Dimension-Free Iteration Complexity of Finite Sum Optimization Problems

Yossi Arjevani, Ohad Shamir

Introduction

Many machine learning tasks reduce to Finite Sum Minimization (FSM) problems of the form

where fif_{i} are LL-smooth and μ\mu-strongly convex. In recent years, a major breakthrough was made when a linear convergence rate was established for this setting (SAG and SDCA ), and since then, many methods have been developed to achieve better convergence rate. However, whereas a large body of literature is devoted for upper bounds, the optimal convergence rate with respect to the problem parameters is not quite settled.

Let us discuss existing lower bounds for this setting, along with their shortcomings, in detail. One approach to obtain lower bounds for this setting is to consider the average of carefully handcrafted functions defined on nn disjoint sets of variables. This approach was taken by Agarwal and Bottou who derived a lower bound for FSM under the first-order oracle model (see Nemirovsky and Yudin ). In this model, optimization algorithms are assumed to access a given function by issuing queries to an external first-order oracle procedure. Upon receiving a query point in the problem domain, the oracle reports the corresponding function value and gradient. The construction used by Agarwal and Bottou consisted of nn different quadratic functions which are adversarially determined based on the first-order queries being issued during the optimization process. The resulting bound in this case does not apply to stochastic algorithms, rendering it invalid for current state-of-the-art methods. Another instantiation of this approach was made by Lan who considered nn disjoint copies of a quadratic function proposed by Nesterov in [13, Section 2.1.2]. This technique is based on the assumption that any iterate generated by the optimization algorithm lies in the span of previously acquired gradients. This assumption is rather permissive and is satisfied by many first-order algorithms, e.g., SAG and SAGA . However, the lower bound stated in the paper faces limitations in a few aspects. First, the validity of the derived bound is restricted to d/nd/n iterations. In many datasets, even if d,nd,n are very large, d/nd/n is quite small. Accordingly, the admissible regime of the lower bound is often not very interesting. Secondly, it is not clear how the proposed construction can be expressed as a Regularized Loss Minimization (RLM) problem with linear predictors (see Section 4). This suggests that methods specialized in dual RLM problems, such as SDCA and accelerated proximal SDCA , can not be addressed by this bound. Thirdly, at least the formal theorem requires assumptions (such as querying in the span of previous gradients, or sampling from a fixed distribution over the individual functions), which are not met by some state-of-the-art methods, such as coordinate descent methods, SVRG and without-replacements sampling algorithms .

In this work, building upon the framework of oblivious algorithms, we take a somewhat more abstract point of view which allows us to easily incorporate coordinate-descent methods, as well as stochastic algorithms. Our framework subsumes the vast majority of optimization methods for machine learning problems, in particular, it applies to SDCA, accelerated proximal SDCA, SDCA without duality , SAG, SAGA, SVRG and acceleration schemes ), as well as for a large number of methods for smooth convex optimization (i.e., FSM with n=1n=1), e.g., (stochastic) Gradient descent (GD), Accelerated Gradient Descent (AGD, ), the Heavy-Ball method (HB, ) and stochastic coordinate descent.

Under this structural assumption, we derive lower bounds for FSM (1), according to which the iteration complexity, i.e., the number of iterations required to obtain an ϵ\epsilon-optimal solution in terms of function value, is at leastFollowing standard conventions, here tilde notation hides logarithmic factors in the parameters of a given class of optimization problems, e.g., smoothness parameter and number of components.

where κ\kappa denotes the condition number of F(w)F({\mathbf{w}}) (that is, the smoothness parameter over the strong convexity parameter). To the best of our knowledge, this is the first tight lower bound to address all the algorithms mentioned above. Moreover, our bound is dimension-free and thus apply to settings in machine learning which are not covered in the current literature (e.g., when nn is Ω(d)\Omega(d)). We also derive a dimension-free nearly-optimal lower bound for smooth convex optimization of

which holds for any oblivious stochastic first-order algorithm. It should be noted that our lower bounds remain valid under any source of randomness which may be introduced into the optimization process (by the oracle or by the optimization algorithm). In particular, our bounds hold in cases where the variance of the iterates produced by the algorithm converges to zero, a highly desirable property of optimization algorithms in this setting.

Two implications can be readily derived from this lower bound. First, obliviousness forms a real barrier for optimization algorithms, and whereas non-oblivious algorithms may achieve a super-linear convergence rate at latter stages of the optimization process (e.g., quasi-newton), or practically zero error after Θ(d)\Theta(d) iterations (e.g. Center of Gravity method, MCG), oblivious algorithms are bound to linear convergence indefinitely, as demonstrated by Figure 1. We believe that this indicates that a major progress can be made in solving machine learning problems by employing non-oblivious methods for settings where dnd\ll n. It should be further noted that another major advantage of non-oblivious algorithms is their ability to obtain optimal convergence rates without an explicit specification of the problem parameters (e.g., [5, Section 4.1]).

Secondly, many practitioners have noticed that oftentimes sampling the individual functions without replacement at each iteration performs better than sampling with replacement (e.g., , see also ). The fact that our lower bound holds regardless of how the individual functions are sampled and is attained using with-replacement sampling (e.g., accelerated proximal SDCA), implies that, in terms of iteration complexity, one should expect to gain no more than log factors in the problem parameters when using one method over the other (it is noteworthy that when comparing with and without replacement samplings, apart from iteration complexity, other computational resources, such as limited communication in distributed settings , may significantly affect the overall runtime).

Framework

Due to difficulties which arise when studying the complexity of general optimization problems under discrete computational models, it is common to analyze the computational hardness of optimization algorithms by modeling the way a given algorithm interacts with the problem instances (without limiting its computational resources). In the seminal work of Nemirovsky and Yudin , it is shown that algorithms which access the function at hand exclusively by querying a first-order oracle require at least

oracle calls to obtain an ϵ\epsilon-optimal solution (note that, here and throughout this section we refer to FSM problems with n=1n=1). This lower bound is tight and its dimension-free part is attained by Nesterov’s well-known accelerated gradient descent, and by MCG otherwise. The fact that this approach is based on information considerations alone is very appealing and renders it valid for any first-order algorithm. However, discarding the resources needed for executing a given algorithm, in particular the per-iteration cost (in time and space), the complexity boundaries drawn by this approach are too crude from a computational point of view. Indeed, the per-iteration cost of MCG, the only method known with oracle complexity of O(dln(1/ϵ))\mathcal{O}{\left(d\ln(1/\epsilon)\right)}, is excessively high, rendering it prohibitive for high-dimensional problems.

2 Definitions

When considering lower bounds one must be very precise as to the scope of optimization algorithms to which they apply. Below, we give formal definitions for oblivious stochastic CLI optimization algorithms and iteration complexity (which serves as a crude proxy for their computational complexity).

A class of optimization problems is an ordered triple (F,I,O)(\mathcal{F},\mathcal{I},\mathcal{O}), where F\mathcal{F} is a family of functions defined over some linear space designated by domF\text{dom}\mathcal{F}, I\mathcal{I} is the side-information given prior to the optimization process and Of:domF×ΘdomF\mathcal{O}_{f}:\text{dom}\mathcal{F}\times\Theta\to\text{dom}\mathcal{F} is a suitable oracle parametrized by some parameters set Θ\Theta, i.e., an external procedure which upon receiving xdomF{\mathbf{x}}\in\text{dom}\mathcal{F} and θΘ\theta\in\Theta, returns some Of(θ)dom(F)\mathcal{O}_{f}(\theta)\in\text{dom}(\mathcal{F}).

For example, in FSM, F\mathcal{F} contains functions as defined in (1), the side-information contains the smooth parameter LL, the strong convexity parameter μ\mu and the number of components nn (although it carries a crucial effect on the iteration complexity, e.g., , in this work, we shall ignore the side-information and assume that all the parameters of the class are given). We shall assume that both first-order and coordinate-descent oracles (see 10,11 below) are allowed to be used during the optimization process. Formally, this is done by introducing an additional parameter which indicates which oracle is being addressed. This added degree of freedom does not violate our lower bounds.

We now turn to rigorously define CLI optimization algorithms. Note that, compared with the definition of first-order pp-CLIs provided in , here, in order to handle coordinate-descent and first-order oracles in a unified manner, we base our formulation on general oracle procedures.

An optimization algorithm is called a Canonical Linear Iterative (CLI) optimization algorithm over a class of optimization problems (F,I,O)(\mathcal{F},\mathcal{I},\mathcal{O}), if given an instance fFf\in\mathcal{F} and initialization points {wi(0)}iJdom(F)\{{\mathbf{w}}^{(0)}_{i}\}_{i\in\mathcal{J}}\subseteq\text{dom}(\mathcal{F}), where J\mathcal{J} is some index set, it operates by iteratively generating points such that for any iJi\in\mathcal{J},

Note that assigning different weights to different terms in (4) can be done through θij(k)Θ\theta^{(k)}_{ij}\in\Theta (e.g., oracle 10 below). This allows a succinct definition for obliviosity. Lastly, we define iteration complexity.

The iteration complexity of a given CLI w.r.t. a given problem class (F,I,O)(\mathcal{F},\mathcal{I},\mathcal{O}) is defined to be the minimal number of iterations KK such that

where the expectation is taken over all the randomness introduced into the optimization process (choosing w1(k){\mathbf{w}}_{1}^{(k)} merely serves as a convention and is not necessary for our bounds to hold).

3 Proof Technique - Deriving Lower Bounds via Approximation Theory

Consider the following parametrized class of LL-smooth and μ\mu-strongly convex optimization problems,

Clearly, the minimizer of fηf_{\eta} are w(η)1/ηw^{*}(\eta)\coloneqq 1/\eta, with norm bounded by 1/μ1/\mu. For simplicity, we will consider a special case, namely, vanilla gradient descent (GD) with step size 1/L1/L, which produces new iterates as follows

Setting the initialization point to be w(0)(η)=0w^{(0)}(\eta)=0, we derive an explicit expression for w(k)(η)w^{(k)}(\eta):

It turns our that each w(k)(η)w^{(k)}(\eta) forms a univariate polynomial whose degree is at most kk. Furthermore, since fη(w)f_{\eta}(w) are LL-smooth μ\mu-strongly convex for any η[μ,L]\eta\in[\mu,L], standard convergence analysis for GD (e.g., , Theorem 2.1.14) guarantees that w(k)(η)w(η)(12/(1+κ))k2w(η)|w^{(k)}(\eta)-w^{*}(\eta)|\leq(1-2/(1+\kappa))^{\frac{k}{2}}|w^{*}(\eta)|, where κ\kappa denotes the condition number. Substituting Equation (6) for w(k)(η)w^{(k)}(\eta) yields

Thus, we see that the faster the convergence rate of a given optimization algorithm is, the better the induced sequence of polynomials (w(k)(η))k0(w^{(k)}(\eta))_{k\geq 0} approximate 1/η1/\eta w.r.t. the maximum norm L([μ,L])\|\cdot\|_{L_{\infty}([\mu,L])} over [μ,L][\mu,L]. In Fig. 2, we compare the first 4 polynomials induced by GD and AGD. Not surprisingly, AGD polynomials approximates 1/η1/\eta better than those of GD.

Now, one may ask, assuming that iterates of a given optimization algorithm A\mathcal{A} for (5) can be expressed as polynomials sk(η)s_{k}(\eta) whose degree does not exceed the iteration number, just how fast can these iterates converge to the minimizer? Since the convergence rate is bounded from below by sk(η)1/ηL([μ,L])\|s_{k}(\eta)-1/\eta\|_{L_{\infty}([\mu,L])}, we may address the following question instead:

where Pk\mathcal{P}_{k} denotes the set of univariate polynomials whose degree does not exceed kk. Problem (7) and other related settings are main topics of study in approximation theory. Accordingly, our technique for proving lower bounds makes an extensive use of tools borrowed from this area. Specifically, in a paper from 1899 Chebyshev showed that

by which we derive the following theorem (see Appendix A.1 for a detailed proof).

In the following sections, we apply oblivious CLI on various parameterized optimization problems so that the resulting iterates are polynomials in the problem parameters. We then apply arguments similar to the above

A similar reduction, from optimization problems to approximation problems, was used before in a few contexts to analyze the iteration complexity of deterministic CLIs (e.g., [5, Section 3], see also Conjugate Gradient convergence analysis ). But, what if we allow random algorithms? should we expect the same iteration complexity? To answer this, we use Yao’s minimax principle according to which the performance of a given stochastic optimization algorithm w.r.t. to its worst input are bounded from below by the performance of the best deterministic algorithm w.r.t. distributions over the input space. Thus, following a similar reduction one can show that the convergence rate of stochastic algorithms is bounded from below by

That is, a lower bound for the stochastic case can be attained by considering an approximation problem w.r.t. weighted L1L_{1} with the uniform distribution over [μ,L][\mu,L]. Other approximation problems considered in this work involve L2L_{2}-norm and different distributions. We provide a schematic description of our proof technique in Scheme 2.1.

Lower Bound for Finite Sums Minimization Methods

Having described our analytic approach, we now turn to present some concrete applications, starting with iteration complexity lower bounds in the context of FSM problems (1). In what follows, we derive a lower bound on the iteration complexity of oblivious (possibly stochastic) CLI algorithms equipped with first-order and coordinate-descent oracles for FSM. Strictly speaking, we focus on optimization algorithms equipped with both generalized first order oracle,

where ei{\mathbf{e}}_{i} denotes the ii’th unit vector. We remark that coordinate-descent steps w.r.t. partial gradients can be implemented using (10) by setting AA to be some principal minor of the unit matrix.It should be further noted that our results below hold for scenarios where the optimization algorithm is free to call a different oracle at different iterations.

It is easy to verify that the minimizers of (12) are

We would like to show that the coordinates of the iterates of deterministic oblivious CLIs, which minimize FηF_{\boldsymbol{\eta}} using first-order and coordinate-descent oracles, form multivariate polynomials in η\boldsymbol{\eta} of total degrees (the maximal sum of powers over all the terms) which does not exceed the iteration number. Indeed, if the coordinates of wi(k)(η){\mathbf{w}}_{i}^{(k)}(\boldsymbol{\eta}) are multivariate polynomial in η\boldsymbol{\eta} of total degree at most kk, then the coordinates of the vectors returned by both oracles

are multivariate polynomials of total degree of at most k+1k+1, as all the parameters (A,B,C,iA,B,C,i and jj) do not depend on η\boldsymbol{\eta} (due to obliviosity) and the rest of the terms (Qηj,q,I,1/(Qηj)ii,(Qηj)i,,eiQ_{\eta_{j}},{\mathbf{q}},I,1/(Q_{\eta_{j}})_{ii},(Q_{\eta_{j}})_{i,*},{\mathbf{e}}_{i} and qiq_{i}) are either linear in ηj\eta_{j} or constants. Now, since the next iterates are generated simply by summing up all the oracle answers, they also form multivariate polynomials of total degree of at most k+1k+1. Thus, denoting the first coordinate of w1(k)(η){\mathbf{w}}_{1}^{(k)}(\boldsymbol{\eta}) by s(η)s(\boldsymbol{\eta}) and using Inequality (8), we get the following bound

where Ω(1)\Omega(1) designates a constant which does not depend on kk (but may depend on the problem parameters). Lastly, this implies that for any deterministic oblivious CLI and any iteration number, there exists some ηHFSM\boldsymbol{\eta}\in\mathcal{H}_{\text{FSM}} such that the convergence rate of the algorithm, when applied on FηF_{\boldsymbol{\eta}}, is bounded from below by Inequality (27). We note that, as opposed to other related lower bounds, e.g., , our proof is non-constructive. As discussed in subsection 2.3, this type of analysis can be extended to stochastic algorithms by considering (26) w.r.t. other norms such as weighted L1L_{1}-norm. We now arrive at the following theorem whose proof, including the corresponding logarithmic factors and constants, can be found in Appendix A.2.

The iteration complexity of oblivious (possibly stochastic) CLIs for FSM (1) equipped with first-order (10) and coordinate-descent oracles (11), is bounded from below by

The lower bound stated in Theorem 2 is tight and is attained by, e.g., SAG combined with an acceleration scheme (e.g., ). Moreover, as mentioned earlier, our lower bound does not depend on the problem dimension (or equivalently, holds for any number of iterations, regardless of dd and nn), and covers coordinate descent methods with stochastic or deterministic coordinate schedule (in the special case where n=1n=1, this gives a lower bound for minimizing smooth and strongly convex functions by performing steepest coordinate descent steps). Also, our bound implies that using mini-batches for tackling FSM does not reduce the overall iteration complexity. Lastly, it is noteworthy that the nn term in the lower bound above holds for any algorithm accompanied with an incremental oracle, which grants access to at most one individual function each time.

We also derive a nearly-optimal lower bound for smooth non-strongly convex functions for the more restricted setting of n=1n=1 and first-order oracle. The parameterized subset of functions we use (see Scheme 2.1) is gη(x)η2x2Rηe1x,η(0,L]g_{\eta}({\mathbf{x}})\coloneqq\frac{\eta}{2}\left\|{\mathbf{x}}\right\|^{2}-R\eta{\mathbf{e}}_{1}^{\top}{\mathbf{x}},\quad\eta\in(0,L]. The corresponding minimizer (as a function of η\eta) is x(η)=Re1{\mathbf{x}}^{*}(\eta)=R{\mathbf{e}}_{1}, and in this case we seek to approximate it w.r.t. L2L_{2}-norm using kk-degree univariate polynomials whose constant term vanishes. The resulting bound is dimension-free and improves upon other bounds for this setting (e.g. ) in that it applies to deterministic algorithms, as well as to stochastic algorithms (see A.3 for proof).

The iteration complexity of any oblivious (possibly stochastic) CLI for LL-smooth convex functions equipped with a first-order oracle, is bounded from below by

Lower Bound for Dual Regularized Loss Minimization with Linear Predictors

The form of functions (12) discussed in the previous section does not readily adapt to general RLM problems with linear predictors, i.e.,

such as SDCA and accelerated proximal SDCA, are not covered by Theorem 2. Accordingly, in this section, we address the iteration complexity of oblivious (possibly stochastic) CLI algorithms equipped with dual RLM oracles:

We state below the corresponding lower bound, whose proof, including logarithmic factors and constants, can be found in Appendix A.4.

The iteration complexity of oblivious (possibly stochastic) CLIs for RLM (28) equipped with dual RLM oracles (30) is bounded from below by

References

Appendix A Proofs

Proof According to the way A\mathcal{A} generates iterates, we have

for some polynomial sk(η)s_{k}(\eta) of degree at most kk. By Lemma 6, we have

Now, since fηf_{\eta} is μ\mu-strongly convex, we have,

Hence, by Lemma 12, the minimal number of iterations required to get an ϵ\epsilon-optimal solution is at least

A.2 Proof of Theorem 2 - Finite Sums

where we put ηi=η1i1ηnin\eta^{\mathbf{i}}=\eta_{1}^{i_{1}}\cdots\eta_{n}^{i_{n}} and i=i1++in|{\mathbf{i}}|=i_{1}+\cdots+i_{n}. In words, Pkn\mathcal{P}_{k}^{n} is the set of all multivariate polynomials over nn indeterminates whose total degree (the maximal sum of the degrees over all terms) is less than or equal to kk. Lastly, given s(η)Pkn{\mathbf{s}}(\boldsymbol{\eta})\in\mathcal{P}_{k}^{n} we define

This notation will come in handy in the main proof.

The lemma below describes the functional form assumed by iterates produced by oblivious CLIs.

When applied on (12) with suitable first-order and coordinate-descent oracles (as defined in 25), the coordinates of iterates produced by oblivious stochastic CLIs form multivariate polynomials in η\boldsymbol{\eta} with random real coefficients whose total degree does not exceed the iteration number.

Proof Let A\mathcal{A} be a oblivious stochastic CLI, and suppose we apply A\mathcal{A} on the class of problems (12) parameterized by η\boldsymbol{\eta}, using both first-order and coordinate-descent oracles as defined in 25. We use mathematical induction to show that for any k=0,1,k=0,1,\dots, the coordinate of the kk’th iterate produced by such process can be expressed as a distribution over multivariate polynomials in η\boldsymbol{\eta} of degree at most kk.

For the inductive step, assume that any coordinate of wi(k)(η){\mathbf{w}}^{(k)}_{i}(\boldsymbol{\eta}) can be expressed as a distribution over Pkn\mathcal{P}^{n}_{k}. It is easy to see that for any wi(k)(η){\mathbf{w}}_{i}^{(k)}(\boldsymbol{\eta}), the answers of both oracles,

form a distribution over Pk+1n\mathcal{P}^{n}_{k+1}, as all the random quantities involved in the expressions (A,B,C,iA,B,C,i and jj) do not depend on η1,,ηn\eta_{1},\dots,\eta_{n} (due to obliviosity) and the rest of the terms (I,Qηj,1/(Qηj)ii,(Qηj)i,,ei,qiI,Q_{\eta_{j}},1/(Q_{\eta_{j}})_{ii},(Q_{\eta_{j}})_{i,*},{\mathbf{e}}_{i},q_{i} and q{\mathbf{q}}) are either linear in ηj\eta_{j} or constants. Lastly, wi(k+1){\mathbf{w}}_{i}^{(k+1)} are computed by simply summing up all the oracle answers, and as such, form again distributions over Pk+1n\mathcal{P}^{n}_{k+1}.

Proof [Theorem 2] Let A\mathcal{A} be a oblivious stochastic CLI. By Lemma 1 the first coordinate of w1(k)(η){\mathbf{w}}^{(k)}_{1}(\boldsymbol{\eta}) (the point returned by the algorithm at the kk’th iteration) when applied on the class of problems (12) distributes according to some distribution D\mathcal{D} over Pkn\mathcal{P}^{n}_{k}. Thus, by Yao principle, since each polynomial in (Pkn)d(\mathcal{P}^{n}_{k})^{d} represents a single deterministic algorithm, we have

where U(H)\mathcal{U}(\mathcal{H}) denotes a distribution over H\mathcal{H} which corresponds to first drawing jU([n])j\sim\mathcal{U}([n]) at random, and then setting the coordinates of η\boldsymbol{\eta} as follows

Furthermore, it is easy to verify that the corresponding minimizers of (12) are

where the first inequality follows by focusing on the first coordinate of s(η)w(η){\mathbf{s}}(\boldsymbol{\eta})-{\mathbf{w}}^{*}(\boldsymbol{\eta}). Now, set α=(n1)Lμ2+nL+μ2\alpha=-(n-1)\frac{L-\mu}{2}+n\frac{L+\mu}{2} and note that

Thus, by Lemma 8 (using the same value for α\alpha and noting that α>(Lμ)/2\alpha>(L-\mu)/2) yields

where kjk_{j} denotes the degree of sj(ηj)s_{j}(\eta_{j}). Plugging in this into Inequality (A.2) we get

Since uρuu\mapsto\rho^{u} is a decreasing and convex function for any 1>ρ>01>\rho>0, we have

where the last inequality is due to the fact that s(η)Pkn{\mathbf{s}}(\boldsymbol{\eta})\in\mathcal{P}^{n}_{k} which implies that j=1nkjk\sum_{j=1}^{n}k_{j}\leq k. Finally, we have,

where the first inequality follows by the μ\mu-strong convexity of FηF_{\boldsymbol{\eta}} and the second inequality follows by Jensen inequality. Using Lemma 12, we get that the iteration complexity of A\mathcal{A} is at least

This, together with Theorem 5 below, concludes the proof.

We bound from below the number of iterations required to obtain a non-trivial accuracy.

Let j[n]j\in[n], let ηj,1H\boldsymbol{\eta}_{j,1}\in\mathcal{H} be a parameters vector whose all coordinates are Lμ2-\frac{L-\mu}{2} and let ηj,2H\boldsymbol{\eta}_{j,2}\in\mathcal{H} be a parameters vector whose all coordinates are Lμ2-\frac{L-\mu}{2}, except for the jj’th coordinate which we set to be Lμ2\frac{L-\mu}{2}. If κ>3\kappa>3, then

where the last inequality follows from κ>3\kappa>3.

The iteration complexity of any stochastic optimization algorithm (not necessarily CLI) which gathers information on FηF_{\boldsymbol{\eta}} (with κ>3\kappa>3) only by means of incremental oracles, i.e., oracles which upon receiving query return an answer which depends on not more than one individual function, is at least nn.

Proof Let A\mathcal{A} be a stochastic optimization algorithm. According to Yao’s principle, we can bound from below the ϵ\epsilon-optimality of A\mathcal{A} after k<nk<n iterations by estimating the ϵ\epsilon-optimality of any deterministic algorithm w.r.t. to distribution D(H)\mathcal{D}(\mathcal{H}) over H\mathcal{H} defined by: draw j[n]j\in[n] and set η\boldsymbol{\eta} to be ηj,1\boldsymbol{\eta}_{j,1} or ηj,2\boldsymbol{\eta}_{j,2} as defined in Lemma 2 w.p. 1/21/2. Then,

where the last inequality follows from Lemma 2. Thus, for sufficiently small ϵ\epsilon, one must perform at least nn iterations in order to obtain an ϵ\epsilon-optimal solution.

A.3 Proof of Theorem 3 - Smooth Functions

with a first-order oracle (as defined in 10 with n=1n=1), the coordinates of iterates produced by oblivious stochastic CLIs whose is initialization iterate is xi(0)=0{\mathbf{x}}_{i}^{(0)}=0 form polynomials in η\eta with random real coefficients which vanishes at η=0\eta=0 and whose degree does not exceed the iteration number.

Proof Let A\mathcal{A} be a oblivious stochastic CLI, and suppose we apply A\mathcal{A} on the class of problems (37) parameterized by η\eta, using a first-order. We use mathematical induction to show that for any k=0,1,k=0,1,\dots, the coordinate of the kk’th iterate produced by such process can be expressed as a distribution over Pk\overline{\mathcal{P}}_{k}.

As the first iterate xi(0){\mathbf{x}}^{(0)}_{i} is assumed to be zero, the base case is trivial. For the inductive step, assume that any coordinate of xi(k){\mathbf{x}}^{(k)}_{i} can be expressed as a distribution over Pk\overline{\mathcal{P}}_{k}. It is easy to see that for any xi(k){\mathbf{x}}_{i}^{(k)}, the answers of the first-order oracle,

form a distribution over Pk+10\mathcal{P}_{k+1}^{0}, as the random quantities involved in the expressions (A,BA,B and CC) do no depend on η\eta (due to obliviosity) and the rest of the terms (η\eta and RηeiR\eta{\mathbf{e}}_{i}) are homogenous in η\eta. Lastly, xi(k+1){\mathbf{x}}_{i}^{(k+1)} are computed by simply summing up all the oracle answers, and as such, form again distributions over Pk+10\mathcal{P}_{k+1}^{0}.

Proof [Theorem 3] Let N\mathcal{N} be a oblivious stochastic CLI and let α(1,0)\alpha\in(-1,0). Our derivation of lower bounds for stochastic CLIs is established via Yao principle. Fix some k{0,1,}k\in\{0,1,\dots\}. By Lemma 3, x1(k)(η){\mathbf{x}}^{(k)}_{1}(\eta) distributes according to some distribution D\mathcal{D} over (Pk)d(\overline{\mathcal{P}}_{k})^{d}. Thus, by Yao principle, since each polynomial in (Pk)d(\overline{\mathcal{P}}_{k})^{d} represents a single deterministic algorithm, we have

where E([0,L],α)\mathcal{E}([0,L],\alpha) (abbr. E\mathcal{E}) denotes a distribution over (0,L](0,L] with a probability density function

where the first inequality follows by the fact that hηh_{\eta} is η\eta-strongly convex and the second inequality follows by focusing on the first coordinate of s(η)x(η){\mathbf{s}}(\eta)-{\mathbf{x}}^{*}(\eta). Invoking Lemma 9 yields

Thus, in this case the iteration complexity is bound from below by

A.4 Proof of Theorem 4 - Regularized Empirical Loss Minimization

In which case, the corresponding dual is:

Note that, all the eigenvalues of QψQ_{\boldsymbol{\psi}} are bigger than 1. Therefore, DψD_{\boldsymbol{\psi}} is 1-strongly convex. We assume that the oracles at the algorithms’ disposal are the dual RLM oracles defined in (30),

Lastly, we will need the following definitions

to ease notation in subsequent proofs (where p\partial p denotes the total degree of pp and Pkn\mathcal{P}_{k}^{n} is defined in (31)). Thus, Qkn\mathcal{Q}_{k}^{n} contains dd-dimensional vectors whose entries are nn-multivariate polynomials expressions in sinψ1,,sinψn\sin\psi_{1},\dots,\sin\psi_{n}, such that the sum of the degrees of the dd-polynomials does not exceed kk. In addition, given t(ψ)Qk,dn{\mathbf{t}}(\boldsymbol{\psi})\in\mathcal{Q}_{k,d}^{n} we define

As usual, we start by stating the functional form assumed by iterates produced by this sort of optimization algorithms.

When applied on (38) with a dual RLM oracle (as defined in 30), the coordinates of iterates produced by oblivious stochastic CLIs form nn multivariate polynomials expressions in sinψ1,,sinψn/2\sin\psi_{1},\dots,\sin\psi_{n/2} with random coefficients, such that the sum of the degrees of these polynomials does not exceed the iteration number.

Proof Let A\mathcal{A} be a oblivious stochastic CLI, and suppose we apply A\mathcal{A} on the class of problems (38) parameterized by ψ\boldsymbol{\psi}, using dual RLM oracles as defined in 30. We use mathematical induction to show that for any k=0,1,k=0,1,\dots, the coordinate of the kk’th iterate produced by such process can be expressed as a distribution over polynomial expressions in sinψ1,,sinψn/2\sin\psi_{1},\dots,\sin\psi_{n/2} whose sum of degrees is less than or equal kk.

For the inductive step, assume that αi(k)\boldsymbol{\alpha}^{(k)}_{i} can be expressed as a distribution over Qk,nn/2\mathcal{Q}^{n/2}_{k,n}. It is easy to see that for any αi(k)\boldsymbol{\alpha}_{i}^{(k)}, the answer of the dual RLM oracle

Let A\mathcal{A} be a oblivious stochastic CLI. By Lemma 4 the coordinates of α1(k)\boldsymbol{\alpha}^{(k)}_{1} (the point returned by the algorithm at the kk’th iteration) when applied on the class of problems (38) distributes according to some distribution D\mathcal{D} over (Qkn/2)n(\mathcal{Q}^{n/2}_{k})^{n}. Furthermore, it is easy to verify that the corresponding minimizers of (38) are

α1(k)(ψ)\boldsymbol{\alpha}^{(k)}_{1}(\boldsymbol{\psi}) distributes according to some distribution D\mathcal{D} over Qk,nn/2\mathcal{Q}^{n/2}_{k,n}. Thus, by Yao principle, since each polynomial in Qk,nn/2\mathcal{Q}^{n/2}_{k,n} represents a single deterministic algorithm, we have

where U(H)\mathcal{U}(\mathcal{H}) denotes a distribution over H\mathcal{H} which corresponds to of first drawing jU([n/2])j\sim\mathcal{U}([n/2]) at random, and then drawing ψj\psi_{j} according to distribution defined by the p.d.f. pψj(ψ)=cos(ψ)/2p_{\psi_{j}}(\psi)=\cos(\psi)/2 over [π/2,π/2][-\pi/2,\pi/2] (for iji\neq j we set ψi=0\psi_{i}=0 ). We now have,

where the first inequality follows by focusing on the jj’th coordinate of s(ψ)α(ψ){\mathbf{s}}(\psi)-\boldsymbol{\alpha}^{*}(\boldsymbol{\psi}) in each summand. Now, set α=1+λn,L=3,μ=1\alpha=1+\lambda n,L=3,\mu=1 and note that

Thus, by Lemma 8, using the same value for α\alpha and noting that α>1=(Lμ)/2\alpha>1=(L-\mu)/2) yields

where kjk_{j} denotes the degree of sj(ηj)s_{j}(\eta_{j}). Plugging in this into Inequality (A.4) we get

Since uρuu\mapsto\rho^{u} is a decreasing and convex function for any 1>ρ>01>\rho>0, we have

where the last inequality is due to the fact that s(ψ)Qk,nn/2(sinψ){\mathbf{s}}(\boldsymbol{\psi})\in\mathcal{Q}^{n/2}_{k,n}(\sin\psi) which implies that j=1nkjk\sum_{j=1}^{n}k_{j}\leq k. Finally, we have,

where the first inequality follows by the 11-strong convexity of DψD_{\boldsymbol{\psi}} and the third inequality follows by Jensen inequality. Using Lemma 12, we get that the iteration complexity of A\mathcal{A} is at least

Lastly, we bound from below the number of iterations required to obtain a non-trivial accuracy.

Let j[n]j\in[n], let ψj,1H\boldsymbol{\psi}_{j,1}\in\mathcal{H} be a parameters vector whose all coordinates are π/2-\pi/2 and let ηψ,2H\boldsymbol{\eta}_{\boldsymbol{\psi},2}\in\mathcal{H} be a parameters vector whose all coordinates are π/2-\pi/2, except for the jj’th coordinate which we set to be π/2\pi/2. Then

When applied on (38) ,the iteration complexity of oblivious stochastic CLI algorithms equipped with a dual RLM oracle DψD_{\boldsymbol{\psi}} is at least n/2n/2.

Proof Let A\mathcal{A} be a stochastic optimization algorithm. By Lemma 4 the coordinates of α1(k)\boldsymbol{\alpha}^{(k)}_{1} (the point returned by the algorithm at the kk’th iteration) when applied on the class of problems (38) distributes according to some distribution D\mathcal{D} over (Qkn/2)n(\mathcal{Q}^{n/2}_{k})^{n}. By Yao principle, since each polynomial in Qk,nn/2\mathcal{Q}^{n/2}_{k,n} represents a single deterministic algorithm, we have

where D(H)\mathcal{D}(\mathcal{H}) denotes a distribution over H\mathcal{H} which corresponds to the process of first drawing jU([n/2])j\sim\mathcal{U}([n/2]) at random, and then set ψ\boldsymbol{\psi} to be ψj,1\boldsymbol{\psi}_{j,1} or ψj,2\boldsymbol{\psi}_{j,2} as defined in Lemma 5 with equal probability. Now, for k<n/2k<n/2, there exists some j[n/2]j\in[n/2] such that t(ψ){\mathbf{t}}(\boldsymbol{\psi}) does not depend on ψj\psi_{j}. This yields,

where the last inequality follows from Lemma 5. Thus, for sufficiently small ϵ\epsilon, one must perform at least n/2n/2 iterations in order to obtain an ϵ\epsilon-optimal solution.

In the following section we analyze the best polynomial approximation of some functions w.r.t. LL_{\infty}, L1L_{1} and L2L_{2} norm, based on standard results from the approximation theory (see generally, Allan Pinkus. On L1-approximation, 1989; Theodore J Rivlin. An introduction to the approximation of functions, 2003; Ronald A DeVore and George G Lorentz. Constructive approximation, 1993; Naum Il’ich Akhiezer and Charles J Hyman. Theory of approximation. Translated by Charles J. Hyman. New York, 1956; Isidor Pavlovich Natanson. Constructive function theory, 1964).

where we used the fact that Pk\mathcal{P}_{k} is invariant under pre-composition and post-composition with linear function in the second, fourth and fifth equalities. Now, since

combining Inequality 8 with Lemma 10, yields

Let Uk(η)U_{k}(\eta) denote the kk’th second order Chebyshev polynomial, i.e.,

Proof We integrate by substituting η=eiθ+eiθ2(=cos(θ))\eta=\frac{e^{i\theta}+e^{-i\theta}}{2}\left(=\cos(\theta)\right),

where the last equality is due to the fact that the integrand is an even function in θ\theta. Lastly, since for any j=1,,kj=1,\dots,k we have

and since e2πij/(k+1)1e^{-2\pi ij/(k+1)}\neq 1 for j=1,,kj=1,\dots,k, it follows that

This, together with the case where j=0j=0,

implies that all the terms in (A.5.2) vanish, thus concluding the proof.

Given μ<L\mu<L (note that, here μ\mu and LL are allowed to take negative values), we define

By substituting η\eta for (μL)η2\frac{(\mu-L)\eta}{2}, we get the following corollary.

We now use Corollary 1 to bound from below the best polynomial L1L_{1}-approximation error w.r.t. 1/(η+α)1/(\eta+\alpha) over the interval [μ,L][\mu,L].

Let p(η)Pk1p(\eta)\in\mathcal{P}_{k-1}. Then, for any (Lμ)/2<α(L-\mu)/2<\alpha we have

Proof First, note that the following two inequalities

Substituting μL2η\frac{\mu-L}{2}\eta for η\eta, yields

Now, plugging in the definition of Uk(η)U_{k}(\eta) (see (52)) and applying Lemma 11, we get

for any u>1u>1. Using this inequality with (A.5.2) where u=2αLμu=\frac{2\alpha}{L-\mu}, yields

shows that this problem can be seen as a best L2L_{2}-approximation for η1+α2\eta^{\frac{1+\alpha}{2}} in the kk-dimensional space spanned by gi=ηi+1+α2, i=1,,kg_{i}=\eta^{i+\frac{1+\alpha}{2}},~{}i=1,\dots,k (accordingly, g0=η1+α2g_{0}=\eta^{\frac{1+\alpha}{2}}). By [2, Equation (3), p. 16], we have

where G()G(\cdot) is Gram matrix (whose entries are the inner products of its arguments). First, note that

It follows that both matrices can be expressed as a Cauchy matrices, that is

where xi=i+α+1,yj=j1, i,j[k]x_{i}=i+\alpha+1,y_{j}=-j-1,~{}i,j\in[k] and ui=i+α,vj=j, i,j[k+1]u_{i}=i+\alpha,v_{j}=-j,~{}i,j\in[k+1]. The determinant of Cauchy matrix AA defined by sequences wi,zjw_{i},z_{j} one has

To estimate the middle term, we apply arguments similar to the integral test for infinite series. First, note that,

Now, since for any α(1,0)\alpha\in(-1,0), it holds that xlnxx+α+1x\mapsto\ln\frac{x}{x+\alpha+1} is a monotone decreasing function (over xαx\neq\alpha), it holds that

Now, by the following standard inequality

A.6 Technical Lemmas

and limxγ(x)=0\lim_{x\to\infty}\gamma(x)=0. Therefore, by using identity (see Section F.31. in ), we get

where the last equality is due to Lemma 10.

Let L>μ>0L>\mu>0, c>0c>0 and α0\alpha\geq 0. Then

and limxδ(x)=0\lim_{x\to\infty}\delta(x)=0. Thus, we obtained the following inequality