Making SGD Parameter-Free

Yair Carmon, Oliver Hinder

Introduction

Stochastic convex optimization (SCO) is a cornerstone of both the theory and practice of machine learning. Consequently, there is intense interest in developing SCO algorithms that require little to no prior knowledge of the problem parameters, and hence little to no tuning . In this work we consider the fundamental problem of non-smooth SCO (in a potentially unbounded domain) and seek methods that are adaptive to a key problem parameter: the initial distance to optimality.

Current approaches for tackling this problem focus on the more general online learning problem of parameter-free regret minimization , where the goal is to to obtain regret guarantees that are valid for comparators with arbitrary norms. Research on parameter-free regret minimization has lead to practical algorithms for stochastic optimization , methods that are able to adapt to many problem parameters simultaneously and methods that can work with any norm . In the basic Euclidean setting with 1-Lipschitz losses where only the initial distance to optimality is unknown, there are essentially matching upper and lower bounds , showing that the best achievable parameter-free average regret scales as

where $T$ is the number of steps, $\|\mathring{x}\|$ is the (Euclidean) comparator norm, and $\varepsilon>0$ represents the (user-chosen) regret we will incur even if the comparator norm is zero. This is larger by a logarithmic factor than the optimal average-regret when the comparator norm is known in advance.

Parameter-free regret bounds immediately translate into parameter-free SCO algorithms using online-to-batch conversion . The expected optimality gap bound of the resulting algorithm is identical to (1) when we replace $\mathring{x}$ by $x_{\star}-x_{0}$ , i.e., the difference between the optimum and the initial point. This bound is a logarithmic factor worse than what stochastic gradient descent (SGD) can achieve when we know the distance to optimality and use it to compute step sizes. While this logarithmic factor is unavoidable for regret minimization, it is unclear if it is necessary for SCO.

In this paper we show it is possible to obtain stronger parameter-free rates for SCO by moving beyond the regret minimization abstraction. In particular, for any $\varepsilon>0$ and $\delta\in(0,1)$ , we a obtain probability $1-\delta$ optimality gap bounds of

which is better than any bound achievable by online-to-batch conversion. While replacing the logarithmic factor by a double-logarithmic factor may appear a small improvement, we consider it important due to the fundamental nature of the problem as well as the theoretical separation it establishes between parameter-free SCO and OCO. Such separations are rare in the literature; we are only aware of one prior example .

Our method also provides high probability guarantees on the suboptimality gap. This resolves an open problem in parameter-free optimization; see and [29, §7]. We are able to form high probability bounds because, unlike other parameter-free SCO algorithms, we prove a strong localization guarantee: our output $\bar{x}$ satisfies $\|\bar{x}-x_{\star}\|=O(\|x_{0}-x_{\star}\|)$ , and key intermediate points satisfy a similar bound as well. We suspect that such localization is difficult to establish with online-to-batch conversion, since online parameter-free algorithms may need to let their iterates fluctuate wildly in order to handle difficult adversaries.

In addition to independence of $\|x_{0}-x_{\star}\|$ , our algorithm exhibits three additional forms of adaptivity. First, our algorithm has adaptivity to gradient norms on par with the best existing parameter-free result : the leading term of our bounds scales with a sum of squared observed gradient norms, and an a-priori gradient norm bound only affects low-order terms. Second, as a consequence, in the smooth and noiseless case our algorithm exhibits a $\frac{\log\log T}{T}$ rate of convergence. Finally, via a simple restart scheme we obtain the optimal rate for strongly-convex stochastic problems (up to double-logarithmic factors), without knowledge of the strong-convexity parameter.

On a technical level, our development differs significantly from prior parameter-free optimization methods. While online methods rely on advanced tools such as coin betting , and online Newton steps , our approach is essentially a careful scheme for correctly setting the step size of SGD. Underlying our algorithm is a parameter-free certificate for SGD, which implies both localization and optimality gap bounds. The certificate takes the form of an implicit equation over the SGD step size, which we solve via bisection on the logarithm of the step size. To obtain high-probability bounds, we develop a time-uniform empirical-Berstein-type concentration bound independent of any a-priori assumptions on the iterate norms. Given the ubiquity of SGD in practice and in the classroom, our insights on how choose its step size may be of independent interest.

In the following subsections we review additional related work, as well as the problem setup and notation. Section 2 develops our parameter-free step size certificate. Section 3 presents our algorithm and its analysis in the noiseless regime. Section 4 lifts the analysis to the stochastic setting, proving our main result on parameter-free SCO. Finally, Section 5 shows how our method adapts to smoothness and (via restarts) to strong convexity.

1 Additional related work

The literature on noiseless optimization also offers a rich variety of parameter-free algorithms. In the smooth setting, the Armijo rule is a standard technique for choosing step sizes for gradient descent. Using variants of this idea combined with acceleration, achieves essentially optimal and parameter-free rates of convergence . The Polyak step size rule simultaneously achieves optimal rates for smooth, non-smooth and strongly-convex optimization , but requires knowledge of the optimal function value. This requirement can be relaxed, making the Polyak method parameter-free, but at the cost of a multiplicative logarithmic factor to its bound . Consequently, non-smooth parameter-free deterministic optimization appears to be as hard as SCO. Multiple works generalize line-search and the Polyak method to the stochastic setting , but do not obtain parameter-free rates in the sense we consider here.

Limitations of online-to-batch conversion.

To the best of our knowledge, the only previous example of an SCO rate that is provably unachievable by online to batch conversion of a (uniform) regret bound occurs for strongly-convex optimization. Specifically, any online strongly-convex optimization algorithm must have logarithmic regret (implying suboptimality $(\log T)/T$ via online to batch conversion) , while Hazan and Kale and others have achieved the optimal $1/T$ rate for stochastic strongly-convex optimization. The variant of our algorithm in Section 5.2 is based on the Epoch-SGD algorithm of , and simultantiously breaks both regret minization barriers, achieving optimality gap $(\log\log T)/T$ for parameter-free strongly-convex stochastic optimization with high probability.

Grid search.

In practice, the standard technique for selecting the step size of SGD (and hyperparameters more broadly) is grid search . This typically consists of testing all step sizes on a geometrically spaced grid and choosing the one with the best performance on a held out set. Compared to our method, such grid search is computationally wasteful, as it tests exponentially more steps sizes than we do. Moreover, in the context of parameter-free SCO, proving guarantees for grid search is surprisingly difficult, since it is unclear how to bound the objective value estimation error for points that may be arbitrarily far apart.

2 Problem setup and notation

Our development revolves around the classical fixed step size stochastic gradient descent (SGD) algorithm. Given step size $\eta$ and initialization $x_{0}$ , SGD iterates

and $\Pi_{\mathcal{X}}$ is the Euclidean projection onto $\mathcal{X}$ ; we intentionally feature the $\eta$ dependence of $x_{i}(\eta)$ and $g_{i}(\eta)$ prominently. We define the following quantities associated with the SGD iterates. First, we write the distance to $x_{\star}$ and its running maximum as

Replacing $x_{\star}$ with $x_{0}$ in the above definitions, we write

Finally, we denote the running sum of squared gradient norms and gradient oracle error by

Throughout, $\|\cdot\|$ denotes the Euclidean norm. We use $\log$ to denote the base 2 logarithm, and write $\log_{+}(x)\coloneqq\max\{2,\log(x)\}$ to simplify $O(\cdot)$ notation. For any particular value of $\eta$ , the quantities $x_{i}(\eta)$ , $g_{i}(\eta)$ , etc. always refer to a single realization of the random process they represent.

A parameter-free step-size selection criterion for SGD

In this section we present the key component of our development: a computable certificate for the efficiency of a candidate SGD step size. For ease of exposition, in this section we restrict some of our arguments to the exact gradient setting, but emphasize that they ultimately translate to high-probability bounds in the stochastic setting.

Consider the noiseless setting with step size $\eta$ , iterates $x_{0},x_{1}(\eta),\ldots,x_{T}(\eta)$ and gradients $g_{0}$ , $g_{1}(\eta)$ , $\ldots,g_{T-1}(\eta)$ . It is well-known that if $\eta$ satisfies

Our key proposal is to approximate the distance to the optimum $d_{0}$ with a computable proxy: the maximum distance traveled by the algorithm, $\bar{r}_{T}(\eta)\coloneqq\max_{i\leq T}\|x_{0}-x_{i}(\eta)\|$ . We consider step sizes that (approximately) satisfy

for nonnegative damping parameters $\alpha$ and $\beta$ ; in the exact gradient setting we can set $\alpha$ to any number $>1$ and $\beta=0$ , while the stochastic setting requires scaling $\alpha$ and $\beta$ roughly as $\mathsf{poly}(\log\log T)$ . Intuitively, $\bar{r}_{T}$ approximates $d_{0}$ since the SGD iterates should converge to $x_{\star}$ and therefore $\|x_{0}-x_{T}(\eta)\|$ should be similar to $\|x_{0}-x_{\star}\|$ . However, in non-smooth optimization, convergence to $x_{\star}$ can be arbitrarily slow. We nevertheless prove that, when $\eta\leq\phi(\eta)$ , we have $\bar{r}_{T}(\eta)=O(d_{0})$ (Lemma 2 below). With this result and a refined SGD error bound (Lemma 1 below), we show that (with exact gradients) any $\eta$ satisfying criterion (4) recovers the optimal error bound.

In the noiseless setting, any step size $\eta>0$ satisfying (4) with $\alpha>1$ and $\beta=0$ produces $\bar{x}\coloneqq\frac{1}{T}\sum_{i<T}x_{i}(\eta)$ such that $\|x_{\star}-\bar{x}\|\leq\frac{2\alpha}{\alpha-1}\|x_{\star}-x_{0}\|$ and

Before proving Proposition 1, let us briefly discuss its algorithmic implications. Since the function $\phi(\cdot)$ is computable (at the cost of $T$ gradient queries) without a-priori assumptions on $d_{0}$ , we have reduced parameter-free optimization to solving the one-dimensional implicit equation (4). However, the function $\phi$ might be discontinuous and an exact solution to the implicit equation might not even exist. Nevertheless, in the next section we show that finding an interval $[\eta,2\eta]$ in which $h\mapsto\phi(h)-h$ changes sign, produces nearly the same error certificates at an interval edge. Since such interval is readily found via bisection, this forms the basis of a working parameter-free step size tuner. We leave the details to Section 3 and for the remainder of this section prove Proposition 1.

The proof of Proposition 1 hinges on two lemmas. The first is a variant of the standard SGD error bound (recall that $\Delta_{i}(\eta)=g_{i}(\eta)-\nabla f(x_{i}(\eta))$ is zero in the noiseless setting).

Since $\eta$ is fixed throughout this proof, we streamline notation by dropping it from $x_{i},g_{i},\Delta_{i}$ and $G_{i}$ . By convexity and the definition of $\Delta_{i}$ ,

From $x_{i+1}=\Pi_{\mathcal{X}}(x_{i}-\eta g_{i})$ we can derive the standard subgradient method inequality

for all $\eta\geq 0$ and $i=0,\dots,T-1$ . Rearranging and summing over $i<T$ gives

The inequality $(\star)$ is where our proof deviates from the textbook derivation [28, Theorem 2.13.]; it holds because $d_{0}-d_{T}\leq r_{T}$ due to the triangle inequality, and either $d_{T}\leq d_{0}$ holds or $d_{0}^{2}-d_{T}^{2}\leq 0\leq 2d_{0}\bar{r}_{T}$ . Substituting into (6) and applying $f(\bar{x})\leq\frac{1}{T}\sum_{i<T}f(x_{i})$ by Jensens’s inequality gives (5). ∎

The second lemma shows that for $\eta$ satisfying $\eta\leq\phi(\eta)$ is guaranteed to produce iterates that do not wander too far from $x_{\star}$ . This is our basic localization guarantee.

In the noiseless setting with $\alpha>1$ and $\eta>0$ , if $\eta\leq\phi(\eta)$ then we have $\bar{d}_{T}(\eta)\leq\frac{\alpha+1}{\alpha-1}d_{0}$ and $\bar{r}_{T}(\eta)\leq\frac{2\alpha}{\alpha-1}d_{0}$ .

This proof once more drops the explicit dependence on $\eta$ . Summing the inequality (7) over $i<t$ and noting that that $\left<g_{i},x_{i}-x_{\star}\right>\geq f(x_{i})-f(x_{\star})\geq 0$ due to convexity and the noiseless setting, we have $d_{t}^{2}\leq d_{0}^{2}+\eta^{2}G_{t}$ for every $t$ . Maximizing over $t\leq T$ and substituting $\eta\leq\phi(\eta)\leq\bar{r}_{T}/\sqrt{\alpha G_{T}}$ yields

where $(\star)$ follows from the triangle inequality: $\bar{r}_{T}=r_{t}\leq d_{t}+d_{0}\leq\bar{d}_{T}+d_{0}$ for some $t\leq T$ . Rearranging yields $\left(\bar{d}_{T}-\frac{1}{\alpha-1}d_{0}\right)^{2}\leq\frac{\alpha^{2}}{(\alpha-1)^{2}}d_{0}^{2},$ and therefore $\bar{d}_{T}\leq\frac{\alpha+1}{\alpha-1}d_{0}$ as required. The bound $\bar{r}_{T}\leq\frac{2\alpha}{\alpha-1}d_{0}$ follows from substituting $\bar{d}_{T}\leq\frac{\alpha+1}{\alpha-1}d_{0}$ into $\bar{r}_{T}\leq\bar{d}_{T}+d_{0}$ . ∎

Proposition 1 follows from substituting $\eta=\bar{r}_{T}(\eta)/\sqrt{\alpha G_{T}}$ into bound (5) yielding $f(\bar{x})-f(x_{\star})\leq\frac{d_{0}\sqrt{\alpha G_{T}(\eta)}}{T}+\frac{\bar{r}_{T}(\eta)\sqrt{G_{T}(\eta)}}{2\sqrt{\alpha}T}$ , and using $\bar{r}_{T}(\eta)\leq\frac{2\alpha}{\alpha-1}d_{0}$ from Lemma 2. $\blacksquare$

Algorithm description, and analysis for exact gradients

In this section we turn the step-size selection criterion presented in the previous section into a complete algorithm (Algorithm 1)—valid for stochastic as well as exact gradients—and analyze it in the simpler setting of exact gradients, deferring the stochastic case to the following section. Our algorithm consists of a core log-scale bisection subroutine (RootFindingBisection) coupled with an outer loop that acts as an aggressive doubling scheme on the upper limit of the bisection. We describe and analyze the two components Sections 3.1 and 3.2, respectively. Then, in Section 3.3, we put these results together and obtain parameter-free rates in the exact gradient setting.

The following lemma shows that the output of RootFindingBisection in fact satisfies a similar bound. See Appendix A.1 for the (easy) proof, and note the lemma also holds in the stochastic case.

We now combine Lemmas 1, 2 and 3 to show an error bound for GD with the $\eta$ selected by RootFindingBisection. The proof of Proposition 2 appears in Appendix A.2.

2 Doubling scheme for upper bisection limit

3 Error guarantees for exact gradients

With Algorithm 1 explained and Proposition 1 and Lemma 4 in place, we are ready to state the parameter-free convergence guarantee in the exact gradient setting. For simplicity of exposition, we fix $\alpha=3$ and $\beta=0$ , but note that any $\alpha>1$ yields a similar guarantee.

for some $\eta^{\prime}\in[\eta,2\eta]$ , or $\eta=\eta_{\varepsilon}$ and

The proof of Theorem 1 appears in Section A.4. Let us briefly compare the bounds in Theorem 1 to our guarantees for a solution to $\eta=\phi(\eta)$ shown in Proposition 1. The “typical case” bound (11) is similar to the error bound of Proposition 1 with only two notable differences beyond a slightly larger constant factor. First, by eq. (10), the value of $T$ in Theorem 1 is smaller than the total complexity budget by a double-logarithmic factor; this is the cost of performing a bisection instead of assuming we start with a solution to the implicit equation. Second, the term $G_{T}(\eta^{\prime})$ in the RHS of (11) is computed at $\eta^{\prime}$ that is possibly different than the $\eta$ for which we prove the error bound.

While bounding the error of SGD with step size $\eta$ using the gradients observed by SGD with a different step size $\eta^{\prime}$ is unconventional, our resulting bounds appear to be as useful as their more conventional counterparts. First, note that $\eta$ and $\eta^{\prime}$ are within a factor of 2 of each other, and we can bring this factor arbitrarily close to 1 by running more bisection steps. Second, despite the difference in $\eta$ , we can still use our lower bounds to obtain (up to double-logarithmic factors) a $1/T$ rate of convergence for smooth problems with unknown smoothness (see Section 5.1); this is the hallmark of error bounds that scale with $\sqrt{G_{T}}$ . Finally, the different $\eta^{\prime}$ issue disappears if we assume $f$ is $L$ -Lipschitz. In particular, from Theorem 1 with $\eta_{\varepsilon}=\frac{\varepsilon}{2L^{2}}$ , using $G_{T}(\eta)\leq L^{2}T$ for all $\eta$ and $\|g_{0}\|\geq(f(x_{0})-f(x_{\star}))/d_{0}$ (due to convexity of $f$ ) we get that

Analysis for stochastic gradients

In this section, we extend the analysis of Algorithm 1 to the stochastic setting, using the following simple strategy: we define a “good event” under which the noiseless analysis goes through essentially unchanged (Section 4.1), and show that this event occurs with high probability (Section 4.2), obtaining a stochastic, high-probability, analog of our exact gradient result (Section 4.3).

With this definition in hand, slightly modified versions of our key lemmas from the deterministic analysis (Lemma 2, Proposition 2, Lemma 4) continue to hold. See Sections B.1, B.2 and B.3 for proofs of these results, which follow very similarly to their exact-gradient counterparts.

2 The good event is likely

We now arrive at the challenging part of the stochastic analysis: showing that the good event we defined occurs with high probability. For this, we require the following standard assumption.

The stochastic gradient oracle satisfies $\|\mathcal{O}(\eta)\|\leq L$ with probability 1.

In online parameter-free optimization such assumption is unavoidable if one seeks regret scaling linearly in the comparator norm . However, similarly to the best prior results, our bounds depend on $L$ only via a low-order term.

The following result shows that, for appropriate choices of $\alpha$ and $\beta$ and any fixed $\eta\geq 0$ the event $\mathfrak{E}_{T,\alpha,\beta}(\eta)$ has high probability.

Proposition 4 makes no a-priori assumption on the size of $x_{i}(\eta)-x_{\star}$ , instead controlling it empirically via $\bar{d}_{t}(\eta)$ ; this is unusual in the literature and crucial for our purposes. Our proof (given in Section B.5) relies on a time-uniform empirical-Bernstein-type martingale concentration bound . However, since this result requires martingale differences that are bounded with probability 1, we cannot apply it on $\left<\Delta_{i}(\eta),x_{i}(\eta)-x_{\star}\right>$ (which is not bounded), nor can we apply it on $\left<\Delta_{i}(\eta),x_{i}(\eta)-x_{\star}\right>/\bar{d}_{t}(\eta)$ (which is bounded but is not adapted to any filtration). Instead, we consider processes of the form $\left<\Delta_{i}(\eta),\Pi_{1}([x_{i}(\eta)-x_{\star}]/s)\right>$ , where $\Pi_{1}(\cdot)$ is the projection to the unit ball and $s$ is a fixed scalar. By carefully union bounding over a set of $O(\log T)$ values of $s$ , we are able to control the probability of $\mathfrak{E}_{T,\alpha,\beta}(\eta)$ .

Having shown that the good event occurs with high probably for any fixed $\eta$ , our next step is to show that, for proper choices of $\alpha^{(k)}$ and $\beta^{(k)}$ , good events hold with high probability for each and every single value of $\eta$ Algorithm 1 might try. Noting that for, each value of $k$ , Algorithm 1 only tests step size values of the form $2^{j}\eta_{\varepsilon}$ for $j\in\{0,\ldots,2^{k}\}$ , the following lemma (which is a direct application of union bounds) provides the required guarantee; see proof in Section B.6.

Then, under Assumption 1, we have $\Pr*(\bigcap_{k=2,4,8,\ldots}\bigcap_{j=0,1,\ldots,2^{k}}\mathfrak{E}_{B,\alpha^{(k)},\beta^{(k)}}(2^{j}\eta_{\varepsilon}))\geq 1-\delta$ .

3 Parameter-free rates for stochastic convex optimization

We are ready to state our main result; see proof in Section B.7.

and for $C=\log\frac{1}{\delta}+\log\log_{+}\frac{B\|x_{\star}-x_{0}\|}{\eta_{\varepsilon}L}$ , either

for some $\eta^{\prime}\in[\eta,2\eta]$ , or

Let us compare our bounds to the best known prior bounds, which follow from from online to batch conversion of parameter-free regret bounds. McMahan and Orabona achieve an optimal parameter-free regret bound for algorithms that are not adaptive to gradient norms: For any user-specified $\varepsilon$ and gradient budget $B$ , their result guarantees an expected optimality gap of $O\big{(}\frac{\varepsilon+{d_{0}L}\sqrt{\Lambda}}{\sqrt{B}}\big{)}$ where $\Lambda=\log\big{(}1+\frac{d_{0}L}{\varepsilon}\big{)}$ is logarithmic in $\frac{1}{\varepsilon}$ . In comparison, by taking $\eta_{\varepsilon}=O(\frac{\varepsilon}{L^{2}B})$ we guarantee a probability $1-\delta$ optimality gap of $O\big{(}\frac{(\varepsilon+{d_{0}L})\lambda^{2}}{\sqrt{B}}\big{)}$ , where $\lambda=\log\left(\frac{1}{\delta}\log_{+}\frac{Bd_{0}L}{\varepsilon}\right)$ is only double-logarithmic in $\frac{1}{\varepsilon}$ ; see Section B.8 for a slightly tighter bound in this setting.

For small values of $\varepsilon$ , our bounds show a clear asymptotic improvement over prior art. However, we note that for a hypothetical optimally-tuned $\varepsilon$ (which depends on the unknown problem parameter $d_{0}$ ), the logarithmic factor $\Lambda$ of prior work becomes $O(1)$ , while our double-logarithmic factor $\lambda$ remains $O(\log(\frac{1}{\delta}\log B))$ . This occurs because Lemma 6 only provides a somewhat loose bound on $\eta_{\max}$ , and because of the union bound in the proof of Proposition 4. We can mitigate this issue at a cost of adaptivity to gradient norm; see Section D.2 for further discussion.

Our results give the the first high-probability parameter-free rates. However, while high-probability bounds are generally considered stronger than expectation bounds, it is not clear how to deduce an expectation bound from our results without increasing the error by a $\mathsf{poly}(\log B)$ factor, due to the need to set $\delta=\mathsf{poly}(1/B)$ . Finally, we note that Theorem 2 also guarantees that the output of the algorithm is at most a multiplicative factor further from $x_{\star}$ than $x_{0}$ was; we believe that this type of guarantee is new in the parameter-free setting.Orabona and Pál [31, Lemma 25] bound the distance moved by Follow the Regularized Leader iterates, but not by a multiple of $\|x_{\star}-x_{0}\|$ .

Adaptivity to problem structure

In this section we showcase our algorithm’s adaptivity by proving stronger rates of convergence under smoothness and strong-convexity assumptions, without introducing any new parameters.

Let us assume that $f$ is $S$ -smooth (i.e., has $S$ -Lipschitz gradient), and consider for simplicity the exact gradient setting; we believe that similar results extend to the stochastic setting as well. Under these assumptions, we show that Algorithm 1, without any changes, achieves (up to double-logarithmic factors) the $Sd_{0}^{2}/T$ suboptimality bound of optimally-tuned GD, as long as $\eta_{\varepsilon}<\frac{1}{2S}$ . See Section C.1 for proof.

We remark that UniXGrad features optimal (accelerated) rates for smooth problems without dependence on the parameter $S$ . However, unlike our method, it requires knowledge of $d_{0}$ .

2 Adaptivity strong convexity using restarts

We now consider a standard strongly-convex stochastic setup [e.g., 17, 36] in which we assume $f$ to be $\mu$ -strongly-convex in $\mathcal{X}$ and admit a stochastic gradient oracle bounded by $L$ . (Note that this implies a bound of $L/\mu$ on the diameter of $\mathcal{X}$ ). Hazan and Kale propose to run SGD for epochs of doubling length and halving step sizes. For a total gradient budget of $B$ , they obtain the optimal bound $O(L^{2}/(\mu B))$ on the expected optimality gap. However, their scheme requires the initial step size to be proportional to $1/\mu$ , and hence requires knowledge of $\mu$ .

We show that restarting Algorithm 1 with doubling gradient budgets (and no step size to tune) recovers (up to double-logarithmic factors) the optimal $1/B$ rate of convergence. To describe the procedure formally, let $\textsc{ParameterFreeTuner}(x_{0},B,\delta,\eta_{\varepsilon})$ denote the output of Algorithm 1 with initial point $x_{0}$ , gradient budget $B$ , failure probability $\delta$ , minimal step size $\eta_{\varepsilon}$ and $\alpha^{(k)},\beta^{(k)}$ as in eq. 13. For user-specified $\varepsilon>0$ and $\delta\in(0,1)$ and $x^{(0)}=x_{0}$ , our doubling procedure is

See proof in Section C.2. Compared to results obtained via parameter-free strongly-convex regret bounds [12, Thm. 7], we remove a squared logarithmic factor, breaking two regret minimization barriers at once.

Acknowledgment

We thank Shira Baneth for pointing out typos in an earlier version of this paper. YC was partially supported by the Len Blavatnik and the Blavatnik Family Foundation and an Alon Fellowship. OH was partially supported by the Pitt Momentum Funds.

References

Appendix A Proofs for Section 3

A.2 Proof of Proposition 2

A.3 Proof of Lemma 4

Assume by contradiction that $\eta>\eta_{\max}$ but $\eta\leq\phi(\eta)$ . On the one hand,

which implies $\bar{r}_{T}(\eta)>\frac{2\alpha}{\alpha-1}d_{0}$ . On the other hand, by Lemma 2, we have $\bar{r}_{T}(\eta)\leq\frac{2\alpha}{\alpha-1}d_{0}$ which yields a contradiction.

A.4 Proof of Theorem 1

and the complexity of all preceding bisection calls is at most

Next, we establish (10). Note that Algorithm 1 indeed always returns a point of the form $\bar{x}=\frac{1}{T}\sum_{i<T}x_{i}(\eta)$ with some $T\geq 1$ ; the edge case of returning in algorithm 1 corresponds to $T=1$ . Moreover, in the typical case of returning in algorithm 1, we have $T=\left\lfloor\frac{B}{2k}\right\rfloor\geq\frac{B}{4k}$ for some $k\leq B/4$ . By Lemma 4, we have

giving the claimed lower bound (10) on $T$ .

It remains to show that one of the conclusions (11) and (12) must hold. When Algorithm 1 returns in algorithm 1, this follows immediately from Proposition 2 (if $\eta_{\varepsilon}\leq\phi(\eta_{\varepsilon})$ then conclusion (11) holds; if $\eta_{\varepsilon}>\phi(\eta_{\varepsilon})$ then either $\eta_{\varepsilon}\sqrt{3G_{T}(\eta_{\varepsilon})}\leq 3d_{0}$ and conclusion (11) holds, or $d_{0}\sqrt{3G_{T}(\eta_{\varepsilon})}\leq\eta_{\varepsilon}G_{T}(\eta_{\varepsilon})$ and conclusion (12) holds). In the edge case of returning $\bar{x}=x_{0}$ in algorithm 1, corresponding to $T=1$ , conclusion (11) clearly holds, as $\|x_{0}-x_{\star}\|\leq 4\|x_{0}-x_{\star}\|$ trivially and $f(x_{0})-f(x_{\star})\leq\left<g_{0},x_{0}-x_{\star}\right>\leq\|x_{0}-x_{\star}\|\|g_{0}\|$ due to convexity of $f$ . We remark that due to inequality (10), the $T=1$ edge case is only possible for a very small iteration budget $B=O(\log\log_{+}\frac{\|x_{0}-x_{\star}\|}{\eta_{\varepsilon}\|g_{0}\|})$ . ∎

Appendix B Proofs for Section 4

The proof proceeds similarly to the proof of Lemma 2, except that instead of assuming exact subgradients we make use of the definition of $\mathfrak{E}_{T,\alpha,\beta}$ . As usual in proofs where $\eta$ is fixed, we drop the explicit dependence on it from $x_{t},g_{t},d_{t},\bar{d}_{t},r_{t},\bar{r}_{t}$ and $G_{t}$ .

Noting that $\left<\nabla f(x_{i}),x_{i}-x_{\star}\right>\geq f(x_{i})-f(x_{\star})\geq 0$ due to convexity and that $\sum_{i<t}\left<\Delta_{i},x_{i}-x_{\star}\right>\geq-\frac{1}{4}\max\{\bar{d}_{t},\eta\sqrt{\beta}\}\sqrt{\alpha G_{t}+\beta}$ for all $t\leq T$ due to $\mathfrak{E}_{T,\alpha,\beta}$ holding, we have

Maximizing both sides of the inequality over $t\leq T$ and recalling that $\eta\leq\phi(\eta)=\bar{r}_{T}/\sqrt{\alpha G_{T}+\beta}$ , we get

Substituting $\bar{r}_{T}\leq\bar{d}_{T}+d_{0}$ (which holds due to the triangle inequality), we get

where the final equality is due to $\alpha>1$ . Thus, we arrive again at inequality (8) from the proof of Lemma 2, but with $\alpha$ replaced by $\alpha^{\prime}=2\alpha/(\alpha+2)$ . We consequently find that

B.2 Proof of Proposition 3

The remainder of the proof is very similar to the proof of Proposition 2. Combining (5), (9) and (21) yields:

B.3 Proof of Lemma 6

The proof is essentially identical to the proof of Lemma 4, with Lemma 5 used instead of Lemma 2; we give it here for completeness.

To show the first part of the lemma, assume that $\mathfrak{E}_{T,\alpha,\beta}(\eta)$ holds and assume by contradiction that $\eta>\eta_{\max}(\alpha,\beta)$ but $\eta\leq\phi(\eta)$ . On the one hand,

which implies $\bar{r}_{T}(\eta)>\frac{4\alpha}{\alpha-2}d_{0}$ . On the other hand, by Lemma 5, we have $\bar{r}_{T}(\eta)\leq\frac{4\alpha}{\alpha-2}d_{0}$ which yields a contradiction.

B.4 A martingale concentration bound

The following corollary is a translation of [18, Theorem 4] which simplifies notation at the cost of looser constants. We remark that it holds even when $\log$ denotes the natural logarithm (as is the convention in ).

Let $X_{t}$ be adapted to $\mathcal{F}_{t}$ such that $\left|X_{t}\right|\leq 1$ with probability 1 for all $t$ . Then, for every $\delta\in\left(0,1\right)$ and any $\hat{X}_{t}\in\mathcal{F}_{t-1}$ such that $\lvert\hat{X}_{t}\rvert\leq 1$ with probability 1,

where $A_{t}(\delta)=\log\left(\frac{60\log(6t)}{\delta}\right)$ .

Throughout we proof we use the binary maximization notation $a\vee b\coloneqq\max\{a,b\}$ .

We apply [18, Theorem 4] with $a=-b=1$ and the polynomial stitched boundary [18, Eq. (10)] with parameters $m,\eta,s\geq 1$ to be specified below. This yields

with $\zeta$ denoting the Riemann zeta function,

Let us first simplify $S_{\delta/2}\left(m\vee v\right)$ and then choose the parameters $m,\eta,s$ to yield decent constants. Writing $Z=s\log\log\left(\frac{\eta(m\vee v)}{m}\right)+\log\frac{2\zeta\left(s\right)}{\delta\log^{s}\eta}$ , we have

Note that $\log\log\left(\frac{\eta(m\vee v)}{m}\right)\geq\log\log\eta$ and therefore $Z\geq\log\frac{2\zeta\left(s\right)}{\delta}\geq\log(2\zeta(s))$ . Therefore, if $m\leq\log(2\zeta(s))$ , we have the slightly simplified bound

Moreover, for $m\geq\eta$ we may upper bound $Z$ by

Taking $\eta=m=1.8$ and $s=1.05$ , one easily confirms that $m\leq 3\leq\log(2\zeta(s))$ and, substituting back $k_{1}$ and $k_{2}$ as defined above, we have

Finally, noting that $\big{(}X_{s}-\hat{X}_{s}\big{)}^{2}\leq 4$ for all $s$ , we may substitute $m+v\leq 6t$ in the bound above, concluding the proof. ∎

B.5 Proof of Proposition 4

Since $\eta$ is fixed throughout this proof, we drop the explicit dependence on it to simplify notation. Furthermore, we define the normalized/shorthand quantities:

With these definitions, the failure probability we wish to bound is

Note that $k_{t}$ satisfies the following

The first set of inequalities follows from the definition of $\bar{d}_{t}^{\prime}$ , $k_{t}$ and $s_{k}$ , while the latter inequality is due to the fact that $\bar{d}_{t}\leq d_{0}+t\eta L\leq d_{0}+\frac{t}{4}\eta\sqrt{\beta}$ , which follows from the definition of SGD, the triangle inequality, the assumption $\|g_{i}\|\leq L$ w.p. 1, and $\beta\geq 16L^{2}$ .

Writing $\Pi_{1}(x)=x/\max\{1,\|x\|\}$ for the projection to the Euclidean unit ball, we now bound the failure probability as follows

where $(i)$ follows from $\|x_{i}-x_{\star}\|=d_{i}\leq\bar{d}_{t}\leq s_{k_{t}}$ (which means that the projection does nothing), $(ii)$ follows from $s_{k_{t}}\leq 2\bar{d}_{t}^{\prime}$ , and $(iii)$ follows from $0\leq k_{t}\leq\log(6t)\leq\log(6T)$ and a union bound. We can now define a nicely behaved stochastic process: for every $i$ and $k$ let

and note that $X_{i}^{(k)}$ is adapted to the filtration $\mathcal{F}_{t}=\sigma(g_{0},g_{1},\ldots,g_{t})$ (i.e., $X_{i}^{(k)}\in\mathcal{F}_{t}$ ) and satisfies $\left\lvert X_{i}^{(k)}\right\rvert\leq\frac{\|g_{i}\|}{L}\leq 1$ by Cauchy-Schwarz and $\|g_{i}\|\leq L$ . Applying Corollary 1 with $\hat{X}=0$ as the predictable sequence, we obtain, for any $k$ and $\delta^{\prime}\in(0,1)$ ,

where $A_{t}(\delta^{\prime})=\log\left(\frac{60\log(6t)}{\delta^{\prime}}\right)$ . Note that

Furthermore, note that, for $\delta^{\prime}=\frac{\delta}{\log(6T)}$ we have that $A_{t}(\delta^{\prime})\leq A_{T}(\delta^{\prime})=C=\alpha/32^{2}\leq\alpha^{\prime}/16$ and that $A_{t}^{2}(\delta^{\prime})\leq A_{t}^{2}(\delta^{\prime})=C^{2}\leq\beta^{\prime}/16$ . Substituting to inequality (23) we conclude that

for all $k$ , and therefore $\Pr*(\mathfrak{E}_{T,\alpha,\beta}^{c})\leq\delta$ by the bound (22). ∎

B.6 Proof of Lemma 7

Fixing some $k\in\{2,4,8,\ldots\}$ and noting that $C_{k}=\log\left(\frac{60\log^{2}(6B)}{2^{-2k}\delta}\right)$ , we may apply Proposition 4 with $T=B$ , $\alpha=\alpha^{(k)}$ , $\beta=\beta^{(k)}$ and failure probability $2^{-2k}\delta$ , giving $1-\Pr*(\mathfrak{E}_{B,\alpha^{(k)},\beta^{(k)}}(\eta))\leq 2^{-2k}\delta$ for any $\eta$ . Therefore, by the union bound

Applying the union bound once more, we have

B.7 Proof of Theorem 2

The bound $B$ on the algorithm’s query number is deterministic and therefore follows exactly as in the proof of Theorem 1. For the remainder of the analysis we assume the event $\bigcap_{k=2,4,8,\ldots}\bigcap_{j=0,1,\ldots,2^{k}}\mathfrak{E}_{B,\alpha^{(k)},\beta^{(k)}}(2^{j}\eta_{\varepsilon})$ holds, which by Lemma 7 happens with probability at least $1-\delta$ ; we will show that, conditional on this event holding, the conclusions of the theorem hold deterministically. (Note that $\mathfrak{E}_{B,\alpha^{(k)},\beta^{(k)}}$ implies $\mathfrak{E}_{T_{k},\alpha^{(k)},\beta^{(k)}}$ for all $T_{k}\leq B$ ).

(where we have used $\alpha\geq\alpha^{(0)}\geq 32^{2}$ ), and, for some $\eta^{\prime}\in[\eta,2\eta]$ ,

In the edge case that $\eta_{\varepsilon}>\phi(\eta_{\varepsilon})$ in the final bisection, we separately consider the cases $\eta_{\varepsilon}\sqrt{\alpha G_{T}(\eta_{\varepsilon})+\beta}\leq 5d_{0}$ and $\eta_{\varepsilon}\sqrt{\alpha G_{T}(\eta_{\varepsilon})+\beta}>5d_{0}$ . In the former case, Proposition 3 gives

so conclusion (15) holds as before. In the second case, where $5d_{0}<\eta_{\varepsilon}\sqrt{\alpha G_{T}(\eta_{\varepsilon})+\beta}$ , we have

Recalling that $\alpha=O(C)$ and $\beta=O(C^{2}L^{2})$ , we see that conclusion (16) holds.

Finally, if Algorithm 1 returns in algorithm 1 instead of algorithm 1, we have $\bar{x}=x_{0}$ and $T=1$ , and so conclusion (15) holds trivially, since $\|x_{0}-x_{\star}\|\leq 6\|x_{0}-x_{\star}\|$ and $f(x_{0})-f(x_{\star})\leq\left<\nabla f(x_{0}),x_{0}-x_{\star}\right>\leq\|x_{0}-x_{\star}\|L=O(d_{0}\sqrt{C^{2}L^{2}})$ due to convexity of $f$ and Assumption 1. ∎

B.8 A corollary for uniform gradient bounds

The following corollary translates Theorem 2 to the setting where we replace all observed gradient norms by $L$ . In it $\lambda$ represents a double-logarithmic factor and we use $\iota<1$ to indicate low order terms which can be ignored as soon as $B=\Omega(\lambda^{2})$ .

The corollary follows by substitution of $\eta_{\varepsilon}=\frac{\varepsilon}{L^{2}}$ into Theorem 2. In particular, the bound (14) becomes

and the upper bound on $\|\bar{x}-x_{\star}\|$ in (16) is

which, when substituting $T\leq B$ and $\iota^{2}=\lambda(\lambda+B)$ and combining with the bound on $\|\bar{x}-x_{\star}\|$ in (15) yields the claimed distance bound.

Finally, recalling that $T\geq 1$ and $T=\Omega(B/m)$ , the suboptimality bound in (15) reads

and the suboptimality bound in (16) reads

combined, these yield the claimed suboptimality bound. ∎

Appendix C Proofs for Section 5

Recall the following basic property of any $S$ -smooth functions [6, Lemma 3.4],

Using this fact we establish two useful inequalities for our proof. First, for any $\eta\leq\frac{1}{S}$ substituting $u=x_{i+1}(\eta)$ and $v=x_{i}(\eta)$ into (25) gives $\frac{\eta}{2}\|g_{i}(\eta)\|^{2}\leq f(x_{i}(\eta))-f(x_{i+1}(\eta))$ . Summing over $i<T$ we obtain

where $(\star)$ follows from (25) with $u=x_{0}$ and $v=x_{\star}$ . Second, for any $\eta\geq 0$ , substituting $x_{i}(\eta)-\frac{1}{S}g_{i}(\eta)$ and $v=x_{t}(\eta)$ into (25) yields $\|g_{i}(\eta)\|^{2}\leq 2S\left[f(x_{i}(\eta))-f\left(x_{i}(\eta)-\tfrac{1}{S}g_{i}(\eta)\right)\right]\leq 2S\left[f(x_{i}(\eta))-f\left(x_{\star}\right)\right]$ . Summing for $i<T$ yields

and rearranging gives $\sqrt{\sum_{i=1}^{T}f(x_{i}(\eta))-f(x_{\star})}=O(\sqrt{S}d_{0})$ again.

C.2 Proof of Theorem 4

First, note that by Theorem 2, computing $x^{(M)}$ requires $\sum_{m=1}^{M}B^{(m)}=2+4+\cdots+2^{M}=2^{M+1}-2\leq B$ gradients queries as claimed. Next, note that

and therefore by the union bound, with probability at least $1-\delta$ the conclusions of Theorem 2 hold for all applications of ParameterFreeTuner; we proceed with our analysis conditional on that event.

Note that $\delta^{(m)}\geq\delta^{(M)}=\Omega(\delta/\log^{2}B)$ and $\eta_{\varepsilon}^{(m)}\geq\eta_{\varepsilon}^{(M)}=\Omega(\varepsilon/(L^{2}B))$ . Consequently, we have

With this, we apply Theorem 2 on to bound the suboptimality of $x^{(m)}$ for $m\leq M$ . Let $T^{(m)}$ be the corresponding $T$ from Theorem 2, and note that $T^{(m)}\geq\max\{1,B^{(m)}/\lambda\}$ . Noting also that $G_{T^{(m)}}(\eta^{\prime})/T^{(m)}\leq L^{2}$ for all $\eta^{\prime}$ , we have

for some $\widetilde{\varepsilon}=O(\varepsilon\lambda^{2})$ and $\widetilde{L}=O(L\lambda^{3/2})$ , and all $m\leq M$ . Applying strong convexity, we have that $\frac{\mu}{2}\|x^{(m-1)}-x_{\star}\|^{2}\leq f(x^{(m-1)})-f(x_{\star})$ which implies

where $(\star)$ follows from $\sqrt{ab}\leq\max\{2a,b/2\}$ . Iterating this bound and noting that $2B^{(m-1)}=B^{(m)}=2^{m}$ we conclude that

Finally, the strong convexity and Lipschitz continuity assumptions imply that

and therefore $\|x^{(0)}-x_{\star}\|\leq\frac{2L}{\mu}$ and, using Lipschitz continuity again, $f(x^{(0)})-f(x_{\star})\leq\frac{2L^{2}}{\mu}\leq\frac{4\widetilde{L}^{2}}{\mu}$ . Substituting back into (29), the second and third terms merge. Recalling that $B^{(M)}=2^{M}=B/2$ , and that $\widetilde{\varepsilon}=O(\varepsilon\lambda^{2})$ and $\widetilde{L}=O(L\lambda^{3/2})$ , we have

Appendix D Additional discussion

Optimality gap bounds obtained via online-to-batch conversion have the appealing property of holding for any comparator $\mathring{x}\in\mathcal{X}$ and not necessarily a minimizer of $f$ . Consequently, the parameter-free regret minimization algorithm of McMahan and Orabona outputs a point $\bar{x}$ with an error bound of the form

In contrast, we only provide guarantees for $x^{\prime}=x_{\star}$ , a minimizer of $f$ . This can be restrictive in settings where $\|x_{0}-x_{\star}\|$ is very large or possibly infinite, i.e., when the minimum of $f$ is not attained, as is the case in logistic regression on separable data.

However, the assumption that $x_{\star}$ is optimal can be relaxed. In particular, our only real requirement from $x_{\star}$ is that, for every SGD iterate $x_{t}(\eta)$ evaluated in Algorithm 1, we have $f(x_{t}(\eta))-f(x_{\star})\geq 0$ . In the noiseless setting, we may modify Algorithm 1 to return the GD iterate with lowest objective value (from all the GD executions combined). Such modified algorithm would satisfy the error bounds in Theorem 1 with respect to an arbitrary point $x_{\star}$ : if the algorithm’s output has function value smaller than $f(x_{\star})$ , we are done; otherwise, we have $f(x_{t}(\eta))-f(x_{\star})\geq 0$ for every GD iterate, and our analysis goes through. (Note, however, that we lose the guarantee on the distance between $x_{\star}$ and the algorithm’s output). Extension to the stochastic case is more involved since we do not have the privilege of choosing the best SGD iterate; we leave it to future work.

D.2 An alternative bisection target without gradient norm adaptivity

Algorithm 1 is fairly adaptive to stochastic gradient norms, with performance guarantees that depend mainly on observed norms, featuring an a-priori gradient norm bound in low-order terms. Moreover, in the noiseless setting our method requires no a-priori bound on gradient norms and our bounds depend solely on observed norms.

It is possible, however, to slightly simplify our method and sharpen some of our bounds by forgoing adaptivity to gradient norms. Specifically, if we only seek guarantees that depend on an a-priori gradient norm bound $L$ , then it is possible to replace the bisection target $\phi$ defined in algorithm 1 of Algorithm 1 with

Our analysis applies to this modified bisection target as well, but with $G_{T}(\eta)$ replaced by $L^{2}T$ throughout. Moreover, this modification allows us to slightly improve two parts of the analysis.

First, we may sharpen the bound on $\eta_{\max}$ in Lemmas 4 and 6 to $\eta_{\max}=O\left(\frac{d_{0}}{L\sqrt{T}}\right)$ , improving our bound on the maximum value of $k$ used in Algorithm 1. In the deterministic case, this allows us to establish optimality gap bounds scaling as $\varepsilon+\lambda^{\prime}\frac{d_{0}L}{\sqrt{B}}$ for $\lambda^{\prime}=O\Big{(}\sqrt{\log\log_{+}\frac{d_{0}L}{\varepsilon\sqrt{B}}}\Big{)}$ which satisfies $\lambda^{\prime}=O(1)$ for $\varepsilon=\frac{d_{0}L}{\sqrt{B}}$ , similarly to the bounds of previous works, as discussed at the end of Section 4.

Second, in the stochastic setting, we may use Blackwell’s inequality instead of the time uniform empirical Bernstein. This allows us to take $\alpha^{(k)}=2k+O\left(\log\left(\frac{1}{\delta}\log B\right)\right)$ , eliminating the additive square logarithmic term stemming from $\beta^{(k)}$ in (13). Consequently, in the stochastic setting we obtain a probability $1-\delta$ optimality gap bound of $\varepsilon+\lambda^{\prime\prime}\frac{d_{0}L}{\sqrt{B}}$ for $\lambda^{\prime\prime}=O\Big{(}{\lambda^{\prime}\big{[}\lambda^{\prime}+\sqrt{\log\left(\frac{1}{\delta}\log B\right)}\big{]}}\Big{)}$ , with $\lambda^{\prime}$ defined above. Therefore, in the stochastic case we do not remove $\log\log B$ term entirely, even for the optimal $\varepsilon$ . The source of the remaining $\log\log B$ is the union bound we use in the proof of Proposition 4, which might be removable via a more careful probabilistic argument.