Making SGD Parameter-Free

Yair Carmon, Oliver Hinder

Introduction

Stochastic convex optimization (SCO) is a cornerstone of both the theory and practice of machine learning. Consequently, there is intense interest in developing SCO algorithms that require little to no prior knowledge of the problem parameters, and hence little to no tuning . In this work we consider the fundamental problem of non-smooth SCO (in a potentially unbounded domain) and seek methods that are adaptive to a key problem parameter: the initial distance to optimality.

Current approaches for tackling this problem focus on the more general online learning problem of parameter-free regret minimization , where the goal is to to obtain regret guarantees that are valid for comparators with arbitrary norms. Research on parameter-free regret minimization has lead to practical algorithms for stochastic optimization , methods that are able to adapt to many problem parameters simultaneously and methods that can work with any norm . In the basic Euclidean setting with 1-Lipschitz losses where only the initial distance to optimality is unknown, there are essentially matching upper and lower bounds , showing that the best achievable parameter-free average regret scales as

where TT is the number of steps, x˚\|\mathring{x}\| is the (Euclidean) comparator norm, and ε>0\varepsilon>0 represents the (user-chosen) regret we will incur even if the comparator norm is zero. This is larger by a logarithmic factor than the optimal average-regret when the comparator norm is known in advance.

Parameter-free regret bounds immediately translate into parameter-free SCO algorithms using online-to-batch conversion . The expected optimality gap bound of the resulting algorithm is identical to (1) when we replace x˚\mathring{x} by xx0x_{\star}-x_{0}, i.e., the difference between the optimum and the initial point. This bound is a logarithmic factor worse than what stochastic gradient descent (SGD) can achieve when we know the distance to optimality and use it to compute step sizes. While this logarithmic factor is unavoidable for regret minimization, it is unclear if it is necessary for SCO.

In this paper we show it is possible to obtain stronger parameter-free rates for SCO by moving beyond the regret minimization abstraction. In particular, for any ε>0\varepsilon>0 and δ(0,1)\delta\in(0,1), we a obtain probability 1δ1-\delta optimality gap bounds of

which is better than any bound achievable by online-to-batch conversion. While replacing the logarithmic factor by a double-logarithmic factor may appear a small improvement, we consider it important due to the fundamental nature of the problem as well as the theoretical separation it establishes between parameter-free SCO and OCO. Such separations are rare in the literature; we are only aware of one prior example .

Our method also provides high probability guarantees on the suboptimality gap. This resolves an open problem in parameter-free optimization; see and [29, §7]. We are able to form high probability bounds because, unlike other parameter-free SCO algorithms, we prove a strong localization guarantee: our output xˉ\bar{x} satisfies xˉx=O(x0x)\|\bar{x}-x_{\star}\|=O(\|x_{0}-x_{\star}\|), and key intermediate points satisfy a similar bound as well. We suspect that such localization is difficult to establish with online-to-batch conversion, since online parameter-free algorithms may need to let their iterates fluctuate wildly in order to handle difficult adversaries.

In addition to independence of x0x\|x_{0}-x_{\star}\|, our algorithm exhibits three additional forms of adaptivity. First, our algorithm has adaptivity to gradient norms on par with the best existing parameter-free result : the leading term of our bounds scales with a sum of squared observed gradient norms, and an a-priori gradient norm bound only affects low-order terms. Second, as a consequence, in the smooth and noiseless case our algorithm exhibits a loglogTT\frac{\log\log T}{T} rate of convergence. Finally, via a simple restart scheme we obtain the optimal rate for strongly-convex stochastic problems (up to double-logarithmic factors), without knowledge of the strong-convexity parameter.

On a technical level, our development differs significantly from prior parameter-free optimization methods. While online methods rely on advanced tools such as coin betting , and online Newton steps , our approach is essentially a careful scheme for correctly setting the step size of SGD. Underlying our algorithm is a parameter-free certificate for SGD, which implies both localization and optimality gap bounds. The certificate takes the form of an implicit equation over the SGD step size, which we solve via bisection on the logarithm of the step size. To obtain high-probability bounds, we develop a time-uniform empirical-Berstein-type concentration bound independent of any a-priori assumptions on the iterate norms. Given the ubiquity of SGD in practice and in the classroom, our insights on how choose its step size may be of independent interest.

In the following subsections we review additional related work, as well as the problem setup and notation. Section 2 develops our parameter-free step size certificate. Section 3 presents our algorithm and its analysis in the noiseless regime. Section 4 lifts the analysis to the stochastic setting, proving our main result on parameter-free SCO. Finally, Section 5 shows how our method adapts to smoothness and (via restarts) to strong convexity.

1 Additional related work

The literature on noiseless optimization also offers a rich variety of parameter-free algorithms. In the smooth setting, the Armijo rule is a standard technique for choosing step sizes for gradient descent. Using variants of this idea combined with acceleration, achieves essentially optimal and parameter-free rates of convergence . The Polyak step size rule simultaneously achieves optimal rates for smooth, non-smooth and strongly-convex optimization , but requires knowledge of the optimal function value. This requirement can be relaxed, making the Polyak method parameter-free, but at the cost of a multiplicative logarithmic factor to its bound . Consequently, non-smooth parameter-free deterministic optimization appears to be as hard as SCO. Multiple works generalize line-search and the Polyak method to the stochastic setting , but do not obtain parameter-free rates in the sense we consider here.

Limitations of online-to-batch conversion.

To the best of our knowledge, the only previous example of an SCO rate that is provably unachievable by online to batch conversion of a (uniform) regret bound occurs for strongly-convex optimization. Specifically, any online strongly-convex optimization algorithm must have logarithmic regret (implying suboptimality (logT)/T(\log T)/T via online to batch conversion) , while Hazan and Kale and others have achieved the optimal 1/T1/T rate for stochastic strongly-convex optimization. The variant of our algorithm in Section 5.2 is based on the Epoch-SGD algorithm of , and simultantiously breaks both regret minization barriers, achieving optimality gap (loglogT)/T(\log\log T)/T for parameter-free strongly-convex stochastic optimization with high probability.

Grid search.

In practice, the standard technique for selecting the step size of SGD (and hyperparameters more broadly) is grid search . This typically consists of testing all step sizes on a geometrically spaced grid and choosing the one with the best performance on a held out set. Compared to our method, such grid search is computationally wasteful, as it tests exponentially more steps sizes than we do. Moreover, in the context of parameter-free SCO, proving guarantees for grid search is surprisingly difficult, since it is unclear how to bound the objective value estimation error for points that may be arbitrarily far apart.

2 Problem setup and notation

Our development revolves around the classical fixed step size stochastic gradient descent (SGD) algorithm. Given step size η\eta and initialization x0x_{0}, SGD iterates

and ΠX\Pi_{\mathcal{X}} is the Euclidean projection onto X\mathcal{X}; we intentionally feature the η\eta dependence of xi(η)x_{i}(\eta) and gi(η)g_{i}(\eta) prominently. We define the following quantities associated with the SGD iterates. First, we write the distance to xx_{\star} and its running maximum as

Replacing xx_{\star} with x0x_{0} in the above definitions, we write

Finally, we denote the running sum of squared gradient norms and gradient oracle error by

Throughout, \|\cdot\| denotes the Euclidean norm. We use log\log to denote the base 2 logarithm, and write log+(x)max{2,log(x)}\log_{+}(x)\coloneqq\max\{2,\log(x)\} to simplify O()O(\cdot) notation. For any particular value of η\eta, the quantities xi(η)x_{i}(\eta), gi(η)g_{i}(\eta), etc. always refer to a single realization of the random process they represent.

A parameter-free step-size selection criterion for SGD

In this section we present the key component of our development: a computable certificate for the efficiency of a candidate SGD step size. For ease of exposition, in this section we restrict some of our arguments to the exact gradient setting, but emphasize that they ultimately translate to high-probability bounds in the stochastic setting.

Consider the noiseless setting with step size η\eta, iterates x0,x1(η),,xT(η)x_{0},x_{1}(\eta),\ldots,x_{T}(\eta) and gradients g0g_{0}, g1(η)g_{1}(\eta), ,gT1(η)\ldots,g_{T-1}(\eta). It is well-known that if η\eta satisfies

Our key proposal is to approximate the distance to the optimum d0d_{0} with a computable proxy: the maximum distance traveled by the algorithm, rˉT(η)maxiTx0xi(η)\bar{r}_{T}(\eta)\coloneqq\max_{i\leq T}\|x_{0}-x_{i}(\eta)\|. We consider step sizes that (approximately) satisfy

for nonnegative damping parameters α\alpha and β\beta; in the exact gradient setting we can set α\alpha to any number >1>1 and β=0\beta=0, while the stochastic setting requires scaling α\alpha and β\beta roughly as poly(loglogT)\mathsf{poly}(\log\log T). Intuitively, rˉT\bar{r}_{T} approximates d0d_{0} since the SGD iterates should converge to xx_{\star} and therefore x0xT(η)\|x_{0}-x_{T}(\eta)\| should be similar to x0x\|x_{0}-x_{\star}\|. However, in non-smooth optimization, convergence to xx_{\star} can be arbitrarily slow. We nevertheless prove that, when ηϕ(η)\eta\leq\phi(\eta), we have rˉT(η)=O(d0)\bar{r}_{T}(\eta)=O(d_{0}) (Lemma 2 below). With this result and a refined SGD error bound (Lemma 1 below), we show that (with exact gradients) any η\eta satisfying criterion (4) recovers the optimal error bound.

In the noiseless setting, any step size η>0\eta>0 satisfying (4) with α>1\alpha>1 and β=0\beta=0 produces xˉ1Ti<Txi(η)\bar{x}\coloneqq\frac{1}{T}\sum_{i<T}x_{i}(\eta) such that xxˉ2αα1xx0\|x_{\star}-\bar{x}\|\leq\frac{2\alpha}{\alpha-1}\|x_{\star}-x_{0}\| and

Before proving Proposition 1, let us briefly discuss its algorithmic implications. Since the function ϕ()\phi(\cdot) is computable (at the cost of TT gradient queries) without a-priori assumptions on d0d_{0}, we have reduced parameter-free optimization to solving the one-dimensional implicit equation (4). However, the function ϕ\phi might be discontinuous and an exact solution to the implicit equation might not even exist. Nevertheless, in the next section we show that finding an interval [η,2η][\eta,2\eta] in which hϕ(h)hh\mapsto\phi(h)-h changes sign, produces nearly the same error certificates at an interval edge. Since such interval is readily found via bisection, this forms the basis of a working parameter-free step size tuner. We leave the details to Section 3 and for the remainder of this section prove Proposition 1.

The proof of Proposition 1 hinges on two lemmas. The first is a variant of the standard SGD error bound (recall that Δi(η)=gi(η)f(xi(η))\Delta_{i}(\eta)=g_{i}(\eta)-\nabla f(x_{i}(\eta)) is zero in the noiseless setting).

Since η\eta is fixed throughout this proof, we streamline notation by dropping it from xi,gi,Δix_{i},g_{i},\Delta_{i} and GiG_{i}. By convexity and the definition of Δi\Delta_{i},

From xi+1=ΠX(xiηgi)x_{i+1}=\Pi_{\mathcal{X}}(x_{i}-\eta g_{i}) we can derive the standard subgradient method inequality

for all η0\eta\geq 0 and i=0,,T1i=0,\dots,T-1. Rearranging and summing over i<Ti<T gives

The inequality ()(\star) is where our proof deviates from the textbook derivation [28, Theorem 2.13.]; it holds because d0dTrTd_{0}-d_{T}\leq r_{T} due to the triangle inequality, and either dTd0d_{T}\leq d_{0} holds or d02dT202d0rˉTd_{0}^{2}-d_{T}^{2}\leq 0\leq 2d_{0}\bar{r}_{T}. Substituting into (6) and applying f(xˉ)1Ti<Tf(xi)f(\bar{x})\leq\frac{1}{T}\sum_{i<T}f(x_{i}) by Jensens’s inequality gives (5). ∎

The second lemma shows that for η\eta satisfying ηϕ(η)\eta\leq\phi(\eta) is guaranteed to produce iterates that do not wander too far from xx_{\star}. This is our basic localization guarantee.

In the noiseless setting with α>1\alpha>1 and η>0\eta>0, if ηϕ(η)\eta\leq\phi(\eta) then we have dˉT(η)α+1α1d0\bar{d}_{T}(\eta)\leq\frac{\alpha+1}{\alpha-1}d_{0} and rˉT(η)2αα1d0\bar{r}_{T}(\eta)\leq\frac{2\alpha}{\alpha-1}d_{0}.

This proof once more drops the explicit dependence on η\eta. Summing the inequality (7) over i<ti<t and noting that that <gi,xix>f(xi)f(x)0\left<g_{i},x_{i}-x_{\star}\right>\geq f(x_{i})-f(x_{\star})\geq 0 due to convexity and the noiseless setting, we have dt2d02+η2Gtd_{t}^{2}\leq d_{0}^{2}+\eta^{2}G_{t} for every tt. Maximizing over tTt\leq T and substituting ηϕ(η)rˉT/αGT\eta\leq\phi(\eta)\leq\bar{r}_{T}/\sqrt{\alpha G_{T}} yields

where ()(\star) follows from the triangle inequality: rˉT=rtdt+d0dˉT+d0\bar{r}_{T}=r_{t}\leq d_{t}+d_{0}\leq\bar{d}_{T}+d_{0} for some tTt\leq T. Rearranging yields (dˉT1α1d0)2α2(α1)2d02,\left(\bar{d}_{T}-\frac{1}{\alpha-1}d_{0}\right)^{2}\leq\frac{\alpha^{2}}{(\alpha-1)^{2}}d_{0}^{2}, and therefore dˉTα+1α1d0\bar{d}_{T}\leq\frac{\alpha+1}{\alpha-1}d_{0} as required. The bound rˉT2αα1d0\bar{r}_{T}\leq\frac{2\alpha}{\alpha-1}d_{0} follows from substituting dˉTα+1α1d0\bar{d}_{T}\leq\frac{\alpha+1}{\alpha-1}d_{0} into rˉTdˉT+d0\bar{r}_{T}\leq\bar{d}_{T}+d_{0}. ∎

Proposition 1 follows from substituting η=rˉT(η)/αGT\eta=\bar{r}_{T}(\eta)/\sqrt{\alpha G_{T}} into bound (5) yielding f(xˉ)f(x)d0αGT(η)T+rˉT(η)GT(η)2αTf(\bar{x})-f(x_{\star})\leq\frac{d_{0}\sqrt{\alpha G_{T}(\eta)}}{T}+\frac{\bar{r}_{T}(\eta)\sqrt{G_{T}(\eta)}}{2\sqrt{\alpha}T}, and using rˉT(η)2αα1d0\bar{r}_{T}(\eta)\leq\frac{2\alpha}{\alpha-1}d_{0} from Lemma 2. \blacksquare

Algorithm description, and analysis for exact gradients

In this section we turn the step-size selection criterion presented in the previous section into a complete algorithm (Algorithm 1)—valid for stochastic as well as exact gradients—and analyze it in the simpler setting of exact gradients, deferring the stochastic case to the following section. Our algorithm consists of a core log-scale bisection subroutine (RootFindingBisection) coupled with an outer loop that acts as an aggressive doubling scheme on the upper limit of the bisection. We describe and analyze the two components Sections 3.1 and 3.2, respectively. Then, in Section 3.3, we put these results together and obtain parameter-free rates in the exact gradient setting.

The following lemma shows that the output of RootFindingBisection in fact satisfies a similar bound. See Appendix A.1 for the (easy) proof, and note the lemma also holds in the stochastic case.

We now combine Lemmas 1, 2 and 3 to show an error bound for GD with the η\eta selected by RootFindingBisection. The proof of Proposition 2 appears in Appendix A.2.

2 Doubling scheme for upper bisection limit

3 Error guarantees for exact gradients

With Algorithm 1 explained and Proposition 1 and Lemma 4 in place, we are ready to state the parameter-free convergence guarantee in the exact gradient setting. For simplicity of exposition, we fix α=3\alpha=3 and β=0\beta=0, but note that any α>1\alpha>1 yields a similar guarantee.

for some η[η,2η]\eta^{\prime}\in[\eta,2\eta], or η=ηε\eta=\eta_{\varepsilon} and

The proof of Theorem 1 appears in Section A.4. Let us briefly compare the bounds in Theorem 1 to our guarantees for a solution to η=ϕ(η)\eta=\phi(\eta) shown in Proposition 1. The “typical case” bound (11) is similar to the error bound of Proposition 1 with only two notable differences beyond a slightly larger constant factor. First, by eq. (10), the value of TT in Theorem 1 is smaller than the total complexity budget by a double-logarithmic factor; this is the cost of performing a bisection instead of assuming we start with a solution to the implicit equation. Second, the term GT(η)G_{T}(\eta^{\prime}) in the RHS of (11) is computed at η\eta^{\prime} that is possibly different than the η\eta for which we prove the error bound.

While bounding the error of SGD with step size η\eta using the gradients observed by SGD with a different step size η\eta^{\prime} is unconventional, our resulting bounds appear to be as useful as their more conventional counterparts. First, note that η\eta and η\eta^{\prime} are within a factor of 2 of each other, and we can bring this factor arbitrarily close to 1 by running more bisection steps. Second, despite the difference in η\eta, we can still use our lower bounds to obtain (up to double-logarithmic factors) a 1/T1/T rate of convergence for smooth problems with unknown smoothness (see Section 5.1); this is the hallmark of error bounds that scale with GT\sqrt{G_{T}}. Finally, the different η\eta^{\prime} issue disappears if we assume ff is LL-Lipschitz. In particular, from Theorem 1 with ηε=ε2L2\eta_{\varepsilon}=\frac{\varepsilon}{2L^{2}}, using GT(η)L2TG_{T}(\eta)\leq L^{2}T for all η\eta and g0(f(x0)f(x))/d0\|g_{0}\|\geq(f(x_{0})-f(x_{\star}))/d_{0} (due to convexity of ff) we get that

Analysis for stochastic gradients

In this section, we extend the analysis of Algorithm 1 to the stochastic setting, using the following simple strategy: we define a “good event” under which the noiseless analysis goes through essentially unchanged (Section 4.1), and show that this event occurs with high probability (Section 4.2), obtaining a stochastic, high-probability, analog of our exact gradient result (Section 4.3).

With this definition in hand, slightly modified versions of our key lemmas from the deterministic analysis (Lemma 2, Proposition 2, Lemma 4) continue to hold. See Sections B.1, B.2 and B.3 for proofs of these results, which follow very similarly to their exact-gradient counterparts.

2 The good event is likely

We now arrive at the challenging part of the stochastic analysis: showing that the good event we defined occurs with high probability. For this, we require the following standard assumption.

The stochastic gradient oracle satisfies O(η)L\|\mathcal{O}(\eta)\|\leq L with probability 1.

In online parameter-free optimization such assumption is unavoidable if one seeks regret scaling linearly in the comparator norm . However, similarly to the best prior results, our bounds depend on LL only via a low-order term.

The following result shows that, for appropriate choices of α\alpha and β\beta and any fixed η0\eta\geq 0 the event ET,α,β(η)\mathfrak{E}_{T,\alpha,\beta}(\eta) has high probability.

Proposition 4 makes no a-priori assumption on the size of xi(η)xx_{i}(\eta)-x_{\star}, instead controlling it empirically via dˉt(η)\bar{d}_{t}(\eta); this is unusual in the literature and crucial for our purposes. Our proof (given in Section B.5) relies on a time-uniform empirical-Bernstein-type martingale concentration bound . However, since this result requires martingale differences that are bounded with probability 1, we cannot apply it on <Δi(η),xi(η)x>\left<\Delta_{i}(\eta),x_{i}(\eta)-x_{\star}\right> (which is not bounded), nor can we apply it on <Δi(η),xi(η)x>/dˉt(η)\left<\Delta_{i}(\eta),x_{i}(\eta)-x_{\star}\right>/\bar{d}_{t}(\eta) (which is bounded but is not adapted to any filtration). Instead, we consider processes of the form <Δi(η),Π1([xi(η)x]/s)>\left<\Delta_{i}(\eta),\Pi_{1}([x_{i}(\eta)-x_{\star}]/s)\right>, where Π1()\Pi_{1}(\cdot) is the projection to the unit ball and ss is a fixed scalar. By carefully union bounding over a set of O(logT)O(\log T) values of ss, we are able to control the probability of ET,α,β(η)\mathfrak{E}_{T,\alpha,\beta}(\eta).

Having shown that the good event occurs with high probably for any fixed η\eta, our next step is to show that, for proper choices of α(k)\alpha^{(k)} and β(k)\beta^{(k)}, good events hold with high probability for each and every single value of η\eta Algorithm 1 might try. Noting that for, each value of kk, Algorithm 1 only tests step size values of the form 2jηε2^{j}\eta_{\varepsilon} for j{0,,2k}j\in\{0,\ldots,2^{k}\}, the following lemma (which is a direct application of union bounds) provides the required guarantee; see proof in Section B.6.

Then, under Assumption 1, we have Pr(k=2,4,8,j=0,1,,2kEB,α(k),β(k)(2jηε))1δ\Pr*(\bigcap_{k=2,4,8,\ldots}\bigcap_{j=0,1,\ldots,2^{k}}\mathfrak{E}_{B,\alpha^{(k)},\beta^{(k)}}(2^{j}\eta_{\varepsilon}))\geq 1-\delta.

3 Parameter-free rates for stochastic convex optimization

We are ready to state our main result; see proof in Section B.7.

and for C=log1δ+loglog+Bxx0ηεLC=\log\frac{1}{\delta}+\log\log_{+}\frac{B\|x_{\star}-x_{0}\|}{\eta_{\varepsilon}L}, either

for some η[η,2η]\eta^{\prime}\in[\eta,2\eta], or

Let us compare our bounds to the best known prior bounds, which follow from from online to batch conversion of parameter-free regret bounds. McMahan and Orabona achieve an optimal parameter-free regret bound for algorithms that are not adaptive to gradient norms: For any user-specified ε\varepsilon and gradient budget BB, their result guarantees an expected optimality gap of O\big{(}\frac{\varepsilon+{d_{0}L}\sqrt{\Lambda}}{\sqrt{B}}\big{)} where \Lambda=\log\big{(}1+\frac{d_{0}L}{\varepsilon}\big{)} is logarithmic in 1ε\frac{1}{\varepsilon}. In comparison, by taking ηε=O(εL2B)\eta_{\varepsilon}=O(\frac{\varepsilon}{L^{2}B}) we guarantee a probability 1δ1-\delta optimality gap of O\big{(}\frac{(\varepsilon+{d_{0}L})\lambda^{2}}{\sqrt{B}}\big{)}, where λ=log(1δlog+Bd0Lε)\lambda=\log\left(\frac{1}{\delta}\log_{+}\frac{Bd_{0}L}{\varepsilon}\right) is only double-logarithmic in 1ε\frac{1}{\varepsilon}; see Section B.8 for a slightly tighter bound in this setting.

For small values of ε\varepsilon, our bounds show a clear asymptotic improvement over prior art. However, we note that for a hypothetical optimally-tuned ε\varepsilon (which depends on the unknown problem parameter d0d_{0}), the logarithmic factor Λ\Lambda of prior work becomes O(1)O(1), while our double-logarithmic factor λ\lambda remains O(log(1δlogB))O(\log(\frac{1}{\delta}\log B)). This occurs because Lemma 6 only provides a somewhat loose bound on ηmax\eta_{\max}, and because of the union bound in the proof of Proposition 4. We can mitigate this issue at a cost of adaptivity to gradient norm; see Section D.2 for further discussion.

Our results give the the first high-probability parameter-free rates. However, while high-probability bounds are generally considered stronger than expectation bounds, it is not clear how to deduce an expectation bound from our results without increasing the error by a poly(logB)\mathsf{poly}(\log B) factor, due to the need to set δ=poly(1/B)\delta=\mathsf{poly}(1/B). Finally, we note that Theorem 2 also guarantees that the output of the algorithm is at most a multiplicative factor further from xx_{\star} than x0x_{0} was; we believe that this type of guarantee is new in the parameter-free setting.Orabona and Pál [31, Lemma 25] bound the distance moved by Follow the Regularized Leader iterates, but not by a multiple of xx0\|x_{\star}-x_{0}\|.

Adaptivity to problem structure

In this section we showcase our algorithm’s adaptivity by proving stronger rates of convergence under smoothness and strong-convexity assumptions, without introducing any new parameters.

Let us assume that ff is SS-smooth (i.e., has SS-Lipschitz gradient), and consider for simplicity the exact gradient setting; we believe that similar results extend to the stochastic setting as well. Under these assumptions, we show that Algorithm 1, without any changes, achieves (up to double-logarithmic factors) the Sd02/TSd_{0}^{2}/T suboptimality bound of optimally-tuned GD, as long as ηε<12S\eta_{\varepsilon}<\frac{1}{2S}. See Section C.1 for proof.

We remark that UniXGrad features optimal (accelerated) rates for smooth problems without dependence on the parameter SS. However, unlike our method, it requires knowledge of d0d_{0}.

2 Adaptivity strong convexity using restarts

We now consider a standard strongly-convex stochastic setup [e.g., 17, 36] in which we assume ff to be μ\mu-strongly-convex in X\mathcal{X} and admit a stochastic gradient oracle bounded by LL. (Note that this implies a bound of L/μL/\mu on the diameter of X\mathcal{X}). Hazan and Kale propose to run SGD for epochs of doubling length and halving step sizes. For a total gradient budget of BB, they obtain the optimal bound O(L2/(μB))O(L^{2}/(\mu B)) on the expected optimality gap. However, their scheme requires the initial step size to be proportional to 1/μ1/\mu, and hence requires knowledge of μ\mu.

We show that restarting Algorithm 1 with doubling gradient budgets (and no step size to tune) recovers (up to double-logarithmic factors) the optimal 1/B1/B rate of convergence. To describe the procedure formally, let \textscParameterFreeTuner(x0,B,δ,ηε)\textsc{ParameterFreeTuner}(x_{0},B,\delta,\eta_{\varepsilon}) denote the output of Algorithm 1 with initial point x0x_{0}, gradient budget BB, failure probability δ\delta, minimal step size ηε\eta_{\varepsilon} and α(k),β(k)\alpha^{(k)},\beta^{(k)} as in eq. 13. For user-specified ε>0\varepsilon>0 and δ(0,1)\delta\in(0,1) and x(0)=x0x^{(0)}=x_{0}, our doubling procedure is

See proof in Section C.2. Compared to results obtained via parameter-free strongly-convex regret bounds [12, Thm. 7], we remove a squared logarithmic factor, breaking two regret minimization barriers at once.

Acknowledgment

We thank Shira Baneth for pointing out typos in an earlier version of this paper. YC was partially supported by the Len Blavatnik and the Blavatnik Family Foundation and an Alon Fellowship. OH was partially supported by the Pitt Momentum Funds.

References

Appendix A Proofs for Section 3

A.2 Proof of Proposition 2

A.3 Proof of Lemma 4

Assume by contradiction that η>ηmax\eta>\eta_{\max} but ηϕ(η)\eta\leq\phi(\eta). On the one hand,

which implies rˉT(η)>2αα1d0\bar{r}_{T}(\eta)>\frac{2\alpha}{\alpha-1}d_{0}. On the other hand, by Lemma 2, we have rˉT(η)2αα1d0\bar{r}_{T}(\eta)\leq\frac{2\alpha}{\alpha-1}d_{0} which yields a contradiction.

A.4 Proof of Theorem 1

and the complexity of all preceding bisection calls is at most

Next, we establish (10). Note that Algorithm 1 indeed always returns a point of the form xˉ=1Ti<Txi(η)\bar{x}=\frac{1}{T}\sum_{i<T}x_{i}(\eta) with some T1T\geq 1; the edge case of returning in algorithm 1 corresponds to T=1T=1. Moreover, in the typical case of returning in algorithm 1, we have T=B2kB4kT=\left\lfloor\frac{B}{2k}\right\rfloor\geq\frac{B}{4k} for some kB/4k\leq B/4. By Lemma 4, we have

giving the claimed lower bound (10) on TT.

It remains to show that one of the conclusions (11) and (12) must hold. When Algorithm 1 returns in algorithm 1, this follows immediately from Proposition 2 (if ηεϕ(ηε)\eta_{\varepsilon}\leq\phi(\eta_{\varepsilon}) then conclusion (11) holds; if ηε>ϕ(ηε)\eta_{\varepsilon}>\phi(\eta_{\varepsilon}) then either ηε3GT(ηε)3d0\eta_{\varepsilon}\sqrt{3G_{T}(\eta_{\varepsilon})}\leq 3d_{0} and conclusion (11) holds, or d03GT(ηε)ηεGT(ηε)d_{0}\sqrt{3G_{T}(\eta_{\varepsilon})}\leq\eta_{\varepsilon}G_{T}(\eta_{\varepsilon}) and conclusion (12) holds). In the edge case of returning xˉ=x0\bar{x}=x_{0} in algorithm 1, corresponding to T=1T=1, conclusion (11) clearly holds, as x0x4x0x\|x_{0}-x_{\star}\|\leq 4\|x_{0}-x_{\star}\| trivially and f(x0)f(x)<g0,x0x>x0xg0f(x_{0})-f(x_{\star})\leq\left<g_{0},x_{0}-x_{\star}\right>\leq\|x_{0}-x_{\star}\|\|g_{0}\| due to convexity of ff. We remark that due to inequality (10), the T=1T=1 edge case is only possible for a very small iteration budget B=O(loglog+x0xηεg0)B=O(\log\log_{+}\frac{\|x_{0}-x_{\star}\|}{\eta_{\varepsilon}\|g_{0}\|}). ∎

Appendix B Proofs for Section 4

The proof proceeds similarly to the proof of Lemma 2, except that instead of assuming exact subgradients we make use of the definition of ET,α,β\mathfrak{E}_{T,\alpha,\beta}. As usual in proofs where η\eta is fixed, we drop the explicit dependence on it from xt,gt,dt,dˉt,rt,rˉtx_{t},g_{t},d_{t},\bar{d}_{t},r_{t},\bar{r}_{t} and GtG_{t}.

Noting that <f(xi),xix>f(xi)f(x)0\left<\nabla f(x_{i}),x_{i}-x_{\star}\right>\geq f(x_{i})-f(x_{\star})\geq 0 due to convexity and that i<t<Δi,xix>14max{dˉt,ηβ}αGt+β\sum_{i<t}\left<\Delta_{i},x_{i}-x_{\star}\right>\geq-\frac{1}{4}\max\{\bar{d}_{t},\eta\sqrt{\beta}\}\sqrt{\alpha G_{t}+\beta} for all tTt\leq T due to ET,α,β\mathfrak{E}_{T,\alpha,\beta} holding, we have

Maximizing both sides of the inequality over tTt\leq T and recalling that ηϕ(η)=rˉT/αGT+β\eta\leq\phi(\eta)=\bar{r}_{T}/\sqrt{\alpha G_{T}+\beta}, we get

Substituting rˉTdˉT+d0\bar{r}_{T}\leq\bar{d}_{T}+d_{0} (which holds due to the triangle inequality), we get

where the final equality is due to α>1\alpha>1. Thus, we arrive again at inequality (8) from the proof of Lemma 2, but with α\alpha replaced by α=2α/(α+2)\alpha^{\prime}=2\alpha/(\alpha+2). We consequently find that

B.2 Proof of Proposition 3

The remainder of the proof is very similar to the proof of Proposition 2. Combining (5), (9) and (21) yields:

B.3 Proof of Lemma 6

The proof is essentially identical to the proof of Lemma 4, with Lemma 5 used instead of Lemma 2; we give it here for completeness.

To show the first part of the lemma, assume that ET,α,β(η)\mathfrak{E}_{T,\alpha,\beta}(\eta) holds and assume by contradiction that η>ηmax(α,β)\eta>\eta_{\max}(\alpha,\beta) but ηϕ(η)\eta\leq\phi(\eta). On the one hand,

which implies rˉT(η)>4αα2d0\bar{r}_{T}(\eta)>\frac{4\alpha}{\alpha-2}d_{0}. On the other hand, by Lemma 5, we have rˉT(η)4αα2d0\bar{r}_{T}(\eta)\leq\frac{4\alpha}{\alpha-2}d_{0} which yields a contradiction.

B.4 A martingale concentration bound

The following corollary is a translation of [18, Theorem 4] which simplifies notation at the cost of looser constants. We remark that it holds even when log\log denotes the natural logarithm (as is the convention in ).

Let XtX_{t} be adapted to Ft\mathcal{F}_{t} such that Xt1\left|X_{t}\right|\leq 1 with probability 1 for all tt. Then, for every δ(0,1)\delta\in\left(0,1\right) and any X^tFt1\hat{X}_{t}\in\mathcal{F}_{t-1} such that X^t1\lvert\hat{X}_{t}\rvert\leq 1 with probability 1,

where At(δ)=log(60log(6t)δ)A_{t}(\delta)=\log\left(\frac{60\log(6t)}{\delta}\right).

Throughout we proof we use the binary maximization notation abmax{a,b}a\vee b\coloneqq\max\{a,b\}.

We apply [18, Theorem 4] with a=b=1a=-b=1 and the polynomial stitched boundary [18, Eq. (10)] with parameters m,η,s1m,\eta,s\geq 1 to be specified below. This yields

with ζ\zeta denoting the Riemann zeta function,

Let us first simplify Sδ/2(mv)S_{\delta/2}\left(m\vee v\right) and then choose the parameters m,η,sm,\eta,s to yield decent constants. Writing Z=sloglog(η(mv)m)+log2ζ(s)δlogsηZ=s\log\log\left(\frac{\eta(m\vee v)}{m}\right)+\log\frac{2\zeta\left(s\right)}{\delta\log^{s}\eta}, we have

Note that loglog(η(mv)m)loglogη\log\log\left(\frac{\eta(m\vee v)}{m}\right)\geq\log\log\eta and therefore Zlog2ζ(s)δlog(2ζ(s))Z\geq\log\frac{2\zeta\left(s\right)}{\delta}\geq\log(2\zeta(s)). Therefore, if mlog(2ζ(s))m\leq\log(2\zeta(s)), we have the slightly simplified bound

Moreover, for mηm\geq\eta we may upper bound ZZ by

Taking η=m=1.8\eta=m=1.8 and s=1.05s=1.05, one easily confirms that m3log(2ζ(s))m\leq 3\leq\log(2\zeta(s)) and, substituting back k1k_{1} and k2k_{2} as defined above, we have

Finally, noting that \big{(}X_{s}-\hat{X}_{s}\big{)}^{2}\leq 4 for all ss, we may substitute m+v6tm+v\leq 6t in the bound above, concluding the proof. ∎

B.5 Proof of Proposition 4

Since η\eta is fixed throughout this proof, we drop the explicit dependence on it to simplify notation. Furthermore, we define the normalized/shorthand quantities:

With these definitions, the failure probability we wish to bound is

Note that ktk_{t} satisfies the following

The first set of inequalities follows from the definition of dˉt\bar{d}_{t}^{\prime}, ktk_{t} and sks_{k}, while the latter inequality is due to the fact that dˉtd0+tηLd0+t4ηβ\bar{d}_{t}\leq d_{0}+t\eta L\leq d_{0}+\frac{t}{4}\eta\sqrt{\beta}, which follows from the definition of SGD, the triangle inequality, the assumption giL\|g_{i}\|\leq L w.p. 1, and β16L2\beta\geq 16L^{2}.

Writing Π1(x)=x/max{1,x}\Pi_{1}(x)=x/\max\{1,\|x\|\} for the projection to the Euclidean unit ball, we now bound the failure probability as follows

where (i)(i) follows from xix=didˉtskt\|x_{i}-x_{\star}\|=d_{i}\leq\bar{d}_{t}\leq s_{k_{t}} (which means that the projection does nothing), (ii)(ii) follows from skt2dˉts_{k_{t}}\leq 2\bar{d}_{t}^{\prime}, and (iii)(iii) follows from 0ktlog(6t)log(6T)0\leq k_{t}\leq\log(6t)\leq\log(6T) and a union bound. We can now define a nicely behaved stochastic process: for every ii and kk let

and note that Xi(k)X_{i}^{(k)} is adapted to the filtration Ft=σ(g0,g1,,gt)\mathcal{F}_{t}=\sigma(g_{0},g_{1},\ldots,g_{t}) (i.e., Xi(k)FtX_{i}^{(k)}\in\mathcal{F}_{t}) and satisfies Xi(k)giL1\left\lvert X_{i}^{(k)}\right\rvert\leq\frac{\|g_{i}\|}{L}\leq 1 by Cauchy-Schwarz and giL\|g_{i}\|\leq L. Applying Corollary 1 with X^=0\hat{X}=0 as the predictable sequence, we obtain, for any kk and δ(0,1)\delta^{\prime}\in(0,1),

where At(δ)=log(60log(6t)δ)A_{t}(\delta^{\prime})=\log\left(\frac{60\log(6t)}{\delta^{\prime}}\right). Note that

Furthermore, note that, for δ=δlog(6T)\delta^{\prime}=\frac{\delta}{\log(6T)} we have that At(δ)AT(δ)=C=α/322α/16A_{t}(\delta^{\prime})\leq A_{T}(\delta^{\prime})=C=\alpha/32^{2}\leq\alpha^{\prime}/16 and that At2(δ)At2(δ)=C2β/16A_{t}^{2}(\delta^{\prime})\leq A_{t}^{2}(\delta^{\prime})=C^{2}\leq\beta^{\prime}/16. Substituting to inequality (23) we conclude that

for all kk, and therefore Pr(ET,α,βc)δ\Pr*(\mathfrak{E}_{T,\alpha,\beta}^{c})\leq\delta by the bound (22). ∎

B.6 Proof of Lemma 7

Fixing some k{2,4,8,}k\in\{2,4,8,\ldots\} and noting that Ck=log(60log2(6B)22kδ)C_{k}=\log\left(\frac{60\log^{2}(6B)}{2^{-2k}\delta}\right), we may apply Proposition 4 with T=BT=B, α=α(k)\alpha=\alpha^{(k)}, β=β(k)\beta=\beta^{(k)} and failure probability 22kδ2^{-2k}\delta, giving 1Pr(EB,α(k),β(k)(η))22kδ1-\Pr*(\mathfrak{E}_{B,\alpha^{(k)},\beta^{(k)}}(\eta))\leq 2^{-2k}\delta for any η\eta. Therefore, by the union bound

Applying the union bound once more, we have

B.7 Proof of Theorem 2

The bound BB on the algorithm’s query number is deterministic and therefore follows exactly as in the proof of Theorem 1. For the remainder of the analysis we assume the event k=2,4,8,j=0,1,,2kEB,α(k),β(k)(2jηε)\bigcap_{k=2,4,8,\ldots}\bigcap_{j=0,1,\ldots,2^{k}}\mathfrak{E}_{B,\alpha^{(k)},\beta^{(k)}}(2^{j}\eta_{\varepsilon}) holds, which by Lemma 7 happens with probability at least 1δ1-\delta; we will show that, conditional on this event holding, the conclusions of the theorem hold deterministically. (Note that EB,α(k),β(k)\mathfrak{E}_{B,\alpha^{(k)},\beta^{(k)}} implies ETk,α(k),β(k)\mathfrak{E}_{T_{k},\alpha^{(k)},\beta^{(k)}} for all TkBT_{k}\leq B).

(where we have used αα(0)322\alpha\geq\alpha^{(0)}\geq 32^{2}), and, for some η[η,2η]\eta^{\prime}\in[\eta,2\eta],

In the edge case that ηε>ϕ(ηε)\eta_{\varepsilon}>\phi(\eta_{\varepsilon}) in the final bisection, we separately consider the cases ηεαGT(ηε)+β5d0\eta_{\varepsilon}\sqrt{\alpha G_{T}(\eta_{\varepsilon})+\beta}\leq 5d_{0} and ηεαGT(ηε)+β>5d0\eta_{\varepsilon}\sqrt{\alpha G_{T}(\eta_{\varepsilon})+\beta}>5d_{0}. In the former case, Proposition 3 gives

so conclusion (15) holds as before. In the second case, where 5d0<ηεαGT(ηε)+β5d_{0}<\eta_{\varepsilon}\sqrt{\alpha G_{T}(\eta_{\varepsilon})+\beta}, we have

Recalling that α=O(C)\alpha=O(C) and β=O(C2L2)\beta=O(C^{2}L^{2}), we see that conclusion (16) holds.

Finally, if Algorithm 1 returns in algorithm 1 instead of algorithm 1, we have xˉ=x0\bar{x}=x_{0} and T=1T=1, and so conclusion (15) holds trivially, since x0x6x0x\|x_{0}-x_{\star}\|\leq 6\|x_{0}-x_{\star}\| and f(x0)f(x)<f(x0),x0x>x0xL=O(d0C2L2)f(x_{0})-f(x_{\star})\leq\left<\nabla f(x_{0}),x_{0}-x_{\star}\right>\leq\|x_{0}-x_{\star}\|L=O(d_{0}\sqrt{C^{2}L^{2}}) due to convexity of ff and Assumption 1. ∎

B.8 A corollary for uniform gradient bounds

The following corollary translates Theorem 2 to the setting where we replace all observed gradient norms by LL. In it λ\lambda represents a double-logarithmic factor and we use ι<1\iota<1 to indicate low order terms which can be ignored as soon as B=Ω(λ2)B=\Omega(\lambda^{2}).

The corollary follows by substitution of ηε=εL2\eta_{\varepsilon}=\frac{\varepsilon}{L^{2}} into Theorem 2. In particular, the bound (14) becomes

and the upper bound on xˉx\|\bar{x}-x_{\star}\| in (16) is

which, when substituting TBT\leq B and ι2=λ(λ+B)\iota^{2}=\lambda(\lambda+B) and combining with the bound on xˉx\|\bar{x}-x_{\star}\| in (15) yields the claimed distance bound.

Finally, recalling that T1T\geq 1 and T=Ω(B/m)T=\Omega(B/m), the suboptimality bound in (15) reads

and the suboptimality bound in (16) reads

combined, these yield the claimed suboptimality bound. ∎

Appendix C Proofs for Section 5

Recall the following basic property of any SS-smooth functions [6, Lemma 3.4],

Using this fact we establish two useful inequalities for our proof. First, for any η1S\eta\leq\frac{1}{S} substituting u=xi+1(η)u=x_{i+1}(\eta) and v=xi(η)v=x_{i}(\eta) into (25) gives η2gi(η)2f(xi(η))f(xi+1(η))\frac{\eta}{2}\|g_{i}(\eta)\|^{2}\leq f(x_{i}(\eta))-f(x_{i+1}(\eta)). Summing over i<Ti<T we obtain

where ()(\star) follows from (25) with u=x0u=x_{0} and v=xv=x_{\star}. Second, for any η0\eta\geq 0, substituting xi(η)1Sgi(η)x_{i}(\eta)-\frac{1}{S}g_{i}(\eta) and v=xt(η)v=x_{t}(\eta) into (25) yields gi(η)22S[f(xi(η))f(xi(η)1Sgi(η))]2S[f(xi(η))f(x)]\|g_{i}(\eta)\|^{2}\leq 2S\left[f(x_{i}(\eta))-f\left(x_{i}(\eta)-\tfrac{1}{S}g_{i}(\eta)\right)\right]\leq 2S\left[f(x_{i}(\eta))-f\left(x_{\star}\right)\right]. Summing for i<Ti<T yields

and rearranging gives i=1Tf(xi(η))f(x)=O(Sd0)\sqrt{\sum_{i=1}^{T}f(x_{i}(\eta))-f(x_{\star})}=O(\sqrt{S}d_{0}) again.

C.2 Proof of Theorem 4

First, note that by Theorem 2, computing x(M)x^{(M)} requires m=1MB(m)=2+4++2M=2M+12B\sum_{m=1}^{M}B^{(m)}=2+4+\cdots+2^{M}=2^{M+1}-2\leq B gradients queries as claimed. Next, note that

and therefore by the union bound, with probability at least 1δ1-\delta the conclusions of Theorem 2 hold for all applications of ParameterFreeTuner; we proceed with our analysis conditional on that event.

Note that δ(m)δ(M)=Ω(δ/log2B)\delta^{(m)}\geq\delta^{(M)}=\Omega(\delta/\log^{2}B) and ηε(m)ηε(M)=Ω(ε/(L2B))\eta_{\varepsilon}^{(m)}\geq\eta_{\varepsilon}^{(M)}=\Omega(\varepsilon/(L^{2}B)). Consequently, we have

With this, we apply Theorem 2 on to bound the suboptimality of x(m)x^{(m)} for mMm\leq M. Let T(m)T^{(m)} be the corresponding TT from Theorem 2, and note that T(m)max{1,B(m)/λ}T^{(m)}\geq\max\{1,B^{(m)}/\lambda\}. Noting also that GT(m)(η)/T(m)L2G_{T^{(m)}}(\eta^{\prime})/T^{(m)}\leq L^{2} for all η\eta^{\prime}, we have

for some ε~=O(ελ2)\widetilde{\varepsilon}=O(\varepsilon\lambda^{2}) and L~=O(Lλ3/2)\widetilde{L}=O(L\lambda^{3/2}), and all mMm\leq M. Applying strong convexity, we have that μ2x(m1)x2f(x(m1))f(x)\frac{\mu}{2}\|x^{(m-1)}-x_{\star}\|^{2}\leq f(x^{(m-1)})-f(x_{\star}) which implies

where ()(\star) follows from abmax{2a,b/2}\sqrt{ab}\leq\max\{2a,b/2\}. Iterating this bound and noting that 2B(m1)=B(m)=2m2B^{(m-1)}=B^{(m)}=2^{m} we conclude that

Finally, the strong convexity and Lipschitz continuity assumptions imply that

and therefore x(0)x2Lμ\|x^{(0)}-x_{\star}\|\leq\frac{2L}{\mu} and, using Lipschitz continuity again, f(x(0))f(x)2L2μ4L~2μf(x^{(0)})-f(x_{\star})\leq\frac{2L^{2}}{\mu}\leq\frac{4\widetilde{L}^{2}}{\mu}. Substituting back into (29), the second and third terms merge. Recalling that B(M)=2M=B/2B^{(M)}=2^{M}=B/2, and that ε~=O(ελ2)\widetilde{\varepsilon}=O(\varepsilon\lambda^{2}) and L~=O(Lλ3/2)\widetilde{L}=O(L\lambda^{3/2}), we have

Appendix D Additional discussion

Optimality gap bounds obtained via online-to-batch conversion have the appealing property of holding for any comparator x˚X\mathring{x}\in\mathcal{X} and not necessarily a minimizer of ff . Consequently, the parameter-free regret minimization algorithm of McMahan and Orabona outputs a point xˉ\bar{x} with an error bound of the form

In contrast, we only provide guarantees for x=xx^{\prime}=x_{\star}, a minimizer of ff. This can be restrictive in settings where x0x\|x_{0}-x_{\star}\| is very large or possibly infinite, i.e., when the minimum of ff is not attained, as is the case in logistic regression on separable data.

However, the assumption that xx_{\star} is optimal can be relaxed. In particular, our only real requirement from xx_{\star} is that, for every SGD iterate xt(η)x_{t}(\eta) evaluated in Algorithm 1, we have f(xt(η))f(x)0f(x_{t}(\eta))-f(x_{\star})\geq 0. In the noiseless setting, we may modify Algorithm 1 to return the GD iterate with lowest objective value (from all the GD executions combined). Such modified algorithm would satisfy the error bounds in Theorem 1 with respect to an arbitrary point xx_{\star}: if the algorithm’s output has function value smaller than f(x)f(x_{\star}), we are done; otherwise, we have f(xt(η))f(x)0f(x_{t}(\eta))-f(x_{\star})\geq 0 for every GD iterate, and our analysis goes through. (Note, however, that we lose the guarantee on the distance between xx_{\star} and the algorithm’s output). Extension to the stochastic case is more involved since we do not have the privilege of choosing the best SGD iterate; we leave it to future work.

D.2 An alternative bisection target without gradient norm adaptivity

Algorithm 1 is fairly adaptive to stochastic gradient norms, with performance guarantees that depend mainly on observed norms, featuring an a-priori gradient norm bound in low-order terms. Moreover, in the noiseless setting our method requires no a-priori bound on gradient norms and our bounds depend solely on observed norms.

It is possible, however, to slightly simplify our method and sharpen some of our bounds by forgoing adaptivity to gradient norms. Specifically, if we only seek guarantees that depend on an a-priori gradient norm bound LL, then it is possible to replace the bisection target ϕ\phi defined in algorithm 1 of Algorithm 1 with

Our analysis applies to this modified bisection target as well, but with GT(η)G_{T}(\eta) replaced by L2TL^{2}T throughout. Moreover, this modification allows us to slightly improve two parts of the analysis.

First, we may sharpen the bound on ηmax\eta_{\max} in Lemmas 4 and 6 to ηmax=O(d0LT)\eta_{\max}=O\left(\frac{d_{0}}{L\sqrt{T}}\right), improving our bound on the maximum value of kk used in Algorithm 1. In the deterministic case, this allows us to establish optimality gap bounds scaling as ε+λd0LB\varepsilon+\lambda^{\prime}\frac{d_{0}L}{\sqrt{B}} for \lambda^{\prime}=O\Big{(}\sqrt{\log\log_{+}\frac{d_{0}L}{\varepsilon\sqrt{B}}}\Big{)} which satisfies λ=O(1)\lambda^{\prime}=O(1) for ε=d0LB\varepsilon=\frac{d_{0}L}{\sqrt{B}}, similarly to the bounds of previous works, as discussed at the end of Section 4.

Second, in the stochastic setting, we may use Blackwell’s inequality instead of the time uniform empirical Bernstein. This allows us to take α(k)=2k+O(log(1δlogB))\alpha^{(k)}=2k+O\left(\log\left(\frac{1}{\delta}\log B\right)\right), eliminating the additive square logarithmic term stemming from β(k)\beta^{(k)} in (13). Consequently, in the stochastic setting we obtain a probability 1δ1-\delta optimality gap bound of ε+λd0LB\varepsilon+\lambda^{\prime\prime}\frac{d_{0}L}{\sqrt{B}} for \lambda^{\prime\prime}=O\Big{(}{\lambda^{\prime}\big{[}\lambda^{\prime}+\sqrt{\log\left(\frac{1}{\delta}\log B\right)}\big{]}}\Big{)}, with λ\lambda^{\prime} defined above. Therefore, in the stochastic case we do not remove loglogB\log\log B term entirely, even for the optimal ε\varepsilon. The source of the remaining loglogB\log\log B is the union bound we use in the proof of Proposition 4, which might be removable via a more careful probabilistic argument.