Adaptive Proximal Gradient Method for Convex Optimization

Yura Malitsky, Konstantin Mishchenko

Intro

In this paper, we address a convex minimization problem

We are interested in the cases when either FF is differentiable and then we will use notation F=fF=f, or it has a composite additive structure as F=f+gF=f+g. Here, ff represents a convex and differentiable function, while gg is convex, lower semi-continuous (lsc), and prox-friendly. Throughout the paper, we will interchangeably refer to the smoothness of ff and the Lipschitzness of f\nabla f, occasionally with the adjective "locally," indicating that it is restricted to a bounded set. We will refer to this property as smoothness, without mentioning the Lipschitzness of ff, so we hope there will be no confusion in this regard.

For simplicity, in most of the introduction, we consider only the simpler problem minxf(x)\min_{x}f(x). We study one of the most classical optimization algorithm — gradient descent —

Its simplicity and the sole prerequisite of knowing the gradient of ff make it appealing for diverse applications. This method is central in modern continuous optimization, forming the bedrock for numerous extensions.

Given the initial point x0x^{0}, the only thing we need to implement (1) is to choose a stepsize αk\alpha_{k} (also known as a learning rate in machine learning literature). This seemingly tiny detail is crucial for the method convergence and performance. When a user invokes GD as a solver, the standard approach would be to pick an arbitrary value for αk\alpha_{k}, run the algorithm, and observe its behavior. If it diverges at some point, the user would try a smaller stepsize and repeat the same procedure. If, on the other hand, the method takes too much time to converge, the user might try to increase the stepsize. In practice, this approach is not very efficient, as we have no theoretical guarantees for a randomly guessed stepsize, and the divergence may occur after a long time. Both underestimating and overestimating the stepsize can, thus, lead to a large overhead.

Below we briefly list possible approaches to choosing or estimating the stepsize and we provide a more detailed literature overview in Section 5.

When ff is LL-smooth, GD can utilize a fixed stepsize αk=α<2L\alpha_{k}=\alpha<\frac{2}{L} and values larger than 2L\frac{2}{L} will provably lead to divergence. Consequently, in such scenarios, the rate of convergence is given by f(xk)f=O(1αk)f(x^{k})-f_{*}=\mathcal{O}\left(\frac{1}{\alpha k}\right), clearly indicating a direct dependence on the stepsize. Nevertheless, several drawbacks emerge from this approach:

LL is not available in many practical scenarios;

if the curvature of ff changes a lot, GD with the global value of LL may be too conservative;

Linesearch.

Also known as backtracking in the literature. In the kk-th iteration we compute xk+1x^{k+1} with a certain stepsize αk\alpha_{k} and check a specific condition. If the condition holds, we accept xk+1x^{k+1} and proceed to the next iteration; otherwise we halve αk\alpha_{k} and recompute xk+1x^{k+1} using this reduced stepsize. This approach, while the most robust and theoretically sound, incurs substantially higher computational cost compared to regular GD due to the linesearch procedure.

Adagrad-type algorithms.

These are the methods of the typeWe provide only the simplest instance of such algorithms.

where v10v_{-1}\geq 0 is some constants, and dkd_{k} is an estimate of x0x\|x^{0}-x^{*}\| some some solution xx^{*}. While such methods indeed have certain nice properties, dkd_{k} is usually either constant or quickly converges to a constant value, so a quick glance at (2) will reveal that its stepsizes are decreasing. Therefore, despite the name, we cannot expect true adaptivity of this method to the local curvature of ff.

Heuristics.

Numerous heuristics exist for selecting αk\alpha_{k} based on local properties of ff and f\nabla f, with the Barzilai-Borwein method [BB88] being among the most widely popular. However, it is crucial to note that we are not particularly interested in such approaches, as they lack consistency and may even lead to divergence, even for simple convex problems.

We have already mentioned adaptivity a few times, without properly introducing it. Now let us try to properly understand its meaning in the context of gradient descent. Besides the initial point x0x^{0}, GD has only one degree of freedom — its stepsize. From the analysis we know that it has to be approximately an inverse of the local smoothness. We call a method adaptive, if it automatically adapts a stepsize to this local smoothness without additional expensive computation and the method does not deteriorate the rate of the original method in the worst case. In our case, original method is GD with a fixed stepsize.

By this definition, GD with linesearch is not adaptive, because it finds the right stepsize with some extra evaluations of ff or f\nabla f. GD with diminishing steps (as in subgradient or Adagrad methods) is also not adaptive, because decreasing steps cannot in general represent well the function’s curvature; also the rate of the subgradient method is definitely worse. It goes without saying, that for a good method its rate must experience improvement when we confine the class of smooth convex functions to the strongly convex ones.

In a previous work [MM20], which serves as the cornerstone for the current paper, the authors proposed an adaptive gradient method named “Adaptive Gradient Descent without Descent” (AdGD). In the current paper, we

deepen our understanding of AdGD and identify its limitations;

refine its theory to accommodate even larger steps;

extend the revised algorithm from unconstrained to the proximal case.

The analysis in the last two cases is not a trivial extension, and we were rather pleasantly surprised that this was possible at all. After all, the theory of GD is well-established and we thought it to be too well-explored for us to discover something new.

Continuous point of view.

It is instructive for some time to switch from the discrete setting to the continuous and to compare gradient descent (GD) with its parent — gradient flow (GF)

where tt is the time variable and x(t)x^{\prime}(t) denotes the derivative of x(t)x(t) with respect to tt. To guarantee existence and uniqueness of a trajectory x(t)x(t) of GF, it is sufficient to assume that f\nabla f is locally Lipschitz-continuous. Then one can prove convergence of x(t)x(t) to a minimizer of ff in just a few lines. For GD, on the other hand, the central assumption is global Lipschitzness of f\nabla f. Our analysis of gradient descent makes it level: local Lipschitzness suffices for both. Or to put it differently, we provide an adaptive discretization of GF that converges under the same assumptions as the original continuous problem (3).

Proximal case.

We emphasize that there is already an excellent extension by Latafat et al. [Lat+23] of the work [MM20] to the additive composite case. Our proposed result, however, is based on an improved unconstrained analysis and uses a different proof. We believe that both these facts will be of interest. We don’t have a good understanding why, but for us finding the proof for the proximal case was quite challenging. It does not follow the standard lines of arguments and uses a novel Lyapunov energy in the analysis.

Nonconvex problems.

We believe that our algorithm will be no less important in the nonconvex case, where gradients are rarely globally Lipschitz continuous and where the curvature may change more drastically. It is true that our analysis applies only to the convex case, but, as far as we know, limited theory has never yet prevented practitioners from using methods in a broader setting. And based on our (speculative) experience, we found it challenging to identify nonconvex functions where the method did not converge to a local solution.

Outline.

In Section 2, we begin by revisiting AdGD from [MM20], examining its limitations, and demonstrating a simple way to enhance it. This section maintains an informal tone, making it easily accessible for quick reading and classroom presentation. In Section 3, we further improve the method and provide all formal proofs. Section 4 extends the improved method to the proximal case. In Section 5 we put our finding in the perspective and compare it to some existing works. Lastly, in Section 6, we conduct experiments to evaluate the proposed method against different linesearch variants.

1 Preliminaries

A convex LL-smooth function ff is characterized by the following inequality

This is equivalently of saying that f\nabla f is a 1L\frac{1}{L}-cocoercive operator, that is

For a convex differentiable ff that is not LL-smooth one can only say that f\nabla f is monotone, that is

We use notation [t]+=max{t,0}[t]_{+}=\max\{t,0\} and for any a>0a>0 we suppose that a0=+\frac{a}{0}=+\infty. With a slight abuse of notation, we write [n][n] to denote the set {1,,n}\{1,\dots,n\}. A solution and the value of the optimization problem minf(x)\min f(x) are denoted by xx^{*} and ff_{*}, respectively.

Adaptive gradient descent: better analysis

Similarly to the standard GD, this method leads to O(1/k)\mathcal{O}(1/k) convergence rate. However, unlike the former, it doesn’t require any knowledge about Lipschitz constant of f\nabla f and doesn’t even require a global Lipschitz continuity of f\nabla f.

The update for αk\alpha_{k} has two ingredients. The first bound αk1+θk1αk1\alpha_{k}\leqslant\sqrt{1+\theta_{k-1}}\alpha_{k-1} sets how fast steps may increase from iteration to iteration. The second αkxkxk12f(xk)f(xk1)\alpha_{k}\leqslant\frac{\|x^{k}-x^{k-1}\|}{2\|\nabla f(x^{k})-\nabla f(x^{k-1})\|} corresponds to the estimate of local Lipschitzness of f\nabla f.

It is important to understand how essential these bounds are. Do we really need to control the growth rate of αk\alpha_{k} or is it an artifact of our analysis? For the second bound, it is not clear whether 22 in the denominator is necessary. For example, given LL-smooth ff, our scheme (7) does not encompass a standard GD with αk=1L\alpha_{k}=\frac{1}{L} for all kk.

Answering the first question is relatively easy. Consider the following function

1<spanclass="katexdisplay"><spanclass="katex"><spanclass="katexmathml"><mathxmlns="http://www.w3.org/1998/Math/MathML"display="block"><semantics><mrow><mn>1</mn></mrow><annotationencoding="application/xtex">1</annotation></semantics></math></span><spanclass="katexhtml"ariahidden="true"><spanclass="base"><spanclass="strut"style="height:0.6444em;"></span><spanclass="mord">1</span></span></span></span></span>x1<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mn>1</mn></mrow><annotation encoding="application/x-tex">1</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6444em;"></span><span class="mord">1</span></span></span></span></span>xf(x)f(x) where parameters a,b>0a,b>0 are chosen to ensure that f(±1)f(\pm 1) and f(±1)f^{\prime}(\pm 1) are well-defined, namely a=2a=2 and b=2log232b=2\log 2-\frac{3}{2}, see Lemma 16 in Appendix.

From an optimization point of view, ff is a nice function. In particular, it is convex (even locally strongly convex) and its gradient is 11-Lipschitz, see Lemma 16. This means that both GD and AdGD linearly converge on it. However, if we remove the first condition for αk\alpha_{k} in AdGD, this new modified algorithm will fail to converge. We can prove an even stronger statement. Specifically, let c1c\geqslant 1, α0=1\alpha_{0}=1 and consider the following method

In other words, the update in (9) is the same as in (7) except we removed the first constraint for αk\alpha_{k} in (7) and introduced a constant factor cc to make the second one more general.

For any c1c\geqslant 1 there exists x0x^{0} such that the method (9) applied to ff defined in (8) diverges.

The formal proof of this statement is in Appendix, but its main idea should be intuitively clear. First, observe that for xx with large absolute value, f(x)f(x) behaves mostly like a linear function. However, f(x)f^{\prime}(x) approaches 1-1 when xx\to-\infty and +1+1 when x+x\to+\infty. Therefore, if xkx^{k} and xk1x^{k-1} have the same sign, the local smoothness estimate will be too optimistic and xk+1x^{k+1} will “leapfrog” the optimum. In contrast, if the signs of xkx^{k} and xk1x^{k-1} are different, then xk+1x^{k+1} will fail to get sufficiently close to the optimum. It is interesting to remark that on this function both versions of the Barzilai-Borwein method will diverge as well.

Consequently, the answer to the first question is affirmative: we do need some extra condition for the stepsize αk\alpha_{k}.

Second bound.

The answer to the second question is the opposite: it is indeed an artifact of our previous analysis. In the next section, we propose an improvement over the previous version [MM20]. We give a concise presentation in an informal way. We keep a more formal style for section 3 where even better version (also slightly more complicated) will be presented.

1 Improving AdGD

The analysis of GD usually starts from the standard identity, followed by convexity inequality

In [MM20] the only “nontrivial” step in the proof was upper bounding αk2f(xk)2\alpha_{k}^{2}\|\nabla f(x^{k})\|^{2}, that is xk+1xk2\|x^{k+1}-x^{k}\|^{2}. Now we do it in a slightly different way. First, we need the following fact.

For GD iterates (xk)(x^{k}) with arbitrary stepsizes, it holds

This is just monotonicity of f\nabla f in disguise:

Now we are going to bound xk+1xk2\|x^{k+1}-x^{k}\|^{2}. For convenience, denote the approximate local Lipschitz constant as

Let αk\alpha_{k} satisfy αkf(xk)f(xk1)γxkxk1\alpha_{k}\|\nabla f(x^{k})-\nabla f(x^{k-1})\|\leqslant\gamma\|x^{k}-x^{k-1}\| for some γ>0\gamma>0, that is αkLkγ\alpha_{k}L_{k}\leqslant\gamma. Using u2=uv2v2+2u,v\|u\|^{2}=\|u-v\|^{2}-\|v\|^{2}+2\langle u,v\rangle, we have

where the last inequality follows from convexity of ff. For γ<1\gamma<1 we can rewrite (2.1) as

Substituting this inequality into (2.1) gives us

As we want to telescope the above inequality, we require

On the other hand, we have already used that αkLkγ\alpha_{k}L_{k}\leqslant\gamma. These two conditions lead to the bound

where γ(0,1)\gamma\in(0,1) can be arbitrary. Now by playing with different values of γ\gamma, we obtain different instances of adaptive gradient descent method. For instance, by setting γ=12\gamma=\frac{1}{\sqrt{2}}, we get

which is a strict improvement upon the original version in [MM20]. A simple reason why this is possible is that, unlike in [MM20], we did not resort to the Cauchy-Schwarz inequality and instead relied on transformation (2.1) and Lemma 1.

We summarize the new scheme in Algorithm 1. We do not provide a formal proof for this scheme and hope that inequality (2.1) should be sufficient for the curious reader to complete the proof. In any case, the next section will contain a further improvement with all the missing proofs.

One might notice that we have used several times monotonicity of f\nabla f, where we actually could use a stronger property of cocoercivity (5). That is true, but we just prefer simplicity. We recommend work [Lat+23] that exploits cocoercivity in this framework.

Adaptive gradient descent: larger stepsize

In this section we modify Algorithm 1 to use even larger steps resulting in Algorithm 2. This, however, will require slightly more complex analysis.

Recall the notation [t]+=max{t,0}[t]_{+}=\max\{t,0\} and note that the second bound αkαk1[2αk12Lk21]+\alpha_{k}\leqslant\frac{\alpha_{k-1}}{\sqrt{[2\alpha_{k-1}^{2}L_{k}^{2}-1]_{+}}} is equivalent to

which obviously allows for larger range of αk\alpha_{k} than αk2Lk212\alpha_{k}^{2}L_{k}^{2}\leqslant\frac{1}{2} in Algorithm 1. On the other hand, the first bound αk23+θk1αk1\alpha_{k}\leqslant\sqrt{\frac{2}{3}+\theta_{k-1}}\alpha_{k-1} is definitely worse. At the moment, it is not even clear whether it allows αk\alpha_{k} to increase.

A notable distinction between Algorithm 2 and Algorithm 1 is that the former allows to use a standard fixed step αk=1L\alpha_{k}=\frac{1}{L}, provided that ff is LL-smooth. For instance, if we start from α0=1L\alpha_{0}=\frac{1}{L} and use LLkL\geqslant L_{k} in every iteration (we can always use a larger value), then it follows from (15) and θk113\theta_{k-1}\geqslant\frac{1}{3} that αk=1L\alpha_{k}=\frac{1}{L} for all k1k\geqslant 1.

This is an important though not very exciting topic. Algorithm 2 requires an initial stepsize α0\alpha_{0}. While, as it will be proved later, the algorithm converges for any value α0>0\alpha_{0}>0 with the rate

(see eq. (21) for the definition of RR), the choice of α0\alpha_{0} will impact further steps due to the bound αk2/3+θk1αk1\alpha_{k}\leqslant\sqrt{2/3+\theta_{k-1}}\alpha_{k-1}. Because of this reason, we do not want to choose α0\alpha_{0} too small. On the other hand, too large α0\alpha_{0} will make RR large. To counterbalance these two extremes, we suggest to do the following:

The upper bound ensures that α0\alpha_{0} is not too large, while the lower ensures that it is not too small either. In most scenarios, this requires to run a linesearch, but we emphasize one more time: it is only needed for the first iteration. In some sense, our condition (16) is similar to classical Goldstein’s rule [Gol62] on selecting the stepsize: not too small and not too big.

Of course, if we start with a very small α0\alpha_{0}, only the first bound for αk\alpha_{k} will be active for some time, and we will eventually reach a reasonable range for a stepsize. However, linesearch with a more aggressive factor (say, 1010) will allow us to reach this range faster. If we start with α0=108\alpha_{0}=10^{-8} when in fact a reasonable range for steps in this region is $,thenwewillneedatleast, then we will need at least100iterationsofourmethod,whilelinesearchwithafactoriterations of our method, while linesearch with a factor10willfinditinlessthanwill find it in less than10$ iterations.

It may happen that the problem is degenerated in a sense that for any α0\alpha_{0}, α0L1<12\alpha_{0}L_{1}<\frac{1}{\sqrt{2}}. In other words, increasing α0\alpha_{0} leads to decreasing L1L_{1} and linesearch may never stop. In this case we should terminate a linesearch after α0\alpha_{0} reaches any prescribed value, say 11.

1 Analysis

For iterates (xk)(x^{k}) of Algorithm 2 it holds

Before we continue, let us give some intuition for this lemma. Its analysis follows mostly the same steps as in (2.1). However, now we will split αk2f(xk1)2\alpha_{k}^{2}\|\nabla f(x^{k-1})\|^{2} into two parts and use one of it to improve the smoothness bound for αk\alpha_{k}.

We start from the third line in (2.1) and then apply the above-mentioned splitting:

For iterates (xk)(x^{k}) of Algorithm 2 and any solution xx^{*} it holds

The sequence (xk)(x^{k}) is bounded. In particular, for any solution xx^{*} we have xkB(x,R)x^{k}\in B(x^{*},R), where

The first bound for αk\alpha_{k} in Algorithm 2 gives us 3αkθk(2+3θk1)αk13\alpha_{k}\theta_{k}\leqslant(2+3\theta_{k-1})\alpha_{k-1}. We use it in (3) and telescope then to obtain

This immediately implies that (xk)(x^{k}) is bounded, but we would like to obtain the bound without an intermediate iterate x1x^{1}. From (2.1) we know that

Using that θ0=13\theta_{0}=\frac{1}{3} completes the proof. ∎

We could have used θ0=0\theta_{0}=0 as we did in Algorithm 1 which would have improved the final constant RR. However, since the first bound for αk\alpha_{k} is worse this time, we would need a more complicated initial bound for α0\alpha_{0}. We decided to keep it simple.

Notation. For brevity, we write αk1\alpha_{k}\mapsto 1 to denote that in the kk-th iteration αk\alpha_{k} satisfies the first bound, that is αk=23+θk1\alpha_{k}=\sqrt{\frac{2}{3}+\theta_{k-1}}. Similarly, for αk2\alpha_{k}\mapsto 2. Also let LL be the Lipschitz constant of f\nabla f over the set B(x,R)B(x^{*},R). This means that LkLL_{k}\leqslant L for all kk.

Next few statements are not very important for the first reading, as they only concern with a lower bound of i=1kαi\sum_{i=1}^{k}\alpha_{i}. The main statement in Theorem 2 is valid independently of them, so the reader can go directly there.

If αk2\alpha_{k}\mapsto 2, then αk12L\alpha_{k}\geqslant\frac{1}{\sqrt{2}L} and αk1+αk2L\alpha_{k-1}+\alpha_{k}\geqslant\frac{2}{L}.

Note that in this case αk=αk12αk12Lk21\alpha_{k}=\frac{\alpha_{k-1}}{\sqrt{2\alpha_{k-1}^{2}L_{k}^{2}-1}}, and hence 1αk12+1αk2=2Lk2\frac{1}{\alpha_{k-1}^{2}}+\frac{1}{\alpha_{k}^{2}}=2L_{k}^{2}. This implies that αk12Lk12L\alpha_{k}\geqslant\frac{1}{\sqrt{2}L_{k}}\geqslant\frac{1}{\sqrt{2}L}. By AM-GM inequality,

and the conclusion αk1+αk82Lk2=2Lk\alpha_{k-1}+\alpha_{k}\geqslant\sqrt{\frac{8}{2L^{2}_{k}}}=\frac{2}{L_{k}} follows. ∎

If α0\alpha_{0} satisfies (16), then αk13L\alpha_{k}\geqslant\frac{1}{\sqrt{3}L} for all k1k\geqslant 1.

We use induction. For k=1k=1, we have either α1=23+θ0α012L\alpha_{1}=\sqrt{\frac{2}{3}+\theta_{0}}\alpha_{0}\geqslant\frac{1}{\sqrt{2}L} or α12\alpha_{1}\mapsto 2, which in view of Lemma 5 also implies α112L\alpha_{1}\geqslant\frac{1}{\sqrt{2}L}.

Suppose that αk113L\alpha_{k-1}\geqslant\frac{1}{\sqrt{3}L} and we must show that αk13L\alpha_{k}\geqslant\frac{1}{\sqrt{3}L}. If αk2\alpha_{k}\mapsto 2, then we are done. Therefore, suppose that αk1\alpha_{k}\mapsto 1. Consider two options for αk1\alpha_{k-1}. If αk11\alpha_{k-1}\mapsto 1, then θk12/3\theta_{k-1}\geqslant\sqrt{2/3}. Thus, for αk\alpha_{k} we have that

If αk12\alpha_{k-1}\mapsto 2, then αk112L\alpha_{k-1}\geqslant\frac{1}{\sqrt{2}L} and hence

It is clear from above proof that condition α012L1\alpha_{0}\geqslant\frac{1}{\sqrt{2}L_{1}} from (16) was used only to give us the basis for induction. Without that condition, one can still show in the same way that αkmin{α0,13L}\alpha_{k}\geqslant\min\{\alpha_{0},\frac{1}{\sqrt{3}L}\}.

Summing this result from 11 to kk yields i=1kαik3L\sum_{i=1}^{k}\alpha_{i}\geqslant\frac{k}{\sqrt{3}L}. The stepsize in the previous section is lower bounded by a 12L\sqrt{1}{\sqrt{2}L}, so it is natural to wonder: why is the current section contains a “larger stepsize”? The answer is that while we cannot show that each individual step is larger, we still show in the next theorem that its total length will be lower bounded by the same quantity.

Let ff be convex with a locally Lipschitz gradient f\nabla f. Then the sequence (xk)(x^{k}) generated by Algorithm 2 converges to a solution and

where RR is defined as in (21). In particular, if α0\alpha_{0} satisfies (16), then

where LL is the Lipschitz constant of f\nabla f over B(x,R)B(x^{*},R).

Of course, the important bound here is (23). The second bound only shows that our choice of stepsizes αk\alpha_{k} cannot be too bad. The bound in (24) is stronger than the bound 3LR22k\frac{\sqrt{3}LR^{2}}{2k}, which could be obtained as a direct consequence of Lemma 6. The derivation of the sharper bound as in (24) is presented in Section 3.2.

We proceed in the same way as in Lemma 4, but this time we keep all the terms that were discarded earlier. Specifically, summing (3) over all kk yields

where the last two bounds follow from the same arguments as in Lemma 4. Note that each factor (αk(2+3θk)3αk+1θk+1)(\alpha_{k}(2+3\theta_{k})-3\alpha_{k+1}\theta_{k+1}) is nonnegative and their sum is

In particular, if α0\alpha_{0} satisfies (16), then inequality (24) is a direct consequence of Lemma 10, which we prove in the next section.

It remains to prove that (xk)(x^{k}) converges to a solution. Next arguments will be similar to the ones in [MM20]. We have already proved that (xk)(x^{k}) is bounded. As ff is LL-smooth over B(x,R)B(x^{*},R), we have

Using this sharper bound instead of plain convexity in (2.1) and repeating the same arguments as in Lemma 3, we end up with the same inequality plus the extra term

Now, by telescoping this inequality we infer that i=1kαiLf(xi)2R2\sum_{i=1}^{k}\frac{\alpha_{i}}{L}\|\nabla f(x^{i})\|^{2}\leqslant R^{2}. Since the sequence (αk)(\alpha_{k}) is separated from (note that this is independent of condition (16) by Remark 4), we conclude that f(xk)0\nabla f(x^{k})\to 0 as kk\to\infty. Hence, all limit points of (xk)(x^{k}) are solutions. Applying 3θkαk(2+3θk1)αk13\theta_{k}\alpha_{k}\leqslant(2+3\theta_{k-1})\alpha_{k-1} in (3.1) we get

where bk=xkxk12+αk1(2+3θk1)(f(xk1)f)b_{k}=\|x^{k}-x^{k-1}\|^{2}+\alpha_{k-1}(2+3\theta_{k-1})(f(x^{k-1})-f_{*}). Then the convergence of (xk)(x^{k}) to a solution follows from the standard Opial-type arguments. ∎

2 Better bounds for the sum of stepsizes

In this section we prove the bound i=1kαik2L\sum_{i=1}^{k}\alpha_{i}\geqslant\frac{k}{\sqrt{2}L}.

If θk<13\theta_{k}<\frac{1}{3}, then αk2\alpha_{k}\mapsto 2 and αk1Lk>5\alpha_{k-1}L_{k}>\sqrt{5}, αk2Lk32\alpha_{k-2}L_{k}\geqslant\frac{3}{2}, αk3Lk1\alpha_{k-3}L_{k}\geqslant 1.

By definition, αk1\alpha_{k}\mapsto 1 means that αk=23+θk1αk1\alpha_{k}=\sqrt{\frac{2}{3}+\theta_{k-1}}\alpha_{k-1} and thus θk23\theta_{k}\geqslant\sqrt{\frac{2}{3}}. Hence, αk2\alpha_{k}\mapsto 2. Then we have that θk=12αk12Lk21<13\theta_{k}=\frac{1}{\sqrt{2\alpha_{k-1}^{2}L_{k}^{2}-1}}<\frac{1}{3} which implies αk1Lk>5\alpha_{k-1}L_{k}>\sqrt{5}. Since we get a large αk1\alpha_{k-1}, the first bound on stepsizes does not allow previous steps to be much smaller. That is the idea we shall use.

For any kk, we have that θk23+θk1\theta_{k}\leqslant\sqrt{\frac{2}{3}+\theta_{k-1}}. As θ01\theta_{0}\leqslant 1, it is trivial to prove that θk1+1132t0\theta_{k}\leqslant\frac{1+\sqrt{\frac{11}{3}}}{2}\eqqcolon t_{0}, which is the root of t23+t=0t-\sqrt{\frac{2}{3}+t}=0. From αk1Lk>5\alpha_{k-1}L_{k}>\sqrt{5}, it follows that

Hence, to prove αk2Lk32\alpha_{k-2}L_{k}\geqslant\frac{3}{2}, it only remains to check that 5t032\frac{\sqrt{5}}{t_{0}}\geqslant\frac{3}{2}.

And to prove αk3Lk1\alpha_{k-3}L_{k}\geqslant 1, we must check that 32t01\frac{3}{2t_{0}}\geqslant 1. ∎

Given the sequence (αk)k1(\alpha_{k})_{k\geq 1}, we call its element αm\alpha_{m} a breakpoint, if θm<13\theta_{m}<\frac{1}{3} and αm<1L\alpha_{m}<\frac{1}{L} . The next lemma says that a small step can only occur shortly after a breakpoint.

If αk<12L\alpha_{k}<\frac{1}{\sqrt{2}L}, then exactly one of the following holds

αk1<αk\alpha_{k-1}<\alpha_{k} and αk2\alpha_{k-2} is a breakpoint.

In view of Lemma 5, the statement implies that αk1\alpha_{k}\mapsto 1. Suppose that αk1\alpha_{k-1} is not a breakpoint, since otherwise we are done. This means that either (a) αk11L\alpha_{k-1}\geqslant\frac{1}{L} or (b) αk1<1L\alpha_{k-1}<\frac{1}{L} and θk113\theta_{k-1}\geqslant\frac{1}{3}. In the first case we immediately get a contradiction, since αk=23+θk1αk1231L>12L\alpha_{k}=\sqrt{\frac{2}{3}+\theta_{k-1}}\alpha_{k-1}\geqslant\sqrt{\frac{2}{3}}\frac{1}{L}>\frac{1}{\sqrt{2}L}. Then if we consider (b), the bound θk113\theta_{k-1}\geqslant\frac{1}{3} implies that αk1αk<12L\alpha_{k-1}\leqslant\alpha_{k}<\frac{1}{\sqrt{2}L}. Then we can apply the same arguments as above, but to αk1\alpha_{k-1}. This means that either αk2\alpha_{k-2} will be a breakpoint or we will have a chain αk2αk1αk<12L\alpha_{k-2}\leqslant\alpha_{k-1}\leqslant\alpha_{k}<\frac{1}{\sqrt{2}L}. However, the latter option cannot occur, because using θk11\theta_{k-1}\geqslant 1 and αk113L\alpha_{k-1}\geqslant\frac{1}{\sqrt{3}L} ensure us that

Although a breakpoint indicates that we are in the region with a small stepsize, Lemma 7 guarantees that previous steps were quite large. The next lemma shows that in total we make a significant progress.

If αm\alpha_{m} is a breakpoint, then j=22αm+j>5L\sum_{j=-2}^{2}\alpha_{m+j}>\frac{5}{L}.

If αm\alpha_{m} is a breakpoint, then on one hand Lemma 7 implies that αm15Lm\alpha_{m-1}\geqslant\frac{\sqrt{5}}{L_{m}}, αm232Lm\alpha_{m-2}\geqslant\frac{3}{2L_{m}}. On the other hand, we have that αm12L\alpha_{m}\geqslant\frac{1}{\sqrt{2}L}, αm+113L\alpha_{m+1}\geqslant\frac{1}{\sqrt{3}L}, and αm+213L\alpha_{m+2}\geqslant\frac{1}{\sqrt{3}L}. Combining, we get

If α0\alpha_{0} satisfies (16), then for any k1k\geqslant 1 we have

Let M={m is a breakpoint: αm+1<12L}\mathcal{M}=\{m\text{ is a breakpoint: }\alpha_{m+1}<\frac{1}{\sqrt{2}L}\}. We can split i=1kαi\sum_{i=1}^{k}\alpha_{i} into two terms as

We claim that elements in the “rest” are greater or equal than 12L\frac{1}{\sqrt{2}L}. Indeed, if αi<12L\alpha_{i}<\frac{1}{\sqrt{2}L} is in the “rest” term, then either αi1\alpha_{i-1} is a breakpoint or αi1<12L\alpha_{i-1}<\frac{1}{\sqrt{2}L} and αi2\alpha_{i-2} is a breakpoint, as Lemma 8 suggests. In either case, αi\alpha_{i} must be included in the first sum, by the definition of M\mathcal{M}.

Now let us estimate both terms. The first sum in (28) is greater than 5ML>5M2L\frac{5|\mathcal{M}|}{L}>\frac{5|\mathcal{M}|}{\sqrt{2}L}, by Lemma 9. The total sum in the “rest” term is not less than k5M2L\frac{k-5|\mathcal{M}|}{\sqrt{2}L}. Hence, the desired inequality follows. It has to be only noted that if k1Mk-1\in\mathcal{M}, we have to additionally consider the sum j=21αk1+j42L\sum_{j=-2}^{1}\alpha_{k-1+j}\geqslant\frac{4}{\sqrt{2}L}, for which the bound follows from the same arguments as in Lemma 9. ∎

It is obvious that our analysis was not optimal. For instance, whenever αk2\alpha_{k}\mapsto 2, we could use αk1+αk2L\alpha_{k-1}+\alpha_{k}\geqslant\frac{2}{L} instead of more conservative 22L\frac{2}{\sqrt{2}L}. Similarly, we got a much better bound for every breakpoint. However, we did not want to overcomplicate an already tedious examination. We leave it as an open question if one can provide a bound closer to kL\frac{k}{L} (or better?) with a readable proof.

Adaptive proximal gradient method

In this section we turn to a more general problem of a composite optimization problem

As one can notice, the only difference between Algorithm 3 and Algorithm 2 is the presence of the proximal mapping. The analysis, however, is not a straightforward generalization.

Recall that the second bound for the stepsize αk\alpha_{k} is equivalent to

We can rewrite xk+1=proxαkg(xkαkf(xk))x^{k+1}=\operatorname{prox}_{\alpha_{k}g}(x^{k}-\alpha_{k}\nabla f(x^{k})) as an implicit equation

where ~g(xk+1)\widetilde{\nabla}g(x^{k+1}) is a certain subgradient of gg at xk+1x^{k+1}, that is ~g(xk+1)g(xk+1)\widetilde{\nabla}g(x^{k+1})\in\partial g(x^{k+1}). For this particular subgradient we will also use notation

First, we adapt our basic inequality (2.1) to the more general case. By prox-inequality, we have

Then we set x=xx=x^{*} above and transform it into

This standard inequality is at the heart of the analysis of the proximal gradient method. To complete the full proof, or rather to get the final inequality, the classical analysis only requires applying one convexity inequality and one descent lemma to function ff. Our analysis, however, will be different. The main nuisance is that in the kk-th iteration the proximal map yields us g(xk+1)g(x)g(x^{k+1})-g(x^{*}) term, while our adaptivity approach works with f(xk)f(x)f(x^{k})-f(x^{*}), as we remember from before. Thus, our first obstacle is to understand how to combine these two terms.

First, we estimate the term f(xk),xxk+1\langle\nabla f(x^{k}),x^{*}-x^{k+1}\rangle in the RHS of (33). We have

where in the last inequality we used separately convexity of ff and gg. Applying this inequality in (33) yields

As we see, the final inequality is very much in the spirit of (2.1).

For iterates (xk)(x^{k}) with arbitrary stepsizes, it holds

As before, this is just monotonicity of f\nabla f in disguise:

The next lemma is special for the composite case. Although it looks like this fact should be known in the literature, we were not able to identify it.

For iterates (xk)(x^{k}) of the proximal gradient method with arbitrary stepsizes, it holds

This time it is just a monotonicity of g\partial g in disguise:

where the last inequality follows from monotonicity of g\partial g and (31). ∎

In Section 3 we estimated xk+1xk2=αk2f(xk)2\|x^{k+1}-x^{k}\|^{2}=\alpha_{k}^{2}\|\nabla f(x^{k})\|^{2}. This time, xk+1xk2\|x^{k+1}-x^{k}\|^{2} and αk2~F(xk)2\alpha_{k}^{2}\|\widetilde{\nabla}F(x^{k})\|^{2} are different and it is the latter term that matters to us.

For iterates (xk)(x^{k}) of Algorithm 3 it holds

The main idea of the proof is exactly the same as in Lemma 2. However, the presence of ~g(xk)\widetilde{\nabla}g(x^{k}) make it slightly more cumbersome. The previous two lemmata are instrumental on our way. We have

For iterates (xk)(x^{k}) of Algorithm 3 and any solution xx^{*} it holds

The sequence (xk)(x^{k}) is bounded. In particular, for any solution xx^{*} of (29) we have xkB(x,R)x^{k}\in B(x^{*},R).

The same as in Lemma 4. We use (14) to telescope until k=1k=1 and then apply (4) with k=0k=0 to bound x1x2\|x^{1}-x^{*}\|^{2}. ∎

Let ff be convex with a locally Lipschitz gradient f\nabla f and gg be convex lsc. Then the sequence (xk)(x^{k}) generated by Algorithm 3 converges to a solution of (29) and

In particular, if α0\alpha_{0} satisfies (16), then

where LL is the Lipschitz constant of f\nabla f over B(x,R)B(x^{*},R).

The proof of inequalities (39) and (40) is almost identical to the one in Theorem 2. The proof of convergence of (xk)(x^{k}) to a solution is, however, more nuanced. The nontrivial part is to show that all limit points of (xk)(x^{k}) are solutions. While on the surface, it should be no harder than before, the fact that limk+αk\lim_{k\to+\infty}\alpha_{k} can be ++\infty complicates things a bit.

Let xx^{*} be a solution of (29). By LL-smoothness of ff over B(x,R)B(x^{*},R), we have

Using this improved bound, similarly as it was done in (3.1), we get

By telescoping this inequality as before, we can now additionally infer that

and thus, f(xk)f(x)0\|\nabla f(x^{k})-\nabla f(x^{*})\|\to 0. Specifically, this implies f(xk)f(xk1)0\nabla f(x^{k})-\nabla f(x^{k-1})\to 0 as kk\to\infty.

We want to prove that all limit points of (xk)(x^{k}) are solutions. To this end, we will use prox-inequality (32) rewritten as

which in turn, by convexity of ff, leads to

The left-hand side has two terms, and the second term evidently tends to as f(xk+1)f(xk)0\nabla f(x^{k+1})-\nabla f(x^{k})\to 0. If we can show the same for the first term, it will imply that all limit points of (xk)(x^{k}) are solutions.

Consider (43) again, but this time we set x=xkx=x^{k}. This yields

We manipulate the inequality above as follows

where in the last inequality we applied Cauchy-Schwarz and Young’s inequalities. From this we deduce that

Note that the sequence (αkf(xk)f(x)2)\left(\alpha_{k}\|\nabla f(x^{k})-\nabla f(x^{*})\|^{2}\right) is summable by (42). Also, the sequence (δk)(\delta_{k}) is summable, since (xk)(x^{k}) is bounded and g(xk)g(x^{k}) is lower-bounded: g(xk)Ff(xk)>g(x^{k})\geqslant F_{*}-f(x^{k})>-\infty for all kk. Hence, k1αkxk+1xk2<+\sum_{k}\frac{1}{\alpha_{k}}\|x^{k+1}-x^{k}\|^{2}<+\infty and, thus, 1αkxk+1xk20\frac{1}{\alpha_{k}}\|x^{k+1}-x^{k}\|^{2}\to 0 as k+k\to+\infty. Given that (αk)(\alpha_{k}) is separated from zero, it immediately follows that 1αkxk+1xk0\frac{1}{\alpha_{k}}\|x^{k+1}-x^{k}\|\to 0 as well.

Therefore, we have proved that all limit points of (xk)(x^{k}) are solutions. The proof of convergence of the whole sequence (xk)(x^{k}) runs as before in Theorem 2. ∎

We didn’t derive a linear convergence of the adaptive proximal gradient, when FF is strongly convex. We only mention that it is quite straightforward and goes along the same lines as the original AdGD in [MM20, Theorem 2] in the strongly convex regime.

Literature and discussion

There are many variants of linesearch procedures that go back to celebrated works of Goldstein [Gol62] and Armijo [Arm66]. We discuss an efficient implementation of the latter in detail in the next section. For other variants of linesearch, we refer to [BN16, Sal17].

Adagrad-type methods.

Original Adagrad algorithm was proposed simultaneously in [DHS11] and [MS10]. The method has had a stunning impact on machine learning applications. It has also spawned a stream of various extensions that retain the same idea of using eventually decreasing steps. Because of this, its adaptivity is more prominent in the non-smooth regime, where stepsizes must be diminishing to guarantee convergence. Recent works [DM23, IHC23] have proposed ways to increase dkd_{k} in the update (2) and [KMJ23] even proved convergence of such method on smooth objectives. However, the stepsize in these method still eventually stops increasing as dkd_{k} is upper bounded by a constant.

In addition, Adagrad-type methods are usually sensitive to the initialization, as they either degrade in performance when dk=Dd_{k}=D and DD is not chosen carefully, or their convergence rate depends multiplicatively on log(x0x/d0)\log(\|x^{0}-x^{*}\|/d_{0}). In contrast, in our methods, the cost of estimating α0\alpha_{0} to satisfy condition (16) is additive and its impact vanishes as the total number of iterations increases.

Refined results on GD with a fixed stepsize.

Paper [TV22] summarizes quite well the difficulty of GD analysis with large steps. In it, the authors derive sharp convergence bounds separately for two cases αL(0,1]\alpha L\in(0,1] and αL(1,2)\alpha L\in(1,2), and the latter case is considerably harder. In our analysis it is even harder, since the steps can go far beyond the global upper bound 2L\frac{2}{L}. A surprising recent result [Gri23] showcases how actually little is understood in this case.

Small gradient.

The lack-of-descent property makes it hard to deduce the O(1/k)\mathcal{O}(1/k) rate for the last-iterate f(xk)\|\nabla f(x^{k})\|, which is known for GD with a fixed stepsize. We leave it as an open problem to establish a rate.

Extensions.

Because the analysis of the algorithm is so special, it is not easy to extend it to basic generalizations of GD. However, some works have already build upon it. In [VMC21], the authors consider a convex smooth minimization subject to linear constraints and combined the adaptive GD [MM20] with the Chambolle-Pock algorithm [CP10]. The authors of [Lat+23] went even further and considered a more general composite minimization problem subject to linear constraints, where the same two ideas as before were combined with a novel way of handling the prox mapping.

If we consider variational inequalities settings in the monotone case, then it is not clear how such adaptivity can help, since the most natural extension, the forward-backward method will diverge. On the other hand, an adaptive golden ratio algorithm [Mal19], which is the precursor of a given work, already has all the properties that AdProxGD has.

Experiments

In the experimentshttps://github.com/ymalitsky/AdProxGD we compare our method to the ProxGD with Armijo’s linesearch. We believe it is the best and arguably the most popular alternative to our method. An efficient implementation of Armijo’s linesearch requires two parameters, s>1s>1 and r<1r<1. In the kk-th iteration, the first iteration of linesearch starts from αk=sαk1\alpha_{k}=s\alpha_{k-1}, that is, we want to try a slightly larger step than in the previous iteration. If linesearch does not terminate, we start decreasing a stepsize geometrically with a ratio rr. Formally, we are looking for the largest αk=sriαk1\alpha_{k}=sr^{i}\alpha_{k-1}, for i=0,1,i=0,1,\dots, such that for xk+1=proxαkg(xkαkf(xk))x^{k+1}=\operatorname{prox}_{\alpha_{k}g}(x^{k}-\alpha_{k}\nabla f(x^{k})) it holds that

It is evident that each iteration of this linesearch requires one evaluation of ff and proxg\operatorname{prox}_{g}. However, it is important to highlight that in some cases, the last evaluation of f(xk+1)f(x^{k+1}) (during linesearch) may not incur any additional costs, as certain expensive operations, such as matrix-vector multiplication, can be reused to compute the next gradient f(xk+1)\nabla f(x^{k+1}). Throughout our comparisons, we consistently took these factors into account and reported only essential operations that cannot be further reused.

Our legend will stay the same for all plots:

AdProxGD (1.2, 0.5) (1.5, 0.8) (1.1, 0.5) (1.2, 0.9) (1.1, 0.9) (1.5, 0.5) (1.2, 0.8) (1.1, 0.8) (1.5, 0.9)

where each pair of numbers represent (s,r)(s,r) for ProxGD with linesearch described above. As we will see, the choice of (s,r)(s,r) matters a lot.

We consider [BV04, Equation (7.5)], where our goal is to estimate the inverse of a covariance matrix YY subject to eigenvalue bounds. Formally, this problem can be formulated as follows

Low-rank matrix completion

We consider a famous low-rank matrix completion problem in the form

where Ω\Omega is a subset of indices (i,j)(i,j) and rr is the supposed maximum rank. To project onto the spectral ball C={X ⁣:Xr}\mathcal{C}=\{X\colon\|X\|_{*}\leqslant r\}, computing the singular value decomposition (SVD) is required, making it the most computationally expensive operation in this setting.

We created matrix AA by multiplying matrices UU and VV^{\top}, where UU and VV are nn-by-rr matrices with entries sampled from a normal distribution. The subset Ω\Omega was randomly chosen as a fraction of 15n2\frac{1}{5}n^{2} entries from [n]×[n][n]\times[n]. The obtained results are depicted in Figure 2, where we solely compared the number of computed SVDs.

Minimal length piecewise-linear curve subject to linear constraints.

While applying the proximal gradient method, the most computationally expensive operation is computing the projection onto C={x ⁣:Ax=b}\mathcal{C}=\{x\colon Ax=b\}. Assuming that AA is full rank with mnm\leqslant n, this projection can be computed as PCz=zA(AA)1(Azb)P_{\mathcal{C}}z=z-A^{\top}(AA^{\top})^{-1}(Az-b).

In comparison, we focused solely on the number of computed projections. We generated a random mm-by-nn matrix AA and random vector ww with entries sampled from a normal distribution and set b=Awb=Aw.

Nonnegative matrix factorization.

We want to solve the matrix factorization problem subject to nonnegative constraints:

where AA is a given nn-by-nn low-rank matrix. Although nonconvex, this problem is famously well-tackled by first-order methods. In each iteration, the gradient f(x)\nabla f(x) involves 3 matrix-matrix multiplications, whereas evaluating the objective f(x)f(x) only requires 1. Note that for the last iteration of the linesearch, the computed matrix product can be reused to compute the gradient for the next iteration.

We created matrix AA by multiplying matrices BB and CC^{\top}, where BB and CC are nn-by-rr matrices with entries sampled from a normal distribution. Negative entries in both matrices BB and CC were then set to zero. The results are presented in Figure 4, where the number of gradients roughly means the number of 3 matrix-matrix multiplications.

Dual of the entropy maximization problem.

Consider the entropy maximization problem subject to linear constraints

Conclusion.

Based on our preliminary experiments, it is evident that AdProxGD indeed performs better. To our surprise, a few specific pairs (r,s)(r,s) consistently outperform the rest among ProxGD with linesearch. We are not aware of any theoretical finding that would confirm this evidence.

Appendix

The function ff defined in (8) satisfies the following properties:

ff^{\prime} is LL-Lipschitz with L=1L=1.

ff is locally strongly convex, i.e., for any bounded set X\mathcal{X} there exists a constant μX>0\mu_{\mathcal{X}}>0 such that f(x)f(y)μXxy|f^{\prime}(x)-f^{\prime}(y)|\geqslant\mu_{\mathcal{X}}|x-y| for any x,yXx,y\in\mathcal{X}.

First, let us find ff^{\prime} and ff^{\prime\prime}:

so indeed a=2a=2 and b=2log232b=2\log 2-\frac{3}{2}. Convexity of ff follows from the fact that f(x)>0f^{\prime\prime}(x)>0 for any xx. Lipschitzness of ff^{\prime} follows directly from the bound f(x)1f^{\prime\prime}(x)\leqslant 1 for all xx. Similarly, local strong convexity follows from the bound f(x)1maxzX(1+z)2μXf^{\prime\prime}(x)\geq\frac{1}{\max_{z\in\mathcal{X}}(1+|z|)^{2}}\eqqcolon\mu_{\mathcal{X}} for any xXx\in\mathcal{X}. Finally, the last two properties trivially follow from the expression for f(x)f^{\prime}(x). ∎

Let us choose x0=r+2x^{0}=r+2 with a sufficiently large r>6cr>6c. This readily implies that

Our goal is to show that the iterates follow a very specific pattern. Namely, we prove that for all k0k\geqslant 0,

If this condition holds true, then the sequence (xk)(x^{k}) must be divergent.

First, observe that if xk,xk1r|x^{k}|,|x^{k-1}|\geqslant r and sign(xk)=sign(xk1)\operatorname{sign}(x^{k})=\operatorname{sign}(x^{k-1}), then the smoothness estimate admits a simple expression:

Therefore, in that case αkf(xk)>r(1+xk)2cf(xk)=rxkc>3xk\alpha_{k}|f^{\prime}(x^{k})|>\frac{r(1+|x^{k}|)}{2c}|f^{\prime}(x^{k})|=\frac{r|x^{k}|}{c}>3|x^{k}|. Since sign(f(xk))=sign(xk)\operatorname{sign}(f^{\prime}(x^{k}))=\operatorname{sign}(x^{k}), it implies that xk+1>2xk|x^{k+1}|>2|x^{k}| and sign(xk+1)sign(xk)\operatorname{sign}(x^{k+1})\neq\operatorname{sign}(x^{k}).

Next, if xk,xk1r>3|x^{k}|,|x^{k-1}|\geqslant r>3 with xk2xk1|x^{k}|\geqslant 2|x^{k-1}| and sign(xk)sign(xk1)\operatorname{sign}(x^{k})\neq\operatorname{sign}(x^{k-1}), then we have 2xk1+xk>32\frac{2|x^{k}|}{1+|x^{k}|}>\frac{3}{2} and

This implies αk<xk2cxk2\alpha_{k}<\frac{|x^{k}|}{2c}\leqslant\frac{|x^{k}|}{2}. Since sign(f(xk))=sign(xk)\operatorname{sign}(f^{\prime}(x^{k}))=\operatorname{sign}(x^{k}) and αk1+xkxk2(1+xk)<12\frac{\alpha_{k}}{1+|x^{k}|}\leqslant\frac{|x^{k}|}{2(1+|x^{k}|)}<\frac{1}{2}, we conclude that sign(xk+1)=sign(xk)\operatorname{sign}(x^{k+1})=\operatorname{sign}(x^{k}) and

As x0x^{0} and x1x^{1} satisfy the first case, by induction we deduce that all iterates (xk)(x^{k}) follow the described pattern. ∎

The authors would like to thank Puya Latafat, who found a subtle error in the convergence proof of (xk)(x^{k}) in Theorem 3 in the first version of this manuscript.

References