Adaptive Proximal Gradient Method for Convex Optimization

Yura Malitsky, Konstantin Mishchenko

Intro

In this paper, we address a convex minimization problem

We are interested in the cases when either $F$ is differentiable and then we will use notation $F=f$ , or it has a composite additive structure as $F=f+g$ . Here, $f$ represents a convex and differentiable function, while $g$ is convex, lower semi-continuous (lsc), and prox-friendly. Throughout the paper, we will interchangeably refer to the smoothness of $f$ and the Lipschitzness of $\nabla f$ , occasionally with the adjective "locally," indicating that it is restricted to a bounded set. We will refer to this property as smoothness, without mentioning the Lipschitzness of $f$ , so we hope there will be no confusion in this regard.

For simplicity, in most of the introduction, we consider only the simpler problem $\min_{x}f(x)$ . We study one of the most classical optimization algorithm — gradient descent —

Its simplicity and the sole prerequisite of knowing the gradient of $f$ make it appealing for diverse applications. This method is central in modern continuous optimization, forming the bedrock for numerous extensions.

Given the initial point $x^{0}$ , the only thing we need to implement (1) is to choose a stepsize $\alpha_{k}$ (also known as a learning rate in machine learning literature). This seemingly tiny detail is crucial for the method convergence and performance. When a user invokes GD as a solver, the standard approach would be to pick an arbitrary value for $\alpha_{k}$ , run the algorithm, and observe its behavior. If it diverges at some point, the user would try a smaller stepsize and repeat the same procedure. If, on the other hand, the method takes too much time to converge, the user might try to increase the stepsize. In practice, this approach is not very efficient, as we have no theoretical guarantees for a randomly guessed stepsize, and the divergence may occur after a long time. Both underestimating and overestimating the stepsize can, thus, lead to a large overhead.

Below we briefly list possible approaches to choosing or estimating the stepsize and we provide a more detailed literature overview in Section 5.

When $f$ is $L$ -smooth, GD can utilize a fixed stepsize $\alpha_{k}=\alpha<\frac{2}{L}$ and values larger than $\frac{2}{L}$ will provably lead to divergence. Consequently, in such scenarios, the rate of convergence is given by $f(x^{k})-f_{*}=\mathcal{O}\left(\frac{1}{\alpha k}\right)$ , clearly indicating a direct dependence on the stepsize. Nevertheless, several drawbacks emerge from this approach:

$L$ is not available in many practical scenarios;

if the curvature of $f$ changes a lot, GD with the global value of $L$ may be too conservative;

Linesearch.

Also known as backtracking in the literature. In the $k$ -th iteration we compute $x^{k+1}$ with a certain stepsize $\alpha_{k}$ and check a specific condition. If the condition holds, we accept $x^{k+1}$ and proceed to the next iteration; otherwise we halve $\alpha_{k}$ and recompute $x^{k+1}$ using this reduced stepsize. This approach, while the most robust and theoretically sound, incurs substantially higher computational cost compared to regular GD due to the linesearch procedure.

Adagrad-type algorithms.

These are the methods of the typeWe provide only the simplest instance of such algorithms.

where $v_{-1}\geq 0$ is some constants, and $d_{k}$ is an estimate of $\|x^{0}-x^{*}\|$ some some solution $x^{*}$ . While such methods indeed have certain nice properties, $d_{k}$ is usually either constant or quickly converges to a constant value, so a quick glance at (2) will reveal that its stepsizes are decreasing. Therefore, despite the name, we cannot expect true adaptivity of this method to the local curvature of $f$ .

Heuristics.

Numerous heuristics exist for selecting $\alpha_{k}$ based on local properties of $f$ and $\nabla f$ , with the Barzilai-Borwein method [BB88] being among the most widely popular. However, it is crucial to note that we are not particularly interested in such approaches, as they lack consistency and may even lead to divergence, even for simple convex problems.

We have already mentioned adaptivity a few times, without properly introducing it. Now let us try to properly understand its meaning in the context of gradient descent. Besides the initial point $x^{0}$ , GD has only one degree of freedom — its stepsize. From the analysis we know that it has to be approximately an inverse of the local smoothness. We call a method adaptive, if it automatically adapts a stepsize to this local smoothness without additional expensive computation and the method does not deteriorate the rate of the original method in the worst case. In our case, original method is GD with a fixed stepsize.

By this definition, GD with linesearch is not adaptive, because it finds the right stepsize with some extra evaluations of $f$ or $\nabla f$ . GD with diminishing steps (as in subgradient or Adagrad methods) is also not adaptive, because decreasing steps cannot in general represent well the function’s curvature; also the rate of the subgradient method is definitely worse. It goes without saying, that for a good method its rate must experience improvement when we confine the class of smooth convex functions to the strongly convex ones.

In a previous work [MM20], which serves as the cornerstone for the current paper, the authors proposed an adaptive gradient method named “Adaptive Gradient Descent without Descent” (AdGD). In the current paper, we

deepen our understanding of AdGD and identify its limitations;

refine its theory to accommodate even larger steps;

extend the revised algorithm from unconstrained to the proximal case.

The analysis in the last two cases is not a trivial extension, and we were rather pleasantly surprised that this was possible at all. After all, the theory of GD is well-established and we thought it to be too well-explored for us to discover something new.

Continuous point of view.

It is instructive for some time to switch from the discrete setting to the continuous and to compare gradient descent (GD) with its parent — gradient flow (GF)

where $t$ is the time variable and $x^{\prime}(t)$ denotes the derivative of $x(t)$ with respect to $t$ . To guarantee existence and uniqueness of a trajectory $x(t)$ of GF, it is sufficient to assume that $\nabla f$ is locally Lipschitz-continuous. Then one can prove convergence of $x(t)$ to a minimizer of $f$ in just a few lines. For GD, on the other hand, the central assumption is global Lipschitzness of $\nabla f$ . Our analysis of gradient descent makes it level: local Lipschitzness suffices for both. Or to put it differently, we provide an adaptive discretization of GF that converges under the same assumptions as the original continuous problem (3).

Proximal case.

We emphasize that there is already an excellent extension by Latafat et al. [Lat+23] of the work [MM20] to the additive composite case. Our proposed result, however, is based on an improved unconstrained analysis and uses a different proof. We believe that both these facts will be of interest. We don’t have a good understanding why, but for us finding the proof for the proximal case was quite challenging. It does not follow the standard lines of arguments and uses a novel Lyapunov energy in the analysis.

Nonconvex problems.

We believe that our algorithm will be no less important in the nonconvex case, where gradients are rarely globally Lipschitz continuous and where the curvature may change more drastically. It is true that our analysis applies only to the convex case, but, as far as we know, limited theory has never yet prevented practitioners from using methods in a broader setting. And based on our (speculative) experience, we found it challenging to identify nonconvex functions where the method did not converge to a local solution.

Outline.

In Section 2, we begin by revisiting AdGD from [MM20], examining its limitations, and demonstrating a simple way to enhance it. This section maintains an informal tone, making it easily accessible for quick reading and classroom presentation. In Section 3, we further improve the method and provide all formal proofs. Section 4 extends the improved method to the proximal case. In Section 5 we put our finding in the perspective and compare it to some existing works. Lastly, in Section 6, we conduct experiments to evaluate the proposed method against different linesearch variants.

1 Preliminaries

A convex $L$ -smooth function $f$ is characterized by the following inequality

This is equivalently of saying that $\nabla f$ is a $\frac{1}{L}$ -cocoercive operator, that is

For a convex differentiable $f$ that is not $L$ -smooth one can only say that $\nabla f$ is monotone, that is

We use notation $[t]_{+}=\max\{t,0\}$ and for any $a>0$ we suppose that $\frac{a}{0}=+\infty$ . With a slight abuse of notation, we write $[n]$ to denote the set $\{1,\dots,n\}$ . A solution and the value of the optimization problem $\min f(x)$ are denoted by $x^{*}$ and $f_{*}$ , respectively.

Adaptive gradient descent: better analysis

Similarly to the standard GD, this method leads to $\mathcal{O}(1/k)$ convergence rate. However, unlike the former, it doesn’t require any knowledge about Lipschitz constant of $\nabla f$ and doesn’t even require a global Lipschitz continuity of $\nabla f$ .

The update for $\alpha_{k}$ has two ingredients. The first bound $\alpha_{k}\leqslant\sqrt{1+\theta_{k-1}}\alpha_{k-1}$ sets how fast steps may increase from iteration to iteration. The second $\alpha_{k}\leqslant\frac{\|x^{k}-x^{k-1}\|}{2\|\nabla f(x^{k})-\nabla f(x^{k-1})\|}$ corresponds to the estimate of local Lipschitzness of $\nabla f$ .

It is important to understand how essential these bounds are. Do we really need to control the growth rate of $\alpha_{k}$ or is it an artifact of our analysis? For the second bound, it is not clear whether $2$ in the denominator is necessary. For example, given $L$ -smooth $f$ , our scheme (7) does not encompass a standard GD with $\alpha_{k}=\frac{1}{L}$ for all $k$ .

Answering the first question is relatively easy. Consider the following function

$1<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mn>1</mn></mrow><annotation encoding="application/x-tex">1</annotation></semantics></math>1x$ $f(x)$ where parameters $a,b>0$ are chosen to ensure that $f(\pm 1)$ and $f^{\prime}(\pm 1)$ are well-defined, namely $a=2$ and $b=2\log 2-\frac{3}{2}$ , see Lemma 16 in Appendix.

From an optimization point of view, $f$ is a nice function. In particular, it is convex (even locally strongly convex) and its gradient is $1$ -Lipschitz, see Lemma 16. This means that both GD and AdGD linearly converge on it. However, if we remove the first condition for $\alpha_{k}$ in AdGD, this new modified algorithm will fail to converge. We can prove an even stronger statement. Specifically, let $c\geqslant 1$ , $\alpha_{0}=1$ and consider the following method

In other words, the update in (9) is the same as in (7) except we removed the first constraint for $\alpha_{k}$ in (7) and introduced a constant factor $c$ to make the second one more general.

For any $c\geqslant 1$ there exists $x^{0}$ such that the method (9) applied to $f$ defined in (8) diverges.

The formal proof of this statement is in Appendix, but its main idea should be intuitively clear. First, observe that for $x$ with large absolute value, $f(x)$ behaves mostly like a linear function. However, $f^{\prime}(x)$ approaches $-1$ when $x\to-\infty$ and $+1$ when $x\to+\infty$ . Therefore, if $x^{k}$ and $x^{k-1}$ have the same sign, the local smoothness estimate will be too optimistic and $x^{k+1}$ will “leapfrog” the optimum. In contrast, if the signs of $x^{k}$ and $x^{k-1}$ are different, then $x^{k+1}$ will fail to get sufficiently close to the optimum. It is interesting to remark that on this function both versions of the Barzilai-Borwein method will diverge as well.

Consequently, the answer to the first question is affirmative: we do need some extra condition for the stepsize $\alpha_{k}$ .

Second bound.

The answer to the second question is the opposite: it is indeed an artifact of our previous analysis. In the next section, we propose an improvement over the previous version [MM20]. We give a concise presentation in an informal way. We keep a more formal style for section 3 where even better version (also slightly more complicated) will be presented.

1 Improving AdGD

The analysis of GD usually starts from the standard identity, followed by convexity inequality

In [MM20] the only “nontrivial” step in the proof was upper bounding $\alpha_{k}^{2}\|\nabla f(x^{k})\|^{2}$ , that is $\|x^{k+1}-x^{k}\|^{2}$ . Now we do it in a slightly different way. First, we need the following fact.

For GD iterates $(x^{k})$ with arbitrary stepsizes, it holds

This is just monotonicity of $\nabla f$ in disguise:

Now we are going to bound $\|x^{k+1}-x^{k}\|^{2}$ . For convenience, denote the approximate local Lipschitz constant as

Let $\alpha_{k}$ satisfy $\alpha_{k}\|\nabla f(x^{k})-\nabla f(x^{k-1})\|\leqslant\gamma\|x^{k}-x^{k-1}\|$ for some $\gamma>0$ , that is $\alpha_{k}L_{k}\leqslant\gamma$ . Using $\|u\|^{2}=\|u-v\|^{2}-\|v\|^{2}+2\langle u,v\rangle$ , we have

where the last inequality follows from convexity of $f$ . For $\gamma<1$ we can rewrite (2.1) as

Substituting this inequality into (2.1) gives us

As we want to telescope the above inequality, we require

On the other hand, we have already used that $\alpha_{k}L_{k}\leqslant\gamma$ . These two conditions lead to the bound

where $\gamma\in(0,1)$ can be arbitrary. Now by playing with different values of $\gamma$ , we obtain different instances of adaptive gradient descent method. For instance, by setting $\gamma=\frac{1}{\sqrt{2}}$ , we get

which is a strict improvement upon the original version in [MM20]. A simple reason why this is possible is that, unlike in [MM20], we did not resort to the Cauchy-Schwarz inequality and instead relied on transformation (2.1) and Lemma 1.

We summarize the new scheme in Algorithm 1. We do not provide a formal proof for this scheme and hope that inequality (2.1) should be sufficient for the curious reader to complete the proof. In any case, the next section will contain a further improvement with all the missing proofs.

One might notice that we have used several times monotonicity of $\nabla f$ , where we actually could use a stronger property of cocoercivity (5). That is true, but we just prefer simplicity. We recommend work [Lat+23] that exploits cocoercivity in this framework.

Adaptive gradient descent: larger stepsize

In this section we modify Algorithm 1 to use even larger steps resulting in Algorithm 2. This, however, will require slightly more complex analysis.

Recall the notation $[t]_{+}=\max\{t,0\}$ and note that the second bound $\alpha_{k}\leqslant\frac{\alpha_{k-1}}{\sqrt{[2\alpha_{k-1}^{2}L_{k}^{2}-1]_{+}}}$ is equivalent to

which obviously allows for larger range of $\alpha_{k}$ than $\alpha_{k}^{2}L_{k}^{2}\leqslant\frac{1}{2}$ in Algorithm 1. On the other hand, the first bound $\alpha_{k}\leqslant\sqrt{\frac{2}{3}+\theta_{k-1}}\alpha_{k-1}$ is definitely worse. At the moment, it is not even clear whether it allows $\alpha_{k}$ to increase.

A notable distinction between Algorithm 2 and Algorithm 1 is that the former allows to use a standard fixed step $\alpha_{k}=\frac{1}{L}$ , provided that $f$ is $L$ -smooth. For instance, if we start from $\alpha_{0}=\frac{1}{L}$ and use $L\geqslant L_{k}$ in every iteration (we can always use a larger value), then it follows from (15) and $\theta_{k-1}\geqslant\frac{1}{3}$ that $\alpha_{k}=\frac{1}{L}$ for all $k\geqslant 1$ .

This is an important though not very exciting topic. Algorithm 2 requires an initial stepsize $\alpha_{0}$ . While, as it will be proved later, the algorithm converges for any value $\alpha_{0}>0$ with the rate

(see eq. (21) for the definition of $R$ ), the choice of $\alpha_{0}$ will impact further steps due to the bound $\alpha_{k}\leqslant\sqrt{2/3+\theta_{k-1}}\alpha_{k-1}$ . Because of this reason, we do not want to choose $\alpha_{0}$ too small. On the other hand, too large $\alpha_{0}$ will make $R$ large. To counterbalance these two extremes, we suggest to do the following:

The upper bound ensures that $\alpha_{0}$ is not too large, while the lower ensures that it is not too small either. In most scenarios, this requires to run a linesearch, but we emphasize one more time: it is only needed for the first iteration. In some sense, our condition (16) is similar to classical Goldstein’s rule [Gol62] on selecting the stepsize: not too small and not too big.

Of course, if we start with a very small $\alpha_{0}$ , only the first bound for $\alpha_{k}$ will be active for some time, and we will eventually reach a reasonable range for a stepsize. However, linesearch with a more aggressive factor (say, $10$ ) will allow us to reach this range faster. If we start with $\alpha_{0}=10^{-8}$ when in fact a reasonable range for steps in this region is $ $, then we will need at least$ 100 $iterations of our method, while linesearch with a factor$ 10 $will find it in less than$ 10$ iterations.

It may happen that the problem is degenerated in a sense that for any $\alpha_{0}$ , $\alpha_{0}L_{1}<\frac{1}{\sqrt{2}}$ . In other words, increasing $\alpha_{0}$ leads to decreasing $L_{1}$ and linesearch may never stop. In this case we should terminate a linesearch after $\alpha_{0}$ reaches any prescribed value, say $1$ .

1 Analysis

For iterates $(x^{k})$ of Algorithm 2 it holds

Before we continue, let us give some intuition for this lemma. Its analysis follows mostly the same steps as in (2.1). However, now we will split $\alpha_{k}^{2}\|\nabla f(x^{k-1})\|^{2}$ into two parts and use one of it to improve the smoothness bound for $\alpha_{k}$ .

We start from the third line in (2.1) and then apply the above-mentioned splitting:

For iterates $(x^{k})$ of Algorithm 2 and any solution $x^{*}$ it holds

The sequence $(x^{k})$ is bounded. In particular, for any solution $x^{*}$ we have $x^{k}\in B(x^{*},R)$ , where

The first bound for $\alpha_{k}$ in Algorithm 2 gives us $3\alpha_{k}\theta_{k}\leqslant(2+3\theta_{k-1})\alpha_{k-1}$ . We use it in (3) and telescope then to obtain

This immediately implies that $(x^{k})$ is bounded, but we would like to obtain the bound without an intermediate iterate $x^{1}$ . From (2.1) we know that

Using that $\theta_{0}=\frac{1}{3}$ completes the proof. ∎

We could have used $\theta_{0}=0$ as we did in Algorithm 1 which would have improved the final constant $R$ . However, since the first bound for $\alpha_{k}$ is worse this time, we would need a more complicated initial bound for $\alpha_{0}$ . We decided to keep it simple.

Notation. For brevity, we write $\alpha_{k}\mapsto 1$ to denote that in the $k$ -th iteration $\alpha_{k}$ satisfies the first bound, that is $\alpha_{k}=\sqrt{\frac{2}{3}+\theta_{k-1}}$ . Similarly, for $\alpha_{k}\mapsto 2$ . Also let $L$ be the Lipschitz constant of $\nabla f$ over the set $B(x^{*},R)$ . This means that $L_{k}\leqslant L$ for all $k$ .

Next few statements are not very important for the first reading, as they only concern with a lower bound of $\sum_{i=1}^{k}\alpha_{i}$ . The main statement in Theorem 2 is valid independently of them, so the reader can go directly there.

If $\alpha_{k}\mapsto 2$ , then $\alpha_{k}\geqslant\frac{1}{\sqrt{2}L}$ and $\alpha_{k-1}+\alpha_{k}\geqslant\frac{2}{L}$ .

Note that in this case $\alpha_{k}=\frac{\alpha_{k-1}}{\sqrt{2\alpha_{k-1}^{2}L_{k}^{2}-1}}$ , and hence $\frac{1}{\alpha_{k-1}^{2}}+\frac{1}{\alpha_{k}^{2}}=2L_{k}^{2}$ . This implies that $\alpha_{k}\geqslant\frac{1}{\sqrt{2}L_{k}}\geqslant\frac{1}{\sqrt{2}L}$ . By AM-GM inequality,

and the conclusion $\alpha_{k-1}+\alpha_{k}\geqslant\sqrt{\frac{8}{2L^{2}_{k}}}=\frac{2}{L_{k}}$ follows. ∎

If $\alpha_{0}$ satisfies (16), then $\alpha_{k}\geqslant\frac{1}{\sqrt{3}L}$ for all $k\geqslant 1$ .

We use induction. For $k=1$ , we have either $\alpha_{1}=\sqrt{\frac{2}{3}+\theta_{0}}\alpha_{0}\geqslant\frac{1}{\sqrt{2}L}$ or $\alpha_{1}\mapsto 2$ , which in view of Lemma 5 also implies $\alpha_{1}\geqslant\frac{1}{\sqrt{2}L}$ .

Suppose that $\alpha_{k-1}\geqslant\frac{1}{\sqrt{3}L}$ and we must show that $\alpha_{k}\geqslant\frac{1}{\sqrt{3}L}$ . If $\alpha_{k}\mapsto 2$ , then we are done. Therefore, suppose that $\alpha_{k}\mapsto 1$ . Consider two options for $\alpha_{k-1}$ . If $\alpha_{k-1}\mapsto 1$ , then $\theta_{k-1}\geqslant\sqrt{2/3}$ . Thus, for $\alpha_{k}$ we have that

If $\alpha_{k-1}\mapsto 2$ , then $\alpha_{k-1}\geqslant\frac{1}{\sqrt{2}L}$ and hence

It is clear from above proof that condition $\alpha_{0}\geqslant\frac{1}{\sqrt{2}L_{1}}$ from (16) was used only to give us the basis for induction. Without that condition, one can still show in the same way that $\alpha_{k}\geqslant\min\{\alpha_{0},\frac{1}{\sqrt{3}L}\}$ .

Summing this result from $1$ to $k$ yields $\sum_{i=1}^{k}\alpha_{i}\geqslant\frac{k}{\sqrt{3}L}$ . The stepsize in the previous section is lower bounded by a $\sqrt{1}{\sqrt{2}L}$ , so it is natural to wonder: why is the current section contains a “larger stepsize”? The answer is that while we cannot show that each individual step is larger, we still show in the next theorem that its total length will be lower bounded by the same quantity.

Let $f$ be convex with a locally Lipschitz gradient $\nabla f$ . Then the sequence $(x^{k})$ generated by Algorithm 2 converges to a solution and

where $R$ is defined as in (21). In particular, if $\alpha_{0}$ satisfies (16), then

where $L$ is the Lipschitz constant of $\nabla f$ over $B(x^{*},R)$ .

Of course, the important bound here is (23). The second bound only shows that our choice of stepsizes $\alpha_{k}$ cannot be too bad. The bound in (24) is stronger than the bound $\frac{\sqrt{3}LR^{2}}{2k}$ , which could be obtained as a direct consequence of Lemma 6. The derivation of the sharper bound as in (24) is presented in Section 3.2.

We proceed in the same way as in Lemma 4, but this time we keep all the terms that were discarded earlier. Specifically, summing (3) over all $k$ yields

where the last two bounds follow from the same arguments as in Lemma 4. Note that each factor $(\alpha_{k}(2+3\theta_{k})-3\alpha_{k+1}\theta_{k+1})$ is nonnegative and their sum is

In particular, if $\alpha_{0}$ satisfies (16), then inequality (24) is a direct consequence of Lemma 10, which we prove in the next section.

It remains to prove that $(x^{k})$ converges to a solution. Next arguments will be similar to the ones in [MM20]. We have already proved that $(x^{k})$ is bounded. As $f$ is $L$ -smooth over $B(x^{*},R)$ , we have

Using this sharper bound instead of plain convexity in (2.1) and repeating the same arguments as in Lemma 3, we end up with the same inequality plus the extra term

Now, by telescoping this inequality we infer that $\sum_{i=1}^{k}\frac{\alpha_{i}}{L}\|\nabla f(x^{i})\|^{2}\leqslant R^{2}$ . Since the sequence $(\alpha_{k})$ is separated from (note that this is independent of condition (16) by Remark 4), we conclude that $\nabla f(x^{k})\to 0$ as $k\to\infty$ . Hence, all limit points of $(x^{k})$ are solutions. Applying $3\theta_{k}\alpha_{k}\leqslant(2+3\theta_{k-1})\alpha_{k-1}$ in (3.1) we get

where $b_{k}=\|x^{k}-x^{k-1}\|^{2}+\alpha_{k-1}(2+3\theta_{k-1})(f(x^{k-1})-f_{*})$ . Then the convergence of $(x^{k})$ to a solution follows from the standard Opial-type arguments. ∎

2 Better bounds for the sum of stepsizes

In this section we prove the bound $\sum_{i=1}^{k}\alpha_{i}\geqslant\frac{k}{\sqrt{2}L}$ .

If $\theta_{k}<\frac{1}{3}$ , then $\alpha_{k}\mapsto 2$ and $\alpha_{k-1}L_{k}>\sqrt{5}$ , $\alpha_{k-2}L_{k}\geqslant\frac{3}{2}$ , $\alpha_{k-3}L_{k}\geqslant 1$ .

By definition, $\alpha_{k}\mapsto 1$ means that $\alpha_{k}=\sqrt{\frac{2}{3}+\theta_{k-1}}\alpha_{k-1}$ and thus $\theta_{k}\geqslant\sqrt{\frac{2}{3}}$ . Hence, $\alpha_{k}\mapsto 2$ . Then we have that $\theta_{k}=\frac{1}{\sqrt{2\alpha_{k-1}^{2}L_{k}^{2}-1}}<\frac{1}{3}$ which implies $\alpha_{k-1}L_{k}>\sqrt{5}$ . Since we get a large $\alpha_{k-1}$ , the first bound on stepsizes does not allow previous steps to be much smaller. That is the idea we shall use.

For any $k$ , we have that $\theta_{k}\leqslant\sqrt{\frac{2}{3}+\theta_{k-1}}$ . As $\theta_{0}\leqslant 1$ , it is trivial to prove that $\theta_{k}\leqslant\frac{1+\sqrt{\frac{11}{3}}}{2}\eqqcolon t_{0}$ , which is the root of $t-\sqrt{\frac{2}{3}+t}=0$ . From $\alpha_{k-1}L_{k}>\sqrt{5}$ , it follows that

Hence, to prove $\alpha_{k-2}L_{k}\geqslant\frac{3}{2}$ , it only remains to check that $\frac{\sqrt{5}}{t_{0}}\geqslant\frac{3}{2}$ .

And to prove $\alpha_{k-3}L_{k}\geqslant 1$ , we must check that $\frac{3}{2t_{0}}\geqslant 1$ . ∎

Given the sequence $(\alpha_{k})_{k\geq 1}$ , we call its element $\alpha_{m}$ a breakpoint, if $\theta_{m}<\frac{1}{3}$ and $\alpha_{m}<\frac{1}{L}$ . The next lemma says that a small step can only occur shortly after a breakpoint.

If $\alpha_{k}<\frac{1}{\sqrt{2}L}$ , then exactly one of the following holds

$\alpha_{k-1}<\alpha_{k}$ and $\alpha_{k-2}$ is a breakpoint.

In view of Lemma 5, the statement implies that $\alpha_{k}\mapsto 1$ . Suppose that $\alpha_{k-1}$ is not a breakpoint, since otherwise we are done. This means that either (a) $\alpha_{k-1}\geqslant\frac{1}{L}$ or (b) $\alpha_{k-1}<\frac{1}{L}$ and $\theta_{k-1}\geqslant\frac{1}{3}$ . In the first case we immediately get a contradiction, since $\alpha_{k}=\sqrt{\frac{2}{3}+\theta_{k-1}}\alpha_{k-1}\geqslant\sqrt{\frac{2}{3}}\frac{1}{L}>\frac{1}{\sqrt{2}L}$ . Then if we consider (b), the bound $\theta_{k-1}\geqslant\frac{1}{3}$ implies that $\alpha_{k-1}\leqslant\alpha_{k}<\frac{1}{\sqrt{2}L}$ . Then we can apply the same arguments as above, but to $\alpha_{k-1}$ . This means that either $\alpha_{k-2}$ will be a breakpoint or we will have a chain $\alpha_{k-2}\leqslant\alpha_{k-1}\leqslant\alpha_{k}<\frac{1}{\sqrt{2}L}$ . However, the latter option cannot occur, because using $\theta_{k-1}\geqslant 1$ and $\alpha_{k-1}\geqslant\frac{1}{\sqrt{3}L}$ ensure us that

Although a breakpoint indicates that we are in the region with a small stepsize, Lemma 7 guarantees that previous steps were quite large. The next lemma shows that in total we make a significant progress.

If $\alpha_{m}$ is a breakpoint, then $\sum_{j=-2}^{2}\alpha_{m+j}>\frac{5}{L}$ .

If $\alpha_{m}$ is a breakpoint, then on one hand Lemma 7 implies that $\alpha_{m-1}\geqslant\frac{\sqrt{5}}{L_{m}}$ , $\alpha_{m-2}\geqslant\frac{3}{2L_{m}}$ . On the other hand, we have that $\alpha_{m}\geqslant\frac{1}{\sqrt{2}L}$ , $\alpha_{m+1}\geqslant\frac{1}{\sqrt{3}L}$ , and $\alpha_{m+2}\geqslant\frac{1}{\sqrt{3}L}$ . Combining, we get

If $\alpha_{0}$ satisfies (16), then for any $k\geqslant 1$ we have

Let $\mathcal{M}=\{m\text{ is a breakpoint: }\alpha_{m+1}<\frac{1}{\sqrt{2}L}\}$ . We can split $\sum_{i=1}^{k}\alpha_{i}$ into two terms as

We claim that elements in the “rest” are greater or equal than $\frac{1}{\sqrt{2}L}$ . Indeed, if $\alpha_{i}<\frac{1}{\sqrt{2}L}$ is in the “rest” term, then either $\alpha_{i-1}$ is a breakpoint or $\alpha_{i-1}<\frac{1}{\sqrt{2}L}$ and $\alpha_{i-2}$ is a breakpoint, as Lemma 8 suggests. In either case, $\alpha_{i}$ must be included in the first sum, by the definition of $\mathcal{M}$ .

Now let us estimate both terms. The first sum in (28) is greater than $\frac{5|\mathcal{M}|}{L}>\frac{5|\mathcal{M}|}{\sqrt{2}L}$ , by Lemma 9. The total sum in the “rest” term is not less than $\frac{k-5|\mathcal{M}|}{\sqrt{2}L}$ . Hence, the desired inequality follows. It has to be only noted that if $k-1\in\mathcal{M}$ , we have to additionally consider the sum $\sum_{j=-2}^{1}\alpha_{k-1+j}\geqslant\frac{4}{\sqrt{2}L}$ , for which the bound follows from the same arguments as in Lemma 9. ∎

It is obvious that our analysis was not optimal. For instance, whenever $\alpha_{k}\mapsto 2$ , we could use $\alpha_{k-1}+\alpha_{k}\geqslant\frac{2}{L}$ instead of more conservative $\frac{2}{\sqrt{2}L}$ . Similarly, we got a much better bound for every breakpoint. However, we did not want to overcomplicate an already tedious examination. We leave it as an open question if one can provide a bound closer to $\frac{k}{L}$ (or better?) with a readable proof.

Adaptive proximal gradient method

In this section we turn to a more general problem of a composite optimization problem

As one can notice, the only difference between Algorithm 3 and Algorithm 2 is the presence of the proximal mapping. The analysis, however, is not a straightforward generalization.

Recall that the second bound for the stepsize $\alpha_{k}$ is equivalent to

We can rewrite $x^{k+1}=\operatorname{prox}_{\alpha_{k}g}(x^{k}-\alpha_{k}\nabla f(x^{k}))$ as an implicit equation

where $\widetilde{\nabla}g(x^{k+1})$ is a certain subgradient of $g$ at $x^{k+1}$ , that is $\widetilde{\nabla}g(x^{k+1})\in\partial g(x^{k+1})$ . For this particular subgradient we will also use notation

First, we adapt our basic inequality (2.1) to the more general case. By prox-inequality, we have

Then we set $x=x^{*}$ above and transform it into

This standard inequality is at the heart of the analysis of the proximal gradient method. To complete the full proof, or rather to get the final inequality, the classical analysis only requires applying one convexity inequality and one descent lemma to function $f$ . Our analysis, however, will be different. The main nuisance is that in the $k$ -th iteration the proximal map yields us $g(x^{k+1})-g(x^{*})$ term, while our adaptivity approach works with $f(x^{k})-f(x^{*})$ , as we remember from before. Thus, our first obstacle is to understand how to combine these two terms.

First, we estimate the term $\langle\nabla f(x^{k}),x^{*}-x^{k+1}\rangle$ in the RHS of (33). We have

where in the last inequality we used separately convexity of $f$ and $g$ . Applying this inequality in (33) yields

As we see, the final inequality is very much in the spirit of (2.1).

For iterates $(x^{k})$ with arbitrary stepsizes, it holds

As before, this is just monotonicity of $\nabla f$ in disguise:

The next lemma is special for the composite case. Although it looks like this fact should be known in the literature, we were not able to identify it.

For iterates $(x^{k})$ of the proximal gradient method with arbitrary stepsizes, it holds

This time it is just a monotonicity of $\partial g$ in disguise:

where the last inequality follows from monotonicity of $\partial g$ and (31). ∎

In Section 3 we estimated $\|x^{k+1}-x^{k}\|^{2}=\alpha_{k}^{2}\|\nabla f(x^{k})\|^{2}$ . This time, $\|x^{k+1}-x^{k}\|^{2}$ and $\alpha_{k}^{2}\|\widetilde{\nabla}F(x^{k})\|^{2}$ are different and it is the latter term that matters to us.

For iterates $(x^{k})$ of Algorithm 3 it holds

The main idea of the proof is exactly the same as in Lemma 2. However, the presence of $\widetilde{\nabla}g(x^{k})$ make it slightly more cumbersome. The previous two lemmata are instrumental on our way. We have

For iterates $(x^{k})$ of Algorithm 3 and any solution $x^{*}$ it holds

The sequence $(x^{k})$ is bounded. In particular, for any solution $x^{*}$ of (29) we have $x^{k}\in B(x^{*},R)$ .

The same as in Lemma 4. We use (14) to telescope until $k=1$ and then apply (4) with $k=0$ to bound $\|x^{1}-x^{*}\|^{2}$ . ∎

Let $f$ be convex with a locally Lipschitz gradient $\nabla f$ and $g$ be convex lsc. Then the sequence $(x^{k})$ generated by Algorithm 3 converges to a solution of (29) and

In particular, if $\alpha_{0}$ satisfies (16), then

where $L$ is the Lipschitz constant of $\nabla f$ over $B(x^{*},R)$ .

The proof of inequalities (39) and (40) is almost identical to the one in Theorem 2. The proof of convergence of $(x^{k})$ to a solution is, however, more nuanced. The nontrivial part is to show that all limit points of $(x^{k})$ are solutions. While on the surface, it should be no harder than before, the fact that $\lim_{k\to+\infty}\alpha_{k}$ can be $+\infty$ complicates things a bit.

Let $x^{*}$ be a solution of (29). By $L$ -smoothness of $f$ over $B(x^{*},R)$ , we have

Using this improved bound, similarly as it was done in (3.1), we get

By telescoping this inequality as before, we can now additionally infer that

and thus, $\|\nabla f(x^{k})-\nabla f(x^{*})\|\to 0$ . Specifically, this implies $\nabla f(x^{k})-\nabla f(x^{k-1})\to 0$ as $k\to\infty$ .

We want to prove that all limit points of $(x^{k})$ are solutions. To this end, we will use prox-inequality (32) rewritten as

which in turn, by convexity of $f$ , leads to

The left-hand side has two terms, and the second term evidently tends to as $\nabla f(x^{k+1})-\nabla f(x^{k})\to 0$ . If we can show the same for the first term, it will imply that all limit points of $(x^{k})$ are solutions.

Consider (43) again, but this time we set $x=x^{k}$ . This yields

We manipulate the inequality above as follows

where in the last inequality we applied Cauchy-Schwarz and Young’s inequalities. From this we deduce that

Note that the sequence $\left(\alpha_{k}\|\nabla f(x^{k})-\nabla f(x^{*})\|^{2}\right)$ is summable by (42). Also, the sequence $(\delta_{k})$ is summable, since $(x^{k})$ is bounded and $g(x^{k})$ is lower-bounded: $g(x^{k})\geqslant F_{*}-f(x^{k})>-\infty$ for all $k$ . Hence, $\sum_{k}\frac{1}{\alpha_{k}}\|x^{k+1}-x^{k}\|^{2}<+\infty$ and, thus, $\frac{1}{\alpha_{k}}\|x^{k+1}-x^{k}\|^{2}\to 0$ as $k\to+\infty$ . Given that $(\alpha_{k})$ is separated from zero, it immediately follows that $\frac{1}{\alpha_{k}}\|x^{k+1}-x^{k}\|\to 0$ as well.

Therefore, we have proved that all limit points of $(x^{k})$ are solutions. The proof of convergence of the whole sequence $(x^{k})$ runs as before in Theorem 2. ∎

We didn’t derive a linear convergence of the adaptive proximal gradient, when $F$ is strongly convex. We only mention that it is quite straightforward and goes along the same lines as the original AdGD in [MM20, Theorem 2] in the strongly convex regime.

Literature and discussion

There are many variants of linesearch procedures that go back to celebrated works of Goldstein [Gol62] and Armijo [Arm66]. We discuss an efficient implementation of the latter in detail in the next section. For other variants of linesearch, we refer to [BN16, Sal17].

Adagrad-type methods.

Original Adagrad algorithm was proposed simultaneously in [DHS11] and [MS10]. The method has had a stunning impact on machine learning applications. It has also spawned a stream of various extensions that retain the same idea of using eventually decreasing steps. Because of this, its adaptivity is more prominent in the non-smooth regime, where stepsizes must be diminishing to guarantee convergence. Recent works [DM23, IHC23] have proposed ways to increase $d_{k}$ in the update (2) and [KMJ23] even proved convergence of such method on smooth objectives. However, the stepsize in these method still eventually stops increasing as $d_{k}$ is upper bounded by a constant.

In addition, Adagrad-type methods are usually sensitive to the initialization, as they either degrade in performance when $d_{k}=D$ and $D$ is not chosen carefully, or their convergence rate depends multiplicatively on $\log(\|x^{0}-x^{*}\|/d_{0})$ . In contrast, in our methods, the cost of estimating $\alpha_{0}$ to satisfy condition (16) is additive and its impact vanishes as the total number of iterations increases.

Refined results on GD with a fixed stepsize.

Paper [TV22] summarizes quite well the difficulty of GD analysis with large steps. In it, the authors derive sharp convergence bounds separately for two cases $\alpha L\in(0,1]$ and $\alpha L\in(1,2)$ , and the latter case is considerably harder. In our analysis it is even harder, since the steps can go far beyond the global upper bound $\frac{2}{L}$ . A surprising recent result [Gri23] showcases how actually little is understood in this case.

Small gradient.

The lack-of-descent property makes it hard to deduce the $\mathcal{O}(1/k)$ rate for the last-iterate $\|\nabla f(x^{k})\|$ , which is known for GD with a fixed stepsize. We leave it as an open problem to establish a rate.

Extensions.

Because the analysis of the algorithm is so special, it is not easy to extend it to basic generalizations of GD. However, some works have already build upon it. In [VMC21], the authors consider a convex smooth minimization subject to linear constraints and combined the adaptive GD [MM20] with the Chambolle-Pock algorithm [CP10]. The authors of [Lat+23] went even further and considered a more general composite minimization problem subject to linear constraints, where the same two ideas as before were combined with a novel way of handling the prox mapping.

If we consider variational inequalities settings in the monotone case, then it is not clear how such adaptivity can help, since the most natural extension, the forward-backward method will diverge. On the other hand, an adaptive golden ratio algorithm [Mal19], which is the precursor of a given work, already has all the properties that AdProxGD has.

Experiments

In the experimentshttps://github.com/ymalitsky/AdProxGD we compare our method to the ProxGD with Armijo’s linesearch. We believe it is the best and arguably the most popular alternative to our method. An efficient implementation of Armijo’s linesearch requires two parameters, $s>1$ and $r<1$ . In the $k$ -th iteration, the first iteration of linesearch starts from $\alpha_{k}=s\alpha_{k-1}$ , that is, we want to try a slightly larger step than in the previous iteration. If linesearch does not terminate, we start decreasing a stepsize geometrically with a ratio $r$ . Formally, we are looking for the largest $\alpha_{k}=sr^{i}\alpha_{k-1}$ , for $i=0,1,\dots$ , such that for $x^{k+1}=\operatorname{prox}_{\alpha_{k}g}(x^{k}-\alpha_{k}\nabla f(x^{k}))$ it holds that

It is evident that each iteration of this linesearch requires one evaluation of $f$ and $\operatorname{prox}_{g}$ . However, it is important to highlight that in some cases, the last evaluation of $f(x^{k+1})$ (during linesearch) may not incur any additional costs, as certain expensive operations, such as matrix-vector multiplication, can be reused to compute the next gradient $\nabla f(x^{k+1})$ . Throughout our comparisons, we consistently took these factors into account and reported only essential operations that cannot be further reused.

Our legend will stay the same for all plots:

AdProxGD (1.2, 0.5) (1.5, 0.8) (1.1, 0.5) (1.2, 0.9) (1.1, 0.9) (1.5, 0.5) (1.2, 0.8) (1.1, 0.8) (1.5, 0.9)

where each pair of numbers represent $(s,r)$ for ProxGD with linesearch described above. As we will see, the choice of $(s,r)$ matters a lot.

We consider [BV04, Equation (7.5)], where our goal is to estimate the inverse of a covariance matrix $Y$ subject to eigenvalue bounds. Formally, this problem can be formulated as follows

Low-rank matrix completion

We consider a famous low-rank matrix completion problem in the form

where $\Omega$ is a subset of indices $(i,j)$ and $r$ is the supposed maximum rank. To project onto the spectral ball $\mathcal{C}=\{X\colon\|X\|_{*}\leqslant r\}$ , computing the singular value decomposition (SVD) is required, making it the most computationally expensive operation in this setting.

We created matrix $A$ by multiplying matrices $U$ and $V^{\top}$ , where $U$ and $V$ are $n$ -by- $r$ matrices with entries sampled from a normal distribution. The subset $\Omega$ was randomly chosen as a fraction of $\frac{1}{5}n^{2}$ entries from $[n]\times[n]$ . The obtained results are depicted in Figure 2, where we solely compared the number of computed SVDs.

Minimal length piecewise-linear curve subject to linear constraints.

While applying the proximal gradient method, the most computationally expensive operation is computing the projection onto $\mathcal{C}=\{x\colon Ax=b\}$ . Assuming that $A$ is full rank with $m\leqslant n$ , this projection can be computed as $P_{\mathcal{C}}z=z-A^{\top}(AA^{\top})^{-1}(Az-b)$ .

In comparison, we focused solely on the number of computed projections. We generated a random $m$ -by- $n$ matrix $A$ and random vector $w$ with entries sampled from a normal distribution and set $b=Aw$ .

Nonnegative matrix factorization.

We want to solve the matrix factorization problem subject to nonnegative constraints:

where $A$ is a given $n$ -by- $n$ low-rank matrix. Although nonconvex, this problem is famously well-tackled by first-order methods. In each iteration, the gradient $\nabla f(x)$ involves 3 matrix-matrix multiplications, whereas evaluating the objective $f(x)$ only requires 1. Note that for the last iteration of the linesearch, the computed matrix product can be reused to compute the gradient for the next iteration.

We created matrix $A$ by multiplying matrices $B$ and $C^{\top}$ , where $B$ and $C$ are $n$ -by- $r$ matrices with entries sampled from a normal distribution. Negative entries in both matrices $B$ and $C$ were then set to zero. The results are presented in Figure 4, where the number of gradients roughly means the number of 3 matrix-matrix multiplications.

Dual of the entropy maximization problem.

Consider the entropy maximization problem subject to linear constraints

Conclusion.

Based on our preliminary experiments, it is evident that AdProxGD indeed performs better. To our surprise, a few specific pairs $(r,s)$ consistently outperform the rest among ProxGD with linesearch. We are not aware of any theoretical finding that would confirm this evidence.

Appendix

The function $f$ defined in (8) satisfies the following properties:

$f^{\prime}$ is $L$ -Lipschitz with $L=1$ .

$f$ is locally strongly convex, i.e., for any bounded set $\mathcal{X}$ there exists a constant $\mu_{\mathcal{X}}>0$ such that $|f^{\prime}(x)-f^{\prime}(y)|\geqslant\mu_{\mathcal{X}}|x-y|$ for any $x,y\in\mathcal{X}$ .

First, let us find $f^{\prime}$ and $f^{\prime\prime}$ :

so indeed $a=2$ and $b=2\log 2-\frac{3}{2}$ . Convexity of $f$ follows from the fact that $f^{\prime\prime}(x)>0$ for any $x$ . Lipschitzness of $f^{\prime}$ follows directly from the bound $f^{\prime\prime}(x)\leqslant 1$ for all $x$ . Similarly, local strong convexity follows from the bound $f^{\prime\prime}(x)\geq\frac{1}{\max_{z\in\mathcal{X}}(1+|z|)^{2}}\eqqcolon\mu_{\mathcal{X}}$ for any $x\in\mathcal{X}$ . Finally, the last two properties trivially follow from the expression for $f^{\prime}(x)$ . ∎

Let us choose $x^{0}=r+2$ with a sufficiently large $r>6c$ . This readily implies that

Our goal is to show that the iterates follow a very specific pattern. Namely, we prove that for all $k\geqslant 0$ ,

If this condition holds true, then the sequence $(x^{k})$ must be divergent.

First, observe that if $|x^{k}|,|x^{k-1}|\geqslant r$ and $\operatorname{sign}(x^{k})=\operatorname{sign}(x^{k-1})$ , then the smoothness estimate admits a simple expression:

Therefore, in that case $\alpha_{k}|f^{\prime}(x^{k})|>\frac{r(1+|x^{k}|)}{2c}|f^{\prime}(x^{k})|=\frac{r|x^{k}|}{c}>3|x^{k}|$ . Since $\operatorname{sign}(f^{\prime}(x^{k}))=\operatorname{sign}(x^{k})$ , it implies that $|x^{k+1}|>2|x^{k}|$ and $\operatorname{sign}(x^{k+1})\neq\operatorname{sign}(x^{k})$ .

Next, if $|x^{k}|,|x^{k-1}|\geqslant r>3$ with $|x^{k}|\geqslant 2|x^{k-1}|$ and $\operatorname{sign}(x^{k})\neq\operatorname{sign}(x^{k-1})$ , then we have $\frac{2|x^{k}|}{1+|x^{k}|}>\frac{3}{2}$ and

This implies $\alpha_{k}<\frac{|x^{k}|}{2c}\leqslant\frac{|x^{k}|}{2}$ . Since $\operatorname{sign}(f^{\prime}(x^{k}))=\operatorname{sign}(x^{k})$ and $\frac{\alpha_{k}}{1+|x^{k}|}\leqslant\frac{|x^{k}|}{2(1+|x^{k}|)}<\frac{1}{2}$ , we conclude that $\operatorname{sign}(x^{k+1})=\operatorname{sign}(x^{k})$ and

As $x^{0}$ and $x^{1}$ satisfy the first case, by induction we deduce that all iterates $(x^{k})$ follow the described pattern. ∎

The authors would like to thank Puya Latafat, who found a subtle error in the convergence proof of $(x^{k})$ in Theorem 3 in the first version of this manuscript.