Efficient Methods for Structured Nonconvex-Nonconcave Min-Max Optimization

Jelena Diakonikolas, Constantinos Daskalakis, Michael I. Jordan

Introduction

Min-max optimization and min-max duality theory lie at the foundations of game theory and mathematical programming, and have found far-reaching applications across a range of disciplines, including complexity theory, statistics, control theory, and online learning theory. Most recently, min-max optimization has played an important role in machine learning, notably in the adversarial training of deep neural network classifiers and the training of generative deep neural network models. These recent applications have heightened the importance of solving min-max optimization problems with nonconvex-nonconcave objectives, taking the following general form:

where $\mathbf{x}$ and $\mathbf{y}$ are real-valued vectors and $f$ is not (necessarily) convex in $\mathbf{x}$ for all $\mathbf{y}$ and/or not (necessarily) concave in $\mathbf{y}$ for all $\mathbf{x}$ . There may also be constraints on $\mathbf{x}$ and $\mathbf{y}$ , and in many applications $\mathbf{x}$ and $\mathbf{y}$ are high-dimensional vectors.

When the objective function is not convex-concave, von Neumann’s celebrated min-max theorem fails to apply, and so do most standard optimization methods for solving (1.1). This has motivated several lines of investigation, which include extensions of the min-max theorem beyond convex-concave objectives (e.g. Sion’s theorem for quasiconvex-quasiconcave objectives), and the pursuit of computational procedures that target solutions to (1.1) even in the absence of a min-max theorem; see Section 1.1 for a review of recent work. Of course, without strong assumptions on $f$ , (1.1) is an intractable problem, at least as intractable as general nonconvex optimization. Thus, the literature has targeted locally optimal solutions, in the same spirit as the targeting of local optima in non-convex optimization. Naturally, there are various notions of local optimality that have been studied in the literature. Our focus here will be on the simplest such notion, namely first-order local optimality, for which, despite the apparent simplicity, many challenges arise Daskalakis and Panageas (2018), Mazumdar et al. (2020).

In contrast to classical optimization problems, where useful results can be obtained with very mild assumptions on the objective function, in min-max optimization it is necessary to impose non-trivial assumptions on $f$ , even when the goal is only to compute locally optimal solutions. Indeed, Daskalakis et al. (2021) establish intractability results in the constrained setting of the problem, wherein first-order locally optimal solutions are guaranteed to exist whenever the objective is smooth. Moreover, they show that even the computation of approximate solutions is PPAD-complete and, if the objective function is accessible through value-queries and gradient-queries, exponentially many such queries are necessary (in particular, exponential in at least one of the following: the inverse approximation parameter, the smoothness constant of $f$ , or the diameter of the constraint set).

We expect similar intractability results to hold in the unconstrained case, which is the case considered in this paper, even when restricting to smooth objectives that have a non-empty set of optimal solutions.Note that these are stationary points of $f$ in this case. Indeed, fixed-point complexity-based intractability results for the constrained case are typically extendable to the unconstrained case, by embedding the hard instances within an unbounded domain.

Given the aforedescribed intractability results, our goal is to identify structural properties that make it possible to solve min-max optimization problems with smooth objectives. Focusing on the unconstrained setting of (1.1), we view it as a special case (obtained by considering the operator $F([\vbox{\Let@\restore@math@cr\default@tag\halign{\hfil$ \m@th\scriptstyle# $&$ \m@th\scriptstyle{}# $\hfil\cr\mathbf{x}\\ \mathbf{y}\crcr}}])=\big{[}\vbox{\Let@\restore@math@cr\default@tag\halign{\hfil$ \m@th\scriptstyle# $&$ \m@th\scriptstyle{}# $\hfil\cr\nabla_{\mathbf{x}}f(\mathbf{x},\mathbf{y})\\ -\nabla_{\mathbf{y}}f(\mathbf{x},\mathbf{y})\crcr}}\big{]}$ ) of the unconstrained variational inequality problem (svi), and consider instead this more general problem. We identify conditions for $F$ under which a generalized version of the extragradient method of Korpelevich (1976), which we propose, converges to a solution of (svi) (or, in the special case of (1.1), to a stationary point of $f$ ) at a rate of $1/\sqrt{k}$ in the number of iterations $k$ . Our condition, presented as Assumption 1, postulates that there exists a solution to (svi) that only violates the stronger (mvi) requirement in a controlled manner that we delineate. Our generalized extragradient method is based on an aggressive interpolation step, as specified by (eg $+$ ), and our main convergence result is Theorem 3.2. We additionally show, in Theorems 4.2 and 4.5, that the algorithm converges in non-Euclidean settings, under the stronger condition that an (mvi) solution exists, or when we only have stochastic oracle access to $F$ (or, in the special case of (1.1), to the gradient of $f$ ).

The condition on $F$ under which our main result applies is weaker than the assumption that a solution to (mvi) exists Zhou et al. (2017), Mertikopoulos et al. (2019), Malitsky (2019), Song et al. (2020), an assumption which is already satisfied by several interesting families of min-max objectives, including quasiconvex-concave families or starconvex-concave families. Our significantly weaker condition applies in particular to (min-max objectives $f$ with corresponding) operators $F$ that are negatively comonotone Bauschke et al. (2020) or positively cohypomonotone Combettes and Pennanen (2004). These conditions have been studied in the literature for at least a couple of decades, but only asymptotic convergence results were available prior to our work for computing solutions to (svi). In contrast, our rates are asymptotically identical to the rates that we would get under the stronger assumption that a solution to (mvi) exists, and sidestep the intractability results for (1.1) suggested by Daskalakis et al. (2021) for general smooth objectives.

1 Further Related Work

A large number of recent works target identifying practical first-order, low-order, or efficient online learning methods for solving min-max optimization problems in a variety of settings, ranging from the well-behaved setting of convex-concave objectives to the challenging setting of nonconvex-nonconcave objectives. There has been substantial work for convex-concave and nonconvex-concave objectives, targeting the computation of min-max solutions to (1.1) or, respectively, stationary points of $f$ or $\Phi(\mathbf{x}):=\max_{\mathbf{y}}f(\mathbf{x},\mathbf{y})$ . This work has focused on attaining improved convergence rates Kong and Monteiro (2019), Lin et al. (2020b), Thekumparampil et al. (2019), Nouiehed et al. (2019), Lu et al. (2020), Zhao (2019), Alkousa et al. (2019), Azizian et al. (2020), Golowich et al. (2020), Lin et al. (2020a), Diakonikolas (2020) and/or obtaining last-iterate convergence guarantees Daskalakis et al. (2018), Daskalakis and Panageas (2018), Mazumdar et al. (2020), Mertikopoulos et al. (2018), Lin et al. (2018), Hamedani and Aybat (2018), Adolphs et al. (2019), Daskalakis and Panageas (2019), Liang and Stokes (2019), Gidel et al. (2019), Mokhtari et al. (2020), Abernethy et al. (2019), Liu et al. (2020).

In the nonconvex-nonconcave setting, research has focused on identifying different notions of local min-max solutions Daskalakis and Panageas (2018), Mazumdar et al. (2020), Jin et al. (2020), Mangoubi and Vishnoi (2021) and studying the existence and (local) convergence properties of learning methods to these points Wang et al. (2019), Mangoubi et al. (2020), Mangoubi and Vishnoi (2021). As already discussed, recent work of Daskalakis et al. (2021) shows that, for general smooth objectives, the computation of even approximate first-order locally optimal min-max solutions is intractable, motivating the identification of structural assumptions on the objective function for which these intractability barriers can be bypassed.

An example such assumption, which is closely related to the one made in this work, is that an (mvi) solution exists for the operator $F([\vbox{\Let@\restore@math@cr\default@tag\halign{\hfil$ \m@th\scriptstyle# $&$ \m@th\scriptstyle{}# $\hfil\cr\mathbf{x}\\ \mathbf{y}\crcr}}])=\big{[}\vbox{\Let@\restore@math@cr\default@tag\halign{\hfil$ \m@th\scriptstyle# $&$ \m@th\scriptstyle{}# $\hfil\cr\nabla_{\mathbf{x}}f(\mathbf{x},\mathbf{y})\\ -\nabla_{\mathbf{y}}f(\mathbf{x},\mathbf{y})\crcr}}\big{]}$ , as studied by Zhou et al. (2017), Lin et al. (2018), Mertikopoulos et al. (2019), Malitsky (2019), Liu et al. (2020), Song et al. (2020). As we have already discussed, the assumption we make for our main result in this work is weaker. Table 1 provides a comparison of our results to those of existing works, considering the deterministic setting (i.e. having exact value access to $F$ ).

In unconstrained Euclidean setups, the best known convergence rates are of the order $1/\sqrt{k}$ Dang and Lan (2015), Song et al. (2020), under the assumption that an (mvi) solution exists. We obtain the same rate under our weaker Assumption 1. Moreover, under our weaker assumption, we show that the accumulation points of the sequence of iterates of our algorithm are (svi) solutions. This was previously established for alternative algorithms and under the stronger assumption that an (mvi) solution exists Mertikopoulos et al. (2019), Malitsky (2019).

Finally, it is worth noting that Zhou et al. (2017), Mertikopoulos et al. (2019), Malitsky (2019), Song et al. (2020) consider constrained optimization setups, which are not considered in our work. We believe that generalizing our results to constrained setups is possible, and defer such generalizations to future work.

Notation and Preliminaries

We are interested in finding stationary points for min-max problems of the form:

The Minty Variational Inequality problem consists in finding $\mathbf{u}^{*}$ such that:

in which case $\mathbf{u}^{*}$ is referred to as the weak solution to the variational inequality corresponding to $F$ and $\mathcal{U}$ . If we assume that $F$ is monotone, then (2.1) implies that every solution to (svi) is also a solution to (mvi), and the two solution sets are equivalent. More generally, if $F$ is not monotone, all that can be said is that the set of (mvi) solutions is a subset of the set of (svi) solutions. In particular, (mvi) solutions may not exist even when (svi) solutions exist. These facts follow from Minty’s theorem (see, e.g., (Kinderlehrer and Stampacchia, 2000, Chapter 3)).

We will not, in general, be assuming that $F$ is monotone. Note that the Lipschitzness of $F$ on its own is not sufficient to guarantee that the problem is computationally tractable, as discussed in the introduction. Thus, additional structure is needed, which we introduce in the following.

We define the class of problems with weak (mvi) solutions as the class of problems in which $F$ satisfies the following assumption.

There exists $\mathbf{u}^{*}\in\mathcal{U}^{*}$ such that:

for some parameter $\rho\in\big{[}0,\frac{1}{4L}\big{)}$ .

2 Example Settings Satisfying Assumption 1

The class of problems that have weak (mvi) solutions in the sense of Assumption 1 generalizes other structured non-monotone variational inequality problems, as we discuss in this section.

When $\rho=0,$ we recover the class of problems that have an (mvi) solution. This class contains all unconstrained variationally coherent problems studied in, e.g., Zhou et al. (2017), Mertikopoulos et al. (2019), which encompass all min-max problems with objectives that are bilinear, pseudo-convex-concave, quasi-convex-concave, and star-convex-concave.

When $\rho>0$ and $p=2$ , Assumption 1 is implied by $F$ being $-\frac{\rho}{2}$ -comonotone Bauschke et al. (2020) or $\frac{\rho}{2}$ -cohypomonotone Combettes and Pennanen (2004), defined as follows:

In particular, Assumption 1 is equivalent to requiring that (2.2) be satisfied for general $\mathbf{u}$ and $\mathbf{v}=\mathbf{u}^{*}$ , where $\mathbf{u}^{*}$ is a solution to (svi) (in which case $F(\mathbf{u}^{*})=\textbf{0}$ ). Note that Assumption 1 does not imply that a solution to (mvi) exists, unless $\rho=0$ . It is further important to note that cohypomonotone operators arise as inverses of operators that only need to be Lipschitz-continuous. (In fact, even a weaker property suffices; see Bauschke et al. (2020).) This is particularly interesting as, combined with our main result, it implies that we can efficiently find zeros of inverses of Lipschitz-continuous operators, as long as those inverses are sufficiently Lipschitz, even though finding zeros of Lipschitz-continuous operators is computationally intractable, in general, as we have discussed.

It is interesting to note that Assumption 1 does not imply that, in the min-max setting, $f$ is convex-concave (or, more generally, that $F$ is monotone), even in a neighborhood of an (svi) solution $\mathbf{u}^{*}=\big{[}\vbox{\Let@\restore@math@cr\default@tag\halign{\hfil$ \m@th\scriptstyle# $&$ \m@th\scriptstyle{}# $\hfil\cr\mathbf{x}^{*}\\ \mathbf{y}^{*}\crcr}}\big{]}$ , i.e., a stationary point of $f$ . To see this, fix $\mathbf{y}=\mathbf{y}^{*}$ and consider $f(\mathbf{x},\mathbf{y}^{*})$ for $\mathbf{x}$ in a small neighborhood of $\mathbf{x}^{*}.$ Using the fact that a continuously-differentiable function is well-approximated by its linear approximation within small neighborhoods, all that we are able to deduce from Assumption 1 is that

In particular, Assumption 1 does not preclude that $f(\mathbf{x}^{*},\mathbf{y}^{*})$ is larger than $f(\mathbf{x},\mathbf{y}^{*})$ ; it only bounds how much larger it can be by a quantity proportional to $\|\nabla f(\mathbf{x},\mathbf{y}^{*})\|_{p^{*}}^{2}$ . Compare this also to the Polyak-Łojasiewicz condition (see, e.g., Nouiehed et al. (2019), Yang et al. (2020)), which imposes the opposite inequality, namely, that $f(\mathbf{x},\mathbf{y}^{*})-f(\mathbf{x}^{*},\mathbf{y}^{*})$ is bounded above by a multiple of $\|\nabla f(\mathbf{x},\mathbf{y}^{*})\|_{p^{*}}^{2}$ .

One way that a generic operator $F$ may satisfy Assumption 1 is when there is a constant $\lambda>0$ such that for some $\mathbf{u}^{*}\in\mathcal{U}^{*}$ we have

and when the operator $F$ does not plateau or become too close to a linear operator around $\mathbf{u}^{*};$ namely, $\|F(\mathbf{u})-F(\mathbf{u}^{*})\|_{p^{*}}\geq\mu\|\mathbf{u}-\mathbf{u}^{*}\|_{p}$ . (Note that (2.3) is always satisfied with $\lambda=2L$ for $L$ -Lipschitz operators, but we may need $\lambda$ to be smaller than $2L$ ). Then Assumption 1 would be satisfied with $\rho=\frac{\lambda}{\mu}.$ For a min-max problem, assuming $f$ is twice differentiable, this would mean that the lowest eigenvalue of the symmetric part of the Jacobian of $\big{[}\vbox{\Let@\restore@math@cr\default@tag\halign{\hfil$ \m@th\scriptstyle# $&$ \m@th\scriptstyle{}# $\hfil\cr\nabla_{\mathbf{x}}f(\mathbf{x},\mathbf{y})\\ -\nabla_{\mathbf{y}}f(\mathbf{x},\mathbf{y})\crcr}}\big{]}$ is bounded below by $-\lambda/2$ in any direction $\mathbf{u}-\mathbf{u}^{*}$ and the function $f$ is sufficiently “curved” (not close to a linear or a constant function) around $\mathbf{u}^{*}=\big{[}\vbox{\Let@\restore@math@cr\default@tag\halign{\hfil$ \m@th\scriptstyle# $&$ \m@th\scriptstyle{}# $\hfil\cr\mathbf{x}^{*}\\ \mathbf{y}^{*}\crcr}}\big{]}.$

Finally, we discuss a concrete min-max application wherein there are no (mvi) solutions, but there do exist (svi) solutions satisfying the weak (mvi) condition of Assumption 1. This application arises in the context of two-agent zero-sum reinforcement learning problems studied by many authors, including recently by Daskalakis et al. (2020). In Section 5.1 of that work, the authors consider a special case of the general two-agent zero-sum RL problem, called von Neumann’s ratio game, for which they observe that, even on a random example, the (mvi) solution set is empty, yet the extragradient method still converges in practice (albeit at a slower rate). Interestingly, it is easy to construct examples of the von Neumann ratio game for which no (mvi) solution exists, but the weak (mvi) condition of Assumption 1 does hold, and yet the stronger cohypomonotonicity condition of (2.2) does not hold. Indeed, one such example is obtained for the game shown in Proposition 2 of their paper, setting $s=1/2$ and $\epsilon=.49$ . Here (mvi) fails, the weak (mvi) condition of Assumption 1 is satisfied, and cohypomonotonicity fails to hold, e.g., for $\mathbf{u}=(\mathbf{x},\mathbf{y})=(0.1,0.3)$ and $\mathbf{v}=(\mathbf{x}’,\mathbf{y}’)=(0.8,0.3)$ . To be clear, the von Neumann ratio game gives rise to a constrained min-max problem while our algorithm targets the unconstrained setting. While extending our result to the constrained setting remains open, our example here demonstrates that there is value in further studying the weak (mvi) condition of Assumption 1 in the constrained setting as well.

3 Useful Definitions and Facts

Observe that when $p=2,$ we recover the standard definition of strong convexity. Thus, uniform convexity is a generalization of strong convexity.

It is immediate that the Bregman divergence of a convex function is non-negative.

We now outline some useful auxiliary results used specifically in Section 4, where we study the case that $p$ is not necessarily equal to 2.

Then, for $p^{*}=\frac{p}{p-1},$ $q^{*}=\frac{q}{q-1}$ :

The statements in the proposition are simple corollaries of conjugacy of the functions $\psi(\mathbf{u})=\frac{1}{q}\|\mathbf{u}\|_{p}^{q}$ and $\psi^{*}(\mathbf{z})=\frac{1}{q^{*}}\|\mathbf{z}\|_{p^{*}}^{q^{*}}$ . In particular, the first part follows from

by the definition of a convex conjugate and using that $\frac{1}{q}\|\mathbf{u}\|_{p}^{q}$ and $\frac{1}{q^{*}}\|\mathbf{z}\|_{p^{*}}^{q^{*}}$ are conjugates of each other, which are standard exercises in convex analysis for $q\in\{p,2\}$ (see, e.g., (Borwein and Zhu, 2004, Exercise 4.4.2) and (Boyd et al., 2004, Example 3.27)).

Another useful result is the following proposition, which will allow us to relate Lipschitzness of $F$ to uniform convexity of the prox mapping $\frac{1}{q}\|\cdot\|_{p}^{q}$ in the definition of the algorithm. The ideas used in the proof can be found in the proofs of (d’Aspremont et al., 2018, Lemma 5.7), (Nesterov, 2015, Lemma 2), and in (Devolder et al., 2014, Section 2.3). The proof is provided for completeness.

For any $L>0$ , $\kappa>0$ , $q\geq\kappa,$ $t\geq 0$ , and $\delta>0$ ,

The proof is based on the Fenchel-Young inequality and the conjugacy of functions $\frac{|x|^{r}}{r}$ and $\frac{|y|^{s}}{s}$ for $r,s\geq 1$ , $\frac{1}{r}+\frac{1}{s}=1,$ which implies $xy\leq\frac{x^{r}}{r}+\frac{y^{s}}{s},$ $\forall x,y\geq 0.$ In particular, setting $r=q/\kappa,$ $s=q/(q-\kappa)$ , and $x=t^{\kappa},$ we have

It remains to set $\frac{\delta}{2}=\frac{L(q-\kappa)}{q\kappa}y^{\frac{\kappa}{q-\kappa}},$ which, solving for $y,$ gives $y=\big{(}\frac{\delta q\kappa}{2L(q-\kappa)}\big{)}^{q-\kappa}$ , and verify that, under this choice, $\Lambda=\frac{Lt^{q}}{qy}.$ ∎

Generalized Extragradient for Problems with Weak MVI Solutions

In this section, we consider the setup with the Euclidean norm $\|\cdot\|=\|\cdot\|_{2}$ , i.e., $p=2.$ To address the class of problems with weak (mvi) solutions (see Assumption 1), we introduce the following generalization of the extragradient algorithm, to which we refer as Extragradient+ (eg $+$ ).

where $\beta\in(0,1]$ is a parameter of the algorithm and $a_{k}>0$ is the step size. When $\beta=1,$ we recover standard eg.

The analysis relies on the following merit (or gap) function:

for some $\mathbf{u}^{*}\in\mathcal{U}^{*}$ for which $F$ satisfies Assumption 1. Then Assumption 1 implies that $h_{k}\geq 0,$ $\forall k.$

The first (and main) step is to bound all $h_{k}$ ’s above, as in the following lemma.

where $h_{k}$ is defined as in Eq. (3.1).

Fix any $k\geq 0$ and write $h_{k}$ equivalently as

The proof proceeds by bounding from above individual terms on the right-hand side of Eq. (3.3). For the first term, the first-order optimality in the definition of $\mathbf{u}_{k+1}$ gives:

For the second term on the right-hand side of Eq. (3.3), the first-order optimality in the definition of $\bar{\mathbf{u}}_{k}$ implies:

which, similarly as for the first term, leads to:

For the third term on the right-hand side of Eq. (3.3), applying the Cauchy-Schwarz inequality, $L$ -Lipschitzness of $F,$ and Young’s inequality, respectively, we have:

where the last inequality holds for any $\gamma>0.$

Using the fact that $\bar{\mathbf{u}}_{k}-\mathbf{u}_{k}=-\frac{a_{k}}{\beta}F(\mathbf{u}_{k})$ , $\mathbf{u}_{k+1}-\mathbf{u}_{k}=-a_{k}F(\bar{\mathbf{u}}_{k})$ and combining Eqs. (3.4)-(3.6) with Eq. (3.3), we have:

Using Lemma 3.1, we can now draw conclusions about the convergence of eg $+$ by choosing parameters $\beta,\gamma$ and the step sizes $a_{k}$ to guarantee that $h_{k}<\frac{1}{2}\|\mathbf{u}^{*}-\mathbf{u}_{k}\|^{2}-\frac{1}{2}\|\mathbf{u}^{*}-\mathbf{u}_{k+1}\|^{2}$ as long as $\|F(\bar{\mathbf{u}}_{k})\|\neq 0$ .

all accumulation points of $\{\bar{\mathbf{u}}_{k}\}_{k\geq 0}$ are in $\mathcal{U}^{*}$ .

Applying Lemma 3.1 with the choice of $a_{k}$ and $\beta$ from the theorem statement and with $\gamma=1,$ we get

By Assumption 1, $\rho<\frac{1}{4L},$ and, thus, the constant multiplying $\|F(\bar{\mathbf{u}}_{k})\|^{2}$ is strictly negative.

As $h_{k}\geq 0$ (by Assumption 1), we can conclude that

As $\frac{1}{4L}\Big{(}\frac{1}{4L}-\rho\Big{)}>0,$ Eq. (3.7) implies that $\|F(\bar{\mathbf{u}}_{k})\|$ converges to zero as $k\to\infty.$ Further, as $\bar{\mathbf{u}}_{k}-\mathbf{u}_{k}=-\frac{a_{k}}{\beta}F(\mathbf{u}_{k}),$ using triangle inequality and $F(\mathbf{u}^{*})=\textbf{0}:$

where we have used that $F$ is $L$ -Lipschitz. Now, as $\|\mathbf{u}_{k}-\mathbf{u}^{*}\|$ is bounded (by $\|\mathbf{u}_{0}-\mathbf{u}^{*}\|,$ from Eq. (3.7)), it follows that the sequence $\{\bar{\mathbf{u}}_{k}\}$ is bounded as well, and thus has a converging subsequence. Let $\{\bar{\mathbf{u}}_{k_{i}}\}$ be any converging subsequence of $\{\bar{\mathbf{u}}_{k}\}$ and let $\bar{\mathbf{u}}^{*}$ be its corresponding accumulation point. Then, as $\|F(\bar{\mathbf{u}}_{k})\|$ converges to zero as $k\to\infty,$ it follows that $\|F(\bar{\mathbf{u}}_{k_{i}})\|$ converges to zero as $i\to\infty,$ and so it must be $\bar{\mathbf{u}}^{*}\in\mathcal{U}^{*}.$

For Part (ii), telescoping Eq. (3.7), we get:

and $\frac{1}{k+1}\sum_{i=0}^{k}\|F(\bar{\mathbf{u}}_{i})\|^{2}\geq\min_{0\leq i\leq k}\|F(\bar{\mathbf{u}}_{i})\|^{2}$ . ∎

Due to Eq. (3.8), we have that all the iterates of eg $+$ with the parameter setting as in Theorem 3.2 remain in the ball centered at $\mathbf{u}^{*}$ and of radius at most $2\|\mathbf{u}_{0}-\mathbf{u}^{*}\|$ . Thus, Assumption 1 does not need to hold globally for the result to apply; it suffices that it only applies locally to points from the ball around $\mathbf{u}^{*}$ with radius $2\|\mathbf{u}_{0}-\mathbf{u}^{*}\|$ .

It is possible to obtain similar convergence results as those of Theorem 3.2 under different parameter choices. In particular, for $\gamma\in(0,1],$ it suffices that $a_{k}\leq\frac{\beta\gamma}{L}$ and $\rho<a_{k}(1-\beta).$ We settled on the choice made in Theorem 3.2 as it is simple and requires tuning only one parameter, $L$ .

where $\mathcal{F}_{k}$ and $\bar{\mathcal{F}}_{k}$ denote the natural filtrations, including all the randomness up to the construction of points $\mathbf{u}_{k}$ and $\bar{\mathbf{u}}_{k},$ respectively, and ${\bar{\sigma}_{k}}^{2},{\sigma_{k+1}}^{2}$ are the variance constants. Observe that $\mathcal{F}_{k}\subseteq\bar{\mathcal{F}}_{k}$ and $\bar{\mathcal{F}}_{k}\subseteq\mathcal{F}_{k+1}.$ To simplify the notation, we denote:

The variant of the method we consider here is stated as follows:

This is immediate for $p>2,$ by the definition of $\phi_{p}$ and using that $q=p$ and $m_{p}=1$ when $p>2.$ For $p\in(1,2],$ we have that $q=2,$ and Eq. (4.6) follows by strong convexity of $\frac{1}{2}\|\cdot\|_{p}^{2}.$

As in the case of Euclidean norms, the analysis relies on the following merit function:

Moreover, as before, Assumption 1 guarantees that $h_{k}\geq 0,$ $\forall k.$

We start by first proving a lemma that holds for generic choices of algorithm parameters $a_{k}$ and $\beta.$ We will then use this lemma to deduce the convergence bounds for different choices of $p>1$ and both deterministic and the stochastic oracle access to $F.$

where $h_{k}$ is defined as in Eq. (4.7), $\delta_{k}$ is any positive number, and $\Lambda_{k}=\Big{(}\frac{q-2}{\delta_{k}q}\Big{)}^{\frac{q-2}{2}}L^{q/2}$ . When $q=2,$ the statement also holds with $\delta_{k}=0$ and $\Lambda_{k}=L.$

We begin the proof by writing $h_{k}$ equivalently as:

The proof now proceeds by bounding individual terms on the right-hand side of the last equality.

Equivalently, applying the definition of $M_{k+1}(\cdot)$ to the last inequality:

where the last inequality follows from Eq. (4.6).

where the inequality is by $\nabla\bar{M}_{k}(\bar{\mathbf{u}}_{k})=\textbf{0}$ and the fact that $\frac{1}{q}\|\cdot\|_{p}^{q}$ is $q$ -uniformly convex w.r.t. $\|\cdot\|_{p}$ with constant $m_{p}$ , by the choice of $q$ from Eq. (4.3). Applying the definition of $\bar{M}_{k}(\mathbf{u})$ to the last inequality:

where $(i)$ is by Hölder’s inequality, $(ii)$ is by $L$ -Lipschitzness of $F,$ and $(iii)$ is by Young’s inequality, which holds for any $\gamma>0.$ Now, let $\delta_{k}>0$ and $\Lambda_{k}=\Big{(}\frac{2(q-\kappa)}{\delta_{k}q\kappa}\Big{)}^{\frac{q-\kappa}{\kappa}}L^{q/\kappa}.$ Then, applying Proposition 2.6 to the last two terms in the last inequality:

Observe that when $q=2,$ there is no need to apply Proposition 2.6, and the last inequality is satisfied with $\delta_{k}=0$ and $\Lambda_{k}=L.$

Combining Eqs. (4.9)-(4.11) with Eq. (4.8), we have:

The main result is summarized in the following theorem.

Let $p\in(1,2]$ . If $\beta=m_{p}=p-1,$ $a_{k}=\frac{{m_{p}}^{3/2}}{2L}$ , then all accumulation points of $\{\mathbf{u}_{k}\}_{k\geq 0}$ are in $\mathcal{U}^{*},$ and, furthermore $\forall k\geq 0$ :

In particular, within $k=O\big{(}\frac{L^{2}\|\mathbf{u}^{*}-\mathbf{u}_{0}\|_{p}^{2}}{{(p-1)}^{2}\epsilon^{2}}\big{)}$ iterations eg ${}_{p}+$ can output a point $\mathbf{u}$ with $\|F(\mathbf{u})\|_{p^{*}}\leq\epsilon.$

Let $p\in(2,\infty)$ . If $\beta=\frac{1}{2},$ $\delta_{k}=\delta>0$ , $\Lambda=\big{(}\frac{q-2}{\delta q}\big{)}^{\frac{q-2}{2}}L^{q/2}$ , and $a_{k}=\frac{1}{2\Lambda}=a$ , then, $\forall k\geq 0$ :

In particular, for any $\epsilon>0,$ there is a choice of $\delta=\frac{\epsilon^{2}}{C_{p}L},$ where $C_{p}$ is a constant that only depends on $p$ , such that eg ${}_{p}+$ can output a point $\mathbf{u}$ with $\|F(\mathbf{u})\|_{p^{*}}\leq\epsilon$ in at most

iterations. Here, the $O_{p}$ notation hides constants that only depend on $p.$

Observe that, as $\bar{\bm{\eta}}_{i}=\bm{\eta}_{i}=\textbf{0},$ $\forall i\geq 0$ and $\rho=0,$ Lemma 4.1 and the definition of $h_{k}$ give:

Proof of Part (i). In this case, we can set $\delta_{k}=0$ (see Lemma 4.1), $\Lambda_{k}=L,$ and $q=2.$ Therefore, setting $\beta=m_{p},$ $a_{k}=\frac{{m_{p}}^{3/2}}{2L}$ , and $\gamma=\frac{1}{\sqrt{m_{p}}}$ we get from Eq. (4.12) that

It follows that $\|\bar{\mathbf{u}}_{k}-\mathbf{u}_{k}\|_{p}^{2}$ converges to zero as $k\to\infty.$ By the definition of $\bar{\mathbf{u}}_{k}$ and Proposition 2.5, $\frac{1}{2}\|\bar{\mathbf{u}}_{k}-\mathbf{u}_{k}\|_{p}^{2}=\frac{{a_{k}}^{2}}{2\beta^{2}}\|F(\mathbf{u}_{k})\|_{p^{*}}^{2},$ and so $\|F(\mathbf{u}_{k})\|_{p^{*}}$ converges to zero as $k\to\infty.$ Further, as $\phi_{p}(\mathbf{u}^{*},\mathbf{u}_{k})\leq\phi_{p}(\mathbf{u}^{*},\mathbf{u}_{0})<\infty$ and $\phi_{p}(\mathbf{u}^{*},\mathbf{u}_{k})\geq\frac{m_{p}}{2}\|\mathbf{u}^{*}-\mathbf{u}_{k}\|_{p}^{2},$ $m_{p}>0,$ it follows that $\|\mathbf{u}^{*}-\mathbf{u}_{k}\|_{p}$ is bounded, and, thus, $\{\mathbf{u}_{k}\}_{k\geq 0}$ is a bounded sequence. The proof that all accumulation points of $\{\mathbf{u}_{k}\}_{k\geq 0}$ are in $\mathcal{U}^{*}$ is standard and omitted (see the proof of Theorem 3.2 for a similar argument).

To bound $\frac{1}{k+1}\sum_{i=0}^{k}\|F(\mathbf{u}_{i})\|_{p^{*}}^{2}$ , we telescope the inequality from Eq. (4.13) to get:

Proof of Part (ii). In this case, $q=p$ , $\phi_{p}(\mathbf{u},\mathbf{v})=\frac{1}{p}\|\mathbf{u}-\mathbf{v}\|_{p}^{p},$ and $m_{p}=1.$ Using Proposition 2.5, $\|\mathbf{u}_{k}-\bar{\mathbf{u}}_{k}\|_{p}^{p}=\frac{{a_{k}}^{p^{*}}}{\beta^{p^{*}}}\|F(\mathbf{u}_{k})\|_{p^{*}}^{p^{*}}$ and $\|\mathbf{u}_{k+1}-\mathbf{u}_{k}\|_{p}^{p}={a_{k}}^{p^{*}}\|F(\bar{\mathbf{u}}_{k})\|_{p^{*}}^{p^{*}}.$ Combining with Eq. (4.12), we have:

Now let $\gamma=1,$ $\beta=\frac{1}{2},$ $\delta_{k}=\delta>0$ , and $a_{k}=\frac{1}{2\Lambda_{k}}=\frac{1}{2\Lambda}=a$ . Then $a_{k}\Lambda_{k}\gamma-\beta=a_{k}\Lambda_{k}/\gamma-\beta=0$ and Eq. (4.14) simplifies to:

Telescoping the last inequality and then dividing by $\frac{{a_{k}}^{p^{*}}(k+1)}{2p},$ we have:

Now, for eg ${}_{p}+$ to be able to output a point $\mathbf{u}$ with $\|F(\mathbf{u})\|_{p^{*}}\leq\epsilon,$ it suffices to show that for some choice of $\delta$ and $k$ we can make the right-hand side of Eq. (4.15) at most $\epsilon^{p^{*}}$ . This is true because then eg ${}_{p}+$ can output the point $\bar{\mathbf{u}}_{i}=\operatorname*{argmin}_{0\leq i\leq k}\|F(\bar{\mathbf{u}}_{i})\|_{p^{*}}.$ For stochastic setups, the guarantee would be in expectation, and eg ${}_{p}+$ could output a point $\bar{\mathbf{u}}_{i}$ with $i$ chosen uniformly at random from $\{0,\dots,k\}$ , as discussed in the proof of Theorem 3.2.

Observe first that, as $\Lambda=\big{(}\frac{p-2}{p\delta}\big{)}^{\frac{p-2}{2}}L^{p/2}$ and $p^{*}=\frac{p}{p-1},$ we have that:

Setting $\frac{2p\delta}{{a}^{p^{*}-1}}\leq\frac{\epsilon^{p^{*}}}{2}$ , recalling that $p^{*}=\frac{p}{p-1}$ , and rearranging, we have:

It can be verified numerically that $(\frac{p-2}{p})^{\frac{p-2}{p}}$ is a constant between $\frac{1}{e}$ and $1,$ while it is clear that $2^{\frac{2(2p-1)}{p}}p^{\frac{2(p-1)}{p}}=O(p^{2})$ is a constant that only depends on $p$ . Hence, it suffices to set $\delta=\frac{\epsilon^{2}}{C_{p}L},$ where $C_{p}=2^{\frac{2(2p-1)}{p}}p^{\frac{2(p-1)}{p}}.$

It remains to bound the number of iterations $k$ so that $\frac{2\|\mathbf{u}^{*}-\mathbf{u}_{0}\|_{p}^{p}}{a^{p^{*}}(k+1)}\leq\frac{\epsilon^{p^{*}}}{2}.$ Equivalently, we need $k+1\geq\frac{4\|\mathbf{u}^{*}-\mathbf{u}_{0}\|_{p}^{p}}{a^{p^{*}}\epsilon^{p^{*}}}$ . Plugging $\delta=\frac{\epsilon^{2}}{C_{p}L}$ into the definition of $\Lambda,$ using that $p^{*}=\frac{p}{p-1},$ and simplifying, we have:

Stochastic oracle access.

We start by bounding the stochastic error $\mathcal{E}^{s}$ in expectation.

Let $\mathcal{E}^{s}=-a_{k}\left\langle\bar{\bm{\eta}}_{k},\bar{\mathbf{u}}_{k}-\mathbf{u}^{*}\right\rangle-a_{k}\left\langle\bar{\bm{\eta}}_{k}-\bm{\eta}_{k},\bar{\mathbf{u}}_{k}-\mathbf{u}_{k+1}\right\rangle$ , where $\bar{\bm{\eta}}_{k}$ and $\bm{\eta}_{k}$ are defined as in Eq. (4.2) and all the assumptions of Theorem 4.5 below apply. Then, for $q$ defined by Eq. (4.3) and any $\tau>0$ :

where the expectation is w.r.t. all the randomness in the algorithm.

Let us start by bounding $-a_{k}\left\langle\bar{\bm{\eta}}_{k},\bar{\mathbf{u}}_{k}-\mathbf{u}^{*}\right\rangle$ first. Conditioning on $\bar{\mathcal{F}}_{k}$ , $\bar{\bm{\eta}}_{k}$ is independent of $\bar{\mathbf{u}}_{k}$ and $\mathbf{u}^{*}$ , and, thus:

The second term, $-a_{k}\left\langle\bar{\bm{\eta}}_{k}-\bm{\eta}_{k},\bar{\mathbf{u}}_{k}-\mathbf{u}_{k+1}\right\rangle,$ can be bounded using Hölder’s inequality and Young’s inequality as follows:

where the last line is by Jensen’s inequality, as $q^{*}\in(1,2],$ and so $(\cdot)^{q^{*}/2}$ is concave. Using Young’s inequality and linearity of expectation:

We are now ready to bound the total oracle complexity of eg ${}_{p}+$ (and its special case eg $+$ ), as follows.

Combining Lemmas 4.1 and 4.4, we have, $\forall k\geq 0$ :

Proof of Part (i). In this case, $q=2$ , $m_{p}=1,$ $\delta=0,$ $\Lambda_{k}=L,$ and $\phi_{p}(\mathbf{u}^{*},\mathbf{u})=\frac{1}{2}\|\mathbf{u}^{*}-\mathbf{u}\|_{2}^{2},$ and, further, $\mathbf{u}_{k+1}-\mathbf{u}_{k}=-a_{k}F(\bar{\mathbf{u}}_{k}),$ so Eq. (4.17) simplifies to

Taking $\beta=\frac{1}{2},$ $\tau^{2}=\frac{1}{4},$ $\gamma=\sqrt{2},$ and $a_{k}=\frac{1}{2\sqrt{2}L},$ and recalling that $\bar{\rho}=\frac{1}{4\sqrt{2}L},$ we have:

Telescoping the last inequality and dividing both sides by $a_{k}(\bar{\rho}-\rho)(k+1),$ we get:

To finish the proof of this part, we require that both terms on the right-hand side of the last inequality are bounded by $\frac{\epsilon^{2}}{2}.$ For the first term, this leads to:

Proof of Part (ii). In this case, $q=2$ , $m_{p}=p-1,$ $\delta=0,$ $\Lambda_{k}=L,$ and $\rho=0.$ Thus, Eq. (4.17) simplifies to:

Telescoping the last inequality and dividing both sides by $(k+1)\frac{{a_{k}}^{2}m_{p}}{4\beta^{2}},$ we have:

and the bound on the total number of samples follows.

Proof of Part (iii). In this case, $q=p,$ $m_{p}=1,$ $\rho=0,$ $\phi_{p}(\mathbf{u}^{*},\mathbf{u})=\frac{1}{p}\|\mathbf{u}^{*}-\mathbf{u}\|_{p}^{p}$ , and we take $\delta_{k}=\delta>0,$ $\Lambda_{k}=\Lambda=\big{(}\frac{p-2}{p\delta}\big{)}^{\frac{p-2}{2}}L^{\frac{p}{2}}.$ Eq. (4.17) now simplifies to:

Telescoping the last inequality and then dividing both sides by $\frac{a^{p^{*}}}{2p}(k+1),$ we have:

To complete the proof, as before it suffices to show that we can choose $k$ and $n$ so that $\frac{2p\|\mathbf{u}^{*}-\mathbf{u}_{0}\|_{p}^{p}}{a^{p^{*}}(k+1)}+\frac{2p\delta}{{a}^{p^{*}}-1}\leq\frac{\epsilon^{p^{*}}}{2}$ and $\frac{2^{\frac{p+2}{p-1}}p\sigma^{p^{*}}}{p^{*}n}\leq\frac{\epsilon^{p^{*}}}{2}$ . For the former, following the same argument as in the proof of Theorem 4.2, Part (ii), it suffices to choose $\delta=O_{p}(\frac{\epsilon^{2}}{L})$ , which leads to:

The total number of queries to the stochastic oracle is then bounded by $k(1+n).$ ∎

Discussion

Acknowledgements

We wish to thank Steve Wright for a useful discussion regarding convergence of sequences. We also wish to thank the Simons Institute for the Theory of Computing where some of this work was conducted.

JD was supported by the Office of the Vice Chancellor for Research and Graduate Education at the University of Wisconsin–Madison with funding from the Wisconsin Alumni Research Foundation and by the NSF Award CCF-2007757. CD was supported by NSF Awards IIS-1741137, CCF-1617730, and CCF-1901292, by a Simons Investigator Award, by the Simons Collaboration on the Theory of Algorithmic Fairness, and by the DOE PhILMs project (No. DE-AC05-76RL01830). MJ was supported in part by the Mathematical Data Science program of the Office of Naval Research under grant number N00014-18-1-2764.