Efficient Methods for Structured Nonconvex-Nonconcave Min-Max Optimization

Jelena Diakonikolas, Constantinos Daskalakis, Michael I. Jordan

Introduction

Min-max optimization and min-max duality theory lie at the foundations of game theory and mathematical programming, and have found far-reaching applications across a range of disciplines, including complexity theory, statistics, control theory, and online learning theory. Most recently, min-max optimization has played an important role in machine learning, notably in the adversarial training of deep neural network classifiers and the training of generative deep neural network models. These recent applications have heightened the importance of solving min-max optimization problems with nonconvex-nonconcave objectives, taking the following general form:

where x\mathbf{x} and y\mathbf{y} are real-valued vectors and ff is not (necessarily) convex in x\mathbf{x} for all y\mathbf{y} and/or not (necessarily) concave in y\mathbf{y} for all x\mathbf{x}. There may also be constraints on x\mathbf{x} and y\mathbf{y}, and in many applications x\mathbf{x} and y\mathbf{y} are high-dimensional vectors.

When the objective function is not convex-concave, von Neumann’s celebrated min-max theorem fails to apply, and so do most standard optimization methods for solving (1.1). This has motivated several lines of investigation, which include extensions of the min-max theorem beyond convex-concave objectives (e.g. Sion’s theorem for quasiconvex-quasiconcave objectives), and the pursuit of computational procedures that target solutions to (1.1) even in the absence of a min-max theorem; see Section 1.1 for a review of recent work. Of course, without strong assumptions on ff, (1.1) is an intractable problem, at least as intractable as general nonconvex optimization. Thus, the literature has targeted locally optimal solutions, in the same spirit as the targeting of local optima in non-convex optimization. Naturally, there are various notions of local optimality that have been studied in the literature. Our focus here will be on the simplest such notion, namely first-order local optimality, for which, despite the apparent simplicity, many challenges arise Daskalakis and Panageas (2018), Mazumdar et al. (2020).

In contrast to classical optimization problems, where useful results can be obtained with very mild assumptions on the objective function, in min-max optimization it is necessary to impose non-trivial assumptions on ff, even when the goal is only to compute locally optimal solutions. Indeed, Daskalakis et al. (2021) establish intractability results in the constrained setting of the problem, wherein first-order locally optimal solutions are guaranteed to exist whenever the objective is smooth. Moreover, they show that even the computation of approximate solutions is PPAD-complete and, if the objective function is accessible through value-queries and gradient-queries, exponentially many such queries are necessary (in particular, exponential in at least one of the following: the inverse approximation parameter, the smoothness constant of ff, or the diameter of the constraint set).

We expect similar intractability results to hold in the unconstrained case, which is the case considered in this paper, even when restricting to smooth objectives that have a non-empty set of optimal solutions.Note that these are stationary points of ff in this case. Indeed, fixed-point complexity-based intractability results for the constrained case are typically extendable to the unconstrained case, by embedding the hard instances within an unbounded domain.

Given the aforedescribed intractability results, our goal is to identify structural properties that make it possible to solve min-max optimization problems with smooth objectives. Focusing on the unconstrained setting of (1.1), we view it as a special case (obtained by considering the operator F([\vbox{\Let@\restore@math@cr\default@tag\halign{\hfil\m@th\scriptstyle#&\m@th\scriptstyle{}#\hfil\cr\mathbf{x}\\ \mathbf{y}\crcr}}])=\big{[}\vbox{\Let@\restore@math@cr\default@tag\halign{\hfil\m@th\scriptstyle#&\m@th\scriptstyle{}#\hfil\cr\nabla_{\mathbf{x}}f(\mathbf{x},\mathbf{y})\\ -\nabla_{\mathbf{y}}f(\mathbf{x},\mathbf{y})\crcr}}\big{]}) of the unconstrained variational inequality problem (svi), and consider instead this more general problem. We identify conditions for FF under which a generalized version of the extragradient method of Korpelevich (1976), which we propose, converges to a solution of (svi) (or, in the special case of (1.1), to a stationary point of ff) at a rate of 1/k1/\sqrt{k} in the number of iterations kk. Our condition, presented as Assumption 1, postulates that there exists a solution to (svi) that only violates the stronger (mvi) requirement in a controlled manner that we delineate. Our generalized extragradient method is based on an aggressive interpolation step, as specified by (eg++), and our main convergence result is Theorem 3.2. We additionally show, in Theorems 4.2 and 4.5, that the algorithm converges in non-Euclidean settings, under the stronger condition that an (mvi) solution exists, or when we only have stochastic oracle access to FF (or, in the special case of (1.1), to the gradient of ff).

The condition on FF under which our main result applies is weaker than the assumption that a solution to (mvi) exists Zhou et al. (2017), Mertikopoulos et al. (2019), Malitsky (2019), Song et al. (2020), an assumption which is already satisfied by several interesting families of min-max objectives, including quasiconvex-concave families or starconvex-concave families. Our significantly weaker condition applies in particular to (min-max objectives ff with corresponding) operators FF that are negatively comonotone Bauschke et al. (2020) or positively cohypomonotone Combettes and Pennanen (2004). These conditions have been studied in the literature for at least a couple of decades, but only asymptotic convergence results were available prior to our work for computing solutions to (svi). In contrast, our rates are asymptotically identical to the rates that we would get under the stronger assumption that a solution to (mvi) exists, and sidestep the intractability results for (1.1) suggested by Daskalakis et al. (2021) for general smooth objectives.

1 Further Related Work

A large number of recent works target identifying practical first-order, low-order, or efficient online learning methods for solving min-max optimization problems in a variety of settings, ranging from the well-behaved setting of convex-concave objectives to the challenging setting of nonconvex-nonconcave objectives. There has been substantial work for convex-concave and nonconvex-concave objectives, targeting the computation of min-max solutions to (1.1) or, respectively, stationary points of ff or Φ(x):=maxyf(x,y)\Phi(\mathbf{x}):=\max_{\mathbf{y}}f(\mathbf{x},\mathbf{y}). This work has focused on attaining improved convergence rates Kong and Monteiro (2019), Lin et al. (2020b), Thekumparampil et al. (2019), Nouiehed et al. (2019), Lu et al. (2020), Zhao (2019), Alkousa et al. (2019), Azizian et al. (2020), Golowich et al. (2020), Lin et al. (2020a), Diakonikolas (2020) and/or obtaining last-iterate convergence guarantees Daskalakis et al. (2018), Daskalakis and Panageas (2018), Mazumdar et al. (2020), Mertikopoulos et al. (2018), Lin et al. (2018), Hamedani and Aybat (2018), Adolphs et al. (2019), Daskalakis and Panageas (2019), Liang and Stokes (2019), Gidel et al. (2019), Mokhtari et al. (2020), Abernethy et al. (2019), Liu et al. (2020).

In the nonconvex-nonconcave setting, research has focused on identifying different notions of local min-max solutions Daskalakis and Panageas (2018), Mazumdar et al. (2020), Jin et al. (2020), Mangoubi and Vishnoi (2021) and studying the existence and (local) convergence properties of learning methods to these points Wang et al. (2019), Mangoubi et al. (2020), Mangoubi and Vishnoi (2021). As already discussed, recent work of Daskalakis et al. (2021) shows that, for general smooth objectives, the computation of even approximate first-order locally optimal min-max solutions is intractable, motivating the identification of structural assumptions on the objective function for which these intractability barriers can be bypassed.

An example such assumption, which is closely related to the one made in this work, is that an (mvi) solution exists for the operator F([\vbox{\Let@\restore@math@cr\default@tag\halign{\hfil\m@th\scriptstyle#&\m@th\scriptstyle{}#\hfil\cr\mathbf{x}\\ \mathbf{y}\crcr}}])=\big{[}\vbox{\Let@\restore@math@cr\default@tag\halign{\hfil\m@th\scriptstyle#&\m@th\scriptstyle{}#\hfil\cr\nabla_{\mathbf{x}}f(\mathbf{x},\mathbf{y})\\ -\nabla_{\mathbf{y}}f(\mathbf{x},\mathbf{y})\crcr}}\big{]}, as studied by Zhou et al. (2017), Lin et al. (2018), Mertikopoulos et al. (2019), Malitsky (2019), Liu et al. (2020), Song et al. (2020). As we have already discussed, the assumption we make for our main result in this work is weaker. Table 1 provides a comparison of our results to those of existing works, considering the deterministic setting (i.e. having exact value access to FF).

In unconstrained Euclidean setups, the best known convergence rates are of the order 1/k1/\sqrt{k} Dang and Lan (2015), Song et al. (2020), under the assumption that an (mvi) solution exists. We obtain the same rate under our weaker Assumption 1. Moreover, under our weaker assumption, we show that the accumulation points of the sequence of iterates of our algorithm are (svi) solutions. This was previously established for alternative algorithms and under the stronger assumption that an (mvi) solution exists Mertikopoulos et al. (2019), Malitsky (2019).

Finally, it is worth noting that Zhou et al. (2017), Mertikopoulos et al. (2019), Malitsky (2019), Song et al. (2020) consider constrained optimization setups, which are not considered in our work. We believe that generalizing our results to constrained setups is possible, and defer such generalizations to future work.

Notation and Preliminaries

We are interested in finding stationary points for min-max problems of the form:

The Minty Variational Inequality problem consists in finding u\mathbf{u}^{*} such that:

in which case u\mathbf{u}^{*} is referred to as the weak solution to the variational inequality corresponding to FF and U\mathcal{U}. If we assume that FF is monotone, then (2.1) implies that every solution to (svi) is also a solution to (mvi), and the two solution sets are equivalent. More generally, if FF is not monotone, all that can be said is that the set of (mvi) solutions is a subset of the set of (svi) solutions. In particular, (mvi) solutions may not exist even when (svi) solutions exist. These facts follow from Minty’s theorem (see, e.g., (Kinderlehrer and Stampacchia, 2000, Chapter 3)).

We will not, in general, be assuming that FF is monotone. Note that the Lipschitzness of FF on its own is not sufficient to guarantee that the problem is computationally tractable, as discussed in the introduction. Thus, additional structure is needed, which we introduce in the following.

We define the class of problems with weak (mvi) solutions as the class of problems in which FF satisfies the following assumption.

There exists uU\mathbf{u}^{*}\in\mathcal{U}^{*} such that:

for some parameter \rho\in\big{[}0,\frac{1}{4L}\big{)}.

2 Example Settings Satisfying Assumption 1

The class of problems that have weak (mvi) solutions in the sense of Assumption 1 generalizes other structured non-monotone variational inequality problems, as we discuss in this section.

When ρ=0,\rho=0, we recover the class of problems that have an (mvi) solution. This class contains all unconstrained variationally coherent problems studied in, e.g., Zhou et al. (2017), Mertikopoulos et al. (2019), which encompass all min-max problems with objectives that are bilinear, pseudo-convex-concave, quasi-convex-concave, and star-convex-concave.

When ρ>0\rho>0 and p=2p=2, Assumption 1 is implied by FF being ρ2-\frac{\rho}{2}-comonotone Bauschke et al. (2020) or ρ2\frac{\rho}{2}-cohypomonotone Combettes and Pennanen (2004), defined as follows:

In particular, Assumption 1 is equivalent to requiring that (2.2) be satisfied for general u\mathbf{u} and v=u\mathbf{v}=\mathbf{u}^{*}, where u\mathbf{u}^{*} is a solution to (svi) (in which case F(u)=0F(\mathbf{u}^{*})=\textbf{0}). Note that Assumption 1 does not imply that a solution to (mvi) exists, unless ρ=0\rho=0. It is further important to note that cohypomonotone operators arise as inverses of operators that only need to be Lipschitz-continuous. (In fact, even a weaker property suffices; see Bauschke et al. (2020).) This is particularly interesting as, combined with our main result, it implies that we can efficiently find zeros of inverses of Lipschitz-continuous operators, as long as those inverses are sufficiently Lipschitz, even though finding zeros of Lipschitz-continuous operators is computationally intractable, in general, as we have discussed.

It is interesting to note that Assumption 1 does not imply that, in the min-max setting, ff is convex-concave (or, more generally, that FF is monotone), even in a neighborhood of an (svi) solution \mathbf{u}^{*}=\big{[}\vbox{\Let@\restore@math@cr\default@tag\halign{\hfil\m@th\scriptstyle#&\m@th\scriptstyle{}#\hfil\cr\mathbf{x}^{*}\\ \mathbf{y}^{*}\crcr}}\big{]}, i.e., a stationary point of ff. To see this, fix y=y\mathbf{y}=\mathbf{y}^{*} and consider f(x,y)f(\mathbf{x},\mathbf{y}^{*}) for x\mathbf{x} in a small neighborhood of x.\mathbf{x}^{*}. Using the fact that a continuously-differentiable function is well-approximated by its linear approximation within small neighborhoods, all that we are able to deduce from Assumption 1 is that

In particular, Assumption 1 does not preclude that f(x,y)f(\mathbf{x}^{*},\mathbf{y}^{*}) is larger than f(x,y)f(\mathbf{x},\mathbf{y}^{*}); it only bounds how much larger it can be by a quantity proportional to f(x,y)p2\|\nabla f(\mathbf{x},\mathbf{y}^{*})\|_{p^{*}}^{2}. Compare this also to the Polyak-Łojasiewicz condition (see, e.g., Nouiehed et al. (2019), Yang et al. (2020)), which imposes the opposite inequality, namely, that f(x,y)f(x,y)f(\mathbf{x},\mathbf{y}^{*})-f(\mathbf{x}^{*},\mathbf{y}^{*}) is bounded above by a multiple of f(x,y)p2\|\nabla f(\mathbf{x},\mathbf{y}^{*})\|_{p^{*}}^{2}.

One way that a generic operator FF may satisfy Assumption 1 is when there is a constant λ>0\lambda>0 such that for some uU\mathbf{u}^{*}\in\mathcal{U}^{*} we have

and when the operator FF does not plateau or become too close to a linear operator around u;\mathbf{u}^{*}; namely, F(u)F(u)pμuup\|F(\mathbf{u})-F(\mathbf{u}^{*})\|_{p^{*}}\geq\mu\|\mathbf{u}-\mathbf{u}^{*}\|_{p}. (Note that (2.3) is always satisfied with λ=2L\lambda=2L for LL-Lipschitz operators, but we may need λ\lambda to be smaller than 2L2L). Then Assumption 1 would be satisfied with ρ=λμ.\rho=\frac{\lambda}{\mu}. For a min-max problem, assuming ff is twice differentiable, this would mean that the lowest eigenvalue of the symmetric part of the Jacobian of \big{[}\vbox{\Let@\restore@math@cr\default@tag\halign{\hfil\m@th\scriptstyle#&\m@th\scriptstyle{}#\hfil\cr\nabla_{\mathbf{x}}f(\mathbf{x},\mathbf{y})\\ -\nabla_{\mathbf{y}}f(\mathbf{x},\mathbf{y})\crcr}}\big{]} is bounded below by λ/2-\lambda/2 in any direction uu\mathbf{u}-\mathbf{u}^{*} and the function ff is sufficiently “curved” (not close to a linear or a constant function) around \mathbf{u}^{*}=\big{[}\vbox{\Let@\restore@math@cr\default@tag\halign{\hfil\m@th\scriptstyle#&\m@th\scriptstyle{}#\hfil\cr\mathbf{x}^{*}\\ \mathbf{y}^{*}\crcr}}\big{]}.

Finally, we discuss a concrete min-max application wherein there are no (mvi) solutions, but there do exist (svi) solutions satisfying the weak (mvi) condition of Assumption 1. This application arises in the context of two-agent zero-sum reinforcement learning problems studied by many authors, including recently by Daskalakis et al. (2020). In Section 5.1 of that work, the authors consider a special case of the general two-agent zero-sum RL problem, called von Neumann’s ratio game, for which they observe that, even on a random example, the (mvi) solution set is empty, yet the extragradient method still converges in practice (albeit at a slower rate). Interestingly, it is easy to construct examples of the von Neumann ratio game for which no (mvi) solution exists, but the weak (mvi) condition of Assumption 1 does hold, and yet the stronger cohypomonotonicity condition of (2.2) does not hold. Indeed, one such example is obtained for the game shown in Proposition 2 of their paper, setting s=1/2s=1/2 and ϵ=.49\epsilon=.49. Here (mvi) fails, the weak (mvi) condition of Assumption 1 is satisfied, and cohypomonotonicity fails to hold, e.g., for u=(x,y)=(0.1,0.3)\mathbf{u}=(\mathbf{x},\mathbf{y})=(0.1,0.3) and v=(x,y)=(0.8,0.3)\mathbf{v}=(\mathbf{x}’,\mathbf{y}’)=(0.8,0.3). To be clear, the von Neumann ratio game gives rise to a constrained min-max problem while our algorithm targets the unconstrained setting. While extending our result to the constrained setting remains open, our example here demonstrates that there is value in further studying the weak (mvi) condition of Assumption 1 in the constrained setting as well.

3 Useful Definitions and Facts

Observe that when p=2,p=2, we recover the standard definition of strong convexity. Thus, uniform convexity is a generalization of strong convexity.

It is immediate that the Bregman divergence of a convex function is non-negative.

We now outline some useful auxiliary results used specifically in Section 4, where we study the case that pp is not necessarily equal to 2.

Then, for p=pp1,p^{*}=\frac{p}{p-1}, q=qq1q^{*}=\frac{q}{q-1}:

The statements in the proposition are simple corollaries of conjugacy of the functions ψ(u)=1qupq\psi(\mathbf{u})=\frac{1}{q}\|\mathbf{u}\|_{p}^{q} and ψ(z)=1qzpq\psi^{*}(\mathbf{z})=\frac{1}{q^{*}}\|\mathbf{z}\|_{p^{*}}^{q^{*}}. In particular, the first part follows from

by the definition of a convex conjugate and using that 1qupq\frac{1}{q}\|\mathbf{u}\|_{p}^{q} and 1qzpq\frac{1}{q^{*}}\|\mathbf{z}\|_{p^{*}}^{q^{*}} are conjugates of each other, which are standard exercises in convex analysis for q{p,2}q\in\{p,2\} (see, e.g., (Borwein and Zhu, 2004, Exercise 4.4.2) and (Boyd et al., 2004, Example 3.27)).

Another useful result is the following proposition, which will allow us to relate Lipschitzness of FF to uniform convexity of the prox mapping 1qpq\frac{1}{q}\|\cdot\|_{p}^{q} in the definition of the algorithm. The ideas used in the proof can be found in the proofs of (d’Aspremont et al., 2018, Lemma 5.7), (Nesterov, 2015, Lemma 2), and in (Devolder et al., 2014, Section 2.3). The proof is provided for completeness.

For any L>0L>0, κ>0\kappa>0, qκ,q\geq\kappa, t0t\geq 0, and δ>0\delta>0,

The proof is based on the Fenchel-Young inequality and the conjugacy of functions xrr\frac{|x|^{r}}{r} and yss\frac{|y|^{s}}{s} for r,s1r,s\geq 1, 1r+1s=1,\frac{1}{r}+\frac{1}{s}=1, which implies xyxrr+yss,xy\leq\frac{x^{r}}{r}+\frac{y^{s}}{s}, x,y0.\forall x,y\geq 0. In particular, setting r=q/κ,r=q/\kappa, s=q/(qκ)s=q/(q-\kappa), and x=tκ,x=t^{\kappa}, we have

It remains to set δ2=L(qκ)qκyκqκ,\frac{\delta}{2}=\frac{L(q-\kappa)}{q\kappa}y^{\frac{\kappa}{q-\kappa}}, which, solving for y,y, gives y=\big{(}\frac{\delta q\kappa}{2L(q-\kappa)}\big{)}^{q-\kappa}, and verify that, under this choice, Λ=Ltqqy.\Lambda=\frac{Lt^{q}}{qy}.

Generalized Extragradient for Problems with Weak MVI Solutions

In this section, we consider the setup with the Euclidean norm =2\|\cdot\|=\|\cdot\|_{2}, i.e., p=2.p=2. To address the class of problems with weak (mvi) solutions (see Assumption 1), we introduce the following generalization of the extragradient algorithm, to which we refer as Extragradient+ (eg++).

where β(0,1]\beta\in(0,1] is a parameter of the algorithm and ak>0a_{k}>0 is the step size. When β=1,\beta=1, we recover standard eg.

The analysis relies on the following merit (or gap) function:

for some uU\mathbf{u}^{*}\in\mathcal{U}^{*} for which FF satisfies Assumption 1. Then Assumption 1 implies that hk0,h_{k}\geq 0, k.\forall k.

The first (and main) step is to bound all hkh_{k}’s above, as in the following lemma.

where hkh_{k} is defined as in Eq. (3.1).

Fix any k0k\geq 0 and write hkh_{k} equivalently as

The proof proceeds by bounding from above individual terms on the right-hand side of Eq. (3.3). For the first term, the first-order optimality in the definition of uk+1\mathbf{u}_{k+1} gives:

For the second term on the right-hand side of Eq. (3.3), the first-order optimality in the definition of uˉk\bar{\mathbf{u}}_{k} implies:

which, similarly as for the first term, leads to:

For the third term on the right-hand side of Eq. (3.3), applying the Cauchy-Schwarz inequality, LL-Lipschitzness of F,F, and Young’s inequality, respectively, we have:

where the last inequality holds for any γ>0.\gamma>0.

Using the fact that uˉkuk=akβF(uk)\bar{\mathbf{u}}_{k}-\mathbf{u}_{k}=-\frac{a_{k}}{\beta}F(\mathbf{u}_{k}), uk+1uk=akF(uˉk)\mathbf{u}_{k+1}-\mathbf{u}_{k}=-a_{k}F(\bar{\mathbf{u}}_{k}) and combining Eqs. (3.4)-(3.6) with Eq. (3.3), we have:

Using Lemma 3.1, we can now draw conclusions about the convergence of eg++ by choosing parameters β,γ\beta,\gamma and the step sizes aka_{k} to guarantee that hk<12uuk212uuk+12h_{k}<\frac{1}{2}\|\mathbf{u}^{*}-\mathbf{u}_{k}\|^{2}-\frac{1}{2}\|\mathbf{u}^{*}-\mathbf{u}_{k+1}\|^{2} as long as F(uˉk)0\|F(\bar{\mathbf{u}}_{k})\|\neq 0.

all accumulation points of {uˉk}k0\{\bar{\mathbf{u}}_{k}\}_{k\geq 0} are in U\mathcal{U}^{*}.

Applying Lemma 3.1 with the choice of aka_{k} and β\beta from the theorem statement and with γ=1,\gamma=1, we get

By Assumption 1, ρ<14L,\rho<\frac{1}{4L}, and, thus, the constant multiplying F(uˉk)2\|F(\bar{\mathbf{u}}_{k})\|^{2} is strictly negative.

As hk0h_{k}\geq 0 (by Assumption 1), we can conclude that

As \frac{1}{4L}\Big{(}\frac{1}{4L}-\rho\Big{)}>0, Eq. (3.7) implies that F(uˉk)\|F(\bar{\mathbf{u}}_{k})\| converges to zero as k.k\to\infty. Further, as uˉkuk=akβF(uk),\bar{\mathbf{u}}_{k}-\mathbf{u}_{k}=-\frac{a_{k}}{\beta}F(\mathbf{u}_{k}), using triangle inequality and F(u)=0:F(\mathbf{u}^{*})=\textbf{0}:

where we have used that FF is LL-Lipschitz. Now, as uku\|\mathbf{u}_{k}-\mathbf{u}^{*}\| is bounded (by u0u,\|\mathbf{u}_{0}-\mathbf{u}^{*}\|, from Eq. (3.7)), it follows that the sequence {uˉk}\{\bar{\mathbf{u}}_{k}\} is bounded as well, and thus has a converging subsequence. Let {uˉki}\{\bar{\mathbf{u}}_{k_{i}}\} be any converging subsequence of {uˉk}\{\bar{\mathbf{u}}_{k}\} and let uˉ\bar{\mathbf{u}}^{*} be its corresponding accumulation point. Then, as F(uˉk)\|F(\bar{\mathbf{u}}_{k})\| converges to zero as k,k\to\infty, it follows that F(uˉki)\|F(\bar{\mathbf{u}}_{k_{i}})\| converges to zero as i,i\to\infty, and so it must be uˉU.\bar{\mathbf{u}}^{*}\in\mathcal{U}^{*}.

For Part (ii), telescoping Eq. (3.7), we get:

and 1k+1i=0kF(uˉi)2min0ikF(uˉi)2\frac{1}{k+1}\sum_{i=0}^{k}\|F(\bar{\mathbf{u}}_{i})\|^{2}\geq\min_{0\leq i\leq k}\|F(\bar{\mathbf{u}}_{i})\|^{2}. ∎

Due to Eq. (3.8), we have that all the iterates of eg++ with the parameter setting as in Theorem 3.2 remain in the ball centered at u\mathbf{u}^{*} and of radius at most 2u0u2\|\mathbf{u}_{0}-\mathbf{u}^{*}\|. Thus, Assumption 1 does not need to hold globally for the result to apply; it suffices that it only applies locally to points from the ball around u\mathbf{u}^{*} with radius 2u0u2\|\mathbf{u}_{0}-\mathbf{u}^{*}\|.

It is possible to obtain similar convergence results as those of Theorem 3.2 under different parameter choices. In particular, for γ(0,1],\gamma\in(0,1], it suffices that akβγLa_{k}\leq\frac{\beta\gamma}{L} and ρ<ak(1β).\rho<a_{k}(1-\beta). We settled on the choice made in Theorem 3.2 as it is simple and requires tuning only one parameter, LL.

where Fk\mathcal{F}_{k} and Fˉk\bar{\mathcal{F}}_{k} denote the natural filtrations, including all the randomness up to the construction of points uk\mathbf{u}_{k} and uˉk,\bar{\mathbf{u}}_{k}, respectively, and σˉk2,σk+12{\bar{\sigma}_{k}}^{2},{\sigma_{k+1}}^{2} are the variance constants. Observe that FkFˉk\mathcal{F}_{k}\subseteq\bar{\mathcal{F}}_{k} and FˉkFk+1.\bar{\mathcal{F}}_{k}\subseteq\mathcal{F}_{k+1}. To simplify the notation, we denote:

The variant of the method we consider here is stated as follows:

This is immediate for p>2,p>2, by the definition of ϕp\phi_{p} and using that q=pq=p and mp=1m_{p}=1 when p>2.p>2. For p(1,2],p\in(1,2], we have that q=2,q=2, and Eq. (4.6) follows by strong convexity of 12p2.\frac{1}{2}\|\cdot\|_{p}^{2}.

As in the case of Euclidean norms, the analysis relies on the following merit function:

Moreover, as before, Assumption 1 guarantees that hk0,h_{k}\geq 0, k.\forall k.

We start by first proving a lemma that holds for generic choices of algorithm parameters aka_{k} and β.\beta. We will then use this lemma to deduce the convergence bounds for different choices of p>1p>1 and both deterministic and the stochastic oracle access to F.F.

where hkh_{k} is defined as in Eq. (4.7), δk\delta_{k} is any positive number, and \Lambda_{k}=\Big{(}\frac{q-2}{\delta_{k}q}\Big{)}^{\frac{q-2}{2}}L^{q/2}. When q=2,q=2, the statement also holds with δk=0\delta_{k}=0 and Λk=L.\Lambda_{k}=L.

We begin the proof by writing hkh_{k} equivalently as:

The proof now proceeds by bounding individual terms on the right-hand side of the last equality.

Equivalently, applying the definition of Mk+1()M_{k+1}(\cdot) to the last inequality:

where the last inequality follows from Eq. (4.6).

where the inequality is by Mˉk(uˉk)=0\nabla\bar{M}_{k}(\bar{\mathbf{u}}_{k})=\textbf{0} and the fact that 1qpq\frac{1}{q}\|\cdot\|_{p}^{q} is qq-uniformly convex w.r.t. p\|\cdot\|_{p} with constant mpm_{p}, by the choice of qq from Eq. (4.3). Applying the definition of Mˉk(u)\bar{M}_{k}(\mathbf{u}) to the last inequality:

where (i)(i) is by Hölder’s inequality, (ii)(ii) is by LL-Lipschitzness of F,F, and (iii)(iii) is by Young’s inequality, which holds for any γ>0.\gamma>0. Now, let δk>0\delta_{k}>0 and \Lambda_{k}=\Big{(}\frac{2(q-\kappa)}{\delta_{k}q\kappa}\Big{)}^{\frac{q-\kappa}{\kappa}}L^{q/\kappa}. Then, applying Proposition 2.6 to the last two terms in the last inequality:

Observe that when q=2,q=2, there is no need to apply Proposition 2.6, and the last inequality is satisfied with δk=0\delta_{k}=0 and Λk=L.\Lambda_{k}=L.

Combining Eqs. (4.9)-(4.11) with Eq. (4.8), we have:

The main result is summarized in the following theorem.

Let p(1,2]p\in(1,2]. If β=mp=p1,\beta=m_{p}=p-1, ak=mp3/22La_{k}=\frac{{m_{p}}^{3/2}}{2L}, then all accumulation points of {uk}k0\{\mathbf{u}_{k}\}_{k\geq 0} are in U,\mathcal{U}^{*}, and, furthermore k0\forall k\geq 0:

In particular, within k=O\big{(}\frac{L^{2}\|\mathbf{u}^{*}-\mathbf{u}_{0}\|_{p}^{2}}{{(p-1)}^{2}\epsilon^{2}}\big{)} iterations egp+{}_{p}+ can output a point u\mathbf{u} with F(u)pϵ.\|F(\mathbf{u})\|_{p^{*}}\leq\epsilon.

Let p(2,)p\in(2,\infty). If β=12,\beta=\frac{1}{2}, δk=δ>0\delta_{k}=\delta>0, \Lambda=\big{(}\frac{q-2}{\delta q}\big{)}^{\frac{q-2}{2}}L^{q/2}, and ak=12Λ=aa_{k}=\frac{1}{2\Lambda}=a, then, k0\forall k\geq 0:

In particular, for any ϵ>0,\epsilon>0, there is a choice of δ=ϵ2CpL,\delta=\frac{\epsilon^{2}}{C_{p}L}, where CpC_{p} is a constant that only depends on pp, such that egp+{}_{p}+ can output a point u\mathbf{u} with F(u)pϵ\|F(\mathbf{u})\|_{p^{*}}\leq\epsilon in at most

iterations. Here, the OpO_{p} notation hides constants that only depend on p.p.

Observe that, as ηˉi=ηi=0,\bar{\bm{\eta}}_{i}=\bm{\eta}_{i}=\textbf{0}, i0\forall i\geq 0 and ρ=0,\rho=0, Lemma 4.1 and the definition of hkh_{k} give:

Proof of Part (i). In this case, we can set δk=0\delta_{k}=0 (see Lemma 4.1), Λk=L,\Lambda_{k}=L, and q=2.q=2. Therefore, setting β=mp,\beta=m_{p}, ak=mp3/22La_{k}=\frac{{m_{p}}^{3/2}}{2L}, and γ=1mp\gamma=\frac{1}{\sqrt{m_{p}}} we get from Eq. (4.12) that

It follows that uˉkukp2\|\bar{\mathbf{u}}_{k}-\mathbf{u}_{k}\|_{p}^{2} converges to zero as k.k\to\infty. By the definition of uˉk\bar{\mathbf{u}}_{k} and Proposition 2.5, 12uˉkukp2=ak22β2F(uk)p2,\frac{1}{2}\|\bar{\mathbf{u}}_{k}-\mathbf{u}_{k}\|_{p}^{2}=\frac{{a_{k}}^{2}}{2\beta^{2}}\|F(\mathbf{u}_{k})\|_{p^{*}}^{2}, and so F(uk)p\|F(\mathbf{u}_{k})\|_{p^{*}} converges to zero as k.k\to\infty. Further, as ϕp(u,uk)ϕp(u,u0)<\phi_{p}(\mathbf{u}^{*},\mathbf{u}_{k})\leq\phi_{p}(\mathbf{u}^{*},\mathbf{u}_{0})<\infty and ϕp(u,uk)mp2uukp2,\phi_{p}(\mathbf{u}^{*},\mathbf{u}_{k})\geq\frac{m_{p}}{2}\|\mathbf{u}^{*}-\mathbf{u}_{k}\|_{p}^{2}, mp>0,m_{p}>0, it follows that uukp\|\mathbf{u}^{*}-\mathbf{u}_{k}\|_{p} is bounded, and, thus, {uk}k0\{\mathbf{u}_{k}\}_{k\geq 0} is a bounded sequence. The proof that all accumulation points of {uk}k0\{\mathbf{u}_{k}\}_{k\geq 0} are in U\mathcal{U}^{*} is standard and omitted (see the proof of Theorem 3.2 for a similar argument).

To bound 1k+1i=0kF(ui)p2\frac{1}{k+1}\sum_{i=0}^{k}\|F(\mathbf{u}_{i})\|_{p^{*}}^{2}, we telescope the inequality from Eq. (4.13) to get:

Proof of Part (ii). In this case, q=pq=p, ϕp(u,v)=1puvpp,\phi_{p}(\mathbf{u},\mathbf{v})=\frac{1}{p}\|\mathbf{u}-\mathbf{v}\|_{p}^{p}, and mp=1.m_{p}=1. Using Proposition 2.5, ukuˉkpp=akpβpF(uk)pp\|\mathbf{u}_{k}-\bar{\mathbf{u}}_{k}\|_{p}^{p}=\frac{{a_{k}}^{p^{*}}}{\beta^{p^{*}}}\|F(\mathbf{u}_{k})\|_{p^{*}}^{p^{*}} and uk+1ukpp=akpF(uˉk)pp.\|\mathbf{u}_{k+1}-\mathbf{u}_{k}\|_{p}^{p}={a_{k}}^{p^{*}}\|F(\bar{\mathbf{u}}_{k})\|_{p^{*}}^{p^{*}}. Combining with Eq. (4.12), we have:

Now let γ=1,\gamma=1, β=12,\beta=\frac{1}{2}, δk=δ>0\delta_{k}=\delta>0, and ak=12Λk=12Λ=aa_{k}=\frac{1}{2\Lambda_{k}}=\frac{1}{2\Lambda}=a. Then akΛkγβ=akΛk/γβ=0a_{k}\Lambda_{k}\gamma-\beta=a_{k}\Lambda_{k}/\gamma-\beta=0 and Eq. (4.14) simplifies to:

Telescoping the last inequality and then dividing by akp(k+1)2p,\frac{{a_{k}}^{p^{*}}(k+1)}{2p}, we have:

Now, for egp+{}_{p}+ to be able to output a point u\mathbf{u} with F(u)pϵ,\|F(\mathbf{u})\|_{p^{*}}\leq\epsilon, it suffices to show that for some choice of δ\delta and kk we can make the right-hand side of Eq. (4.15) at most ϵp\epsilon^{p^{*}}. This is true because then egp+{}_{p}+ can output the point uˉi=argmin0ikF(uˉi)p.\bar{\mathbf{u}}_{i}=\operatorname*{argmin}_{0\leq i\leq k}\|F(\bar{\mathbf{u}}_{i})\|_{p^{*}}. For stochastic setups, the guarantee would be in expectation, and egp+{}_{p}+ could output a point uˉi\bar{\mathbf{u}}_{i} with ii chosen uniformly at random from {0,,k}\{0,\dots,k\}, as discussed in the proof of Theorem 3.2.

Observe first that, as \Lambda=\big{(}\frac{p-2}{p\delta}\big{)}^{\frac{p-2}{2}}L^{p/2} and p=pp1,p^{*}=\frac{p}{p-1}, we have that:

Setting 2pδap1ϵp2\frac{2p\delta}{{a}^{p^{*}-1}}\leq\frac{\epsilon^{p^{*}}}{2}, recalling that p=pp1p^{*}=\frac{p}{p-1}, and rearranging, we have:

It can be verified numerically that (p2p)p2p(\frac{p-2}{p})^{\frac{p-2}{p}} is a constant between 1e\frac{1}{e} and 1,1, while it is clear that 22(2p1)pp2(p1)p=O(p2)2^{\frac{2(2p-1)}{p}}p^{\frac{2(p-1)}{p}}=O(p^{2}) is a constant that only depends on pp. Hence, it suffices to set δ=ϵ2CpL,\delta=\frac{\epsilon^{2}}{C_{p}L}, where Cp=22(2p1)pp2(p1)p.C_{p}=2^{\frac{2(2p-1)}{p}}p^{\frac{2(p-1)}{p}}.

It remains to bound the number of iterations kk so that 2uu0ppap(k+1)ϵp2.\frac{2\|\mathbf{u}^{*}-\mathbf{u}_{0}\|_{p}^{p}}{a^{p^{*}}(k+1)}\leq\frac{\epsilon^{p^{*}}}{2}. Equivalently, we need k+14uu0ppapϵpk+1\geq\frac{4\|\mathbf{u}^{*}-\mathbf{u}_{0}\|_{p}^{p}}{a^{p^{*}}\epsilon^{p^{*}}}. Plugging δ=ϵ2CpL\delta=\frac{\epsilon^{2}}{C_{p}L} into the definition of Λ,\Lambda, using that p=pp1,p^{*}=\frac{p}{p-1}, and simplifying, we have:

Stochastic oracle access.

We start by bounding the stochastic error Es\mathcal{E}^{s} in expectation.

Let Es=akηˉk,uˉkuakηˉkηk,uˉkuk+1\mathcal{E}^{s}=-a_{k}\left\langle\bar{\bm{\eta}}_{k},\bar{\mathbf{u}}_{k}-\mathbf{u}^{*}\right\rangle-a_{k}\left\langle\bar{\bm{\eta}}_{k}-\bm{\eta}_{k},\bar{\mathbf{u}}_{k}-\mathbf{u}_{k+1}\right\rangle, where ηˉk\bar{\bm{\eta}}_{k} and ηk\bm{\eta}_{k} are defined as in Eq. (4.2) and all the assumptions of Theorem 4.5 below apply. Then, for qq defined by Eq. (4.3) and any τ>0\tau>0:

where the expectation is w.r.t. all the randomness in the algorithm.

Let us start by bounding akηˉk,uˉku-a_{k}\left\langle\bar{\bm{\eta}}_{k},\bar{\mathbf{u}}_{k}-\mathbf{u}^{*}\right\rangle first. Conditioning on Fˉk\bar{\mathcal{F}}_{k}, ηˉk\bar{\bm{\eta}}_{k} is independent of uˉk\bar{\mathbf{u}}_{k} and u\mathbf{u}^{*}, and, thus:

The second term, akηˉkηk,uˉkuk+1,-a_{k}\left\langle\bar{\bm{\eta}}_{k}-\bm{\eta}_{k},\bar{\mathbf{u}}_{k}-\mathbf{u}_{k+1}\right\rangle, can be bounded using Hölder’s inequality and Young’s inequality as follows:

where the last line is by Jensen’s inequality, as q(1,2],q^{*}\in(1,2], and so ()q/2(\cdot)^{q^{*}/2} is concave. Using Young’s inequality and linearity of expectation:

We are now ready to bound the total oracle complexity of egp+{}_{p}+ (and its special case eg++), as follows.

Combining Lemmas 4.1 and 4.4, we have, k0\forall k\geq 0:

Proof of Part (i). In this case, q=2q=2, mp=1,m_{p}=1, δ=0,\delta=0, Λk=L,\Lambda_{k}=L, and ϕp(u,u)=12uu22,\phi_{p}(\mathbf{u}^{*},\mathbf{u})=\frac{1}{2}\|\mathbf{u}^{*}-\mathbf{u}\|_{2}^{2}, and, further, uk+1uk=akF(uˉk),\mathbf{u}_{k+1}-\mathbf{u}_{k}=-a_{k}F(\bar{\mathbf{u}}_{k}), so Eq. (4.17) simplifies to

Taking β=12,\beta=\frac{1}{2}, τ2=14,\tau^{2}=\frac{1}{4}, γ=2,\gamma=\sqrt{2}, and ak=122L,a_{k}=\frac{1}{2\sqrt{2}L}, and recalling that ρˉ=142L,\bar{\rho}=\frac{1}{4\sqrt{2}L}, we have:

Telescoping the last inequality and dividing both sides by ak(ρˉρ)(k+1),a_{k}(\bar{\rho}-\rho)(k+1), we get:

To finish the proof of this part, we require that both terms on the right-hand side of the last inequality are bounded by ϵ22.\frac{\epsilon^{2}}{2}. For the first term, this leads to:

Proof of Part (ii). In this case, q=2q=2, mp=p1,m_{p}=p-1, δ=0,\delta=0, Λk=L,\Lambda_{k}=L, and ρ=0.\rho=0. Thus, Eq. (4.17) simplifies to:

Telescoping the last inequality and dividing both sides by (k+1)ak2mp4β2,(k+1)\frac{{a_{k}}^{2}m_{p}}{4\beta^{2}}, we have:

and the bound on the total number of samples follows.

Proof of Part (iii). In this case, q=p,q=p, mp=1,m_{p}=1, ρ=0,\rho=0, ϕp(u,u)=1puupp\phi_{p}(\mathbf{u}^{*},\mathbf{u})=\frac{1}{p}\|\mathbf{u}^{*}-\mathbf{u}\|_{p}^{p}, and we take δk=δ>0,\delta_{k}=\delta>0, \Lambda_{k}=\Lambda=\big{(}\frac{p-2}{p\delta}\big{)}^{\frac{p-2}{2}}L^{\frac{p}{2}}. Eq. (4.17) now simplifies to:

Telescoping the last inequality and then dividing both sides by ap2p(k+1),\frac{a^{p^{*}}}{2p}(k+1), we have:

To complete the proof, as before it suffices to show that we can choose kk and nn so that 2puu0ppap(k+1)+2pδap1ϵp2\frac{2p\|\mathbf{u}^{*}-\mathbf{u}_{0}\|_{p}^{p}}{a^{p^{*}}(k+1)}+\frac{2p\delta}{{a}^{p^{*}}-1}\leq\frac{\epsilon^{p^{*}}}{2} and 2p+2p1pσppnϵp2\frac{2^{\frac{p+2}{p-1}}p\sigma^{p^{*}}}{p^{*}n}\leq\frac{\epsilon^{p^{*}}}{2}. For the former, following the same argument as in the proof of Theorem 4.2, Part (ii), it suffices to choose δ=Op(ϵ2L)\delta=O_{p}(\frac{\epsilon^{2}}{L}), which leads to:

The total number of queries to the stochastic oracle is then bounded by k(1+n).k(1+n).

Discussion

Acknowledgements

We wish to thank Steve Wright for a useful discussion regarding convergence of sequences. We also wish to thank the Simons Institute for the Theory of Computing where some of this work was conducted.

JD was supported by the Office of the Vice Chancellor for Research and Graduate Education at the University of Wisconsin–Madison with funding from the Wisconsin Alumni Research Foundation and by the NSF Award CCF-2007757. CD was supported by NSF Awards IIS-1741137, CCF-1617730, and CCF-1901292, by a Simons Investigator Award, by the Simons Collaboration on the Theory of Algorithmic Fairness, and by the DOE PhILMs project (No. DE-AC05-76RL01830). MJ was supported in part by the Mathematical Data Science program of the Office of Naval Research under grant number N00014-18-1-2764.

References