Efficient Methods for Structured Nonconvex-Nonconcave Min-Max Optimization
Jelena Diakonikolas, Constantinos Daskalakis, Michael I. Jordan
Introduction
Min-max optimization and min-max duality theory lie at the foundations of game theory and mathematical programming, and have found far-reaching applications across a range of disciplines, including complexity theory, statistics, control theory, and online learning theory. Most recently, min-max optimization has played an important role in machine learning, notably in the adversarial training of deep neural network classifiers and the training of generative deep neural network models. These recent applications have heightened the importance of solving min-max optimization problems with nonconvex-nonconcave objectives, taking the following general form:
where and are real-valued vectors and is not (necessarily) convex in for all and/or not (necessarily) concave in for all . There may also be constraints on and , and in many applications and are high-dimensional vectors.
When the objective function is not convex-concave, von Neumann’s celebrated min-max theorem fails to apply, and so do most standard optimization methods for solving (1.1). This has motivated several lines of investigation, which include extensions of the min-max theorem beyond convex-concave objectives (e.g. Sion’s theorem for quasiconvex-quasiconcave objectives), and the pursuit of computational procedures that target solutions to (1.1) even in the absence of a min-max theorem; see Section 1.1 for a review of recent work. Of course, without strong assumptions on , (1.1) is an intractable problem, at least as intractable as general nonconvex optimization. Thus, the literature has targeted locally optimal solutions, in the same spirit as the targeting of local optima in non-convex optimization. Naturally, there are various notions of local optimality that have been studied in the literature. Our focus here will be on the simplest such notion, namely first-order local optimality, for which, despite the apparent simplicity, many challenges arise Daskalakis and Panageas (2018), Mazumdar et al. (2020).
In contrast to classical optimization problems, where useful results can be obtained with very mild assumptions on the objective function, in min-max optimization it is necessary to impose non-trivial assumptions on , even when the goal is only to compute locally optimal solutions. Indeed, Daskalakis et al. (2021) establish intractability results in the constrained setting of the problem, wherein first-order locally optimal solutions are guaranteed to exist whenever the objective is smooth. Moreover, they show that even the computation of approximate solutions is PPAD-complete and, if the objective function is accessible through value-queries and gradient-queries, exponentially many such queries are necessary (in particular, exponential in at least one of the following: the inverse approximation parameter, the smoothness constant of , or the diameter of the constraint set).
We expect similar intractability results to hold in the unconstrained case, which is the case considered in this paper, even when restricting to smooth objectives that have a non-empty set of optimal solutions.Note that these are stationary points of in this case. Indeed, fixed-point complexity-based intractability results for the constrained case are typically extendable to the unconstrained case, by embedding the hard instances within an unbounded domain.
Given the aforedescribed intractability results, our goal is to identify structural properties that make it possible to solve min-max optimization problems with smooth objectives. Focusing on the unconstrained setting of (1.1), we view it as a special case (obtained by considering the operator F([\vbox{\Let@\restore@math@cr\default@tag\halign{\hfil\m@th\scriptstyle#&\m@th\scriptstyle{}#\hfil\cr\mathbf{x}\\ \mathbf{y}\crcr}}])=\big{[}\vbox{\Let@\restore@math@cr\default@tag\halign{\hfil\m@th\scriptstyle#&\m@th\scriptstyle{}#\hfil\cr\nabla_{\mathbf{x}}f(\mathbf{x},\mathbf{y})\\ -\nabla_{\mathbf{y}}f(\mathbf{x},\mathbf{y})\crcr}}\big{]}) of the unconstrained variational inequality problem (svi), and consider instead this more general problem. We identify conditions for under which a generalized version of the extragradient method of Korpelevich (1976), which we propose, converges to a solution of (svi) (or, in the special case of (1.1), to a stationary point of ) at a rate of in the number of iterations . Our condition, presented as Assumption 1, postulates that there exists a solution to (svi) that only violates the stronger (mvi) requirement in a controlled manner that we delineate. Our generalized extragradient method is based on an aggressive interpolation step, as specified by (eg), and our main convergence result is Theorem 3.2. We additionally show, in Theorems 4.2 and 4.5, that the algorithm converges in non-Euclidean settings, under the stronger condition that an (mvi) solution exists, or when we only have stochastic oracle access to (or, in the special case of (1.1), to the gradient of ).
The condition on under which our main result applies is weaker than the assumption that a solution to (mvi) exists Zhou et al. (2017), Mertikopoulos et al. (2019), Malitsky (2019), Song et al. (2020), an assumption which is already satisfied by several interesting families of min-max objectives, including quasiconvex-concave families or starconvex-concave families. Our significantly weaker condition applies in particular to (min-max objectives with corresponding) operators that are negatively comonotone Bauschke et al. (2020) or positively cohypomonotone Combettes and Pennanen (2004). These conditions have been studied in the literature for at least a couple of decades, but only asymptotic convergence results were available prior to our work for computing solutions to (svi). In contrast, our rates are asymptotically identical to the rates that we would get under the stronger assumption that a solution to (mvi) exists, and sidestep the intractability results for (1.1) suggested by Daskalakis et al. (2021) for general smooth objectives.
1 Further Related Work
A large number of recent works target identifying practical first-order, low-order, or efficient online learning methods for solving min-max optimization problems in a variety of settings, ranging from the well-behaved setting of convex-concave objectives to the challenging setting of nonconvex-nonconcave objectives. There has been substantial work for convex-concave and nonconvex-concave objectives, targeting the computation of min-max solutions to (1.1) or, respectively, stationary points of or . This work has focused on attaining improved convergence rates Kong and Monteiro (2019), Lin et al. (2020b), Thekumparampil et al. (2019), Nouiehed et al. (2019), Lu et al. (2020), Zhao (2019), Alkousa et al. (2019), Azizian et al. (2020), Golowich et al. (2020), Lin et al. (2020a), Diakonikolas (2020) and/or obtaining last-iterate convergence guarantees Daskalakis et al. (2018), Daskalakis and Panageas (2018), Mazumdar et al. (2020), Mertikopoulos et al. (2018), Lin et al. (2018), Hamedani and Aybat (2018), Adolphs et al. (2019), Daskalakis and Panageas (2019), Liang and Stokes (2019), Gidel et al. (2019), Mokhtari et al. (2020), Abernethy et al. (2019), Liu et al. (2020).
In the nonconvex-nonconcave setting, research has focused on identifying different notions of local min-max solutions Daskalakis and Panageas (2018), Mazumdar et al. (2020), Jin et al. (2020), Mangoubi and Vishnoi (2021) and studying the existence and (local) convergence properties of learning methods to these points Wang et al. (2019), Mangoubi et al. (2020), Mangoubi and Vishnoi (2021). As already discussed, recent work of Daskalakis et al. (2021) shows that, for general smooth objectives, the computation of even approximate first-order locally optimal min-max solutions is intractable, motivating the identification of structural assumptions on the objective function for which these intractability barriers can be bypassed.
An example such assumption, which is closely related to the one made in this work, is that an (mvi) solution exists for the operator F([\vbox{\Let@\restore@math@cr\default@tag\halign{\hfil\m@th\scriptstyle#&\m@th\scriptstyle{}#\hfil\cr\mathbf{x}\\ \mathbf{y}\crcr}}])=\big{[}\vbox{\Let@\restore@math@cr\default@tag\halign{\hfil\m@th\scriptstyle#&\m@th\scriptstyle{}#\hfil\cr\nabla_{\mathbf{x}}f(\mathbf{x},\mathbf{y})\\ -\nabla_{\mathbf{y}}f(\mathbf{x},\mathbf{y})\crcr}}\big{]}, as studied by Zhou et al. (2017), Lin et al. (2018), Mertikopoulos et al. (2019), Malitsky (2019), Liu et al. (2020), Song et al. (2020). As we have already discussed, the assumption we make for our main result in this work is weaker. Table 1 provides a comparison of our results to those of existing works, considering the deterministic setting (i.e. having exact value access to ).
In unconstrained Euclidean setups, the best known convergence rates are of the order Dang and Lan (2015), Song et al. (2020), under the assumption that an (mvi) solution exists. We obtain the same rate under our weaker Assumption 1. Moreover, under our weaker assumption, we show that the accumulation points of the sequence of iterates of our algorithm are (svi) solutions. This was previously established for alternative algorithms and under the stronger assumption that an (mvi) solution exists Mertikopoulos et al. (2019), Malitsky (2019).
Finally, it is worth noting that Zhou et al. (2017), Mertikopoulos et al. (2019), Malitsky (2019), Song et al. (2020) consider constrained optimization setups, which are not considered in our work. We believe that generalizing our results to constrained setups is possible, and defer such generalizations to future work.
Notation and Preliminaries
We are interested in finding stationary points for min-max problems of the form:
The Minty Variational Inequality problem consists in finding such that:
in which case is referred to as the weak solution to the variational inequality corresponding to and . If we assume that is monotone, then (2.1) implies that every solution to (svi) is also a solution to (mvi), and the two solution sets are equivalent. More generally, if is not monotone, all that can be said is that the set of (mvi) solutions is a subset of the set of (svi) solutions. In particular, (mvi) solutions may not exist even when (svi) solutions exist. These facts follow from Minty’s theorem (see, e.g., (Kinderlehrer and Stampacchia, 2000, Chapter 3)).
We will not, in general, be assuming that is monotone. Note that the Lipschitzness of on its own is not sufficient to guarantee that the problem is computationally tractable, as discussed in the introduction. Thus, additional structure is needed, which we introduce in the following.
We define the class of problems with weak (mvi) solutions as the class of problems in which satisfies the following assumption.
There exists such that:
for some parameter \rho\in\big{[}0,\frac{1}{4L}\big{)}.
2 Example Settings Satisfying Assumption 1
The class of problems that have weak (mvi) solutions in the sense of Assumption 1 generalizes other structured non-monotone variational inequality problems, as we discuss in this section.
When we recover the class of problems that have an (mvi) solution. This class contains all unconstrained variationally coherent problems studied in, e.g., Zhou et al. (2017), Mertikopoulos et al. (2019), which encompass all min-max problems with objectives that are bilinear, pseudo-convex-concave, quasi-convex-concave, and star-convex-concave.
When and , Assumption 1 is implied by being -comonotone Bauschke et al. (2020) or -cohypomonotone Combettes and Pennanen (2004), defined as follows:
In particular, Assumption 1 is equivalent to requiring that (2.2) be satisfied for general and , where is a solution to (svi) (in which case ). Note that Assumption 1 does not imply that a solution to (mvi) exists, unless . It is further important to note that cohypomonotone operators arise as inverses of operators that only need to be Lipschitz-continuous. (In fact, even a weaker property suffices; see Bauschke et al. (2020).) This is particularly interesting as, combined with our main result, it implies that we can efficiently find zeros of inverses of Lipschitz-continuous operators, as long as those inverses are sufficiently Lipschitz, even though finding zeros of Lipschitz-continuous operators is computationally intractable, in general, as we have discussed.
It is interesting to note that Assumption 1 does not imply that, in the min-max setting, is convex-concave (or, more generally, that is monotone), even in a neighborhood of an (svi) solution \mathbf{u}^{*}=\big{[}\vbox{\Let@\restore@math@cr\default@tag\halign{\hfil\m@th\scriptstyle#&\m@th\scriptstyle{}#\hfil\cr\mathbf{x}^{*}\\ \mathbf{y}^{*}\crcr}}\big{]}, i.e., a stationary point of . To see this, fix and consider for in a small neighborhood of Using the fact that a continuously-differentiable function is well-approximated by its linear approximation within small neighborhoods, all that we are able to deduce from Assumption 1 is that
In particular, Assumption 1 does not preclude that is larger than ; it only bounds how much larger it can be by a quantity proportional to . Compare this also to the Polyak-Łojasiewicz condition (see, e.g., Nouiehed et al. (2019), Yang et al. (2020)), which imposes the opposite inequality, namely, that is bounded above by a multiple of .
One way that a generic operator may satisfy Assumption 1 is when there is a constant such that for some we have
and when the operator does not plateau or become too close to a linear operator around namely, . (Note that (2.3) is always satisfied with for -Lipschitz operators, but we may need to be smaller than ). Then Assumption 1 would be satisfied with For a min-max problem, assuming is twice differentiable, this would mean that the lowest eigenvalue of the symmetric part of the Jacobian of \big{[}\vbox{\Let@\restore@math@cr\default@tag\halign{\hfil\m@th\scriptstyle#&\m@th\scriptstyle{}#\hfil\cr\nabla_{\mathbf{x}}f(\mathbf{x},\mathbf{y})\\ -\nabla_{\mathbf{y}}f(\mathbf{x},\mathbf{y})\crcr}}\big{]} is bounded below by in any direction and the function is sufficiently “curved” (not close to a linear or a constant function) around \mathbf{u}^{*}=\big{[}\vbox{\Let@\restore@math@cr\default@tag\halign{\hfil\m@th\scriptstyle#&\m@th\scriptstyle{}#\hfil\cr\mathbf{x}^{*}\\ \mathbf{y}^{*}\crcr}}\big{]}.
Finally, we discuss a concrete min-max application wherein there are no (mvi) solutions, but there do exist (svi) solutions satisfying the weak (mvi) condition of Assumption 1. This application arises in the context of two-agent zero-sum reinforcement learning problems studied by many authors, including recently by Daskalakis et al. (2020). In Section 5.1 of that work, the authors consider a special case of the general two-agent zero-sum RL problem, called von Neumann’s ratio game, for which they observe that, even on a random example, the (mvi) solution set is empty, yet the extragradient method still converges in practice (albeit at a slower rate). Interestingly, it is easy to construct examples of the von Neumann ratio game for which no (mvi) solution exists, but the weak (mvi) condition of Assumption 1 does hold, and yet the stronger cohypomonotonicity condition of (2.2) does not hold. Indeed, one such example is obtained for the game shown in Proposition 2 of their paper, setting and . Here (mvi) fails, the weak (mvi) condition of Assumption 1 is satisfied, and cohypomonotonicity fails to hold, e.g., for and . To be clear, the von Neumann ratio game gives rise to a constrained min-max problem while our algorithm targets the unconstrained setting. While extending our result to the constrained setting remains open, our example here demonstrates that there is value in further studying the weak (mvi) condition of Assumption 1 in the constrained setting as well.
3 Useful Definitions and Facts
Observe that when we recover the standard definition of strong convexity. Thus, uniform convexity is a generalization of strong convexity.
It is immediate that the Bregman divergence of a convex function is non-negative.
We now outline some useful auxiliary results used specifically in Section 4, where we study the case that is not necessarily equal to 2.
Then, for :
The statements in the proposition are simple corollaries of conjugacy of the functions and . In particular, the first part follows from
by the definition of a convex conjugate and using that and are conjugates of each other, which are standard exercises in convex analysis for (see, e.g., (Borwein and Zhu, 2004, Exercise 4.4.2) and (Boyd et al., 2004, Example 3.27)).
Another useful result is the following proposition, which will allow us to relate Lipschitzness of to uniform convexity of the prox mapping in the definition of the algorithm. The ideas used in the proof can be found in the proofs of (d’Aspremont et al., 2018, Lemma 5.7), (Nesterov, 2015, Lemma 2), and in (Devolder et al., 2014, Section 2.3). The proof is provided for completeness.
For any , , , and ,
The proof is based on the Fenchel-Young inequality and the conjugacy of functions and for , which implies In particular, setting , and we have
It remains to set which, solving for gives y=\big{(}\frac{\delta q\kappa}{2L(q-\kappa)}\big{)}^{q-\kappa}, and verify that, under this choice, ∎
Generalized Extragradient for Problems with Weak MVI Solutions
In this section, we consider the setup with the Euclidean norm , i.e., To address the class of problems with weak (mvi) solutions (see Assumption 1), we introduce the following generalization of the extragradient algorithm, to which we refer as Extragradient+ (eg).
where is a parameter of the algorithm and is the step size. When we recover standard eg.
The analysis relies on the following merit (or gap) function:
for some for which satisfies Assumption 1. Then Assumption 1 implies that
The first (and main) step is to bound all ’s above, as in the following lemma.
where is defined as in Eq. (3.1).
Fix any and write equivalently as
The proof proceeds by bounding from above individual terms on the right-hand side of Eq. (3.3). For the first term, the first-order optimality in the definition of gives:
For the second term on the right-hand side of Eq. (3.3), the first-order optimality in the definition of implies:
which, similarly as for the first term, leads to:
For the third term on the right-hand side of Eq. (3.3), applying the Cauchy-Schwarz inequality, -Lipschitzness of and Young’s inequality, respectively, we have:
where the last inequality holds for any
Using the fact that , and combining Eqs. (3.4)-(3.6) with Eq. (3.3), we have:
Using Lemma 3.1, we can now draw conclusions about the convergence of eg by choosing parameters and the step sizes to guarantee that as long as .
all accumulation points of are in .
Applying Lemma 3.1 with the choice of and from the theorem statement and with we get
By Assumption 1, and, thus, the constant multiplying is strictly negative.
As (by Assumption 1), we can conclude that
As \frac{1}{4L}\Big{(}\frac{1}{4L}-\rho\Big{)}>0, Eq. (3.7) implies that converges to zero as Further, as using triangle inequality and
where we have used that is -Lipschitz. Now, as is bounded (by from Eq. (3.7)), it follows that the sequence is bounded as well, and thus has a converging subsequence. Let be any converging subsequence of and let be its corresponding accumulation point. Then, as converges to zero as it follows that converges to zero as and so it must be
For Part (ii), telescoping Eq. (3.7), we get:
and . ∎
Due to Eq. (3.8), we have that all the iterates of eg with the parameter setting as in Theorem 3.2 remain in the ball centered at and of radius at most . Thus, Assumption 1 does not need to hold globally for the result to apply; it suffices that it only applies locally to points from the ball around with radius .
It is possible to obtain similar convergence results as those of Theorem 3.2 under different parameter choices. In particular, for it suffices that and We settled on the choice made in Theorem 3.2 as it is simple and requires tuning only one parameter, .
where and denote the natural filtrations, including all the randomness up to the construction of points and respectively, and are the variance constants. Observe that and To simplify the notation, we denote:
The variant of the method we consider here is stated as follows:
This is immediate for by the definition of and using that and when For we have that and Eq. (4.6) follows by strong convexity of
As in the case of Euclidean norms, the analysis relies on the following merit function:
Moreover, as before, Assumption 1 guarantees that
We start by first proving a lemma that holds for generic choices of algorithm parameters and We will then use this lemma to deduce the convergence bounds for different choices of and both deterministic and the stochastic oracle access to
where is defined as in Eq. (4.7), is any positive number, and \Lambda_{k}=\Big{(}\frac{q-2}{\delta_{k}q}\Big{)}^{\frac{q-2}{2}}L^{q/2}. When the statement also holds with and
We begin the proof by writing equivalently as:
The proof now proceeds by bounding individual terms on the right-hand side of the last equality.
Equivalently, applying the definition of to the last inequality:
where the last inequality follows from Eq. (4.6).
where the inequality is by and the fact that is -uniformly convex w.r.t. with constant , by the choice of from Eq. (4.3). Applying the definition of to the last inequality:
where is by Hölder’s inequality, is by -Lipschitzness of and is by Young’s inequality, which holds for any Now, let and \Lambda_{k}=\Big{(}\frac{2(q-\kappa)}{\delta_{k}q\kappa}\Big{)}^{\frac{q-\kappa}{\kappa}}L^{q/\kappa}. Then, applying Proposition 2.6 to the last two terms in the last inequality:
Observe that when there is no need to apply Proposition 2.6, and the last inequality is satisfied with and
Combining Eqs. (4.9)-(4.11) with Eq. (4.8), we have:
The main result is summarized in the following theorem.
Let . If , then all accumulation points of are in and, furthermore :
In particular, within k=O\big{(}\frac{L^{2}\|\mathbf{u}^{*}-\mathbf{u}_{0}\|_{p}^{2}}{{(p-1)}^{2}\epsilon^{2}}\big{)} iterations eg can output a point with
Let . If , \Lambda=\big{(}\frac{q-2}{\delta q}\big{)}^{\frac{q-2}{2}}L^{q/2}, and , then, :
In particular, for any there is a choice of where is a constant that only depends on , such that eg can output a point with in at most
iterations. Here, the notation hides constants that only depend on
Observe that, as and Lemma 4.1 and the definition of give:
Proof of Part (i). In this case, we can set (see Lemma 4.1), and Therefore, setting , and we get from Eq. (4.12) that
It follows that converges to zero as By the definition of and Proposition 2.5, and so converges to zero as Further, as and it follows that is bounded, and, thus, is a bounded sequence. The proof that all accumulation points of are in is standard and omitted (see the proof of Theorem 3.2 for a similar argument).
To bound , we telescope the inequality from Eq. (4.13) to get:
Proof of Part (ii). In this case, , and Using Proposition 2.5, and Combining with Eq. (4.12), we have:
Now let , and . Then and Eq. (4.14) simplifies to:
Telescoping the last inequality and then dividing by we have:
Now, for eg to be able to output a point with it suffices to show that for some choice of and we can make the right-hand side of Eq. (4.15) at most . This is true because then eg can output the point For stochastic setups, the guarantee would be in expectation, and eg could output a point with chosen uniformly at random from , as discussed in the proof of Theorem 3.2.
Observe first that, as \Lambda=\big{(}\frac{p-2}{p\delta}\big{)}^{\frac{p-2}{2}}L^{p/2} and we have that:
Setting , recalling that , and rearranging, we have:
It can be verified numerically that is a constant between and while it is clear that is a constant that only depends on . Hence, it suffices to set where
It remains to bound the number of iterations so that Equivalently, we need . Plugging into the definition of using that and simplifying, we have:
Stochastic oracle access.
We start by bounding the stochastic error in expectation.
Let , where and are defined as in Eq. (4.2) and all the assumptions of Theorem 4.5 below apply. Then, for defined by Eq. (4.3) and any :
where the expectation is w.r.t. all the randomness in the algorithm.
Let us start by bounding first. Conditioning on , is independent of and , and, thus:
The second term, can be bounded using Hölder’s inequality and Young’s inequality as follows:
where the last line is by Jensen’s inequality, as and so is concave. Using Young’s inequality and linearity of expectation:
We are now ready to bound the total oracle complexity of eg (and its special case eg), as follows.
Combining Lemmas 4.1 and 4.4, we have, :
Proof of Part (i). In this case, , and and, further, so Eq. (4.17) simplifies to
Taking and and recalling that we have:
Telescoping the last inequality and dividing both sides by we get:
To finish the proof of this part, we require that both terms on the right-hand side of the last inequality are bounded by For the first term, this leads to:
Proof of Part (ii). In this case, , and Thus, Eq. (4.17) simplifies to:
Telescoping the last inequality and dividing both sides by we have:
and the bound on the total number of samples follows.
Proof of Part (iii). In this case, , and we take \Lambda_{k}=\Lambda=\big{(}\frac{p-2}{p\delta}\big{)}^{\frac{p-2}{2}}L^{\frac{p}{2}}. Eq. (4.17) now simplifies to:
Telescoping the last inequality and then dividing both sides by we have:
To complete the proof, as before it suffices to show that we can choose and so that and . For the former, following the same argument as in the proof of Theorem 4.2, Part (ii), it suffices to choose , which leads to:
The total number of queries to the stochastic oracle is then bounded by ∎
Discussion
Acknowledgements
We wish to thank Steve Wright for a useful discussion regarding convergence of sequences. We also wish to thank the Simons Institute for the Theory of Computing where some of this work was conducted.
JD was supported by the Office of the Vice Chancellor for Research and Graduate Education at the University of Wisconsin–Madison with funding from the Wisconsin Alumni Research Foundation and by the NSF Award CCF-2007757. CD was supported by NSF Awards IIS-1741137, CCF-1617730, and CCF-1901292, by a Simons Investigator Award, by the Simons Collaboration on the Theory of Algorithmic Fairness, and by the DOE PhILMs project (No. DE-AC05-76RL01830). MJ was supported in part by the Mathematical Data Science program of the Office of Naval Research under grant number N00014-18-1-2764.