Escaping limit cycles: Global convergence for constrained nonconvex-nonconcave minimax problems
Thomas Pethick, Puya Latafat, Panagiotis Patrinos, Olivier Fercoq, Volkan Cevher
Introduction
Many machine learning applications, from generative adversarial networks (GANs) to robust reinforcement learning, result in nonconvex-nonconcave constrained minimax problems, which pose notorious difficulties to the scalable first order methods. Indeed, there is no shortage of results illustrating divergent or cycling behavior when going beyond minimization problems (Benaım & Hirsch, 1999; Hommes & Ochea, 2012; Mertikopoulos et al., 2018b; Hsieh et al., 2021).
Traditionally, minimax problems have been studied for more than half a century under the umbrella of the variational inequalities (VIs). The extragradient-type algorithms from the VI literature was recently brought to the awareness of the machine learning community (Mertikopoulos et al., 2018a; Gidel et al., 2018; Böhm et al., 2020), and have provided a principled way of stabilizing training and avoiding Poincaré recursions. However, these results mostly concern the convex-concave setting.
In nonconvex-nonconcave minimax problems, or more generally nonmonotone variational inequalities (VIs), even finding a local solution is in general intractable. This has been made precise through exponential lower bound of the classical optimization type (Hirsch & Vavasis, 1987) and computational complexity results (Papadimitriou, 1994; Daskalakis et al., 2021b). This is in sharp contrast to minimization problems, where only finding a global solution is intractable. The recent result of (Hsieh et al., 2021) provides some intuition behind this difference by showing that the asymptotic limits of most schemes, including extragradient, can converge to attracting limit cycles.
To make progress in lieu of these negative results, Diakonikolas et al. (2021) proposes a simple generalization of extragradient, called (EG+), that can converge to a stationary point even for a class of nonmonotone problems provided that the weak Minty variational inequality (MVI) holds. This problem class is parametrized by a constant , which controls the degree of nonconvexity. However, given the range of in Diakonikolas et al. (2021), the new class is still too small to include even the simplest counterexample of Hsieh et al. (2021) for the general Robbins-Monro schemes.
Building on the analysis in Diakonikolas et al. (2021), we propose a new adaptive scheme, called (CurvatureEG+), that converges even in the difficult counter example of Hsieh et al. (2021) as illustrated in Fig. 1. Our main contributions are summarized below.
We propose an adaptive extragradient-type algorithm that converges for a larger range of , the parameter in the weak MVI assumption (cf. Item 3) than previously known.
More importantly, we show that convergence is ensured if , where is the extrapolation stepsize. This is crucial since by selecting through a backtracking procedure larger stepsizes are allowed, which in turn implies convergence for more negative values of , thus capturing a larger class of problems. In addition, we show that the linesearch eventually passes without triggering any backtrack if initialized based on the Jacobian of (cf. Section 4).
We present a non-adaptive variant of our algorithm (CEG+), and show that for particular parameter choices (EG+) of Diakonikolas et al. (2021), and when the celebrated forward-backward-forward (FBF) algorithm of Tseng (2000) are recovered, thus unifying and generalizing both methods. We improve upon Diakonikolas et al. (2021) by not only relaxing the problem class but also the stepsize range. We show that our results are tight by providing a matching lower bound, thus providing a complete picture of (EG+) under weak MVI.
Related work
The community has resorted to various approaches to make progress for nonconvex-nonconcave minimax problems. One line of work focuses on deriving local convergence results (Mazumdar et al., 2019; Fiez & Ratliff, 2020; Heusel et al., 2017). For global results, the two primary approaches have been to either assume a global oracle for the inner problem (Jin et al., 2019; Davis & Drusvyatskiy, 2018) or assume particular problem structure such as the Polyak-Łojasiewicz condition (Nouiehed et al., 2019; Yang et al., 2020) or concavity for the inner problem (Rafique et al., 2019).
We follow the same tradition of assuming structure, but from the general perspective of operator theory. The idea of studying minimax and related problems through the lens of variational inequality has a long history (Minty, 1962; Rockafellar, 1976; Polyak, 1987; Bertsekas, 1997), with recent renewed interest due to its relevance for minimax formulations (Mertikopoulos et al., 2018a; Gidel et al., 2018; Azizian et al., 2020).
One relaxation of the monotone case for which we have positive results is that of Minty variational inequalities (MVI) (Mertikopoulos et al., 2018a; Song et al., 2021; Zhou et al., 2017; Liu et al., 2021), which includes all quasiconvex-concave and starconvex-concave problems. Diakonikolas et al. (2021) introduced the relaxed condition of weak MVI. In the unconstrained setting they showed non-asymptotic convergence results under a restricted problem constant . Similarly to us, Lee & Kim (2021a) extends the regime but under the stronger condition of cohypomonotonicity. They do so by studying a more evolved variant of extragradient building on anchoring techniques. We instead directly improve upon (EG+) and generalize it to new settings.
In the stochastic setting, usually the stepsize for the extrapolation step is diminishing. This is the case in Böhm et al. (2020) where they consider a forward-backward-forward type scheme. However, they remain in the monotone setting, where the limit cycles are non-attracting, as exemplified by a bilinear game. Hsieh et al. (2021) recently showed that a large family of algorithms, which includes the extragradient method with diminishing stepsize, can converge to attracting limit cycles. Going beyond this restriction, prior to Diakonikolas et al. (2021), Hsieh et al. (2020) interestingly considers two separate and diminishing stepsizes under the stronger assumption of MVI.
Problem formulation and preliminaries
holds. The set of all such points is denoted by . Throughout the paper problem (2.1) is studied under the following assumptions (definitions can be found in Appendix A).
Weak Minty variational inequality (MVI) holds, i.e., there exists a nonempty set such that for all and some
Generally, we do not require the weak Minty assumption to hold at every . In fact, as shown in Theorem 3.1 nonemptiness of is sufficient for ensuring that the limit points belong to . Interestingly, despite nonmonotonicity of , global (as opposed to subsequential) convergence can be established when , an assumption that is still weaker than cohypomonotonicity.
VIs provide a convenient abstraction for a range of problems. We mention some central examples below but otherwise defer to the overview in Facchinei & Pang (2007). Subsequently, we provide examples where the weak MVI holds.
Example 1: (minimax optimization). A comprehensive way to capture a wide range of applications in machine learning is to consider structured minimax problems of the form
As it will become clear in the next section (cf. Algorithm 1), the main computations involved in the proposed scheme are evaluations of and resolvent . Recall that the resolvent of a maximally monotone operator is firmly nonexpansive with full domain (cf. (Bauschke & Combettes, 2017, Sect. 23)). If is the subdifferential operator of a convex function , then its resolvent is the proximal mapping. For instance when is as in Assumption I, then its resolvent is given by .
The corresponding first order optimality conditions may be written as and Fz=({\nabla}_{z_{1}}\varphi_{1}(\mathchoice{\text{\boldmath{\displaystyle z}}}{\text{\boldmath{\textstyle z}}}{\text{\boldmath{\scriptstyle z}}}{\text{\boldmath{\scriptscriptstyle z}}}),\ldots,{\nabla}_{z_{N}}\varphi_{N}(\mathchoice{\text{\boldmath{\displaystyle z}}}{\text{\boldmath{\textstyle z}}}{\text{\boldmath{\scriptstyle z}}}{\text{\boldmath{\scriptscriptstyle z}}})).
A solution to (2.1) thus returns a candidate for which the first order condition of the above problems is satisfied. In the monotone case these two solution concepts coincide, while in the more general case of weak MVI, we provide examples where this still holds. In particular, we introduce in Section 5 a nonconvex-nonconcave minimax game which additionally exhibits limit cycles for . As a consequence most schemes including gradient descent ascent, extragradient and optimistic gradient descent ascent do not converge to a stationary point globally (Hsieh et al., 2021). However, the global Nash equilibrium satisfies Item 3 with , which we show is sufficient for global convergence of (CEG+).
The weak MVI condition is satisfied in certain reinforcement learning settings. Specifically, Diakonikolas et al. (2021); Daskalakis et al. (2021a) considers a two-player zero-sum game where the weak MVI holds, while neither MVI nor cohypomonotonicity holds. Interestingly, the formulation requires constraint—a condition they do not handle. We thus provide the first provable algorithm for this setting. Weak MVI also contains all quasiconvex-concave and starconvex-concave problems. For further examples, the literature on cohypomonotonicity (Bauschke et al., 2020) is relevant since it implies weak MVI, see for instance Lee & Kim (2021b, Example 1).
Generalizing Extragradient+
Our starting point is the Extragradient+ (EG+) algorithm of Diakonikolas et al. (2021) which is identical to extragradient (Korpelevich, 1976) except for the second stepsize being smaller. They only treat the inclusion (2.1) when , and in our notation require . Specifically,
where they choose and (Diakonikolas et al., 2021, Thm. 3.2).
We generalize (EG+) in Algorithm 1 to take the operator into account—consequently we capture constraint and regularized problems as well. In addition, the scheme is adaptive in . We will show that the weaker requirement of suffices even for the more general inclusion (2.1).
The main convergence results of Algorithm 1 are established in the next theorem. The proof is largely inspired by recent developments in operator splitting techniques in the framework of monotone inclusions (Latafat & Patrinos, 2017; Giselsson, 2021). The key idea lies in interpreting each iteration of the algorithm as a projection onto a certain hyperplane, an interpretation that dates back to Solodov & Tseng (1996); Solodov & Svaiter (1999).
where . Moreover, the following holds
Note that whenever , Item 2 may be used to derive a similar inequality in terms of by lower bounding in (3.1). We also remark that tighter rates may be obtained in the regime , however, this will not be pursued in this work.
Although we do not incur additional costs for evaluating the adaptive stepsize in 1.4, it proves instructive to present a variant with constant stepsize. As a result we compare the range of our stepsizes against Diakonikolas et al. (2021) showing an improvement by a factor of . Moreover, in the monotone case (), with a certain choice of stepsizes the algorithm reduces to the celebrated forward-backward-forward (FBF) algorithm of Tseng (2000). We remark that the relation of FBF to projection-type algorithms was noted in Tseng (2000), (Giselsson, 2021, Sect. 6.2.1).
To this end, in this subsection consider the following non-adaptive variant of Algorithm 1 that generalizes (EG+). Letting :
The convergence of this algorithm is an immediate byproduct of Theorem 3.1. To see this, note that the update in 1.5 may be written as , for . Therefore, convergence is still ensured for any as the difference may be absorbed by the relaxation parameter . Note that by -cocoercivity of (cf. Item 1)
where . Moreover, the claims of Items 1 and 2 hold true.
The setting of Diakonikolas et al. (2021) in (EG+) involves the stepsizes , . Note that when restricting to , the iterates (CEG+) simplify to this form owing to the fact that . In comparison, in our setting if (the smallest permitted in Diakonikolas et al. (2021)) is selected, then based on our analysis in Corollary 3.2 we may select , and , thus the upper bound for the second stepsize is times that of Diakonikolas et al. (2021).
In Corollary 3.2 the range of stepsizes , may alternatively be set as \gamma\in\big{(}\lfloor-2\rho\rfloor_{+},\nicefrac{{1}}{{L}}\big{)}, . This is due to the fact that if (strictly), then is strictly -cocoercive. Therefore, in (3.2), holds, and thus the stepsize is permitted. Although this may appear to be of little practical significance, by setting , , and in (CEG+), we obtain , which is the forward-backward-forward (FBF) algorithm of Tseng (2000), (Bauschke & Combettes, 2017, Thm. 26.17)). ∎
2 Lower bounds
We show that the result in Corollary 3.2 is tight by providing a matching lower bound when . We do so by fixing and showing a stepsize dependent lower bound. In particular, note that if as in Diakonikolas et al. (2021, Thm. 3.2), then Theorem 3.4 implies a lower bound of for the (EG+) scheme. The lower bound is contextualized in Fig. 2 by relating it to our convergence results and existing results in the literature.
Adaptively taking larger stepsizes using local curvature
The proposed scheme involves a backtracking linesearch that uses the local curvature for its initial guess. The reason being that this will immediately pass, close enough to the solution , by argument of continuity. More precisely, we will set the initial guess to something slightly smaller than , where denotes the Jacobian of at and is the spectral norm. Note that, despite the use of second order information, the scheme remains efficient since only requires one eigenvalue computation performed through Jacobian-vector product (Pearlmutter, 1994).
Given an initial point and , the final scheme which we denote (CurvatureEG+) proceeds for as follows:
The linesearch terminates in finite time with ;
The convergence results for (CurvatureEG+) are deduced based of the above lemma and Theorem 3.1 and are provided in Corollary B.1 in Section B.2. We illustrate the behavior of (CurvatureEG+) in Fig. 1 and in Section 6.
Constructing toy examples
When Item 3 holds for negative , limit cycles of the underlying operator can emerge. We illustrate this with simple polynomial examples for which all the properties of interest can be computed in closed form.
A PolarGame denotes a two-player game whose associated operator has limit cycles at for all where .
This turns out to be particularly easy to construct in polar coordinates as the name suggests (see Section C.1). Apart from introducing arbitrary number of limit cycles it also gives us control over . This is illustrated in the following instantiations capturing three important cases.
Example 3: (PolarGame). Consider where and . We have the following three cases:
(i) then (ii) then (iii) then
where denotes the Lipschitz constant of restricted to the constraint set. For all cases exhibits limit cycles at and . Proof is deferred to Section C.2.
Example 4: (minimax). In the particular case of constrained minimax problem we introduce the following polynomial game:
where . We provide proof of the following properties in Section C.3:
There exists a repellant limit cycle and an attracting limit cycle of .
is a global Nash equilibrium for which Item 3 holds inside the constraint with , where denotes the Lipschitz constant of restricted to the constraint set.
Experiments
The algorithms considered in the experiments include the adaptive Algorithm 1, (CurvatureEG+), and constant stepsize methods that can be seen as instances of (CEG+) for various choices of and . When and we recover a constrained variant of extragradient, which we denote CEG. When we denote the scheme CEG+, which is the direct generalization to the constraint setting of the (EG+) scheme studied in Diakonikolas et al. (2021, Thm. 3.2). Note that this choice of restricts the problem class for which we otherwise can have guaranteed convergence according to Corollary 3.2. When is chosen adaptively according to Algorithm 1 we refer to it as AdaptiveEG+. Finally, when is additionally chosen adaptively we use the name (CurvatureEG+).
In the stochastic setting, when and , effectively both stepsizes diminish, and we recover a constrained variant of the popular stochastic extragradient scheme (see e.g. Hsieh et al. (2021, Algorithm 3)), which we refer to as SEG. We also consider a heuristic variant where and only is decreasing, which we refer to as SEG+.
Conclusion
This paper introduced an EG-type algorithm for a class of nonconvex-nonconcave minimax problems that satisfy the weak Minty variational inequality (MVI). The range of parameter in the weak MVI was extended compared to EG+ of Diakonikolas et al. (2021), and tightness of our results were demonstrated through construction of a counter example. In addition, EG+ (Diakonikolas et al., 2021), as well as the forward-backward-forward algorithm (Tseng, 2000) were all shown to be special cases of our scheme. Furthermore, (CurvatureEG+) was proposed that performs a backtracking linesearch on the extrapolation stepsize allowing for larger stepsizes and relaxes the condition to which is often a much weaker condition. More importantly, it is shown that asymptotically the linesearch always passes with for any , thus ratifying the name (CurvatureEG+). Future direction include exploring applications of the proposed algorithm in particular in the setting of GANs. It is also interesting to develope a variance reduced variant of the algorithm for finite sum minimax problems.
Acknowledgments and disclosure of funding
We would like to especially thank Yu-Guan Hsieh for providing valuable feedback and discussion. This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement n° 725594 - time-data). This work was supported by the Swiss National Science Foundation (SNSF) under grant number 200021_205011. The work of the second and third author was supported by the Research Foundation Flanders (FWO) postdoctoral grant 12Y7622N and research projects G081222N, G0A0920N, G086518N, and G086318N; Research Council KU Leuven C1 project No. C14/18/068; Fonds de la Recherche Scientifique – FNRS and the Fonds Wetenschappelijk Onderzoek – Vlaanderen under EOS project no 30468160 (SeLMA); European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 953348. The work of Olivier Fercoq was supported by the Agence National de la Recherche grant ANR-20-CE40-0027, Optimal Primal-Dual Algorithms (APDO).
References
Appendix A Preliminary definitions
Notationally we will use throughout. We additionally recall some standard definitions and results and refer to Bauschke & Combettes (2017); Rockafellar (1970)) for further details.
and it is said to be -comonotone if for all
The operator is said to be maximally (co)monotone if its graph is not strictly contained in the graph of another (co)monotone operator.
We say that is monotone if it is -monotone. When , -comonotonicity is also referred to as -cohypomonotonicity.
Moreover, is said to be nonexpansive if it is -Lipschitz continuous, and firmly nonexpansive if it is -cocoercive.
The following lemma plays an important role in our convergence analysis.
is -Lipschitz if and only if is -cocoercive.
The first claim follows directly from (Bauschke & Combettes, 2017, Prop.4.11). That is strongly monotone is a consequence of the Cauchy Schwarz inequality and Lipschitz continuity of :
In turn, the last claim follows from the Cauchy-Schwarz inequality. ∎
Appendix B Proofs and further results
Let . By 1.3 . Therefore,
In what follows we will show that Algorithm 1 is equivalent to taking a forward-backward step followed by a correction step. Consider the updates
where in the inequality Item 1 was used. Hence, by (B.3) the stepsize is positive and bounded away from zero. Moreover, if , then from (B.3) we may conclude that which implies that the generated sequence remains constant and (cf. (B.1)).
The projection onto for any is given by
Moreover, (B.1) together with Item 3 at yields
thus ensuring . The projection onto is then given by where is as in 1.4.
The proof of convergence was already given prior to the statement of the corollary. It remains to derive (3.3). By Item 3 and owing to -cocoercivity of (cf. Item 1)
Therefore, provided that we have
Telescoping the above inequality yields the claimed inequality. ∎
B.2 Convergence results and proofs of Section 4
The convergence results for (CurvatureEG+) are provided in the next corollary where in Item 3 is allowed to take potentially larger values provided that . Note that owing to the lower bound on (cf. Item 1), the weak MVI assumption in the corollary is always satisfied if , however, in practice may take larger values.
Moreover, if (as is the case in 3), and is continuously differentiable, then eventually the backtrack will never be invoked.
Observe that in the proof of Theorem 3.1 -Lipschitz continuity of is only used at the generated points and (see (B.3)), and is thus ensured by the linesearch Algorithm 2. Therefore, it is easy to see that is positive and bounded away from zero provided that , see (B.3). Moreover, since , arguing as in Item 2 we obtain . Hence, it follows from (B.5) that
1: Since is -Lipschitz continuous the linesearch would terminate in finite steps. Either satisfies the condition, or else the backtrack procedure is invoked, which in turn implies the previous candidate should have violated the condition leading the the claimed lower bound.
B.3 Proofs of Section 3.2
To prove the lower bound we introduce the following unconstrained bilinear minimax problem with an unstable critical point.
Example 5: Consider the following minimax problem:
The associated operator of Section B.3 can easily be computed,
where . In this particular case, both and turn out to be constants. By simple calculation we have,
where is the spectral norm. Since the norm of the Jacobian is constant it equates the global Lipschitz constant, .
By linearity of , one step of (EG+) is conveniently also a linear operator. Specifically,
We know that a linear dynamical system is globally asymptotically stable if and only if the spectral radius of the linear mapping is strictly less than 1.
Let be the eigenvalues of . Then the spectral radius is the largest absolute value of the eigenvalues. For this becomes,
Equation (B.13) provides a specification for Section B.3. As long as (B.12) is satisfied, (EG+) is guaranteed to converge for . On the other hand, since (B.10) is a linear system, we simultaneously learn that picking any larger would imply non-convergence through (given ). We can trivially embed problem (B.10) into a higher dimension to generalize the result. Noting that completes the proof. ∎
We provide Mathematica code to verify each step of the above proof.The supplementary code can be found at https://github.com/LIONS-EPFL/weak-minty-code/.
Appendix C Toy examples
In the following appendix, denotes the Lipschitz constant of restricted to the constraint set and is the parameter of the weak MVI (Item 3) when restricted to the constraint set. This restriction of the definitions is warranted, since remains within the constraint set in all simulations, while is guaranteed to stay within by definition of 1.3 in Algorithm 1 (and likewise for all other considered method treating problem (2.1)).
All computer-assisted calculations can be found in the supplementary code.1
with . Transforming this dynamics into cartesian coordinates yields the desired vectorfield, , while subsequently integrating with respect to and yields the two potentials associated with the two players. Note that the roots for the polynomial defining are not strictly necessary for showing existence of limit cycles, but leads to a simpler form for . We illustrate the construction in Fig. 5.
Let be the evolution in cartesian coordinates of the associated vectorfield in polar coordinates defined by (C.1). Then the only stationary point of is at the origin and there exists a limit cycle at for all .
Let . It is easy to see from (C.1) that the only stationary point is at . By construction, is a polynomial with roots for all , so any trajectory starting on the circle defined by remains in that set. However, is strictly nonzero. As a consequence is nonzero, so must define a limit cycle, which proofs the claim. ∎
C.2 Proof for properties of Definition 1
This can easily be verified by a change of variables. From Proposition 1 it then follows, that there must exist a limit cycle at and . To verify the conditions on we compute the closed form solution to and in Mathematica:
For we have and
For we have and
For we have and
It can easily be verified that the stated conditions for in Definition 1 are met for the values above. This completes the proof.
We provide Mathematica code verifying the construction of and the closed form solutions to and .
C.3 Proof for properties of Definition 1
Under the definitions of and in Appendix C, we claim that the origin in (GlobalForsaken) is a global Nash equilibrium and satisfies Item 3 with .
To verify that is indeed a global Nash equilibrium we need to check that the solution cannot be unilaterally improved. In other words, the solution should coincide with where
We can easily verify this with Minimize in Mathematica, since the functions are polynomial for which a closed form solutions to the global optimization problem will be returned.
To find for we solve the global minimization problem,
for which a closed form solution can be found with Mathematica, which when numerically evaluated is approximately .
We need to compute to ensure . In our case of convex constraints, , we have that where denotes the spectral norm (Rockafellar & Wets, 2009, Thm. 9.2 and 9.7). Under our constraint , this can similarly be computed in closed form, yielding . So which satisfy the condition . This completes the proof.
Let be the associated operator of in (GlobalForsaken) defined as . Define the radius as . Then, has a stable critical point at the origin , at least one attracting limit cycle in the region defined by and at least one repellant limit cycle within .
We follow a similar argument as in Hsieh et al. (2021, D.2). We can compute the associated operator ,
With a change of variables into polar coordinates we get that evolves as,
When this reduces to and we observe that for any . Likewise for , we have that which implies . Since there is no stationary point in the region it then follows from the Poincaré-Bendixson theorem (Teschl, 2012, Thm. 7.16) that there must exist at least one attracting limit cycle in . Further, it is easy to see that is a critical point and that it is stable by inspection of the Jacobian . Since is trapping, it follows from Poincaré–Hopf index theorem, that there must exist a repellant limit cycles in the region defined by . This completes the proof. ∎
C.4 Proof of properties for (Hsieh et al., 2021, Example 5.2)
Example 6: (Hsieh et al., 2021, Example 5.2)
where .
By using Mathematica, we can obtain a closed form solution of the Lipschitz constant of restricted to the constraint set, which we find to be . Mathematica can solve approximately for the critical point, yielding . To find we want to globally minimize for . Mathematica finds the candidate for which . So must be at least this small, i.e. . Since , this implies that . See Forsaken.nb for Mathematica-assisted computations.
This rules out convergence guarantees for both (CEG+) and AdaptiveEG+ (Algorithm 1), which is supported by the simulation in Figure 8. However, as observed, (CurvatureEG+) converges in the simulations.