Escaping limit cycles: Global convergence for constrained nonconvex-nonconcave minimax problems

Thomas Pethick, Puya Latafat, Panagiotis Patrinos, Olivier Fercoq, Volkan Cevher

Introduction

Many machine learning applications, from generative adversarial networks (GANs) to robust reinforcement learning, result in nonconvex-nonconcave constrained minimax problems, which pose notorious difficulties to the scalable first order methods. Indeed, there is no shortage of results illustrating divergent or cycling behavior when going beyond minimization problems (Benaım & Hirsch, 1999; Hommes & Ochea, 2012; Mertikopoulos et al., 2018b; Hsieh et al., 2021).

Traditionally, minimax problems have been studied for more than half a century under the umbrella of the variational inequalities (VIs). The extragradient-type algorithms from the VI literature was recently brought to the awareness of the machine learning community (Mertikopoulos et al., 2018a; Gidel et al., 2018; Böhm et al., 2020), and have provided a principled way of stabilizing training and avoiding Poincaré recursions. However, these results mostly concern the convex-concave setting.

In nonconvex-nonconcave minimax problems, or more generally nonmonotone variational inequalities (VIs), even finding a local solution is in general intractable. This has been made precise through exponential lower bound of the classical optimization type (Hirsch & Vavasis, 1987) and computational complexity results (Papadimitriou, 1994; Daskalakis et al., 2021b). This is in sharp contrast to minimization problems, where only finding a global solution is intractable. The recent result of (Hsieh et al., 2021) provides some intuition behind this difference by showing that the asymptotic limits of most schemes, including extragradient, can converge to attracting limit cycles.

To make progress in lieu of these negative results, Diakonikolas et al. (2021) proposes a simple generalization of extragradient, called (EG+), that can converge to a stationary point even for a class of nonmonotone problems provided that the weak Minty variational inequality (MVI) holds. This problem class is parametrized by a constant ρ\rho, which controls the degree of nonconvexity. However, given the range of ρ\rho in Diakonikolas et al. (2021), the new class is still too small to include even the simplest counterexample of Hsieh et al. (2021) for the general Robbins-Monro schemes.

Building on the analysis in Diakonikolas et al. (2021), we propose a new adaptive scheme, called (CurvatureEG+), that converges even in the difficult counter example of Hsieh et al. (2021) as illustrated in Fig. 1. Our main contributions are summarized below.

We propose an adaptive extragradient-type algorithm that converges for a larger range of ρ\rho, the parameter in the weak MVI assumption (cf. Item 3) than previously known.

More importantly, we show that convergence is ensured if 2ρ+γk>02\rho+\gamma_{k}>0, where γk\gamma_{k} is the extrapolation stepsize. This is crucial since by selecting γk\gamma_{k} through a backtracking procedure larger stepsizes are allowed, which in turn implies convergence for more negative values of ρ\rho, thus capturing a larger class of problems. In addition, we show that the linesearch eventually passes without triggering any backtrack if initialized based on the Jacobian of FF (cf. Section 4).

We present a non-adaptive variant of our algorithm (CEG+), and show that for particular parameter choices (EG+) of Diakonikolas et al. (2021), and when ρ=0\rho=0 the celebrated forward-backward-forward (FBF) algorithm of Tseng (2000) are recovered, thus unifying and generalizing both methods. We improve upon Diakonikolas et al. (2021) by not only relaxing the problem class but also the stepsize range. We show that our results are tight by providing a matching lower bound, thus providing a complete picture of (EG+) under weak MVI.

Related work

The community has resorted to various approaches to make progress for nonconvex-nonconcave minimax problems. One line of work focuses on deriving local convergence results (Mazumdar et al., 2019; Fiez & Ratliff, 2020; Heusel et al., 2017). For global results, the two primary approaches have been to either assume a global oracle for the inner problem (Jin et al., 2019; Davis & Drusvyatskiy, 2018) or assume particular problem structure such as the Polyak-Łojasiewicz condition (Nouiehed et al., 2019; Yang et al., 2020) or concavity for the inner problem (Rafique et al., 2019).

We follow the same tradition of assuming structure, but from the general perspective of operator theory. The idea of studying minimax and related problems through the lens of variational inequality has a long history (Minty, 1962; Rockafellar, 1976; Polyak, 1987; Bertsekas, 1997), with recent renewed interest due to its relevance for minimax formulations (Mertikopoulos et al., 2018a; Gidel et al., 2018; Azizian et al., 2020).

One relaxation of the monotone case for which we have positive results is that of Minty variational inequalities (MVI) (Mertikopoulos et al., 2018a; Song et al., 2021; Zhou et al., 2017; Liu et al., 2021), which includes all quasiconvex-concave and starconvex-concave problems. Diakonikolas et al. (2021) introduced the relaxed condition of weak MVI. In the unconstrained setting they showed non-asymptotic convergence results under a restricted problem constant ρ\rho. Similarly to us, Lee & Kim (2021a) extends the regime but under the stronger condition of cohypomonotonicity. They do so by studying a more evolved variant of extragradient building on anchoring techniques. We instead directly improve upon (EG+) and generalize it to new settings.

In the stochastic setting, usually the stepsize for the extrapolation step is diminishing. This is the case in Böhm et al. (2020) where they consider a forward-backward-forward type scheme. However, they remain in the monotone setting, where the limit cycles are non-attracting, as exemplified by a bilinear game. Hsieh et al. (2021) recently showed that a large family of algorithms, which includes the extragradient method with diminishing stepsize, can converge to attracting limit cycles. Going beyond this restriction, prior to Diakonikolas et al. (2021), Hsieh et al. (2020) interestingly considers two separate and diminishing stepsizes under the stronger assumption of MVI.

Problem formulation and preliminaries

holds. The set of all such points is denoted by zerT{zRn0Tz}\operatorname*{zer}T\coloneqq{\mathopen{}\left\{z\in R^{n}{}\mid{}0\in Tz\right\}\mathclose{}}. Throughout the paper problem (2.1) is studied under the following assumptions (definitions can be found in Appendix A).

Weak Minty variational inequality (MVI) holds, i.e., there exists a nonempty set SzerT\mathcal{S}^{\star}\subseteq\operatorname*{zer}T such that for all zSz^{\star}\in\mathcal{S}^{\star} and some ρ(12L,)\rho\in(-\tfrac{1}{2L},\infty)

Generally, we do not require the weak Minty assumption to hold at every zzerTz^{\star}\in\operatorname*{zer}T. In fact, as shown in Theorem 3.1 nonemptiness of S\mathcal{S}^{\star} is sufficient for ensuring that the limit points belong to zerT\operatorname*{zer}T. Interestingly, despite nonmonotonicity of FF, global (as opposed to subsequential) convergence can be established when S=zerT\mathcal{S}^{\star}=\operatorname*{zer}T, an assumption that is still weaker than cohypomonotonicity.

VIs provide a convenient abstraction for a range of problems. We mention some central examples below but otherwise defer to the overview in Facchinei & Pang (2007). Subsequently, we provide examples where the weak MVI holds.

Example 1: (minimax optimization). A comprehensive way to capture a wide range of applications in machine learning is to consider structured minimax problems of the form

As it will become clear in the next section (cf. Algorithm 1), the main computations involved in the proposed scheme are evaluations of FF and resolvent JA=(id+A)1J_{A}={\mathopen{}\left({\rm id}+A\right)\mathclose{}}^{-1}. Recall that the resolvent of a maximally monotone operator is firmly nonexpansive with full domain (cf. (Bauschke & Combettes, 2017, Sect. 23)). If A=fA=\partial f is the subdifferential operator of a convex function ff, then its resolvent is the proximal mapping. For instance when AA is as in Assumption I, then its resolvent is given by JA(x,y)=(proxg(x),proxh(y))J_{A}(x,y)=(\operatorname{prox}_{g}(x),\operatorname{prox}_{h}(y)).

The corresponding first order optimality conditions may be written as Az=(g1(z1),,gN(zN))Az=(\partial g_{1}(z_{1}),\ldots,\partial g_{N}(z_{N})) and Fz=({\nabla}_{z_{1}}\varphi_{1}(\mathchoice{\text{\boldmath{\displaystyle z}}}{\text{\boldmath{\textstyle z}}}{\text{\boldmath{\scriptstyle z}}}{\text{\boldmath{\scriptscriptstyle z}}}),\ldots,{\nabla}_{z_{N}}\varphi_{N}(\mathchoice{\text{\boldmath{\displaystyle z}}}{\text{\boldmath{\textstyle z}}}{\text{\boldmath{\scriptstyle z}}}{\text{\boldmath{\scriptscriptstyle z}}})).

A solution to (2.1) thus returns a candidate for which the first order condition of the above problems is satisfied. In the monotone case these two solution concepts coincide, while in the more general case of weak MVI, we provide examples where this still holds. In particular, we introduce in Section 5 a nonconvex-nonconcave minimax game which additionally exhibits limit cycles for FzFz. As a consequence most schemes including gradient descent ascent, extragradient and optimistic gradient descent ascent do not converge to a stationary point globally (Hsieh et al., 2021). However, the global Nash equilibrium satisfies Item 3 with ρ>\nicefrac12L\rho>-\nicefrac{{1}}{{2L}}, which we show is sufficient for global convergence of (CEG+).

The weak MVI condition is satisfied in certain reinforcement learning settings. Specifically, Diakonikolas et al. (2021); Daskalakis et al. (2021a) considers a two-player zero-sum game where the weak MVI holds, while neither MVI nor cohypomonotonicity holds. Interestingly, the formulation requires constraint—a condition they do not handle. We thus provide the first provable algorithm for this setting. Weak MVI also contains all quasiconvex-concave and starconvex-concave problems. For further examples, the literature on cohypomonotonicity (Bauschke et al., 2020) is relevant since it implies weak MVI, see for instance Lee & Kim (2021b, Example 1).

Generalizing Extragradient+

Our starting point is the Extragradient+ (EG+) algorithm of Diakonikolas et al. (2021) which is identical to extragradient (Korpelevich, 1976) except for the second stepsize being smaller. They only treat the inclusion (2.1) when A0A\equiv 0, and in our notation require ρ(\nicefrac18L,0]\rho\in(-\nicefrac{{1}}{{8L}},0]. Specifically,

where they choose γk=\nicefrac1L\gamma_{k}=\nicefrac{{1}}{{L}} and αˉk=\nicefrac12\bar{\alpha}_{k}=\nicefrac{{1}}{{2}} (Diakonikolas et al., 2021, Thm. 3.2).

We generalize (EG+) in Algorithm 1 to take the operator AA into account—consequently we capture constraint and regularized problems as well. In addition, the scheme is adaptive in αˉk\bar{\alpha}_{k}. We will show that the weaker requirement of ρ(\nicefrac12L,)\rho\in(-\nicefrac{{1}}{{2L}},\infty) suffices even for the more general inclusion (2.1).

The main convergence results of Algorithm 1 are established in the next theorem. The proof is largely inspired by recent developments in operator splitting techniques in the framework of monotone inclusions (Latafat & Patrinos, 2017; Giselsson, 2021). The key idea lies in interpreting each iteration of the algorithm as a projection onto a certain hyperplane, an interpretation that dates back to Solodov & Tseng (1996); Solodov & Svaiter (1999).

where κ=lim infkλk(2λk)(δk+\nicefracγk2)2\kappa=\liminf_{k\to\infty}\lambda_{k}(2-\lambda_{k})(\delta_{k}+\nicefrac{{\gamma_{k}}}{{2}})^{2}. Moreover, the following holds

Note that whenever lim supkγk<1L\limsup_{k\to\infty}\gamma_{k}<\tfrac{1}{L}, Item 2 may be used to derive a similar inequality in terms of zˉkzk\|\bar{z}^{k}-z^{k}\| by lower bounding HzˉkHzk\|H\bar{z}^{k}-Hz^{k}\| in (3.1). We also remark that tighter rates may be obtained in the regime ρ0\rho\geq 0, however, this will not be pursued in this work.

Although we do not incur additional costs for evaluating the adaptive stepsize αk\alpha_{k} in 1.4, it proves instructive to present a variant with constant stepsize. As a result we compare the range of our stepsizes against Diakonikolas et al. (2021) showing an improvement by a factor of \nicefrac32\nicefrac{{3}}{{2}}. Moreover, in the monotone case (ρ=0\rho=0), with a certain choice of stepsizes the algorithm reduces to the celebrated forward-backward-forward (FBF) algorithm of Tseng (2000). We remark that the relation of FBF to projection-type algorithms was noted in Tseng (2000), (Giselsson, 2021, Sect. 6.2.1).

To this end, in this subsection consider the following non-adaptive variant of Algorithm 1 that generalizes (EG+). Letting αˉk(0,1+\nicefrac2δkγk)\bar{\alpha}_{k}\in(0,1+\nicefrac{{2\delta_{k}}}{{\gamma_{k}}}):

The convergence of this algorithm is an immediate byproduct of Theorem 3.1. To see this, note that the zk+1z^{k+1} update in 1.5 may be written as zk+1=zk+2ηkαk(HzˉkHzk)z^{k+1}{}={}z^{k}{}+{}2\eta_{k}\alpha_{k}(H\bar{z}^{k}-Hz^{k}), for ηk(0,1)\eta_{k}\in(0,1). Therefore, convergence is still ensured for any αˉk<2αk\bar{\alpha}_{k}<2\alpha_{k} as the difference may be absorbed by the relaxation parameter ηk\eta_{k}. Note that by \nicefrac12\nicefrac{{1}}{{2}}-cocoercivity of HH (cf. Item 1)

where κ=αˉ(1+2δγαˉ)\kappa=\bar{\alpha}(1+\tfrac{2\delta}{\gamma}-\bar{\alpha}). Moreover, the claims of Items 1 and 2 hold true.

The setting of Diakonikolas et al. (2021) in (EG+) involves the stepsizes γk=\nicefrac1L\gamma_{k}=\nicefrac{{1}}{{L}}, αk=\nicefrac12\alpha_{k}=\nicefrac{{1}}{{2}}. Note that when restricting to A0A\equiv 0, the iterates (CEG+) simplify to this form owing to the fact that HzˉkHzk=zˉkγFzˉkHzk=γFzˉkH\bar{z}^{k}-Hz^{k}=\bar{z}^{k}-\gamma F\bar{z}^{k}-Hz^{k}=-\gamma F\bar{z}^{k}. In comparison, in our setting if δ=ρ=\nicefrac18L\delta=\rho=-\nicefrac{{1}}{{8L}} (the smallest ρ\rho permitted in Diakonikolas et al. (2021)) is selected, then based on our analysis in Corollary 3.2 we may select γk=\nicefrac1L\gamma_{k}=\nicefrac{{1}}{{L}}, and αˉk(0,\nicefrac34)\bar{\alpha}_{k}\in(0,\nicefrac{{3}}{{4}}), thus the upper bound for the second stepsize is \nicefrac32\nicefrac{{3}}{{2}} times that of Diakonikolas et al. (2021).

In Corollary 3.2 the range of stepsizes γ\gamma, αˉ\bar{\alpha} may alternatively be set as \gamma\in\big{(}\lfloor-2\rho\rfloor_{+},\nicefrac{{1}}{{L}}\big{)}, αˉ(0,1+\nicefrac2δγ]\bar{\alpha}\in(0,1+\nicefrac{{2\delta}}{{\gamma}}]. This is due to the fact that if γ<\nicefrac1L\gamma<\nicefrac{{1}}{{L}} (strictly), then HH is strictly \nicefrac12\nicefrac{{1}}{{2}}-cocoercive. Therefore, in (3.2), 1+2δγ<2αk1+\tfrac{2\delta}{\gamma}<2\alpha_{k} holds, and thus the stepsize αˉ=1+2δγ\bar{\alpha}=1+\tfrac{2\delta}{\gamma} is permitted. Although this may appear to be of little practical significance, by setting γ(0,\nicefrac1L)\gamma\in(0,\nicefrac{{1}}{{L}}), δ=ρ=0\delta=\rho=0, and αˉ=1\bar{\alpha}=1 in (CEG+), we obtain zk+1=zˉk+γFzkγFzˉkz^{k+1}=\bar{z}^{k}+\gamma Fz^{k}-\gamma F\bar{z}^{k}, which is the forward-backward-forward (FBF) algorithm of Tseng (2000), (Bauschke & Combettes, 2017, Thm. 26.17)). ∎

2 Lower bounds

We show that the result in Corollary 3.2 is tight by providing a matching lower bound when A0A\equiv 0. We do so by fixing αˉk\bar{\alpha}_{k} and showing a stepsize dependent lower bound. In particular, note that if αˉk=\nicefrac12\bar{\alpha}_{k}=\nicefrac{{1}}{{2}} as in Diakonikolas et al. (2021, Thm. 3.2), then Theorem 3.4 implies a lower bound of ρ>\nicefrac14L\rho>-\nicefrac{{1}}{{4L}} for the (EG+) scheme. The lower bound is contextualized in Fig. 2 by relating it to our convergence results and existing results in the literature.

Adaptively taking larger stepsizes using local curvature

The proposed scheme involves a backtracking linesearch that uses the local curvature for its initial guess. The reason being that this will immediately pass, close enough to the solution zz^{\star}, by argument of continuity. More precisely, we will set the initial guess to something slightly smaller than JF(zk)1\|JF(z^{k})\|^{-1}, where JF(z)JF(z) denotes the Jacobian of FF at zz and \|\cdot\| is the spectral norm. Note that, despite the use of second order information, the scheme remains efficient since JF(z)[]\|JF({}z{})[{}{}{}]\| only requires one eigenvalue computation performed through Jacobian-vector product (Pearlmutter, 1994).

Given an initial point z0=zinitz^{0}=z^{\rm init} and ν(0,1)\nu\in(0,1), the final scheme which we denote (CurvatureEG+) proceeds for k=0,1,k=0,1,\dots as follows:

The linesearch terminates in finite time with γmin{γinit,\nicefracντL}\gamma\geq\min\{\gamma^{\rm init},\nicefrac{{\nu\tau}}{{L}}\};

The convergence results for (CurvatureEG+) are deduced based of the above lemma and Theorem 3.1 and are provided in Corollary B.1 in Section B.2. We illustrate the behavior of (CurvatureEG+) in Fig. 1 and in Section 6.

Constructing toy examples

When Item 3 holds for negative ρ\rho, limit cycles of the underlying operator FzFz can emerge. We illustrate this with simple polynomial examples for which all the properties of interest can be computed in closed form.

A PolarGame denotes a two-player game whose associated operator FF has limit cycles at z2=ci\|z\|_{2}=c_{i} for all i[k]i\in[k] where ci0c_{i}\neq 0.

This turns out to be particularly easy to construct in polar coordinates as the name suggests (see Section C.1). Apart from introducing arbitrary number of limit cycles it also gives us control over ρ\rho. This is illustrated in the following instantiations capturing three important cases.

Example 3: (PolarGame). Consider Fz=(ψ(x,y)y,ψ(y,x)+x)Fz={\mathopen{}\left(\psi(x,y)-y,\psi(y,x)+x\right)\mathclose{}} where z\nicefrac1110\|z\|_{\infty}\leq\nicefrac{{11}}{{10}} and ψ(x,y)=116ax(1+x2+y2)(9+16x2+16y2)\psi(x,y)=\frac{1}{16}ax(-1+x^{2}+y^{2})(-9+16x^{2}+16y^{2}). We have the following three cases:

(i) a=1a=1 then ρ(1L,12L)\rho\in(-\frac{1}{L},-\frac{1}{2L}) (ii) a=34a=\frac{3}{4} then ρ(12L,13L)\rho\in(-\frac{1}{2L},-\frac{1}{3L}) (iii) a=13a=\frac{1}{3} then ρ(18L,110L)\rho\in(-\frac{1}{8L},-\frac{1}{10L})

where LL denotes the Lipschitz constant of FF restricted to the constraint set. For all cases FF exhibits limit cycles at z=1\|z\|=1 and z=\nicefrac34\|z\|=\nicefrac{{3}}{{4}}. Proof is deferred to Section C.2.

Example 4: (minimax). In the particular case of constrained minimax problem we introduce the following polynomial game:

where ψ(z)=2z621z43+z23\psi(z)=\frac{2z^{6}}{21}-\frac{z^{4}}{3}+\frac{z^{2}}{3}. We provide proof of the following properties in Section C.3:

There exists a repellant limit cycle and an attracting limit cycle of FF.

z=(0,0)z^{\star}=(0,0) is a global Nash equilibrium for which Item 3 holds inside the constraint with ρ>\nicefrac12L\rho>-\nicefrac{{1}}{{2L}}, where LL denotes the Lipschitz constant of FF restricted to the constraint set.

Experiments

The algorithms considered in the experiments include the adaptive Algorithm 1, (CurvatureEG+), and constant stepsize methods that can be seen as instances of (CEG+) for various choices of γk\gamma_{k} and αˉk\bar{\alpha}_{k}. When γk=\nicefrac1L\gamma_{k}=\nicefrac{{1}}{{L}} and αˉk=1\bar{\alpha}_{k}=1 we recover a constrained variant of extragradient, which we denote CEG. When αˉk=\nicefrac12\bar{\alpha}_{k}=\nicefrac{{1}}{{2}} we denote the scheme CEG+, which is the direct generalization to the constraint setting of the (EG+) scheme studied in Diakonikolas et al. (2021, Thm. 3.2). Note that this choice of αˉk\bar{\alpha}_{k} restricts the problem class for which we otherwise can have guaranteed convergence according to Corollary 3.2. When αˉk\bar{\alpha}_{k} is chosen adaptively according to Algorithm 1 we refer to it as AdaptiveEG+. Finally, when γk\gamma_{k} is additionally chosen adaptively we use the name (CurvatureEG+).

In the stochastic setting, when γk=\nicefrac1k\gamma_{k}=\nicefrac{{1}}{{k}} and αk=1\alpha_{k}=1, effectively both stepsizes diminish, and we recover a constrained variant of the popular stochastic extragradient scheme (see e.g. Hsieh et al. (2021, Algorithm 3)), which we refer to as SEG. We also consider a heuristic variant where γk=\nicefrac1L\gamma_{k}=\nicefrac{{1}}{{L}} and only αk\alpha_{k} is decreasing, which we refer to as SEG+.

Conclusion

This paper introduced an EG-type algorithm for a class of nonconvex-nonconcave minimax problems that satisfy the weak Minty variational inequality (MVI). The range of parameter in the weak MVI was extended compared to EG+ of Diakonikolas et al. (2021), and tightness of our results were demonstrated through construction of a counter example. In addition, EG+ (Diakonikolas et al., 2021), as well as the forward-backward-forward algorithm (Tseng, 2000) were all shown to be special cases of our scheme. Furthermore, (CurvatureEG+) was proposed that performs a backtracking linesearch on the extrapolation stepsize γk\gamma_{k} allowing for larger stepsizes and relaxes the condition ρ>12L\rho>\tfrac{1}{2L} to ρ>\nicefracγk2\rho>\nicefrac{{-\gamma_{k}}}{{2}} which is often a much weaker condition. More importantly, it is shown that asymptotically the linesearch always passes with γk=νJF(zk)[]1\gamma_{k}=\nu\|JF({}z^{k}{})[{}{}{}]\|^{-1} for any ν(0,1)\nu\in(0,1), thus ratifying the name (CurvatureEG+). Future direction include exploring applications of the proposed algorithm in particular in the setting of GANs. It is also interesting to develope a variance reduced variant of the algorithm for finite sum minimax problems.

Acknowledgments and disclosure of funding

We would like to especially thank Yu-Guan Hsieh for providing valuable feedback and discussion. This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement n° 725594 - time-data). This work was supported by the Swiss National Science Foundation (SNSF) under grant number 200021_205011. The work of the second and third author was supported by the Research Foundation Flanders (FWO) postdoctoral grant 12Y7622N and research projects G081222N, G0A0920N, G086518N, and G086318N; Research Council KU Leuven C1 project No. C14/18/068; Fonds de la Recherche Scientifique – FNRS and the Fonds Wetenschappelijk Onderzoek – Vlaanderen under EOS project no 30468160 (SeLMA); European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 953348. The work of Olivier Fercoq was supported by the Agence National de la Recherche grant ANR-20-CE40-0027, Optimal Primal-Dual Algorithms (APDO).

References

Appendix A Preliminary definitions

Notationally we will use x+max{0,x}\lfloor x\rfloor_{+}\coloneqq\max\{0,x\} throughout. We additionally recall some standard definitions and results and refer to Bauschke & Combettes (2017); Rockafellar (1970)) for further details.

and it is said to be ρ\rho-comonotone if for all (x,y),(x,y)gphA(x,y),(x^{\prime},y^{\prime})\in\operatorname{gph}A

The operator AA is said to be maximally (co)monotone if its graph is not strictly contained in the graph of another (co)monotone operator.

We say that AA is monotone if it is -monotone. When ρ<0\rho<0, ρ\rho-comonotonicity is also referred to as ρ|\rho|-cohypomonotonicity.

Moreover, AA is said to be nonexpansive if it is 11-Lipschitz continuous, and firmly nonexpansive if it is 11-cocoercive.

The following lemma plays an important role in our convergence analysis.

AA is 11-Lipschitz if and only if T=idAT={\rm id}-A is \nicefrac12\nicefrac{{1}}{{2}}-cocoercive.

The first claim follows directly from (Bauschke & Combettes, 2017, Prop.4.11). That TT is strongly monotone is a consequence of the Cauchy Schwarz inequality and Lipschitz continuity of AA:

In turn, the last claim follows from the Cauchy-Schwarz inequality. ∎

Appendix B Proofs and further results

Let H=idγkFH={\rm id}-\gamma_{k}F. By 1.3 Hzkzˉk+γkAzˉkHz^{k}\in\bar{z}^{k}+\gamma_{k}A\bar{z}^{k}. Therefore,

In what follows we will show that Algorithm 1 is equivalent to taking a forward-backward step followed by a correction step. Consider the updates

where in the inequality Item 1 was used. Hence, by (B.3) the stepsize αk\alpha_{k} is positive and bounded away from zero. Moreover, if zkDkz^{k}\in\mathcal{D}_{k}, then from (B.3) we may conclude that HzˉkHzk0\|H\bar{z}^{k}-Hz^{k}\|\leq 0 which implies that the generated sequence remains constant and zˉkzerT\bar{z}^{k}\in\operatorname*{zer}T (cf. (B.1)).

The projection onto Dk\mathcal{D}_{k} for any vDkv\notin\mathcal{D}_{k} is given by

Moreover, (B.1) together with Item 3 at zˉk\bar{z}^{k} yields

thus ensuring zSDkz^{\star}\in\mathcal{S}^{\star}\subseteq\mathcal{D}_{k}. The projection onto Dk\mathcal{D}_{k} is then given by ΠDk(zk)=zk+αk(HzˉkHzk),\operatorname{\Pi}_{\mathcal{D}_{k}}(z^{k})=z^{k}+\alpha_{k}(H\bar{z}^{k}-Hz^{k}), where αk\alpha_{k} is as in 1.4.

The proof of convergence was already given prior to the statement of the corollary. It remains to derive (3.3). By Item 3 and owing to \nicefrac12\nicefrac{{1}}{{2}}-cocoercivity of HH (cf. Item 1)

Therefore, provided that αˉ>0\bar{\alpha}>0 we have

Telescoping the above inequality yields the claimed inequality. ∎

B.2 Convergence results and proofs of Section 4

The convergence results for (CurvatureEG+) are provided in the next corollary where ρ\rho in Item 3 is allowed to take potentially larger values provided that ρ>\nicefracγk2\rho>-\nicefrac{{\gamma_{k}}}{{2}}. Note that owing to the lower bound on γk\gamma_{k} (cf. Item 1), the weak MVI assumption in the corollary is always satisfied if ρ(\nicefracντ2L,)\rho\in(-\nicefrac{{\nu\tau}}{{2L}},\infty), however, in practice γk\gamma_{k} may take larger values.

Moreover, if zk,zˉkzzerTz^{k},\bar{z}^{k}\to z^{\star}\in\operatorname*{zer}T (as is the case in 3), and FF is continuously differentiable, then eventually the backtrack will never be invoked.

Observe that in the proof of Theorem 3.1 11-Lipschitz continuity of γkF\gamma_{k}F is only used at the generated points zˉk\bar{z}^{k} and zkz^{k} (see (B.3)), and is thus ensured by the linesearch Algorithm 2. Therefore, it is easy to see that αk\alpha_{k} is positive and bounded away from zero provided that ρ>\nicefracγk2\rho>-\nicefrac{{\gamma_{k}}}{{2}} , see (B.3). Moreover, since γkFzˉkFzkνzˉkzk\gamma_{k}\|F\bar{z}^{k}-Fz^{k}\|\leq\nu\|\bar{z}^{k}-z^{k}\|, arguing as in Item 2 we obtain HzˉkHzk(1ν)zˉkzk\|H\bar{z}^{k}-Hz^{k}\|\geq(1-\nu)\|\bar{z}^{k}-z^{k}\|. Hence, it follows from (B.5) that

1: Since FF is LL-Lipschitz continuous the linesearch would terminate in finite steps. Either γinit\gamma^{\rm init} satisfies the condition, or else the backtrack procedure is invoked, which in turn implies the previous candidate γ/τ\gamma/\tau should have violated the condition leading the the claimed lower bound.

B.3 Proofs of Section 3.2

To prove the lower bound we introduce the following unconstrained bilinear minimax problem with an unstable critical point.

Example 5: Consider the following minimax problem:

The associated operator of Section B.3 can easily be computed,

where z=(x,y)z=(x,y). In this particular case, both LL and ρ\rho turn out to be constants. By simple calculation we have,

where \|\cdot\| is the spectral norm. Since the norm of the Jacobian is constant it equates the global Lipschitz constant, L=JF(z)L=\|JF(z)\|.

By linearity of FF, one step of (EG+) is conveniently also a linear operator. Specifically,

We know that a linear dynamical system is globally asymptotically stable if and only if the spectral radius of the linear mapping is strictly less than 1.

Let λ1,λ2\lambda_{1},\lambda_{2} be the eigenvalues of TT. Then the spectral radius is the largest absolute value of the eigenvalues. For TT this becomes,

Equation (B.13) provides a specification for Section B.3. As long as (B.12) is satisfied, (EG+) is guaranteed to converge for γk=\nicefrac1L\gamma_{k}=\nicefrac{{1}}{{L}}. On the other hand, since (B.10) is a linear system, we simultaneously learn that picking cc any larger would imply non-convergence through maxiλi1\max_{i}|\lambda_{i}|\geq 1 (given z00z^{0}\neq 0). We can trivially embed problem (B.10) into a higher dimension to generalize the result. Noting that c=ρLc=-\rho L completes the proof. ∎

We provide Mathematica code to verify each step of the above proof.The supplementary code can be found at https://github.com/LIONS-EPFL/weak-minty-code/.

Appendix C Toy examples

In the following appendix, LL denotes the Lipschitz constant of FF restricted to the constraint set and ρ\rho is the parameter of the weak MVI (Item 3) when restricted to the constraint set. This restriction of the definitions is warranted, since zkz^{k} remains within the constraint set in all simulations, while zˉk\bar{z}^{k} is guaranteed to stay within by definition of 1.3 in Algorithm 1 (and likewise for all other considered method treating problem (2.1)).

All computer-assisted calculations can be found in the supplementary code.1

with a,b0a,b\neq 0. Transforming this dynamics into cartesian coordinates yields the desired vectorfield, FF, while subsequently integrating with respect to xx and yy yields the two potentials associated with the two players. Note that the roots {ci}i=1k{\mathopen{}\left\{-c_{i}{}\mid{}{}\right\}\mathclose{}}_{i=1}^{k} for the polynomial defining r˙\dot{r} are not strictly necessary for showing existence of limit cycles, but leads to a simpler form for FzFz. We illustrate the construction in Fig. 5.

Let Fz=(x˙,y˙)Fz=(\dot{x},\dot{y}) be the evolution in cartesian coordinates of the associated vectorfield in polar coordinates defined by (C.1). Then the only stationary point of FF is at the origin (0,0)(0,0) and there exists a limit cycle at r=cir=c_{i} for all i[k]i\in[k].

Let r=x2+y2r=\sqrt{x^{2}+y^{2}}. It is easy to see from (C.1) that the only stationary point is at r=0r=0. By construction, r˙\dot{r} is a polynomial with roots cic_{i} for all i[k]i\in[k], so any trajectory starting on the circle defined by r=cir=c_{i} remains in that set. However, θ˙\dot{\theta} is strictly nonzero. As a consequence FzFz is nonzero, so r=cir=c_{i} must define a limit cycle, which proofs the claim. ∎

C.2 Proof for properties of Definition 1

This can easily be verified by a change of variables. From Proposition 1 it then follows, that there must exist a limit cycle at z=1\|z\|=1 and z=\nicefrac34\|z\|=\nicefrac{{3}}{{4}}. To verify the conditions on ρ\rho we compute the closed form solution to ρ\rho and LL in Mathematica:

For a=1a=1 we have ρ=501761050977\rho=-\frac{50176}{1050977} and L=2538096704424929+7024698961720000L=\frac{\sqrt{2538096\sqrt{704424929}+70246989617}}{20000}

For a=\nicefrac34a=\nicefrac{{3}}{{4}} we have ρ=60211216798825\rho=-\frac{602112}{16798825} and L=76142886383574361+63502290655380000L=\frac{\sqrt{7614288\sqrt{6383574361}+635022906553}}{80000}

For a=\nicefrac13a=\nicefrac{{1}}{{3}} we have ρ=1505289439585\rho=-\frac{150528}{9439585} and L=2538096754424929+7344698961760000L=\frac{\sqrt{2538096\sqrt{754424929}+73446989617}}{60000}

It can easily be verified that the stated conditions for ρ\rho in Definition 1 are met for the values above. This completes the proof.

We provide Mathematica code verifying the construction of FF and the closed form solutions to LL and ρ\rho.

C.3 Proof for properties of Definition 1

Under the definitions of ρ\rho and LL in Appendix C, we claim that the origin (0,0)(0,0) in (GlobalForsaken) is a global Nash equilibrium and satisfies Item 3 with ρ>\nicefrac12L\rho>-\nicefrac{{1}}{{2L}}.

To verify that (0,0)(0,0) is indeed a global Nash equilibrium we need to check that the solution cannot be unilaterally improved. In other words, the solution should coincide with (x,y)(x^{\star},y^{\star}) where

We can easily verify this with Minimize in Mathematica, since the functions are polynomial for which a closed form solutions to the global optimization problem will be returned.

To find ρ\rho for z=(0,0)z^{\star}=(0,0) we solve the global minimization problem,

for which a closed form solution can be found with Mathematica, which when numerically evaluated is approximately 0.119732-0.119732.

We need to compute LL to ensure ρ>\nicefrac12L\rho>-\nicefrac{{1}}{{2L}}. In our case of convex constraints, C\mathcal{C}, we have that L=supzCJF(z)L=\sup_{z\in\mathcal{C}}\|JF(z)\| where \|\cdot\| denotes the spectral norm (Rockafellar & Wets, 2009, Thm. 9.2 and 9.7). Under our constraint z\nicefrac43\|z\|_{\infty}\leq\nicefrac{{4}}{{3}}, this can similarly be computed in closed form, yielding L=\nicefrac12(940959721901+74125591)2835L=\nicefrac{{\sqrt{\frac{1}{2}{\mathopen{}\left(9409\sqrt{59721901}+74125591\right)\mathclose{}}}}}{{2835}}. So 12L0.165432-\frac{1}{2L}\approx-0.165432 which satisfy the condition ρ>12L\rho>-\frac{1}{2L}. This completes the proof.

Let FF be the associated operator of ϕ\phi in (GlobalForsaken) defined as Fz=(xϕ(x,y),yϕ(x,y))Fz=({\nabla}_{x}\phi(x,y),-{\nabla}_{y}\phi(x,y)). Define the radius as r=zr=\|z\|. Then, FzFz has a stable critical point at the origin (0,0)(0,0), at least one attracting limit cycle in the region defined by \nicefrac32<r<2\sqrt{\nicefrac{{3}}{{2}}}<r<2 and at least one repellant limit cycle within r\nicefrac32r\leq\sqrt{\nicefrac{{3}}{{2}}}.

We follow a similar argument as in Hsieh et al. (2021, D.2). We can compute the associated operator FF,

With a change of variables into polar coordinates (r,θ)(r,\theta) we get that r=x2+y2r=\sqrt{x^{2}+y^{2}} evolves as,

When r=\nicefrac32r=\sqrt{\nicefrac{{3}}{{2}}} this reduces to r˙=3cos(4θ)+5566\dot{r}=\frac{3\cos(4\theta)+5}{56\sqrt{6}} and we observe that r˙>0\dot{r}>0 for any θ\theta. Likewise for r=2r=2, we have that r˙=421(22cos(4θ)+25)\dot{r}=-\frac{4}{21}(22\cos(4\theta)+25) which implies r˙<0\dot{r}<0. Since there is no stationary point in the region S={(r,θ):\nicefrac32<r<2}\mathcal{S}={\mathopen{}\left\{(r,\theta):\sqrt{\nicefrac{{3}}{{2}}}<r<2{}\mid{}{}\right\}\mathclose{}} it then follows from the Poincaré-Bendixson theorem (Teschl, 2012, Thm. 7.16) that there must exist at least one attracting limit cycle in S\mathcal{S}. Further, it is easy to see that (0,0)(0,0) is a critical point and that it is stable by inspection of the Jacobian JF(z)JF(z). Since S\mathcal{S} is trapping, it follows from Poincaré–Hopf index theorem, that there must exist a repellant limit cycles in the region defined by r<\nicefrac32r<\sqrt{\nicefrac{{3}}{{2}}}. This completes the proof. ∎

C.4 Proof of properties for (Hsieh et al., 2021, Example 5.2)

Example 6: (Hsieh et al., 2021, Example 5.2)

where ψ(z)=14z212z4+16z6\psi(z)=\frac{1}{4}z^{2}-\frac{1}{2}z^{4}+\frac{1}{6}z^{6}.

By using Mathematica, we can obtain a closed form solution of the Lipschitz constant LL of FF restricted to the constraint set, which we find to be L=18012(1089801761+993841)L=\frac{1}{80}\sqrt{\frac{1}{2}{\mathopen{}\left(1089\sqrt{801761}+993841\right)\mathclose{}}}. Mathematica can solve approximately for the critical point, yielding z=(0.0780267,0.411934)z^{\star}=(0.0780267,0.411934). To find ρ\rho we want to globally minimize ρ(z):=Fz,zzFz2\rho(z):=\frac{{\mathopen{}\left\langle Fz,z-z^{\star}\right\rangle\mathclose{}}}{\|Fz\|^{2}} for zDz\in\mathcal{D}. Mathematica finds the candidate z=(1.01236,0.104749)z^{\prime}=(-1.01236,-0.104749) for which ρ(z)=0.477761\rho(z^{\prime})=-0.477761. So ρ\rho must be at least this small, i.e. ρ<0.477761\rho<-0.477761. Since \nicefrac12L0.04-\nicefrac{{1}}{{2L}}\approx-0.04, this implies that ρ<\nicefrac12L\rho<-\nicefrac{{1}}{{2L}}. See Forsaken.nb for Mathematica-assisted computations.

This rules out convergence guarantees for both (CEG+) and AdaptiveEG+ (Algorithm 1), which is supported by the simulation in Figure 8. However, as observed, (CurvatureEG+) converges in the simulations.