The Limit Points of (Optimistic) Gradient Descent in Min-Max Optimization
Constantinos Daskalakis, Ioannis Panageas
Introduction
The celebrated min-max theorem was a founding stone in the development of Game Theory , and is intimately related to strong linear programming duality , Blackwell’s approachability theory , and the theory of no-regret learning . The theorem states that if is a convex-concave function, and , are compact and concave subsets of Euclidean space, then
If represents the payment of the player to the player under choices of strategies and by these two players, the min-max theorem reassures us that an equilibrium of the game exists, and that the equilibrium payoffs to both players are unique.
What does not follow directly from the min-max theorem is whether there exist dynamics via which players would arrive at equilibrium if they were to follow some simple rule to update their current strategies. This has been the topic of a long line of investigation starting with Julia Robinson’s celebrated analysis of fictitious play , and leading to the development of no-regret learning .
Renewed interest in this problem has been recently motivated by the task of training Generative Adversarial Networks (GANs) , where two deep neural networks, the generator and the discriminator, are trained in tandem using first order methods, aiming at solving a min-max problem, of the following form, albeit typically with a non convex-concave objective function :
Here represents the parameters of the generator deep neural net, represents the parameters of the discriminator neural net, and is some measure of how close the distribution generated by the generator appears to the true distribution from the perspective of the discriminator.
Min-max optimization in non convex-concave settings is a central problem for many research communities, however our knowledge is very limited from optimization perspective. Moreover, for such applications of first-order methods to min-max problems in Machine Learning, it is especially important that the last-iterate maintained by the min and the max dynamics converges to a desirable solution. Unfortunately, even when is convex-concave, it is rare that guarantees are known for the last iterate (see for continuous time learning dynamics that may cycle). Some guarantees are known for continuous-time dynamics , but for discrete-time dynamics it is typically only shown that the average-iterates converge to min-max equilibrium. Recent work of shows that, while Gradient Descent/Ascent (GDA) dynamics performed by the min/max players may diverge, the Optimistic version dynamics of exhibit last iterate convergence to min-max solutions (which we shall call Optimistic Gradient Descent/Ascent (OGDA)), whenever is linear in and . The goal of our paper is to understand the limit points of GDA and OGDA dynamics (points that last iterate might converge to) for general functions . In particular, we answer the following questions:
are the stable limit points of GDA and OGDA locally min-max solutions?
how do the stable limit points of GDA and OGDA relate to each other?
We provide answers to these questions after defining our dynamics of interest formally.
with some constant step size Note that for the rest of this paper.. However, there are examples (functions and initial points ) in which the system of equations (3) cycles (see ). To break down this behavior, the authors in analyzed another optimization algorithm which is called Optimistic Gradient Descent/Ascent (OGDA)We note that OGDA has some resemblance with Polyak’s heavy ball method. However, one important difference is that OGDA has “negative momentum” while the heavy ball method has “positive momentum.”, the equations of which boil down to the following:
One of their key results was to show convergence to the solution for the case of bilinear objective functions, namely .
Our contribution and techniques: In this paper we analyze Gradient Descent/Ascent (GDA) and Optimistic Gradient Descent/Ascent (OGDA) dynamics applied to min-max optimization problems. Our starting point is to show that both dynamics avoid their unstable fixed points (GDA-unstable and OGDA-unstable respectively, as defined in Section 1.1). This is shown by using techniques from dynamical systems, following the line of work of recent papers in Optimization and Machine Learning . In a nutshell we show that the update rules of both dynamics are local diffeomorphismsA local diffeomorphism is a function that locally is invertible, smooth and its (local) inverse is also smooth. and we then make use of Center-Stable manifold theorem A.1. These results are given as Theorems 2.2, 3.2. One important step in our approach is the construction of a dynamical system for OGDA in order to apply dynamical systems’ techniques.
We next study the set of stable fixed points of GDA dynamics and their relation to locally min-max solutions, called local min-max.In optimization literature they are called local saddles.). Informally, a local min-max critical point satisfies the following: compared to value , if we fix and perturb infinitesimally, the value of does not increase and similarly if we fix and perturb infinitesimally, the value of does not decrease. We show that the set of stable fixed points of GDA is a superset of the set of local min-max and there are functions in which this inclusion is strict. This is given in Lemmas 2.4, 2.7, 2.5.
Finally, we analyze OGDA dynamics which is a bit trickier than GDA due to the nature of the dynamics, namely the existence of memory in the dynamics: the next iterate depends on the gradient of the current and previous point. We construct a dynamical system that captures OGDA dynamics (see Equation (12)), using a construction that is commonly employed in differential equations. Importantly, we establish a mapping (relation) between the eigenvalues of the Jacobian of the update rules of both GDA and OGDA, showing that OGDA stable fixed points is a superset of GDA stable ones (under mild assumptions on the stepsize), namely (we suggest the reader to see first Remark 1.5 to avoid confusion):
We note that the inclusion above are strict.
1 Important Definitions
We have already stated our min-max problem of interest (2) as well as the Gradient Descent/Ascent (GDA) dynamics (3) and Optimistic Gradient Descent/Ascent (OGDA) dynamics (4) that we plan to analyze. We provide some further definitions.
Let be continuously differentiable. We call a fixed point linearly stable or just stable if, for the Jacobian of computed at , it holds that its spectral radius is at most one and otherwise we call it linearly unstable or just unstable.
A fixed point of is called Lyapunov stable if, for every , there exists a such that if with Ball of radius . we have that for every . That is, if dynamics starts close enough to , it remains close for all times.
A fixed point of is called (locally) asymptotically stable (or attracting) if it is Lyapunov stable and there exists a such that, for all we have that as . That is, there is a small neighborhood around so that, for all initializations in that neighborhood, the dynamics converges to .
We call a fixed point hyperbolic iff the Jacobian of computed at has no eigenvalues with absolute value .
If the Jacobian of the update rule at a stable fixed point has spectral radius less than one, then the fixed point is asymptotically stable. Therefore, if a fixed point is hyperbolic, then linear stability implies asymptotic stability.
It is easy to see that a fixed point of the GDA dynamics (3) arises whenever , or in other words whenever such that .
Since the OGDA dynamics (4) has memory, it is more appropriate to think of the dynamics as mapping a quadruple to a quadruple . In this case, a fixed point arises whenever , or in other words whenever and .
Given Proposition 1.4, it follows that spectral analysis of the Jacobian of the fixed points can give us qualitative information about the local behavior of the dynamics. Unless otherwise specified, throughout this paper, whenever we say “stable” we mean linearly stable. GDA/OGDA-stable critical points are critical points that are stable with respect to GDA/OGDA dynamics (for fixed stepsize , otherwise are unstable). Moreover since different choices of stepsize might give different stability for GDA and OGDA dynamics, we are interested in the case is “sufficiently” small. Therefore in the sections we characterize the GDA/OGDA-stable critical points, a point is classified as GDA/OGDA-stable if there exists a sufficiently small number such that for all stepsizes we have that the is a stable fixed point of GDA/OGDA dynamics (in case there exists a small so that for all stepsizes we have that is an unstable fixed point of GDA/OGDA dynamics, it is classified as GDA/OGDA-unstable).
Optimization.
We use the following standard terminology.
For a min-max problem (2) where is twice continuously differentiable,
A point is a critical point of if .
A critical point is isolated if there is a neighborhood around where is the only critical point.If the critical points are isolated then they are countably many or finite. Otherwise it is called non-isolated.
A critical point is a local min-max point if there exists a neighborhood around so that for all we have that .In optimization literature these critical points are also called local saddle points. If is the whole domain then we call it global min-max.
A critical point is a strongly local min-max point if and .
2 Formal Statement of Results
We present our main results for GDA and OGDA, to be proven in Sections 2 and 3. Some of our claims make use of the following assumptions about the objective function of (2):
(the Hessian of ) is invertible for all .
GDA is non-imaginary at a critical point of iff
has no eigenvalue whose real part is . captures the difference where is the Jacobian of GDA dynamics and the identity matrix.
Our two main results are stated as follows:
Assume is twice differentiable and is Lipschitz with constant .
Let be a local min max critical point that satisfies Assumption 1.8. For sufficiently small it holds that is GDA-stable fixed point. There is a function with critical point which violates Assumption 1.8, is local min-max but not GDA-stable for any (Lemmas 2.4, 2.7 and 2.6).
Additionally, if is a strongly local min max critical point then Assumption 1.8 is satisfied and for sufficiently small we get is GDA-stable (Remark 2.8).
Finally there is a function with a critical point which is not local min-max but it is GDA-stable (for sufficiently small , Lemma 2.5).
Let be a GDA-stable fixed point. For it holds that is OGDA-stable. Moreover the inclusion is strict, i.e., there is a function with critical point which is OGDA-stable but not GDA-stable (for small enough , Lemmas 3.4 and 3.5).
Assume is twice differentiable and is Lipschitz with constant . The set of initial vectors so that GDA converges to (linearly) GDA-unstable fixed points (critical points) is of measure zero. Under Assumption 1.7, the set of initial vectors so that OGDA converges to (linearly) OGDA-unstable fixed points (critical points) is of measure zero. These statements are captured by Theorems 2.2 and 3.2.
Analysis of Gradient Descent/Ascent
In this section we analyze the local behavior (which carries over to a global characterization under Lemma 2.1 and Center-stable manifold theorem A.1) of GDA dynamics (3). In all our statements (theorems, lemmas etc) we work with real-valued function that is twice differentiable and we also assume is Lipschitz with constant and that the stepsize satisfies (unless stated otherwise in the statement of a lemma/theorem).
We need to show the following lemma in order to use the stable manifold theorem (see Theorem A.1).
Let be twice differentiable and is Lipschitz with constant . Assume that . The update rule of the GDA dynamics (3) is a local diffeomorphism.
Let be the update rule of the dynamics (3). It suffices to show that the Jacobian of is invertible and by the use of Inverse Function theorem, the claim follows. After straightforward calculations we get
It suffices to show that the matrix below does not have an eigenvalue that is equal to (by just subtracting the identity matrix),
If the function is L-Lipschitz, it follows that (Lemma 6 in ). Therefore by equation (9) we have that . The claim follows. ∎
Let be twice differentiable and is Lipschitz with constant . Assume that and let be the update rule of the GDA dynamics (3), be a GDA-unstable critical point and be its stable set, i.e.,
It holds that is of Lebesgue measure zero. Moreover if is union of the stable sets of all GDA-unstable critical points, then has also measure zero (namely the proof works for non-isolated critical points).
The following corollary is immediate from Theorem 2.2.
2 Characterizing GDA-stability
Assume that and let be a local min-max critical point of and matrix (see equations (5)) computed at has real eigenvalues. It holds that is GDA-stable.
By definition of local min-max, it holds that is positive semi-definite and also is negative semi-definite. Hence the symmetric matrix below (matrix is given by equation (8))
is negative semi-definite. We use the Ky Fan inequality which states that the sequence (in decreasing order) of the eigenvalues of majorizes the real part of the sequence of the eigenvalues of (see , page 4). By assumption that has real eigenvalues we conclude that since is negative semi-definite. Therefore the spectrum of lies in $(\mathbf{x}^{*},\mathbf{y}^{*})$ is GDA-stable. ∎
The converse of Lemma 2.4 is false. There are functions with critical points that are GDA-stable but not local min-max. An example is See Figure 1..
We provide an example with two variables (so that we can also give a figure). Let . Computing the Jacobian of the update rule of dynamics (3) at point we get that
Both eigenvalues of have magnitude less than 1 (for any where ). Finally matrix has real eigenvalues. Therefore there exists a neighborhood of so that for all , we get that for GDA dynamics (3). However it is clear that is not a local min-max. See also Figure 1 for a pictorial illustration of the result. ∎
We end Section 2 by characterizing the case in which has complex eigenvalues.
There are functions with critical points that are not GDA-stable but are local min-max when matrix (see equations (5)) has imaginary eigenvalues.
Let . It is clear that critical point is a local min-max point. Computing the Jacobian of the update rule of dynamics (3) at point we get that
For any we have that the eigenvalues of are ,We denote . so they have magnitude greater than (and is clear that has complex eigenvalues). It is easy to see that , i.e., inductively we have
We complete the characterization for the relation between GDA-stable critical points and local min-max with the following lemma:
Let be a local min-max critical point of and matrix (see equations (5)) computed at has all its eigenvalues with real part nonzero (i.e., Assumption 1.8). There is a small enough step-size so that is GDA-stable.
The proof follows the steps of the proof of Lemma 2.4. Similarly, using Ky Fan inequality we know that for any eigenvalue of it holds that
Hence we conclude that . Additionally, the corresponding eigenvalue of is . By choosing , it is easy to see that for all the eigenvalues of , hence the eigenvalues of have magnitude less than one. ∎
If the critical point is strongly local min-max then and hence is attracting under GDA dynamics, i.e., it holds that .
Optimistic Gradient Descent/Ascent
The results of the previous section cannot carry over to Optimistic Gradient Descent/Ascent due to the fact that the dynamics has memory and is more challenging to analyze. Here we show that Optimistic Gradient Descent/Ascent avoid OGDA-unstable critical points and we also relate the eigenvalues of the Jacobian of OGDA to the eigenvalues of the Jacobian of GDA. In particular we show that (inclusion strict). In the beginning we will construct a dynamical system that captures the dynamics of OGDA (4).
We define the function to be for all (think of the last two vector components as dummy for function , its value does not depend on them). Hence it is clear that and . The same holds for and .
We define the following function which consists of 4 components:
It is not hard to check that so captures exactly the dynamics of OGDA (4). The idea behind the construction of the dynamical system above is common in the literature of ODEs (ordinal differential equations) where in order to solve (typically to understand the qualitative behavior) a higher order ODE, one approach is to express it as a linear system of ODEs.
2 Analyzing OGDA via system (12)
As in the case of GDA, we need to show the following key lemma in order to use the Center-stable manifold theorem.
Let is real-valued and is Lipschitz with constant and . Under the Assumption 1.7 we get that the update rule of the OGDA dynamics (12) is a local diffeomorphism.
It suffices to show that the Jacobian of , denoted by is invertible and then by Inverse Function theorem the claim follows. After calculations the Jacobian boils down to the following (we set ) :
Observe that for , the matrix is not invertible, as opposed to the case of GDA which is the identity matrix and hence is invertible. It is easy to see that for , then , namely it is not even (not even locally).
The null space of is the same as the null space of the following matrix (after row and column operations)
It is clear that under the assumption that the Hessian is invertible (see Assumption 1.7), we get that
Again as in Section 2, we are able to prove the following measure zero argument using Lemma 3.1 and Center-Stable manifold theorem.
Let be twice differentiable and is Lipschitz with constant . Suppose that Assumption 1.7 holds and . Let be the update rule of the OGDA dynamics (4), be a OGDA-unstable critical point and be its stable set, i.e.,
It holds that is of Lebesgue measure zero. Moreover if is union of the stable sets of all OGDA-unstable critical points, then has also measure zero (namely the proof works for non-isolated critical points).
The following corollary is immediate from Theorem 3.2.
3 Characterizing OGDA-stability
In this subsection we provide an analysis for the eigenvalues of the Jacobian matrix of the update rule of the system (12). We begin by claiming that the set of GDA-stable critical points is a subset of the set of OGDA-critical points. We manage to show this by constructing a mapping between the eigenvalues of and .
Let be twice differentiable and be L-Lipschitz. Assume that and suppose is a critical point that is GDA-stable (i.e., stable according to dynamics (3)). The critical point is stable according to OGDA dynamics (4).
A fixed point of the dynamics (4) is of the form (see Remark 1.5). The Jacobian of the update rule becomes as follows:
We would like to find a relation between the eigenvalues of matrix (16) and matrix (6) (relate the Jacobian of both dynamics GDA and OGDA). We start with the matrix
The absolute value of the determinant of a matrix remains invariant under row/column operations (add a multiple of a row/column to another row/column or exchange rows/columns). After such operations, the determinant of the matrix above has determinant in absolute value equal to (we assume that )
The determinant above is equal to , where
It is clear that is not an eigenvalue of . Let be the characteristic polynomial of (6, Jacobian of GDA dynamics at ). The characteristic polynomial of ends up being equal to
Let be an eigenvalue of matrix (8), i.e., is an eigenvalue of . From relation (17) it turns out that the roots of the polynomial
are eigenvalues of the matrix . For it holds that and it turns out that all the roots of the quadratic equation (18) have magnitude at most one (see Mathematica code in Section A.1 for a proof of the inequality). ∎
We conclude the subsection with the following claim and a remark.
There are functions with critical points that are OGDA-stable but not GDA-stable.
The easiest example is . It is clear that the Jacobian of GDA dynamics (3) is given by
which has eigenvalues (magnitude greater than one) and hence the critical point is GDA-unstable. However, the Jacobian of OGDA dynamics (4) is given by
which has the four eigenvalues . For all the four eigenvalues have magnitude less than or equal to 1, hence is OGDA-stable (see mathematica code A.2 for the inequality claim). Another example which is not bilinear (Assumption 1.7 is satisfied) is the function (this is used in the example section). ∎
We would like to note that some of our results (e.g., Lemma 3.1 and Theorem 3.2) are not applicable to a generic bilinear function , since if is not a square matrix, the Hessian is not invertible (they are applicable only when is square matrix and invertible).
Examples and Experiments
The function has the property that the critical point is GDA-stable but not local min-max (see Lemma 2.5). Moreover, consider . This function has the property that the critical point is GDA-unstable and is easy to check that is not a local min-max. We construct the polynomial function . Function has the property that around behaves like and around behaves like . The GDA dynamics of can be seen in Figure 2. However more critical points are created. There are five critical points, i.e, (in interval , the last critical point is computed approximately). In Table 1 we observe that the critical point is stable for OGDA but unstable for GDA (essentially OGDA has more attracting critical points). Moreover, our theorems of avoiding unstable fixed points are verified with this experiment. Note that there are some initial conditions that GDA and OGDA dynamics don’t converge (3% and 9.8% respectively).
2 Higher dimensional
Let , where is the random 3-degree polynomial as mentioned above and . It is clear that locally at behaves like function (which has as a local min-max critical point). We run for 10000 uniformly random points in and it turns out that 87% of initial points converge to in OGDA as opposed to GDA which 79.3% fraction converged. This experiment indicates qualitative difference between the two methods, where the area of region of attraction in OGDA is a bit larger.
Conclusion
In this paper we made a step towards understanding first order methods which are used to solve min-max optimization problems, by analyzing the local behavior of GDA and OGDA dynamics around critical points. Our paper is an indication that important first order methods we analyze fail to converge to only local min-max solutions(standard concept in optimization literature). Whether or not local min-max solutions is a good concept is out of the scope of this paper. Local min-max solutions might not be all equally good and some may be bad, which is really important in applications such as training GANs. Nevertheless, even for minimization problems, finding good local minima is a hard task that is not well understood in the literature (most first order methods guarantee convergence to some local minimum, without guarantees about its quality). A forteriori guaranteeing good solutions in a min-max problem is a harder proposition and an important open question.
References
Appendix A Missing theorems and proofs
Let be a fixed point for the local diffeomorphism . Suppose that , where is the span of the eigenvectors corresponding to eigenvalues of magnitude less than or equal to one of , and is the span of the eigenvectors corresponding to eigenvalues of magnitude greater than one of Jacobian of function .. Then there exists a embedded disk of dimension that is tangent to at called the local stable center manifold. Moreover, there exists a neighborhood of , such that , and .
It follows the general line of the papers . We assume that the update rule of GDA, OGDA dynamics is a diffeomorphism (as proved in Lemmas 2.1 and 3.1). The proof is generic and has appeared in . Let be the set of unstable critical points of a dynamical system with update rule a function (in ). For each , there is an associated open neighborhood promised by the Stable Manifold Theorem A.1. forms an open cover, and since is second-countable we can extract a countable subcover, so that .
Define (stable set of ). Fix a point . Since , then for some non-negative integer and all , . Since we have a countable sub-cover, for some and all . This implies that for all . By Theorem A.1, is a subset of the local center stable manifold which has co-dimension at least one, and is thus measure zero.
Finally, implies that . Since is unknown we union over all non-negative integers, to obtain . Since was arbitrary, we have shown that . Using Lemma 1 of page 5 in and that countable union of measure zero sets is measure zero, has measure zero.