On Finding Local Nash Equilibria (and Only Local Nash Equilibria) in Zero-Sum Games

Eric V. Mazumdar, Michael I. Jordan, S. Shankar Sastry

Introduction

The classical problem of finding Nash equilibria in multi-player games has been a focus of intense research in computer science, control theory, economics and mathematics (Basar and Olsder, 1998; Nisan et al., 2007; C. Daskalakis, 2009). Some connections have been made between this extensive literature and machine learning (see, e.g., Cesa-Bianchi and Lugosi, 2006; Banerjee and Peng, 2003; Foerster et al., 2017), but these connections have focused principally on decision-making by single agents and multiple agents, and not on the burgeoning pattern-recognition side of machine learning, with its focus on large data sets and simple gradient-based algorithms for prediction and inference. This gap has begun to close in recent years, due to new formulations of learning problems as involving competition between subsystems that are construed as adversaries (Goodfellow et al., 2014), the need to robustify learning systems with regard to against actual adversaries (Xu et al., 2009) and with regard to mismatch between assumptions and data-generating mechanisms (Yang, 2011; Giordano et al., 2018), and an increasing awareness that real-world machine-learning systems are often embedded in larger economic systems or networks (Jordan, 2018).

These emerging connections bring significant algorithmic and conceptual challenges to the fore. Indeed, while gradient-based learning has been a major success in machine learning, both in theory and in practice, work on gradient-based algorithms in game theory has often highlighted their limitations. For example, gradient-based approaches are known to be difficult to tune and train (Daskalakis et al., 2017; Mescheder et al., 2017; Hommes and Ochea, 2012; Balduzzi et al., 2018), and recent work has shown that gradient-based learning will almost surely avoid a subset of the local Nash equilibria in general-sum games (Mazumdar and Ratliff, ). Moreover, there is no shortage of work showing that gradient-based algorithms can converge to limit cycles or even diverge in game-theoretic settings (Benaïm and Hirsch, 1999; Hommes and Ochea, 2012; Daskalakis et al., 2017; Mertikopoulos et al., 2018b).

These drawbacks have led to a renewed interest in approaches to finding the Nash equilibria of zero-sum games, or equivalently, to solving saddle point problems. Recent work has attempted to use second-order information to reduce oscillations around equilibria and speed up convergence to fixed points of the gradient dynamics (Mescheder et al., 2017; Balduzzi et al., 2018). Other recent approaches have attempted to tackle the problem from the variational inequality perspective but also with an eye on reducing oscillatory behaviors (Mertikopoulos et al., 2018a; Gidel et al., 2018).

None of these approaches, however, address a fundamental issue that arises in zero-sum games. As we will discuss, the set of attracting fixed points for the gradient dynamics in zero-sum games can include critical points that are not Nash equilibria. In fact, any saddle point of the underlying function that does not satisfy a particular alignment condition of a Nash equilibrium is a candidate attracting equilibrium for the gradient dynamics. Further, as we show, these points are attracting for a variety of recently proposed adjustments to gradient-based algorithms, including consensus optimization (Mescheder et al., 2017), the symplectic gradient adjustment (Balduzzi et al., 2018), and a two-timescale version of simultaneous gradient descent (Heusel et al., 2017). Moreover, we show by counterexample that these algorithms can all converge to non-Nash stationary points.

We present a new gradient-based algorithm for finding the local Nash equilibria of two-player zero-sum games and prove that the only stationary points to which the algorithm can converge are local Nash equilibria. Our algorithm makes essential use of the underlying structure of zero-sum games. To obtain our theoretical results we work in continuous time—via an ordinary differential equation (ODE)—and our algorithm is obtained via a discretization of the ODE. While a naive discretization would require a matrix inversion and would be computationally burdensome, our discretization is a two-timescale discretization that avoids matrix inversion entirely and is of a similar computational complexity as that of other gradient-based algorithms.

The paper is organized as follows. In Section 2 we define our notation and the problem we address. In Section 3 we define the limiting ODE that we would like our algorithm to follow and show that it has the desirable property that its only limit points are local Nash equilibria of the game. In Section 4 we introduce local symplectic surgery, a two-timescale procedure that asymptotically tracks the limiting ODE and show that it can be implemented efficiently. Finally, in Section 5 we present two numerical examples to validate the algorithm. The first is a toy example with three local Nash equilibria, and one non-Nash fixed point. We show that simultaneous gradient descent and other recently proposed algorithms for zero-sum games can converge to any of the four points while the proposed algorithm only converges to the local Nash equilibria. The second example is a small generative adversarial network (GAN), where we show that the proposed algorithm converges to a suitable solution within a similar number of steps as simultaneous gradient descent.

Preliminaries

for all xx and yy in neighborhoods of xx^{*} and yy^{*} respectively. Such equilibria are locally optimal for both players with respect to their own decision variable, meaning that neither player has an incentive to unilaterally deviate from such a point. As was shown in Ratliff et al. (2013), generically, local Nash equilibria will satisfy slightly stronger conditions, namely they will be differential Nash equilibria (DNE):

Dxf(x,y)=0D_{x}f(x^{*},y^{*})=0 and Dyf(x,y)=0D_{y}f(x^{*},y^{*})=0.

Dxx2f(x,y)0D^{2}_{xx}f(x^{*},y^{*})\succ 0, and Dyy2f(x,y)0D^{2}_{yy}f(x^{*},y^{*})\prec 0.

Here DxfD_{x}f and DyfD_{y}f denote the partial derivatives of ff with respect to xx and yy respectively, and Dxx2fD^{2}_{xx}f and Dyy2fD^{2}_{yy}f denote the matrices of second derivatives of ff with respect to xx and yy. Both differential and local Nash equilibria in two-player zero-sum games are, by definition, special saddle points of the function ff that satisfy a particular alignment condition with respect to the player’s decision variables. Indeed, the definition of differential Nash equilibria, which holds for almost all local Nash equilibria in a formal mathematical sense, makes this condition clear: the directions of positive and negative curvature of the function ff at a local Nash equilibria must be aligned with the minimizing and maximizing player’s decision variables respectively.

We note that the key difference between local and differential Nash equilibria is that Dxx2f(x,y)D^{2}_{xx}f(x,y), and Dyy2f(x,y)D^{2}_{yy}f(x,y) are required to be definite instead of semidefinite. This distinction simplifies our analysis while still allowing our results to hold for almost all continuous games.

Having introduced local Nash equilibria as the solution concept of interest, we now consider how to find such solutions, and in particular we highlight some issues with gradient-based algorithms in zero-sum continuous games. The most common method of finding local Nash equilibria in such games is to have both players randomly initialize their variables (x0,y0)(x_{0},y_{0}) and then follow their respective gradients. That is, at each step n=1,2,...n=1,2,..., each agent updates their variable as follows:

In this notation, the simGD update is given by:

Since (1) is in the form of a discrete-time dynamical system, it is natural to examine its limiting behavior through the lens of dynamical systems theory. Intuitively, given a properly chosen sequence of step sizes, (1) should have the same limiting behavior as the continuous-time flow:

We remark that the diagonal blocks of J(z)J(z) are always symmetric and Dxy2f=(Dyx2f)TD^{2}_{xy}f=(D^{2}_{yx}f)^{T}. Thus J(z)J(z) can be written as the sum of a block symmetric matrix S(z)S(z) and a block anti-symmetric matrix A(z)A(z), where:

Given the structure of the Jacobian, we can now draw links between differential Nash equilibria and equilibrium concepts in dynamical systems theory. We focus on hyperbolic critical points of ω\omega.

It is well known that hyperbolic critical points are generic among critical points of smooth dynamical systems (see e.g. (Sastry, 1999)), meaning that our focus on hyperbolic critical points is not very restrictive. Of particular interest are locally asymptotically stable equilibria of the dynamics.

LASE have the desirable property that they are locally exponentially attracting under the flow of ω-\omega. This implies that a properly discretized version of z˙=ω(z)\dot{z}=-\omega(z) will also converge exponentially fast in a neighborhood of such points. LASE are the only attracting hyperbolic equilibria. Thus, making statements about all the LASE of a certain continuous-time dynamical system allows us to characterize all attracting hyperbolic equilibria.

As shown in Ratliff et al. (2013) and Nagarajan and Kolter (2017), the fact that all differential Nash equilibria are critical points of ω\omega coupled with the structure of JJ in zero-sum games guarantees that all differential Nash equilibria of the game are LASE of the gradient dynamics. However the converse is not true. The structure present in zero-sum games is not enough to ensure that the differential Nash equilibria are the only LASE of the gradient dynamics. When either Dxx2fD^{2}_{xx}f or Dyy2fD^{2}_{yy}f is indefinite at a critical point of ω(z)\omega(z), the Jacobian can still have eigenvalues with strictly positive real parts.

Such points, which we refer to as non-Nash LASE of (2), are what makes having guarantees on the convergence of algorithms in zero-sum games particularly difficult. Non-Nash LASE are not locally optimal for both players, and may not even be optimal for one of the players. By definition, at least one of the two players has a direction in which they would move to unilaterally decrease their cost. Such points arise solely due to the gradient dynamics, and persist even in other gradient-based dynamics suggested in the literature. In Appendix B, we show that three recent algorithms for finding local Nash equilibria in zero-sum continuous games—consensus optimization, symplectic gradient adjustment, and a two-time scale version of simGD—are susceptible to converge to such points and therefore have no guarantees of convergence to local Nash equilibria. We note that such points can be very common since every saddle point of ff that is not a local Nash equilibrium is a candidate non-Nash LASE of the gradient dynamics. Further, local minima or maxima of ff could also be non-Nash LASE of the gradient dynamics.

To understand how non-Nash equilibria can be attracting under the flow of ω-\omega, we again analyze the Jacobian of ω\omega. At such points, the symmetric matrix S(z)S(z) must have both positive and negative eigenvalues. The sum of SS with AA, however, has eigenvalues with strictly positive real part. Thus, the anti-symmetric matrix A(z)A(z) can be seen as stabilizing such points.

Previous gradient-based algorithms for zero-sum games have also pinpointed the matrix AA as the source of problems in zero-sum games, however they focus on a different issue. Consensus optimization (Mescheder et al., 2017) and the symplectic gradient adjustment (Balduzzi et al., 2018) both seek to adjust the gradient dynamics to reduce oscillatory behaviors in neighborhoods of stable equilibria. Since the matrix A(z)A(z) is anti-symmetric, it has only imaginary eigenvalues. If it dominates SS, then the eigenvalues of JJ can have a large imaginary component. This leads to oscillations around equilibria that have been shown empirically to slow down convergence (Mescheder et al., 2017). Both of these adjustments rely on tunable hyper-parameters to achieve their goals. Their effectiveness is therefore highly reliant on the choice of parameter. Further, as shown in Appendix B neither of the adjustments are able to rule out convergence to non-Nash equilibria.

A second promising line of research into theoretically sound methods of finding the Nash equilibria of zero-sum games has approached the issue from the perspective of variational inequalities (Mertikopoulos et al., 2018a; Gidel et al., 2018). In Mertikopoulos et al. (2018a) extragradient methods were used to solve coherent saddle point problems and reduce oscillations when converging to saddle points. In such problems, however, all saddle points of the function ff are assumed to be local Nash equilibria, and thus the issue of converging to non-Nash equilibria is assumed away. Similarly, by assuming that ω\omega is monotone, as in the theoretical treatment of the averaging scheme proposed in Gidel et al. (2018), the cost function is implicitly assumed to be convex-concave. This in turn implies that the Jacobian satisfies the conditions for a Nash equilibrium everywhere. The behavior of their approaches in more general zero-sum games with less structure (like the training of GANs) is therefore not well known. Moreover, since their approach relies on averaging the gradients, they do not fundamentally change the nature of the critical points of simGD.

In the following sections we propose an algorithm for which the only LASE are the differential Nash equilibria of the game. We also show that, regardless of the choice of hyper-parameter, the Jacobian of the new dynamics at LASE has real eigenvalues, which means that the dynamics cannot exhibit oscillatory behaviors around differential Nash equilibria.

Constructing the limiting differential equation

In this section we define the continuous-time flow that our discrete-time algorithm should ideally follow.

We do not require J(z)J(z) to be invertible everywhere, but only at the critical points of ff.

The function λ\lambda ensures that, even when JJ is not invertible everywhere, the inverse matrix in (4) exists. The vanishing condition λ(z)=0    ω(z)=0\lambda(z)=0\iff\omega(z)=0 ensures us that the Jacobian of the adjustment term is exactly JT(z)J^{T}(z) at differential Nash equilibria.

The dynamics introduced in (4) can be seen as an adjusted version of the gradient dynamics where the adjustment term only allows trajectories to approach critical points of ω\omega along the players’ axes. If a critical point is not locally optimal for one of the players (i.e., it is a non-Nash critical point) then that player can push the dynamics out of a neighborhood of that point. The mechanism is easier to see if we assume JJ is invertible and set λ(z)0\lambda(z)\equiv 0. This results in the following dynamics:

In this simplified form we can see that the Jacobian of the adjustment is approximately JTJ^{T} when ω(z)2||\omega(z)||_{2} is small. This approximation is exact at critical points of ω\omega. Adding this adjustment term to ω\omega exactly cancels out the rotational part of the vector field contributed by the antisymmetric matrix A(z)A(z) in a neighborhood of critical points. Since we identified A(z)A(z) as the source of oscillatory behaviors and non-Nash equilibria in Section 2, this adjustment addresses both of these issues. The following theorem establishes this formally.

If zz is a critical point of z˙=h(z)\dot{z}=-h(z), then the Jacobian of hh at zz has real eigenvalues.

Clearly, ω(z)=0    h(z)=0\omega(z)=0\implies h(z)=0. To show the converse, we assume that h(z)=0h(z)=0 but ω(z)0\omega(z)\neq 0. This implies that:

Since we assumed that this cannot be true, we must have that h(z)=0    ω(z)=0h(z)=0\implies\omega(z)=0.

Having shown that under our assumptions, the critical points of hh are the same as those of ω\omega, we now note that the Jacobian of h(z)h(z) at a critical point must have the form:

By assumption, at critical points, J(z)J(z) is invertible and λ(z)=0\lambda(z)=0. Given that ω(z)=0\omega(z)=0, terms that include ω(z)\omega(z) disappear, and the adjustment term contributes only a factor of JT(z)J^{T}(z) to the Jacobian of hh at a critical point. This exactly cancels out the antisymmetric part of the Jacobian of ω\omega. The Jacobian of hh is therefore symmetric at critical points of ω\omega and has positive eigenvalues only when Dxx2f(z)0D_{xx}^{2}f(z)\succ 0 and Dyy2f(z)0D_{yy}^{2}f(z)\prec 0.

Since these are also the conditions for differential Nash equilibria, all differential Nash equilibria of G\mathcal{G} must be LASE of z˙=h(z)\dot{z}=-h(z). Further, non-Nash LASE of z˙=ω(z)\dot{z}=-\omega(z) cannot be LASE of z˙=h(z)\dot{z}=-h(z), since by definition either Dxx2f(z)D_{xx}^{2}f(z) or Dyy2f(z)D_{yy}^{2}f(z) is indefinite at such points. To show the second part of the theorem, we simply note that JhJ_{h} must be symmetric at all critical points which in turn implies that it has only real eigenvalues.

The continuous-time dynamical system therefore solves both of the problems we highlighted in Section 2, for any choice of the function λ\lambda that satisfies our assumptions. The assumption that ω(z)\omega(z) is never an eigenvector of JT(z)(JT(z)J(z)+λ(z)I)1JT(z)J^{T}(z)\left(J^{T}(z)J(z)+\lambda(z)I\right)^{-1}J^{T}(z) with an eigenvalue of 1-1 ensures that the adjustment does not create new critical points. In high dimensions the assumption is mild since the scenario is extremely specific, but it is also possible to show that this assumption can be removed entirely by adding a time-varying term to \eqrefeq:ctalg\eqref{eq:ctalg} while still retaining the theoretical guarantees. We show this in Appendix A.

Theorem 4 shows that the only attracting hyperbolic equilibria of the limiting ordinary differential equation (ODE) are the differential Nash equilibria of the game. Also, since Jh(z)J_{h}(z) is symmetric at critical points of ω\omega, if either Dxx2f(z)D_{xx}^{2}f(z) or Dyy2f(z)-D_{yy}^{2}f(z) has at least one negative eigenvalue then such a point would be a linearly unstable equilibrium of z˙=h(z)\dot{z}=-h(z). Such points are linearly unstable and are therefore almost surely avoided when the algorithm is randomly initialized (Benaïm and Hirsch, 1995; Sastry, 1999).

Theorem 4 also guarantees that the continuous-time dynamics do not oscillate near critical points. Oscillatory behaviors, as outlined in Mescheder et al. (2017), are known to slow down convergence of the discretized version of the process. Reducing oscillations near critical points is the main goal of consensus optimization (Mescheder et al., 2017) and the symplectic gradient adjustment (Balduzzi et al., 2018). However, for both algorithms, the extent to which they are able to reduce the oscillations depends on the choice of hyperparameter. The proposed dynamics achieves this for any λ(z)\lambda(z) that satisfies our assumptions.

We close this section by noting that one can pre-multiply the adjustment term by some function g(z)g(z) such that ω(z)=0    g(z)=1\omega(z)=0\implies g(z)=1 while still retaining the theoretical properties described in Theorem 4. Such a function can be used to ensure that the dynamics closely track a trajectory of simGD except in neighborhoods of critical points. For example, if the matrix JJ is ill-conditioned, such a term could be used to ensure that the adjustment does not dominate the underlying gradient dynamics. In Section 5 we give an example of such a damping function.

Two-timescale approximation

Given the limiting ODE, we could perform a straightforward Euler discretization to obtain a discrete-time update having the form:

However, due to the matrix inversion, such a discrete-time update would be prohibitively expensive to implement in high-dimensional parameter spaces like those encountered when training GANs. To solve this problem, we now introduce a two-timescale approximation to the continuous-time dynamics that has the same limiting behavior, but is much faster to compute at each iteration, than the simple discretization. Since this procedure serves to exactly remove the symplectic part, A(z)A(z) of Jacobian in neighborhoods of hyperbolic critical points, we refer to this two-timescale procedure as local symplectic surgery (LSS). In Appendix A we derive the two-timescale update rule for the time-varying version of the limiting ODE and show that it also has the same properties.

The two-timescale approximation to (4) is given by:

where h1h_{1} and h2h_{2} are defined as:

and the sequences of step sizes {an}n=0\{a_{n}\}_{n=0}^{\infty},{bn}n=0\{b_{n}\}_{n=0}^{\infty} satisfy the following assumptions:

The sequences {an}n=0\{a_{n}\}_{n=0}^{\infty} and {bn}n=0\{b_{n}\}_{n=0}^{\infty} satisfy:

i=1ai=\sum_{i=1}^{\infty}a_{i}=\infty, and i=1bi=\sum_{i=1}^{\infty}b_{i}=\infty;

i=1ai2<\sum_{i=1}^{\infty}a^{2}_{i}<\infty, and i=1bi2<\sum_{i=1}^{\infty}b^{2}_{i}<\infty;

limnanbn=0\lim_{n\rightarrow\infty}\frac{a_{n}}{b_{n}}=0.

We note that h2h_{2} is Lipschitz continuous in vv uniformly in zz under Assumption 1.

The vv process performs gradient descent on a regularized version of least squares, where the regularization is governed by λ(z)\lambda(z). If the vnv_{n} process is on a faster time scale, the intuition is that it will first converge to (JT(zn)J(zn)+λ(n)I)1ω(zn)(J^{T}(z_{n})J(z_{n})+\lambda(_{n})I)^{-1}\omega(z_{n}), and then znz_{n} will track the limiting ODE in (4). In the next section we show that this behavior holds even in the presence of noise.

The key benefit to the two-timescale process is that zn+1z_{n+1} and vn+1v_{n+1} can be computed efficiently since neither require a matrix inversion. In fact, as we show in Appendix C, the computation can be done with Jacobian-vector products with the same order of complexity as that of simGD, consensus optimization, and the symplectic gradient adjustment. This insight gives rise to the procedure outlined in Algorithm 1.

We now show that LSS asymptotically tracks the limiting ODE even in the presence of noise. This implies that the algorithm has the same limiting behavior as (4). In particular, our setup allows us to treat the case where one only has access to unbiased estimates of h1h_{1} and h2h_{2} at each iteration. This is the setting most likely to be encountered in practice, for example in the case of training GANs in a mini-batch setting.

We assume that we have access to estimators h^1\hat{h}_{1} and h^2\hat{h}_{2} such that:

To place this in the form of classical two-timescale stochastic approximation processes, we write each estimator h^1\hat{h}_{1} and h^2\hat{h}_{2} as the sum of its mean and zero-mean noise processes MzM^{z} and MvM^{v} respectively. This results in the following two timescale process:

We assume that the noise processes satisfy the following standard conditions (Benaïm, 1999; Borkar, 2008):

Assumptions on the noise: Define the filtration Fn\mathcal{F}_{n}:

for n0n\geq 0. Given Fn\mathcal{F}_{n}, we assume that:

Mn+1vM_{n+1}^{v} and Mn+1zM_{n+1}^{z} are conditionally independent given Fn\mathcal{F}_{n} for n0n\geq 0.

Given our assumptions on the estimator, cost function, and step sizes we now show that (7) asymptotically tracks a trajectory of the continuous-time dynamics almost surely. Since h1h_{1}, h2h_{2}, and v(z)=(JT(z)J(z)+λ(z)I)1ω(z)v^{*}(z)=(J^{T}(z)J(z)+\lambda(z)I)^{-1}\omega(z) are not uniformly Lipschitz continuous in both zz and vv, we cannot directly invoke results from the literature. Instead, we adapt the proof of Theorem 2 in Chapter 6 of Borkar (2008) to show that vnv(zn)v_{n}\rightarrow v^{*}(z_{n}) almost surely. We then invoke Proposition 4.1 from Benaïm (1999) to show that znz_{n} asymptotically tracks z˙=h(z)\dot{z}=-h(z). We note that this approach only holds on the event {supnzn+vn<}\{\sup_{n}||z_{n}||+||v_{n}||<\infty\}. Thus, if the stochastic approximation process remains bounded, then under our assumptions we are sure to track a trajectory of the limiting ODE.

Under Assumptions 1-3, and on the event {supnzn2+vn2<}\{\sup_{n}||z_{n}||_{2}+||v_{n}||_{2}<\infty\}:

where Mˉn+1z=anbnMn+1z\bar{M}^{z}_{n+1}=\frac{a_{n}}{b_{n}}M^{z}_{n+1}. By assumption, anbn0\frac{a_{n}}{b_{n}}\rightarrow 0. Since h1h_{1} is locally Lipschitz continuous, it is bounded on the event {supnzn2+vn2<}\{\sup_{n}||z_{n}||_{2}+||v_{n}||_{2}<\infty\}. Thus, anbnh1(zn,vn)0\frac{a_{n}}{b_{n}}h_{1}(z_{n},v_{n})\rightarrow 0 almost surely.

From Lemma 1 in Chapter 6 of Borkar (2008) , the above processes, on the event {supnzn2+vn2<}\{\sup_{n}||z_{n}||_{2}+||v_{n}||_{2}<\infty\}, converge almost surely to internally chain-transitive invariant sets of v˙=h2(z,v)\dot{v}=-h_{2}(z,v) and z˙=0\dot{z}=0. Since, for a fixed zz, h2(z,v)h_{2}(z,v) is a Lipschitz continuous function of vv with a globally asymptotically stable equilibrium at (JT(z)J(z)+λ(z)I)1ω(z)(J^{T}(z)J(z)+\lambda(z)I)^{-1}\omega(z), the claim follows.

Having shown that vnv(zn)20||v_{n}-v^{*}(z_{n})||_{2}\rightarrow 0 almost surely, we now show that znz_{n} will asymptotically track a trajectory of the limiting ODE. Let us first define z(t,s,zs)z(t,s,z_{s}) for tst\geq s to be the trajectory of z˙=h(z)\dot{z}=-h(z) starting at zsz_{s} at time ss.

Given Assumptions 1-3, let tn=i=0n1ait_{n}=\sum_{i=0}^{n-1}a_{i}. On the event {supnzn2+vn2<}\{sup_{n}||z_{n}||_{2}+||v_{n}||_{2}<\infty\}, for any integer K>0K>0 we have:

Proof The proof makes use of Propositions 4.1 and 4.2 in Benaïm (1999) which are supplied in Appendix E.

We note that, from Lemma 5, (v(zn)vn)0\left(v^{*}(z_{n})-v_{n}\right)\rightarrow 0 almost surely. Since JT(zn)2<Lω||J^{T}(z_{n})||_{2}<L_{\omega}, we can write this process as:

where χn0\chi_{n}\rightarrow 0 almost surely. Since hh is continuously differentiable, it is locally Lipschitz, and on the event {supnzn+vn<}\{sup_{n}||z_{n}||+||v_{n}||<\infty\} it is bounded. It thus induces a continuous globally integrable vector field, and therefore satisfies the assumptions for Propositions 4.1 in Benaïm (1999). Further, by assumption the sequence of step sizes and martingale difference sequences satisfy the assumptions of Proposition 4.2 in Benaïm (1999). Invoking Proposition 4.1 and 4.2 in Benaïm (1999) gives us the desired result.

Theorem 6 guarantees that LSS asymptotically tracks a trajectory of the limiting ODE. The approximation will therefore avoid non-Nash equilibria of the gradient dynamics. Further, the only locally asymptotically stable points for LSS must be the differential Nash equilibria of the game.

Numerical Examples

This function is a fourth-order polynomial that is scaled by an exponential to ensure that it is bounded. The gradient dynamics z˙=ω(z)\dot{z}=-\omega(z) associated with function have four LASE. By evaluating the Jacobian of ω\omega at these points we find that three of the LASE are local Nash equilibria. These are denoted by ‘x’ in Figure 1. The fourth LASE is a non-Nash equilibrium which is denoted with a star. In Figure 1, we plot the sample paths of both simGD and our limiting ODE from the same initial positions, shown with red dots. We clearly see that simGD converges to all four LASE, depending on the initialization. Our algorithm, on the other hand, only converges to the local Nash equilibria. When initialized close to the non-Nash equilibrium it diverges from the simGD path and ends up converging to a LNE.

This numerical example also allows us to study the behavior of our algorithm around LASE. By focusing on a local Nash equilibrium, as in Figure 1B, we observe that the limiting ODE approaches it directly even when simGD displays oscillatory behaviors. This empirically validates the second part of Theorem 4.

In Figure 2 we empirically validate that LSS asymptotically tracks the limiting ODE. When the fast timescale has not converged, the process tracks the gradient dynamics. Once it has converged however, we see that it closely tracks the limiting ODE which leads it to converge to only the local Nash equilibria. This behavior highlights an issue with the two-timescale approach. Since the non-Nash equilibria of the gradient dynamics are saddle points for the new dynamics they can slow down convergence. However, the process will eventually escape such points (Benaïm, 1999).

In our numerical experiments we let λ(z)=ξ1(1eω(z)2)\lambda(z)=\xi_{1}\left(1-e^{-||\omega(z)||^{2}}\right). We also make use of a damping function gg as described in Section 3. The limiting ODE is therefore given by:

where v=JT(z)(JT(z)J(z)+λ(z)I)1JT(z)ω(z)v=J^{T}(z)(J^{T}(z)J(z)+\lambda(z)I)^{-1}J^{T}(z)\omega(z). For the two-timescale process, since there is no noise we use constant step sizes and the following update:

where ξ1=1e4\xi_{1}=1e-4,ξ2=1e4\xi_{2}=1e-4,γ1=0.004\gamma_{1}=0.004, and γ2=0.005\gamma_{2}=0.005.

2 Generative adversarial network

We now train a generative adversarial network with LSS. Both the discriminator and generator are fully connected neural networks with four hidden layers of 16 neurons each. The tanh activation function is used since it satisfies the smoothness assumptions for our functions. For the latent space, we use a 16-dimensional Gaussian with mean zero and covariance Σ=0.1I16\Sigma=0.1I_{16}. The ground truth distribution is a mixture of eight Gaussians with their modes uniformly spaced around the unit circle and covariance Σ=1e4I2\Sigma=1e-4I_{2}.

In Figure 3, we show the progression of the generator at , 50005000, 1200012000, and 2000020000 iterations for a GAN initialized with the same weights and biases and then trained with A. simGD and B. LSS. We can see empirically that, in this example, LSS converges to the true distribution while simGD quickly suffers mode collapse, showing how the adjusted dynamics can lead to convergence to better equilibria. Further numerical experiments are shown in Appendix D.

We caution that convergence rate per se is not necessarily a reasonable metric on which to compare performance in the GAN setting or in other game-theoretic settings. Competing algorithms may converge faster than our method when used to train GANs, but once because the competitors could be converging quickly to a non-Nash equilibrium, which is not desirable. Indeed, the optimal solution is known to be a local Nash equilibrium for GANs (Goodfellow et al., 2014; Nagarajan and Kolter, 2017). LSS may initially move towards a non-Nash equilibrium, while subsequently escaping the neighborhood of such points before converging. This will lead to a slower convergence rate, but a better quality solution.

Discussion

We have introduced local symplectic surgery, a new two-timescale algorithm for finding the local Nash equilibria of two-player zero-sum continuous games. We have established that this comes with the guarantee that the only hyperbolic critical points to which it can converge are the local Nash equilibria of the underlying game. This significantly improves upon previous methods for finding such points which, as shown in Appendix B, cannot give such guarantees. We have analyzed the asymptotic properties of the proposed algorithm and have shown that the algorithm can be implemented efficiently. Altogether, these results show that the proposed algorithm yields limit points with game-theoretic relevance while ruling out oscillations near those equilibria and having a similar per-iteration complexity as existing methods which do not come with the same guarantees. Our numerical examples allow us to empirically observe these properties.

It is important to emphasize that our analysis has been limited to neighborhoods of equilibria; the proposed algorithm can converge in principle to limit cycles at other locations of the space. These are hard to rule out completely. Moreover, some of these limit cycles may actually have some game-theoretic relevance (Hommes and Ochea, 2012; Benaim and Hirsch, 1997). Another limitation of our analysis is that we have assumed the existence of local Nash equilibria in games. Showing that they exist and finding them is very hard to do in general. Our algorithm will converge to local Nash equilibria, but may diverge when the game does not admit equilibria or when the algorithm does not come any equilibria its region of attraction. Thus, divergence of our algorithm is not a certificate that no equilibria exist. Such caveats, however, are the same as those for other gradient-based approaches for finding local Nash equilibria.

Another drawback to our approach is the use of second-order information. Though the two-timescale approximation does not need access to the full Jacobian of the gradient dynamics, the update does involve computing Jacobian-vector products. This is similar to other recently proposed approaches but will be inherently slower to compute than pure first- or zeroth-order methods. Bridging this gap while retaining similar theoretical properties remains an interesting avenue of further research.

In all, we have shown that some of the inherent flaws to gradient-based methods in zero-sum games can be overcome by designing our algorithms to take advantage of the game-theoretic setting. Indeed, by using the structure of local Nash equilibria we designed an algorithm that has significantly stronger theoretical support than existing approaches.

References

A Time-varying adjustment

In this section we analyze a slightly different version of (4) that allows us to remove the assumption that ω(z)\omega(z) is never an eigenvector of JT(z)(JT(z)J(z)+λ(z)I)1JT(z)J^{T}(z)(J^{T}(z)J(z)+\lambda(z)I)^{-1}J^{T}(z) with associated eigenvalue 1-1. Though this assumption is relatively mild, since intuitively it will be very rare that ω\omega is exactly the eigenvector of the adjustment matrix, we show that by adding a third term to (4) we can remove it entirely while retaining our theoretical guarantees. The new dynamics are constructed by adding a time-varying term to the dynamics that goes to zero only when ω(z)\omega(z) is zero. This guarantees that the only critical points of the limiting dynamics are the critical points of ω\omega. The analysis of these dynamics is slightly more involved and requires generalizations of the definition of a LASE to handle time-varying dynamics. We first define an equilibrium of a potentially time-varying dynamical system θ˙=g(θ,t)\dot{\theta}=g(\theta,t) as a point θ\theta^{*} such that g(θ,t)0g(\theta^{*},t)\equiv 0 for all t0t\geq 0. We can now generalize the definition of a LASE to the time-varying setting.

Locally uniformly asymptotically stable equilibria, under this definition, also have the property that they are locally exponentially attracting under the flow, θ˙=f(θ,t)\dot{\theta}=-f(\theta,t). Further, since the linearization around a locally uniformly asymptotically stable equilibrium is time-invariant, we can still invoke converse Lyapunov theorems like those presented in Sastry (1999) when deriving the non-asymptotic bounds.

Having defined equilibria and a generalization of LASE for time-varying systems, we now introduce a time-varying version of the continuous-time ODE presented in Section 3 which allows us to remove the assumption that ω(z)\omega(z) is never an eigenvector of JT(z)(JT(z)J(z)+λ(z)I)1JT(z)J^{T}(z)(J^{T}(z)J(z)+\lambda(z)I)^{-1}J^{T}(z) with associated eigenvalue 1-1. The limiting ODE is given by:

where h(z)h(z) is as described in Section 3, g(z,t)g(z,t) can be decomposed as:

ω(z)=0    Dzλ1(z)=0\omega(z)=0\implies D_{z}\lambda_{1}(z)=0,

  t00\nexists\ \ t_{0}\geq 0 such that Dtu(t)0    tt0D_{t}u(t)\equiv 0\ \ \forall\ \ t\geq t_{0}.

u(t)ξ3    t0||u(t)||\leq\xi_{3}\ \ \forall\ \ t\geq 0.

Thus we require that the time-varying adjustment term gTVg_{TV} must be bounded and is equal to zero only when ω(z)=0\omega(z)=0. Most importantly, we require that for any zz that is not a critical point of ω\omega, gTVg_{TV} must be changing in time. An example of a gTVg_{TV} that satisfies these requirements is:

These conditions, as the next theorem shows, allow us to guarantee that the only locally asymptotically stable equilibria are the differential Nash equilibria of the game.

Under Assumption 1 the continuous-time dynamical system z˙=hTV(z,t)\dot{z}=-h_{TV}(z,t) satisfies:

If zz is an equilibrium point of z˙=hTV(z,t)\dot{z}=-h_{TV}(z,t), then the Jacobian of hTVh_{TV} at zz is time-invariant and has real eigenvalues.

By construction ω(z)=0    hTV(z,t)0  t0\omega(z)=0\implies h_{TV}(z,t)\equiv 0\ \ \forall t\geq 0. To show the converse, we assume that there exists a zz such that hTV(z,t)0    t0h_{TV}(z,t)\equiv 0\ \ \forall\ \ t\geq 0 but ω(z)0\omega(z)\neq 0. This implies that:

Since zz is a constant and λ1(z)>0\lambda_{1}(z)>0, taking the derivative of both sides with respect to tt gives us the following condition on gg under our assumption:

By assumption this cannot be true. Thus, we have a contradiction and hTV(z,t)0  t0    ω(z)=0h_{TV}(z,t)\equiv 0\ \ \forall t\geq 0\implies\omega(z)=0.

Having shown that the critical points of z˙=hTV(z,t)\dot{z}=-h_{TV}(z,t) are the same as that of z˙=ω(z)\dot{z}=-\omega(z), we now note that the Jacobian of hTV(z,t)h_{TV}(z,t), at critical points, must be S(z)S(z). Under the same development as the proof of Theorem 4 the Jacobian of hTVh_{TV} is given by:

Again, by construction DzgTV(z,t)=0D_{z}g_{TV}(z,t)=0 when ω(z)=0\omega(z)=0. The third term therefore disappears and we have that JTV(z)=S(z)J_{TV}(z)=S(z). The proof now follows from that of Theorem 4.

We have shown that adding a time-varying term to the original adjusted dynamics allows us to remove the assumption that the adjustment term is never exactly ω(z)-\omega(z). As in Section 3 we can now construct a two-timescale process that asymptotically tracks (8). We assume that u(t)u(t) is a deterministic function of a trajectory of an ODE:

The form of u(t)u(t) introduced in (9), u(t)=cos(t)u0u(t)=\cos(t)u_{0} can be written as u(t)=(θ(t))u0u(t)=(*\theta(t))u_{0}, where θ\theta satisfies the linear dynamical system:

Given this setup, the continuous-time dynamics can be written as:

Having made this further assumption on the time-varying term, we now introduce the two-timescale process that asymptotically tracks (10). This process is given by:

Proceeding as in Section 3, we write h^5(z,v,θ)=h5(z,v,θ)+Mz\hat{h}_{5}(z,v,\theta)=h_{5}(z,v,\theta)+M^{z} and h^6(z,v)=h6(z,v)+Mv\hat{h}_{6}(z,v)=h_{6}(z,v)+M^{v} where MzM^{z} and MvM^{v} are martingale difference sequences satisfying Assumption 3. We note that the θn\theta_{n} process is deterministic.

This two-timescale process gives rise to the time-varying version of local symplectic surgery (TVLSS) outlined in Algorithm 2.

We can now proceed as in Section 4 to show that (11) has the same limiting behavior as the limiting ODE in (10).

Under Assumptions 1-3, and on the event {supnzn+vn+θn<}\{\sup_{n}||z_{n}||+||v_{n}||+||\theta_{n}||<\infty\}:

The proof of this lemma is exactly the same as that of Lemma 5, and we do not repeat the argument. We can now invoke Proposition 4.1 and 4.2 from Benaïm (1999) to show that the two-timescale process tracks a trajectory of the limiting ODE. For tst\geq s, let θ(t,s,θs)\theta(t,s,\theta_{s}) and z(t,s,zs)z(t,s,z_{s}) be the trajectories of the dynamical systems introduced in (10), starting at states θs\theta_{s} and zsz_{s} respectively at time ss.

Let Assumptions 1-3 hold. Further, assume h3h_{3} is Lipschitz continuous and define tn=i=0n1ait_{n}=\sum_{i=0}^{n-1}a_{i}. Then on the event {supnzn+vn+θn<}\{sup_{n}||z_{n}||+||v_{n}||+|\theta_{n}||<\infty\}, for any integer K>0K>0:

The proof of Theorem 10 is exactly like that of Theorem 6.

Together, Theorem 8 and Theorem 10 show that our approach can be modified to allow us to relax some of our assumptions, while still maintaining the theoretical properties of our algorithms.

B Counter-examples for other algorithms

Staying with our earlier notation, the player with variable xx seeks to minimize ff, and the player with variable yy minimizes f-f.

This game has only one critical point, (x,y)=(0,0)(x,y)=(0,0), and the combined gradient dynamics for this game are linear and are given by:

Since the diagonal is not strictly positive, (x,y)=(0,0)(x,y)=(0,0) is not an LNE. However, since the eigenvalues of J(x,y)J(x,y) are 0.45+0.835i0.45+0.835i and 0.450.835i0.45-0.835i, (x,y)=(0,0)(x,y)=(0,0) is a LASE of the gradient dynamics.

We first show that even running SGA on two timescales as suggested by Heusel et al. (2017), cannot guarantee convergence to only local Nash equilibria. We assume that the maximizing player is on the slower timescale, and, in the following proposition, show that SGA on two timescales can still converge to non-Nash fixed points.

Simultaneous gradient ascent on two timescales can converge to non-Nash equilibria.

Proof Consider the game introduced in Example 2, and the following dynamics:

where ωi\omega_{i} denotes the ithith component of ω\omega as described in Example 2 and an,bna_{n},b_{n} are sequences of step sizes satisfying Assumption 2.

Since these dynamics have a unique (exponentially) stable equilibrium at y=0y=0, (xn,yn)(0,0)(x_{n},y_{n})\rightarrow(0,0). As we showed, the origin is non-Nash, so we are done.

The previous proposition shows that a two-timescale version of simultaneous gradient ascent will still be susceptible to non-Nash equilibria. This implies that such an approach cannot guarantee convergence to only local Nash equilibria. We now show that the consensus optimization approach introduced in Mescheder et al. (2017), can also converge to non-Nash points.

Consensus optimization can converge to non-Nash equilibria.

Proof The update in consensus optimization is given by:

where λ>0\lambda>0 is a hyperparameter to be chosen. We note that the signs are different than in Mescheder et al. (2017) because we consider simultaneous gradient descent as the base algorithm while they consider simultaneous gradient ascent. The limiting dynamics of this approach is given by:

At a critical point zz where ω(z)=0\omega(z)=0, the Jacobian of this dynamics is given by:

Now, in the game described in Example 2, the Jacobian of the consensus optimization dynamics would be:

For λ>0\lambda>0, JCO(x,y)J_{CO}(x,y) has eigenvalues with strictly positive real parts, which implies that, with any choice of the hyper-parameter λ\lambda, consensus optimization will converge to the non-Nash equilibrium at (0,0) in Example 2.

The preceding proposition shows that the adjustment to the gradient dynamics proposed in Mescheder et al. (2017) does not solve the issue of non-Nash fixed points. This is not surprising since the primary goal for the algorithm is to reduce oscillation around equilibria and speed up convergence to the stable equilibria of the gradient dynamics. We note that as shown in Theorem 4, the proposed algorithm also achieves this goal.

The final algorithm we consider is the symplectic gradient adjustment proposed in Balduzzi et al. (2018). In the text, the authors do remark that the adjustment is not enough to guarantee convergence to only Nash equilibrium. For completeness, we make this assertion concrete by showing that the adjustment term is not enough to avoid the non-Nash equilibrium in Example 2.

The symplectic gradient adjustment to the gradient dynamics can converge to non-Nash equilibria.

Proof The dynamics resulting from the symplectic gradient adjustment is given by:

where λ>0\lambda>0 is a hyperparameter to be chosen. The limiting dynamics of this algorithm is given by:

At a critical point zz where ω(z)=0\omega(z)=0, the Jacobian of the above dynamics is given by:

Now, in the game described in Example 2, the Jacobian of the SGA dynamics is given by:

This has eigenvalues with strictly positive real parts for all values of λ>0\lambda>0, which implies that the symplectic gradient adjustment will converge to the non-Nash equilibrium at (0,0) in Example 2.

C Efficient computation of the two-timescale update

The next proposition shows that, using the structure of JJ that was discussed in Section 2, the term J(z)vJ(z)v can also be efficiently computed. Together these results show that the per-iteration complexity of LSS is on the same order as that of consensus optimization and the symplectic gradient adjustment.

We note that the above expression is only possible due to the structure of the Jacobian in two-player zero-sum games. Having written J(z)J(z) as above, it is now clear that computing J(z)vJ(z)v will require two Jacobian-vector products.

D Further numerical examples

We now present further numerical experiments as well as the specific experimental setup we employed when training GANs. In Figure 4 we show further numerical experiments that show the training of a generative adversarial network from various initial conditions. In Figure 4A, both simGD and LSS quickly converge to the correct solution. In Figure 4B, however, simGD only correctly identifies five of the Gaussians in the mixture while LSS converges to the correct solution.

For the training of the GAN, we randomly initialize the weights and biases using the standard TensorFlow initializations. For both the implementation of simGD and LSS used to generate Figure 3 and Figure 4, we used the RMSProp optimizer with step size 2e42e-4 for the xx and yy processes. For the vv process in LSS, we used the RMSProp optimizer with step size 1e51e-5. We note that these do not satisfy our assumptions on the step size sequences for the two-timescale process, but are meant to show that the approach still works in practical settings. Lastly, we chose a damping function g(z)=e0.001ω(z)2g(z)=e^{-0.001||\omega(z)||^{2}} and a batch size of 128128.

E Useful propositions

In this section we present a useful theorem from previous work that we invoke in our proofs. The first proposition, a combination of Proposition 4.1 and 4.2 from Benaïm (1999), gives us conditions under which a stochastic approximation process is an asymptotic pseudo-trajectory of the underlying ODE.

Let hh be a continuous globally integrable vector field. Further, let x(t,s,xs)x(t,s,x_{s}) for tst\geq s be a trajectory of the dynamical system x˙=h(x)\dot{x}=-h(x) starting from state xsx_{s} at time ss. Finally let the stochastic approximation process be given by:

limnχn=0\lim_{n\rightarrow\infty}\chi_{n}=0 almost surely.