Last-Iterate Convergence: Zero-Sum Games and Constrained Min-Max Optimization

Constantinos Daskalakis, Ioannis Panageas

Introduction

A central problem in Game Theory and Optimization is computing a pair of probability vectors $(\mathbf{x},\mathbf{y})$ , solving

and that all solutions to the LHS are solutions to the RHS, and vice versa. This result was a founding stone in the development of Game Theory. Indeed, interpreting $\mathbf{x}^{\top}A\mathbf{y}$ as the payment of the “min player” to the “max player” when the former selects a distribution $\mathbf{y}$ over columns and the latter selects a distribution $\mathbf{x}$ over rows of matrix $A$ , a solution to (1) constitutes an equilibrium of the game defined by matrix $A$ , called a “minimax equilibrium”, a pair of randomized strategies such that neither player can improve their payoff by unilaterally changing their distribution.

Besides their fundamental value for Game Theory, it is known that (1) and (2) are also intimately related to Linear Programming. It was shown by von Neumann that (2) follows from strong linear programming duality. Moreover, it was suggested by Dantzig and recently proven by Adler that any linear program can be solved by solving some min-max problem of the form (1). In particular, min-max problems of form (1) are exactly as expressive as min-max problems of the following form, which capture any linear program (by Lagrangifying the constraints):

Soon after the minimax theorem was proven and its connection to linear programming was forged, researchers proposed dynamics for solving min-max optimization problems by having the min and max players of (1) run a simple learning procedure in tandem. An early method, proposed by Brown and analyzed by Robinson , was fictitious play. Soon after, Blackwell’s approachability theorem propelled the field of online learning, which lead to the discovery of several learning algorithms converging to minimax equilibrium at faster rates, while also being robust to adversarial environments, situations where one of the players of the game deviates from the prescribed dynamics; see e.g. . These learning methods, called “no-regret”, include the celebrated multiplicative-weights-update method, follow-the-regularized-leader, and follow-the-perturbed-leader. Compared to centralized linear programming procedures the advantage of these methods is the simplicity of executing their steps, and their robustness to adversarial environments, as we just discussed.

Despite the extensive literature on no-regret learning, an unsatisfactory feature of known results is that min-max equilibrium is shown to be attained only in an average sense. To be precise, if $(\mathbf{x}^{t},\mathbf{y}^{t})$ is the trajectory of a no-regret learning method, it is usually shown that the average ${1\over t}\sum_{\tau\leq t}{\mathbf{x}^{\tau}}^{\top}A\mathbf{y}^{\tau}$ converges to the optimal value of (1), as $t\rightarrow\infty$ . Moreover, if the solution to (1) is unique, then ${1\over t}\sum_{\tau\leq t}(\mathbf{x}^{\tau},\mathbf{y}^{\tau})$ converges to the optimal solution. Unfortunately that does not mean that the last iterate $(\mathbf{x}^{t},\mathbf{y}^{t})$ converges to an optimal solution, and indeed it commonly diverges or enters a limit cycle. Furthermore, in the optimization literature, Nesterov provides a method that can give pointwise convergence (i.e., convergence of the last iterate) to problem (1)Nesterov showed that by optimizing $f_{\mu}(\mathbf{x}):=\mu\ln(\frac{1}{m}\sum_{j=1}^{m}e^{-\frac{1}{\mu}(A\mathbf{x})_{j}}),$ $g_{\nu}(\mathbf{x}):=\nu\ln(\frac{1}{n}\sum_{j=1}^{n}e^{\frac{1}{\nu}(A^{\top}\mathbf{y})_{j}})$ for $\mu=\Theta(\frac{\epsilon}{\log m}),$ $\nu=\Theta(\frac{\epsilon}{\log n})$ yields an $O(\epsilon)$ approximation to the problem problem (1)., however his algorithm is not a no-regret learning algorithm. Recent work by Daskalakis et al and Liang and Stokes studies whether last iterate convergence can be established for no-regret learning methods in the simple unconstrained min-max problem of the form:

Motivated by the afore-described lines of work, and the importance of last iterate convergence for Game Theory and the modern applications of GDA-style methods in Optimization, our goal in this work is to generalize the results of to the general min-max problem (3), or equivalently (1); indeed, we will focus on the latter, but our algorithms are readily applicable to the former as the two problems are equivalent . With the constraint that $(\mathbf{x},\mathbf{y})$ should remain in $\Delta_{n}\times\Delta_{m}$ , GDA and OGDA are not applicable. Indeed, the natural GDA-style method for min-max problems in this case is the celebrated Multiplicative-Weights-Update (MWU) method, which is tantamount to FTRL with entropy-regularization. Unsurprisingly, in the same way that GDA suffers in the unconstrained problem (4), MWU exhibits cycling in the constrained problem (1) (a recent work is and was also shown empirically in ). So it is natural for us to study instead its optimistic variant, “Optimistic Multiplicative-Weights-Update (OMWU),” (called Optimistic Hedge in ) which corresponds to Optimistic FTRL with entropy-regularization, the equations of which are given in Section 2.2. Our main result is the following (restated as Theorem 2.7 after Section 2.2) and answers an open question asked in as applicable to two player zero sum games:

Whenever (1) has a unique optimal solution $(\mathbf{x}^{*},\mathbf{y}^{*})$ , OMWU with appropriate choice of learning rate and initialized at the pair of uniform distributions $({1\over n}{\bf 1},{1\over m}{\bf 1})$ exhibits last-iterate convergence to the optimal solution. That is, if $(\mathbf{x}^{t},\mathbf{y}^{t})$ are the vectors maintained by OMWU at step $t$ , then $\lim_{t\to\infty}(\mathbf{x}^{t},\mathbf{y}^{t})=(\mathbf{x}^{*},\mathbf{y}^{*})$ .

We note that the assumption about uniqueness of the optimal solution for problem (1) is generic in the following sense: Within the set of all zero-sum games, the set of zero-sum games with non-unique equilibrium has Lebesgue measure zero . This implies that if $A$ ’s entries are sampled independently from some continuous distribution, then with probability one the min-max problem (1) will have a unique solution.

Our paper provides two important messages:

It strengthens the intuition that optimism helps the trajectories of learning dynamics stabilize (e.g., Optimistic MWU vs MWU or Optimistic GDA vs GDA; as the papers of Syrgkanis et al and Daskalakis et al also do).

The techniques we use (typically appear in dynamical systems literature) to prove convergence for the last iterate, are fundamentally different from those commonly used to prove convergence of the time average of a learning algorithm.

Preliminaries

A recurrence relation of the form $\mathbf{x}^{t+1}=w(\mathbf{x}^{t})$ is a discrete time dynamical system, with update rule $w:\mathcal{S}\to\mathcal{S}$ where $\mathcal{S}=\Delta_{n}\times\Delta_{m}\times\Delta_{n}\times\Delta_{m}$ for our purposes. The point $\mathbf{z}$ is called a fixed point or equilibrium of $w$ if $w(\mathbf{z})=\mathbf{z}$ . We will be interested in the following well known fact that will be used in our proofs.

If the Jacobian of the update rule $w$ We assume $w$ is a continuously differential function. at a fixed point $\mathbf{z}$ has spectral radius less than one, then there exists a neighborhood $U$ around $\mathbf{z}$ such that for all $\mathbf{x}\in U$ , the dynamics converges to $\mathbf{z}$ , i.e., $\lim_{n\to\infty}w^{n}(\mathbf{x})=\mathbf{z}$ . We call $w$ an asymptotic stable mapping in $U$ .

2 OMWU Method

Our main contribution is that the last iterate of OMWU converges to the optimal solution. The OMWU dynamics is defined as follows ( $t\geq 1$ ):

Points $(\mathbf{x}^{1},\mathbf{y}^{1}),(\mathbf{x}^{0},\mathbf{y}^{0})$ are the initial conditions and are given as input. We call $0<\eta<1$ the stepsize of the dynamics. It is more convenient to interpret OMWU dynamics as mapping a quadruple to quadruple ( $(\mathbf{x}^{t},\mathbf{y}^{t},\mathbf{x}^{t-1},\mathbf{y}^{t-1})\to(\mathbf{x}^{t+1},\mathbf{y}^{t+1},\mathbf{x}^{t},\mathbf{y}^{t})$ , see Section 3.2 for the construction of the dynamical system).

Let $(\mathbf{x}^{*},\mathbf{y}^{*})$ be the optimal solution. We see that $(\mathbf{x}^{*},\mathbf{y}^{*},\mathbf{x}^{*},\mathbf{y}^{*})$ is a fixed point of the mapping. Furthermore, $\Delta_{n}\times\Delta_{m}\times\Delta_{n}\times\Delta_{m}$ is invariant under OMWU dynamics. For $t\geq 1$ , if $x_{i}^{t}=0$ then $x_{i}$ remains zero for all times greater than $t$ , and if it is positive, it remains positive (both numerator and denominator are positive) Same holds for vector $\mathbf{y}$ .. In words, at all times the OMWU satisfies the non-negativity constraints and the renormalization factor (denominator) makes both $\mathbf{x},\mathbf{y}$ ’s coordinates sum up to one. A last observation is that every fixed point of OMWU dynamics (mapping a quadruple to quadruple) has the form $(\mathbf{x},\mathbf{y},\mathbf{x},\mathbf{y})$ (two same copies). Equation (8) shows how to express OMWU dynamics as a dynamical system.

3 Linear Variant of OMWU

We provide the linear variant of OMWU dynamics (5) because we use it in some intermediate lemmas (appear in appendix).

This dynamics is derived by considering the first order approximation of the exponential function. Stepsize $\eta$ in this case should be chosen sufficiently small so that both numerator and denominator are positive.

4 More definitions and statement of our result

Assume $\alpha>0$ . We call a point $(\mathbf{x},\mathbf{y})\in\Delta_{n}\times\Delta_{m}$ $\alpha$ -close if for each $i$ we have that $x_{i}\leq\alpha$ or $|\mathbf{x}^{\top}A\mathbf{y}-(A\mathbf{y})_{i}|\leq\alpha$ and for each $j$ it holds $y_{j}\leq\alpha$ or $|\mathbf{x}^{\top}A\mathbf{y}-(A^{\top}\mathbf{x})_{j}|\leq\alpha$ .

Think of $\alpha$ -close points as $\alpha$ -approximate optimal solutions for min-max problems that are induced by submatrices of $A$ ( $\alpha$ -approximate stationary points). Moreover, if $(\mathbf{x},\mathbf{y})$ is -close point does not necessarily imply $(\mathbf{x},\mathbf{y})$ is the optimal solution of problem (1)!

Think of $\epsilon$ -approximate points as approximate optimal solutions to the min-max problem (1). Moreover, if $(\mathbf{x},\mathbf{y})$ is -approximate then $(\mathbf{x},\mathbf{y})$ is the optimal solution of problem (1).

We finish the preliminary section by stating formally the main result.

Let $A$ be a $n\times m$ matrix and assume that

has a unique solution $(\mathbf{x}^{*},\mathbf{y}^{*})$ . It holds that for $\eta$ sufficiently small (depends on $n,m,A$ ), starting from the uniform distribution, i.e., $(\mathbf{x}^{1},\mathbf{y}^{1})=(\mathbf{x}^{0},\mathbf{y}^{0})=({1\over n}{\bf 1},{1\over m}{\bf 1})$ , it holds

under OMWU dynamics. The stepsize $\eta$ is constant, i.e., does not scale with timeOur proof also works if the starting points $(\mathbf{x}^{1},\mathbf{y}^{1}),(\mathbf{x}^{0},\mathbf{y}^{0})$ are both in the interior of $\Delta_{n}\times\Delta_{m}$ and not necessarily uniform, however the choice of $\eta$ depends on the initial distributions as well and not only on $n,m,A$ ..

We need to note that it is not clear from our theorem how small $\eta$ is and its dependence on the size of $A$ . Moreover, OMWU has two phases (the phase where KL divergence decreases and the local asymptotic stability phase, see theorems below) where the stepsize is constant but it might change when we move from phase one to phase two. Nevertheless, our convergence result holds for constant stepsizes as opposed to the classic no-regret learning literature where $\eta$ scales like $\frac{1}{\sqrt{T}}$ after $T$ iterations. Another result we know of this flavor is about MWU algorithm on congestion games .

Last iterate convergence of OMWU

In this section we show our main result (Theorem 2.7), by breaking the proof into three key theorems. The first theorem says that KL divergence from the $t$ -th iterate $(\mathbf{x}^{t},\mathbf{y}^{t})$ to the optimal solution $(\mathbf{x}^{*},\mathbf{y}^{*})$ , i.e., (sum of KL divergences to be exact)

decreases with time $t\geq 2$ by at least a factor of $\eta^{3}$ per iteration, unless the iterate $(\mathbf{x}^{t},\mathbf{y}^{t})$ is $O(\eta^{1/3})$ -close (see Definition 2.3). Moreover, provided that the stepsize $\eta$ is small enough, we can show the structural result that $(\mathbf{x}^{t},\mathbf{y}^{t})$ lies in a neighborhood of $(\mathbf{x}^{*},\mathbf{y}^{*})$ that becomes smaller and smaller as $\eta\to 0$ . Finally, as long as OMWU dynamics has reached a small neighborhood around $(\mathbf{x}^{*},\mathbf{y}^{*})$ , we show that the update rule of the dynamical system induced by OMWU is locally (asymptotically) stable (for maybe different choice of learning rate), and the last iterate convergence result follows. Formally we show:

Let $(\mathbf{x}^{*},\mathbf{y}^{*})$ be the unique optimal solution of problem (1) and $\eta$ sufficiently small. Then

is decreasing with time $t$ by (at least) $\Omega(\eta^{3})$ unless $(\mathbf{x}^{t},\mathbf{y}^{t})$ is $O(\eta^{1/3})$ -close.

Assume that $(\mathbf{x}^{*},\mathbf{y}^{*})$ is unique optimal solution of the problem (1). Let $T$ (depends on $\eta$ ) be the first time KL divergence does not decrease by $\Omega(\eta^{3})$ . It follows that as $\eta\to 0$ , the $\eta^{1/3}$ -close point $(\mathbf{x}^{T},\mathbf{y}^{T})$ has distance from $(\mathbf{x}^{*},\mathbf{y}^{*})$ that goes to zero, i.e., $\lim_{\eta\to 0}\left\|(\mathbf{x}^{*},\mathbf{y}^{*})-(\mathbf{x}^{T},\mathbf{y}^{T})\right\|_{1}=0$ .

Let $(\mathbf{x}^{*},\mathbf{y}^{*})$ be the unique optimal solution to the min-max problem (1). There exists a neighborhood $U:=U(\eta)\subset\Delta_{n}\times\Delta_{m}\times\Delta_{n}\times\Delta_{m}$ of $(\mathbf{x}^{*},\mathbf{y}^{*},\mathbf{x}^{*},\mathbf{y}^{*})$ Since $(\mathbf{x}^{*},\mathbf{y}^{*},\mathbf{x}^{*},\mathbf{y}^{*})$ might be on the boundary of $\Delta_{n}\times\Delta_{m}\times\Delta_{n}\times\Delta_{m}$ , $U$ is the intersection of an open ball around $(\mathbf{x}^{*},\mathbf{y}^{*},\mathbf{x}^{*},\mathbf{y}^{*})$ with $\Delta_{n}\times\Delta_{m}\times\Delta_{n}\times\Delta_{m}$ . so that for all $(\mathbf{x}^{1},\mathbf{y}^{1},\mathbf{x}^{0},\mathbf{y}^{0})\in U$ we have that $\lim_{t\to\infty}(\mathbf{x}^{t},\mathbf{y}^{t},\mathbf{x}^{t-1},\mathbf{y}^{t-1})=(\mathbf{x}^{*},\mathbf{y}^{*},\mathbf{x}^{*},\mathbf{y}^{*})$ under OMWU dynamics as defined in (5) and (8) (Section 3.2).

Assuming these three theorems, our main result is straightforward.

In the next subsections we will provide the proofs to all three key theorems.

In this subsection we argue about the proofs of Theorems 3.1 and 3.2. The inequality we managed to prove (see in the appendix the proof of Theorem 3.1) is the following:

The proof of the inequality is quite long, we choose to provide intuition and skip the details. We refer to the appendix for a proof. The inequality says that OMWU dynamics has a good progress (KL divergence decreases by at least a factor of $\eta^{3}$ ) as long as the current and previous iterate $(\mathbf{x}^{t},\mathbf{y}^{t}),(\mathbf{x}^{t-1},\mathbf{y}^{t-1})$ are not $\alpha$ -close for $\alpha$ chosen to be $O(\eta^{1/3})$ . This situation appears a lot in gradient methods when the dynamics is close to a stationary point, the gradient of $f$ is small and the progress is small as opposed to the case where the gradient of $f$ is big and there is satisfying progress. The RHS of inequality (7) captures the “distance” from stationarity. Thus, as long as we are not close to a stationary point (i.e., $O(\eta^{1/3})$ -close) in a time window between 1,2,…,k, KL divergence from current iterate ( $k$ -th) to the optimum has decreased by (at least) $\Omega(k\eta^{3})$ compared to KL divergence from first iterate to the optimum.

Moreover, suppose that at some point of OMWU dynamics, KL divergence from current iterate to the optimum did not decrease by at least a factor of $\eta^{3}$ and let $T$ be the iteration this happened. As we have already argued, $(\mathbf{x}^{T},\mathbf{y}^{T})$ is a $O(\eta^{1/3})$ -close point. We can show that as long as $\eta$ is sufficiently small, then for all $i,j$ in the support of $(\mathbf{x}^{*},\mathbf{y}^{*})$ , $x_{i}^{T},y_{j}^{T}$ are (at least) $\Omega(\eta^{1/3})$ i.e., coordinates in the support of the optimum will have non negligible probability in $(\mathbf{x}^{T},\mathbf{y}^{T})$ . Formally:

Let $i\in\textrm{Supp}(\mathbf{x}^{*})$ and $j\in\textrm{Supp}(\mathbf{y}^{*})$ . It holds that $x_{i}^{T}\geq\frac{1}{2}\eta^{1/3}$ and $y_{i}^{T}\geq\frac{1}{2}\eta^{1/3}$ as long as

By definition of $T$ , the KL divergence is decreasing for $2\leq t\leq T-1$ , thus

Therefore $x_{i}^{*}\log\frac{1}{x_{i}^{T-1}}<\sum_{i}x_{i}^{*}\log\frac{1}{x^{1}_{i}}+\sum_{i}y_{j}^{*}\log\frac{1}{y^{1}_{j}}=\log(mn)$ . It follows that $x_{i}^{T}>1/(mn)^{\frac{1}{x_{i}^{*}}}\geq\eta^{1/3}$ for $x_{i}^{*}>0$ ( $i\in\textrm{Supp}(\mathbf{x}^{*})$ ). Since $|x_{i}^{T}-x_{i}^{T-1}|$ is $O(\eta)$ (Lemma B.1) the result follows. Similarly, the argument works for $y_{j}^{T}$ . ∎

Lemma 3.4 indicates that the stepsize $\eta$ might have to be exponentially small in the dimension (OMWU dynamics is slow when $\eta$ is very small). We can now prove Theorem 3.2.

From Lemma 3.4 and definition of $T$ , we get that $|(A\mathbf{y}^{T})_{i}-\mathbf{x}^{T\ \top}A\mathbf{y}^{T}|$ is $O(\eta^{1/3})$ for all $i$ in the support of $\mathbf{x}^{*}$ and $|(A^{\top}\mathbf{x}^{T})_{j}-\mathbf{x}^{T\ \top}A\mathbf{y}^{T}|$ is $O(\eta^{1/3})$ for all $j$ in the support of $\mathbf{y}^{*}$ . We consider $(\mathbf{w}^{T},\mathbf{z}^{T})$ to be the projection of the point $(\mathbf{x}^{T},\mathbf{y}^{T})$ by removing all the coordinates that have probability mass less than $\frac{1}{2}\eta^{1/3}$ and rescale so that the coordinates sum up to one.

We restrict ourselves to the corresponding subproblem (submatrix). It is clear that $(\mathbf{w}^{T},\mathbf{z}^{T})$ is a $O(\eta^{1/3})$ -approximate solution By $\epsilon$ -approximate optimal solution we mean the $\epsilon$ -approximate Nash equilibrium notion (additive), see Definition 2.5. for the subproblem. Let $v=\mathbf{x}^{*}A\mathbf{y}^{*}$ be the optimal value. By uniqueness of the optimal solution, we get that $(A\mathbf{y}^{*})_{i}=v$ for all $i\in\textrm{Supp}(\mathbf{x}^{*})$ and $(A\mathbf{y}^{*})_{i}<v$ otherwise (check Lemma C.3 in paper for a proof, where they use Farkas’ lemma to show it, we use this fact later in Section 3.2). Similarly $(A^{\top}\mathbf{x}^{*})_{j}=v$ for the min player $\mathbf{y}$ if $j$ lies in the support of $\mathbf{y}^{*}$ and $(A^{\top}\mathbf{x}^{*})_{j}>v$ otherwise. We choose $\eta$ so small that every $O(\eta^{1/3})$ -approximate solution $(\mathbf{x},\mathbf{y})$ has the property that $(A\mathbf{y})_{i}\leq v-\eta^{1/4}$ , $(A^{\top}\mathbf{x})_{j}\geq v+\eta^{1/4}$ for all $i\notin\textrm{Supp}(\mathbf{x}^{*})$ and $j\notin\textrm{Supp}(\mathbf{y}^{*})$ respectively (this is possible by continuity of the bilinear function and Claim 3.5 below). Hence we conclude that if $\eta$ is small enough, the coordinates in the vector $(\mathbf{w}^{T},\mathbf{z}^{T})$ that are not in the support of the optimal solution (since $\eta^{1/4}\gg\eta^{1/3}$ ), should have probability mass $O(\eta^{1/3})$ at time $T$ .

Let $(\mathbf{x}^{*},\mathbf{y}^{*})$ be the unique optimal solution to the problem (1). For every $\epsilon>0$ , there exists an $\delta(\epsilon)>0$ so that for every $\delta$ -approximate solution $(\mathbf{x},\mathbf{y})$ we get that $|x_{i}-x_{i}^{*}|<\epsilon$ for all $i\in[n]$ . Analogously holds for player $\mathbf{y}$ .

We will prove this by contradiction. Assume there is an $\epsilon$ that violates this statement. We choose a sequence $\delta_{k}$ so that $\lim_{k\to\infty}\delta_{k}=0$ and also there is a sequence $(\mathbf{x}_{k},\mathbf{y}_{k})$ of $\delta_{k}$ -approximate Nash equilibrium with $|x_{k,i}-x^{*}_{i}|\geq\epsilon$ for some strategy $i$ . Since $\Delta_{n}\times\Delta_{m}$ is compact and the sequence above is bounded, there is a convergent subsequence. The limit of the convergent subsequence is a Nash equilibrium by definition of $\delta$ -approximate (Definition 2.5). By uniqueness it follows that the $i$ -th coordinate of the convergent sequence must converge to $x_{i}^{*}$ , hence we reached a contradiction. ∎

Therefore, if we restrict to the subproblem induced by the strategies in the support of $(\mathbf{x}^{*},\mathbf{y}^{*})$ , the projected vector $(\mathbf{w}^{T},\mathbf{z}^{T})$ is a $O(\eta^{1/3})$ -approximate solution of the subgame.

2 Proving local convergence

The purpose of this section is to prove Theorem 3.3. First of all, we assume that the stepsize $\eta>0$ is some fixed constant (sufficiently small, not necessarily the same stepsize as in the first phase where KL divergence decreases). To show asymptotic stability of OMWU dynamics in a neighborhood of the optimal solution $(\mathbf{x}^{*},\mathbf{y}^{*})$ , we first construct a dynamical system that captures OMWU. Moreover, we prove that the Jacobian of the update rule of that particular dynamical system computed at the optimal solution, has spectral radius less than one. This suffices to prove asymptotic stability (see Proposition 2.1). As a result, as long as OMWU reaches a small neighborhood of $(\mathbf{x}^{*},\mathbf{y}^{*},\mathbf{x}^{*},\mathbf{y}^{*})$ , it converges pointwise (last iterate convergence) to itSince the dynamical system is from a quadruple to a quadruple, it is a neighborhood of $(\mathbf{x}^{*},\mathbf{y}^{*},\mathbf{x}^{*},\mathbf{y}^{*})$ .. Below we provide the update rule $g$ of the dynamical system, which consists of 4 components:

so $g$ captures exactly the dynamics of OMWU (5). The equations of the Jacobian of $g$ can be found in the appendix (see Section A).

The rest of the section constitutes the proof of Theorem 3.3. Assume $v=\mathbf{x}^{*\ \top}A\mathbf{y}^{*}$ , i.e., $v$ is the value of the bilinear function $\mathbf{x}^{\top}A\mathbf{y}$ at the optimal solution. We will analyze the Jacobian computed at $(\mathbf{x}^{*},\mathbf{y}^{*},\mathbf{x}^{*},\mathbf{y}^{*})$ See also Equations (14) of the Jacobian computed at $(\mathbf{x}^{*},\mathbf{y}^{*},\mathbf{x}^{*},\mathbf{y}^{*})$ ..

Assume $i\notin\textrm{Supp}(\mathbf{x}^{*})$ , then

and all other partial derivatives of $g_{1,i}$ are zero, thus $\frac{e^{\eta(A\mathbf{y}^{*})_{i}}}{e^{\eta v}}$ is an eigenvalue of the Jacobian computed at $(\mathbf{x}^{*},\mathbf{y}^{*},\mathbf{x}^{*},\mathbf{y}^{*})$ . Moreover because of uniqueness of the optimal solution, it holds that $\frac{e^{\eta(A\mathbf{y}^{*})_{i}}}{e^{\eta v}}<1$ because $(A\mathbf{y}^{*})_{i}-v<0$ (check Lemma C.3 in for a proof, where they use Farkas’ Lemma to show it). Similarly, it holds for $j\notin\textrm{Supp}(\mathbf{y}^{*})$ that $\frac{\partial g_{2,j}}{\partial y_{j}}=\frac{e^{-\eta(A^{\top}\mathbf{x}^{*})_{j}}}{e^{-\eta v}}<1$ (again by C.3 in it holds that $(A\mathbf{x}^{*})_{j}-v>0$ ) and all other partial derivatives of $g_{2,j}$ are zero, hence $\frac{e^{-\eta(A^{\top}\mathbf{x}^{*})_{j}}}{e^{-\eta v}}$ is an eigenvalue of the Jacobian computed at the optimal solution.

Let $p(\lambda)$ be the characteristic polynomial of the matrix (10). After row/column operations it boils down to

where $q(\lambda)$ is the characteristic polynomial of

Experiments

For the latter case, we fix $n=50$ and we consider the error $\epsilon$ to be $\{0.5,0.25,0.0625,0.015625,0.007812\}$ . Starting from uniform distribution, we count the number of iterations to reach error $\epsilon$ . The stepsize $\eta$ is fixed at $0.01$ at all times. The results can be found in the figure below (Figure 4). If we had to guess, it seems that the relation between dimension and iterations is between linear and quadratic (i.e., OMWU dynamics has roughly cubic-quartic running time in $n$ if we count the cost of each iteration as quadratic) and the dependence between error $\epsilon$ and iterations $t$ seems like $t$ is inverse polynomial in $\epsilon$ .

We note the importance of stepsize $\eta$ . $\eta$ must be sufficiently small for our proofs to work. If $\eta$ is chosen to be big, then OMWU might not converge (might cycle, we observed such behavior in experiments). On the other hand, the smaller $\eta$ is chosen, the smaller the progress of OMWU dynamics (see the inequality claim for KL divergence) and hence the slower the dynamics.

Conclusion

In this paper we showed that a no-regret algorithm called Optimistic Multiplicative Weights Update (OMWU) converges pointwise to a Nash equilibrium in two player zero sum games (See also a concurrent work to ours , in which the authors provide a pointwise result about other dynamics, using different techniques). Our analysis is novel and does not follow the standard approaches of the literature of no-regret learning. We believe that our techniques can be useful in the analysis of other learning algorithms with no provable guarantees of pointwise convergence.

One interesting open question is to show that OMWU algorithm converges in polynomial time in $n,m$ (for proper choice of stepsize $\eta$ ) and find exact rates of convergence. Another possible future direction is to generalize our results about OMWU beyond the bilinear setting.

References

Appendix A Equations of the Jacobian of OMWU dynamics

Set $S_{x}=\sum_{t=1}^{n}x_{t}e^{2\eta(A\mathbf{y})_{t}-\eta(A\mathbf{w})_{t}}$ , $S_{y}=\sum_{t=1}^{m}y_{t}e^{-2\eta(A^{\top}\mathbf{x})_{t}+\eta(A^{\top}\mathbf{z})_{t}}$ and let $i,j$ be arbitrary indexes ( $g_{1,i}$ captures the $i$ -th coordinate of function $g_{1}$ etc),

Set $S_{x}=\sum_{t=1}^{n}x_{t}^{*}e^{\eta(A\mathbf{y}^{*})_{t}}$ , $S_{y}=\sum_{t=1}^{m}y_{t}^{*}e^{-\eta(A^{\top}\mathbf{x}^{*})_{t}}$ and let $i,j$ be arbitrary indexes ( $g_{1,i}$ captures the $i$ -th coordinate of function $g_{1}$ etc). Assume $v=\mathbf{x}^{*\top}A\mathbf{y}^{*}$ , it is not hard to see that $(A^{\top}\mathbf{x}^{*})_{i}=(A\mathbf{y}^{*})_{j}=v$ for all $i\in\textrm{Supp}(\mathbf{x}^{*}),j\in\textrm{Supp}(\mathbf{y}^{*})$ and $S_{x}=e^{\eta v},S_{y}=e^{-\eta v}$ . We get that:

Appendix B Missing claims and proofs

Lemma B.1 shows that the change between next and current iterate in both OMWU algorithms (classic and linear variant) is of order $O(\eta)$ and that the difference between the next iterate of both algorithms is $O(\eta^{2})$ .

Let $\mathbf{x}\in\Delta_{n}$ be the vector of the max player, $\mathbf{w},\mathbf{z}\in\Delta_{m}$ and suppose $\mathbf{x}^{\prime},\mathbf{x}^{\prime\prime}$ are the next iterates of OMWU and its linear variant with current vector $\mathbf{x}$ and vectors $\mathbf{w},\mathbf{z}$ of the min player. It holds that

Analogously, it holds for vector $\mathbf{y}\in\Delta_{m}$ of the min player and its next iterates.

Let $\eta$ be sufficiently small (smaller than maximum in absolute value entry of $A$ ).

and hence $\left\|\mathbf{x}^{\prime}-\mathbf{x}^{\prime\prime}\right\|_{1}$ is $O(\eta^{2})$ . Moreover we have that

By triangle inequality and the two above proofs we get the third part of the lemma. ∎

Lemmas B.2, B.3 and B.5 will be used in the proof of Theorem 3.1.

Let $\mathbf{x}\in\Delta_{n}$ , $\mathbf{w},\mathbf{z}\in\Delta_{m}$ and suppose $\mathbf{x}^{\prime},\mathbf{x}^{\prime\prime}$ are the next iterates of OMWU and its linear variant with current vector $\mathbf{x}$ and inputs $\mathbf{w},\mathbf{z}$ , i.e., $\mathbf{x}^{\prime}$ has coordinates $x_{i}^{\prime}=x_{i}\frac{e^{2\eta(A\mathbf{w})_{i}-\eta(A\mathbf{z})_{i}}}{\sum_{j}x_{j}e^{2\eta(A\mathbf{w})_{j}-\eta(A\mathbf{z})_{j}}}$ and $\mathbf{x}^{\prime\prime}$ has coordinates $x_{i}^{\prime\prime}=x_{i}\frac{1+{2\eta(A\mathbf{w})_{i}-\eta(A\mathbf{z})_{i}}}{\sum_{j}x_{j}(1+2\eta(A\mathbf{w})_{j}-\eta(A\mathbf{z})_{j})}$ . It holds that (for $\eta$ sufficiently small)

It suffices to prove the second equality. The rest follow from Lemma B.1. Set $B=(\mathbf{1}_{n}\mathbf{1}_{m}^{\top}+\eta A)$ . We have that $x_{i}^{\prime\prime}=x_{i}\frac{(B(2\mathbf{w}-\mathbf{z}))_{i}}{\mathbf{x}^{\top}B(2\mathbf{w}-\mathbf{z})}$ (from definition of linear variant of OMWU dynamics). It follows that

Using same arguments as in proof of Lemma B.2 we have the following lemma:

Let $\mathbf{y}\in\Delta_{m}$ , $\mathbf{w},\mathbf{z}\in\Delta_{n}$ and suppose $\mathbf{y}^{\prime}$ is the next iterate of OMWU with current vector $\mathbf{y}$ and inputs $\mathbf{w},\mathbf{z}$ , i.e., $\mathbf{y}^{\prime}$ has coordinates $y_{i}^{\prime}=y_{i}\frac{e^{-2\eta(A^{\top}\mathbf{w})_{i}+\eta(A^{\top}\mathbf{z})_{i}}}{\sum_{j}y_{j}e^{-2\eta(A^{\top}\mathbf{w})_{j}+\eta(A^{\top}\mathbf{z})_{j}}}$ . It holds that (for $\eta$ sufficiently small)

Let $(\mathbf{x}^{t},\mathbf{y}^{t})$ be the $t$ -th iterate of OMWU dynamics (5). For each time step $t\geq 2$ it holds that

The second equality comes from Lemmas B.2 and B.3. By canceling out the common terms and bring to the LHS the appropriate remaining terms, the claim follows. ∎

Let $(\mathbf{x}^{t},\mathbf{y}^{t})$ denote the $t$ -th iterate of OMWU dynamics. It holds for $t\geq 2$ that

where $(\mathbf{x}^{*},\mathbf{y}^{*})$ is the optimal solution of the min-max problem.

It is true that $x_{i}^{t}\geq(1-O(\eta))x_{i}^{t-1}$ , hence $x_{i}^{t}\geq\frac{1}{2}x_{i}^{t-1}$ for $\eta$ sufficiently small. Therefore $2\mathbf{x}^{t}-\mathbf{x}^{t-1}$ lies in the simplex $\Delta_{n}$ . Hence since $(\mathbf{x}^{*},\mathbf{y}^{*})$ is the optimum (Nash equilibrium) we get that $(2\mathbf{x}^{t\ \top}-\mathbf{x}^{t-1\ \top})A\mathbf{y}^{*}\leq\mathbf{x}^{*\ \top}A\mathbf{y}^{*}$ ( $\mathbf{x}$ is the max player). Similarly the second inequality can be proved. ∎

We compute the difference between $D_{KL}((\mathbf{x}^{*},\mathbf{y}^{*})||(\mathbf{x}^{t+1},\mathbf{y}^{t+1}))$ and $D_{KL}((\mathbf{x}^{*},\mathbf{y}^{*})||(\mathbf{x}^{t},\mathbf{y}^{t}))$

We use Lemma B.5 and we get that ${-2\eta\mathbf{x}^{*\ \top}A\mathbf{y}^{t}+\eta\mathbf{x}^{*\ \top}A\mathbf{y}^{t-1}+2\eta\mathbf{x}^{t\ \top}A\mathbf{y}^{*}-\eta\mathbf{x}^{t-1\ \top}A\mathbf{y}^{*}}\leq 0$ , therefore the LHS (difference in the KL divergence) is at most

We furthermore use second order Taylor approximation ( $\eta$ is sufficiently small) to the function $e^{x}$ and we get that previous expression is at most

Finally, using Taylor approximation on $\log(1+x)$ and Lemma B.4 (last equality) we get the following system:

It is clear that as long as $(\mathbf{x}^{t},\mathbf{y}^{t})$ (and thus $(\mathbf{x}^{t-1},\mathbf{y}^{t-1})$ by Lemma B.1) is not $O({\eta}^{1/3})$ -close, from above inequalities/equalities we get

meaning that KL divergence decreases by at least a factor of $\eta^{3}$ and the claim follows. ∎

Let $D$ be a real diagonal matrix with positive diagonal entries and $S$ be a real skew-symmetric matrix ( $S^{\top}=-S$ ). It holds that $SD$ has eigenvalues with real part zero (i.e., it has only imaginary eigenvalues).

Let $\mathbf{z}^{*}$ be the conjugate transpose of $\mathbf{z}$ and $\mathbf{z}^{*}$ be a left eigenvector of $SD$ with complex eigenvalue $\lambda$ . It holds that

Since $D$ has positive diagonal entries, we conclude that $\mathbf{z}^{*}D^{-1}\mathbf{z}\neq 0$ (since $\mathbf{z}\neq\mathbf{0}$ ), thus $\lambda=-\overline{\lambda}$ and the claim follows. ∎