Cycles in adversarial regularized learning

Panayotis Mertikopoulos, Christos Papadimitriou, Georgios Piliouras

Introduction

color=DarkGreen!20!LightGray,author=CP,inline]Regularization is optimization’s latest and most incisive twist, its present zeitgeist: Through the introduction of a new component to the objective, a new algorithm results which overcomes ill-conditioning and overfitting, and achieves sparsity and parsimony without sacrificing efficiency. In the context of on-line optimization, regularization is exemplified…

Regularization is a fundamental and incisive method in optimization, its present zeitgeist and its entry into machine learning. Through the introduction of a new component in the objective, regularization techniques overcome ill-conditioning and overfitting, and they yield algorithms that achieve sparsity and parsimony without sacrificing efficiency .

In the context of online optimization, these features are exemplified in the family of learning algorithms known as “Follow the Regularized Leader” (FoReL) . “Follow the Regularized Leader” (FoReL) represents an important archetype of adaptive behavior for several reasons: it provides optimal min-max regret guarantees (O(t1/2)\operatorname{\mathcal{O}}(t^{-1/2}) in an adversarial setting), it offers significant flexibility with respect to the geometry of the problem at hand, and it captures numerous other dynamics as special cases (hedge, multiplicative weights, gradient descent, etc.) . As such, given that these regret guarantees hold without any further assumptions about how payoffs/costs are determined at each stage, the dynamics of FoReL have been the object of intense scrutiny and study in algorithmic game theory.

The standard way of analyzing such no-regret dynamics in games involves a two-step approach. The first step exploits the fact that the empirical frequency of play under a no-regret algorithm converges to the game’s set of coarse correlated equilibria (CCE). The second involves proving some useful property of the game’s coarse correlated equilibria: For instance, leveraging (λ,μ)(\lambda,\mu)-robustness implies that the social welfare at a CCE lies within a small constant of the optimum social welfare; as another example, the product of the marginal distributions of CCE in zero-sum games is Nash. In this way, the no-regret properties of FoReL can be turned into convergence guarantees for the players’ empirical frequency of play (that is, in a time-averaged, correlated sense).

Recently, several papers have moved beyond this “black-box” framework and focused instead on obtaining stronger regret/convergence guarantees for systems of learning algorithms coupled together in games with a specific structure. Along these lines, Daskalakis et al. and Rakhlin and Sridharan developed classes of dynamics that enjoy a O(logt/t)\operatorname{\mathcal{O}}(\log t/t) regret minimization rate in two-player zero-sum games. Syrgkanis et al. further analyzed a recency biased variant of FoReL in more general multi-player games and showed that it is possible to achieve an O(t3/4)\operatorname{\mathcal{O}}(t^{-3/4}) regret minimization rate. The social welfare converges at a rate of O(t1)\operatorname{\mathcal{O}}(t^{-1}), a result which was extended to standard versions of FoReL dynamics in .

Whilst a regret-based analysis provides significant insights about these systems, it does not answer a fundamental behavioral question:

Does the system converge to a Nash equilibrium? Does it even stabilize?

The dichotomy between a self-stabilizing, convergent system and a system with recurrent cycles is of obvious significance, but a regret-based analysis cannot distinguish between the two. Indeed, convergent, recurrent, and even chaotic systems may exhibit equally strong regret minimization properties in general games, so the question remains: What does the long-run behavior of FoReL look like, really?

This question becomes particularly interesting and important under perfect competition (such as zero-sum games and variants thereof). Especially in practice, zero-sum games can capture optimization “duels” : for example, two Internet search engines competing to maximize their market share can be modeled as players in a zero-sum game with a convex strategy space. In it was shown that the time-average of a regret-minimizing class of dynamics converges to an approximate equilibrium of the game. Finally, zero-sum games have also been used quite recently as a model for deep learning optimization techniques in image generation and discrimination .

In each of the above cases, min-max strategies are typically thought of as the axiomatically correct prediction. The fact that the time average of the marginals of a FoReL procedure converges to such states is considered as further evidence of the correctness of this prediction. However, the long-run behavior of the actual sequence of play (as opposed to its time-averages) seems to be trickier, and a number of natural questions arise:

Does optimization-driven learning converge under perfect competition?

Does fast regret minimization necessarily imply (fast) equilibration in this case?

We settle these questions with a resounding “no”. Specifically, we show that the behavior of FoReL in zero-sum games with an interior equilibrium (e.g. Matching Pennies) is Poincaré recurrent, implying that almost every trajectory revisits any (arbitrarily small) neighborhood of its starting point infinitely often. Importantly, the observed cycling behavior is robust to the agents’ choice of regularization mechanism (each agent could be using a different regularizer), and it applies to any positive affine transformation of zero-sum games (and hence all strictly competitive games ) even though these transformations lead to different trajectories of play. Finally, this cycling behavior also persists in the case of networked competition, i.e. for constant-sum polymatrix games .

Given that the no-regret guarantees of FoReL require a decreasing step-size (or learning rate),A standard trick is to decrease step-sizes by a constant factor after a window of “doubling” length . we focus on a smooth version of FoReL described by a dynamical system in continuous time. The resulting FoReL dynamics enjoy a particularly strong O(t1)\operatorname{\mathcal{O}}(t^{-1}) regret minimization rate and they capture as a special case the replicator dynamics and the projection dynamics , arguably the most widely studied game dynamics in biology, evolutionary game theory and transportation science . In this way, our analysis unifies and generalizes many prior results on the cycling behavior of evolutionary dynamics and it provides a new interpretation of these results through the lens of optimization and machine learning.

From a technical point of view, our analysis touches on several issues. Our first insight is to focus not on the simplex of the players’ mixed strategies, but on a dual space of payoff differences. The reason for this is that the vector of cumulative payoff differences between two strategies fully determines a player’s mixed strategy under FoReL, and it is precisely these differences that ultimately drive the players’ learning process. Under this transformation, FoReL exhibits a striking property, incompressibility: the flow of the dynamics is volume-preserving, so a ball of initial conditions in this dual space can never collapse to a point.

That being said, the evolution of such a ball in the space of payoffs could be transient, implying in particular that the players’ mixed strategies could converge (because the choice map that links payoff differences to strategies is nonlinear). To rule out such behaviors, we show that FoReL in zero-sum games with an interior Nash equilibrium has a further important property: it admits a constant of motion. Specifically, if x=(xi)iNx^{\ast}=(x^{\ast}_{i})_{i\in\mathcal{N}} is an interior equilibrium of the game and yiy_{i} is an arbitrary point in the payoff space of player ii, this constant is given by the coupling function

where hi(yi)=maxxi{yi,xihi(xi)}h_{i}^{\ast}(y_{i})=\max_{x_{i}}\{\langle y_{i},x_{i}\rangle-h_{i}(x_{i})\} is the convex conjugate of the regularizer hih_{i} that generates the learning process of player ii (for the details, see Sections 3 and 4). Coupled with the dynamics’ incompressibility, this invariance can be used to show that FoReL is recurrent: after some finite time, almost every trajectory returns arbitrarily close to its initial state.

On the other hand, if the game does not admit an interior equilibrium, the coupling above is no longer a constant of motion. In this case, GG decreases over time until the support of the players’ mixed strategies matches that of a Nash equilibrium with maximal support: as this point in time is approached, GG essentially becomes a constant. Thus, in general zero-sum games, FoReL wanders perpetually in the smallest face of the game’s strategy space containing all of the game’s equilibria; indeed, the only possibility that FoReL converges is if the game admits a unique Nash equilibrium in pure strategies – a fairly restrictive requirement.

Definitions from game theory

Players can also use mixed strategies, i.e. mixed probability distributions xi=(xiαi)αiAiΔ(Ai)x_{i}=(x_{i\alpha_{i}})_{\alpha_{i}\in\mathcal{A}_{i}}\in\Delta(\mathcal{A}_{i}) over their action sets Ai\mathcal{A}_{i}. The resulting probability vector xix_{i} is called a mixed strategy and we write Xi=Δ(Ai)\mathcal{X}_{i}=\Delta(\mathcal{A}_{i}) for the mixed strategy space of player ii. Aggregating over players, we also write X=iXi\mathcal{X}=\prod_{i}\mathcal{X}_{i} for the game’s strategy space, i.e. the space of all strategy profiles x=(xi)iNx=(x_{i})_{i\in\mathcal{N}}.

In this context (and in a slight abuse of notation), the expected payoff of the ii-th player in the profile x=(x1,,xN)x=(x_{1},\dotsc,x_{N}) is

To keep track of the payoff of each pure strategy, we also write viαi(x)=ui(αi;xi)v_{i\alpha_{i}}(x)=u_{i}(\alpha_{i};x_{-i}) for the payoff of strategy αiAi\alpha_{i}\in\mathcal{A}_{i} under the profile xXx\in\mathcal{X} and vi(x)=(viαi(x))αiAiv_{i}(x)=(v_{i\alpha_{i}}(x))_{\alpha_{i}\in\mathcal{A}_{i}} for the resulting payoff vector of player ii. We then have

where v,xvx\langle v,x\rangle\equiv v^{\top}x denotes the ordinary pairing between vv and xx.

The most widely used solution concept in game theory is that of a Nash equilibrium (NE), defined here as a mixed strategy profile xXx^{\ast}\in\mathcal{X} such that

for every deviation xiXix_{i}\in\mathcal{X}_{i} of player ii and all iNi\in\mathcal{N}. Writing supp(xi)={αiAi:xi>0}\operatorname{supp}(x^{\ast}_{i})=\{\alpha_{i}\in\mathcal{A}_{i}:x^{\ast}_{i}>0\} for the support of xiXix^{\ast}_{i}\in\mathcal{X}_{i}, a Nash equilibrium xXx^{\ast}\in\mathcal{X} is called pure if supp(xi)={αi}\operatorname{supp}(x^{\ast}_{i})=\{\alpha^{\ast}_{i}\} for some αiAi\alpha^{\ast}_{i}\in\mathcal{A}_{i} and all iNi\in\mathcal{N}. At the other end of the spectrum, xx^{\ast} is said to be interior (or fully mixed) if supp(xi)=Ai\operatorname{supp}(x^{\ast}_{i})=\mathcal{A}_{i} for all iNi\in\mathcal{N}. Finally, a coarse correlated equilibrium (CCE) is a distribution π\pi over the set of action profiles AiAi\mathcal{A}\equiv\prod_{i}\mathcal{A}_{i} such that, for every player iNi\in\mathcal{N} and every action βiAi\beta_{i}\in\mathcal{A}_{i}, we have αAvi(α)π(α)αiAivi(βi,αi)πi(αi)\sum_{\alpha\in\mathcal{A}}v_{i}(\alpha)\pi(\alpha)\geq\sum_{\alpha_{-i}\in\mathcal{A}_{-i}}v_{i}(\beta_{i},\alpha_{-i})\pi_{i}(\alpha_{-i}), where πi(αi)=αiαiπ(αi,αi)\pi_{i}(\alpha_{-i})=\sum_{\alpha_{i}\in\alpha_{i}}\pi(\alpha_{i},\alpha_{-i}) is the marginal distribution of π\pi with respect to ii.

2. Zero-sum games and zero-sum polymatrix games

Perhaps the most widely studied class of finite games (and certainly the first to be considered) is that of 22-player zero-sum games, i.e. when N={1,2}\mathcal{N}=\{1,2\} and u1=u2u_{1}=-u_{2}. Letting uu1=u2u\equiv u_{1}=-u_{2}, the value of a 22-player zero-sum game Γ\Gamma is defined as

with equality following from von Neumann’s celebrated min-max theorem . As is well known, the solutions of this saddle-point problem form a closed, convex set consisting precisely of the game’s Nash equilibria; moreover, the players’ equilibrium payoffs are simply uΓu_{\Gamma} and uΓ-u_{\Gamma} respectively. As a result, Nash equilibrium is the standard game-theoretic prediction in such games.

An important question that arises here is whether the straightforward equilibrium structure of zero-sum games extends to the case of a network of competitors. Following , an NN-player pairwise zero-/constant-sum polymatrix game consists of an (undirected) interaction graph GG(N,E)\mathcal{G}\equiv\mathcal{G}(\mathcal{N},\mathcal{E}) whose set of nodes N\mathcal{N} represents the competing players, with two nodes i,jNi,j\in\mathcal{N} connected by an edge e=(i,j)e=(i,j) in E\mathcal{E} if and only if the corresponding players compete with each other in a two-player zero-/constant-sum game.

where Ni={jN:{i,j}E}\mathcal{N}_{i}=\{j\in\mathcal{N}:\{i,j\}\in\mathcal{E}\} denotes the set of “neighbors” of player ii. In other words, the payoff to player ii is simply the the sum of all payoffs in the zero-/constant-sum games that player ii plays with their neighbors.

No-regret learning via regularization

Throughout this paper, our focus will be on repeated decision making in low-information environments where the players don’t know the rules of the game (perhaps not even that they are playing a game). In this case, even if the game admits a unique Nash equilibrium, it is not reasonable to assume that players are able to pre-compute their component of an equilibrium strategy – let alone assume that all players are fully rational, that there is common knowledge of rationality, etc.

With this in mind, we only make the bare-bones assumption that every player seeks to at least minimize their “regret”, i.e. the average payoff difference between a player’s mixed strategy at time t0t\geq 0 and the player’s best possible strategy in hindsight. Formally, assuming that play evolves in continuous time, the regret of player ii along the sequence of play x(t)x(t) is defined as

and we say that player ii has no regret under x(t)x(t) if lim suptRegi(t)0\limsup_{t\to\infty}\operatorname{Reg}_{i}(t)\leq 0.

The most widely used scheme to achieve this worst-case guarantee is known as “Follow the Regularized Leader” (FoReL), an exploitation-exploration class of policies that consists of playing a mixed strategy that maximizes the player’s expected cumulative payoff (the exploitation part) minus a regularization term (exploration). In our continuous-time framework, this is described by the learning dynamics

In Appendix A, we present in detail two of the prototypical examples of (FoReL): i ) the multiplicative weights (MW) dynamics induced by the entropic regularizer function hi(x)=αiAixiαilogxiαih_{i}(x)=\sum_{\alpha_{i}\in\mathcal{A}_{i}}x_{i\alpha_{i}}\log x_{i\alpha_{i}} (which lead to the replicator dynamics of evolutionary game theory); and ii ) the projection dynamics induced by the Euclidean regularizer hi(x)=12xi2h_{i}(x)=\frac{1}{2}\lVert x_{i}\rVert^{2}. For concreteness, we will assume in what follows that the regularizer of every player iNi\in\mathcal{N} satisfies the following minimal requirements:

hih_{i} is continuous and strictly convex on Xi\mathcal{X}_{i}.

hih_{i} is smooth on the relative interior of every face of Xi\mathcal{X}_{i} (including Xi\mathcal{X}_{i} itself).

Under these basic assumptions, the “regularized leader” Qi(yi)Q_{i}(y_{i}) is well-defined in the sense that (3.2) admits a unique solution. More importantly, we have the following no-regret guarantee:

A player following (FoReL) enjoys an O(1/t)\operatorname{\mathcal{O}}(1/t) regret bound, no matter what other players do. Specifically, if player iNi\in\mathcal{N} follows (FoReL), then, for every continuous trajectory of play xi(t)x_{-i}(t) of the opponents of player ii, we have

where 0pti=maxhiminhi0pt_{i}=\max h_{i}-\min h_{i} is a positive constant.

color=DodgerBlue!20!LightGray,author=PM,inline]Guys, review the stuff below to make sure I’m not committing any strategic blunders.

To streamline our discussion, we relegate the proof of Theorem 3.1 to Appendix C; we also refer to for a similar regret bound for (FoReL) in the context of online convex optimization. Instead of discussing the proof, we close this section by noting that (3.3) represents a striking improvement over the Θ(t1/2)\Theta(t^{-1/2}) worst-case bound for FoReL in discrete time . In view of this, the continuous-time framework we consider here can be seen as particularly amenable to learning because it allows players seek to minimize their regret (and thus converge to coarse correlated equilibria) at the fastest possible rate.

Recurrence in adversarial regularized learning

In this section, our aim is to take a closer look at the ramifications of fast regret minimization under (FoReL) beyond convergence to the set of coarse correlated equilibria. Indeed, as is well known, this set is fairly large and may contain thoroughly non-rationalizable strategies: for instance, Viossat and Zapechelnyuk recently showed that a coarse correlated equilibrium could assign positive selection probability only to strictly dominated strategies. Moreover, the time-averaging that is inherent in the definition of the players’ regret leaves open the possibility of complex day-to-day behavior e.g. periodicity, recurrence, limit cycles or chaos . Motivated by this, we examine the long-run behavior of the (FoReL) in the popular setting of zero-sum games (with or without interior equilibria) and several extensions thereof.

color=DodgerBlue!20!LightGray,author=PM,inline]George, please doublecheck what I wrote below: it was written in a hurry, so maybe I missed something. Also, what’s your go-to reference for all this? Arnold? Hirsch and Smale?

A key notion in our analysis is that of (Poincaré) recurrence. Intuitively, a dynamical system is recurrent if, after a sufficiently long (but finite) time, almost every state returns arbitrarily close to the system’s initial state.Here, “almost” means that the set of such states has full Lebesgue measure. More formally, given a dynamical system on X\mathcal{X} that is defined by means of a semiflow Φ ⁣:X×[0,)X\Phi\colon\mathcal{X}\times[0,\infty)\to\mathcal{X}, we have:Recall that a continuous map Φ ⁣:X×[0,)X\Phi\colon\mathcal{X}\times[0,\infty)\to\mathcal{X} is a semiflow if Φ(x,0)=x\Phi(x,0)=x and Φ(x,t+s)=Φ(Φ(x,t),s)\Phi(x,t+s)=\Phi(\Phi(x,t),s) for all xXx\in\mathcal{X} and all s,t0s,t\geq 0. Heuristically, Φt(x)Φ(x,t)\Phi_{t}(x)\equiv\Phi(x,t) describes the trajectory of the dynamical system starting at xx.

A point xXx\in\mathcal{X} is said to be recurrent under Φ\Phi if, for every neighborhood UU of xx in X\mathcal{X}, there exists an increasing sequence of times tnt_{n}\uparrow\infty such that Φ(x,tn)U\Phi(x,t_{n})\in U for all nn. Moreover, the flow Φ\Phi is called (Poincaré) recurrent if, for every measurable subset AA of X\mathcal{X}, the set of recurrent points in AA has full measure.

An immediate consequence of Definition 4.1 is that, if a point is recurrent, there exists an increasing sequence of times tnt_{n}\uparrow\infty such that Φ(x,tn)x\Phi(x,t_{n})\to x. On that account, recurrence can be seen as the flip side of convergence: under the latter, (almost) every initial state of the dynamics eventually reaches some well-defined end-state; instead, under the former, the system’s orbits fill the entire state space and return arbitarily close to their starting points infinitely often (so there is no possibility of convergence beyond trivial cases).

Our first result is that (FoReL) is recurrent (and hence, non-convergent) in zero-sum games with an interior Nash equilibrium:

Let Γ\Gamma be a 22-player zero-sum game that admits an interior Nash equilibrium. Then, almost every solution trajectory of (FoReL) is recurrent; specifically, for (Lebesgue) almost every initial condition x(0)=Q(y(0))Xx(0)=Q(y(0))\in\mathcal{X}, there exists an increasing sequence of times tnt_{n}\uparrow\infty such that x(tn)x(0)x(t_{n})\to x(0).

The proof of Theorem 4.2 is fairly complicated, so we outline the basic steps below:

We first show that the dynamics of the score sequence y(t)y(t) are incompressible, i.e. the volume of a set of initial conditions remains invariant as the dynamics evolve over time. By Poincaré’s recurrence theorem (cf. Appendix B), if every solution orbit y(t)y(t) of (FoReL) remains in a compact set for all t0t\geq 0, incompressibility implies recurrence.

To counter the possibility of solutions escaping to infinity, we introduce a transformed system based on the differences between scores (as opposed to the scores themselves). To establish boundedness in these dynamics, we consider the “primal-dual” coupling

where xx^{\ast} is an interior Nash equilibrium and hi(yi)=maxxiXi{yi,xihi(xi)}h_{i}^{\ast}(y_{i})=\max_{x_{i}\in\mathcal{X}_{i}}\{\langle y_{i},x_{i}\rangle-h_{i}(x_{i})\} denotes the convex conjugate of hih_{i}.This coupling is closely related to the so-called Bregman divergence – for the details, see . The key property of this coupling is that it remains invariant under (FoReL); however, its level sets are not bounded so, again, precompactness of solutions is not guaranteed.

Nevertheless, under the score transformation described above, the level sets of GG are compact. Since the transformed dynamics are invariant under said transformation, Poincaré’s theorem finally implies recurrence.

To make the above plan precise, fix some “benchmark” strategy α^iAi\hat{\alpha}_{i}\in\mathcal{A}_{i} for every player iNi\in\mathcal{N} and, for all αiAi{α^i}\alpha_{i}\in\mathcal{A}_{i}\mathopen{}\setminus\{\hat{\alpha}_{i}\}, consider the corresponding score differences

Now, under (FoReL), the score differences (4.2) evolve as

However, since the right-hand side (RHS) of (4.3) depends on x=Q(y)x=Q(y) and the mapping yzy\mapsto z is not invertible (so yy cannot be expressed as a function of zz), the above does not a priori constitute an autonomous dynamical system (as required to apply Poincaré’s recurrence theorem). Our first step below is to show that (4.3) does in fact constitute a well-defined dynamical system on zz.

where we used the fact that αiAixiαi=1\sum_{\alpha_{i}\in\mathcal{A}_{i}}x_{i\alpha_{i}}=1. The above shows that Qi(yi)=Qi(yi)Q_{i}(y_{i}^{\prime})=Q_{i}(y_{i}) if and only if Πi(yi)=Πi(yi)\Pi_{i}(y_{i})=\Pi_{i}(y_{i}^{\prime}), so Q^i\hat{Q}_{i} is well-defined.

Letting Q^(Q^1,,Q^N)\hat{Q}\equiv(\hat{Q}_{1},\dotsc,\hat{Q}_{N}) denote the aggregation of the players’ individual choice maps Q^i\hat{Q}_{i}, it follows immediately that Q(y)=Q^(Π(y))=Q^(z)Q(y)=\hat{Q}(\Pi(y))=\hat{Q}(z) by construction. Hence, the dynamics (4.3) may be written as

These dynamics obviously constitute an autonomous system, so our goal will be to use Liouville’s formula and Poincaré’s theorem in order to establish recurrence and then conclude that the induced trajectory of play x(t)x(t) is recurrent by leveraging the properties of Q^\hat{Q}.

As a first step towards applying Liouville’s formula, we note that the dynamics (4.6) are incompressible. Indeed, we have

because viv_{i} does not depend on xix_{i}. We thus obtain divzV(z)=0\operatorname{div}_{z}V(z)=0, i.e. the dynamics (4.6) are incompressible.

We now show that every solution orbit z(t)z(t) of (4.6) is precompact, that is, supt0z(t)<\sup_{t\geq 0}\lVert z(t)\rVert<\infty. To that end, note that the coupling G(y)=iN[hi(yi)yi,xi]G(y)=\sum_{i\in\mathcal{N}}[h_{i}^{\ast}(y_{i})-\langle y_{i},x^{\ast}_{i}\rangle] defined in (4.1) remains invariant under (FoReL) when Γ\Gamma is a 22-player zero-sum game. Indeed, by Lemma C.1, we have

where we used the fact that Qi=hiQ_{i}=\nabla h_{i}^{\ast} in the first line (cf. (C.2) above), and the assumption that xx^{\ast} is an interior Nash equilibrium of a 22-player zero-sum game in the last one. We conclude that G(y(t))G(y(t)) remains constant under (FoReL), as claimed.

We close this section by noting that the invariance of (4.1) under (FoReL) induces a foliation of X\mathcal{X}, with each individual trajectory of (FoReL) living on a “leaf” of the foliation (a level set of GG). Fig. 1 provides a schematic illustration of this foliation/cycling structure.

2. Zero-sum games with no interior equilibria

At first sight, Theorem 4.2 suggests that cycling is ubiquitous in zero-sum games; however, if the game does not admit an interior equilibrium, the behavior of (FoReL) turns out to be qualitatively different. To state our result for such games, it will be convenient to assume that the players’ regularizer functions are strongly convex, i.e. each hih_{i} can be bounded from below by a quadratic minorant:

for all xi,xiXix_{i},x_{i}^{\prime}\in\mathcal{X}_{i} and for all tt\in. Under this technical assumption, we have:

Let Γ\Gamma be a 22-player zero-sum game that does not admit an interior Nash equilibrium. Then, for every initial condition of (FoReL), the induced trajectory of play x(t)x(t) converges to the boundary of X\mathcal{X}. Specifically, if xx^{\ast} is a Nash equilibrium of Γ\Gamma with maximal support, x(t)x(t) converges to the relative interior of the face of X\mathcal{X} spanned by supp(x)\operatorname{supp}(x^{\ast}).

Theorem 4.3 is our most comprehensive result for the behavior of (FoReL) in zero-sum games, so several remarks are in order. First, we note that Theorem 4.3 complements Theorem 4.2 in a very natural way: specifically, if Γ\Gamma admits an interior Nash equilibrium, Theorem 4.3 suggests that the solutions of (FoReL) will stay within the relative interior X\mathcal{X}^{\circ} of X\mathcal{X} (since an interior equilibrium is supported on all actions). Of course, Theorem 4.2 provides a stronger result because it states that, within X\mathcal{X}^{\circ}, (FoReL) is recurrent. Hence, applying both results in tandem, we obtain the following heuristic for the behavior of (FoReL) in zero-sum games:

In the long run, (FoReL) wanders in perpetuity in the smallest face of X\mathcal{X} containing the equilibrium set of Γ\Gamma.

This leads to two extremes: On the one hand, if Γ\Gamma admits an interior equilibrium, (FoReL) is recurrent and cycles in the level sets of the coupling function (4.1). At the other end of the spectrum, if Γ\Gamma admits only a single, pure equilibrium, then (FoReL) converges to it (since it has to wander in a singleton set). In all other “in-between” cases, (FoReL) exhibits a hybrid behavior, converging to the face of X\mathcal{X} that is spanned by the maximal support equilibrium of Γ\Gamma, and then cycling in that face in perpetuity.

The reason for this behavior is that the coupling (4.1) is no longer a constant of motion of (FoReL) if the game does not admit an interior equilibrium. As we show in Appendix C, the coupling (4.1) is strictly decreasing when the support of x(t)x(t) is strictly greater than that of a Nash equilibrium xx^{\ast} with maximal support. When the two match, the rate of change of (4.1) drops to zero, and we fall back to a “constrained” version of Theorem 4.2. We make this argument precise in Appendix C (where we present the proof of Theorem 4.3).

3. Zero-sum polymatrix games & positive affine payoff transformations

We close this section by showing that the recurrence properties of (FoReL) are not unique to “vanilla” zero-sum games, but also occur when there is a network of competitors – i.e. in NN-player zero-sum polymatrix games. In fact, the recurrence results carry over to any NN-player game which is isomorphic to a constant-sum polymatrix game with an interior equilibrium up to a positive-affine payoff transformation (possibly different transformation for each agent). For example, this class of games contains all strictly competitive games . Such transformations do not affect the equilibrium structure of the game, but can affect the geometry of the trajectories; nevertheless, the recurrent behavior persists as shown by the following result:

Let Γ=(Γe)eE\Gamma=(\Gamma_{e})_{e\in\mathcal{E}} be a constant-sum polymatrix game (or a positive affine payoff transformation thereof). If Γ\Gamma admits an interior Nash equilibrium, almost every solution trajectory of (FoReL) is recurrent; specifically, for (Lebesgue) almost every initial condition x(0)=Q(y(0))Xx(0)=Q(y(0))\in\mathcal{X}, there exists an increasing sequence of times tnt_{n}\uparrow\infty such that x(tn)x(0)x(t_{n})\to x(0).

We leave the case of zero-sum polymatrix games with no interior equilibria to future work.

Conclusions

Our results show that the behavior of regularized learning in adversarial environments is considerably more intricate than the strong no-regret properties of FoReL might at first suggest. Even though the empirical frequency of play under FoReL converges to the set of coarse correlated equilibria (possibly at an increased rate, depending on the game’s structure), the actual trajectory of play under FoReL is recurrent and exhibits cycles in zero-sum games. We find this property particularly interesting as it suggests that “black box” guarantees are not the be-all/end-all of learning in games: the theory of dynamical systems is rife with complex phenomena and notions that arise naturally when examining the behavior of learning algorithms in finer detail.

Appendix A Examples of FoReL dynamics

Perhaps the most widely known example of a regularized choice map is the so-called logit choice map

This choice model was first studied in the context of discrete choice theory by McFadden and it leads to the multiplicative weights (MW) dynamics:The terminology “multiplicative weights” refers to the fact that (MW) is the continuous version of the discrete-time multiplicative weights update rule: xiαi(t+1)=xiαi(t)eηiviαi(x(t))βiAixiβi(t)eηiviβi(x(t)),x_{i\alpha_{i}}(t+1)=\frac{x_{i\alpha_{i}}(t)e^{\eta_{i}v_{i\alpha_{i}}(x(t))}}{\sum_{\beta_{i}\in\mathcal{A}_{i}}x_{i\beta_{i}}(t)e^{\eta_{i}v_{i\beta_{i}}(x(t))}}, (MWU) where ηi>0\eta_{i}>0 is the scheme’s “learning rate”. For more details about (MWU), we refer the reader to .

As is well known, the logit map above is obtained by the model (3.2) by considering the entropic regularizer

i.e. the (negative) Gibbs–Shannon entropy function. A simple differentiation of (MW) then shows that the players’ mixed strategies evolve according to the dynamics

This equation describes the replicator dynamics of , the most widely studied model for evolution under natural selection in population biology and evolutionary game theory. The basic relation between (MW) and (RD) was first noted in a single-agent environment by and was explored further in game theory by and many others.

Another widely used example of regularization is given by the quadratic penalty

The induced choice map (3.2) is the (Euclidean) projection map

leading to the projected reinforcement learning process

The players’ mixed strategies are then known to follow the projection dynamics

over all intervals for which the support of x(t)x(t) remains constant . The dynamics (PD) were introduced in game theory by as a geometric model of the evolution of play in population games; for a closely related approach, see also and references therein.

Appendix B Liouville’s formula and Poincaré recurrence

Below we present for completeness some basic results from the theory of dynamical systems.

Liouville’s formula can be applied to any system of autonomous differential equations with a continuously differentiable vector field ξ\xi on an open domain of Sk\mathcal{S}\subset^{k}. The divergence of ξ\xi at xSx\in\mathcal{S} is defined as the trace of the corresponding Jacobian at xx, i.e., div[ξ(x)]=i=1kξixi(x)\text{div}[\xi(x)]=\sum_{i=1}^{k}\frac{\partial\xi_{i}}{\partial x_{i}}(x). Since divergence is a continuous function we can compute its integral over measurable sets ASA\subset\mathcal{S}. Given any such set AA, let A(t)={Φ(x0,t):x0A}A(t)=\{\Phi(x_{0},t):x_{0}\in A\} be the image of AA under map Φ\Phi at time tt. A(t)A(t) is measurable and is volume is vol[A(t)]=A(t)dx\text{vol}[A(t)]=\int_{A(t)}dx. Liouville’s formula states that the time derivative of the volume A(t)A(t) exists and is equal to the integral of the divergence over A(t)A(t):

A vector field is called divergence free if its divergence is zero everywhere. Liouville’s formula trivially implies that volume is preserved in such flows.

Poincaré’s recurrence theorem

The notion of recurrence that we will be using in this paper goes back to Poincaré and specifically to his study of the three-body problem. In 1890, in his celebrated work , he proved that whenever a dynamical system preserves volume almost all trajectories return arbitrarily close to their initial position, and they do so an infinite number of times. More precisely, Poincaré established the following:

Poincaré Recurrence: If a flow preserves volume and has only bounded orbits then for each open set there exist orbits that intersect the set infinitely often.

Appendix C Technical proofs

The first result that we prove in this appendix is a key technical lemma concerning the evolution of the coupling function (4.1):

Let piXip_{i}\in\mathcal{X}_{i} and let Gi(yi)=hi(yi)yi,piG_{i}(y_{i})=h_{i}^{\ast}(y_{i})-\langle y_{i},p_{i}\rangle denote the coupling (4.1) for player iNi\in\mathcal{N}. If player iNi\in\mathcal{N} follows (FoReL), we have

for every trajectory of play xi(t)x_{-i}(t) of all players other than ii.

We begin by recalling the “maximizing argument” identity

which expresses the choice map QiQ_{i} as a function of the convex conjugate of hih_{i} [40, p. 149]. With this at hand, a simple differentiation gives

where the last step follows from the fact that xi(t)=Qi(yi(t))=hi(yi(t))x_{i}(t)=Q_{i}(y_{i}(t))=\nabla h_{i}^{\ast}(y_{i}(t)). ∎

Armed with this lemma, we proceed to prove the no-regret guarantees of (FoReL):

Fix some base point piXip_{i}\in\mathcal{X}_{i} and let Li(t)=Gi(yi(t))=hi(yi(t))yi(t),piL_{i}(t)=G_{i}(y_{i}(t))=h_{i}(y_{i}(t))-\langle y_{i}(t),p_{i}\rangle. Then, by Lemma C.1, we have

and hence, after integrating and rearranging, we get

where we used the fact that ui(pi;xi)=vi(x),piu_{i}(p_{i};x_{-i})=\langle v_{i}(x),p_{i}\rangle – cf. Eq. 2.2 in Section 2. However, expanding the RHS of (C.5), we get

where we used the defining property of convex conjugation in the second and third lines above – i.e. that hi(yi)yi,xihi(xi)h_{i}^{\ast}(y_{i})\geq\langle y_{i},x_{i}\rangle-h_{i}(x_{i}) for all xiXix_{i}\in\mathcal{X}_{i}, with equality if and only if xi=Qi(yi)x_{i}=Q_{i}(y_{i}). Thus, maximizing (C.5) over piXip_{i}\in\mathcal{X}_{i}, we finally obtain

We now turn to two-player zero-sum games that do not admit interior equilibria. To describe such equilibria in more detail, we consider below the notion of essential and non-essential strategies:

A strategy αi\alpha_{i} of agent i{1,2}i\in\{1,2\} in a zero sum game is called essential if there exists a Nash equilibrium in which player ii plays αi\alpha_{i} with positive probability. A strategy that is not essential is called non-essential.

As it turns out, the Nash equilibria of a zero-sum game admit a very useful characterization in terms of essential and non-essential strategies:

Let Γ\Gamma be a 22-player zero-sum game that does not admit an interior Nash equilibrium. Then, there exists a mixed Nash equilibrium (x1,x2)(x_{1},x_{2}) such that a) each agent plays each of their essential strategies with positive probability; and b) for each agent deviating to a non-essential strategy results to a strictly worse performance than the value of the game.

The key step in proving this characterization is Farkas’ lemma; the version we employ here is due to Gale, Kuhn and Tucker ): color=DodgerBlue!20!LightGray,author=PM,inline]Provide reference? Done.

color=DodgerBlue!20!LightGray,author=PM,inline]Some final fine-tuning needed here, will do so tomorrow.

Assume without loss of generality that the value of the zero-sum game is zero. and that the first agent is a maximizing agent. Let AA be the payoff matrix of the first agent and hence AT=AA^{T}=A the payoff matrix of the second/minimizing agent. We will show first that for any non-essential strategy αi\alpha_{i} of each agent there exists a Nash equilibrium strategy of his opponent such that the expected performance of αi\alpha_{i} is strictly worse than the value of the game (i.e. zero).

It suffices to argue this for the first agent. Let αi\alpha_{i} be one of his non-essential strategies then by definition there does not exist any Nash equilibrium strategy of that agent that chooses αi\alpha_{i} with positive probability. This is equivalent to the negation of the following statement:

To complete the proof, for each essential strategy of the first agent there exists one equilibrium strategy of his that chooses it with positive probability (by definition). Similarly, for each non-essential strategy of the second agent there exists one equilibrium strategy of the first agent such that makes the expected payoff of that non-essential strategy strictly worse than the value of the game. The barycenter of all the above equilibrium strategies is still an equilibrium strategy (by convexity) and has all the desired properties. ∎

With all this at hand, we are finally in a position to prove Theorem 4.3:

We first show that the coupling G(y)=iN[hi(yi)yi,xi]G(y)=\sum_{i\in\mathcal{N}}[h_{i}^{\ast}(y_{i})-\langle y_{i},x^{\ast}_{i}\rangle] defined in (4.1) given any fully mixed initial condition strictly increases under (FoReL) when Γ\Gamma is a 22-player zero-sum game that does not have an equilibrium with full support.

Indeed, by (C.3) there exists a mixed Nash equilibrium (x1,x2)(x^{\ast}_{1},x^{\ast}_{2}) such that i) both players employ each of their essential strategies with positive probability over time; and ii) every player deviating to a non-essential strategy obtains a payoff lower than the value of the game. As a result, any player playing an interior (fully mixed) strategy against such an equilibrium strategy must receive less utility than their value. In more detail, we have

where we used the fact that Qi=hiQ_{i}=\nabla h_{i}^{\ast} in the first line (cf. Appendix D), and the assumption that xx^{\ast} is a Nash equilibrium of a 22-player zero-sum game such that any agent playing an interior (fully mixed) strategy against such an equilibrium strategy must receive less utility than their value (and hence the agent himself receives more utility than the value of the game). We thus conclude that G(y(t))G(y(t)) strictly increases under (FoReL), as claimed.

Let x=(x1,x2)x^{\ast}=(x^{\ast}_{1},x^{\ast}_{2}) be the Nash equilibrium identified in (C) and let L(t)=G(y(t))=iN[hi(yi(t))yi(t),xi]L(t)=G(y(t))=\sum_{i\in\mathcal{N}}[h_{i}^{\ast}(y_{i}(t))-\langle y_{i}(t),x^{\ast}_{i}\rangle] denote the primal-dual coupling (4.1) between y(t)y(t) and xx^{\ast}. From (C), we have that starting from any fully mixed strategy profile x(0)iint(Xi)x(0)\in\prod_{i}\text{int}(\mathcal{X}_{i}) and for all t0t\geq 0, L(t)=y˙(t),G(y(t))<0L^{\prime}(t)=\langle\dot{y}(t),\nabla G(y(t))\rangle<0. However, GG is bounded from below by imaxxiXihi(xi)-\sum_{i}\max_{x_{i}\in\mathcal{X}_{i}}h_{i}(x_{i}), and since G(y(t))G(y(t)) is strictly decreasing, it must exhibit a finite limit.

We begin by noting that x(t)=Q(y(t))x(t)=Q(y(t)) is Lipschitz continuous in tt. Indeed, vv is Lipschitz continuous on X\mathcal{X} by linearity; furthermore, since the regularizer functions hih_{i} are assumed KiK_{i}-strongly convex, it follows that QiQ_{i} is (1/Ki)(1/K_{i})-continuous by standard convex analysis arguments [32, Theorem 12.60]. In turn, this implies that the field of motion V(y)v(Q(y))V(y)\equiv v(Q(y)) of (FoReL) is Lipschitz continuous, so the dynamics are well-posed and y(t)y(t) is differentiable. Since y˙=v\dot{y}=v and, in addition, vv is bounded on X\mathcal{X}, we conclude that y˙\dot{y} is bounded so, in particular, y(t)y(t) is Lipschitz continuous on [0,)[0,\infty). We thus conclude that x(t)=Q(y(t))x(t)=Q(y(t)) is Lipschitz continuous as the composition of Lipschitz continuous functions.

We now further claim that L(t)L^{\prime}(t) is also Lipschitz continuous in tt. Indeed, by (C.1), we have L(t)=iNvi(x(t)),xi(t)xiL^{\prime}(t)=\sum_{i\in\mathcal{N}}\langle v_{i}(x(t)),x_{i}(t)-x^{\ast}_{i}\rangle; since viv_{i} is Lipschitz continuous in xx and x(t)x(t) is Lipschitz continuous in tt, our claim follows trivially. Hence, by Lemma D.3, we conclude that limtL(t)=limiNvi(x(t),xi(t)xi=0\lim_{t\to\infty}L^{\prime}(t)=\lim_{\to\infty}\sum_{i\in\mathcal{N}}\langle v_{i}(x(t),x_{i}(t)-x^{\ast}_{i}\rangle=0.

By (C), we know that L(t)<0L^{\prime}(t)<0 as long as x(t)x(t) is interior. Hence, any ω\omega-limit x^\hat{x} of x(t)x(t) cannot be interior (given that the embedded game does not have any interior Nash equilibria). Moreover, we can repeat this argument for any subspace such that the restriction of the game on that subspace (when ignoring the strategies that are played with probability zero) does not have a fully mixed NE. We thus conclude that the support of x^\hat{x} must be a subset of the support of xx^{\ast}. Since Γ\Gamma does not admit an interior equilibrium, xx^{\ast} does not have full support, so every ω\omega-limit of x(t)x(t) lies on the boundary of X\mathcal{X}, as claimed. ∎

We close this appendix with the proof of our result on constant-sum polymatrix games (and positive affine transformations thereof):

Our proof follows closely that of Theorem 4.2; to streamline our presentation, we only highlight here the points that differ due to working with (an positive-affine transformations of) a network of constant-sum games (as opposed to a single 22-player zero-sum game).

The first such point is the incompressibility of the “reduced” dynamics (4.6). By definition, we have ui(x)=jNiuij(xi,xj)u_{i}(x)=\sum_{j\in\mathcal{N}_{i}}u_{ij}(x_{i},x_{j}), so we also have

Since uij(αi,xj)u_{ij}(\alpha_{i},x_{j}) does not depend on xix_{i}, we readily obtain αiviαi(x)=0\partial_{\alpha_{i}}v_{i\alpha_{i}}(x)=0 and incompressibility follows as before.

Let the network game in question be isomorphic to a network of constant-sum games after the following positive-affine transformation of utilities, ui(x)aiui(x)+biu_{i}(x)\leftarrow a_{i}u_{i}(x)+b_{i} where ai>0a_{i}>0. The second point of interest is the use of the coupling G(y)=iNai[hi(yi)yi,xi]G(y)=\sum_{i\in\mathcal{N}}a_{i}[h_{i}^{\ast}(y_{i})-\langle y_{i},x^{\ast}_{i}\rangle] as a constant of motion for (FoReL). Indeed, adapting the derivation of (C.1), we now get

where the third line follows by regrouping the summands in the second line by edge, and the last line follows as in the case of (C.1). This implies that G(y(t))G(y(t)) remains constant along any solution of (FoReL), so the rest of the proof follows as in the case of Theorem 4.2. ∎

Appendix D Auxiliary results

In this appendix, we provide two auxiliary results that are used in the proof of Theorem 4.2. The first one shows that if the score difference between two strategies grows large, the strategy with the lower score becomes extinct:

Set xn=Q(yn)x_{n}=Q(y_{n}) and, by descending to a subsequence if necessary, assume there exists some ε>0\varepsilon>0 such that xα,nε>0x_{\alpha,n}\geq\varepsilon>0 for all nn. Then, by the defining relation Q(y)=arg max{y,xh(x)}Q(y)=\operatorname*{arg\,max}\{\langle y,x\rangle-h(x)\} of QQ, we have:

for all xΔx^{\prime}\in\Delta. Therefore, taking xn=xn+ε(eβeα)x_{n}^{\prime}=x_{n}+\varepsilon(e_{\beta}-e_{\alpha}), we readily obtain

which contradicts our original assumption that yα,nyβ,ny_{\alpha,n}-y_{\beta,n}\to-\infty. With Δ\Delta compact, the above shows that xα=0x_{\alpha}^{\ast}=0 for any limit point xx^{\ast} of xnx_{n}, i.e. Qα(yn)0Q_{\alpha}(y_{n})\to 0. ∎

A key step of the proof of Theorem 4.2 consists of showing that the level sets of the Fenchel coupling G(p,y)G(p,y) become bounded under the coordinate reduction transformation yΠ(y)=zy\mapsto\Pi(y)=z, so every solution orbit z(t)z(t) of (4.6) also remains bounded. We encode this in the following lemma:

We argue by contradiction. Indeed, assume that the sequence Gnh(yn)yn,pG_{n}\equiv h^{\ast}(y_{n})-\langle y_{n},p\rangle is bounded but lim supnyα,nyβ,n=\limsup_{n\to\infty}\lvert y_{\alpha,n}-y_{\beta,n}\rvert=\infty for some α,βA\alpha,\beta\in\mathcal{A}. Letting yn+=maxαyα,ny_{n}^{+}=\max_{\alpha}y_{\alpha,n} and yn=minαAyα,ny_{n}^{-}=\min_{\alpha\in\mathcal{A}}y_{\alpha,n}, this implies that lim supn(yn+yn)=\limsup_{n\to\infty}(y_{n}^{+}-y_{n}^{-})=\infty. Hence, by descending to a subsequence if necessary, there exist α+,αA\alpha^{+},\alpha^{-}\in\mathcal{A} such that a) yn±=yα±,ny_{n}^{\pm}=y_{\alpha^{\pm},n}for all nn; and b) yα+,nyα,ny_{\alpha^{+},n}-y_{\alpha^{-},n}\to\inftyas nn\to\infty.

By construction, we have yα,n=ynyα,nyn+=yα+,ny_{\alpha^{-},n}=y_{n}^{-}\leq y_{\alpha,n}\leq y_{n}^{+}=y_{\alpha^{+},n} for all αA\alpha\in\mathcal{A}. Thus, by descending to a further subsequence if necessary, we may assume that the index set A\mathcal{A} can be partitioned into two nonempty sets A+\mathcal{A}^{+} and A\mathcal{A}^{-} such that

yn+yα,ny_{n}^{+}-y_{\alpha,n} is bounded for all αA+\alpha\in\mathcal{A}^{+}.

yn+yα,ny_{n}^{+}-y_{\alpha,n}\to\infty for all αA\alpha\in\mathcal{A}^{-}.

and construct the required partition {A+,A}\{\mathcal{A}^{+},\mathcal{A}^{-}\} according to the following procedure:

Thus, if we let xn=Q(yn)x_{n}=Q(y_{n}) we readily obtain:

where we used the fact that αApα=αAxα,n=1\sum_{\alpha\in\mathcal{A}}p_{\alpha}=\sum_{\alpha\in\mathcal{A}}x_{\alpha,n}=1 in the first line. The first sum above is bounded by assumption. As for the second one, the fact that yα+,nyα,n=yn+yα,ny_{\alpha^{+},n}-y_{\alpha,n}=y_{n}^{+}-y_{\alpha,n}\to\infty implies that xα,n0x_{\alpha,n}\to 0 for all αA\alpha\in\mathcal{A}^{-} (by Lemma D.1 above). We thus get lim infn(pαxα,n)>0\liminf_{n}(p_{\alpha}-x_{\alpha,n})>0 (recall that pΔ ⁣p\in\Delta^{\!\circ}), and hence, αA(yα,nyn+)(pαxα,n)\sum_{\alpha\in\mathcal{A}^{-}}(y_{\alpha,n}-y_{n}^{+})(p_{\alpha}-x_{\alpha,n})\to-\infty.

From the above, we conclude that yn,pxn\langle y_{n},p-x_{n}\rangle\to-\infty as nn\to\infty. However, by construction, we also have

Since hh is finite on xx, it follows that GnG_{n}\to-\infty, contradicting our assumption that GnG_{n} is bounded. Retracing our steps, this implies that supnyα,nyβ,n<\sup_{n}\lvert y_{\alpha,n}-y_{\beta,n}\rvert<\infty, as claimed. ∎

The final result we state here is a technical result regarding the asymptotic behavior of the derivative of functions with a finite limit at infinity:

References