Stable Opponent Shaping in Differentiable Games

Alistair Letcher, Jakob Foerster, David Balduzzi, Tim Rocktäschel, Shimon Whiteson

Introduction

While machine learning has traditionally focused on optimising single objectives, generative adversarial nets (GANs) (Goodfellow et al., 2014) have showcased the potential of architectures dealing with multiple interacting goals. They have since then proliferated substantially, including intrinsic curiosity (Pathak et al., 2017), imaginative agents (Racanière et al., 2017), synthetic gradients (Jaderberg et al., 2017), hierarchical reinforcement learning (RL) (Wayne & Abbott, 2014; Vezhnevets et al., 2017) and multi-agent RL in general (Busoniu et al., 2008).

These can effectively be viewed as differentiable games played by cooperating and competing agents – which may simply be different internal components of a single system, like the generator and discriminator in GANs. The difficulty is that each loss depends on all parameters, including those of other agents. While gradient descent on single functions has been widely successful, converging to local minima under rather mild conditions (Lee et al., 2017), its simultaneous generalisation can fail even in simple two-player, two-parameter zero-sum games. No algorithm has yet been shown to converge, even locally, in all differentiable games.

Related Work.

Convergence has widely been studied in convex nn-player games, see especially Rosen (1965); Facchinei & Kanzow (2007). However, the recent success of non-convex games exemplified by GANs calls for a better understanding of this general class where comparatively little is known. Mertikopoulos & Zhou (2018) recently prove local convergence of no-regreat learning to variationally stable equilibria, though under a number of regularity assumptions.

Conversely, a number of algorithms have been successful in the non-convex setting for restricted classes of games. These include policy prediction in two-player two-action bimatrix games (Zhang & Lesser, 2010); WoLF in two-player two-action games (Bowling & Veloso, 2001); AWESOME in repeated games (Conitzer & Sandholm, 2007); Optimistic Mirror Descent in two-player bilinear zero-sum games (Daskalakis et al., 2018) and Consensus Optimisation (CO) in two-player zero-sum games (Mescheder et al., 2017). An important body of work including Heusel et al. (2017); Nagarajan & Kolter (2017) has also appeared for the specific case of GANs.

Working towards bridging this gap, some of the authors recently proposed Symplectic Gradient Adjustment (SGA), see Balduzzi et al. (2018). This algorithm is provably ‘attracted’ to stable fixed points while ‘repelled’ from unstable ones in all differentiable games (nn-player, non-convex). Nonetheless, these results are weaker than strict convergence guarantees. Moreover, SGA agents may act against their own self-interest by prioritising stability over individual loss. SGA was also discovered independently by Gemp & Mahadevan (2018), drawing on variational inequalities.

In a different direction, Learning with Opponent-Learning Awareness (LOLA) (Foerster et al., 2018) modifies the learning objective by predicting and differentiating through opponent learning steps. This is intuitively appealing and experimentally successful, encouraging cooperation in settings like the Iterated Prisoner’s Dilemma (IPD) where more stable algorithms like SGA defect. However, LOLA has no guarantees of converging or even preserving fixed points of the game.

Contribution.

We begin by constructing the first explicit tandem game where LOLA agents adopt ‘arrogant’ behaviour and converge to non-fixed points. We pinpoint the cause of failure and show that a natural variant named LookAhead (LA), discovered before LOLA by Zhang & Lesser (2010), successfully preserves fixed points. We then prove that LookAhead locally converges and avoids strict saddles in all differentiable games, filling a theoretical gap in multi-agent learning. This is enabled through a unified approach based on fixed-point iterations and dynamical systems. These techniques apply equally well to algorithms like CO and SGA, though this is not our present focus.

While LookAhead is theoretically robust, the shaping component endowing LOLA with a capacity to exploit opponent dynamics is lost. We solve this dilemma with an algorithm named Stable Opponent Shaping (SOS), trading between stability and exploitation by interpolating between LookAhead and LOLA. Using an intuitive and theoretically grounded criterion for this interpolation parameter, SOS inherits both strong convergence guarantees from LA and opponent shaping from LOLA.

On the experimental side, we show that SOS plays tit-for-tat in the IPD on par with LOLA, while all other methods mostly defect. We display the practical consequences of our theoretical guarantees in the tandem game, where SOS always outperforms LOLA. Finally we implement a more involved GAN setup, testing for mode collapse and mode hopping when learning Gaussian mixture distributions. SOS successfully spreads mass across all Gaussians, at least matching dedicated algorithms like CO, while LA is significantly slower and simultaneous gradient descent fails entirely.

Background

We frame the problem of multi-agent learning as a game. Adapted from Balduzzi et al. (2018), the following definition insists only on differentiability for gradient-based methods to apply. This concept is strictly more general than stochastic games, whose parameters are usually restricted to action-state transition probabilities or functional approximations thereof.

We write iLk=θiLk\nabla_{i}L^{k}=\nabla_{\theta^{i}}L^{k} and ijLk=θjθiLk\nabla_{ij}L^{k}=\nabla_{\theta^{j}}\nabla_{\theta^{i}}L^{k} for any i,j,ki,j,k. Define the simultaneous gradient of the game as the concatenation of each player’s gradient,

The iith component of ξ\xi is the direction of greatest increase in LiL^{i} with respect to θi\theta^{i}. If each agent minimises their loss independently from others, they perform GD on their component iLi\nabla_{i}L^{i} with learning rate αi\alpha_{i}. Hence, the parameter update for all agents is given by θθαξ\theta\leftarrow\theta-\alpha\odot\xi, where \alpha=(\alpha_{1},\ldots,\alpha_{n})^{\mathchoice{\raisebox{0.0pt}{\displaystyle\intercal}}{\raisebox{0.0pt}{\textstyle\intercal}}{\raisebox{0.0pt}{\scriptstyle\intercal}}{\raisebox{0.0pt}{\scriptscriptstyle\intercal}}} and \odot is element-wise multiplication. This is also called naive learning (NL), reducing to θθαξ\theta\leftarrow\theta-\alpha\xi if agents have the same learning rate. This is assumed for notational simplicity, though irrelevant to our results. The following example shows that NL can fail to converge.

Consider L1/2=±xyL^{1/2}=\pm xy, where players control the xx and yy parameters respectively. The origin is a (global and unique) Nash equilibrium. The simultaneous gradient is ξ=(y,x)\xi=(y,-x) and cycles around the origin. Explicitly, a gradient step from (x,y)(x,y) yields

which has distance from the origin (1+α2)(x2+y2)>(x2+y2)(1+\alpha^{2})(x^{2}+y^{2})>(x^{2}+y^{2}) for any α>0\alpha>0 and (x,y)0(x,y)\neq 0. It follows that agents diverge away from the origin for any α>0\alpha>0. The cause of failure is that ξ\xi is not the gradient of a single function, implying that each agent’s loss is inherently dependent on others. This results in a contradiction between the non-stationarity of each agent, and the optimisation of each loss independently from others. Failure of convergence in this simple two-player zero-sum game shows that gradient descent does not generalise well to differentiable games. We consider an alternative solution concept to Nash equilibria before introducing LOLA.

2 Stable fixed points

Consider the game given by L1=L2=xyL^{1}=L^{2}=xy where players control the xx and yy parameters respectively. The optimal solution is (x,y)±(,)(x,y)\to\pm(\infty,-\infty), since then L1=L2L^{1}=L^{2}\to-\infty. However the origin is a global Nash equilibrium, while also a saddle point of xyxy. It is highly undesirable to converge to the origin in this game, since infinitely better losses can be reached in the anti-diagonal direction. In this light, Nash equilibria cannot be the right solution concept to aim for in multi-agent learning. To define stable fixed points, first introduce the ‘Hessian’ of the game as the block matrix

This can equivalently be viewed as the Jacobian of the vector field ξ\xi. Importantly, note that HH is not symmetric in general unless n=1n=1, in which case we recover the usual Hessian H=2LH=\nabla^{2}L.

A point θˉ\bar{\theta} is a fixed point if ξ(θˉ)=0\xi(\bar{\theta})=0. It is stable if H(θˉ)0H(\bar{\theta})\succeq 0, unstable if H(θˉ)0H(\bar{\theta})\prec 0 and a strict saddle if H(θˉ)H(\bar{\theta}) has an eigenvalue with negative real part.

The name ‘fixed point’ is coherent with GD, since ξ(θˉ)=0\xi(\bar{\theta})=0 implies a fixed update θˉθˉαξ(θˉ)=θˉ\bar{\theta}\leftarrow\bar{\theta}-\alpha\xi(\bar{\theta})=\bar{\theta}. Though Nash equilibria were shown to be inadequate above, it is not obvious that stable fixed points (SFPs) are a better solution concept. In Appendix A we provide intuition for why SFPs are both closer to local minima in the context of multi-loss optimisation, and more tractable for convergence proofs. Moreover, this definition is an improved variant on that in Balduzzi et al. (2018), assuming positive semi-definiteness only at θˉ\bar{\theta} instead of holding in a neighbourhood. This makes the class of SFPs as large as possible, while sufficient for all our theoretical results.

Assuming invertibility of H(θˉ)H(\bar{\theta}) at SFPs is crucial to all convergence results in this paper. The same assumption is present in related work including Mescheder et al. (2017), and cannot be avoided. Even for single losses, a fixed point with singular Hessian can be a local minimum, maximum, or saddle point. Invertibility is thus necessary to ensure that SFPs really are ‘local minima’. This is omitted from now on. Finally note that unstable fixed points are a subset of strict saddles, making 6 both stronger and more general than results for SGA by Balduzzi et al. (2018).

3 Learning with opponent-learning awareness (LOLA)

Accounting for nonstationarity, Learning with Opponent-Learning Awareness (LOLA) modifies the learning objective by predicting and differentiating through opponent learning steps (Foerster et al., 2018). For simplicity, if n=2n=2 then agent 1 optimises L1(θ1,θ2+Δθ2)L^{1}(\theta^{1},\theta^{2}+\Delta\theta^{2}) with respect to θ1\theta^{1}, where Δθ2\Delta\theta^{2} is the predicted learning step for agent 2. Foerster et al. (2018) assume that opponents are naive learners, namely Δθ2=α22L2\Delta\theta^{2}=-\alpha_{2}\nabla_{2}L^{2}. After first-order Taylor expansion, the loss is approximately given by L1+2L1Δθ2L^{1}+\nabla_{2}L^{1}\cdot\Delta\theta^{2}. By minimising this quantity, agent 1 learns parameters that align the opponent learning step Δθ2\Delta\theta^{2} with the direction of greatest decrease in L1L^{1}, exploiting opponent dynamics to further reduce one’s losses. Differentiating with respect to θ1\theta^{1}, the adjustment is

By explicitly differentiating through Δθ2\Delta\theta^{2} in the rightmost term, LOLA agents actively shape opponent learning. This has proven effective in reaching cooperative equilibria in multi-agent learning, finding success in a number of games including tit-for-tat in the IPD. The middle term above was originally dropped by the authors because “LOLA focuses on this shaping of the learning direction of the opponent”. We choose not to eliminate this term, as also inherent in LOLA-DiCE (Foerster et al., 2018). Preserving both terms will in fact be key to developing stable opponent shaping.

Writing {\mathchoice{\raisebox{0.0pt}{\displaystyle\chi}}{\raisebox{0.0pt}{\textstyle\chi}}{\raisebox{0.0pt}{\scriptstyle\chi}}{\raisebox{0.0pt}{\scriptscriptstyle\chi}}}=\operatorname{diag}(H_{o}^{\mathchoice{\raisebox{0.0pt}{\displaystyle\intercal}}{\raisebox{0.0pt}{\textstyle\intercal}}{\raisebox{0.0pt}{\scriptstyle\intercal}}{\raisebox{0.0pt}{\scriptscriptstyle\intercal}}}\nabla L), the LOLA gradient adjustment is

While experimentally successful, LOLA fails to preserve fixed points θˉ\bar{\theta} of the game since

in general. Even if θˉ\bar{\theta} is a Nash equilibrium, the update θˉθˉα\textsclolaθˉ\bar{\theta}\leftarrow\bar{\theta}-\alpha\textsc{lola}\neq\bar{\theta} can push them away despite parameters being optimal. This may worsen the losses for all agents, as in the game below.

Imagine a tandem controlled by agents facing opposite directions, who feed xx and yy force into their pedals respectively. Negative numbers correspond to pedalling backwards.

Moving coherently requires xyx\approx-y, embodied by a quadratic loss (x+y)2(x+y)^{2}. However it is easier for agents to pedal forwards, translated by linear losses 2x-2x and 2y-2y. The game is thus given by L1(x,y)=(x+y)22xL^{1}(x,y)=(x+y)^{2}-2x and L2(x,y)=(x+y)22yL^{2}(x,y)=(x+y)^{2}-2y. These sub-goals are incompatible, so agents cannot simply accelerate forwards. The SFPs are given by {x+y=1}\{x+y=1\}. Computing {\mathchoice{\raisebox{0.8pt}{\displaystyle\chi}}{\raisebox{0.8pt}{\textstyle\chi}}{\raisebox{0.8pt}{\scriptstyle\chi}}{\raisebox{0.8pt}{\scriptscriptstyle\chi}}}(x,1-x)=(4,4)\neq 0, none of these are preserved by LOLA. Instead, we show in Appendix C that LOLA can only converge to sub-optimal scenarios with worse losses for both agents, for any α\alpha.

Intuitively, the root of failure is that LOLA agents try to shape opponent learning and enforce compliance by accelerating forwards, assuming a dynamic response from their opponent. The other agent does the same, so they become ‘arrogant’ and suffer by pushing strongly in opposite directions.

Method

The shaping term χ\textstyle\chi prevents LOLA from preserving fixed points. Consider removing this component entirely, giving (IαHo)ξ(I-\alpha H_{o})\xi. This variant preserves fixed points, but what does it mean from the perspective of each agent? Note that LOLA optimises L1(θ1,θ2+Δθ2)L^{1}(\theta^{1},\theta^{2}+\Delta\theta^{2}) with respect to θ1\theta^{1}, while Δθ2\Delta\theta^{2} is a function of θ1\theta^{1}. In other words, we assume that our opponent’s learning step depends on our current optimisation with respect to θ1\theta^{1}. This is inaccurate, since opponents cannot see our updated parameters until the next step. Instead, assume we optimise L1(θ1,θ^2+Δθ2(θ^1,θ^2))L^{1}(\theta^{1},\hat{\theta}^{2}+\Delta\theta^{2}(\hat{\theta}^{1},\hat{\theta}^{2})) where θ^1,θ^2\hat{\theta}^{1},\hat{\theta}^{2} are the current parameters. After Taylor expansion, the gradient with respect to θ1\theta^{1} is given by

since Δθ2(θ^1,θ^2)\Delta\theta^{2}(\hat{\theta}^{1},\hat{\theta}^{2}) does not depend on θ1\theta^{1}. In vectorial form, we recover the variant (IαHo)ξ(I-\alpha H_{o})\xi since the shaping term corresponds precisely to differentiating through Δθ2\Delta\theta^{2}. We name this LookAhead, which was discovered before LOLA by Zhang & Lesser (2010) though not explicitly named. Using the stop-gradient operator \botThis operator is implemented in TensorFlow as stop_gradient and in PyTorch as detach., this can be reformulated as optimising L1(θ1,θ2+Δθ2)L^{1}(\theta^{1},\theta^{2}+\bot\Delta\theta^{2}) where \bot prevents gradient flowing from Δθ2\Delta\theta^{2} upon differentiation.

The main result of Zhang & Lesser (2010) is that LookAhead converges to Nash equilibria in the small class of two-player, two-action bimatrix games. We will prove local convergence to SFP and non-convergence to strict saddles in all differentiable games. On the other hand, by discarding the problematic shaping term, we also eliminated LOLA’s capacity to exploit opponent dynamics and encourage cooperation. This will be witnessed in the IPD, where LookAhead agents mostly defect.

2 Stable opponent shaping (SOS)

We propose Stable Opponent Shaping (SOS), an algorithm preserving both advantages at once. Define the partial stop-gradient operator pp+(1p)I\bot^{p}\coloneqq p\bot+(1-p)I, where II is the identity and pp stands for partial. A pp-LOLA agent optimises the modified objective

collapsing to LookAhead at p=0p=0 and LOLA at p=1p=1. The resulting gradient is given by

with ξ0=\textscla\xi_{0}=\textsc{la}. We obtain an algorithm trading between shaping and stability as a function of pp. Note however that preservation of fixed points only holds if pp is infinitesimal, in which case pp-LOLA is almost identical to LookAhead – losing the very purpose of interpolation. Instead we propose a two-part criterion for pp at each learning step, through which all guarantees descend.

First choose pp such that ξp\xi_{p} points in the same direction as LookAhead. This will not be enough to prove convergence itself, but prevents arrogant behaviour by ensuring convergence only to fixed points. Formally, the first criterion is given by ξp,ξ00\langle\xi_{p},\xi_{0}\rangle\geq 0. If \langle-\alpha{\mathchoice{\raisebox{0.8pt}{\displaystyle\chi}}{\raisebox{0.8pt}{\textstyle\chi}}{\raisebox{0.8pt}{\scriptstyle\chi}}{\raisebox{0.8pt}{\scriptscriptstyle\chi}}},\xi_{0}\rangle\geq 0 then ξp,ξ00\langle\xi_{p},\xi_{0}\rangle\geq 0 automatically, so we choose p=1p=1 for maximal shaping. Otherwise choose

with any hyperparameter 0<a<10<a<1. This guarantees a positive inner product

We complement this with a second criterion ensuring local convergence. The idea is to scale pp by a function of ξ\lVert\xi\rVert if ξ\lVert\xi\rVert is small enough, which certainly holds in neighbourhoods of fixed points. Let 0<b<10<b<1 be a hyperparameter and take p=ξ2p=\lVert\xi\rVert^{2} if ξ<b\lVert\xi\rVert<b, otherwise p=1p=1. Choosing p1p_{1} and p2p_{2} according to these criteria, the two-part criterion is p=min{p1,p2}p=\min\{p_{1},p_{2}\}. SOS is obtained by combining pp-LOLA with this criterion, as summarised in Algorithm 1. Crucially, all theoretical results in the next section are independent from the choice of hyperparameters aa and bb.

Theoretical Results

Our central theoretical contribution is that LookAhead and SOS converge locally to SFP and avoid strict saddles in all differentiable games. Since the learning gradients involve second-order Hessian terms, our results assume thrice continuously differentiable losses (omitted hereafter). Losses which are C2C^{2} but not C3C^{3} are very degenerate, so this is a mild assumption. Statements made about SOS crucially hold for any hyperparameters a,b(0,1)a,b\in(0,1). See Appendices D, LABEL: and E for detailed proofs.

Convergence is proved using Ostrowski’s Theorem. This reduces convergence of a gradient adjustment gg to positive stability (eigenvalues with positive real part) of g\nabla g at stable fixed points.

Let H0H\succeq 0 be invertible with symmetric diagonal blocks. Then there exists ϵ>0\epsilon>0 such that (IαHo)H(I-\alpha H_{o})H is positive stable for all 0<α<ϵ0<\alpha<\epsilon.

This type of result would usually be proved either by analytical means showing positive definiteness and hence positive stability, or direct eigenvalue analysis. We show in Appendix D that (IαHo)H(I-\alpha H_{o})H is not necessarily positive definite, while there is no necessary relationship between eigenpairs of HH and HoH_{o}. This makes our theorem all the more interesting and non-trivial. We use a similarity transformation trick to circumvent the dual obstacle, allowing for analysis of positive definiteness with respect to a new inner product. We obtain positive stability by invariance under change of basis.

LookAhead converges locally to stable fixed points for α>0\alpha>0 sufficiently small.

Using the second criterion for pp, we prove local convergence of SOS in all differentiable games despite the presence of a shaping term (unlike LOLA).

SOS converges locally to stable fixed points for α>0\alpha>0 sufficiently small.

2 Avoiding strict saddles

Using the first criterion for pp, we prove that SOS only converges to fixed points (unlike LOLA).

If SOS converges to θˉ\bar{\theta} and α>0\alpha>0 is small then θˉ\bar{\theta} is a fixed point of the game.

Now assume that θ\theta is initialised randomly (or with arbitrarily small noise), as is standard in ML. Let F(θ)=θαξp(θ)F(\theta)=\theta-\alpha\xi_{p}(\theta) be the SOS iteration. Using both the second criterion and the Stable Manifold Theorem from dynamical systems, we can prove that every strict saddle θˉ\bar{\theta} has a neighbourhood UU such that {θUFn(θ)θˉ as n}\{\theta\in U\mid F^{n}(\theta)\to\bar{\theta}\text{ as }n\to\infty\} has measure zero for α>0\alpha>0 sufficiently small. Since θ\theta is initialised randomly, we obtain the following result.

SOS locally avoids strict saddles almost surely, for α>0\alpha>0 sufficiently small.

This also holds for LookAhead, and could be strenghtened to global initialisations provided a strong boundedness assumption on H2\lVert H\rVert_{2}. This is trickier for SOS since p(θ)p(\theta) is not globally continuous. Altogether, our results for LookAhead and the correct criterion for pp-LOLA lead to some of the strongest theoretical guarantees in multi-agent learning. Furthermore, SOS retains all of LOLA’s opponent shaping capacity while LookAhead does not, as shown experimentally in the next section.

Experiments and Discussion

We evaluate the performance of SOS in three differentiable games. We first showcase opponent shaping and superiority over LA/CO/SGA/NL in the Iterated Prisoner’s Dilemma (IPD). This leaves SOS and LOLA, which have differed only in theory up to now. We bridge this gap by showing that SOS always outperforms LOLA in the tandem game, avoiding arrogant behaviour by decaying pp while LOLA overshoots. Finally we test SOS on a more involved GAN learning task, with results similar to dedicated methods like Consensus Optimisation.

This game is an infinite sequence of the well-known Prisoner’s Dilemma, where the payoff is discounted by a factor γ[0,1)\gamma\in[0,1) at each iteration. Agents are endowed with a memory of actions at the previous state. Hence there are 55 parameters for each agent ii: the probability Pi(Cstate)P^{i}(C\mid state) of cooperating at start state s0=s_{0}=\varnothing or state st=(at11,at12)s_{t}=(a_{t-1}^{1},a_{t-1}^{2}) for t>0t>0. One Nash equilibrium is to always defect (DD), with a normalised loss of 22. A better equilibrium with loss 11 is named tit-for-tat (TFT), where each player begins by cooperating and then mimicks the opponent’s previous action.

We run 300 training episodes for SOS, LA, CO, SGA and NL. The parameters are initialised following a normal distribution around 1/21/2 probability of cooperation, with unit variance. We fix α=1\alpha=1 and γ=0.96\gamma=0.96, following Foerster et al. (2018). We choose a=0.5a=0.5 and b=0.1b=0.1 for SOS. The first is a robust and arbitrary middle ground, while the latter is intentionally small to avoid poor SFP.

Tandem:

Though local convergence is guaranteed for SOS, it is possible that SOS diverges from poor initialisations. This turns out to be impossible in the tandem game since the Hessian is globally positive semi-definite. We show this explicitly by running 300 training episodes for SOS and LOLA. Parameters are initialised following a normal distribution around the origin. We found performance to be robust to hyperparameters a,ba,b. Here we fix a=b=0.5a=b=0.5 and α=0.1\alpha=0.1.

Gaussian mixtures:

We reproduce a setup from Balduzzi et al. (2018). The game is to learn a Gaussian mixture distribution using GANs. Data is sampled from a highly multimodal distribution designed to probe the tendency to collapse onto a subset of modes during training – see ground truth in Appendix F. The generator and discriminator networks each have 6 ReLU layers of 384 neurons, with 2 and 1 output neurons respectively. Learning rates are chosen by grid search at iteration 8k, with a=0.5a=0.5 and b=0.1b=0.1 for SOS, following the same reasoning as the IPD.

2 Results and discussion

Results are given in Figure 2. Parameters in part (A) are the end-run probabilities of cooperating for each memory state, encoded in different colours. Only 50 runs are shown for visibility. Losses at each step are displayed in part (B), averaged across 300 episodes with shaded deviations.

SOS and LOLA mostly succeed in playing tit-for-tat, displayed by the accumulation of points in the correct corners of (A) plots. For instance, CC and CD points are mostly in the top right and left corners so agent 2 responds to cooperation with cooperation. Agents also cooperate at the start state, represented by \varnothing points all hidden in the top right corner. Tit-for-tat strategy is further indicated by the losses close to 11 in part (B). On the other hand, most points for LA/CO/SGA/NL are accumulated at the bottom left, so agents mostly defect. This results in poor losses, demonstrating the limited effectiveness of recent proposals like SGA and CO. Finally note that trained parameters and losses for SOS are almost identical to those for LOLA, displaying equal capacity in opponent shaping while also inheriting convergence guarantees and outperforming LOLA in the next experiment.

Tandem:

Results are given in Figure 3. SOS always succeeds in decreasing pp to reach the correct equilibria, with losses averaging at . LOLA fails to preserve fixed points, overshooting with losses averaging at 4/94/9. The criterion for SOS is shown in action in part (B), decaying pp to avoid overshooting. This illustrates that purely theoretical guarantees descend into practical outperformance. Note that SOS even gets away from the LOLA fixed points if initialised there (not shown), converging to improved losses using the alignment criterion with LookAhead.

Gaussian mixtures:

The generator distribution and KL divergence are given at {\{2k, 4k, 6k, 8k}\} iterations for NL, CO and SOS in Figure 4. Results for SGA, LOLA and LA are in Appendix F. SOS achieves convincing results by spreading mass across all Gaussians, as do CO/SGA/LOLA. LookAhead is significantly slower, while NL fails through mode collapse and hopping. Only visual inspection was used for comparison by Balduzzi et al. (2018), while KL divergence gives stronger numerical evidence here. SOS and CO are slightly superior to others with reference to this metric. However CO is aimed specifically toward two-player zero-sum GAN optimisation, while SOS is widely applicable with strong theoretical guarantees in all differentiable games.

Conclusion

Theoretical results in machine learning have significantly helped understand the causes of success and failure in applications, from optimisation to architecture. While gradient descent on single losses has been studied extensively, algorithms dealing with interacting goals are proliferating, with little grasp of the underlying dynamics. The analysis behind CO and SGA has been helpful in this respect, though lacking either in generality or convergence guarantees. The first contribution of this paper is to provide a unified framework and fill this theoretical gap with robust convergence results for LookAhead in all differentiable games. Capturing stable fixed points as the correct solution concept was essential for these techniques to apply.

Furthermore, we showed that opponent shaping is both a powerful approach leading to experimental success and cooperative behaviour – while at the same time preventing LOLA from preserving fixed points in general. This conundrum is solved through a robust interpolation between LookAhead and LOLA, giving birth to SOS through a robust criterion. This was partially enabled by choosing to preserve the ‘middle’ term in LOLA, and using it to inherit stability from LookAhead. This results in convergence guarantees stronger than all previous algorithms, but also in practical superiority over LOLA in the tandem game. Moreover, SOS fully preserves opponent shaping and outperforms SGA, CO, LA and NL in the IPD by encouraging tit-for-tat policy instead of defecting. Finally, SOS convincingly learns Gaussian mixtures on par with the dedicated CO algorithm.

Acknowledgements

This project has received funding from the European Research Council under the European Union’s Horizon 2020 research and innovation programme (grant agreement number 637713). It was also supported by the Oxford-Google DeepMind Graduate Scholarship.

References

Appendix A Stable Fixed Points

In the main text we showed that Nash equilibria are inadequate in multi-agent learning, exemplified by the simple game given by L1=L2=xyL^{1}=L^{2}=xy, where the origin is a global Nash equilibrium but a saddle point of the losses. It is not however obvious that SFP are a better solution concept. We begin by pointing out that for single losses, invertibility and symmetry of the Hessian imply positive definiteness at SFP. These are exactly local minima of LL detected by the second partial derivative test, namely those points provably attainable by gradient descent.

To emphasise this, note that gradient descent does not converge locally to all local minima. This can be seen by considering the example L(x,y)=y2L(x,y)=y^{2} and the local (global) minimum (0,0)(0,0). There is no neighbourhood for which gradient descent converges to (0,0)(0,0), since initialising at (x0,y0)(x_{0},y_{0}) will always converge to (x0,0)(x_{0},0) for appropriate learning rates, with x00x_{0}\neq 0 almost surely. This occurs precisely because the Hessian is singular at (0,0)(0,0). Though a degenerate example, this suggests an important difference to make between the ideal solution concept (local minima) and that for which local convergence claims are possible to attain (local minima with invertible H0H\succeq 0).

Accordingly, the definition of SFP is the immediate generalisation of ‘fixed points with positive semi-definite Hessian’, or in other words, ‘second-order-tractable local minima’. It is important to impose only positive semi-definiteness to keep the class as large as possible, despite strict positive definiteness holding for single losses due to symmetry. Imposing strict positivity would for instance exclude the origin in the cyclic game L1=xy=L2L^{1}=xy=-L^{2}, a point certainly worthy of convergence.

Note also that imposing a weaker condition than H0H\succeq 0 would be incorrect. Invertibility aside, local convergence of gradient descent on single functions cannot be guaranteed if H0H\nsucceq 0, since such points are strict saddles. These are almost always avoided by gradient descent, as proven by Lee et al. (2016) and Panageas & Piliouras (2017). It is thus necessary to impose H0H\succeq 0 as a minimal requirement in optimisation methods attempting to generalise gradient descent.

A matrix HH is positive semi-definite iff the same holds for its symmetric part S=(H+H^{\mathchoice{\raisebox{0.0pt}{\displaystyle\intercal}}{\raisebox{0.0pt}{\textstyle\intercal}}{\raisebox{0.0pt}{\scriptstyle\intercal}}{\raisebox{0.0pt}{\scriptscriptstyle\intercal}}})/2, so SFP could equivalently be defined as S(θˉ)0S(\bar{\theta})\succeq 0. This is the original formulation given by part of the authors (Balduzzi et al., 2018), who also imposed the extra requirement S(θ)0S(\theta)\succeq 0 in a neighbourhood of θˉ\bar{\theta}. After discussion we decided to drop this assumption, pointing out that it is 1) more restrictive, 2) superficial to all theoretical results and 3) weakens the analogy with tractable local minima. The only thing gained by imposing semi-positivity in a neighbourhood is that SFP become a subset of Nash equilibria.

Regarding unstable fixed points and strict saddles, note that H(θˉ)0H(\bar{\theta})\succ 0 implies H(θ)0H(\theta)\succ 0 in a neighbourhood, hence being equivalent to the definition in Balduzzi et al. (2018). It follows also that unstable points are a subset of strict saddles: if H(θˉ)0H(\bar{\theta})\prec 0 then all eigenvalues are negative since any eigenpair (v,λ)(v,\lambda) satisfies

We introduced strict saddles in this paper as a generalisation of unstable FP, which are more difficult to handle but nonetheless tractable using dynamical systems. The name is chosen by analogy to the definition in Lee et al. (2016) for single losses.

Appendix B Lola Vectorial Form

in the usual assumption of equal learning rates.

for agent 11, and so on for each agent. First-order Taylor expansion yields

and similarly for each agent. Differentiating with respect to θi\theta^{i}, the adjustment for player ii is

Appendix C Tandem Game

We provide a more detailed exposition of the tandem game in this section, including computation of fixed points for NL/LOLA and corresponding losses. Recall that the game is given by

Intuitively, agents wants to have xyx\approx-y since (x+y)2(x+y)^{2} is the leading loss, but would also prefer to have positive xx and yy. These are incompatible, so the agents must not be ‘arrogant’ and instead make concessions. The fixed points are given by

namely any pair (x,1x)(x,1-x). The corresponding losses are L1=12x=L2L^{1}=1-2x=-L^{2}, summing to for any xx. We have

everywhere, so all fixed points are SFP. LOLA fails to preserve these, since

which is non-zero for any SFP (x,1x)(x,1-x). Instead, LOLA can only converge to points such that

The fixed points for LOLA are thus pairs (x,y)(x,y) such that

noting that (12α)/(14α)>1(1-2\alpha)/(1-4\alpha)>1 for all α>0\alpha>0. This leads to worse losses

for agent 1 and similarly for agent 2. In particular, losses always sum to something greater than . This becomes negligible as the learning rate becomes smaller, but is always positive nonetheless Taking α\alpha arbitrarily small is not a viable solution since convergence will in turn be arbitrarily slow. LOLA is thus not a strong algorithm candidate for all differentiable games.

Appendix D Convergence Proofs

We use Ostrowski’s theorem as a unified framework for proving local convergence of gradient-based methods. This is a standard result on fixed-point iterations, adapted from (Ortega & Rheinboldt, 2000, 10.1.3). We also invoke and prove a topological result of our own, D.9, at the end of this section. This is useful in deducing local convergence, though not central to intuition.

A matrix MM is called positive stable if all its eigenvalues have positive real part.

Recall the simultaneous gradient ξ\xi and the Hessian HH defined for differentiable games. Let XX be any matrix with continuously differentiable entries.

Assume xˉ\bar{x} is a fixed point of a differentiable game such that XH(xˉ)XH(\bar{x}) is positive stable. Then the iterative procedure

converges locally to xˉ\bar{x} for α>0\alpha>0 sufficiently small.

By definition of fixed points, ξ(xˉ)=0\xi(\bar{x})=0 and so

is positive stable by assumption, namely has eigenvalues ak+ibka_{k}+ib_{k} with ak>0a_{k}>0. It follows that

has eigenvalues 1αakiαbk1-\alpha a_{k}-i\alpha b_{k}, which are in the unit circle for small α\alpha. More precisely,

which is always possible for ak>0a_{k}>0. Hence F(xˉ)\nabla F(\bar{x}) has eigenvalues in the unit circle for 0<α<mink2ak/(ak2+bk2)0<\alpha<\min_{k}2a_{k}/(a_{k}^{2}+b_{k}^{2}), and we are done by Ostrowski’s Theorem since xˉ\bar{x} is a fixed point of FF. ∎

We apply this corollary to LookAhead, which is given by

where X=(IαHo)X=(I-\alpha H_{o}). It is thus sufficient to prove the following result.

Let H0H\succeq 0 invertible with symmetric diagonal blocks. Then there exists ϵ>0\epsilon>0 such that (IαHo)H(I-\alpha H_{o})H is positive stable for all 0<α<ϵ0<\alpha<\epsilon.

Note that (IαHo)H(I-\alpha H_{o})H may fail to be positive definite, though true in the case of 2×22\times 2 matrices. This no longer holds in higher dimensions, exemplified by the Hessian

By direct computation (symbolic in α\alpha), one can show that G=(IαHo)HG=(I-\alpha H_{o})H always has positive eigenvalues for small α>0\alpha>0, whereas its symmetric part SS always has a negative eigenvalue with magnitude in the order of α\alpha. This implies that SS and in turn GG is not positive definite. As such, an analytical proof of the theorem involving bounds on the corresponding bilinear form will fail.

This makes the result all the more interesting, but more involved. Central to the proof is a similarity transformation proving positive definiteness with respect to a different inner product, a novel technique we have not found in the multi-agent learning literature.

We cannot study the eigenvalues of GG directly, since there is no necessary relationship between eigenpairs of HH and HoH_{o}. In the aim of using analytical tools, the trick is to find a positive definite matrix which is similar to GG, thus sharing the same positive eigenvalues. First define

where HdH_{d} is the sub-matrix of diagonal blocks,and rewrite

Note that HdH_{d} is block diagonal with symmetric blocks iiLi0\nabla_{ii}L^{i}\succeq 0, so (I+αHd)(I+\alpha H_{d}) is symmetric and positive definite for all α0\alpha\geq 0. In particular its principal square root

for all non-zero uu. In particular MM provides a similarity transformation which eliminates HdH_{d} from G1G_{1} while simultaneously delivering positive semi-definiteness. We can now prove that

First note that a Taylor expansion of MM in α\alpha yields

There are two cases to distinguish. If u^{\mathchoice{\raisebox{0.0pt}{\displaystyle\intercal}}{\raisebox{0.0pt}{\textstyle\intercal}}{\raisebox{0.0pt}{\scriptstyle\intercal}}{\raisebox{0.0pt}{\scriptscriptstyle\intercal}}}Hu>0 then

for α\alpha sufficiently small. Otherwise, u^{\mathchoice{\raisebox{0.0pt}{\displaystyle\intercal}}{\raisebox{0.0pt}{\textstyle\intercal}}{\raisebox{0.0pt}{\scriptstyle\intercal}}{\raisebox{0.0pt}{\scriptscriptstyle\intercal}}}Hu=0 and consider decomposing HH into symmetric and antisymmetric parts S=(H+H^{\mathchoice{\raisebox{0.0pt}{\displaystyle\intercal}}{\raisebox{0.0pt}{\textstyle\intercal}}{\raisebox{0.0pt}{\scriptstyle\intercal}}{\raisebox{0.0pt}{\scriptscriptstyle\intercal}}})/2 and A=(H-H^{\mathchoice{\raisebox{0.0pt}{\displaystyle\intercal}}{\raisebox{0.0pt}{\textstyle\intercal}}{\raisebox{0.0pt}{\scriptstyle\intercal}}{\raisebox{0.0pt}{\scriptscriptstyle\intercal}}})/2, so that H=S+AH=S+A. By antisymmetry of AA we have u^{\mathchoice{\raisebox{0.0pt}{\displaystyle\intercal}}{\raisebox{0.0pt}{\textstyle\intercal}}{\raisebox{0.0pt}{\scriptstyle\intercal}}{\raisebox{0.0pt}{\scriptscriptstyle\intercal}}}Au=0 and hence u^{\mathchoice{\raisebox{0.0pt}{\displaystyle\intercal}}{\raisebox{0.0pt}{\textstyle\intercal}}{\raisebox{0.0pt}{\scriptstyle\intercal}}{\raisebox{0.0pt}{\scriptscriptstyle\intercal}}}Hu=0=u^{\mathchoice{\raisebox{0.0pt}{\displaystyle\intercal}}{\raisebox{0.0pt}{\textstyle\intercal}}{\raisebox{0.0pt}{\scriptstyle\intercal}}{\raisebox{0.0pt}{\scriptscriptstyle\intercal}}}Su. Now H0H\succeq 0 implies S0S\succeq 0, so by Cholesky decomposition of SS there exists a matrix TT such that S=T^{\mathchoice{\raisebox{0.0pt}{\displaystyle\intercal}}{\raisebox{0.0pt}{\textstyle\intercal}}{\raisebox{0.0pt}{\scriptstyle\intercal}}{\raisebox{0.0pt}{\scriptscriptstyle\intercal}}}T. In particular 0=u^{\mathchoice{\raisebox{0.0pt}{\displaystyle\intercal}}{\raisebox{0.0pt}{\textstyle\intercal}}{\raisebox{0.0pt}{\scriptstyle\intercal}}{\raisebox{0.0pt}{\scriptscriptstyle\intercal}}}Su=\lVert Tu\rVert^{2} implies Tu=0Tu=0, and in turn Su=0Su=0. Since HH is invertible and u0u\neq 0, we have 0Hu=Au0\neq Hu=Au and so Au2>0\lVert Au\rVert^{2}>0. It follows in particular that

Using positive semi-definiteness of M1G1MM^{-1}G_{1}M,

for α>0\alpha>0 small enough. We conclude that for any uSmu\in S^{m} there is ϵu>0\epsilon_{u}>0 such that

for all uSmu\in S^{m} and 0<α<ϵ0<\alpha<\epsilon. It follows that M1GMM^{-1}GM is positive definite for all 0<α<ϵ0<\alpha<\epsilon and thus GG is positive stable for α\alpha in the same range, by similarity. ∎

LookAhead converges locally to stable fixed points for α>0\alpha>0 sufficiently small.

For any SFP θˉ\bar{\theta} we have ξ(θˉ)=0\xi(\bar{\theta})=0 and H(θˉ)0H(\bar{\theta})\succeq 0 invertible by definition, with diagonal blocks iiLi\nabla_{ii}L^{i} symmetric by twice continuous differentiability. We are done by the result above and D.3. ∎

We now prove that local convergence results descend to SOS. The following lemma establishes the crucial claim that our criterion for pp is C1C^{1} in neighbourhoods of fixed points. This is necessary to invoke analytical arguments including Ostrowski’s Theorem, and would be untrue globally.

If θˉ\bar{\theta} is a fixed point and α\alpha is sufficiently small then p=ξ2p=\lVert\xi\rVert^{2} in a neighbourhood of θˉ\bar{\theta}.

First note that ξ(θˉ)=0\xi(\bar{\theta})=0, so there is a (bounded) neighbourhood VV of θˉ\bar{\theta} such that ξ(θ)<b\lVert\xi(\theta)\rVert<b for all θV\theta\in V, for any choice of hyperparameter b(0,1)b\in(0,1). In particular p2(θ)=ξ(θ)2p_{2}(\theta)=\lVert\xi(\theta)\rVert^{2} by definition of the second criterion. We want to show that p(θ)=p2(θ)p(\theta)=p_{2}(\theta) near θˉ\bar{\theta}, or equivalently p1(θ)p2(θ)p_{1}(\theta)\geq p_{2}(\theta). Since p2(θ)=ξ(θ)2<b2<1p_{2}(\theta)=\lVert\xi(\theta)\rVert^{2}<b^{2}<1 in VV, it remains only to show that

in some neighbourhood UVU\subseteq V of θˉ\bar{\theta}, for any choice of hyperparameter a(0,1)a\in(0,1). Now by boundedness of VV and continuity of χ\textstyle\chi, there exists c>0c>0 such that \lVert-\alpha{\mathchoice{\raisebox{0.8pt}{\displaystyle\chi}}{\raisebox{0.8pt}{\textstyle\chi}}{\raisebox{0.8pt}{\scriptstyle\chi}}{\raisebox{0.8pt}{\scriptscriptstyle\chi}}}(\theta)\rVert=\alpha^{2}\lVert{\mathchoice{\raisebox{0.8pt}{\displaystyle\chi}}{\raisebox{0.8pt}{\textstyle\chi}}{\raisebox{0.8pt}{\scriptstyle\chi}}{\raisebox{0.8pt}{\scriptscriptstyle\chi}}}(\theta)\rVert<c for all θV\theta\in V and bounded α\alpha. It follows by Cauchy-Schwartz that

in VV, for some d>0d>0 and α\alpha sufficiently small, by boundedness of VV and continuity of HoH_{o}. Finally there is a sub-neighbourhood UVU\subset V such that ξ(θ)<ad/c\lVert\xi(\theta)\rVert<ad/c for all θU\theta\in U, so that adξ/c>ξ(θ)2ad\lVert\xi\rVert/c>\lVert\xi(\theta)\rVert^{2} and hence

in UU. Hence p(θ)=min{p1(θ),p2(θ)}=p2(θ)=ξ(θ)2p(\theta)=\min\{p_{1}(\theta),p_{2}(\theta)\}=p_{2}(\theta)=\lVert\xi(\theta)\rVert^{2} for all θU\theta\in U, as required. ∎

SOS converges locally to stable fixed points for α>0\alpha>0 sufficiently small.

Though the criterion for pp is dual, we will only use the second part. More precisely,

if ξ<b\lVert\xi\rVert<b. The aim is to show that if θˉ\bar{\theta} is an SFP then ξp(θˉ)\nabla\xi_{p}(\bar{\theta}) is positive stable for small α\alpha, using Ostrowski to conclude as usual. The first problem we face is that ξp\nabla\xi_{p} does not exist everywhere, since p(θ)p(\theta) is not a continuous function. However we know by D.7 that p=ξ2p=\lVert\xi\rVert^{2} in a neighbourhood UU of θˉ\bar{\theta}, so ξp\xi_{p} is continuously differentiable in UU. Moreover, p(θˉ)=ξ(θˉ)2=0p(\bar{\theta})=\lVert\xi(\bar{\theta})\rVert^{2}=0 with gradient

by definition of fixed points. It follows that

which is identical to LookAhead. This is positive stable for all 0<α<ϵ0<\alpha<\epsilon, and θˉ\bar{\theta} is a fixed point of the iteration since

We conclude by D.3 that SOS converges locally to SFP for any a,b(0,1)a,b\in(0,1) and α\alpha sufficiently small. ∎

For any uYu\in Y there is ϵu>0\epsilon_{u}>0 such that

We would like to extend this uniformly in uu, namely prove that

for some ϵ>0\epsilon>0. Now g1(0,)g^{-1}(0,\infty) is open by continuity of gg, so each (0,ϵu)×{u}(0,\epsilon_{u})\times\{u\} has a neighbourhood XuX_{u} contained in g1(0,)g^{-1}(0,\infty). Open sets in a product topology are unions of open products, so

In particular (0,ϵu)xUx(0,\epsilon_{u})\subseteq\bigcup_{x}U_{x} and at least one VxV_{x} contains uu, so we can take the open neighbourhood to be

for some neighbourhood VuV_{u} of uu. In particular YuYVuY\subseteq\bigcup_{u\in Y}V_{u}, and by compactness there is a finite cover Yi=1kVuiY\subseteq\bigcup_{i=1}^{k}V_{u_{i}}\,. Letting ϵ=min{ϵi}i=1k>0\epsilon=\min\{\epsilon_{i}\}_{i=1}^{k}>0, we obtain the required inclusion

Appendix E Non-convergence Proofs

Let aka_{k} and bkb_{k} be sequences of real numbers, and define ck=min{ak,bk}c_{k}=\min\{a_{k},b_{k}\}. If

for all kMk\geq M, kNk^{\prime}\geq N. Expanding the absolute value, this implies

for all kmax{M,N}k\geq\max\{M,N\}. Now ckakc_{k}\leq a_{k} for all kk, hence

If SOS converges to θˉ\bar{\theta} and α>0\alpha>0 is small then θˉ\bar{\theta} is a fixed point of the game.

If θkθˉ\theta_{k}\to\bar{\theta} as kk\to\infty then taking limits on both sides of the iteration yields

and so limkξp(θk)=0\lim\limits_{k}\xi_{p}(\theta_{k})=0, omitting kk\to\infty for convenience. It follows by continuity that

noting that p(θ)p(\theta) is not a globally continuous function. Assume for contradiction that ξ0(θˉ)0\xi_{0}(\bar{\theta})\neq 0. There are two cases to distinguish for clarity.

First assume \langle-\alpha{\mathchoice{\raisebox{0.8pt}{\displaystyle\chi}}{\raisebox{0.8pt}{\textstyle\chi}}{\raisebox{0.8pt}{\scriptstyle\chi}}{\raisebox{0.8pt}{\scriptscriptstyle\chi}}},\xi_{0}\rangle(\bar{\theta})\geq 0. Note that limkp(θk)0\lim_{k}p(\theta_{k})\geq 0 since p(θ)0p(\theta)\geq 0 for all θ\theta, and so

This is a contradiction since limkξp(θk)=0\lim_{k}\xi_{p}(\theta_{k})=0.

by continuity and E.1. Finally we conclude

In both cases a contradiction is obtained, hence ξ0(θˉ)=0=(IαHo)ξ(θˉ)\xi_{0}(\bar{\theta})=0=(I-\alpha H_{o})\xi(\bar{\theta}). Now note that (IαHo)(θˉ)(I-\alpha H_{o})(\bar{\theta}) is singular iff Ho(θˉ)H_{o}(\bar{\theta}) has an eigenvalue 1/α1/\alpha, which is impossible for α\alpha sufficiently small. Hence (IαHo)ξ(θˉ)=0(I-\alpha H_{o})\xi(\bar{\theta})=0 implies ξ(θˉ)=0\xi(\bar{\theta})=0, as required. ∎

Now assume that θ\theta is initialised randomly (or with arbitrarily small noise around a point), as is standard in ML. We prove that SOS locally avoids strict saddles using the Stable Manifold Theorem, inspired from Lee et al. (2017).

SOS locally avoids strict saddles almost surely, for α>0\alpha>0 sufficiently small.

Let θˉ\bar{\theta} a strict saddle and recall that SOS is given by

Recall by D.7 that p(θ)=ξ(θ)2p(\theta)=\lVert\xi(\theta)\rVert^{2} for all θ\theta in a neighbourhood UU of θˉ\bar{\theta}. Restricting FF to UU, all terms involved are continuously differentiable and

by assumption that ξ(θˉ)=0\xi(\bar{\theta})=0. Since all terms except II are of order at least α\alpha, F(θˉ)\nabla F(\bar{\theta}) is invertible for all α\alpha sufficiently small. By the inverse function theorem, there exists a neighbourhood VV of θˉ\bar{\theta} such that FF is has a continuously differentiable inverse on VV. Hence FF restricted to UVU\cap V is a C1C^{1} diffeomorphism with fixed point θˉ\bar{\theta}.

By definition of strict saddles, H(θˉ)H(\bar{\theta}) has a negative eigenvalue. It follows by continuity that (IαHo)H(θˉ)(I-\alpha H_{o})H(\bar{\theta}) also has a negative eigenvalue a+iba+ib with a<0a<0 for α\alpha sufficiently small. Finally,

has an eigenvalue λ=1αaiαb\lambda=1-\alpha a-i\alpha b with

It follows that EsE^{s} has codimension at least one, implying in turn that the local stable set WW has measure zero. We can now prove that

Now F1F^{-1} is C1C^{1}, hence locally Lipschitz and thus preserves sets of measure zero, so that Fn(W)F^{-n}(W) has measure zero for each nn. Countable unions of measure zero sets are still measure zero, so we conclude that ZZ also has measure zero. In other words, SOS converges to θˉ\bar{\theta} with zero probability upon random initialisation of θ\theta in UU. ∎

Appendix F Further Gaussian Mixture Experiments

In the Gaussian mixture experiment, data is sampled from a highly multimodal distribution designed to probe the tendency to collapse onto a subset of modes during training, given in Figure 5.

The generator distribution and KL divergence are given at {\{2k, 4k, 6k, 8k}\} iterations for LA, LOLA and SGA in Figure 6. LOLA and SGA successfully spread mass across all Gaussians. LookAhead displays mode collapse and hopping in early stages, but begins to incorporate further mixtures near 8k8k iterations. We ran further iterations and discovered that LookAhead eventually spreads mass across all mixtures, though very slowly. Comparing with results for NL/CO/SOS in the main text, we see that CO/SOS/LOLA/SGA are equally successful in qualitative terms.

Note that SOS/CO are slightly superior with respect to KL divergence after 6-8k iterations, though LOLA is initially faster. This may be due only to random sampling. We also noticed experimentally that LOLA often moves away from the correct distribution after 8-10k iterations (not shown), while SOS stays stable in the long run. This may occur thanks to the two-part criterion encouraging convergence, while LOLA continually attempts to exploit opponent learning.

Finally we plot ξ\lVert\xi\rVert at all iterations up to 12k12k for SOS, LA and NL in Figure 7 (other algorithms are omitted for visibility). This gives further evidence of SOS converging quite rapidly to the correct distribution, while NL perpetually suffers from mode hopping and LA lags behind significantly.