Negative Momentum for Improved Game Dynamics

Gauthier Gidel, Reyhane Askari Hemmat, Mohammad Pezeshki, Remi Lepriol, Gabriel Huang, Simon Lacoste-Julien, Ioannis Mitliagkas

INTRODUCTION

Recent advances in machine learning are largely driven by the success of gradient-based optimization methods for the training process. A common learning paradigm is empirical risk minimization, where a (potentially non-convex) objective, that depends on the data, is minimized. However, some recently introduced approaches require the joint minimization of several objectives. For example, actor-critic methods can be written as a bi-level optimization problem (Pfau and Vinyals, 2016) and generative adversarial networks (GANs) (Goodfellow et al., 2014) use a two-player game formulation.

Games generalize the standard optimization framework by introducing different objective functions for different optimizing agents, known as players. We are commonly interested in finding a local Nash equilibrium: a set of parameters from which no player can (locally and unilaterally) improve its objective function. Games with differentiable objectives often proceed by simultaneous or alternating gradient steps on the players’ objectives. Even though the dynamics of gradient based methods is well understood for minimization problems, new issues appear in multi-player games. For instance, some stable stationary points of the dynamics may not be (local) Nash equilibria (Adolphs et al., 2018; Daskalakis and Panageas, 2018).

Motivated by a decreasing trend of momentum values in GAN literature (see Fig. 1), we study the effect of two particular algorithmic choices: (i) the choice between simultaneous and alternating updates, and (ii) the choice of step-size and momentum value. The idea behind our approach is that a momentum term combined with the alternating gradient method can be used to manipulate the natural oscillatory behavior of adversarial games. We summarize our main contributions:

We prove in §5 that the alternating gradient method with negative momentum is the only setting within our study parameters (Fig. 2) that converges on a bilinear smooth game. Using a zero or positive momentum value, or doing simultaneous updates in such games fails to converge.

We show in §4 that, for general dynamics, when the eigenvalues of the Jacobian have a large imaginary part, negative momentum can improve the local convergence properties of the gradient method.

We confirm the benefits of negative momentum for training GANs with the notoriously ill-behaved saturating loss on both toy settings, and real datasets.

§2 describes the fundamentals of the analytic setup that we use. §3 provides a formulation for the optimal step-size, and discusses the constraints and intuition behind it. §4 presents our theoretical results and guarantees on negative momentum. §5 studies the properties of alternating and simultaneous methods with negative momentum on a bilinear smooth game. §6 contains experimental results on toy and real datasets. Finally, in §7, we review some of the existing work on smooth game optimization as well as GAN stability and convergence.

BACKGROUND

Game theory formulation of GANs

Generative adversarial networks consist of a discriminator DφD_{{\bm{\varphi}}} and a generator GθG_{{\bm{\theta}}}. In this game, the discriminator’s objective is to tell real from generated examples. The generator’s goal is to produce examples that are sufficiently close to real examples to confuse the discriminator.

From a game theory point of view, GAN training is a differentiable two-player game: the discriminator DφD_{{\bm{\varphi}}} aims at minimizing its cost function LD{\mathcal{L}}_{D} and the generator GθG_{{\bm{\theta}}} aims at minimizing its own cost function LG{\mathcal{L}}_{G}. Using the same formulation as the one in Mescheder et al. (2017) and Gidel et al. (2018), the GAN objective has the following form,

Given such a game setup, GAN training consists of finding a local Nash Equilibrium, which is a state (φ,θ)({\bm{\varphi}}^{*},{\bm{\theta}}^{*}) in which neither the discriminator nor the generator can improve their respective cost by a small change in their parameters. In order to analyze the dynamics of gradient-based methods near a Nash Equilibrium, we look at the gradient vector field,

and its associated Jacobian v(φ,θ)\nabla\bm{v}({\bm{\varphi}},{\bm{\theta}}),

Games in which LG=LD{\mathcal{L}}_{G}=-{\mathcal{L}}_{D} are called zero-sum games and (1) can be reformulated as a min-max problem. This is the case for the original min-max GAN formulation, but not the case for the non-saturating loss (Goodfellow et al., 2014) which is commonly used in practice.

For a zero-sum game, we note LG=LD=L{\mathcal{L}}_{G}=-{\mathcal{L}}_{D}={\mathcal{L}}. When the matrices φ2L(φ,θ)\nabla^{2}_{{\bm{\varphi}}}{\mathcal{L}}({\bm{\varphi}},{\bm{\theta}}) and θ2L(φ,θ)\nabla^{2}_{{\bm{\theta}}}{\mathcal{L}}({\bm{\varphi}},{\bm{\theta}}) are zero, the Jacobian is anti-symmetric and has pure imaginary eigenvalues. We call games with pure imaginary eigenvalues purely adversarial games. This is the case in a simple bilinear game L(φ,θ):=φAθ{\mathcal{L}}({\bm{\varphi}},{\bm{\theta}})\mathrel{\mathop{\ordinarycolon}}={\bm{\varphi}}^{\top}{\bm{A}}{\bm{\theta}}. This game can be formulated as a GAN where the true distribution is a Dirac on 0, the generator is a Dirac on θ\theta and the discriminator is linear. This setup was extensively studied in 2D by Gidel et al. (2018).

Conversely, when φθL(φ,θ)\nabla_{{\bm{\varphi}}}\nabla_{{\bm{\theta}}}{\mathcal{L}}({\bm{\varphi}},{\bm{\theta}}) is zero and the matrices φ2L(φ,θ)\nabla^{2}_{{\bm{\varphi}}}{\mathcal{L}}({\bm{\varphi}},{\bm{\theta}}) and θ2L(φ,θ)-\nabla^{2}_{{\bm{\theta}}}{\mathcal{L}}({\bm{\varphi}},{\bm{\theta}}) are symmetric and definite positive, the Jacobian is symmetric and has real positive eigenvalues. We call games with real positive eigenvalues purely cooperative games. This is the case, for example, when the objective function L{\mathcal{L}} is separable such as L(φ,θ)=f(φ)g(θ){\mathcal{L}}({\bm{\varphi}},{\bm{\theta}})=f({\bm{\varphi}})-g({\bm{\theta}}) where ff and gg are two convex functions. Thus, the optimization can be reformulated as two separated minimization of ff and gg with respect to their respective parameters.

These notions of adversarial and cooperative games can be related to the notions of potential games (Monderer and Shapley, 1996) and Hamiltonian games recently introduced by Balduzzi et al. (2018): a game is a potential game (resp. Hamiltonian game) if its Jacobian is symmetric (resp. asymmetric). Our definition of cooperative game is a bit more general than the definition of potential game since some non-symmetric matrices may have positive eigenvalues. Similarly, the notion of adversarial game generalizes the Hamiltonian games since some non-antisymmetric matrices may have pure imaginary eigenvalues, for instance,

Simultaneous Gradient Method.

Let us consider the dynamics of the simultaneous gradient method. It is defined as the repeated application of the operator,

Then, if the gradient method converges, and its limit point ω=(φ,θ){\bm{\omega}}^{*}=({\bm{\varphi}}^{*},{\bm{\theta}}^{*}) is a fixed point of FηF_{\eta} such that v(ω)\nabla v({\bm{\omega}}^{*}) is positive-definite, then ω{\bm{\omega}}^{*} is a local Nash equilibrium. Interestingly, some of the stable stationary points of gradient dynamics may not be Nash equilibrium (Adolphs et al., 2018). In this work, we focus on the local convergence properties near the stationary points of gradient . To the best of our knowledge, there is no first order method alleviating this issue. In the following, ω{\bm{\omega}}^{*} is a stationary point of the gradient dynamics (i.e. a point such that v(ω)=0\bm{v}({\bm{\omega}}^{*})=0).

TUNING THE STEP-SIZE

Under certain conditions on a fixed point operator, linear convergence is guaranteed in a neighborhood around a fixed point.

If the spectral radius ρmax:=ρ(Fη(ω))<1\rho_{\max}\mathrel{\mathop{\ordinarycolon}}=\rho(\nabla F_{\eta}({\bm{\omega}}^{*}))<1, then, for ω0{\bm{\omega}}_{0} in a neighborhood of ω{\bm{\omega}}^{*}, the distance of ωt{\bm{\omega}}_{t} to the stationary point ω{\bm{\omega}}^{*} converges at a linear rate of \mathcal{O}\big{(}(\rho_{\max}+\epsilon)^{t}\big{)}\,,\,\forall\epsilon>0.

If the eigenvalues of v(ω)\nabla\bm{v}({\bm{\omega}}^{*}) all have a positive real-part, then for small enough η\eta, the eigenvalues of Fη(ω)\nabla F_{\eta}({\bm{\omega}}^{*}) are inside a convergence circle of radius ρmax<1\rho_{\max}<1, as illustrated in Fig. 3. Thm. 1 guarantees the existence of an optimal step-size ηbest\eta_{best} which yields a non-trivial convergence rate ρmax<1\rho_{\max}<1. Thm. 2 gives analytic bounds on the optimal step-size ηbest\eta_{best}, and lower-bounds the best convergence rate ρmax(ηbest)\rho_{\max}(\eta_{best}) we can expect.

If the eigenvalues of v(ω)\nabla\bm{v}({\bm{\omega}}^{*}) all have a positive real-part, then, the best step-size ηbest\eta_{best}, which minimizes the spectral radius ρmax(η)\rho_{\max}(\eta) of Fη(φ,θ)\nabla F_{\eta}({\bm{\varphi}}^{*},{\bm{\theta}}^{*}), is the solution of a (convex) quadratic by parts problem, and satisfies,

where (λk=rkeiψk)1km=Sp(v(φ,θ))(\lambda_{k}=r_{k}e^{i\psi_{k}})_{1\leq k\leq m}=\operatorname*{Sp}(\nabla\bm{v}({\bm{\varphi}}^{*},{\bm{\theta}}^{*})) are sorted such that 0<(1/λ1)(1/λm)0<\Re(1/\lambda_{1})\leq\cdots\leq\Re(1/\lambda_{m}). Particularly, when ηbest=(1/λ1)\eta_{best}=\Re(1/\lambda_{1}) we are in the case of the top plot of Fig.3 and ρmax(ηbest)2=sin(ψ1)2  .\rho_{\max}(\eta_{best})^{2}=\sin(\psi_{1})^{2}\;.

When v\nabla\bm{v} is positive-definite, the best ηbest\eta_{best} is attained either because of one or several limiting eigenvalues. We illustrate and interpret these two cases in Fig. 3. In multivariate convex optimization, the optimal step-size depends on the extreme eigenvalues and their ratio, the condition number. Unfortunately, the notion of the condition number does not trivially extend to games, but Thm. 2 seems to indicate that the real part of the inverse of the eigenvalues play an important role in the dynamics of smooth games. We think that a notion of condition number might be meaningful for such games and we propose an illustrative example to discuss this point in §B. Note that when the eigenvalues are pure positive real numbers belonging to [μ,L][\mu,L], (8) provides the standard bound ρmax1μ/L\rho_{\max}\leq 1-\mu/L obtained with a step-size η=1/L\eta=1/L (see §D.2 for details).

Re<spanclass="katexdisplay"><spanclass="katex"><spanclass="katexmathml"><mathxmlns="http://www.w3.org/1998/Math/MathML"display="block"><semantics><mrow><mimathvariant="bold">I</mi><mimathvariant="bold">m</mi></mrow><annotationencoding="application/xtex">Im</annotation></semantics></math></span><spanclass="katexhtml"ariahidden="true"><spanclass="base"><spanclass="strut"style="height:0.6861em;"></span><spanclass="mord"><spanclass="mordmathbf">Im</span></span></span></span></span></span>1+0i<spanclass="katexdisplay"><spanclass="katex"><spanclass="katexmathml"><mathxmlns="http://www.w3.org/1998/Math/MathML"display="block"><semantics><mrow><mn>1</mn><mo></mo><msub><mi>λ</mi><mn>1</mn></msub></mrow><annotationencoding="application/xtex">1λ1</annotation></semantics></math></span><spanclass="katexhtml"ariahidden="true"><spanclass="base"><spanclass="strut"style="height:0.7278em;verticalalign:0.0833em;"></span><spanclass="mord">1</span><spanclass="mspace"style="marginright:0.2222em;"></span><spanclass="mbin"></span><spanclass="mspace"style="marginright:0.2222em;"></span></span><spanclass="base"><spanclass="strut"style="height:0.8444em;verticalalign:0.15em;"></span><spanclass="mord"><spanclass="mordmathnormal">λ</span><spanclass="msupsub"><spanclass="vlisttvlistt2"><spanclass="vlistr"><spanclass="vlist"style="height:0.3011em;"><spanstyle="top:2.55em;marginleft:0em;marginright:0.05em;"><spanclass="pstrut"style="height:2.7em;"></span><spanclass="sizingresetsize6size3mtight"><spanclass="mordmtight"><spanclass="mordmtight">1</span></span></span></span></span><spanclass="vlists"></span></span><spanclass="vlistr"><spanclass="vlist"style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span></span>1ηλ1<spanclass="katexdisplay"><spanclass="katex"><spanclass="katexmathml"><mathxmlns="http://www.w3.org/1998/Math/MathML"display="block"><semantics><mrow><mn>1</mn><mo></mo><msub><mi>λ</mi><mn>2</mn></msub></mrow><annotationencoding="application/xtex">1λ2</annotation></semantics></math></span><spanclass="katexhtml"ariahidden="true"><spanclass="base"><spanclass="strut"style="height:0.7278em;verticalalign:0.0833em;"></span><spanclass="mord">1</span><spanclass="mspace"style="marginright:0.2222em;"></span><spanclass="mbin"></span><spanclass="mspace"style="marginright:0.2222em;"></span></span><spanclass="base"><spanclass="strut"style="height:0.8444em;verticalalign:0.15em;"></span><spanclass="mord"><spanclass="mordmathnormal">λ</span><spanclass="msupsub"><spanclass="vlisttvlistt2"><spanclass="vlistr"><spanclass="vlist"style="height:0.3011em;"><spanstyle="top:2.55em;marginleft:0em;marginright:0.05em;"><spanclass="pstrut"style="height:2.7em;"></span><spanclass="sizingresetsize6size3mtight"><spanclass="mordmtight"><spanclass="mordmtight">2</span></span></span></span></span><spanclass="vlists"></span></span><spanclass="vlistr"><spanclass="vlist"style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span></span>1ηλ2<spanclass="katexdisplay"><spanclass="katex"><spanclass="katexmathml"><mathxmlns="http://www.w3.org/1998/Math/MathML"display="block"><semantics><mrow><mn>1</mn><mo></mo><msub><mi>λ</mi><mn>3</mn></msub></mrow><annotationencoding="application/xtex">1λ3</annotation></semantics></math></span><spanclass="katexhtml"ariahidden="true"><spanclass="base"><spanclass="strut"style="height:0.7278em;verticalalign:0.0833em;"></span><spanclass="mord">1</span><spanclass="mspace"style="marginright:0.2222em;"></span><spanclass="mbin"></span><spanclass="mspace"style="marginright:0.2222em;"></span></span><spanclass="base"><spanclass="strut"style="height:0.8444em;verticalalign:0.15em;"></span><spanclass="mord"><spanclass="mordmathnormal">λ</span><spanclass="msupsub"><spanclass="vlisttvlistt2"><spanclass="vlistr"><spanclass="vlist"style="height:0.3011em;"><spanstyle="top:2.55em;marginleft:0em;marginright:0.05em;"><spanclass="pstrut"style="height:2.7em;"></span><spanclass="sizingresetsize6size3mtight"><spanclass="mordmtight"><spanclass="mordmtight">3</span></span></span></span></span><spanclass="vlists"></span></span><spanclass="vlistr"><spanclass="vlist"style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span></span>1ηλ3<spanclass="katexdisplay"><spanclass="katex"><spanclass="katexmathml"><mathxmlns="http://www.w3.org/1998/Math/MathML"display="block"><semantics><mrow><msub><mi>ψ</mi><mn>1</mn></msub></mrow><annotationencoding="application/xtex">ψ1</annotation></semantics></math></span><spanclass="katexhtml"ariahidden="true"><spanclass="base"><spanclass="strut"style="height:0.8889em;verticalalign:0.1944em;"></span><spanclass="mord"><spanclass="mordmathnormal"style="marginright:0.0359em;">ψ</span><spanclass="msupsub"><spanclass="vlisttvlistt2"><spanclass="vlistr"><spanclass="vlist"style="height:0.3011em;"><spanstyle="top:2.55em;marginleft:0.0359em;marginright:0.05em;"><spanclass="pstrut"style="height:2.7em;"></span><spanclass="sizingresetsize6size3mtight"><spanclass="mordmtight"><spanclass="mordmtight">1</span></span></span></span></span><spanclass="vlists"></span></span><spanclass="vlistr"><spanclass="vlist"style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span></span>ψ2\mathbf{Re}<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mi mathvariant="bold">I</mi><mi mathvariant="bold">m</mi></mrow><annotation encoding="application/x-tex">\mathbf{Im}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6861em;"></span><span class="mord"><span class="mord mathbf">Im</span></span></span></span></span></span>1+0i<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mn>1</mn><mo>−</mo><msub><mi>λ</mi><mn>1</mn></msub></mrow><annotation encoding="application/x-tex">1-\lambda_{1}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.7278em;vertical-align:-0.0833em;"></span><span class="mord">1</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.8444em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathnormal">λ</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">1</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span></span>1-\eta\lambda_{1}<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mn>1</mn><mo>−</mo><msub><mi>λ</mi><mn>2</mn></msub></mrow><annotation encoding="application/x-tex">1-\lambda_{2}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.7278em;vertical-align:-0.0833em;"></span><span class="mord">1</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.8444em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathnormal">λ</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">2</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span></span>1-\eta\lambda_{2}<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mn>1</mn><mo>−</mo><msub><mi>λ</mi><mn>3</mn></msub></mrow><annotation encoding="application/x-tex">1-\lambda_{3}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.7278em;vertical-align:-0.0833em;"></span><span class="mord">1</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.8444em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathnormal">λ</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">3</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span></span>1-\eta\lambda_{3}<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><msub><mi>ψ</mi><mn>1</mn></msub></mrow><annotation encoding="application/x-tex">\psi_{1}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8889em;vertical-align:-0.1944em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.0359em;">ψ</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em;"><span style="top:-2.55em;margin-left:-0.0359em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">1</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span></span>\psi_{2}ψ3\psi_{3}\tkzMarkRightAngle Re<spanclass="katexdisplay"><spanclass="katex"><spanclass="katexmathml"><mathxmlns="http://www.w3.org/1998/Math/MathML"display="block"><semantics><mrow><mimathvariant="bold">I</mi><mimathvariant="bold">m</mi></mrow><annotationencoding="application/xtex">Im</annotation></semantics></math></span><spanclass="katexhtml"ariahidden="true"><spanclass="base"><spanclass="strut"style="height:0.6861em;"></span><spanclass="mord"><spanclass="mordmathbf">Im</span></span></span></span></span></span>1+0i<spanclass="katexdisplay"><spanclass="katex"><spanclass="katexmathml"><mathxmlns="http://www.w3.org/1998/Math/MathML"display="block"><semantics><mrow><mn>1</mn><mo></mo><msub><mi>λ</mi><mn>1</mn></msub></mrow><annotationencoding="application/xtex">1λ1</annotation></semantics></math></span><spanclass="katexhtml"ariahidden="true"><spanclass="base"><spanclass="strut"style="height:0.7278em;verticalalign:0.0833em;"></span><spanclass="mord">1</span><spanclass="mspace"style="marginright:0.2222em;"></span><spanclass="mbin"></span><spanclass="mspace"style="marginright:0.2222em;"></span></span><spanclass="base"><spanclass="strut"style="height:0.8444em;verticalalign:0.15em;"></span><spanclass="mord"><spanclass="mordmathnormal">λ</span><spanclass="msupsub"><spanclass="vlisttvlistt2"><spanclass="vlistr"><spanclass="vlist"style="height:0.3011em;"><spanstyle="top:2.55em;marginleft:0em;marginright:0.05em;"><spanclass="pstrut"style="height:2.7em;"></span><spanclass="sizingresetsize6size3mtight"><spanclass="mordmtight"><spanclass="mordmtight">1</span></span></span></span></span><spanclass="vlists"></span></span><spanclass="vlistr"><spanclass="vlist"style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span></span>1ηλ1<spanclass="katexdisplay"><spanclass="katex"><spanclass="katexmathml"><mathxmlns="http://www.w3.org/1998/Math/MathML"display="block"><semantics><mrow><mn>1</mn><mo></mo><msub><mi>λ</mi><mn>2</mn></msub></mrow><annotationencoding="application/xtex">1λ2</annotation></semantics></math></span><spanclass="katexhtml"ariahidden="true"><spanclass="base"><spanclass="strut"style="height:0.7278em;verticalalign:0.0833em;"></span><spanclass="mord">1</span><spanclass="mspace"style="marginright:0.2222em;"></span><spanclass="mbin"></span><spanclass="mspace"style="marginright:0.2222em;"></span></span><spanclass="base"><spanclass="strut"style="height:0.8444em;verticalalign:0.15em;"></span><spanclass="mord"><spanclass="mordmathnormal">λ</span><spanclass="msupsub"><spanclass="vlisttvlistt2"><spanclass="vlistr"><spanclass="vlist"style="height:0.3011em;"><spanstyle="top:2.55em;marginleft:0em;marginright:0.05em;"><spanclass="pstrut"style="height:2.7em;"></span><spanclass="sizingresetsize6size3mtight"><spanclass="mordmtight"><spanclass="mordmtight">2</span></span></span></span></span><spanclass="vlists"></span></span><spanclass="vlistr"><spanclass="vlist"style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span></span>1ηλ2<spanclass="katexdisplay"><spanclass="katex"><spanclass="katexmathml"><mathxmlns="http://www.w3.org/1998/Math/MathML"display="block"><semantics><mrow><mn>1</mn><mo></mo><msub><mi>λ</mi><mn>3</mn></msub></mrow><annotationencoding="application/xtex">1λ3</annotation></semantics></math></span><spanclass="katexhtml"ariahidden="true"><spanclass="base"><spanclass="strut"style="height:0.7278em;verticalalign:0.0833em;"></span><spanclass="mord">1</span><spanclass="mspace"style="marginright:0.2222em;"></span><spanclass="mbin"></span><spanclass="mspace"style="marginright:0.2222em;"></span></span><spanclass="base"><spanclass="strut"style="height:0.8444em;verticalalign:0.15em;"></span><spanclass="mord"><spanclass="mordmathnormal">λ</span><spanclass="msupsub"><spanclass="vlisttvlistt2"><spanclass="vlistr"><spanclass="vlist"style="height:0.3011em;"><spanstyle="top:2.55em;marginleft:0em;marginright:0.05em;"><spanclass="pstrut"style="height:2.7em;"></span><spanclass="sizingresetsize6size3mtight"><spanclass="mordmtight"><spanclass="mordmtight">3</span></span></span></span></span><spanclass="vlists"></span></span><spanclass="vlistr"><spanclass="vlist"style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span></span>1ηλ3\mathbf{Re}<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mi mathvariant="bold">I</mi><mi mathvariant="bold">m</mi></mrow><annotation encoding="application/x-tex">\mathbf{Im}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6861em;"></span><span class="mord"><span class="mord mathbf">Im</span></span></span></span></span></span>1+0i<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mn>1</mn><mo>−</mo><msub><mi>λ</mi><mn>1</mn></msub></mrow><annotation encoding="application/x-tex">1-\lambda_{1}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.7278em;vertical-align:-0.0833em;"></span><span class="mord">1</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.8444em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathnormal">λ</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">1</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span></span>1-\eta\lambda_{1}<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mn>1</mn><mo>−</mo><msub><mi>λ</mi><mn>2</mn></msub></mrow><annotation encoding="application/x-tex">1-\lambda_{2}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.7278em;vertical-align:-0.0833em;"></span><span class="mord">1</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.8444em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathnormal">λ</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">2</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span></span>1-\eta\lambda_{2}<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mn>1</mn><mo>−</mo><msub><mi>λ</mi><mn>3</mn></msub></mrow><annotation encoding="application/x-tex">1-\lambda_{3}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.7278em;vertical-align:-0.0833em;"></span><span class="mord">1</span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:0.8444em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathnormal">λ</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3011em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">3</span></span></span></span></span><span class="vlist-s">​</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span></span>1-\eta\lambda_{3} Figure 3: Eigenvalues λi\lambda_{i} of the Jacobian v(ϕ,θ)\nabla\bm{v}(\bm{\phi^{*}},\bm{\theta^{*}}) and their trajectories 1ηλi1-\eta\lambda_{i} for growing step-sizes. The unit circle is drawn in black, and the red dashed circle has radius equal to the largest eigenvalue μmax\mu_{\max}, which is directly related to the convergence rate. Therefore, smaller red circles mean better convergence rates. Top: The red circle is limited by the tangent trajectory line 1ηλ11-\eta\lambda_{1}, which means the best convergence rate is limited only by the eigenvalue which will pass furthest from the origin as η\eta grows, i.e., λi=argmin(1/λi)\lambda_{i}=\arg\min\Re(1/\lambda_{i}). Bottom: The red circle is cut (not tangent) by the trajectories at points 1ηλ11-\eta\lambda_{1} and 1ηλ31-\eta\lambda_{3}. The η\eta is optimal because any increase in η\eta will push the eigenvalue λ1\lambda_{1} out of the red circle, while any decrease in step-size will retract the eigenvalue λ3\lambda_{3} out of the red circle, which will lower the convergence rate in any case. Figure inspired by Mescheder et al. (2017). Note that, in (9), we have δ>0\delta>0 because (λk)(\lambda_{k}) are sorted such that, (1/λk)(1/λ1),1km\Re(1/\lambda_{k})\geq\Re(1/\lambda_{1})\,,\,\forall 1\leq k\leq m. In (8), we can see that if the Jacobian of v\bm{v} has an almost purely imaginary eigenvalue rjeψjr_{j}e^{\psi_{j}} then sin(ψj)\sin(\psi_{j}) is close to 11 and thus, the convergence rate of the gradient method may be arbitrarily close to 1. Zhang and Mitliagkas (2017) provide an analysis of the momentum method for quadratics, showing that momentum can actually help to better condition the model. One interesting point from their work is that the best conditioning is achieved when the added momentum makes the Jacobian eigenvalues turn from positive reals into complex conjugate pairs. Our goal is to use momentum to wrangle game dynamics into convergence manipulating the eigenvalues of the Jacobian.

NEGATIVE MOMENTUM

Note that for β=0\beta=0, we recover the gradient method.

In some situations, if β<0\beta<0 is adjusted properly, negative momentum can improve the convergence rate to a local stationary point by pushing the eigenvalues of its Jacobian towards the origin. In the following theorem, we provide an explicit equation for the eigenvalues of the Jacobian of Fη,βF_{\eta,\beta}.

The eigenvalues of Fη,β(ω)\nabla F_{\eta,\beta}({\bm{\omega}}^{*}) are

where Δ:=14β(1ηλ+β)2,  λSp(v(ω))\Delta\mathrel{\mathop{\ordinarycolon}}=1-\frac{4\beta}{(1-\eta\lambda+\beta)^{2}}\,,\;\lambda\in\operatorname*{Sp}(\nabla\bm{v}({\bm{\omega}}^{*})) and Δ12\Delta^{\frac{1}{2}} is the complex square root of Δ\Delta with positive real part If Δ\Delta is a negative real number we set Δ12:=iΔ\Delta^{\frac{1}{2}}\mathrel{\mathop{\ordinarycolon}}=i\sqrt{-\Delta}. Moreover we have the following Taylor approximation,

When β\beta is small enough, Δ\Delta is a complex number close to 11. Consequently, μ+\mu_{+} is close to the original eigenvalue for gradient dynamics 1ηλ1-\eta\lambda, and μ\mu_{-}, the eigenvalue introduced by the state augmentation, is close to 0. We formalize this intuition by providing the first order approximation of both eigenvalues.

In Fig. 4, we illustrate the effects of negative momentum on a game described in (4). Negative momentum shifts the original eigenvalues (trajectories in light red) by pushing them to the left towards the origin (trajectories in light blue).

Since our goal is to minimize the largest magnitude of the eigenvalues of Fη,βF_{\eta,\beta} which are computed in Thm. 3, we want to understand the effect of β\beta on these eigenvalues with potential large magnitude. Let λSp(v(ω))\lambda\in\operatorname*{Sp}(\nabla\bm{v}({\bm{\omega}}^{*})), we define the (squared) magnitude ρλ,η(β)\rho_{\lambda,\eta}(\beta) that we want to optimize,

We study the local behavior of ρλ,η\rho_{\lambda,\eta} for small β\beta. The following theorem shows that a well suited β\beta decreases ρλ,η\rho_{\lambda,\eta}, which corresponds to faster convergence.

For any λSp(v(ω))\lambda\in\operatorname*{Sp}(\nabla\bm{v}({\bm{\omega}}^{*})) s.t. (λ)>0\Re(\lambda)>0,

Particularly, we have ρλ,(1/λ)(0)=2(λ)(1/λ)>0\rho_{\lambda,\Re(1/\lambda)}^{\prime}(0)=2\Re(\lambda)\Re(1/\lambda)>0 and Arg(λ)π4((1/λ),2(1/λ))I(λ)|\text{Arg}(\lambda)|\geq\frac{\pi}{4}\Rightarrow\left(\Re({1}/{\lambda}),2\Re({1}/{\lambda})\right)\subset I(\lambda).

As we have seen previously in Fig. 3 and Thm. 2, there are only few eigenvalues which slow down the convergence. Thm. 4 is a local result showing that a small negative momentum can improve the magnitude of the limiting eigenvalues in the following cases: when there is only one limiting eigenvalue λ1\lambda_{1} (since in that case the optimal step-size is ηbest=(1/λ1)I(λ1)\eta_{best}=\Re(1/\lambda_{1})\in I(\lambda_{1})) or when there are several limiting eigenvalues λ1,,λk\lambda_{1},\ldots,\lambda_{k} and the intersection I(λ1)I(λk)I(\lambda_{1})\cap\ldots\cap I(\lambda_{k}) is not empty. We point out that we do not provide any guarantees on whether this intersection is empty or not but note that if the absolute value of the argument of λ1\lambda_{1} is larger than π/4\pi/4 then by (10), our theorem provides that the optimal step-size ηbest\eta_{best} belongs to I(λ1)I(\lambda_{1}).

Since our result is local, it does not provide any guarantees on large negative values of β\beta. Nevertheless, we numerically optimized (17) with respect to β\beta and η\eta and found that for any non-imaginary fixed eigenvalue λ\lambda, the optimal momentum is negative and the associated optimal step-size is larger than η^(λ)\hat{\eta}(\lambda). Another interesting aspect of negative momentum is that it admits larger step-sizes (see Fig. 4 and 5).

For a game with purely imaginary eigenvalues, when ηλ1|\eta\lambda|\ll 1, Thm. 3 shows that μ+(β,η,λ)1(1+β)ηλ\mu_{+}(\beta,\eta,\lambda)\approx 1-(1+\beta)\eta\lambda. Therefore, at the first order, β\beta only has an impact on the imaginary part of μ+\mu_{+}. Consequently μ+\mu_{+} cannot be pushed into the unit circle, and the convergence guarantees of Thm. 1 do not apply. In other words, the analysis above provides convergence rates for games without any pure imaginary eigenvalues. It excludes the purely adversarial bilinear example (α=0\alpha=0 in Eq. 4) that is discussed in the next section.

BILINEAR SMOOTH GAMES

In this section we analyze the dynamics of a purely adversarial game described by,

The first order stationary condition for this game characterizes the solutions (θ,φ)({\bm{\theta}}^{*},{\bm{\varphi}}^{*}) as

If b{\bm{b}} (resp. c{\bm{c}}) does not belong to the column space of A{\bm{A}} (resp. A{\bm{A}}^{\top}), the game (18) admits no equilibrium. In the following, we assume that an equilibrium does exist for this game. Consequently, there exist b{\bm{b}}^{\prime} and c{\bm{c}}^{\prime} such that b=Ab{\bm{b}}={\bm{A}}{\bm{b}}^{\prime} and c=Ac{\bm{c}}={\bm{A}}^{\top}{\bm{c}}^{\prime}. Using the translations θθc{\bm{\theta}}\rightarrow{\bm{\theta}}-{\bm{c}}^{\prime} and φφb{\bm{\varphi}}\rightarrow{\bm{\varphi}}-{\bm{b}}^{\prime}, we can assume without loss of generality, that pdp\geq d, b=0{\bm{b}}=\bm{0} and c=0{\bm{c}}=\bm{0}. We provide upper and lower bounds on the squared distance from the known equilibrium,

where (θ,φ)({\bm{\theta}}^{*},{\bm{\varphi}}^{*}) is the projection of (θt,φt){\bm{\theta}}_{t},{\bm{\varphi}}_{t}) onto the solution space. We show in §C, Lem. 2 that, for our methods of interest, this projection has a simple formulation that only depends on the initialization (θ0,φ0)({\bm{\theta}}_{0},{\bm{\varphi}}_{0}).

We aim to understand the difference between the dynamics of simultaneous steps and alternating steps. Practitioners have been widely using the latter instead of the former when optimizing GANs despite the rich optimization literature on simultaneous methods.

We define this class of methods with momentum using the following formulas,

In our simple setting, the operator Fη,βsimF^{\text{sim}}_{\eta,\beta} is linear. One way to study the asymptotic properties of the sequence (θt,φt)({\bm{\theta}}_{t},{\bm{\varphi}}_{t}) is to compute the eigenvalues of Fη,βsim\nabla F^{\text{sim}}_{\eta,\beta}. The following proposition characterizes these eigenvalues.

The eigenvalues of Fη,βsim\nabla F^{\text{sim}}_{\eta,\beta} are the roots of the \nth4 order polynomials:

Interestingly, these roots only depend on the product η1η2\eta_{1}\eta_{2} meaning that any re-scaling η1γη1,  η21γη2\eta_{1}\rightarrow\gamma\eta_{1}\,,\;\eta_{2}\rightarrow\frac{1}{\gamma}\eta_{2} does not change the eigenvalues of Fη,βsim\nabla F^{\text{sim}}_{\eta,\beta} and consequently the asymptotic dynamics of the iterates (θt,φt)({\bm{\theta}}_{t},{\bm{\varphi}}_{t}). The magnitude of the eigenvalues described in (22), characterizes the asymptotic properties for the iterates of the simultaneous method (21). We report the maximum magnitude of these roots for a given λ\lambda and for a grid of step-sizes and momentum values in Fig 7. We observe that they are always larger than 1, which transcribes a diverging behavior. The following theorem provides an analytical rate of divergence.

For any η1,η20\eta_{1},\eta_{2}\geq 0 and β1=β2=β\beta_{1}=\beta_{2}=\beta, the iterates of the simultaneous methods (21) diverge as,

This theorem states that the iterates of the simultaneous method (21) diverge geometrically for β116\beta\geq-\tfrac{1}{16}. Interestingly, this geometric divergence implies that even a uniform averaging of the iterates (standard in game optimization to ensure convergence (Freund et al., 1999)) cannot alleviate this divergence.

2 Alternating gradient descent

Alternating gradient methods take advantage of the fact that the iterates θt+1{\bm{\theta}}_{t+1} and φt+1{\bm{\varphi}}_{t+1} are computed sequentially, to plug the value of θt+1{\bm{\theta}}_{t+1} (instead of θt{\bm{\theta}}_{t} for simultaneous update rule) into the update of φt+1{\bm{\varphi}}_{t+1},

This slight change between (21) and (23) significantly shifts the eigenvalues of the Jacobian. We first characterize them with the following proposition.

The eigenvalues of Fη,βalt\nabla F^{\text{alt}}_{\eta,\beta} are the roots of the \nth4 order polynomials:

The same way as in (22), these roots only depend on the product η1η2\eta_{1}\eta_{2}. The only difference is that the monomial with coefficient η1η2λ\eta_{1}\eta_{2}\lambda is of degree 2 in (22) and of degree 3 in (24). This difference is major since, for well chosen values of negative momentum, the eigenvalues described in Prop. 2 lie in the unit disk (see Fig. 7). As a consequence, the iterates of the alternating method with no momentum are bounded and do converge if we add some well chosen negative momentum:

If we set η1σmax(A)\eta\leq\frac{1}{\sigma_{\max}(A)}, β1=12\beta_{1}=-\frac{1}{2} and β2=0\beta_{2}=0 then we have

If we set β1=0\beta_{1}=0 and β2=0\beta_{2}=0, then there exists M>1M>1 such that for any η1,η20\eta_{1},\eta_{2}\geq 0, Δt=Θ(Δ0)\Delta_{t}=\Theta(\Delta_{0}).

Our results from this section, namely Thm. 5 and Thm. 6, are summarized in Fig. 2, and demonstrate how alternating steps can improve the convergence properties of the gradient method for bilinear smooth games. Moreover, combining them with negative momentum can surprisingly lead to a linearly convergent method. The conjecture provided in Fig. 2 (divergence of the alternating method with positive momentum) is backed-up by the results provided in Fig. 5 and §A.1.

EXPERIMENTS AND DISCUSSION

Fashion MNIST and CIFAR 10

[Fig. 6] In our third set of experiments, we use negative momentum in a GAN setup on CIFAR-10 (Krizhevsky and Hinton, 2009) and Fashion-MNIST (Xiao et al., 2017) with saturating loss and alternating steps. We use residual networks for both the generator and the discriminator with no batch-normalization. Following the same architecture as Gulrajani et al. (2017), each residual block is made of two 3×33\times 3 convolution layers with ReLU activation function. Up-sampling and down-sampling layers are respectively used in the generator and discriminator. We experiment with different values of momentum on the discriminator and a constant value of 0.5 for the momentum of the generator. We observe that using a negative value can generally result in samples with higher quality and inception scores. Intuitively, using negative momentum only on the discriminator slows down the learning process of the discriminator and allows for better flow of the gradient to the generator. Note that we provide an additional experiment on mixture of Gaussians in § A.2.

RELATED WORK

From an optimization point of view, a lot of work has been done in the context of understanding momentum and its variants (Polyak, 1964; Qian, 1999; Nesterov, 2013; Sutskever et al., 2013). Some recent studies have emphasized the importance of momentum tuning in deep learning such as Sutskever et al. (2013), Kingma and Ba (2015), and Zhang and Mitliagkas (2017), however, none of them consider using negative momentum. Among recent work, using robust control theory, Lessard et al. (2016) study optimization procedures and cover a variety of algorithms including momentum methods. Their analysis is global and they establish worst-case bounds for smooth and strongly-convex functions. Mitliagkas et al. (2016) considered negative momentum in the context of asynchronous single-objective minimization. They show that asynchronous-parallel dynamics ‘bleed’ into optimization updates introducing momentum-like behavior into SGD. They argue that algorithmic momentum and asynchrony-induced momentum add up to create an effective ‘total momentum’ value. They conclude that to attain the optimal (positive) effective momentum in an asynchronous system, one would have to reduce algorithmic momentum to small or sometimes negative values. This differs from our work where we show that for games the optimal effective momentum may be negative. Ghadimi et al. (2015) analyze momentum and provide global convergence properties for functions with Lipschitz-continuous gradients. However, all the results mentioned above are restricted to minimization problems. The purpose of our work is to try to understand how momentum influences game dynamics which is intrinsically different from minimization dynamics.

Finally, similar proof techniques based on the study of the eigenvalues of a state-augmented operator have been recently used by Daskalakis and Panageas (2018) for the study of the optimistic gradient method (OGDA). However, even though OGDA and Polyak’s momentum can be seen as a variant of the gradient method with an additional term, these additional terms are fundamentally different. In OGDA it is a difference between the two previous gradients, while in Polyak’s method it is a difference between the two past iterates.

GANs as games

A lot of recent work has attempted to make GAN training easier with new optimization methods. Daskalakis et al. (2018) extrapolate the next value of the gradient using previous history and Gidel et al. (2018) explore averaging and introduce a variant of the extra-gradient algorithm. Balduzzi et al. (2018) develop new methods to understand the dynamics of general games: they decompose second-order dynamics into two components using Helmholtz decomposition and use the fact that the optimization of Hamiltonian games is well understood. It differs from our work since we do not consider any decomposition of the Jacobian but focus on the manipulation of its eigenvalues. Recently, Liang and Stokes (2018) provide a unifying theory for smooth two-player games for non-asymptotic local convergence. They also provide theory for choosing the right step-size required for convergence.

From another perspective, Odena et al. (2018) show that in a GAN setup, the average conditioning of the Jacobian of the generator becomes ill-conditioned during training. They propose Jacobian clamping to improve the inception score and Frechet Inception Distance. Mescheder et al. (2017) provide discussion on how the eigenvalues of the Jacobian govern the local convergence properties of GANs. They argue that the presence of eigenvalues with zero real-part and large imaginary-part results in oscillatory behavior but do not provide results on the optimal step-size and on the impact of momentum. Nagarajan and Kolter (2017) also analyze the local stability of GANs as an approximated continuous dynamical system. They show that during training of a GAN, the eigenvalues of the Jacobian of the corresponding vector field are pushed away from one along the real axis.

CONCLUSION

In this paper, we show analytically and empirically that alternating updates with negative momentum is the only method within our study parameters (Fig.2) that converges in bilinear smooth games. We study the effects of using negative values of momentum in a GAN setup both theoretically and experimentally. We show that, for a large class of adversarial games, negative momentum may improve the convergence rate of gradient-based methods by shifting the eigenvalues of the Jacobian appropriately into a smaller convergence disk. We found that, in simple yet intuitive examples, using negative momentum makes convergence to the Nash Equilibrium easier. Our experiments support the use of negative momentum for saturating losses on mixtures of Gaussians, as well as on other tasks using CIFAR-10 and fashion MNIST. Altogether, fully stabilizing learning in GANs requires a deep understanding of the underlying highly non-linear dynamics. We believe our work is a step towards a better understanding of these dynamics. We encourage deep learning researchers and practitioners to include negative values of momentum in their hyper-parameter search.

We believe that our results explain a decreasing trend in momentum values used for training GANs in the past few years reported in Fig. 4. Some of the most successful papers use zero momentum (Arjovsky et al., 2017; Gulrajani et al., 2017) for architectures that would otherwise call for high momentum values in a non-adversarial setting.

Acknowledgments

This research was partially supported by the Canada CIFAR AI Chair Program, the FRQNT nouveaux chercheurs program, 2019-NC-257943, the Canada Excellence Research Chair in “Data Science for Real-time Decision-making”, by the NSERC Discovery Grant RGPIN-2017-06936, a Google Focused Research Award and an IVADO grant. Authors would like to thank NVIDIA corporation for providing the NVIDIA DGX-1 used for this research. Authors are also grateful to Guojun Zhang, Frédéric Bastien, Florian Bordes, Adam Beberg, Cam Moore and Nithya Natesan for their support.

Appendix A ADDITIONNAL FIGURES

In Figure 7 we numerically (using the formula provided in Proposition 1 and 2) computed the maximum magnitude of the eigenvalues gradient descent with negative momentum on a bilinear objective as a function of the step size η\eta and the momentum β\beta. We can notice that on one hand, for simultaneous gradient method, no value of η\eta and β\beta provide a maximum magnitude smaller than 1, causing a divergence of the algorithm. On the other hand, for alternating gradient method there exists a sweet spot where the maximum magnitude of the eigenvalues of the operator is smaller than 1 insuring that this method does converge linearly (since the Jacobian of a bilinear minmax proble is constant).

A.2 Mixture of Gaussian

[Fig. 8] In this set of experiments we evaluate the effect of using negative momentum for a GAN with saturating loss and alternating steps. The data in this experiment comes from eight Gaussian distributions which are distributed uniformly around the unit circle. The goal is to force the generator to generate 2-D samples that are coming from all of the 8 distributions. Although this looks like a simple task, many GANs fail to generate diverse samples in this setup. This experiment shows whether the algorithm prevents mode collapse or not.

We use a fully connected network with 4 hidden ReLU layers where each layer has 256 hidden units. The latent code of the generator is an 8-dimensional multivariate Gaussian. The model is trained for 100,000 iterations with a learning rate of 0.010.01 for stochastic gradient descent along with values of zero, 0.5-0.5 and 0.50.5 momentum. We observe that negative momentum considerably improves the results compared to positive or zero momentum.

Appendix B DISCUSSION ON MOMENTUM AND CONDITIONING

In this section, we analyze the effect of the conditioning of the problem on the optimal value of momentum. Consider the following formulation as an extension of the bilinear min-max game discussed in §5, Eq. 4 (p=d=np=d=n),

where D{\bm{D}} is a square diagonal positive-definite matrix,

and its condition number is κ(D)=dn,n/d1,1\kappa({\bm{D}})=d_{n,n}/d_{1,1}. Thus, we can re-write the vector field and the Jacobian as a function of α\alpha and D{\bm{D}},

The corresponding eigenvalues λ\lambda of the Jacobian are,

For simplicity, in the following we will note Fη,β\nabla F_{\eta,\beta} for Fη,β(φ,θ,α,D)\nabla F_{\eta,\beta}({\bm{\varphi}},{\bm{\theta}},\bm{\alpha},{\bm{D}}).

Using Thm. (3), the eigenvalues of Fη,β\nabla F_{\eta,\beta} are,

where Δ:=14β(1ηλ+β)2\Delta\mathrel{\mathop{\ordinarycolon}}=1-\frac{4\beta}{(1-\eta\lambda+\beta)^{2}} and Δ12\Delta^{\frac{1}{2}} is the complex square root of Δ\Delta with positive real part.

Hence the spectral radius of Fη,β\nabla F_{\eta,\beta} can be explicitly formulated as a function of β\beta and η\eta,

In Figure 9, we numerically computed the optimal β\beta that minimizes ρmax(Fη,β)\rho_{max}(\nabla F_{\eta,\beta}) as a function of the step-size η\eta, for n=2n=2, d1,1=1/κd_{1,1}=1/\kappa and d2,2=1d_{2,2}=1. To balance the game between the adversarial part and the cooperative part, we normalize the matrix D{\bm{D}} such that the sum of its diagonal elements is nn. It can be seen that there is a competition between the type of the game (adversarial and cooperative) versus the conditioning of the matrix D{\bm{D}}. In a more cooperative regime, increasing κ\kappa results in more positive values of momentum which is consistent with the intuition that cooperative games are almost minimization problems where the optimum value for the momentum is known (Polyak, 1964) to be \beta=\big{(}\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}\big{)}^{2}. Interestingly, even if the condition number of D{\bm{D}} is large, when the game is adversarial enough, the optimum value for the momentum is negative. This experimental setting seems to suggest the existence of a multidimensional condition number taking into account the difficulties introduced by the ill conditioning of D{\bm{D}} as well as the adversarial component of the game.

Appendix C LEMMAS AND DEFINITIONS

Recall that the spectral radius ρ(A)\rho(A) of a matrix AA is the maximum magnitude of its eigenvalues.

For a symmetric matrix, this is equal to the spectral norm, which is the operator norm induced by the vector 2-norm. However, we are dealing with general matrices, so these two values may be different. The spectral radius is always smaller than the spectral norm, but it’s not a norm itself, as illustrated by the example below:

where we used the fact that the spectral norm is also the square root of the largest singular value.

In this section we will introduce three lemmas that we will use in the proofs of §D.

The first lemma is about the determinant of a block matrix.

Let A,B,C,DA,B,C,D four matrices such that CC and DD commute. Then

where A\begin{vmatrix}A\end{vmatrix} is the determinant of AA.

The second lemma is about the iterates of the simultaneous and the alternating methods introduced in §5 for the bilinear game. It shows that we can pick a subspace where the iterates will remain.

Let (θt,φt)({\bm{\theta}}_{t},{\bm{\varphi}}_{t}) the updates computed by the simultaneous (resp. alternating) gradient method with momentum (21) (resp. (23)). There exists are couple (θ,φ)({\bm{\theta}}^{*},{\bm{\varphi}}^{*}) solution of (18) only depending on (θ0,φ0)({\bm{\theta}}_{0},{\bm{\varphi}}_{0}) such that,

Let us start with the simultaneous updates (21).

Let UDV=A{\bm{U}}^{\top}{\bm{D}}{\bm{V}}={\bm{A}} the SVD of A{\bm{A}} where U{\bm{U}} and V{\bm{V}} are orthogonal matrices and

where rr is the rank of AA and σ1σr>0\sigma_{1}\geq\cdots\geq\sigma_{r}>0 are the (positive) singular values of AA. The update rules (21) implies that,

Since the solutions (θ,φ)({\bm{\theta}}^{*},{\bm{\varphi}}^{*}) of (18) verify the following first order conditions:

One can set (θ,φ)({\bm{\theta}}^{*},{\bm{\varphi}}^{*}) as in (37) to be a couple of solution of (18) such that U(θ0θ)span(D){\bm{U}}({\bm{\theta}}_{0}-{\bm{\theta}}^{*})\in span(D) and V(φ0φ)span(D){\bm{V}}({\bm{\varphi}}_{0}-{\bm{\varphi}}^{*})\in span(D). By an immediate recurrence, using (36) we have that for any initialization (θ0,φ0)({\bm{\theta}}_{0},{\bm{\varphi}}_{0}) there exists a couple (θ,φ)({\bm{\theta}}^{*},{\bm{\varphi}}^{*}) such that that for any t0t\geq 0,

The proof for the alternated updates (23) are the same since we only use the fact that the iterates stay on the span of interest. ∎

If ρ(M)<1\rho(M)<1, and MM is diagonalizable, then ut2O((ρ(M))tu02)\|{\bm{u}}_{t}\|_{2}\in O((\rho({\bm{M}}))^{t}\|{\bm{u}}_{0}\|_{2}).

If ρ(M)>1\rho(M)>1, then there exist u0{\bm{u}}_{0} such that ut2Ω(ρ(M))tu02\|{\bm{u}}_{t}\|_{2}\in\Omega(\rho({\bm{M}}))^{t}\|{\bm{u}}_{0}\|_{2}.

If λ=1,  λSp(M)|\lambda|=1\,,\;\forall\lambda\in Sp(M), and MM is diagonalizable then ut2Θ(u02)\|{\bm{u}}_{t}\|_{2}\in\Theta(\|{\bm{u}}_{0}\|_{2}).

If λ=1,  λSp(M)|\lambda|=1\,,\;\forall\lambda\in Sp(M), we can diagonalize M{\bm{M}} such that M=PDP1{\bm{M}}={\bm{P}}{\bm{D}}{\bm{P}}^{-1} where P{\bm{P}} is invertible and D{\bm{D}} is a diagonal matrix with complex values of magnitude 1.

Appendix D PROOFS OF THE THEOREMS AND PROPOSITIONS

Let us recall the Theorem proposed by Bertsekas (1999, Proposition 4.4.1). We also provide a convergence rate that was not previously stated in (Bertsekas, 1999).

If the spectral radius ρmax:=ρ(Fη(ω))<1\rho_{\max}\mathrel{\mathop{\ordinarycolon}}=\rho(\nabla F_{\eta}({\bm{\omega}}^{*}))<1, then, for ω0{\bm{\omega}}_{0} in a neighborhood of ω{\bm{\omega}}^{*}, the distance of ωt{\bm{\omega}}_{t} to the stationary point ω{\bm{\omega}}^{*} converges at a linear rate of \mathcal{O}\big{(}(\rho_{\max}+\epsilon)^{t}\big{)}\,,\,\forall\epsilon>0.

For brevity let us write xt:=(ϕt,θt)x_{t}\mathrel{\mathop{\ordinarycolon}}=(\phi_{t},\theta_{t}) for t0t\geq 0 and x:=(ϕ,θ)x^{*}\mathrel{\mathop{\ordinarycolon}}=(\phi^{*},\theta^{*}). Let ϵ>0\epsilon>0.

By Proposition A.15 (Bertsekas, 1999) there exists a norm \|\cdot\| such that its induced matrix norm has the following property:

Then by definition of the sequence (xt)(x_{t}) and since xx^{*} is a fixed point of FηF_{\eta}, we have that,

Since FηF_{\eta} is assumed to be continuously differentiable by the mean value theorem we have that

Since the induced norm of a square matrix is continuous on its elements and since we assumed that Fη\nabla F_{\eta} was continuous, there exists δ>0\delta>0 such that,

Finally, we get that if xtxδ\|x_{t}-x^{*}\|\leq\delta, then,

where in the last line we used (47) and (51). Consequently, if ρ(Fη(x)<1\rho(\nabla F_{\eta}(x^{*})<1 and if x0xδ\|x_{0}-x^{*}\|\leq\delta, we have that,

D.2 Proof of Thm. 2

If the eigenvalues of v(ω)\nabla\bm{v}({\bm{\omega}}^{*}) all have a positive real-part, then, the best step-size ηbest\eta_{best}, which minimizes the spectral radius ρmax(η)\rho_{\max}(\eta) of Fη(φ,θ)\nabla F_{\eta}({\bm{\varphi}}^{*},{\bm{\theta}}^{*}), is the solution of a (convex) quadratic by parts problem, and satisfies,

where (λk=rkeiψk)1km=Sp(v(φ,θ))(\lambda_{k}=r_{k}e^{i\psi_{k}})_{1\leq k\leq m}=\operatorname*{Sp}(\nabla\bm{v}({\bm{\varphi}}^{*},{\bm{\theta}}^{*})) are sorted such that 0<(1/λ1)(1/λm)0<\Re(1/\lambda_{1})\leq\cdots\leq\Re(1/\lambda_{m}). Particularly, when ηbest=(1/λ1)\eta_{best}=\Re(1/\lambda_{1}) we are in the case of the top plot of Fig.3 and ρmax(ηbest)2=sin(ψ1)2  .\rho_{\max}(\eta_{best})^{2}=\sin(\psi_{1})^{2}\;.

The eigenvalues of Fη\nabla F_{\eta} are 1ηλ1-\eta\lambda, for λSp(v(φ,θ))\lambda\in\operatorname*{Sp}(\nabla\bm{v}({\bm{\varphi}},{\bm{\theta}})). Our goal is to solve

where {λ1,,λm}\{\lambda_{1},\ldots,\lambda_{m}\} is the spectrum of v(φ,θ)\nabla\bm{v}({\bm{\varphi}}^{*},{\bm{\theta}}^{*}). we can develop the magnitude to get,

The function ηmax1infi(η)\eta\mapsto\max_{1\leq i\leq n}f_{i}(\eta) is a convex function quadratic by part. This function goes to ++\infty as η\eta gets larger, so it reaches its minimum over [0,)[0,\infty). We can notice that each function fif_{i} reaches its minimum for ηi=(λi)λi2=(1/λi)\eta_{i}=\frac{\Re(\lambda_{i})}{|\lambda_{i}|^{2}}=\Re(1/\lambda_{i}). Consequently, if we order the eigenvalues such that,

Then developing 1η1λ12|1-\eta_{1}\lambda_{1}|^{2}, we get that,

where λ1=r1eiψ1\lambda_{1}=r_{1}e^{i\psi_{1}}. Moreover, we also have that

This upper bound is then achieved for η=(1/λ1)\eta=\Re(1/\lambda_{1}). Moreover is Sp(v(φ,θ)[μ,L]\operatorname*{Sp}(\nabla{\bm{v}}({\bm{\varphi}}^{*},{\bm{\theta}}^{*})\subset[\mu,L] we have that, λ1=L\lambda_{1}=L and that

Consequently we recover the standard upper bound ρmax212μL+μ2L=(1μ/L)2\rho_{\max}^{2}\leq 1-2\frac{\mu}{L}+\frac{\mu^{2}}{L}=(1-\mu/L)^{2} provided in the convex case.

D.3 Proof of Thm. 3

We are now interested in the eigenvalues of the Simultaneous Gradient Method with Momentum.

The eigenvalues of Fη,β(ω)\nabla F_{\eta,\beta}({\bm{\omega}}^{*}) are

where Δ:=14β(1ηλ+β)2,  λSp(v(ω))\Delta\mathrel{\mathop{\ordinarycolon}}=1-\frac{4\beta}{(1-\eta\lambda+\beta)^{2}}\,,\;\lambda\in\operatorname*{Sp}(\nabla\bm{v}({\bm{\omega}}^{*})) and Δ12\Delta^{\frac{1}{2}} is the complex square root of Δ\Delta with positive real part If Δ\Delta is a negative real number we set Δ12:=iΔ\Delta^{\frac{1}{2}}\mathrel{\mathop{\ordinarycolon}}=i\sqrt{-\Delta}. Moreover we have the following Taylor approximation,

Its characteristic polynomial can be written:

where v(ω)=PTP1\nabla\bm{v}({\bm{\omega}}^{*})=PTP^{-1} and TT is an upper-triangular matrix. Finally by Lemma 1 we have that,

Let λ\lambda one of the λi\lambda_{i} we have,

where Δ:=(1ηλ+β)24β\Delta\mathrel{\mathop{\ordinarycolon}}=(1-\eta\lambda+\beta)^{2}-4\beta and λSp(v(ω))\lambda\in\operatorname*{Sp}(\nabla\bm{v}({\bm{\omega}}^{*})). This can be rewritten as,

where Δ:=14β(1ηλ+β)2,  λSp(v(φ,θ))\Delta\mathrel{\mathop{\ordinarycolon}}=1-\frac{4\beta}{(1-\eta\lambda+\beta)^{2}}\,,\;\lambda\in\operatorname*{Sp}(\nabla\bm{v}({\bm{\varphi}}^{*},{\bm{\theta}}^{*})) and Δ12\Delta^{\frac{1}{2}} is the complex square root of Δ\Delta with real positive part (if Δ\Delta is a real negative number, we set Δ12:=iΔ\Delta^{\frac{1}{2}}\mathrel{\mathop{\ordinarycolon}}=i\sqrt{-\Delta}). Moreover we have the following Taylor approximation,

D.4 Proof of Thm. 4

We are interested in the impact of small Momentum values on the convergence rate of Simultaneous Gradient Method.

For any λSp(v(ω))\lambda\in\operatorname*{Sp}(\nabla\bm{v}({\bm{\omega}}^{*})) s.t. (λ)>0\Re(\lambda)>0,

Particularly, we have ρλ,(1/λ)(0)=2(λ)(1/λ)>0\rho_{\lambda,\Re(1/\lambda)}^{\prime}(0)=2\Re(\lambda)\Re(1/\lambda)>0 and Arg(λ)π4((1/λ),2(1/λ))I(λ)|\text{Arg}(\lambda)|\geq\frac{\pi}{4}\Rightarrow\left(\Re({1}/{\lambda}),2\Re({1}/{\lambda})\right)\subset I(\lambda).

Recall the definitions of μ+\mu_{+} and μ\mu_{-} from Thm. 3, and the definition of the radius:

When β\beta is close to , μ\mu_{-} is close also to whereas μ+\mu_{+} is close to 1ηλ1-\eta\lambda. In general 1ηλ01-\eta\lambda\neq 0, so around , ρλ,η(β)=μ+(β)2=μ+(β)μˉ+(β)\rho_{\lambda,\eta}(\beta)=|\mu_{+}(\beta)|^{2}=\mu_{+}(\beta)\bar{\mu}_{+}(\beta). The special case where 1ηλ=01-\eta\lambda=0 is excluded from this analysis because it means that the eigenvalue λ\lambda is not one constraining the learning rate as seen in Thm. 2. Computing the derivative of ρ\rho give us

The sign of ρλ,η(0)\rho^{\prime}_{\lambda,\eta}(0) is determined by the sign of

This quadratic function is strictly positive on the open interval (λ(λ)λ(λ),λ+(λ)λ(λ))\left(\frac{|\lambda|-|\Im(\lambda)|}{|\lambda|\Re(\lambda)},\frac{|\lambda|+|\Im(\lambda)|}{|\lambda|\Re(\lambda)}\right).

Moreover since (1/λ)=(λ)λ2\Re(1/\lambda)=\frac{\Re(\lambda)}{|\lambda|^{2}}, we have that 1λ(1/λ)2=1(λ)(1/λ)|1-\lambda\Re(1/\lambda)|^{2}=1-\Re(\lambda)\Re(1/\lambda) (see Eq. 65) and then,

Finally writting λ=reiψ\lambda=re^{i\psi} we get that,

and arg(λ)π4|\arg(\lambda)|\geq\frac{\pi}{4} implies that (23(1/λ),2(1/λ))\left(\frac{2}{3}\Re(1/\lambda),2\Re(1/\lambda)\right)

D.5 Proof of Thm. 5

We are now in the special case of a bilinear game. We first consider the simultaneous gradient step with momentum The operator Fη,βF_{\eta,\beta} is defined as:

The eigenvalues of Fη,βsim\nabla F^{\text{sim}}_{\eta,\beta} are the roots of the \nth4 order polynomials:

Particularly, when β1=β2=0\beta_{1}=\beta_{2}=0 and η1=η2=η\eta_{1}=\eta_{2}=\eta we have,

Then the characteristic polynomial of this matrix is equal to,

Then we can use Lemma 1 to compute this determinant,

Where for the last equality we added to the first block column the second one multiplied by η2AXX(X1β2)+β2\eta_{2}{\bm{A}}^{\top}\frac{X}{X(X-1-\beta_{2})+\beta_{2}}. It’s now time to introduce rr the rank of A{\bm{A}}. We can diagonalize AA=Udiag(λ1,,λr,0,,0)U{\bm{A}}^{\top}{\bm{A}}={\bm{U}}^{\top}diag(\lambda_{1},\ldots,\lambda_{r},0,\ldots,0){\bm{U}} to get the determinant of a triangular matrix,

where λk\lambda_{k} are the positive eigenvalues of AA{\bm{A}}^{\top}{\bm{A}}. This is the characteristic polynomial we were seeking, taking into account the null singular values of AA.

In particular, when β1=β2=0\beta_{1}=\beta_{2}=0, we get,

For any η1,η20\eta_{1},\eta_{2}\geq 0 and β1=β2=β\beta_{1}=\beta_{2}=\beta, the iterates of the simultaneous methods (21) diverge as,

We report the maximum magnitudes of the eigenvalues of the polynomial from Prop. 1 in Fig. 7. We observe that they are larger than 1. We now prove it in several cases. Let us start with the simpler case β1=β2=0\beta_{1}=\beta_{2}=0. Using Lemma 2, there exists (θ,φ)({\bm{\theta}}^{*},{\bm{\varphi}}^{*}) such that for any t0t\geq 0,

where in line 1 we used that Aφ=0{\bm{A}}{\bm{\varphi}}^{*}=0 and in line 3 we used that φtφ{\bm{\varphi}}_{t}-{\bm{\varphi}}^{*} is orthogonal to the null space of A{\bm{A}}, so that we lower bound the product by the smallest non-zero singular value σmin(A)\sigma_{\min}({\bm{A}}). The same way, we get:

where σmin2(A)\sigma^{2}_{\min}({\bm{A}}) is the minimal (positive) squared singular value of A{\bm{A}}.

Now we can try to handle the case where β1=β2=β0\beta_{1}=\beta_{2}=\beta\neq 0. To prove Thm. 5 we will prove the following Proposition

Let Fη,βsimF^{\text{sim}}_{\eta,\beta} the operator defined in (21).

For β0\beta\geq 0 its radial spectrum is lower bounded by 1+η1η2σmax2(A)1+\eta_{1}\eta_{2}\sigma^{2}_{\max}(A).

For 1/16β<0-1/16\leq\beta<0 its radial spectrum is lower bounded by 1+η1η2σmax2(A)/171+\eta_{1}\eta_{2}\sigma^{2}_{\max}(A)/17.

Let us use Proposition 1 to get that the eigenvalues of our linear operator are the solutions of

Let us fix λ>0\lambda>0 belonging to Sp(AA)Sp({\bm{A}}^{\top}{\bm{A}}). For simplicity, let us note α2=η2λ\alpha^{2}=\eta^{2}\lambda. We can then notice that this polynomial can be factorized as

Then the roots of these 2 quadratic polynomials are

where ±z1/2\pm z^{1/2} are the complex square roots of zz with positive imaginary part. Our goal is going to be to show that z1z_{1} has a magnitude larger than 11.

Let us first assume that β<0\beta<0. We have that,

where for the two inequalities we used 1+x1+x,x0\sqrt{1+x}\geq 1+x\,,\,\forall x\leq 0. With the same ideas we can lower bound the Imaginary part of z1/2z^{1/2},

Consequently we can use (115) and (120) to lower bound the magnitude of z1z_{1} (defined in Eq. 108) as,

For 1/16β<0-1/16\leq\beta<0 we have that α2+16α2βα2+(1β)2α217\alpha^{2}+16\frac{\alpha^{2}\beta}{\alpha^{2}+(1-\beta)^{2}}\geq\frac{\alpha^{2}}{17}. Hence,

Let us now consider the case β0\beta\geq 0. By using the fact that a+ba,a,b0\sqrt{a+b}\geq\sqrt{a}\,,\,\forall a,b\geq 0 we have that,

To conclude this proof we just need to combine Proposition • ‣ 3 with Lemma 3 saying that if the spectral radius is strictly larger than 1 then the iterates diverge. ∎

D.6 Proof of Thm. 6

The eigenvalues of Fη,βalt\nabla F^{\text{alt}}_{\eta,\beta} are the roots of the \nth4 order polynomials:

Particularly for β1=β2=0\beta_{1}=\beta_{2}=0 and η1η2=η2\eta_{1}\eta_{2}=\eta^{2} we get

Particularly for β1=12\beta_{1}=-\frac{1}{2} and β2=0\beta_{2}=0 we get

Let us recall the definition of Fη,βaltF^{\text{alt}}_{\eta,\beta}, (for compactness we note Fη,βalt=Fη1,η2,β1,β2altF^{\text{alt}}_{\eta,\beta}=F^{\text{alt}}_{\eta_{1},\eta_{2},\beta_{1},\beta_{2}})

Hence, the matrix Fη,βaltF^{\text{alt}}_{\eta,\beta} is,

Then the characteristic polynomial of Fη,βaltF^{\text{alt}}_{\eta,\beta} is equal to

Where in the last line we added the third block column multiplied by 1X\frac{1}{X} to the first one.Then we have

where we added to the second block line the first block line by η2A-\eta_{2}{\bm{A}}^{\top}. Then our determinant is triangular by squared blocks of size m×mm\times m and we can write,

Now we can diagonalize AA{\bm{A}}^{\top}{\bm{A}} to get,

where (λk)1kr(\lambda_{k})_{1\leq k\leq r} are the positive eigenvalues of AA{\bm{A}}^{\top}{\bm{A}} of rank rr. Particularly, when β1=β2=0\beta_{1}=\beta_{2}=0 we have that,

We report the maximum magnitudes of the eigenvalues of the polynomial from Prop. 2 in Fig. 7. We observe that they are smaller than 1 for a large choice of step-size and momentum values. This is a satisfying numerical result but we want analytical convergence rates. This is what we prove in Thm. 6.

If we set η1σmax(A)\eta\leq\frac{1}{\sigma_{\max}(A)}, β1=12\beta_{1}=-\frac{1}{2} and β2=0\beta_{2}=0 then we have

If we set β1=0\beta_{1}=0 and β2=0\beta_{2}=0, then there exists M>1M>1 such that for any η1,η20\eta_{1},\eta_{2}\geq 0, Δt=Θ(Δ0)\Delta_{t}=\Theta(\Delta_{0}).

In Lemma 2 we showed of the affine transformations θtU(θtθ){\bm{\theta}}_{t}\rightarrow{\bm{U}}({\bm{\theta}}_{t}-{\bm{\theta}}^{*}) and φtV(φtφ){\bm{\varphi}}_{t}\rightarrow{\bm{V}}({\bm{\varphi}}_{t}-{\bm{\varphi}}^{*}) allow us to work on the span of a diagonal matrix D{\bm{D}}. Then in that case the eigenspace of D{\bm{D}} do not interact with each other. In the sense that for each coordinate of [U(θtθ)]i[{\bm{U}}({\bm{\theta}}_{t}-{\bm{\theta}}^{*})]_{i} and [V(φtφ)]i[{\bm{V}}({\bm{\varphi}}_{t}-{\bm{\varphi}}^{*})]_{i} (1ir1\leq i\leq r) we have from (36) that

Consequently we only need to study the 4 dimensional linear operators

for σ1σr>0\sigma_{1}\leq\cdots\leq\sigma_{r}>0 the positive singular values of A{\bm{A}}. These equations are a particular case of (135). Using the proof of Proposition 2 the eigenvalues of these matrices are the solution of

When, β1=β2=0\beta_{1}=\beta_{2}=0 we have that,

Then the roots of Pi(X)P_{i}(X) are and two complex conjugate value with a magnitude equal to the constant term of (X1)2+η1η2Xσi2(X-1)^{2}+\eta_{1}\eta_{2}X\sigma_{i}^{2} which is 11. Since these two eigenvalues are different, the matrix (147) is diagonalizable (for β1=β2=0\beta_{1}=\beta_{2}=0 we can remove the state augmentation to only work with these two eigenvector). Consequently our linear operator is diagonalizable and has all its eigenvalues larger than 1 in magnitude, we can then apply Lemma 3 to conclude that Δt=Ω(Δ0)\Delta_{t}=\Omega(\Delta_{0}).

When, β1=12\beta_{1}=-\frac{1}{2} and β2=0\beta_{2}=0, we have that Pi(X)=XQi(X)P_{i}(X)=XQ_{i}(X) where

Then Pi(1/2)=+η1η2σi24>0P_{i}(-1/2)=+\frac{\eta_{1}\eta_{2}\sigma_{i}^{2}}{4}>0 and Pi(1)=2+η1η2σi2P_{i}(-1)=-2+\eta_{1}\eta_{2}\sigma_{i}^{2}. If η1η2<2σi2\eta_{1}\eta_{2}<\frac{2}{\sigma_{i}^{2}} we have Pi(1)<0P_{i}(-1)<0. Consequently, this polynomial has a negative root λ\lambda_{-} such that 1<λ<12<0-1<\lambda_{-}<-\frac{1}{2}<0. Moreover the derivative of Qi(X)Q_{i}(X) is

If η1η2<32σi2\eta_{1}\eta_{2}<\frac{3}{2\sigma_{i}^{2}}, then Qi(x)>0,x>0Q^{\prime}_{i}(x)>0\,,\forall x>0. Since Qi(0)=1/2>0Q_{i}(0)=1/2>0 then Qi(x)>0,x0Q_{i}(x)>0\,,\forall x\geq 0 and consequently all the real roots of QiQ_{i} are negative.

Since by the root coefficient relationship the sum of the roots of QiQ_{i} has to be equal to 32η1η2σi2>0\frac{3}{2}-\eta_{1}\eta_{2}\sigma_{i}^{2}>0, all the roots of QiQ_{i} cannot be real (because the real roots of QiQ_{i} are negative). Hence QiQ_{i} has two conjugate roots λc\lambda_{c} and λˉc\bar{\lambda}_{c} and one real negative root λr\lambda_{r}. Let us consider 1<λr<1/2-1<\lambda_{r}<-1/2, we have,

where we called α=η1η2σi2\alpha=\eta_{1}\eta_{2}\sigma_{i}^{2}. Thus we have,

where we used 1x2x281x,1>x01-\frac{x}{2}-\frac{x^{2}}{8}\geq\sqrt{1-x}\,,\,1>x\geq 0. Moreover the roots coefficient relationship are

where we called α=η1η2σi2\alpha=\eta_{1}\eta_{2}\sigma_{i}^{2}. Plugging (154) into (155) we get

Multiplying by λr\lambda_{r} and plugging (156) in we get

where we used that λr<12\lambda_{r}<-\frac{1}{2}. Consequently, since in the theorem we assumed that η1η21σmax(A)2\eta_{1}\eta_{2}\leq\frac{1}{\sigma_{\max}({\bm{A}})^{2}}, we have that α1\alpha\leq 1, we have

One last thing to say is that the four roots of PiP_{i} which are the four eigenvalues of the matrix in (147) are different and consequently this matrix is diagonalizable.

We can then apply Lemma 3 in a case of a spectral radius strictly smaller that 11 to conclude that,

because U{\bm{U}} and V{\bm{V}} are orthogonal.