Interaction Matters: A Note on Non-asymptotic Local Convergence of Generative Adversarial Networks
Tengyuan Liang, James Stokes
Introduction
In this paper we consider the non-asymptotic local convergence and stability of discrete-time gradient-based optimization algorithms for solving smooth two-player zero-sum games of the form,
The motivation behind our non-asymptotic analysis follows from the observation that Generative Adversarial Networks (GANs) lack principled understanding at both the computational and algorithmic level. GAN optimization is a special case of (1), which has been developed for learning a complex and multi-modal probability distribution based on samples from (over ), through learning a generator function that transforms the input distribution (over ) to match the target . Ignoring the parameter regularization, the value function corresponding to a GAN is of the form,
Optimization of GANs (and value functions of the form (1) at large) is hard, both in theory and in practice (Singh et al., 2000; Pfau and Vinyals, 2016; Salimans et al., 2016). Global optimization of a general value function with multiple saddle points is impractical and unstable, so we instead resort to the more modest problem of searching for a local saddle point such that no player has the incentive to deviate locally
For smooth value functions, the above conditions are equivalent to the following solution concept:
is called a local Nash equilibrium if
, ;
, .
In practice, discrete-time dynamical systems are employed to numerically approach the saddle points of , as is the case in GANs (Goodfellow et al., 2014), and in primal-dual methods for non-linear optimization (Singh et al., 2000). The simplest possibility is Simultaneous Gradient Ascent (SGA), which corresponds to the following discrete-time dynamical system,
where is the step size or learning rate. In the limit of vanishing step size, SGA approximates a continuous-time autonomous dynamical system, the asymptotic convergence of which has been established in Singh et al. (2000); Cherukuri et al. (2017); Nagarajan and Kolter (2017). In practice, however, it has been widely reported that the discrete-time SGA dynamics for GAN optimization suffers from instabilities due to the possibility of complex eigenvalues in the operator of the dynamical system (Salimans et al., 2016; Metz et al., 2016; Nagarajan and Kolter, 2017; Mescheder et al., 2017; Heusel et al., 2017). We believe room for improvement still exists in the current theory, which we hope will render it to be more informative in practice:
Non-asymptotic convergence speed. In practice, one is concerned with finite step size which is typically subject to extensive hyperparameter tuning. Detailed characterizations on the convergence speed, and theoretical insights on the choice of learning rate can be helpful.
Unified simple analysis for modified saddle point dynamics. Several attempts to fix GAN optimization have been put forth by independent researchers, which modify the dynamics (Mescheder et al., 2017; Daskalakis et al., 2017; Yadav et al., 2017) using very different insights. A unified analysis that reviews the deeper connections amongst these proposals helps to better understand the saddle point dynamics at large.
In this paper, we address the above points by studying the theory of non-asymptotic convergence of SGA and related discrete-time saddle point dynamics, namely, Optimistic Mirror Descent (OMD), Consensus Optimization (CO), Implicit Updates (IU) and Predictive Method (PM). More concretely, we provide the following theoretical contributions about the crucial effect of the off-diagonal interaction term in two-player games:
Stable case: curse of the interaction term. Locally, SGA converges exponentially fast to a stable Nash equilibrium with a carefully chosen learning rate. This can be viewed as a generalization (rather than a special case) of the local convergence guarantee for single-player gradient descent for strongly-convex functions. In addition, we quantitatively isolate the slow-down in the convergence rate of two-player SGA compared to single-player gradient descent, due to the presence of the off-diagonal interaction term for the two-player game.
Unstable case: blessing of the interaction term. For unstable Nash equilibria, SGA diverges away for any non-zero learning rate. We discover a unified non-asymptotic analysis that encompasses four proposed modified dynamics — OMD, CO, IU and PM. The analysis shows that all these algorithms, at a high level, share the same idea of utilizing the curvature introduced by the interaction term . Unlike the slow sub-linear rate of convergence experienced by single-player gradient descent for non-strongly convex functionsIn fact, Nesterov (2013) constructed a convex function that is non-strongly convex, such that all first order methods suffer slow sub-linear rate of convergence (in optimization literature, linear rate refers to exponential convergence speed)., four modified dynamics effectively exploit the interaction term to achieve exponential convergence to unstable Nash equilibria. The analysis also provides specific advice on the choice of learning rate for each procedure, albeit restricted to the simple case of bi-linear games.
The organization of the paper is as follows. In Section 2 we consider the situation when locally, the value function satisfies strict convexity/concavity. We show non-asymptotic exponential convergence to Nash equilibria for SGA, and identify an optimal learning rate. To reveal and understand the new features of the modified dynamics, we study in Section 3 the minimal unstable bilinear game, showing that proposed stabilizing techniques all achieve exponential convergence to unstable Nash equilibria. Finally, in Section 4 we take a step closer to the real world by numerically evaluating each of the proposed dynamical systems using value functions of GAN form (2), under objective evaluation metrics. Proofs are deferred to the Appendix A.
Stable Case: Non-asymptotic Local Convergence
In this section we will establish the non-asymptotic convergence of SGA dynamics to saddle points that are stable local Nash equilibria. With a properly chosen learning rate, the local convergence can be intuitively pictured as cycling inwards to these saddle points, where the distance to the saddle point of interest is exponentially contracting. First, let’s introduce the notion of stable equilibrium.
is called a stable local Nash equilibrium if
, ;
, .
The above notion of stability is stronger than the Definition 1.1, in the sense that and have smallest eigenvalues bounded away from .
It will prove convenient to introduce some notation before introducing the main theorem. Let us define the following block-wise abbreviation for the matrix of second derivatives,
where denote the largest and smallest eigenvalue of matrix .
( defined in Eqn. (2)) obtains an -minimizer such that , as long as
It is interesting to compare the convergence speed of the saddle point dynamics to conventional gradient descent in one variable, for a strongly-convex function. We remind the reader that to obtain an -minimizer for a strongly-convex function, one needs the following number of iterations of gradient descent,
depending on whether we are optimizing with respect to or , respectively. It is now evident that due to the presence of , the convergence of two-player SGA to a saddle-point can be significantly slower than convergence of single-player gradient descent. In particular, applying the eigenvalue interlacing theorem to the principal submatrix of we obtain
Therefore the convergence of SGA is slower than that in the conventional GDRecall that .
We would like to emphasize that for the saddle point convergence, the slow-down effect of the interaction term is explicit in our non-asymptotic analysis.
The intuition that the discrete-time SGA dynamics cycles inward to a stable Nash equilibrium exponentially fast can be seen in the following way. The presence of the off-diagonal anti-symmetric component in Eqn. (4) means that the associated linear operator of the discrete-time dynamics has complex eigenvalues, which results in periodic cycling behavior. However, due to the explicit choice of , the distance to stable Nash equilibrium is shrinking exponentially fast. The local exponential stability in the infinitesimal/asymptotic case when has already been studied in a paper Nagarajan and Kolter (2017) (Theorem 3.1 therein) by showing the Jacobian matrix of a particular form of GAN objective is Hurwitz (has all strictly negative eigenvalues). There are two distinct differences in our result: (1) we provide non-asymptotic convergence, with specific guidance on the choice of learning rate ; (2) our analysis goes through analyzing the singular values (which is rather different from the modulus of eigenvalue for a general matrix), instead of involving the complex eigenvalues, and this simple technique generalizes to four other modified saddle point dynamics which we discuss in the next section.
In fact, one can show that the slow-down effect of the interaction term for SGA in the above theorem is indeed necessary.
The above corollary shows that for any stepsize , to obtain -solution, the number of SGA iteration is at least . Namely, the interaction term is indeed a curse to the convergence rate.
Unstable Case: Local Bi-Linear Problem
Oscillation and instability for SGA occurs when the problem is non-strongly convex-concave, as in the bi-linear game (or more precisely, at least linear in one player). This observation was first pointed out using a very simple linear game in Salimans et al. (2016). More generally, as a result of Theorem 1, this phenomenon occurs when the local Nash equilibrium is non-stable,
Let’s consider an extreme case when and . In this case, we will use a novel unified non-asymptotic analysis to show that the following proposed dynamics can fix the oscillation problem and provide exponential convergence to unstable Nash equilibria:
Optimistic Mirror Descent (OMD) in Daskalakis et al. (2017)
A modified version of Predictive Methods (PM) motivated from Yadav et al. (2017)
Consensus Optimization (CO) introduced in Mescheder et al. (2017)
Our analysis shows that these stabilizing techniques, at a high level, all manipulate the dynamics to utilize the curvature generated by the interaction term — which we refer to as the “blessing” of the interaction term, to contrast with the “slow-down effect” of the interaction term in the strongly convex-concave case (Theorem 1). Once again, as alluded to in the introduction, this fast linear-rate convergence result in the non-strongly convex-concave two-player game should be contrasted with the significantly slower sub-linear convergence rate for all first-order-methods in convex but non-strongly convex single-player optimization. The latter was proved by a lower bound argument in (Nesterov, 2013, Theorem 2.1.7). The main result proved in this section is informally stated
All these four modified dynamics, in the bi-linear game, enjoy the last iterate exponential convergence guarantee.
The bilinear game can be motivated by considering the Taylor expansion of a general smooth two-player game around a non-stable Nash equilibrium (), assuming that . Now consider the simple bi-linear game . With the SGA dynamics defined in (1), one can easily verify that
Therefore, the continuous limit is cycling around a sphere, while with any practical learning rate , the distance to the Nash equilibrium can be increasing exponentially instead of converging. Per Theorem 1 and the discussion above, instability for SGA only occurs when the local game is approximately bi-linear. From now on, therefore, we will focus on the simplest unstable form of the game, the bi-linear game, to isolate the main idea behind fixing the instability problem. The proof technique can be extended to more general settings, with a sacrifice of simplicity.
Daskalakis, Ilyas, Syrgkanis, and Zeng (2017) employed Optimistic Mirror Descent (OMD) (Rakhlin and Sridharan, 2013) motivated by online learning to solve the instability problem in GANs. Here we provide a stronger result, showing that the last iterate of OMD enjoys exponential convergence for bi-linear games. We note that although the last-iterate convergence of this OMD procedure was already rigorously proved in Daskalakis et al. (2017), the exponential convergence is not known to the best of our knowledge.
Consider a bi-linear game Assume and is full rank. Then the OMD dynamics,
obtains an -minimizer such that , provided
under the assumption that .
Let us compare our result with the last-iterate convergence result in Daskalakis et al. (2017). Roughly speaking, (Daskalakis et al., 2017, Theorem 1) asserts that to obtain an -minimizer, one requires a learning rate scaling as and a number of iterations bounded by
In contrast, we show that with step size chosen independently of , the last iterate of OMD falls within of the saddle point after a number of iterations given by
In other words, we improved the dependence on from polynomial to logarithmic. This improved analysis also coincides with the exponential convergence found in simulations.
2 (Modified) Predictive Methods
From a very different motivation in ordinary differential equations, Yadav et al. (2017) proposed Predictive Methods (PM) to fix the instability problem. The intuition is to evaluate the gradient at a predictive future location and then perform the update. In this section, we propose and analyze a modified version of the predictive method (for simultaneous gradient updates), inspired by Yadav et al. (2017).
Consider the following modified PM dynamics,
Consider a bi-linear game Assume and is full rank. Fix some . Then the PM dynamics in Eqn. (3.2) with learning rate
obtains an -minimizer such that , provided
under the assumption that .
3 Implicit Updates
Implicit Update (IU) rules have been shown to be more robust compared to explicit updates, and typically match the performance or even outperform the latter empirically in online learning (Kulis and Bartlett, 2010). We will show that a simple adaptation of implicit updates for simultaneous gradient ascent/descent resolves the instability problem in the bi-linear case.
Consider a bi-linear game Assume and is full rank. Then the implicit updates
obtains an -minimizer such that , provided
under the assumption that .
4 Consensus Optimization
Consensus Optimization (CO) is another elegant attempt to fix the aforementioned problem, proposed in Mescheder et al. (2017). The idea is to add a potential component to the pure-curl vector field associated with SGA in the bi-linear game, in order to attract the dynamics to the critical points. (Mescheder et al., 2017; Nagarajan and Kolter, 2017) analyzed the infinitesimal flow version of the consensus optimization, and intuitively showed that it pushes the real part of the eigenvalue away from , to ensure asymptotic convergence. In this section, we provide a simple convergence analysis of the discretized dynamics, of the same flavor as the previous section. An upshot of the analysis is that it sheds light on possible choices of learning rate.
Recall that the regularization term defining consensus optimization is given by,
Surprisingly, we find that the consensus optimization coincides with the modified predictive method for the bi-linear game, as described by the following
Consider a bi-linear game Assume and is full rank. Recall defined in Eqn. (8), and fix some . Then the CO dynamics with the same learning rate as in Thm. 4,
converges exponentially fast in the same way as the PM dynamics in Thm. 4.
Experiments
In the simplistic setting of bilinear games we have seen that exponential convergence can be achieved for appropriate choice of learning rate and this is indeed confirmed by numerical experiments as shown in Fig. 1. In reality, however, the assumption of bilinearity is not applicable to value functions of GAN form and indeed
recent large-scale studies of GAN optimization (Lucic et al., 2017) suggest that improvements from algorithmic changes mostly disappeared after taking into account hyper-parameter tuning and randomness of initialization. They conclude that “future GAN research should be based on more systematic and objective evaluation procedures.” Inspired by this conclusion, we conduct a systematic evaluation of the proposed optimization algorithms on two basic density learning problems, and introduce corresponding objective evaluation metrics. The goal of this analysis is not to achieve state-of-art performance, but rather to compare and contrast the existing proposals in a carefully controlled learning environment. We focus on the Wasserstein GAN formulation so that the value function is given by
The coefficients of the gradient penalty and consensus optimization terms were determined by a coarse parameter search and then locked to throughout. In order to make close contact with our theoretical formalism, we optimize the above loss functions using simultaneous gradient updates with fixed learning rate of .
The above analytical form of the value function sheds some light on the nature of the local Nash equilibrium solution concept. In particular, if one solves for the condition of being a Nash equilibrium, one does not conclude that . The result depends on the rank of the matrix .
The evaluation of different optimization algorithms involved comparing the target density and the analytical generator density after training iterations (Fig. 2). For simplicity, we chose the evaluation metric to be the Frobenius norm of the difference between the covariance matrices . The covariance learning experiments were conducted in the well-specified and over-parametrized regime (, ) using hidden units for the discriminator network.
2 Mixture of Gaussians
In practical applications, GANs are typically trained using the empirical distribution of the samples, where the samples are drawn from an idealized multi-modal probability distribution. To capture the notion of a multi-modal data distribution, we focus on a mixture of 8 Gaussians with means located at the vertices of a regular octagon inscribed in the unit circle, where each component has a fixed diagonal covariance of width . In contrast to previous visual-based evaluations, we estimate the Wasserstein-1 distance between the target density and the distribution of the random variable implied by the trained generator network. The estimate is obtained by solving the linear program which computes the earth mover’s distance between the sample estimates and , respectively, and approaches the population version as the number of samples .
The experiments with the mixture of Gaussians used 2 dimensional Gaussian as input (). Both the generator and discriminator networks consisted of hidden layers with units per hidden layer. The estimate of the Wasserstein-1 distance was calculated using a sample size of after training for iterations. It is clear from Fig. 3 that the Wasserstein-1 distance correlates closely with the visual fit to the target distribution. The empirical evaluation (Fig. 2) shows that the separation between consensus optimization and competing algorithms disappears on the mixture distribution, suggesting that the qualitative ranking is not robust to the choice of loss landscape. These findings demand deeper understanding of the global structure of the landscape, including the formulation of regularization to tame the notoriously difficult GAN optimization (Arbel et al., 2018), which is not captured by our local stability analysis.
Conclusions and Future Work
In this paper we made a first step towards understanding the local convergence rate of the discrete-time gradient-based saddle point dynamics for solving smooth two-player zero-sum games, including GANs as the leading motivation. The focus of the paper is on illustrating how local geometry affects the convergence speed and choice of learning-rate for both stable and unstable local Nash Equilibria. A curious fact we proved is that modified first-order dynamics such as OMD converge with linear-rate to unstable local Nash Equilibria, as a consequence of the interaction term between the two players.
We acknowledge that there are still critical steps left open by our analysis in order to understand the effectiveness of heuristic methods for training GANs for distribution learning. Solving this problem requires an understanding of the ability of various stable/unstable local Nash Equilibria to represent distributions in the statistical sense. To the best of our knowledge, it remains unclear whether convergence to the stable local solution concept (Definition 2.1) is better than converging/oscillating/escaping an unstable local solution, in terms of distribution learning. Recent progress by Daskalakis and Panageas (2018b, a) employs the stable manifold Theorem (Galor, 2007) to show that certain dynamics avoid unstable local solutions (barring initialization in a set of Lebesgue measure zero). Overall, a satisfactory theory — in both the computational and statistical sense — for answering how heuristic gradient-based saddle point dynamics for GANs are able to learn distributions is still wide open for future investigation.
References
Appendix A Technical Proofs
Define the line interpolation between two points,
Then the SGA dynamics can be written as (using Taylor’s theorem with remainder)
Assume that one can prove for some , and , with a proper choice of , the largest singular value is bounded above by 1,
Then due to convexity of the operator norm, the dynamics of SGA is contracting locally because,
assuming , . Abbreviate
is the square root of the largest eigenvalue of the following symmetric matrix
It is clear that when , the largest eigenvalue of the above matrix is . Observe
Therefore, to obtain an -minimizer one requires a number of steps equal to
Consider where is strictly positive. Then gradient descent corresponds to and thus . Setting we have so . Therefore \|\theta_{t}\|\leq\|\theta_{0}\|\big{[}1-\lambda_{\min}(A)/\lambda_{\max}(A)\big{]}^{t}\leq\|\theta_{0}\|e^{-t\lambda_{\min}(A)/\lambda_{\max}(A)}. The number of iterations required to obtain an -minimizer is thus bounded as .
If we define D=\operatorname*{diag}\big{(}(1-\eta)^{2}I+\eta^{2}CC^{T},(1-\eta)^{2}I+\eta^{2}C^{T}C\big{)} then using the Rayleigh quotient representation of we obtain,
regardless of the choice of , which proves the claim. ∎
Recall that the OMD dynamics iteratively updates
The commutative property follows from a singular value decomposition argument: Letting be the SVD of ( diagonal) one finds,
Using the above equality, the commutative property follows
Now we have the following relations for OMD,
Let’s analyze the singular values of and . We have,
For small enough, the spectral radius satisfies the strict inequality . Concretely, for example,
Therefore .
The RHS of Eqn. (13) is upper bounded because
By our assumption so
But so
one can ensure . ∎
In this case, the consensus optimization satisfies the following update
Again let’s analyze the singular values of the operator , or equivalently, the eigenvalues of ,
Now consider the largest eigenvalue of , for a fixed , with a properly chosen . Using the SVD , we obtain
In this case, the implicit update satisfies the update rule
Let’s analyze the singular values of the matrix , or equivalently the root of eigenvalues of
It is clear that the singular values of , denoted by is sandwiched between
If we choose , then
Using the fact that for all , , then
Recall Eqn. (22), using the fact for all , , we know
one can ensure . ∎
Note this linear system is the same as that in Thm. 6. Therefore the convergence analysis follows in the same way as Thm. 6. ∎