Towards Better Understanding of Adaptive Gradient Algorithms in Generative Adversarial Nets

Mingrui Liu, Youssef Mroueh, Jerret Ross, Wei Zhang, Xiaodong Cui, Payel Das, Tianbao Yang

Introduction

Adaptive gradient algorithms (Duchi et al., 2011; Tieleman & Hinton, 2012; Kingma & Ba, 2014; Reddi et al., 2019) are very popular in training deep neural networks due to their computational efficiency and minimal need for hyper-parameter tuning (Kingma & Ba, 2014). For example, Adagrad (Duchi et al., 2011) automatically adjusts the learning rate for each dimension of the model parameter according to the information of history gradients, while its computational cost is almost the same as Stochastic Gradient Descent (SGD). However, in supervised deep learning (for example, image classification tasks using a deep convolutional neural network), there is not enough evidence showing that adaptive gradient methods converge faster than its non-adaptive counterpart (i.e., SGD) on benchmark datasets. For example, it is argued in (Wilson et al., 2017) that adaptive gradient methods often find a solution with worse performance than SGD. Specifically, Wilson et al. (2017) observed that Adagrad has slower convergence than SGD in terms of both training and testing error, while using VGG (Simonyan & Zisserman, 2014) on CIFAR10 data.

GANs (Goodfellow et al., 2014) are a popular class of generative models. In a nutshell, they consist of a generator and a discriminator, both of which are defined by deep neural networks. The generator and the discriminator are trained under an adversarial cost, corresponding to a non-convex non-concave min-max problem. GANs are known to be notoriously difficult to train. In practice, Adam (Kingma & Ba, 2014) is the defacto optimizer used for GAN training. The common optimization strategy is to alternatively update the discriminator and the generator (Arjovsky et al., 2017; Gulrajani et al., 2017). Using Adam is important in GAN training, since replacing it with non-adaptive methods (e.g. SGD) would significantly deteriorate the performance. This paper studies and attempts to answer the following question:

Why do adaptive gradient methods outperform their non-adaptive counterparts in GAN training?

We analyze a variant of Optimistic Stochastic Gradient (OSG) in (Daskalakis & Panageas, 2018) and propose an adaptive variant named Optimistic Adagrad (OAdagrad) for solving a class of non-convex non-concave min-max problems. Both of them are shown to enjoy state-of-the-art complexities. We further prove that the convergence rate of OAdagrad to an ϵ\epsilon-first-order stationary point depends on the growth rate of the cumulative stochastic gradient. In our experiments, we observed an interesting phenomenon while using adaptive gradient methods for training GANs: the cumulative stochastic gradient grows at a slow rate. This observation is in line with the prediction of our theory suggesting improved convergence rate for OAdagrad in GAN training, when the growth rate of the cumulative stochastic gradient is slow.

Since GAN is a min-max optimization problem in nature, our problem of interest is to solve the following stochastic optimization problem:

where U\mathcal{U}, V\mathcal{V} are closed and convex sets, F(u,v)F(\mathbf{u},\mathbf{v}) is possibly non-convex in u\mathbf{u} and non-concave in v\mathbf{v}. ξ\xi is a random variable following an unknown distribution D\mathcal{D}. In GAN training, u\mathbf{u}, v\mathbf{v} represent the parameters of generator and discriminator respectively.

The ideal goal for solving (1) is to find a saddle point (u,v)U×V(\mathbf{u}_{*},\mathbf{v}_{*})\in\mathcal{U}\times\mathcal{V} such that F(u,v)F(u,v)F(u,v)F(\mathbf{u}_{*},\mathbf{v})\leq F(\mathbf{u}_{*},\mathbf{v}_{*})\leq F(\mathbf{u},\mathbf{v}_{*}) for uU,vV\forall\mathbf{u}\in\mathcal{U},\forall\mathbf{v}\in\mathcal{V}.

To achieve this goal, the typical assumption usually made is that the objective function is convex-concave. When F(u,v)F(\mathbf{u},\mathbf{v}) is convex in u\mathbf{u} and concave in v\mathbf{v}, non-asymptotic guarantee in terms of the duality gap is well established by a series of work (Nemirovski & Yudin, 1978; Nemirovski, 2004; Nesterov, 2007; Nemirovski et al., 2009; Juditsky et al., 2011). However, when F(u,v)F(\mathbf{u},\mathbf{v}) is non-convex in u\mathbf{u} and non-concave in v\mathbf{v}, finding the saddle point is NP-hard in general. Instead, we focus on finding the first-order stationary point provided that the objective function is smooth. I.e. we aim to find (u,v)U×V(\mathbf{u},\mathbf{v})\in\mathcal{U}\times\mathcal{V} such that uF(u,v)=0\nabla_{\mathbf{u}}F(\mathbf{u},\mathbf{v})=0, vF(u,v)=0\nabla_{\mathbf{v}}F(\mathbf{u},\mathbf{v})=0. Note that this is a necessary condition for finding the (local) saddle point.

To avoid the cost of an additional oracle call in extragradient step, several studies (Chiang et al., 2012; Rakhlin & Sridharan, 2013; Daskalakis et al., 2017; Gidel et al., 2018; Xu et al., 2019) proposed single-call variants of the extragradient algorithm. Some of them focus on the convex setting (e.g. (Chiang et al., 2012; Rakhlin & Sridharan, 2013)), while others focus on the non-convex setting (Xu et al., 2019). The closest to our work is the work by (Daskalakis et al., 2017; Gidel et al., 2018), where the min-max setting and GAN training are considered. However, the convergence of those algorithms is only shown for a class of bilinear problems in (Daskalakis et al., 2017) and for monotone variational inequalities in (Gidel et al., 2018). Hence a big gap remains between the specific settings studied in (Daskalakis et al., 2017; Gidel et al., 2018) and more general non-convex non-concave min-max problems. Table 1 provides a complete overview of our results and existing results. It is hard to give justice to the large body of work on min-max optimization, so we refer the interested reader to Appendix B that gives a comprehensive survey of related previous methods that are not covered in this Table.

Our main goal is to design stochastic first-order algorithms with low iteration complexity, low per-iteration cost and suitable for a general class of non-convex non-concave min-max problems. The main tool we use in our analysis is variational inequality.

Our main contributions are summarized as follows:

Following (Daskalakis et al., 2017), we extend optimistic stochastic gradient (OSG) analysis beyond the bilinear and unconstrained case, by assuming the Lipschitz continuity of the operator TT and the existence of a solution for the variational inequality MVI(T,X)\text{MVI}(T,\mathcal{X}). These conditions were considered in the analysis of the stochastic extragradient algorithm in (Iusem et al., 2017). We analyze a variant of Optimistic Stochastic Gradient (OSG) under these conditions, inspired by the analysis of (Iusem et al., 2017). We show that OSG achieves state-of-the-art iteration complexity O(1/ϵ4)O(1/\epsilon^{4}) for finding an ϵ\epsilon-first-order stationary point. Note that our OSG variant only requires invoking one stochastic first-order oracle while enjoying the state-of-the-art iteration complexity achieved by stochastic extragradient method (Iusem et al., 2017).

Under the same conditions, we design an adaptive gradient algorithm named Optimistic Adagrad (OAdagrad), and show that it enjoys better adaptive complexity O(ϵ21α)O\left(\epsilon^{-\frac{2}{1-\alpha}}\right), where α\alpha characterizes the growth rate of cumulative stochastic gradient and 0α1/20\leq\alpha\leq 1/2. Similar to Adagrad (Duchi et al., 2011), our main innovation is in considering variable metrics according to the geometry of the data in order to achieve potentially faster convergence rate for a class of nonconvex-nonconcave min-max games. Note that this adaptive complexity improves upon the non-adaptive one (i.e. O(1/ϵ4)O(1/\epsilon^{4})) achieved by OSG. To the best of our knowledge, we establish the first known adaptive complexity for adaptive gradient algorithms in a class of non-convex non-concave min-max problems.

We demonstrate the effectiveness of our algorithms in GAN training on CIFAR10 data. Empirical results identify an important reason behind why adaptive gradient methods behave well in GANs, which is due to the fact that the cumulative stochastic gradient grows in a slow rate. We also show that OAdagrad outperforms Simultaneous Adam in sample quality in ImageNet generation using self-attention GANs (Zhang et al., 2018). This confirms the superiority of OAdagrad in min-max optimization.

Preliminaries and Notations

In this section, we fix some notations and give formal definitions of variational inequalities, and their relationship to the min-max problem (1).

An operator TT is monotone if T(x)T(y),xy0\langle T(\mathbf{x})-T(\mathbf{y}),\mathbf{x}-\mathbf{y}\rangle\geq 0 for x,yX\forall\mathbf{x},\mathbf{y}\in\mathcal{X}. An operator TT is pseudo-monotone if T(x),yx0T(y),yx0\langle T(\mathbf{x}),\mathbf{y}-\mathbf{x}\rangle\geq 0\Rightarrow\langle T(\mathbf{y}),\mathbf{y}-\mathbf{x}\rangle\geq 0 for x,yX\forall\mathbf{x},\mathbf{y}\in\mathcal{X}. An operator TT is γ\gamma-strongly-monotone if T(x)T(y),xyγ2xy2\langle T(\mathbf{x})-T(\mathbf{y}),\mathbf{x}-\mathbf{y}\rangle\geq\frac{\gamma}{2}\|\mathbf{x}-\mathbf{y}\|^{2} for x,yX\forall\mathbf{x},\mathbf{y}\in\mathcal{X}.

We give here formal definitions of monotonic operators TT and the ϵ\epsilon-first-order stationary point.

A point xX\mathbf{x}\in\mathcal{X} is called ϵ\epsilon-first-order stationary point if T(x)ϵ\|T(\mathbf{x})\|\leq\epsilon.

Remark: We make the following observations:

From the definition, it is evident that strong-monotonicity \Rightarrow monotonicity \Rightarrowpseudo-monotonicity. Assuming SVI has a solution and pseudo-monotonicity of the operator TT imply that MVI(T,X)\text{MVI}(T,\mathcal{X}) has a solution. To see that, assume that SVI has a nonempty solution set, i.e. there exists x\mathbf{x}_{*} such that T(x),yx0\langle T(\mathbf{x}_{*}),\mathbf{y}-\mathbf{x}_{*}\rangle\geq 0 for any y\mathbf{y}. Noting that pseudo-monotonicity means that for every y,x\mathbf{y},\mathbf{x}, T(x),yx0\langle T(\mathbf{x}),\mathbf{y}-\mathbf{x}\rangle\geq 0 implies T(y),yx0\langle T(\mathbf{y}),\mathbf{y}-\mathbf{x}\rangle\geq 0, we have T(y),yx0\langle T(\mathbf{y}),\mathbf{y}-\mathbf{x}_{*}\rangle\geq 0 for any y\mathbf{y}, which means that x\mathbf{x}_{*} is the solution of Minty variational inequality. Note that the reverse may not be true and an example is provided in Appendix G.

For the min-max problem (1), when F(u,v)F(\mathbf{u},\mathbf{v}) is convex in u\mathbf{u} and concave in v\mathbf{v}, TT is monotone. And, therefore solving SVI(T,X)\text{SVI}(T,\mathcal{X}) is equivalent to solving (1). When TT is not monotone, by assuming TT is Lipschitz continuous, it can be shown that the solution set of (1) is a subset of the solution set of SVI(T,X)\text{SVI}(T,\mathcal{X}). However, even solving SVI(T,X)\text{SVI}(T,\mathcal{X}) is NP-hard in general and hence we resort to finding an ϵ\epsilon-first-order stationary point.

Throughout the paper, we make the following assumption:

TT is LL-Lipschitz continuous, i.e. T(x1)T(x2)2Lx1x22\|T(\mathbf{x}_{1})-T(\mathbf{x}_{2})\|_{2}\leq L\|\mathbf{x}_{1}-\mathbf{x}_{2}\|_{2} for x1,x2X\forall\mathbf{x}_{1},\mathbf{x}_{2}\in\mathcal{X}.

MVI(T,X)\text{MVI}(T,\mathcal{X}) has a solution, i.e. there exists x\mathbf{x}_{*} such that T(x),xx0\left\langle T(\mathbf{x}),\mathbf{x}-\mathbf{x}_{*}\right\rangle\geq 0 for xX\forall\mathbf{x}\in\mathcal{X}.

Remark: Assumptions (i) and (iii) are commonly used assumptions in the literature of variational inequalities and non-convex optimization (Juditsky et al., 2011; Ghadimi & Lan, 2013; Iusem et al., 2017). Assumption (ii) is used frequently in previous work focusing on analyzing algorithms that solve non-monotone variational inequalities (Iusem et al., 2017; Lin et al., 2018; Mertikopoulos et al., 2018). Assumption (ii) is weaker than other assumptions usually considered, such as pseudo-monotonicity, monotonicity, or coherence as assumed in (Mertikopoulos et al., 2018). For non-convex minimization problem, it has been shown that this assumption holds while using SGD to learn neural networks (Li & Yuan, 2017; Kleinberg et al., 2018; Zhou et al., 2019).

Optimistic Stochastic Gradient

This section serves as a warm-up and motivation of our main theoretical contribution presented in the next section. Inspired by (Iusem et al., 2017), we present an algorithm called Optimistic Stochastic Gradient (OSG) that saves the cost of the additional oracle call as required in (Iusem et al., 2017) and maintains the same iteration complexity. The main algorithm is described in Algorithm 1, where mtm_{t} denotes the minibatch size for estimating the first-order oracle. It is worth mentioning that Algorithm 1 becomes stochastic extragradient method if one changes T(zk1;ξk1i)T(\mathbf{z}_{k-1};\xi_{k-1}^{i}) to T(xk1;ξk1i)T(\mathbf{x}_{k-1};\xi_{k-1}^{i}) in line 3. Stochastic extragradient method requires to compute stochastic gradient over both sequences {xk}\{\mathbf{x}_{k}\} and {zk}\{\mathbf{z}_{k}\}. In contrast, {xk}\{\mathbf{x}_{k}\} is an ancillary sequence in OSG and the stochastic gradient is only computed over the sequence of {zk}\{\mathbf{z}_{k}\}. Thus, stochastic extragradient method is twice as expensive as OSG in each iteration. In some tasks (e.g. training GANs) where the stochastic gradient computation is expensive, OSG is numerically more appealing.

The detailed derivation of (2) can be found in Appendix F.

Suppose that Assumption 1 holds. Let rα(zk)=zkΠX(zkαT(zk))r_{\alpha}(\mathbf{z}_{k})=\left\|\mathbf{z}_{k}-\Pi_{\mathcal{X}}\left(\mathbf{z}_{k}-\alpha T(\mathbf{z}_{k})\right)\right\|. Let η1/9L\eta\leq 1/9L and run Algorithm 1 for NN iterations. Then we have

Remark: There are two implications of Corollary 1.

Optimistic Adagrad

Before introducing Optimistic Adagrad, we present here a quick overview of Adagrad (Duchi et al., 2011). The main objective in Adagrad is to solve the following minimization problem:

where w\mathbf{w} is the model parameter, and ζ\zeta is an random variable following distribution P\mathcal{P}. The update rule of Adagrad is

where η>0\eta>0, g^t=f(wt;ζt)\hat{\mathbf{g}}_{t}=\nabla f(\mathbf{w}_{t};\zeta_{t}), Ht=diag((i=1tg^ig^i)12)H_{t}=\text{diag}\left(\left(\sum_{i=1}^{t}\hat{\mathbf{g}}_{i}\circ\hat{\mathbf{g}}_{i}\right)^{\frac{1}{2}}\right) with \circ denoting the Hadamard product. Adagrad when taking Ht=IH_{t}=I reduces to SGD. Different from SGD, Adagrad dynamically incorporates knowledge of history gradients to perform more informative gradient-based learning. When solving a convex minimization problem and the gradient is sparse, Adagrad converges faster than SGD. There are several variants of Adagrad, including Adam (Kingma & Ba, 2014), RMSProp (Tieleman & Hinton, 2012), and AmsGrad (Reddi et al., 2019). All of them share the spirit, as they take advantage of the information provided by the history of gradients. Wilson et al. (2017) provide a complete overview of different adaptive gradient methods in a unified framework. It is worth mentioning that Adagrad can not be directly applied to solve non-convex non-concave min-max problems with provable guarantee.

2 Optimistic Adagrad for min-max optimization

There exists G>0G>0 and δ>0\delta>0 such that T(z;ξ)2G\left\|T(\mathbf{z};\xi)\right\|_{2}\leq G, T(z;ξ)δ\left\|T(\mathbf{z};\xi)\right\|_{\infty}\leq\delta for all z\mathbf{z} almost surely.

There exists a universal constant D>0D>0 such that xk2D/2\|\mathbf{x}_{k}\|_{2}\leq D/2 for k=1,,Nk=1,\ldots,N, and x2D/2\|\mathbf{x}_{*}\|_{2}\leq D/2.

Remark: Assumption 2 (i) is a standard one often made in literature (Duchi et al., 2011). Assumption 2 (ii) holds when we use normalization layers in the discriminator and generator such as spectral normalization of weights (Miyato et al., 2018; Zhang et al., 2018), that will keep the norms of the weights bounded. Regularization techniques such as weight decay also ensure that the weights of the networks remain bounded throughout the training.

Define g^k=1mi=1mT(zk;ξki)\text{Define }\widehat{\mathbf{g}}_{k}=\frac{1}{m}\sum_{i=1}^{m}T(\mathbf{z}_{k};\xi_{k}^{i}), xH=x,Hx\|\mathbf{x}\|_{H}=\sqrt{\left\langle\mathbf{x},H\mathbf{x}\right\rangle}. Denote g^0:k\widehat{\mathbf{g}}_{0:k} by the concatenation of g^0,,g^k\widehat{\mathbf{g}}_{0},\ldots,\widehat{\mathbf{g}}_{k}, and denote g^0:k,i\widehat{\mathbf{g}}_{0:k,i} by the ii-th row of g^0:k\widehat{\mathbf{g}}_{0:k}.

Suppose Assumption 1 and 2 hold. Suppose g^1:k,i2δkα\|\widehat{\mathbf{g}}_{1:k,i}\|_{2}\leq\delta k^{\alpha} with 0α1/20\leq\alpha\leq 1/2 for every i=1,,di=1,\ldots,d and every k=1,,Nk=1,\ldots,N. When ηδ9L\eta\leq\frac{\delta}{9L}, after running Algorithm 2 for NN iterations, we have

We denote g^1:k\widehat{\mathbf{g}}_{1:k} by the cumulative stochastic gradient, where g^1:k,i2δkα\|\widehat{\mathbf{g}}_{1:k,i}\|_{2}\leq\delta k^{\alpha} characterizes the growth rate of the gradient in terms of ii-th coordinate. In our proof, a key quantity is i=1dg^1:k,i2\sum_{i=1}^{d}\|\widehat{\mathbf{g}}_{1:k,i}\|_{2} that crucially affects the computational complexity of Algorithm 2. Since i=1dg^1:k,i2δdkα\sum_{i=1}^{d}\|\widehat{\mathbf{g}}_{1:k,i}\|_{2}\leq\delta dk^{\alpha}, in the worst case, α=12\alpha=\frac{1}{2}. But in practice, the stochastic gradient is usually sparse, and hence α\alpha can be strictly smaller than 12\frac{1}{2}.

As shown in Theorem 2, the minibatch size used in Algorithm 2 for estimating the first-order oracle can be any positive constant and independent of ϵ\epsilon. This is more practical than the results established in Theorem 1, since the minibatch size in Theorem 1 does either increase in terms of number of iterations or is dependent on ϵ\epsilon. When α=12\alpha=\frac{1}{2}, the complexity of Algorithm 2 is O(1/ϵ4)O(1/\epsilon^{4}), which matches the complexity stated in Theorem 1. When α<12\alpha<\frac{1}{2}, the complexity of OAdagrad given in Algorithm 2 is O(ϵ21α)O\left(\epsilon^{-\frac{2}{1-\alpha}}\right), i.e., strictly better than that of OSG given in Algorithm 1.

Alternating Adam is very popular in GAN training (Goodfellow et al., 2014; Arjovsky et al., 2017; Gulrajani et al., 2017; Brock et al., 2018). In Alternating Adam, one alternates between multiple steps of Adam on the discriminator and a single step of Adam on the generator. The key difference between OAdagrad and Alternating Adam is that OAdagrad updates the discriminator and generator simultaneously. It is worth mentioning that OAdagrad naturally fits into the framework of Optimistic Adam proposed in (Daskalakis et al., 2017). Taking β1=0,β21\beta_{1}=0,\beta_{2}\rightarrow 1 in their Algorithm 1 reduces to OAdagrad with annealing learning rate. To the best of our knowledge, there is no convergence proof for Alternating Adam for non-convex non-concave problems. Our convergence proof for OAdagrad provides a theoretical justification of a special case of Optimistic Adam.

Experiments

In the first experiment, we verify the effectiveness of the proposed algorithms in GAN training using the PyTorch framework (Paszke et al., 2017). We use Wasserstein GAN with gradient penalty (WGAN-GP) (Gulrajani et al., 2017) and CIFAR10 data in our experiments. The architectures of discriminator and generator, and the penalty parameter in WGAN-GP are set to be same as in the original paper. We compare Alternating Adam, OSG and OAdagrad, where the Alternating Adam is to run 55 steps of Adam on the discriminator before performing 11 step of Adam on the generator. We try different batch sizes (64,128,256)(64,128,256) for each algorithm. For each algorithm, we tune the learning rate in the range of {1×103,2×104,1×104,2×105,1×105}\{1\times 10^{-3},2\times 10^{-4},1\times 10^{-4},2\times 10^{-5},1\times 10^{-5}\} when using batch size 64, and use the same learning rate for batch size 128128 and 256256. We report Inception Score (IS) (Salimans et al., 2016) as a function of number of iterations. Figure 1 suggests that OAdagrad performs better than OSG and Alternating Adam, and OAdagrad results in higher IS. We compare the generated CIFAR10 images associated with these three methods, which is included in Appendix A. We also provide experimental results to compare the performance of different algorithms using different minibatch sizes, which are included in Appendix E.

Growth Rate of Cumulative Stochastic Gradient

In the second experiment, we employ OAdagrad to train GANs and study the growth rate of the cumulative stochastic gradient (i.e., i=1dg^1:N,i2\sum_{i=1}^{d}\|\widehat{\mathbf{g}}_{1:N,i}\|_{2}). We tune the learning rate from {1×103,2×104,1×104,2×105,1×105}\{1\times 10^{-3},2\times 10^{-4},1\times 10^{-4},2\times 10^{-5},1\times 10^{-5}\} and choose batch size to be 64. In Figure 2, the blue curve and red curve stand for the growth rate for OAdagrad and its corresponding tightest polynomial growth upper bound respectively. NN is the number of iterations, and cc is a multiplicative constant such that the red curve and blue curve overlaps at the starting point of the training. The degree of the polynomial is determined using binary search. We can see that the growth rate of cumulative stochastic gradient grows very slowly in GANs (the worst-case polynomial degree is 0.50.5, but it is 0.20.2 for WGAN-GP on CIFAR10 and 0.070.07 for WGAN on LSUN Bedroom dataset). As predicted by our theory, this behavior explains the faster convergence of OAdagrad versus OSG, consistent with what is observed empirically in Figure 1.

Self-Attention GAN on ImageNet

In the third experiment, we consider GAN training on large-scale dataset. We use the model from Self-Attention GAN (Zhang et al., 2018) (SA-GAN) and ImageNet as our dataset. Note that in this setting the boundedness of both generator (GG) and discriminator (DD) is ensured by spectral normalization of both GG and DD. Three separate experiments are performed, including Alternating Adam (baseline), Simultaneous Adam (Mescheder et al., 2017), and OAdagrad. It should be mentioned that the update rule of Simultaneous Adam involves performing Adam-type update for discriminator and generator simultaneously. Training is performed with batch size 128 for all experiments.

For the baseline experiment (Alternating Adam) we use the default settings and hyper parameters reported in SA-GAN (Zhang et al., 2018) (note that we are not using the same batch size of 256256 as in (Zhang et al., 2018) due to limited computational resources). In our experience, Alternating Adam training for a batch size of 128128 with same learning rate as in SA-GAN (0.00010.0001 for generator and 0.00040.0004 for discriminator) collapsed. This does not mean that Alternating Adam fails, it just needs more tuning to find the correct range of learning rates for the particular batch size we have. With the hyperparameters ranges we tried Alternating Adam collapsed, with extra tuning efforts and an expensive computational budget Alternating Adam would eventually succeed. This is inline with the large scale study in (Lucic et al., 2018) that states that given a large computational budget for tuning hyper-parameters most GANs training succeed equally.

For both OAdagrad and Simultaneous Adam, we use different learning rate for generator and discriminator, as suggested in (Heusel et al., 2017). Specifically, the learning rates used are 10310^{-3} for the generator and 4×1054\times 10^{-5} for the discriminator. We report both Inception Score (IS) and Fréchet Inception Distance (Heusel et al., 2017) (FID) as a function of number of iterations.

We compare the generated ImageNet images associated with the three optimization methods in Appendix 4. Since Alternating Adam collapsed we don’t report its Inception Score or FID. As it can be seen in Figure 3 and Appendix 4, OAdagrad outperforms simultaneous Adam in quantitative metrics (IS and FID) and in sample quality generation. Future work will include investigating whether OAdagrad would benefit from training with larger batch size, in order to achieve state-of-the-art results.

Conclusion

In this paper, we explain the effectiveness of adaptive gradient methods in training GANs from both theoretical and empirical perspectives. Theoretically, we provide two efficient stochastic algorithms for solving a class of min-max non-convex non-concave problems with state-of-the-art computational complexities. We also establish adaptive complexity results for an Adagrad-style algorithm by using coordinate-wise stepsize according to the geometry of the history data. The algorithm is proven to enjoy faster adaptive convergence than its non-adaptive counterpart when the gradient is sparse, which is similar to Adagrad applied to convex minimization problem. We have conducted extensive empirical studies to verify our theoretical findings. In addition, our experimental results suggest that the reason why adaptive gradient methods deliver good practical performance for GAN training is due to the slow growth rate of the cumulative stochastic gradient.

Acknowledgments

The authors thank the anonymous reviewers for their helpful comments. M. Liu and T. Yang are partially supported by National Science Foundation CAREER Award 1844403. M. Liu would like to thank Xiufan Yu from Pennsylvania State University and Zehao Dou from Yale University for helpful discussions.

References

Appendix A More Experimental Results

Comparison of Generated CIFAR10 Images by Different Optimization Methods In this section, we report the generated CIFAR10 images during the training of WGAN-GP by three optimization methods (OSG, OAdagrad, Alternating Adam). Every method uses batch size 64, and 1 iteration represents calculating the stochastic gradient with minibatch size 64 once. Figure 4 consists of images by three optimization methods at iteration 8000. Visually we can see that OAdagrad is better than Alternating Adam, and both of them are significantly better than OSG. It is consistent with the inception score results reported in Figure 1, and it also illustrates the tremendous benefits delivered by adaptive gradient methods when training GANs.

Comparison of Generated ImageNet Images by Different Optimization Methods In this section, we report the generated ImageNet images during the training of Self-Attention GAN by three optimization methods (OAdagrad, Simultaneous Adam, Alternating Adam). Every method uses batch size 128 and 1 iteration represents calculating the stochastic gradient with minibatch 128 once. Figure 5 consists of images by three optimization methods at iteration 135000. Visually it is apparent that OAdagrad is better than Simultaneous Adam, and both of them are significantly than Alternating Adam.

Unofficial PyTorch Inception Score and FID results for SA-GAN on ImageNet

Appendix B Related Work

For convex-concave min-max optimization, the extragradient method was first proposed by (Korpelevich, 1976). Later on, under gradient Lipschitz condition, Nemirovski (2004) extended the idea of extragradient to mirror-prox and obtained the O(1/N)O(1/N) convergence rate in terms of the duality gap (see also (Nesterov, 2007)), where NN is the number of iterations. When only the stochastic first-order oracle is available, the stochastic mirror-prox was analyzed by (Juditsky et al., 2011). The convergence rates for both deterministic and stochastic mirror-prox are optimal (Nemirovsky & Yudin, 1983). Recently, Zhao (2019) developed a nearly-optimal stochastic first-order algorithm when the primal variable is strongly convex in the primal variable. Bach & Levy (2019) proposed a universal algorithm that is adaptive to smoothness and noise, and simultaneously achieves optimal convergence rate.

There is a plethora of work analyzing one-sided nonconvex min-max problem, where the objective function is nonconvex in the minimization variable but concave in maximization variable. When the function is weakly-convex in terms of the minimization variable, Rafique et al. (2018) propose a stage-wise stochastic algorithm that approximately solves a convex-concave subproblem by adding a quadratic regularizer and show the first-order convergence of the equivalent minimization problem. Under the same setting, Lu et al. (2019) utilize block-based optimization strategy and show the convergence of the stationarity gap. By further assuming that the function is smooth in the minimization variable, Lin et al. (2019) show that (stochastic) gradient descent ascent is able to converge to the first-order stationary point of the equivalent minimization problem. Liu et al. (2020) cast the problem of stochastic AUC maximization with deep neural networks into a nonconvex-concave min-max problem, show the PL (Polyak-Łojasiewicz) condition holds for the objective of the outer minimization problem, and propose an algorithm and establish its fast convergence rate.

A more challenging problem is the non-convex non-concave min-max problem. Dang & Lan (2015) demonstrate that the deterministic extragradient method is able to converge to ϵ\epsilon-first-order stationary point with non-asymptotic guarantee. Under the condition that the objective function is weakly-convex and weakly-concave, Lin et al. (2018) designs a stage-wise algorithm, where in each stage a strongly-convex strongly-concave subproblem is constructed by adding quadratic terms and appropriate stochastic algorithms can be employed to approximately solve it. They also show the convergence to the stationary point. Sanjabi et al. (2018) design an alternating deterministic optimization algorithm, in which multiple steps of gradient ascent for dual variable are conducted before one step of gradient descent for primal variable is performed. They show the convergence to stationary point based on the assumption that the inner maximization problem satisfies PL condition (Polyak, 1969). Our work is different from these previous methods in many aspects. In comparison to (Lin et al., 2018), our result does not need the bounded domain assumption. Furthermore, our iteration complexity is O(1/ϵ4)O(1/\epsilon^{4}) to achieve ϵ\epsilon-first-order stationary point while the corresponding complexity in (Lin et al., 2018) is O(1/ϵ6)O(1/\epsilon^{6}). When comparing to (Sanjabi et al., 2018), we do not assume that the PL (Polyak-Łojasiewicz) condition holds. Additionally, our algorithm is stochastic and not restricted to the deterministic case. Apparently the most related work to the present one is (Iusem et al., 2017). The stochastic extragradient method analyzed in (Iusem et al., 2017) requires calculation of two stochastic gradients per iteration, while the present algorithm only needs one since it memorizes the stochastic gradient in the previous iteration to guide the update in the current iteration. Nevertheless, we achieve the same iteration complexity as in (Iusem et al., 2017).

There are a body of work analyzing the convergence behavior of min-max optimization algorithms and its application in training GANs (Heusel et al., 2017; Daskalakis & Panageas, 2018; Nagarajan & Kolter, 2017; Grnarova et al., 2017; Yadav et al., 2017; Gidel et al., 2018; Mertikopoulos et al., 2018; Mazumdar et al., 2019). A few of them (Heusel et al., 2017; Daskalakis & Panageas, 2018; Mazumdar et al., 2019) only have asymptotic convergence. Others (Nagarajan & Kolter, 2017; Grnarova et al., 2017; Daskalakis et al., 2017; Yadav et al., 2017; Gidel et al., 2018; Mertikopoulos et al., 2018) focus on more restricted settings. For example, Nagarajan & Kolter (2017); Grnarova et al. (2017) require the concavity of the objective function in terms of dual variable. Yadav et al. (2017); Gidel et al. (2018) assume the objective to be convex-concave. Mertikopoulos et al. (2018) imposes the so-called coherence condition which is stronger than our assumption. Daskalakis et al. (2017) analyze the last-iteration convergence for bilinear problem. Recently, Gidel et al. (2019) analyze the benefits of using negative momentum in alternating gradient descent to improve the training of a bilinear game. Chavdarova et al. (2019) develop a variance-reduced extragradient method and shows its linear convergence under strong monotonicity and finite-sum structure assumptions. Azizian et al. (2019) provide a unified analysis of extragradient for bilinear game, strongly monotone case, and their intermediate cases. However, none of them give non-asymptotic convergence results for the class of non-convex non-concave min-max problem considered in our paper.

Appendix C Proof of Theorem 1

C.2 Lemmas

Let xX\mathbf{x}_{*}\in\mathcal{X}^{*}, where X\mathcal{X}^{*} is the set of optimal solutions of MVI(T,X)(T,\mathcal{X}), i.e. T(x),xx0\left\langle T(\mathbf{x}),\mathbf{x}-\mathbf{x}_{*}\right\rangle\geq 0 holds for xX\forall\mathbf{x}\in\mathcal{X}. Define ϵk=1mki=1mkT(zk,ξki)T(zk)\epsilon_{k}=\frac{1}{m_{k}}\sum_{i=1}^{m_{k}}T(\mathbf{z}_{k},\xi_{k}^{i})-T(\mathbf{z}_{k}), and T^(ϵk,zk)=T(zk)+ϵk\widehat{T}(\epsilon_{k},\mathbf{z}_{k})=T(\mathbf{z}_{k})+\epsilon_{k}. For any xX\mathbf{x}\in\mathcal{X}, we have

where (a) holds by using Fact 1. Note that

where the last inequality holds by the fact that xzk,T(zk)0\left\langle\mathbf{x}_{*}-\mathbf{z}_{k},T(\mathbf{z}_{k})\right\rangle\leq 0 since x\mathbf{x}_{*} is a solution of MVI(T,X)\text{MVI}(T,\mathcal{X}). Note that

where (a) holds by xkzk,xk1ηT^(ϵk1,zk1)zk0\left\langle\mathbf{x}_{k}-\mathbf{z}_{k},\mathbf{x}_{k-1}-\eta\widehat{T}(\epsilon_{k-1},\mathbf{z}_{k-1})-\mathbf{z}_{k}\right\rangle\leq 0 and Cauchy-Schwartz inequality, where the former inequality comes from Fact 2 and the update rules of the algorithm, (b) holds by the update rule of zk\mathbf{z}_{k} and xk\mathbf{x}_{k}, (c) holds by the nonexpansion property of the projection operator, (d) holds since TT is LL-Lipschitz continuous, (e) holds since (a+b+c)23a2+3b2+3c2(a+b+c)^{2}\leq 3a^{2}+3b^{2}+3c^{2}.

Define Λk=2xzk,ηϵk\Lambda_{k}=2\langle\mathbf{x}_{*}-\mathbf{z}_{k},\eta\epsilon_{k}\rangle. Taking x=x\mathbf{x}=\mathbf{x}_{*} in (8) and combining (9) and (10), we have

Take summation over k=1,,Nk=1,\ldots,N in (12) and note that x0=z0\mathbf{x}_{0}=\mathbf{z}_{0}, which yields

By taking η19L\eta\leq\frac{1}{9L}, we have 136η2L2121-36\eta^{2}L^{2}\geq\frac{1}{2}, and we have the result. ∎

C.3 Main Proof of Theorem 1

Define rη(zk)=zkΠX(zkηT(zk))r_{\eta}(\mathbf{z}_{k})=\left\|\mathbf{z}_{k}-\Pi_{\mathcal{X}}\left(\mathbf{z}_{k}-\eta T(\mathbf{z}_{k})\right)\right\|. Our goal is to get a bound on rη(zk)r_{\eta}(\mathbf{z}_{k}). We have:

where (a) holds since (a+b)22a2+2b2(a+b)^{2}\leq 2a^{2}+2b^{2}, (b) holds by the non-expansion property of the projection operator and (a+b)22a2+2b2(a+b)^{2}\leq 2a^{2}+2b^{2}.

Let xX\mathbf{x}_{*}\in\mathcal{X}^{*}, where X\mathcal{X}^{*} is the set of optimal solutions of MVI(T,X)\text{MVI}(T,\mathcal{X}), i.e. T(x),xx0\left\langle T(\mathbf{x}),\mathbf{x}-\mathbf{x}_{*}\right\rangle\geq 0 holds for xX\forall\mathbf{x}\in\mathcal{X}. Define ϵk=1mki=1mkT(zk,ξki)T(zk)\epsilon_{k}=\frac{1}{m_{k}}\sum_{i=1}^{m_{k}}T(\mathbf{z}_{k},\xi_{k}^{i})-T(\mathbf{z}_{k}), and T^(ϵk,zk)=T(zk)+ϵk\widehat{T}(\epsilon_{k},\mathbf{z}_{k})=T(\mathbf{z}_{k})+\epsilon_{k}. Define Λk=2xzk,ηϵk\Lambda_{k}=2\langle\mathbf{x}_{*}-\mathbf{z}_{k},\eta\epsilon_{k}\rangle.

By summing over kk in Equation (14) and using Equation (7) in Lemma 1, we have

Taking expectation and divided by NN on both sides, we have

Appendix D Proof of Theorem 2

In this section, we define gk=T(zk)\mathbf{g}_{k}=T(\mathbf{z}_{k}), ϵk=g^kgk\epsilon_{k}=\widehat{\mathbf{g}}_{k}-\mathbf{g}_{k}.

For any positive definite diagonal matrix HH satisfying HδIH\succeq\delta I with δ>0\delta>0, if T(x1)T(x2)2Lx1x22\|T(\mathbf{x}_{1})-T(\mathbf{x}_{2})\|_{2}\leq L\|\mathbf{x}_{1}-\mathbf{x}_{2}\|_{2} for x1,x2X\mathbf{x}_{1},\mathbf{x}_{2}\in\mathcal{X}, then

Note that HδIH\succeq\delta I, we have 0<H11δI0<H^{-1}\preceq\frac{1}{\delta}I. Noting that xH=xHx\|\mathbf{x}\|_{H}=\sqrt{\mathbf{x}^{\top}H\mathbf{x}}, we have

When ηδ9L\eta\leq\frac{\delta}{9L}, we have

Define ϵk=g^kgk\epsilon_{k}=\widehat{\mathbf{g}}_{k}-\mathbf{g}_{k}. For any xX\mathbf{x}\in\mathcal{X}, we have

where the last inequality holds by the fact that xzk,gk0\left\langle\mathbf{x}_{*}-\mathbf{z}_{k},\mathbf{g}_{k}\right\rangle\leq 0 since x\mathbf{x}_{*} is a solution of MVI(T,X)\text{MVI}(T,\mathcal{X}). Note that

where (a) holds by the update rule of zk\mathbf{z}_{k} and xk\mathbf{x}_{k} in Algorithm 2, (b) holds by the triangle inequality, (c) holds by utilizing the Lipschitz continuity of TT, Lemma 2 and the fact that Hk1δIH_{k-1}\succeq\delta I for any kk, (d) holds since (a+b+c)23a2+3b2+3c2(a+b+c)^{2}\leq 3a^{2}+3b^{2}+3c^{2}.

Define Λk=2xzk,ηϵk\Lambda_{k}=2\langle\mathbf{x}_{*}-\mathbf{z}_{k},\eta\epsilon_{k}\rangle. Taking x=x\mathbf{x}=\mathbf{x}_{*} in (18) and combining (19) and (20), we have

Taking summation over k=1,,Nk=1,\ldots,N in (22), and noting that x0=z0\mathbf{x}_{0}=\mathbf{z}_{0}, xHt112xHt12\|\mathbf{x}\|_{H_{t-1}^{-1}}^{2}\geq\|\mathbf{x}\|^{2}_{H_{t}^{-1}} for all x\mathbf{x} and t1t\geq 1, we have

By taking ηδ9L\eta\leq\frac{\delta}{9L}, we have 136η2L2δ2121-\frac{36\eta^{2}L^{2}}{\delta^{2}}\geq\frac{1}{2}, and we have the result. ∎

When g^1:N,i2δNα\|\widehat{\mathbf{g}}_{1:N,i}\|_{2}\leq\delta N^{\alpha} with 0α1/20\leq\alpha\leq 1/2 for every ii, we have

When g^1:N,i2δNα\|\widehat{\mathbf{g}}_{1:N,i}\|_{2}\leq\delta N^{\alpha} with 0α1/20\leq\alpha\leq 1/2 for every ii, we have

where (a) holds since we have k=1Ng^kHk122i=1dg^1:N,i2\sum_{k=1}^{N}\|\widehat{\mathbf{g}}_{k}\|_{H_{k}^{-1}}^{2}\leq 2\sum_{i=1}^{d}\left\|\widehat{\mathbf{g}}_{1:N,i}\right\|_{2} by the setting of HkH_{k} and utilizing Lemma 4 of (Duchi et al., 2011), (b) holds because of g^1:N,i2δNα\|\widehat{\mathbf{g}}_{1:N,i}\|_{2}\leq\delta N^{\alpha}.

D.2 Main Proof of Theorem 2

where (a) holds by the triangle inequality, (b) is due to (a+b)22a2+2b2(a+b)^{2}\leq 2a^{2}+2b^{2}, (c) holds by the update rule of xk\mathbf{x}_{k} of Algorithm 2, (d) comes from the triangle inequality and (a+b)22a2+2b2(a+b)^{2}\leq 2a^{2}+2b^{2}.

Taking summation over k=1,,Nk=1,\ldots,N over (29) and invoking Lemma 3, we have

Dividing η2N\eta^{2}N on both sides, we have

Appendix E More Experimental Results on CIFAR10

In Figure 7, we compare the performance of OSG, Alternating Adam (AlterAdam) and OAdagrad under the same minibatch size setting on CIFAR10 dataset, where one epoch means one pass of the dataset. We can see that OAdagrad and Alternating Adam behave consistently better than OSG. When the minibatch size is small (e.g., 64), OAdagrad and Alternating Adam have comparable performance, but when the minibatch size is large (e.g., 128, 256), OAdagrad converges faster than Alternating Adam. This phenomenon shows the benefits of OAdagrad when large minibatch size is used.

Appendix F The equivalence between OSG in unconstrained case and the algorithm in Daskalakis et al. (2017)

Define g^k=1mki=1mkT(zk;ξki)\hat{\mathbf{g}}_{k}=\frac{1}{m_{k}}\sum_{i=1}^{m_{k}}T(\mathbf{z}_{k};\xi_{k}^{i}), then the update rule of Algorithm 1 becomes

where the first equality comes from (33) by replacing kk to k+1k+1, the second equality holds by (34), and the third equality holds by using (33) again. (35) is the algorithm in (Daskalakis et al. 2017).

Appendix G The existence of MVI solution may not imply pseudo-monotonicity

Define T(x)=f(x)T(x)=\nabla f(x). Then T(x)=sin(x)T(x)=-\sin(x) if 0x2π0\leq x\leq 2\pi and T(x)=0T(x)=0 if x0x\leq 0 or x2πx\geq 2\pi. Then we know that π\pi is the solution of both SVI (i.e. T(π),xπ0\langle T(\pi),x-\pi\rangle\geq 0 for any xXx\in\mathcal{X}) and MVI (i.e. T(x),xπ0\langle T(x),x-\pi\rangle\geq 0 for any xXx\in\mathcal{X}). However TT is not pseudo-monotone. To see this, take x=0x=0 and y=π4y=\frac{\pi}{4} and we have T(x),yx=0\langle T(x),y-x\rangle=0 and T(y),yx<0\langle T(y),y-x\rangle<0, which means that TT is not pseudo-monotone.