Global Convergence and Variance-Reduced Optimization for a Class of Nonconvex-Nonconcave Minimax Problems
Junchi Yang, Negar Kiyavash, Niao He
Introduction
We consider minimax optimization problems of the forms
where is a random vector with support , and is a possibly nonconvex-nonconcave function. Minimax problems have been widely studied in game theory and operations research. Recent emerging applications in machine learning have further stimulated a surge of interest in these problems. For example, generative adversarial networks (GANs) (Goodfellow et al., 2016) can be viewed as a two-player game between a generator that produces synthetic data and a discriminator that differentiates between true data and synthetic data. In reinforcement learning, solving Bellman equations can also be reformulated as minimax optimization problems (Chen and Wang, 2016; Dai et al., 2017, 2018). Other applications include robust optimization (Namkoong and Duchi, 2016, 2017), adversarial machine learning (Sinha et al., 2017; Madry et al., 2017), unsupervised learning (Xu et al., 2005), and so on.
The most natural and frequently used methods for solving minimax problems (1) and (2) are the gradient descent ascent (GDA) algorithms (or their stochastic variants), with either simultaneous or alternating updates of the primal-dual variables, referred to as SGDA and AGDA, respectively, throughout the paper. While these algorithms have received much empirical success especially in adversarial training, it is known that these GDA algorithms with constant stepsizes could fail to converge for general smooth function (Mescheder et al., 2018), even for the bilinear games (Gidel et al., 2019); even when they do converge, the stable limit point may not be a local Nash equilibrium (Daskalakis et al., 2018; Mazumdar and Ratliff, 2018). On the other hand, GDA algorithms can converge linearly to the saddle point for strongly-convex-strongly-concave functions (Facchinei and Pang, 2007). Moreover, for many simple nonconvex-nonconcave objective functions, such as, we also observe that GDA algorithms with constant stepsizes indeed converge to the global Nash equilibrium (or saddle point), at a linear rate (see Figure 1). This also holds true for their stochastic variants, albeit at a sublinear rate. These facts naturally raise a question: Is there a general condition under which GDA algorithms converge to the global optima?
In this paper, we address these two questions and specifically focus on the alternating gradient descent ascent, namely AGDA. This is due to several considerations. First of all, it has been recently shown that alternating updates of GDA are more stable than simultaneous updates (Gidel et al., 2019; Bailey et al., 2019). Note that for a convex-concave matrix game, SGDA may diverge while AGDA is proven to always have bounded iterates (Gidel et al., 2019). See Figure 2 for a simple illustration. Secondly, in general, it is much more challenging to analyze AGDA than SGDA. There is a lack of discussion on the convergence of AGDA for general minimax problems in the literature. Our contributions are summarized as follows.
Second, for minimax problems with the finite sum structure, we introduce a variance-reduced AGDA algorithm (VR-AGDA) that leverages the idea of stochastic variance reduced gradient (SVRG) (Johnson and Zhang, 2013; Reddi et al., 2016a) with the alternating updates. We prove that VR-AGDA achieves the complexity of in the region and in the region , where is the number of component functions. This greatly improves over the complexity of AGDA when applied to the finite sum minimax problems. We summarize the results of these algorithms in Table 1. Our numerical experiments further demonstrate that VR-AGDA performs significantly better than AGDA and Stoc-AGDA, especially for problems with large condition numbers. To our best knowledge, this is the first work to provide a variance reduced algorithm and theoretical guarantees in the nonconvex-nonconcave regime of minimax optimization.
2 Related work
There has been a recent surge in research on solving minimax optimization beyond the convex-concave regime (Sinha et al., 2017; Chen et al., 2017; Qian et al., 2019; Thekumparampil et al., 2019; Lin et al., 2018; Nouiehed et al., 2019; Abernethy et al., 2019), but they differ from our work from various perspectives. For example, Chen et al. (2017); Sinha et al. (2017); Lin et al. (2019); Thekumparampil et al. (2019) considered the minimax problem when the objective function is nonconvex in but concave in and focused on achieving convergence to stationary points. Their algorithms require solving the inner maximization or some sub-problems with high accuracy at every iteration, which are different from AGDA. Lin et al. (2018) considered a general class of weakly-convex weakly-concave minimax problems and proposed an inexact proximal point method to find an -stationary point. Their convergence result relies on assuming the existence of a solution to the corresponding Minty variational inequality, which is often hard to verify. Abernethy et al. (2019) recently showed the linear convergence of a second-order iterative algorithm, called Hamiltonian gradient descent (HGD), for a subclass of “sufficiently bilinear” functions. Compared with their work, the PL condition we consider in this paper is easier to verify and GDA algorithms are much simpler.
PL condition.
Recently, Nouiehed et al. (2019) studied a class of minimax problems where the objective only satisfies a one-sided PL condition and introduced the GDmax algorithm, which takes multiple ascent steps at every iteration. Our work differs from (Nouiehed et al., 2019) in two aspects: (i) we consider the two-sided PL condition which guarantees global convergence We also show that AGDA can find stationary point for minimax problems under the one-sided PL condition within iterations in Appendix D.; (ii) we consider AGDA which takes one ascent step at every iteration. Another closely related work is Cai et al. (2019). The authors considered a specific application in generative adversarial imitation learning with linear quadratic regulator dynamics. This is a special example that falls under the two-sided PL condition.
Variance-reduced minimax optimization.
There exists a few works that apply variance reduction techniques to minimax optimization. Palaniappan and Bach (2016); Luo et al. (2019) provided linear-convergent algorithms for strongly-convex-strongly-concave objectives, based on simultaneous updates. Du and Hu (2019) extended the result to convex-strongly-concave objectives with full-rank coupling bilinear term. In contrast, we are dealing with a much broader class of objectives that are possibly nonconvex-nonconcave. We point out that Luo et al. (2020) recently introduced a variance-reduced algorithm for finding the stationary point of nonconvex-strongly-concave problems, which is again different from our setting.
The rest of this paper is organized as follows. In Section 2, we introduce the two-sided PL condition and show the equivalence of three min-max optimality criteria under this condition. In Section 3, we describe deterministic and stochastic AGDA algorithms, and provide convergence analyses of those algorithms under the two-sided PL condition. In Section 4, we introduce the variance-reduced AGDA algorithm and establish its convergence results. In Section 5, we provide numerical performance of these algorithms for robust least square and imitation learning for LQR.
Global optima and two-sided PL condition
Throughout this paper, we assume that the function in (1) is continuously differentiable and has Lipschitz gradient. We state it as a basic assumption. Here is used to denote the Euclidean norm.
There exists a positive constant such that
We now define three notions of optimality for minimax problems. The most direct notion of optimality is global minimax point, at which is an optimal solution to the function and is an optimal solution to . In the two-player zero-sum game, the notion of saddle point is also widely used (Von Neumann et al., 2007; Nash, 1953). For a saddle point , is an optimal solution to and is an optimal solution to .
is a global minimax point, if for any
is a saddle point, if for any
is a stationary point, if
For general nonconvex-nonconcave minimax problems, these three notions of optimality are not necessarily equivalent. A stationary point may not be a saddle point or a global minimax point; a global minimax point may not be a saddle point or a stationary point. Note that generally speaking, for minimax problems, a saddle point or a global minimax point may not always exist. However, since our goal in this paper is to find global optima, in the remainder of the paper, we assume that a saddle point always exists.
We introduce a straightforward generalization of the PL condition to the minimax problem: function satisfies the PL condition with constant with respect to , and - satisfies PL condition with constant with respect to . We formally state this in the following definition.
A continuously differentiable function satisfies the two-sided PL condition if there exist constants such that:
The two-sided PL condition does not imply convexity-concavity, and it is a much weaker condition than strong-convexity-strong-concavity. In Lemma 2.1, we show that three notions of optimality are equivalent under the two-sided PL condition. Note that they may not be unique.
If the objective function satisfies the two-sided PL condition, then the following holds true:
Below we give some examples that satisfy this condition.
The nonconvex-nonconcave function in the introduction, satisfies the two-sided PL condition with (see Appendix A).
, where is strongly-convex-strongly-concave and and are arbitrary matrices, satisfies the two-sided PL condition.
The generative adversarial imitation learning for LQR can be formulated as , where is strongly-concave in terms of and satisfies PL condition in terms of (see (Cai et al., 2019) for more details), thus satisfying the two-sided PL condition.
Under the two-sided PL condition, the function can be shown to satisfy PL condition with (see Appendix A). Moreover, it holds that is also -smooth with (Nouiehed et al., 2019). Finally, we denote and , which represents the condition number of the problem.
Global convergence of AGDA and Stoc-AGDA
In this section, we establish the convergence rate of the stochastic alternating gradient descent ascent (Stoc-AGDA) algorithm, which we present in Algorithm 1, under the two-sided PL condition. Stoc-AGDA updates variables and sequentially using stochastic gradient descent/ascent steps. Here we make standard assumptions about stochastic gradients and .
and are unbiased stochastic estimators of and and have variances bounded by .
Note that Stoc-AGDA with constant stepsizes (i.e., and ) and noiseless stochastic gradient (i.e., ) reduces to AGDA:
We will measure the inaccuracy of through the potential function
We first consider Stoc-AGDA with constant stepsizes. We show that will converge linearly to a neighbourhood of the optimal set.
Suppose Assumptions 1, 2, 3 hold and satisfies the two-sided PL condition with and . Define . If we run Algorithm 1 with and , then
where
In the theorem above, we choose smaller than , , because our potential function is not symmetric about and . Another reason is because we want to approach faster so that is a better approximation for (, see Nouiehed et al. (2019)). Indeed, it is common to use different learning rates for and in GDA algorithms for nonconvex minimax problems; see e.g., Jin et al. (2019) and Lin et al. (2019). Note that the ratio between these two learning rates is quite crucial here. We also observe empirically when the same learning rate is used, even if small, the algorithm may not converge to saddle points.
When , . If and , the error term will go to 0. When using smaller stepsizes, the algorithm reaches a smaller neighbour of the saddle point yet at the cost of a slower rate, as the contraction factor also deteriorates.
Linear convergence of AGDA
Setting , it follows immediately from the previous theorem that AGDA converges linearly under the two-sided PL condition. Moreover, we have
Suppose Assumptions 1, 2 hold and satisfies the two-sided PL condition with and . Define . If we run AGDA with and , then
Furthermore, converges to some saddle point , and
where is a constant depending on and .
The above theorem implies that the limit point of is a saddle point and the distance to the saddle point decreases in the order of . Note that in the special case when the objective is strongly-convex-strongly-concave, it is known that SGDA (GDA with simultaneous updates) achieves an iteration complexity (see, e.g., Facchinei and Pang (2007)) and this can be further improved to by extragradient methods (Korpelevich, 1976), Nesterov’s dual extrapolation (Nesterov and Scrimali, 2006) or accelerated proximal point algorithm (Lin et al., 2020). However, these result relies heavily on the strong monotonicity of the corresponding variational inequality. For the general two-sided PL condition, we may not achieve the same dependency on .
Stoc-AGDA with diminishing stepsizes
While Stoc-AGDA with constant stepsizes only converges linearly to a neighbourhood of the saddle point, Stoc-AGDA with diminishing stepsizes converges to the saddle point but at a sublinear rate .
Suppose Assumptions 1, 2, 3 hold and satisfies the two-sided PL condition with and . Define . If we run algorithm 1 with stepsizes and for some and such that , then we have
Note the rate is affected by , and the first term in the definition of is controlled by the initial point. In practice, we can find a good initial point by running Stoc-AGDA with constant stepsizes so that only the second term in the definition of matters. Then by choosing , we have . Thus, the convergence rate of Stoc-AGDA is .
Stochastic variance reduced algorithm
In this section, we study the minimax problem in (2) with the finite-sum structure:
which arises ubiquitously in machine learning. We are especially interested in the case when is large. We assume the overall objective function still satisfies the two-sided PL condition with and , but we do not assume each to satisfy the two-sided PL condition. Instead of Assumption 1, we now assume each component has Lipschitz gradients.
If we run AGDA with full gradients to solve the finite-sum minimax problem, the total complexity for finding an -optimal solution is by Theorem 3.2. Despite the linear convergence, the per-iteration cost is high and the complexity can be huge when the number of components and condition number are large. Instead, if we run Stoc-AGDA, this leads to the total complexity by Remark 3, which has worse dependence on .
Motivated by the recent success of stochastic variance reduced gradient (SVRG) technique (Johnson and Zhang, 2013; Reddi et al., 2016a; Palaniappan and Bach, 2016), we introduce the VR-AGDA algorithm (presented in Algorithm 2), that combines AGDA with SVRG so that the linear convergence is preserved while improving the dependency on and . VR-AGDA can be viewed as the applying SVRG to AGDA with restarting: at every epoch , we restart the SVRG subroutine (with outer iterations, inner steps) by initializing it with , which is randomly selected from previous SVRG subroutine. This is partly inspired by the GD-SVRG algorithm for minimizing PL functions (Reddi et al., 2016a). Notice when , VR-AGDA reduces to a double-loop algorithm which is similar to the SVRG for saddle point problems proposed by Palaniappan and Bach (2016), except for several notable differences: (i) we are using the alternating updates rather than simultaneous updates, (ii) as a result, we require to sample two independent indices rather than one at each iteration, and (iii) most importantly, we are dealing with possibly nonconvex-nonconcave objectives that satisfy the two-sided PL condition.
The following two theorems capture the convergence of VR-AGDA.
for VR-AGDA to achieve an -optimal solution.
Under the same assumptions in Theorem 4.1 and further assuming , if we run VR-AGDA with , , , and , where are constants irrelevant to , then This further implies a total complexity of
for VR-AGDA to achieve an -optimal solution.
Theorems 4.1 and 4.2 are different in their choices of stepsizes and iteration numbers, which gives rise to different complexities. Another difference is that Theorem 4.2 only works in the regime where the number of components is not “too large” compared to the condition number, i.e., , which naturally guarantees .
Since AGDA has complexity \mathcal{O}\big{(}n\kappa^{3}\log(1/\epsilon)\big{)}, VR-AGDA with the setting in Theorem 4.1 is better than AGDA when . With the setting in Theorem 4.2, VR-AGDA outperforms AGDA as long as the assumption holds. As a result of these two theorems, VR-AGDA always improves over AGDA. Furthermore, VR-AGDA with the second setting has a lower complexity than the first setting in the regime , although the first setting allows a simpler double-loop algorithm. Figure 3 summarizes the performance of VR-AGDA compared to AGDA in different regimes of and .
Experiments
In the introduction, we already presented the convergence results of AGDA on a two-dimensional nonconvex-nonconcave function that satisfies the two-sided PL condition. In this section, we will present numerical experiments on machine learning applications: robust least square and imitation learning for linear quadratic regulators (LQR). Particularly, we focus on the comparison between AGDA, Stoc-AGDA, and VR-AGDA.
where we also adopt the general M-(semi-)norm in (13): and is positive semi-definite. satisfies the two-sided PL condition when , because it can be written as the composition of a strongly-convex-strongly-concave function and an affine function (Example 2). However, is not strongly convex about , and when is not full-rank, it is not strongly concave about .
Evaluation. For each dataset, we compare three algorithms: AGDA, Stoc-AGDA, and VR-AGDA. We tune the stepsizes of all algorithms to achieve the best convergence. For Stoc-AGDA, we choose constant stepsizes to form a fair comparison with the other two. We report the potential function value, i.e., described in our theorems, and distance to the limit point . These errors are plotted against the number of gradient evaluations normalized by (i.e., number of full gradients). Results are reported in Figure 4. We observe that VR-AGDA and AGDA both exhibit linear convergence, and the speedup of VR-AGDA is fairly significant when the condition number is large, whereas Stoc-AGDA progresses fast at the beginning and stagnates later on. These numerical results clearly validate our theoretical findings.
2 Generative adversarial imitation learning for LQR
The optimal control problem for LQR can be formulated as:
where the trajectory is induced by LQR dynamics and policy . In generative adversarial imitation learning for LQR, the trajectories induced by an expert policy are observed and part of the goal is to learn the cost function parameters and from the expert. This can be formulated as a minimax problem (Cai et al., 2019):
where , and is a strongly-convex regularizer. We sample initial points from and approximate by sample average
Note that satisfies the PL condition in terms of (Fazel et al., 2018), and is strongly-concave in terms of , so the function satisfies the two-sided PL condition.
In our experiment, we use for some and . We generate three datasets with different dimensions: (1) ; (2) ; (3) . The initial distribution is and we sample initial points. The exact gradients can be computed based on the compact forms established in Fazel et al. (2018); Cai et al. (2019). We compare AGDA and VR-AGDA under fine-tuned stepsizes, and track their errors in terms of . The result is reported in Figure 5, which again indicates that VR-AGDA significantly outperforms AGDA.
Conclusion
In this paper, we identify a subclass of nonconvex-nonconcave minimax problems, represented by the the so-called two-side PL condition, for which AGDA and Stoc-AGDA can converge to global saddle points. We also propose the first linearly-convergent variance-reduced AGDA algorithm that is provably always faster than AGDA, for this subclass of minimax problems . We hope this work can shed some light on the understanding of nonconvex-nonconcave minimax optimization: (1) different learning rates for two players are essential in GDA algorithms with alternating updates; (2) convexity-concavity is not a watershed to guarantee global convergence of GDA algorithms; (3) the complexity of solving minimax problems under PL conditions may have high-order dependence on the condition number in contrast to problems with strong convex-concavity conditions. It remains interesting to explore whether similar results apply to GDA algorithms with simultaneous updates and whether these algorithms can be further accelerated with momentum or catalyst schemes.
References
Appendix A Proofs for Section 2
If is l-smooth and it satisfies PL with constant , then it also satisfies error bound (EB) condition with , i.e.
where is the projection of onto the optimal set, also it satisfies quadratic growth (QG) condition with , i.e.
Conversely, if is l-smooth and it satisfies EB with constant , then it satisfies PL with constant .
From the above lemma, we easily derive that .
In the minimax problem, when satisfies PL condition with constant for any and satisfies Assumption 1, then the function is -smooth with and for any .
In the minimax problem 1, when the objective function satisfies Assumption 1 (Lipschitz gradient) and the two-sided PL condition with constant and , then function satisfies the PL condition with .
Since satisfies PL condition with constant , we get
Combining equation (14) and (15), we obtain,
The following lemma states that stochastic gradient descent converges linearly to the neighbourhood of the optimal set under PL condition. The proof is based on [Karimi et al., 2016].
(stationary point) (saddle point): From the definition of PL condition, if is a stationary point,
so , and therefore is a saddle point.
(saddle point) (global minimax point): Follow from definitions.
(global minimax point) (stationary point): If is a global minimax point, then by definition,
Then by first order necessary condition, we have,
Thus, is a stationary point.
satisfies the two-sided PL condition with .
It is not hard to derive that , and , i.e. . Therefore, is the only saddle point. Then compute the gradients:
so is -smooth with for any and is -smooth with for any . Then note that:
So satisfies EB with , and - satisfies EB with . By Lemma A.1, we have satisfies PL with constant and - satisfies PL with constant .
Appendix B Proofs for Section 3
Before we step into proofs for Theorem 3.1, 3.2 and 3.3, we first present a contraction theorem for each iteration.
and such that .
Because is -smooth by Lemma A.2, we have
Taking expectation of both side and use Assumption 3, we get
where in the second inequality we use Assumption 3, and in the third inequality we use . Because is -smooth and -PL, by Lemma A.4, when we have
Because of lipschitz continuity of the gradient, we can bound as
Taking expectation of both side and use Assumption 3,
Combining (18) and (22), we have for
where in the second inequality we use Young’s Inequality and . Now it suffices to bound and by and . With Lemma A.2, we have:
for any . Now we fix to be the projection of on the the set . Because satisfies PL condition with , and Lemma A.1 therefore indicates it also satisfies quadratic growth condition with , i.e.
Because satisfies PL condition with by Lemma A.3,
In the setting of Theorem 1, and . By Thoerem B.1, We only need to choose , , and to let . Here we first choose and . Then
where in the last inequality we just plug in and and use . Also,
where in the last inequality we plug in and and we use by our choice of . Note that , because . Define , and by Theorem B.1,
We verify that by noting: and . ∎
The first part of Theorem 3.2 is a direct corollary of Theorem 3.1 by setting . We show the second part by noting that
where in the second inequality is the projection of on the the set and , in the third inequality we use lipschtiz continuity of gradient, and in the last inequality we use quadratic growth condition. Also,
where in the first equality is the projection of on the set and , in the second inequality is the projection of on the the set and , and in the last inequality we use quadratic growth condition. Therefore with (32) and (33),
where . Letting , we have
so converges and by first part of this theorem the limit must be a saddle point. Thus we have
with . ∎
First note that since , . Similar to the proof of Theorem 3.1, by choosing and in the Theorem B.1, we have . We prove the theorem by induction. When t = 1, it is naturally satisfied by definition of . We assume that . Then by Theorem B.1,
where in the second inequality we plug in and , in the last inequality we use and the fact that sum of last two terms in (34) is no greater than 0 by our choice of . ∎
Appendix C Proofs for Section 4
Because the proof is long, we break the proof into three parts for the convenience of understanding the intuition behind it.
Note that these are unbiased stochastic gradients. Similar to the proof of Theorem B.1 (replace in (18) ), with , we have
where in the last inequality we use Young’s inequality to the inner product and is a constant which we will determine later. Similarly,
where in the last inequality we use Young’s inequality to the inner product and is a constant. We are going to construct a potential function
and we will determine and later. Combine (35), (36) and (38),
Then we bound the variance of the stochastic gradients,
Consider the second line. Using PL condition and assuming , which we will justify later by our choices of and , we have
where in the last inequality we use (35) and (20). Now we plug this into ,
where we define and . With ,
Then plugging in (26), (27) and (42), we get
Now we are ready to define sequences and . Let , and
The left hand side is exactly , because is sampled uniformly from .
It suffices to choose proper , , and such that . Driven by the proof, we choose
We will choose and later and we let . Plug back to and , we have
where in the last inequality we assume .
We define . Then combining (53) and (54), we easily get
and note that so . Then we want to lower bound . Rearrange (48),
where in the inequality, we use and assume that . Rearranging (49),
where in the inequality we use and assume and . Note that and . Then we have
Letting and , we have
where we use . By plugging in and into (55), we have
We choose , and , where is irrelevant to . Then since , after plugging in and , we have
Therefore, for choosing small enough and small enough, we have . Now it remains to verify several assumptions we made in the proof. The first is . Since , this assumption easily holds when . The second assumption we want to verify is . Note that
So this assumption can also be easily satisfied when is small. The last assumption we need to verify is . Because and (61),
So this assumption holds when and are small.
We start from Part 3 of the proof of Theorem 4.1. We now choose , , and then
Therefore, for choosing small enough and small enough, we have . Other assumptions can be easily verified by the same way as in the proof of Theorem 4.1. ∎
Appendix D AGDA for minimax problems under one-sided PL condition
We are here to show that if satisfies PL condition with constant and may be nonconvex (referred to as PL game by Nouiehed et al. ), AGDA as presented in Algorithm 3 can find -stationary point of within iterations. Note that GDmax has complexity on minimax problems under the one-sided PL condition [Nouiehed et al., 2019]; SGDA has complexity on nonconvex-strongly-concave minimax problems [Lin et al., 2019]. Here we define condition number and is still defined the same as before. The proof is based on our previous analysis and Lin et al. .
Suppose Assumption 1 holds and satisfies PL condition with constant for any . If we run Algorithm 3 with and , then
where and .
For convenience, we still define . Since it can be easily verified that , by (18) and (26), we have
where in the second inequality we use Young’s inequality, and in third inequality we use (26). We write
We note that because by our choice of . Also,
where in the last inequality we use and . Plugging into (73),
where in the inequality we use again.