Global Convergence and Variance-Reduced Optimization for a Class of Nonconvex-Nonconcave Minimax Problems

Junchi Yang, Negar Kiyavash, Niao He

Introduction

We consider minimax optimization problems of the forms

where $\xi$ is a random vector with support $\Xi$ , and $f(x,y)$ is a possibly nonconvex-nonconcave function. Minimax problems have been widely studied in game theory and operations research. Recent emerging applications in machine learning have further stimulated a surge of interest in these problems. For example, generative adversarial networks (GANs) (Goodfellow et al., 2016) can be viewed as a two-player game between a generator that produces synthetic data and a discriminator that differentiates between true data and synthetic data. In reinforcement learning, solving Bellman equations can also be reformulated as minimax optimization problems (Chen and Wang, 2016; Dai et al., 2017, 2018). Other applications include robust optimization (Namkoong and Duchi, 2016, 2017), adversarial machine learning (Sinha et al., 2017; Madry et al., 2017), unsupervised learning (Xu et al., 2005), and so on.

The most natural and frequently used methods for solving minimax problems (1) and (2) are the gradient descent ascent (GDA) algorithms (or their stochastic variants), with either simultaneous or alternating updates of the primal-dual variables, referred to as SGDA and AGDA, respectively, throughout the paper. While these algorithms have received much empirical success especially in adversarial training, it is known that these GDA algorithms with constant stepsizes could fail to converge for general smooth function (Mescheder et al., 2018), even for the bilinear games (Gidel et al., 2019); even when they do converge, the stable limit point may not be a local Nash equilibrium (Daskalakis et al., 2018; Mazumdar and Ratliff, 2018). On the other hand, GDA algorithms can converge linearly to the saddle point for strongly-convex-strongly-concave functions (Facchinei and Pang, 2007). Moreover, for many simple nonconvex-nonconcave objective functions, such as, $f(x,y)=x^{2}+3\sin^{2}x\sin^{2}y-4y^{2}-10\sin^{2}y,$ we also observe that GDA algorithms with constant stepsizes indeed converge to the global Nash equilibrium (or saddle point), at a linear rate (see Figure 1). This also holds true for their stochastic variants, albeit at a sublinear rate. These facts naturally raise a question: Is there a general condition under which GDA algorithms converge to the global optima?

In this paper, we address these two questions and specifically focus on the alternating gradient descent ascent, namely AGDA. This is due to several considerations. First of all, it has been recently shown that alternating updates of GDA are more stable than simultaneous updates (Gidel et al., 2019; Bailey et al., 2019). Note that for a convex-concave matrix game, SGDA may diverge while AGDA is proven to always have bounded iterates (Gidel et al., 2019). See Figure 2 for a simple illustration. Secondly, in general, it is much more challenging to analyze AGDA than SGDA. There is a lack of discussion on the convergence of AGDA for general minimax problems in the literature. Our contributions are summarized as follows.

Second, for minimax problems with the finite sum structure, we introduce a variance-reduced AGDA algorithm (VR-AGDA) that leverages the idea of stochastic variance reduced gradient (SVRG) (Johnson and Zhang, 2013; Reddi et al., 2016a) with the alternating updates. We prove that VR-AGDA achieves the complexity of $\mathcal{O}((n^{2/3}\kappa^{3}\log(1/\epsilon))$ in the region $n\leq\kappa^{9}$ and $\mathcal{O}(n+\kappa^{9})\log(1/\epsilon))$ in the region $n\geq\kappa^{9}$ , where $n$ is the number of component functions. This greatly improves over the $\mathcal{O}\left(n\kappa^{3}\log\frac{1}{\epsilon}\right)$ complexity of AGDA when applied to the finite sum minimax problems. We summarize the results of these algorithms in Table 1. Our numerical experiments further demonstrate that VR-AGDA performs significantly better than AGDA and Stoc-AGDA, especially for problems with large condition numbers. To our best knowledge, this is the first work to provide a variance reduced algorithm and theoretical guarantees in the nonconvex-nonconcave regime of minimax optimization.

2 Related work

There has been a recent surge in research on solving minimax optimization beyond the convex-concave regime (Sinha et al., 2017; Chen et al., 2017; Qian et al., 2019; Thekumparampil et al., 2019; Lin et al., 2018; Nouiehed et al., 2019; Abernethy et al., 2019), but they differ from our work from various perspectives. For example, Chen et al. (2017); Sinha et al. (2017); Lin et al. (2019); Thekumparampil et al. (2019) considered the minimax problem when the objective function is nonconvex in $x$ but concave in $y$ and focused on achieving convergence to stationary points. Their algorithms require solving the inner maximization or some sub-problems with high accuracy at every iteration, which are different from AGDA. Lin et al. (2018) considered a general class of weakly-convex weakly-concave minimax problems and proposed an inexact proximal point method to find an $\epsilon$ -stationary point. Their convergence result relies on assuming the existence of a solution to the corresponding Minty variational inequality, which is often hard to verify. Abernethy et al. (2019) recently showed the linear convergence of a second-order iterative algorithm, called Hamiltonian gradient descent (HGD), for a subclass of “sufficiently bilinear” functions. Compared with their work, the PL condition we consider in this paper is easier to verify and GDA algorithms are much simpler.

PL condition.

Recently, Nouiehed et al. (2019) studied a class of minimax problems where the objective only satisfies a one-sided PL condition and introduced the GDmax algorithm, which takes multiple ascent steps at every iteration. Our work differs from (Nouiehed et al., 2019) in two aspects: (i) we consider the two-sided PL condition which guarantees global convergence We also show that AGDA can find $\epsilon-$ stationary point for minimax problems under the one-sided PL condition within $\mathcal{O}(1/\epsilon^{2})$ iterations in Appendix D.; (ii) we consider AGDA which takes one ascent step at every iteration. Another closely related work is Cai et al. (2019). The authors considered a specific application in generative adversarial imitation learning with linear quadratic regulator dynamics. This is a special example that falls under the two-sided PL condition.

Variance-reduced minimax optimization.

There exists a few works that apply variance reduction techniques to minimax optimization. Palaniappan and Bach (2016); Luo et al. (2019) provided linear-convergent algorithms for strongly-convex-strongly-concave objectives, based on simultaneous updates. Du and Hu (2019) extended the result to convex-strongly-concave objectives with full-rank coupling bilinear term. In contrast, we are dealing with a much broader class of objectives that are possibly nonconvex-nonconcave. We point out that Luo et al. (2020) recently introduced a variance-reduced algorithm for finding the stationary point of nonconvex-strongly-concave problems, which is again different from our setting.

The rest of this paper is organized as follows. In Section 2, we introduce the two-sided PL condition and show the equivalence of three min-max optimality criteria under this condition. In Section 3, we describe deterministic and stochastic AGDA algorithms, and provide convergence analyses of those algorithms under the two-sided PL condition. In Section 4, we introduce the variance-reduced AGDA algorithm and establish its convergence results. In Section 5, we provide numerical performance of these algorithms for robust least square and imitation learning for LQR.

Global optima and two-sided PL condition

Throughout this paper, we assume that the function $f(x,y)$ in (1) is continuously differentiable and has Lipschitz gradient. We state it as a basic assumption. Here $\|\cdot\|$ is used to denote the Euclidean norm.

There exists a positive constant $l>0$ such that

We now define three notions of optimality for minimax problems. The most direct notion of optimality is global minimax point, at which $x^{*}$ is an optimal solution to the function $g(x):=\max_{y}f(x,y)$ and $y^{*}$ is an optimal solution to $\max_{y}f(x^{*},y)$ . In the two-player zero-sum game, the notion of saddle point is also widely used (Von Neumann et al., 2007; Nash, 1953). For a saddle point $(x^{*},y^{*})$ , $x^{*}$ is an optimal solution to $\min_{x}f(x,y^{*})$ and $y^{*}$ is an optimal solution to $\max_{y}f(x^{*},y)$ .

$(x^{*},y^{*})$ is a global minimax point, if for any $(x,y):$

$(x^{*},y^{*})$ is a saddle point, if for any $(x,y):$

$(x^{*},y^{*})$ is a stationary point, if $:$

For general nonconvex-nonconcave minimax problems, these three notions of optimality are not necessarily equivalent. A stationary point may not be a saddle point or a global minimax point; a global minimax point may not be a saddle point or a stationary point. Note that generally speaking, for minimax problems, a saddle point or a global minimax point may not always exist. However, since our goal in this paper is to find global optima, in the remainder of the paper, we assume that a saddle point always exists.

We introduce a straightforward generalization of the PL condition to the minimax problem: function $f(x,y)$ satisfies the PL condition with constant $\mu_{1}$ with respect to $x$ , and - $f$ satisfies PL condition with constant $\mu_{2}$ with respect to $y$ . We formally state this in the following definition.

A continuously differentiable function $f(x,y)$ satisfies the two-sided PL condition if there exist constants $\mu_{1},\mu_{2}>0$ such that:

The two-sided PL condition does not imply convexity-concavity, and it is a much weaker condition than strong-convexity-strong-concavity. In Lemma 2.1, we show that three notions of optimality are equivalent under the two-sided PL condition. Note that they may not be unique.

If the objective function $f(x,y)$ satisfies the two-sided PL condition, then the following holds true:

Below we give some examples that satisfy this condition.

The nonconvex-nonconcave function in the introduction, $f(x,y)=x^{2}+3\sin^{2}x\sin^{2}y-4y^{2}-10\sin^{2}y$ satisfies the two-sided PL condition with $\mu_{1}=1/16,\mu_{2}=1/11$ (see Appendix A).

$f(x,y)=F(Ax,By)$ , where $F(\cdot,\cdot)$ is strongly-convex-strongly-concave and $A$ and $B$ are arbitrary matrices, satisfies the two-sided PL condition.

The generative adversarial imitation learning for LQR can be formulated as $\min_{K}\min_{\theta}m(K,\theta)$ , where $m$ is strongly-concave in terms of $\theta$ and satisfies PL condition in terms of $K$ (see (Cai et al., 2019) for more details), thus satisfying the two-sided PL condition.

Under the two-sided PL condition, the function $g(x):=\max_{y}f(x,y)$ can be shown to satisfy PL condition with $\mu_{1}$ (see Appendix A). Moreover, it holds that $g$ is also $L$ -smooth with $L:=l+l^{2}/\mu_{2}$ (Nouiehed et al., 2019). Finally, we denote $\mu=\min(\mu_{1},\mu_{2})$ and $\kappa=\frac{l}{\mu}$ , which represents the condition number of the problem.

Global convergence of AGDA and Stoc-AGDA

In this section, we establish the convergence rate of the stochastic alternating gradient descent ascent (Stoc-AGDA) algorithm, which we present in Algorithm 1, under the two-sided PL condition. Stoc-AGDA updates variables $x$ and $y$ sequentially using stochastic gradient descent/ascent steps. Here we make standard assumptions about stochastic gradients $G_{x}(x,y,\xi)$ and $G_{y}(x,y,\xi)$ .

$G_{x}(x,y,\xi)$ and $G_{y}(x,y,\xi)$ are unbiased stochastic estimators of $\nabla_{x}f(x,y)$ and $\nabla_{y}f(x,y)$ and have variances bounded by $\sigma^{2}>0$ .

Note that Stoc-AGDA with constant stepsizes (i.e., $\tau_{1}^{t}=\tau_{1}$ and $\tau_{2}^{t}=\tau_{2}$ ) and noiseless stochastic gradient (i.e., $\sigma^{2}=0$ ) reduces to AGDA:

We will measure the inaccuracy of $(x_{t},y_{t})$ through the potential function

We first consider Stoc-AGDA with constant stepsizes. We show that $\{(x_{t},y_{t})\}_{t}$ will converge linearly to a neighbourhood of the optimal set.

Suppose Assumptions 1, 2, 3 hold and $f(x,y)$ satisfies the two-sided PL condition with $\mu_{1}$ and $\mu_{2}$ . Define $P_{t}:=a_{t}+\frac{1}{10}b_{t}$ . If we run Algorithm 1 with $\tau_{2}^{t}=\tau_{2}\leq\frac{1}{l}$ and $\tau_{1}^{t}=\tau_{1}\leq\frac{\mu_{2}^{2}\tau_{2}}{18l^{2}}$ , then

where $\delta=\frac{(1-\mu_{2}\tau_{2})(L+l)\tau_{1}^{2}+l\tau_{2}^{2}+10L\tau_{1}^{2}}{10\mu_{1}\tau_{1}}\sigma^{2}.$

In the theorem above, we choose $\tau_{1}$ smaller than $\tau_{2}$ , $\tau_{1}/\tau_{2}\leq\mu_{2}^{2}/(18l^{2})$ , because our potential function is not symmetric about $x$ and $y$ . Another reason is because we want $y_{t}$ to approach $y^{*}(x_{t})\in\arg\max_{y}f(x_{t},y)$ faster so that $\nabla_{x}f(x_{t},y_{t})$ is a better approximation for $\nabla g(x_{t})$ ( $\nabla g(x)=\nabla_{x}f(x,y^{*}(x))$ , see Nouiehed et al. (2019)). Indeed, it is common to use different learning rates for $x$ and $y$ in GDA algorithms for nonconvex minimax problems; see e.g., Jin et al. (2019) and Lin et al. (2019). Note that the ratio between these two learning rates is quite crucial here. We also observe empirically when the same learning rate is used, even if small, the algorithm may not converge to saddle points.

When $t\rightarrow\infty$ , $P_{t}\to\delta$ . If $\tau_{1}\rightarrow 0$ and $\tau_{2}^{2}/\tau_{1}\rightarrow 0$ , the error term $\delta$ will go to 0. When using smaller stepsizes, the algorithm reaches a smaller neighbour of the saddle point yet at the cost of a slower rate, as the contraction factor also deteriorates.

Linear convergence of AGDA

Setting $\sigma^{2}=0$ , it follows immediately from the previous theorem that AGDA converges linearly under the two-sided PL condition. Moreover, we have

Suppose Assumptions 1, 2 hold and $f(x,y)$ satisfies the two-sided PL condition with $\mu_{1}$ and $\mu_{2}$ . Define $P_{t}:=a_{t}+\frac{1}{10}b_{t}$ . If we run AGDA with $\tau_{1}=\frac{\mu_{2}^{2}}{18l^{3}}$ and $\tau_{2}=\frac{1}{l}$ , then

Furthermore, $\{(x_{t},y_{t})\}_{t}$ converges to some saddle point $(x^{*},y^{*})$ , and

where $\alpha$ is a constant depending on $\mu_{1},\mu_{2}$ and $l$ .

The above theorem implies that the limit point of $\{(x_{t},y_{t})\}_{t}$ is a saddle point and the distance to the saddle point decreases in the order of $\mathcal{O}\left((1-\kappa^{-3})^{t}\right)$ . Note that in the special case when the objective is strongly-convex-strongly-concave, it is known that SGDA (GDA with simultaneous updates) achieves an $\mathcal{O}(\kappa^{2}\log(1/\epsilon))$ iteration complexity (see, e.g., Facchinei and Pang (2007)) and this can be further improved to $\mathcal{O}(\kappa\log(1/\epsilon))$ by extragradient methods (Korpelevich, 1976), Nesterov’s dual extrapolation (Nesterov and Scrimali, 2006) or accelerated proximal point algorithm (Lin et al., 2020). However, these result relies heavily on the strong monotonicity of the corresponding variational inequality. For the general two-sided PL condition, we may not achieve the same dependency on $\kappa$ .

Stoc-AGDA with diminishing stepsizes

While Stoc-AGDA with constant stepsizes only converges linearly to a neighbourhood of the saddle point, Stoc-AGDA with diminishing stepsizes converges to the saddle point but at a sublinear rate $\mathcal{O}(1/t)$ .

Suppose Assumptions 1, 2, 3 hold and $f(x,y)$ satisfies the two-sided PL condition with $\mu_{1}$ and $\mu_{2}$ . Define $P_{t}=a_{t}+\frac{1}{10}b_{t}$ . If we run algorithm 1 with stepsizes $\tau_{1}^{t}=\frac{\beta}{\gamma+t}$ and $\tau_{2}^{t}=\frac{18l^{2}\beta}{\mu_{2}^{2}(\gamma+t)}$ for some $\beta>2/\mu_{1}$ and $\gamma>0$ such that $\tau_{1}^{1}\leq\min\{1/L,\mu_{2}^{2}/18l^{2}\}$ , then we have

Note the rate is affected by $\nu$ , and the first term in the definition of $\nu$ is controlled by the initial point. In practice, we can find a good initial point by running Stoc-AGDA with constant stepsizes so that only the second term in the definition of $\nu$ matters. Then by choosing $\beta=3/\mu_{1}$ , we have $\nu=\mathcal{O}\left(\frac{l^{5}\sigma^{2}}{\mu_{1}^{2}\mu_{2}^{4}}\right)$ . Thus, the convergence rate of Stoc-AGDA is $\mathcal{O}\left(\frac{\kappa^{5}\sigma^{2}}{\mu t}\right)$ .

Stochastic variance reduced algorithm

In this section, we study the minimax problem in (2) with the finite-sum structure:

which arises ubiquitously in machine learning. We are especially interested in the case when $n$ is large. We assume the overall objective function $f(x,y)$ still satisfies the two-sided PL condition with $\mu_{1}$ and $\mu_{2}$ , but we do not assume each $f_{i}$ to satisfy the two-sided PL condition. Instead of Assumption 1, we now assume each component $f_{i}$ has Lipschitz gradients.

If we run AGDA with full gradients to solve the finite-sum minimax problem, the total complexity for finding an $\epsilon$ -optimal solution is $\mathcal{O}(n\kappa^{3}\log(1/\epsilon))$ by Theorem 3.2. Despite the linear convergence, the per-iteration cost is high and the complexity can be huge when the number of components $n$ and condition number $\kappa$ are large. Instead, if we run Stoc-AGDA, this leads to the total complexity $\mathcal{O}\left(\frac{\kappa^{5}\sigma^{2}}{\mu_{2}\epsilon}\right)$ by Remark 3, which has worse dependence on $\epsilon$ .

Motivated by the recent success of stochastic variance reduced gradient (SVRG) technique (Johnson and Zhang, 2013; Reddi et al., 2016a; Palaniappan and Bach, 2016), we introduce the VR-AGDA algorithm (presented in Algorithm 2), that combines AGDA with SVRG so that the linear convergence is preserved while improving the dependency on $n$ and $\kappa$ . VR-AGDA can be viewed as the applying SVRG to AGDA with restarting: at every epoch $k$ , we restart the SVRG subroutine (with $T$ outer iterations, $N$ inner steps) by initializing it with $(x^{k},y^{k})$ , which is randomly selected from previous SVRG subroutine. This is partly inspired by the GD-SVRG algorithm for minimizing PL functions (Reddi et al., 2016a). Notice when $T=1$ , VR-AGDA reduces to a double-loop algorithm which is similar to the SVRG for saddle point problems proposed by Palaniappan and Bach (2016), except for several notable differences: (i) we are using the alternating updates rather than simultaneous updates, (ii) as a result, we require to sample two independent indices rather than one at each iteration, and (iii) most importantly, we are dealing with possibly nonconvex-nonconcave objectives that satisfy the two-sided PL condition.

The following two theorems capture the convergence of VR-AGDA.

for VR-AGDA to achieve an $\epsilon$ -optimal solution.

Under the same assumptions in Theorem 4.1 and further assuming $n\leq\kappa^{9}$ , if we run VR-AGDA with $\tau_{1}=\beta/(28\kappa^{2}ln^{2/3})$ , $\tau_{2}=\beta/(ln^{2/3})$ , $N=\lfloor\alpha\beta^{-2/3}n(2+4\beta^{1/2}n^{-1/3})^{-1}\rfloor$ , and $T=\lceil\kappa^{3}n^{-1/3}\rceil$ , where $\alpha,\beta$ are constants irrelevant to $l,n,\mu_{1},\mu_{2}$ , then $P_{k+1}\leq\frac{1}{2}P_{k}.$ This further implies a total complexity of

for VR-AGDA to achieve an $\epsilon$ -optimal solution.

Theorems 4.1 and 4.2 are different in their choices of stepsizes and iteration numbers, which gives rise to different complexities. Another difference is that Theorem 4.2 only works in the regime where the number of components $n$ is not “too large” compared to the condition number, i.e., $n\leq\kappa^{9}$ , which naturally guarantees $T=\lceil\kappa^{3}n^{-1/3}\rceil\geq 1$ .

Since AGDA has complexity $\mathcal{O}\big{(}n\kappa^{3}\log(1/\epsilon)\big{)}$ , VR-AGDA with the setting in Theorem 4.1 is better than AGDA when $n\geq\mathcal{\kappa}^{6}$ . With the setting in Theorem 4.2, VR-AGDA outperforms AGDA as long as the assumption $n\leq\kappa^{9}$ holds. As a result of these two theorems, VR-AGDA always improves over AGDA. Furthermore, VR-AGDA with the second setting has a lower complexity than the first setting in the regime $n\leq\kappa^{9}$ , although the first setting allows a simpler double-loop algorithm. Figure 3 summarizes the performance of VR-AGDA compared to AGDA in different regimes of $n$ and $\kappa$ .

Experiments

In the introduction, we already presented the convergence results of AGDA on a two-dimensional nonconvex-nonconcave function that satisfies the two-sided PL condition. In this section, we will present numerical experiments on machine learning applications: robust least square and imitation learning for linear quadratic regulators (LQR). Particularly, we focus on the comparison between AGDA, Stoc-AGDA, and VR-AGDA.

where we also adopt the general M-(semi-)norm in (13): $\|x\|_{M}^{2}=x^{T}Mx$ and $M$ is positive semi-definite. $F(x,y)$ satisfies the two-sided PL condition when $\lambda>1$ , because it can be written as the composition of a strongly-convex-strongly-concave function and an affine function (Example 2). However, $F(x,y)$ is not strongly convex about $x$ , and when $M$ is not full-rank, it is not strongly concave about $y$ .

Evaluation. For each dataset, we compare three algorithms: AGDA, Stoc-AGDA, and VR-AGDA. We tune the stepsizes of all algorithms to achieve the best convergence. For Stoc-AGDA, we choose constant stepsizes to form a fair comparison with the other two. We report the potential function value, i.e., $P_{t}$ described in our theorems, and distance to the limit point $\|(x_{t},y_{t})-(x^{*},y^{*})\|^{2}$ . These errors are plotted against the number of gradient evaluations normalized by $n$ (i.e., number of full gradients). Results are reported in Figure 4. We observe that VR-AGDA and AGDA both exhibit linear convergence, and the speedup of VR-AGDA is fairly significant when the condition number is large, whereas Stoc-AGDA progresses fast at the beginning and stagnates later on. These numerical results clearly validate our theoretical findings.

2 Generative adversarial imitation learning for LQR

The optimal control problem for LQR can be formulated as:

where the trajectory is induced by LQR dynamics and policy $K$ . In generative adversarial imitation learning for LQR, the trajectories induced by an expert policy $K_{E}$ are observed and part of the goal is to learn the cost function parameters $Q$ and $R$ from the expert. This can be formulated as a minimax problem (Cai et al., 2019):

where $m(K,Q,R):=C(K;Q,R)-C(K_{E};Q,R)-\Phi(Q,R)$ , $\Theta=\{(Q,R):\alpha_{Q}I\preceq Q\preceq\beta_{Q}I,\quad\alpha_{R}I\preceq R\preceq\beta_{R}I\}$ and $\Phi$ is a strongly-convex regularizer. We sample $n$ initial points $x_{0}^{(1)},x_{0}^{(2)},...,x_{0}^{(n)}$ from $\mathcal{D}$ and approximate $C(K;Q,R)$ by sample average

Note that $m_{n}$ satisfies the PL condition in terms of $K$ (Fazel et al., 2018), and $m_{n}$ is strongly-concave in terms of $(Q,R)$ , so the function satisfies the two-sided PL condition.

In our experiment, we use $\Phi(Q,R)=\lambda(\|Q-\bar{Q}\|^{2}+\|R-\bar{R}\|^{2})$ for some $\bar{Q},\bar{R}$ and $\lambda=1$ . We generate three datasets with different dimensions: (1) $d=3,k=2$ ; (2) $d=20,k=10$ ; (3) $d=30,k=20$ . The initial distribution $\mathcal{D}$ is $\mathcal{N}(0,I_{d})$ and we sample $n=100$ initial points. The exact gradients can be computed based on the compact forms established in Fazel et al. (2018); Cai et al. (2019). We compare AGDA and VR-AGDA under fine-tuned stepsizes, and track their errors in terms of $\|K_{t}-K^{*}\|^{2}+\|Q_{t}-Q^{*}\|_{F}^{2}+\|R_{t}-R^{*}\|_{F}^{2}$ . The result is reported in Figure 5, which again indicates that VR-AGDA significantly outperforms AGDA.

Conclusion

In this paper, we identify a subclass of nonconvex-nonconcave minimax problems, represented by the the so-called two-side PL condition, for which AGDA and Stoc-AGDA can converge to global saddle points. We also propose the first linearly-convergent variance-reduced AGDA algorithm that is provably always faster than AGDA, for this subclass of minimax problems . We hope this work can shed some light on the understanding of nonconvex-nonconcave minimax optimization: (1) different learning rates for two players are essential in GDA algorithms with alternating updates; (2) convexity-concavity is not a watershed to guarantee global convergence of GDA algorithms; (3) the complexity of solving minimax problems under PL conditions may have high-order dependence on the condition number in contrast to problems with strong convex-concavity conditions. It remains interesting to explore whether similar results apply to GDA algorithms with simultaneous updates and whether these algorithms can be further accelerated with momentum or catalyst schemes.

References

Appendix A Proofs for Section 2

If $f(\cdot)$ is l-smooth and it satisfies PL with constant $\mu$ , then it also satisfies error bound (EB) condition with $\mu$ , i.e.

where $x_{p}$ is the projection of $x$ onto the optimal set, also it satisfies quadratic growth (QG) condition with $\mu$ , i.e.

Conversely, if $f(\cdot)$ is l-smooth and it satisfies EB with constant $\mu$ , then it satisfies PL with constant $\mu/l$ .

From the above lemma, we easily derive that $l\geq\mu$ .

In the minimax problem, when $-f(x,\cdot)$ satisfies PL condition with constant $\mu_{2}$ for any $x$ and $f$ satisfies Assumption 1, then the function $g(x):=\max_{y}f(x,y)$ is $L$ -smooth with $L:=l+l^{2}/\mu_{2}$ and $\nabla g(x)=\nabla_{x}f(x,y^{*}(x))$ for any $y^{*}(x)\in\arg\max_{y}f(x,y)$ .

In the minimax problem 1, when the objective function $f$ satisfies Assumption 1 (Lipschitz gradient) and the two-sided PL condition with constant $\mu_{1}$ and $\mu_{2}$ , then function $g(x):=\max_{y}f(x,y)$ satisfies the PL condition with $\mu_{1}$ .

Since $f(\cdot,y)$ satisfies PL condition with constant $\mu_{1}$ , we get

Combining equation (14) and (15), we obtain,

The following lemma states that stochastic gradient descent converges linearly to the neighbourhood of the optimal set under PL condition. The proof is based on [Karimi et al., 2016].

(stationary point) $\Longrightarrow$ (saddle point): From the definition of PL condition, if $(x^{*},y^{*})$ is a stationary point,

so $\max_{y}f(x^{*},y)=f(x^{*},y^{*})=\min_{x}f(x,y^{*})$ , and therefore $f(x^{*},y^{*})$ is a saddle point.

(saddle point) $\Longrightarrow$ (global minimax point): Follow from definitions.

(global minimax point) $\Longrightarrow$ (stationary point): If $(x^{*},y^{*})$ is a global minimax point, then by definition,

Then by first order necessary condition, we have,

Thus, $(x^{*},y^{*})$ is a stationary point.

satisfies the two-sided PL condition with $\mu_{1}=1/16,\mu_{2}=1/14$ .

It is not hard to derive that $\arg\min_{x}f(x,y)=0,\forall y$ , and $\arg\max_{y}f(x,y)=0,\forall x$ , i.e. $x^{*}(y)=y^{*}(x)=0,\forall x,y$ . Therefore, $(0,0)$ is the only saddle point. Then compute the gradients:

so $f(\cdot,y)$ is $L_{1}$ -smooth with $L_{1}=8$ for any $x$ and $f(x,\cdot)$ is $L_{2}$ -smooth with $L_{2}=28$ for any $y$ . Then note that:

So $f(\cdot,y)$ satisfies EB with $\mu_{EB1}=1/2$ , and - $f(x,\cdot)$ satisfies EB with $\mu_{EB2}=2$ . By Lemma A.1, we have $f(\cdot,y)$ satisfies PL with constant $\mu_{1}=1/16$ and - $f(x,\cdot)$ satisfies PL with constant $\mu_{1}=1/14$ .

Appendix B Proofs for Section 3

Before we step into proofs for Theorem 3.1, 3.2 and 3.3, we first present a contraction theorem for each iteration.

and $\lambda,\beta>0$ such that $k_{1}\leq 1$ .

Because $g$ is $L$ -smooth by Lemma A.2, we have

Taking expectation of both side and use Assumption 3, we get

where in the second inequality we use Assumption 3, and in the third inequality we use $\tau_{1}\leq 1/L$ . Because $-f(x_{t+1},y)$ is $l$ -smooth and $\mu_{1}$ -PL, by Lemma A.4, when $\tau_{1}\leq 1/l$ we have

Because of lipschitz continuity of the gradient, we can bound $f(x_{t},y_{t})-f(x_{t+1},y_{t})$ as

Taking expectation of both side and use Assumption 3,

Combining (18) and (22), we have for $\forall\lambda>0$

where in the second inequality we use Young’s Inequality and $\beta>0$ . Now it suffices to bound $\|g(x_{t})\|^{2}$ and $\|\nabla_{x}f(x_{t},y_{t})-\nabla g(x_{t})\|^{2}$ by $a_{t}$ and $b_{t}$ . With Lemma A.2, we have:

for any $y^{*}(x_{t})\in\arg\max_{y}f(x_{t},y)$ . Now we fix $y^{*}(x_{t})$ to be the projection of $y_{t}$ on the the set $\arg\max_{y}f(x_{t},y)$ . Because $-f(\bm{x_{t}},\cdot)$ satisfies PL condition with $\mu_{2}$ , and Lemma A.1 therefore indicates it also satisfies quadratic growth condition with $\mu_{2}$ , i.e.

Because $g$ satisfies PL condition with $\mu_{1}$ by Lemma A.3,

In the setting of Theorem 1, $\tau_{1}^{t}=\tau_{1}$ and $\tau_{2}^{t}=\tau_{2},\forall t$ . By Thoerem B.1, We only need to choose $\tau_{1}$ , $\tau_{2}$ , $\lambda$ and $\beta$ to let $k_{1},k_{2}<1$ . Here we first choose $\beta=1$ and $\lambda=1/10$ . Then

where in the last inequality we just plug in $\beta$ and $\lambda$ and use $l\tau_{1}\leq 1$ . Also,

where in the last inequality we plug in $\beta$ and $\lambda$ and we use $\frac{\mu_{2}^{2}\tau_{2}}{\tau_{1}l^{2}}\leq 18$ by our choice of $\tau_{1}$ . Note that $\frac{1}{2}\tau_{1}\mu_{1}<\frac{l^{2}\tau_{1}}{\mu_{2}}$ , because $\left(\frac{1}{2}\tau_{1}\mu_{1}\right)/\left(\frac{l^{2}\tau_{1}}{\mu_{2}}\right)=\frac{\mu_{1}\mu_{2}}{2l^{2}}<1$ . Define $P_{t}:=a_{t}+\frac{1}{10}b_{t}$ , and by Theorem B.1,

We verify that $\tau_{1}\leq 1/L$ by noting: $\tau_{1}\leq\frac{\mu_{2}^{2}\tau_{2}}{18l^{2}}\leq\frac{\mu_{2}^{2}}{18l^{3}}\leq\frac{\mu_{2}}{2l^{2}}$ and $L=l+\frac{l^{2}}{\mu_{2}}\leq\frac{2l^{2}}{\mu_{2}}$ . ∎

The first part of Theorem 3.2 is a direct corollary of Theorem 3.1 by setting $\sigma=0$ . We show the second part by noting that

where in the second inequality $y^{*}(x_{t})$ is the projection of $y_{t}$ on the the set $\arg\max_{y}f(x_{t},y)$ and $\nabla_{y}f(x_{t},y^{*}(x_{t}))=0$ , in the third inequality we use lipschtiz continuity of gradient, and in the last inequality we use quadratic growth condition. Also,

where in the first equality $x^{*}$ is the projection of $x_{t}$ on the set $\arg\min_{x}g(x)$ and $\nabla g(x^{*})=0$ , in the second inequality $y^{*}(x_{t})$ is the projection of $y_{t}$ on the the set $\arg\max_{y}f(x_{t},y)$ and $\nabla g(x_{t})=\nabla_{x}f(x_{t},y_{t})$ , and in the last inequality we use quadratic growth condition. Therefore with (32) and (33),

where $c=1-\frac{\mu_{1}\mu_{2}^{2}}{36l^{3}}$ . Letting $\alpha_{1}=\left[\frac{2(1+\tau_{2}^{2}l^{2})\tau_{1}^{2}L^{2}}{\mu_{1}}+\frac{20(1+\tau_{2}^{2}l^{2})\tau_{1}^{2}l^{2}+20l^{2}\tau_{2}^{2}}{\mu_{2}}\right]P_{0}$ , we have

so $\{(x_{t},y_{t})\}_{t}$ converges and by first part of this theorem the limit $(x^{*},y^{*})$ must be a saddle point. Thus we have

with $\alpha=2\left[\frac{2(1+\tau_{2}^{2}l^{2})\tau_{1}^{2}L^{2}}{\mu_{1}}+\frac{20(1+\tau_{2}^{2}l^{2})\tau_{1}^{2}l^{2}+20l^{2}\tau_{2}^{2}}{\mu_{2}}\right]/(1-\sqrt{c})^{2}$ . ∎

First note that since $\tau_{1}^{t}\leq\mu_{2}^{2}/18l^{2}$ , $\tau_{2}^{t}=\frac{18l^{2}\beta}{\mu_{2}^{2}(\gamma+t)}=\frac{18l^{2}\tau_{1}^{t}}{\mu_{2}^{2}}\leq\frac{1}{l}$ . Similar to the proof of Theorem 3.1, by choosing $\beta=1$ and $\lambda=1/10$ in the Theorem B.1, we have $\min\{k_{1},k_{2}\}=\frac{1}{2}\mu_{1}\tau_{1}^{t}$ . We prove the theorem by induction. When t = 1, it is naturally satisfied by definition of $\nu$ . We assume that $P_{t}\leq\frac{\nu}{\gamma+t}$ . Then by Theorem B.1,

where in the second inequality we plug in $\tau_{1}^{t}$ and $\tau_{2}^{t}$ , in the last inequality we use $(\gamma+t+1)(\gamma+t-1)\leq(\gamma+t)^{2}$ and the fact that sum of last two terms in (34) is no greater than 0 by our choice of $\nu$ . ∎

Appendix C Proofs for Section 4

Because the proof is long, we break the proof into three parts for the convenience of understanding the intuition behind it.

Note that these are unbiased stochastic gradients. Similar to the proof of Theorem B.1 (replace $\sigma^{2}$ in (18) ), with $\tau_{1}\leq 1/L$ , we have

where in the last inequality we use Young’s inequality to the inner product and $\beta_{1}>0$ is a constant which we will determine later. Similarly,

where in the last inequality we use Young’s inequality to the inner product and $\beta_{2}>0$ is a constant. We are going to construct a potential function

and we will determine $\lambda,c_{j}$ and $d_{j}$ later. Combine (35), (36) and (38),

Then we bound the variance of the stochastic gradients,

Consider the second line. Using PL condition $\|\nabla_{y}f(x_{j+1},y_{j})\|^{2}\geq 2\mu_{2}[g(x_{j+1})-f(x_{j+1},y_{j})]$ and assuming $\lambda\geq d_{j+1}(\tau_{2}+1/\beta_{2})$ , which we will justify later by our choices of $d_{j+1}$ and $\beta_{2}$ , we have

where in the last inequality we use (35) and (20). Now we plug this into $R_{j+1}$ ,

where we define $\zeta=1-\tau_{2}\mu_{2}+\frac{\lambda}{2}d_{j+1}\left(\tau_{2}^{2}+\frac{\tau_{2}}{\beta_{2}}\right)\mu_{2}$ and $\psi=1-\zeta$ . With $\|\nabla_{x}f(x_{j},y_{j})\|^{2}\leq 2\|\nabla g(x_{j})\|^{2}+2\|\nabla g(x_{j})-\nabla_{x}f(x_{j},y_{j})\|^{2}$ ,

Then plugging in (26), (27) and (42), we get

Now we are ready to define sequences $\{c_{j}\}_{j}$ and $\{d_{j}\}_{j}$ . Let $c_{N}=d_{N}=0$ , and

The left hand side is exactly $a^{k+1}+\lambda b^{k+1}$ , because $(x_{k},y_{k})$ is sampled uniformly from $\{\{(x_{t,j},y_{t,j})\}_{j=0}^{N-1}\}_{t=0}^{T-1}$ .

It suffices to choose proper $\tau_{1}$ , $\tau_{2}$ , $N$ and $T$ such that $NT\gamma>1$ . Driven by the proof, we choose

We will choose $k_{1},k_{2},k_{3}$ and $k_{4}$ later and we let $k_{1},k_{2},k_{3},k_{4}\leq 1$ . Plug back to $c_{j}$ and $d_{j}$ , we have

where in the last inequality we assume $k_{3}^{2}+\frac{k_{3}}{k_{4}}\leq 1$ .

We define $e_{j}=\max\{c_{j},d_{j}\}$ . Then combining (53) and (54), we easily get

and note that $e_{j}>e_{j+1}$ so $e_{j}\leq e_{0},\forall j$ . Then we want to lower bound $\gamma$ . Rearrange (48),

where in the inequality, we use $\lambda=1/20$ and assume that $\frac{1}{\kappa^{2}}k_{3}^{2}(k_{1}+\frac{1}{k_{2}})\leq 10$ . Rearranging (49),

where in the inequality we use $\lambda=1/20$ and assume $k_{1}\leq k_{3}/28$ and $\frac{1}{\kappa^{2}}k_{3}^{2}\left(k_{1}+\frac{1}{k_{2}}\right)\leq 1/4$ . Note that $\frac{1}{2}\tau_{1}\mu_{1}=\frac{\mu_{1}}{2\kappa^{2}l}k_{1}$ and $\frac{l^{2}\tau_{1}}{2\min\{\mu_{1},\mu_{2}\}}=\frac{l}{2\kappa^{2}\min\{\mu_{1},\mu_{2}\}}k_{1}$ . Then we have

Letting $k_{1}/k_{2}=k_{3}/k_{4}$ and $k_{1}=\frac{1}{28}k_{3}$ , we have

where we use $c_{j},d_{j}\leq e_{0},\forall j$ . By plugging in $k_{1}=k_{3}/28$ and $\lambda=1/20$ into (55), we have

We choose $T=1$ , $k_{3}=\beta\kappa^{-6}$ and $N=\alpha(2k_{3}^{3/2}+4k_{3}^{2})^{-1}\geq\frac{\alpha}{2}k_{2}^{-3/2}$ , where $\alpha,\beta$ is irrelevant to $n,l,\mu_{1},\mu_{2}$ . Then since $(1+2k_{3}^{3/2}+4k_{3}^{2})^{N}\leq e^{\alpha}$ , after plugging in $N$ and $k_{3}$ , we have

Therefore, for choosing $\alpha$ small enough and $\beta$ small enough, we have $NT\gamma\geq 2$ . Now it remains to verify several assumptions we made in the proof. The first is $\frac{k_{3}}{k_{4}}+k_{3}^{2}\leq 1$ . Since $\frac{k_{3}}{k_{4}}+k_{3}^{2}=k_{3}^{1/2}+k_{3}^{2}$ , this assumption easily holds when $\beta\leq 1/4$ . The second assumption we want to verify is $\frac{1}{\kappa^{2}}k_{3}^{2}\left(k_{1}+\frac{1}{k_{2}}\right)\leq 1/4$ . Note that

So this assumption can also be easily satisfied when $\beta$ is small. The last assumption we need to verify is $\lambda\geq d_{j+1}\left(\tau_{2}+\frac{1}{\beta_{2}}\right)$ . Because $d_{j+1}\leq e_{0}$ and (61),

So this assumption holds when $\alpha$ and $\beta$ are small.

We start from Part 3 of the proof of Theorem 4.1. We now choose $k_{3}=\beta n^{-2/3}$ , $N=\alpha(2k_{3}^{3/2}+4k_{3}^{2})^{-1}$ , and $T=\kappa^{3}n^{-1/3}$ then

Therefore, for choosing $\alpha$ small enough and $\beta$ small enough, we have $NT\gamma\geq 2$ . Other assumptions can be easily verified by the same way as in the proof of Theorem 4.1. ∎

Appendix D AGDA for minimax problems under one-sided PL condition

We are here to show that if $-f(x,\cdot)$ satisfies PL condition with constant $\mu$ and $f(\cdot,y)$ may be nonconvex (referred to as PL game by Nouiehed et al. ), AGDA as presented in Algorithm 3 can find $\epsilon$ -stationary point of $g(x):=\max_{y}f(x,y)$ within $\mathcal{O}(\epsilon^{-2})$ iterations. Note that GDmax has complexity $\mathcal{O}(\epsilon^{-2}\log(1/\epsilon))$ on minimax problems under the one-sided PL condition [Nouiehed et al., 2019]; SGDA has complexity $\mathcal{O}(\epsilon^{-2})$ on nonconvex-strongly-concave minimax problems [Lin et al., 2019]. Here we define condition number $\kappa=\frac{\mu}{l}$ and $L$ is still defined the same as before. The proof is based on our previous analysis and Lin et al. .

Suppose Assumption 1 holds and $-f(x,\cdot)$ satisfies PL condition with constant $\mu$ for any $x$ . If we run Algorithm 3 with $\tau_{1}=\frac{1}{20\kappa^{2}l}$ and $\tau_{2}=\frac{1}{l}$ , then

where $a_{0}=g(x_{0})-g^{*}$ and $b_{0}=g(x_{0})-f(x_{0},y_{0})$ .

For convenience, we still define $b_{t}=g(x_{t})-f(x_{t},y_{t})$ . Since it can be easily verified that $\tau_{1}\leq 1/L$ , by (18) and (26), we have

where in the second inequality we use Young’s inequality, and in third inequality we use (26). We write

We note that $\beta=(1-\mu_{2}\tau_{2})\left[\frac{3}{2}\tau_{1}+l\tau_{1}^{2}\right]\leq\frac{5}{2}\tau_{1}$ because $l/\tau_{1}\leq 1$ by our choice of $\tau_{1}$ . Also,

where in the last inequality we use $\mu_{2}\tau_{2}=1/\kappa$ and $(1-\mu_{2}\tau_{2})\frac{\tau_{1}l^{2}}{\mu_{2}}=(1-1/\kappa)/(20\kappa)\leq 1/(20\kappa)$ . Plugging into (73),

where in the inequality we use $1-\alpha\geq 1/(2\kappa)$ again.