Global Convergence and Variance-Reduced Optimization for a Class of Nonconvex-Nonconcave Minimax Problems

Junchi Yang, Negar Kiyavash, Niao He

Introduction

We consider minimax optimization problems of the forms

where ξ\xi is a random vector with support Ξ\Xi, and f(x,y)f(x,y) is a possibly nonconvex-nonconcave function. Minimax problems have been widely studied in game theory and operations research. Recent emerging applications in machine learning have further stimulated a surge of interest in these problems. For example, generative adversarial networks (GANs) (Goodfellow et al., 2016) can be viewed as a two-player game between a generator that produces synthetic data and a discriminator that differentiates between true data and synthetic data. In reinforcement learning, solving Bellman equations can also be reformulated as minimax optimization problems (Chen and Wang, 2016; Dai et al., 2017, 2018). Other applications include robust optimization (Namkoong and Duchi, 2016, 2017), adversarial machine learning (Sinha et al., 2017; Madry et al., 2017), unsupervised learning (Xu et al., 2005), and so on.

The most natural and frequently used methods for solving minimax problems (1) and (2) are the gradient descent ascent (GDA) algorithms (or their stochastic variants), with either simultaneous or alternating updates of the primal-dual variables, referred to as SGDA and AGDA, respectively, throughout the paper. While these algorithms have received much empirical success especially in adversarial training, it is known that these GDA algorithms with constant stepsizes could fail to converge for general smooth function (Mescheder et al., 2018), even for the bilinear games (Gidel et al., 2019); even when they do converge, the stable limit point may not be a local Nash equilibrium (Daskalakis et al., 2018; Mazumdar and Ratliff, 2018). On the other hand, GDA algorithms can converge linearly to the saddle point for strongly-convex-strongly-concave functions (Facchinei and Pang, 2007). Moreover, for many simple nonconvex-nonconcave objective functions, such as, f(x,y)=x2+3sin2xsin2y4y210sin2y,f(x,y)=x^{2}+3\sin^{2}x\sin^{2}y-4y^{2}-10\sin^{2}y, we also observe that GDA algorithms with constant stepsizes indeed converge to the global Nash equilibrium (or saddle point), at a linear rate (see Figure 1). This also holds true for their stochastic variants, albeit at a sublinear rate. These facts naturally raise a question: Is there a general condition under which GDA algorithms converge to the global optima?

In this paper, we address these two questions and specifically focus on the alternating gradient descent ascent, namely AGDA. This is due to several considerations. First of all, it has been recently shown that alternating updates of GDA are more stable than simultaneous updates (Gidel et al., 2019; Bailey et al., 2019). Note that for a convex-concave matrix game, SGDA may diverge while AGDA is proven to always have bounded iterates (Gidel et al., 2019). See Figure 2 for a simple illustration. Secondly, in general, it is much more challenging to analyze AGDA than SGDA. There is a lack of discussion on the convergence of AGDA for general minimax problems in the literature. Our contributions are summarized as follows.

Second, for minimax problems with the finite sum structure, we introduce a variance-reduced AGDA algorithm (VR-AGDA) that leverages the idea of stochastic variance reduced gradient (SVRG) (Johnson and Zhang, 2013; Reddi et al., 2016a) with the alternating updates. We prove that VR-AGDA achieves the complexity of O((n2/3κ3log(1/ϵ))\mathcal{O}((n^{2/3}\kappa^{3}\log(1/\epsilon)) in the region nκ9n\leq\kappa^{9} and O(n+κ9)log(1/ϵ))\mathcal{O}(n+\kappa^{9})\log(1/\epsilon)) in the region nκ9n\geq\kappa^{9}, where nn is the number of component functions. This greatly improves over the O(nκ3log1ϵ)\mathcal{O}\left(n\kappa^{3}\log\frac{1}{\epsilon}\right) complexity of AGDA when applied to the finite sum minimax problems. We summarize the results of these algorithms in Table 1. Our numerical experiments further demonstrate that VR-AGDA performs significantly better than AGDA and Stoc-AGDA, especially for problems with large condition numbers. To our best knowledge, this is the first work to provide a variance reduced algorithm and theoretical guarantees in the nonconvex-nonconcave regime of minimax optimization.

2 Related work

There has been a recent surge in research on solving minimax optimization beyond the convex-concave regime (Sinha et al., 2017; Chen et al., 2017; Qian et al., 2019; Thekumparampil et al., 2019; Lin et al., 2018; Nouiehed et al., 2019; Abernethy et al., 2019), but they differ from our work from various perspectives. For example, Chen et al. (2017); Sinha et al. (2017); Lin et al. (2019); Thekumparampil et al. (2019) considered the minimax problem when the objective function is nonconvex in xx but concave in yy and focused on achieving convergence to stationary points. Their algorithms require solving the inner maximization or some sub-problems with high accuracy at every iteration, which are different from AGDA. Lin et al. (2018) considered a general class of weakly-convex weakly-concave minimax problems and proposed an inexact proximal point method to find an ϵ\epsilon-stationary point. Their convergence result relies on assuming the existence of a solution to the corresponding Minty variational inequality, which is often hard to verify. Abernethy et al. (2019) recently showed the linear convergence of a second-order iterative algorithm, called Hamiltonian gradient descent (HGD), for a subclass of “sufficiently bilinear” functions. Compared with their work, the PL condition we consider in this paper is easier to verify and GDA algorithms are much simpler.

PL condition.

Recently, Nouiehed et al. (2019) studied a class of minimax problems where the objective only satisfies a one-sided PL condition and introduced the GDmax algorithm, which takes multiple ascent steps at every iteration. Our work differs from (Nouiehed et al., 2019) in two aspects: (i) we consider the two-sided PL condition which guarantees global convergence We also show that AGDA can find ϵ\epsilon-stationary point for minimax problems under the one-sided PL condition within O(1/ϵ2)\mathcal{O}(1/\epsilon^{2}) iterations in Appendix D.; (ii) we consider AGDA which takes one ascent step at every iteration. Another closely related work is Cai et al. (2019). The authors considered a specific application in generative adversarial imitation learning with linear quadratic regulator dynamics. This is a special example that falls under the two-sided PL condition.

Variance-reduced minimax optimization.

There exists a few works that apply variance reduction techniques to minimax optimization. Palaniappan and Bach (2016); Luo et al. (2019) provided linear-convergent algorithms for strongly-convex-strongly-concave objectives, based on simultaneous updates. Du and Hu (2019) extended the result to convex-strongly-concave objectives with full-rank coupling bilinear term. In contrast, we are dealing with a much broader class of objectives that are possibly nonconvex-nonconcave. We point out that Luo et al. (2020) recently introduced a variance-reduced algorithm for finding the stationary point of nonconvex-strongly-concave problems, which is again different from our setting.

The rest of this paper is organized as follows. In Section 2, we introduce the two-sided PL condition and show the equivalence of three min-max optimality criteria under this condition. In Section 3, we describe deterministic and stochastic AGDA algorithms, and provide convergence analyses of those algorithms under the two-sided PL condition. In Section 4, we introduce the variance-reduced AGDA algorithm and establish its convergence results. In Section 5, we provide numerical performance of these algorithms for robust least square and imitation learning for LQR.

Global optima and two-sided PL condition

Throughout this paper, we assume that the function f(x,y)f(x,y) in (1) is continuously differentiable and has Lipschitz gradient. We state it as a basic assumption. Here \|\cdot\| is used to denote the Euclidean norm.

There exists a positive constant l>0l>0 such that

We now define three notions of optimality for minimax problems. The most direct notion of optimality is global minimax point, at which xx^{*} is an optimal solution to the function g(x):=maxyf(x,y)g(x):=\max_{y}f(x,y) and yy^{*} is an optimal solution to maxyf(x,y)\max_{y}f(x^{*},y). In the two-player zero-sum game, the notion of saddle point is also widely used (Von Neumann et al., 2007; Nash, 1953). For a saddle point (x,y)(x^{*},y^{*}), xx^{*} is an optimal solution to minxf(x,y)\min_{x}f(x,y^{*}) and yy^{*} is an optimal solution to maxyf(x,y)\max_{y}f(x^{*},y).

(x,y)(x^{*},y^{*}) is a global minimax point, if for any (x,y):(x,y):

(x,y)(x^{*},y^{*}) is a saddle point, if for any (x,y):(x,y):

(x,y)(x^{*},y^{*}) is a stationary point, if ::

For general nonconvex-nonconcave minimax problems, these three notions of optimality are not necessarily equivalent. A stationary point may not be a saddle point or a global minimax point; a global minimax point may not be a saddle point or a stationary point. Note that generally speaking, for minimax problems, a saddle point or a global minimax point may not always exist. However, since our goal in this paper is to find global optima, in the remainder of the paper, we assume that a saddle point always exists.

We introduce a straightforward generalization of the PL condition to the minimax problem: function f(x,y)f(x,y) satisfies the PL condition with constant μ1\mu_{1} with respect to xx, and -ff satisfies PL condition with constant μ2\mu_{2} with respect to yy. We formally state this in the following definition.

A continuously differentiable function f(x,y)f(x,y) satisfies the two-sided PL condition if there exist constants μ1,μ2>0\mu_{1},\mu_{2}>0 such that:

The two-sided PL condition does not imply convexity-concavity, and it is a much weaker condition than strong-convexity-strong-concavity. In Lemma 2.1, we show that three notions of optimality are equivalent under the two-sided PL condition. Note that they may not be unique.

If the objective function f(x,y)f(x,y) satisfies the two-sided PL condition, then the following holds true:

Below we give some examples that satisfy this condition.

The nonconvex-nonconcave function in the introduction, f(x,y)=x2+3sin2xsin2y4y210sin2yf(x,y)=x^{2}+3\sin^{2}x\sin^{2}y-4y^{2}-10\sin^{2}y satisfies the two-sided PL condition with μ1=1/16,μ2=1/11\mu_{1}=1/16,\mu_{2}=1/11 (see Appendix A).

f(x,y)=F(Ax,By)f(x,y)=F(Ax,By), where F(,)F(\cdot,\cdot) is strongly-convex-strongly-concave and AA and BB are arbitrary matrices, satisfies the two-sided PL condition.

The generative adversarial imitation learning for LQR can be formulated as minKminθm(K,θ)\min_{K}\min_{\theta}m(K,\theta), where mm is strongly-concave in terms of θ\theta and satisfies PL condition in terms of KK (see (Cai et al., 2019) for more details), thus satisfying the two-sided PL condition.

Under the two-sided PL condition, the function g(x):=maxyf(x,y)g(x):=\max_{y}f(x,y) can be shown to satisfy PL condition with μ1\mu_{1} (see Appendix A). Moreover, it holds that gg is also LL-smooth with L:=l+l2/μ2L:=l+l^{2}/\mu_{2} (Nouiehed et al., 2019). Finally, we denote μ=min(μ1,μ2)\mu=\min(\mu_{1},\mu_{2}) and κ=lμ\kappa=\frac{l}{\mu}, which represents the condition number of the problem.

Global convergence of AGDA and Stoc-AGDA

In this section, we establish the convergence rate of the stochastic alternating gradient descent ascent (Stoc-AGDA) algorithm, which we present in Algorithm 1, under the two-sided PL condition. Stoc-AGDA updates variables xx and yy sequentially using stochastic gradient descent/ascent steps. Here we make standard assumptions about stochastic gradients Gx(x,y,ξ)G_{x}(x,y,\xi) and Gy(x,y,ξ)G_{y}(x,y,\xi).

Gx(x,y,ξ)G_{x}(x,y,\xi) and Gy(x,y,ξ)G_{y}(x,y,\xi) are unbiased stochastic estimators of xf(x,y)\nabla_{x}f(x,y) and yf(x,y)\nabla_{y}f(x,y) and have variances bounded by σ2>0\sigma^{2}>0.

Note that Stoc-AGDA with constant stepsizes (i.e., τ1t=τ1\tau_{1}^{t}=\tau_{1} and τ2t=τ2\tau_{2}^{t}=\tau_{2}) and noiseless stochastic gradient (i.e., σ2=0\sigma^{2}=0) reduces to AGDA:

We will measure the inaccuracy of (xt,yt)(x_{t},y_{t}) through the potential function

We first consider Stoc-AGDA with constant stepsizes. We show that {(xt,yt)}t\{(x_{t},y_{t})\}_{t} will converge linearly to a neighbourhood of the optimal set.

Suppose Assumptions 1, 2, 3 hold and f(x,y)f(x,y) satisfies the two-sided PL condition with μ1\mu_{1} and μ2\mu_{2}. Define Pt:=at+110btP_{t}:=a_{t}+\frac{1}{10}b_{t}. If we run Algorithm 1 with τ2t=τ21l\tau_{2}^{t}=\tau_{2}\leq\frac{1}{l} and τ1t=τ1μ22τ218l2\tau_{1}^{t}=\tau_{1}\leq\frac{\mu_{2}^{2}\tau_{2}}{18l^{2}}, then

where δ=(1μ2τ2)(L+l)τ12+lτ22+10Lτ1210μ1τ1σ2.\delta=\frac{(1-\mu_{2}\tau_{2})(L+l)\tau_{1}^{2}+l\tau_{2}^{2}+10L\tau_{1}^{2}}{10\mu_{1}\tau_{1}}\sigma^{2}.

In the theorem above, we choose τ1\tau_{1} smaller than τ2\tau_{2}, τ1/τ2μ22/(18l2)\tau_{1}/\tau_{2}\leq\mu_{2}^{2}/(18l^{2}), because our potential function is not symmetric about xx and yy. Another reason is because we want yty_{t} to approach y(xt)argmaxyf(xt,y)y^{*}(x_{t})\in\arg\max_{y}f(x_{t},y) faster so that xf(xt,yt)\nabla_{x}f(x_{t},y_{t}) is a better approximation for g(xt)\nabla g(x_{t}) (g(x)=xf(x,y(x))\nabla g(x)=\nabla_{x}f(x,y^{*}(x)), see Nouiehed et al. (2019)). Indeed, it is common to use different learning rates for xx and yy in GDA algorithms for nonconvex minimax problems; see e.g., Jin et al. (2019) and Lin et al. (2019). Note that the ratio between these two learning rates is quite crucial here. We also observe empirically when the same learning rate is used, even if small, the algorithm may not converge to saddle points.

When tt\rightarrow\infty, PtδP_{t}\to\delta. If τ10\tau_{1}\rightarrow 0 and τ22/τ10\tau_{2}^{2}/\tau_{1}\rightarrow 0, the error term δ\delta will go to 0. When using smaller stepsizes, the algorithm reaches a smaller neighbour of the saddle point yet at the cost of a slower rate, as the contraction factor also deteriorates.

Linear convergence of AGDA

Setting σ2=0\sigma^{2}=0, it follows immediately from the previous theorem that AGDA converges linearly under the two-sided PL condition. Moreover, we have

Suppose Assumptions 1, 2 hold and f(x,y)f(x,y) satisfies the two-sided PL condition with μ1\mu_{1} and μ2\mu_{2}. Define Pt:=at+110btP_{t}:=a_{t}+\frac{1}{10}b_{t}. If we run AGDA with τ1=μ2218l3\tau_{1}=\frac{\mu_{2}^{2}}{18l^{3}} and τ2=1l\tau_{2}=\frac{1}{l}, then

Furthermore, {(xt,yt)}t\{(x_{t},y_{t})\}_{t} converges to some saddle point (x,y)(x^{*},y^{*}), and

where α\alpha is a constant depending on μ1,μ2\mu_{1},\mu_{2} and ll.

The above theorem implies that the limit point of {(xt,yt)}t\{(x_{t},y_{t})\}_{t} is a saddle point and the distance to the saddle point decreases in the order of O((1κ3)t)\mathcal{O}\left((1-\kappa^{-3})^{t}\right). Note that in the special case when the objective is strongly-convex-strongly-concave, it is known that SGDA (GDA with simultaneous updates) achieves an O(κ2log(1/ϵ))\mathcal{O}(\kappa^{2}\log(1/\epsilon)) iteration complexity (see, e.g., Facchinei and Pang (2007)) and this can be further improved to O(κlog(1/ϵ))\mathcal{O}(\kappa\log(1/\epsilon)) by extragradient methods (Korpelevich, 1976), Nesterov’s dual extrapolation (Nesterov and Scrimali, 2006) or accelerated proximal point algorithm (Lin et al., 2020). However, these result relies heavily on the strong monotonicity of the corresponding variational inequality. For the general two-sided PL condition, we may not achieve the same dependency on κ\kappa.

Stoc-AGDA with diminishing stepsizes

While Stoc-AGDA with constant stepsizes only converges linearly to a neighbourhood of the saddle point, Stoc-AGDA with diminishing stepsizes converges to the saddle point but at a sublinear rate O(1/t)\mathcal{O}(1/t).

Suppose Assumptions 1, 2, 3 hold and f(x,y)f(x,y) satisfies the two-sided PL condition with μ1\mu_{1} and μ2\mu_{2}. Define Pt=at+110btP_{t}=a_{t}+\frac{1}{10}b_{t}. If we run algorithm 1 with stepsizes τ1t=βγ+t\tau_{1}^{t}=\frac{\beta}{\gamma+t} and τ2t=18l2βμ22(γ+t)\tau_{2}^{t}=\frac{18l^{2}\beta}{\mu_{2}^{2}(\gamma+t)} for some β>2/μ1\beta>2/\mu_{1} and γ>0\gamma>0 such that τ11min{1/L,μ22/18l2}\tau_{1}^{1}\leq\min\{1/L,\mu_{2}^{2}/18l^{2}\}, then we have

Note the rate is affected by ν\nu, and the first term in the definition of ν\nu is controlled by the initial point. In practice, we can find a good initial point by running Stoc-AGDA with constant stepsizes so that only the second term in the definition of ν\nu matters. Then by choosing β=3/μ1\beta=3/\mu_{1}, we have ν=O(l5σ2μ12μ24)\nu=\mathcal{O}\left(\frac{l^{5}\sigma^{2}}{\mu_{1}^{2}\mu_{2}^{4}}\right). Thus, the convergence rate of Stoc-AGDA is O(κ5σ2μt)\mathcal{O}\left(\frac{\kappa^{5}\sigma^{2}}{\mu t}\right).

Stochastic variance reduced algorithm

In this section, we study the minimax problem in (2) with the finite-sum structure:

which arises ubiquitously in machine learning. We are especially interested in the case when nn is large. We assume the overall objective function f(x,y)f(x,y) still satisfies the two-sided PL condition with μ1\mu_{1} and μ2\mu_{2}, but we do not assume each fif_{i} to satisfy the two-sided PL condition. Instead of Assumption 1, we now assume each component fif_{i} has Lipschitz gradients.

If we run AGDA with full gradients to solve the finite-sum minimax problem, the total complexity for finding an ϵ\epsilon-optimal solution is O(nκ3log(1/ϵ))\mathcal{O}(n\kappa^{3}\log(1/\epsilon)) by Theorem 3.2. Despite the linear convergence, the per-iteration cost is high and the complexity can be huge when the number of components nn and condition number κ\kappa are large. Instead, if we run Stoc-AGDA, this leads to the total complexity O(κ5σ2μ2ϵ)\mathcal{O}\left(\frac{\kappa^{5}\sigma^{2}}{\mu_{2}\epsilon}\right) by Remark 3, which has worse dependence on ϵ\epsilon.

Motivated by the recent success of stochastic variance reduced gradient (SVRG) technique (Johnson and Zhang, 2013; Reddi et al., 2016a; Palaniappan and Bach, 2016), we introduce the VR-AGDA algorithm (presented in Algorithm 2), that combines AGDA with SVRG so that the linear convergence is preserved while improving the dependency on nn and κ\kappa. VR-AGDA can be viewed as the applying SVRG to AGDA with restarting: at every epoch kk, we restart the SVRG subroutine (with TT outer iterations, NN inner steps) by initializing it with (xk,yk)(x^{k},y^{k}), which is randomly selected from previous SVRG subroutine. This is partly inspired by the GD-SVRG algorithm for minimizing PL functions (Reddi et al., 2016a). Notice when T=1T=1, VR-AGDA reduces to a double-loop algorithm which is similar to the SVRG for saddle point problems proposed by Palaniappan and Bach (2016), except for several notable differences: (i) we are using the alternating updates rather than simultaneous updates, (ii) as a result, we require to sample two independent indices rather than one at each iteration, and (iii) most importantly, we are dealing with possibly nonconvex-nonconcave objectives that satisfy the two-sided PL condition.

The following two theorems capture the convergence of VR-AGDA.

for VR-AGDA to achieve an ϵ\epsilon-optimal solution.

Under the same assumptions in Theorem 4.1 and further assuming nκ9n\leq\kappa^{9} , if we run VR-AGDA with τ1=β/(28κ2ln2/3)\tau_{1}=\beta/(28\kappa^{2}ln^{2/3}), τ2=β/(ln2/3)\tau_{2}=\beta/(ln^{2/3}), N=αβ2/3n(2+4β1/2n1/3)1N=\lfloor\alpha\beta^{-2/3}n(2+4\beta^{1/2}n^{-1/3})^{-1}\rfloor, and T=κ3n1/3T=\lceil\kappa^{3}n^{-1/3}\rceil, where α,β\alpha,\beta are constants irrelevant to l,n,μ1,μ2l,n,\mu_{1},\mu_{2}, then Pk+112Pk.P_{k+1}\leq\frac{1}{2}P_{k}. This further implies a total complexity of

for VR-AGDA to achieve an ϵ\epsilon-optimal solution.

Theorems 4.1 and 4.2 are different in their choices of stepsizes and iteration numbers, which gives rise to different complexities. Another difference is that Theorem 4.2 only works in the regime where the number of components nn is not “too large” compared to the condition number, i.e., nκ9n\leq\kappa^{9}, which naturally guarantees T=κ3n1/31T=\lceil\kappa^{3}n^{-1/3}\rceil\geq 1.

Since AGDA has complexity \mathcal{O}\big{(}n\kappa^{3}\log(1/\epsilon)\big{)}, VR-AGDA with the setting in Theorem 4.1 is better than AGDA when nκ6n\geq\mathcal{\kappa}^{6}. With the setting in Theorem 4.2, VR-AGDA outperforms AGDA as long as the assumption nκ9n\leq\kappa^{9} holds. As a result of these two theorems, VR-AGDA always improves over AGDA. Furthermore, VR-AGDA with the second setting has a lower complexity than the first setting in the regime nκ9n\leq\kappa^{9}, although the first setting allows a simpler double-loop algorithm. Figure 3 summarizes the performance of VR-AGDA compared to AGDA in different regimes of nn and κ\kappa.

Experiments

In the introduction, we already presented the convergence results of AGDA on a two-dimensional nonconvex-nonconcave function that satisfies the two-sided PL condition. In this section, we will present numerical experiments on machine learning applications: robust least square and imitation learning for linear quadratic regulators (LQR). Particularly, we focus on the comparison between AGDA, Stoc-AGDA, and VR-AGDA.

where we also adopt the general M-(semi-)norm in (13): xM2=xTMx\|x\|_{M}^{2}=x^{T}Mx and MM is positive semi-definite. F(x,y)F(x,y) satisfies the two-sided PL condition when λ>1\lambda>1, because it can be written as the composition of a strongly-convex-strongly-concave function and an affine function (Example 2). However, F(x,y)F(x,y) is not strongly convex about xx, and when MM is not full-rank, it is not strongly concave about yy.

Evaluation. For each dataset, we compare three algorithms: AGDA, Stoc-AGDA, and VR-AGDA. We tune the stepsizes of all algorithms to achieve the best convergence. For Stoc-AGDA, we choose constant stepsizes to form a fair comparison with the other two. We report the potential function value, i.e., PtP_{t} described in our theorems, and distance to the limit point (xt,yt)(x,y)2\|(x_{t},y_{t})-(x^{*},y^{*})\|^{2}. These errors are plotted against the number of gradient evaluations normalized by nn (i.e., number of full gradients). Results are reported in Figure 4. We observe that VR-AGDA and AGDA both exhibit linear convergence, and the speedup of VR-AGDA is fairly significant when the condition number is large, whereas Stoc-AGDA progresses fast at the beginning and stagnates later on. These numerical results clearly validate our theoretical findings.

2 Generative adversarial imitation learning for LQR

The optimal control problem for LQR can be formulated as:

where the trajectory is induced by LQR dynamics and policy KK. In generative adversarial imitation learning for LQR, the trajectories induced by an expert policy KEK_{E} are observed and part of the goal is to learn the cost function parameters QQ and RR from the expert. This can be formulated as a minimax problem (Cai et al., 2019):

where m(K,Q,R):=C(K;Q,R)C(KE;Q,R)Φ(Q,R)m(K,Q,R):=C(K;Q,R)-C(K_{E};Q,R)-\Phi(Q,R), Θ={(Q,R):αQIQβQI,αRIRβRI}\Theta=\{(Q,R):\alpha_{Q}I\preceq Q\preceq\beta_{Q}I,\quad\alpha_{R}I\preceq R\preceq\beta_{R}I\} and Φ\Phi is a strongly-convex regularizer. We sample nn initial points x0(1),x0(2),...,x0(n)x_{0}^{(1)},x_{0}^{(2)},...,x_{0}^{(n)} from D\mathcal{D} and approximate C(K;Q,R)C(K;Q,R) by sample average

Note that mnm_{n} satisfies the PL condition in terms of KK (Fazel et al., 2018), and mnm_{n} is strongly-concave in terms of (Q,R)(Q,R), so the function satisfies the two-sided PL condition.

In our experiment, we use Φ(Q,R)=λ(QQˉ2+RRˉ2)\Phi(Q,R)=\lambda(\|Q-\bar{Q}\|^{2}+\|R-\bar{R}\|^{2}) for some Qˉ,Rˉ\bar{Q},\bar{R} and λ=1\lambda=1. We generate three datasets with different dimensions: (1) d=3,k=2d=3,k=2; (2) d=20,k=10d=20,k=10; (3) d=30,k=20d=30,k=20. The initial distribution D\mathcal{D} is N(0,Id)\mathcal{N}(0,I_{d}) and we sample n=100n=100 initial points. The exact gradients can be computed based on the compact forms established in Fazel et al. (2018); Cai et al. (2019). We compare AGDA and VR-AGDA under fine-tuned stepsizes, and track their errors in terms of KtK2+QtQF2+RtRF2\|K_{t}-K^{*}\|^{2}+\|Q_{t}-Q^{*}\|_{F}^{2}+\|R_{t}-R^{*}\|_{F}^{2}. The result is reported in Figure 5, which again indicates that VR-AGDA significantly outperforms AGDA.

Conclusion

In this paper, we identify a subclass of nonconvex-nonconcave minimax problems, represented by the the so-called two-side PL condition, for which AGDA and Stoc-AGDA can converge to global saddle points. We also propose the first linearly-convergent variance-reduced AGDA algorithm that is provably always faster than AGDA, for this subclass of minimax problems . We hope this work can shed some light on the understanding of nonconvex-nonconcave minimax optimization: (1) different learning rates for two players are essential in GDA algorithms with alternating updates; (2) convexity-concavity is not a watershed to guarantee global convergence of GDA algorithms; (3) the complexity of solving minimax problems under PL conditions may have high-order dependence on the condition number in contrast to problems with strong convex-concavity conditions. It remains interesting to explore whether similar results apply to GDA algorithms with simultaneous updates and whether these algorithms can be further accelerated with momentum or catalyst schemes.

References

Appendix A Proofs for Section 2

If f()f(\cdot) is l-smooth and it satisfies PL with constant μ\mu, then it also satisfies error bound (EB) condition with μ\mu, i.e.

where xpx_{p} is the projection of xx onto the optimal set, also it satisfies quadratic growth (QG) condition with μ\mu, i.e.

Conversely, if f()f(\cdot) is l-smooth and it satisfies EB with constant μ\mu, then it satisfies PL with constant μ/l\mu/l.

From the above lemma, we easily derive that lμl\geq\mu.

In the minimax problem, when f(x,)-f(x,\cdot) satisfies PL condition with constant μ2\mu_{2} for any xx and ff satisfies Assumption 1, then the function g(x):=maxyf(x,y)g(x):=\max_{y}f(x,y) is LL-smooth with L:=l+l2/μ2L:=l+l^{2}/\mu_{2} and g(x)=xf(x,y(x))\nabla g(x)=\nabla_{x}f(x,y^{*}(x)) for any y(x)argmaxyf(x,y)y^{*}(x)\in\arg\max_{y}f(x,y).

In the minimax problem 1, when the objective function ff satisfies Assumption 1 (Lipschitz gradient) and the two-sided PL condition with constant μ1\mu_{1} and μ2\mu_{2}, then function g(x):=maxyf(x,y)g(x):=\max_{y}f(x,y) satisfies the PL condition with μ1\mu_{1}.

Since f(,y)f(\cdot,y) satisfies PL condition with constant μ1\mu_{1}, we get

Combining equation (14) and (15), we obtain,

The following lemma states that stochastic gradient descent converges linearly to the neighbourhood of the optimal set under PL condition. The proof is based on [Karimi et al., 2016].

(stationary point) \Longrightarrow (saddle point): From the definition of PL condition, if (x,y)(x^{*},y^{*}) is a stationary point,

so maxyf(x,y)=f(x,y)=minxf(x,y)\max_{y}f(x^{*},y)=f(x^{*},y^{*})=\min_{x}f(x,y^{*}), and therefore f(x,y)f(x^{*},y^{*}) is a saddle point.

(saddle point) \Longrightarrow (global minimax point): Follow from definitions.

(global minimax point) \Longrightarrow (stationary point): If (x,y)(x^{*},y^{*}) is a global minimax point, then by definition,

Then by first order necessary condition, we have,

Thus, (x,y)(x^{*},y^{*}) is a stationary point.

satisfies the two-sided PL condition with μ1=1/16,μ2=1/14\mu_{1}=1/16,\mu_{2}=1/14.

It is not hard to derive that argminxf(x,y)=0,y\arg\min_{x}f(x,y)=0,\forall y, and argmaxyf(x,y)=0,x\arg\max_{y}f(x,y)=0,\forall x, i.e. x(y)=y(x)=0,x,yx^{*}(y)=y^{*}(x)=0,\forall x,y. Therefore, (0,0)(0,0) is the only saddle point. Then compute the gradients:

so f(,y)f(\cdot,y) is L1L_{1}-smooth with L1=8L_{1}=8 for any xx and f(x,)f(x,\cdot) is L2L_{2}-smooth with L2=28L_{2}=28 for any yy. Then note that:

So f(,y)f(\cdot,y) satisfies EB with μEB1=1/2\mu_{EB1}=1/2, and -f(x,)f(x,\cdot) satisfies EB with μEB2=2\mu_{EB2}=2. By Lemma A.1, we have f(,y)f(\cdot,y) satisfies PL with constant μ1=1/16\mu_{1}=1/16 and -f(x,)f(x,\cdot) satisfies PL with constant μ1=1/14\mu_{1}=1/14.

Appendix B Proofs for Section 3

Before we step into proofs for Theorem 3.1, 3.2 and 3.3, we first present a contraction theorem for each iteration.

and λ,β>0\lambda,\beta>0 such that k11k_{1}\leq 1.

Because gg is LL-smooth by Lemma A.2, we have

Taking expectation of both side and use Assumption 3, we get

where in the second inequality we use Assumption 3, and in the third inequality we use τ11/L\tau_{1}\leq 1/L. Because f(xt+1,y)-f(x_{t+1},y) is ll-smooth and μ1\mu_{1}-PL, by Lemma A.4, when τ11/l\tau_{1}\leq 1/l we have

Because of lipschitz continuity of the gradient, we can bound f(xt,yt)f(xt+1,yt)f(x_{t},y_{t})-f(x_{t+1},y_{t}) as

Taking expectation of both side and use Assumption 3,

Combining (18) and (22), we have for λ>0\forall\lambda>0

where in the second inequality we use Young’s Inequality and β>0\beta>0. Now it suffices to bound g(xt)2\|g(x_{t})\|^{2} and xf(xt,yt)g(xt)2\|\nabla_{x}f(x_{t},y_{t})-\nabla g(x_{t})\|^{2} by ata_{t} and btb_{t}. With Lemma A.2, we have:

for any y(xt)argmaxyf(xt,y)y^{*}(x_{t})\in\arg\max_{y}f(x_{t},y). Now we fix y(xt)y^{*}(x_{t}) to be the projection of yty_{t} on the the set argmaxyf(xt,y)\arg\max_{y}f(x_{t},y). Because f(xt,)-f(\bm{x_{t}},\cdot) satisfies PL condition with μ2\mu_{2}, and Lemma A.1 therefore indicates it also satisfies quadratic growth condition with μ2\mu_{2}, i.e.

Because gg satisfies PL condition with μ1\mu_{1} by Lemma A.3,

In the setting of Theorem 1, τ1t=τ1\tau_{1}^{t}=\tau_{1} and τ2t=τ2,t\tau_{2}^{t}=\tau_{2},\forall t. By Thoerem B.1, We only need to choose τ1\tau_{1}, τ2\tau_{2}, λ\lambda and β\beta to let k1,k2<1k_{1},k_{2}<1. Here we first choose β=1\beta=1 and λ=1/10\lambda=1/10. Then

where in the last inequality we just plug in β\beta and λ\lambda and use lτ11l\tau_{1}\leq 1. Also,

where in the last inequality we plug in β\beta and λ\lambda and we use μ22τ2τ1l218\frac{\mu_{2}^{2}\tau_{2}}{\tau_{1}l^{2}}\leq 18 by our choice of τ1\tau_{1}. Note that 12τ1μ1<l2τ1μ2\frac{1}{2}\tau_{1}\mu_{1}<\frac{l^{2}\tau_{1}}{\mu_{2}}, because (12τ1μ1)/(l2τ1μ2)=μ1μ22l2<1\left(\frac{1}{2}\tau_{1}\mu_{1}\right)/\left(\frac{l^{2}\tau_{1}}{\mu_{2}}\right)=\frac{\mu_{1}\mu_{2}}{2l^{2}}<1. Define Pt:=at+110btP_{t}:=a_{t}+\frac{1}{10}b_{t}, and by Theorem B.1,

We verify that τ11/L\tau_{1}\leq 1/L by noting: τ1μ22τ218l2μ2218l3μ22l2\tau_{1}\leq\frac{\mu_{2}^{2}\tau_{2}}{18l^{2}}\leq\frac{\mu_{2}^{2}}{18l^{3}}\leq\frac{\mu_{2}}{2l^{2}} and L=l+l2μ22l2μ2L=l+\frac{l^{2}}{\mu_{2}}\leq\frac{2l^{2}}{\mu_{2}}. ∎

The first part of Theorem 3.2 is a direct corollary of Theorem 3.1 by setting σ=0\sigma=0. We show the second part by noting that

where in the second inequality y(xt)y^{*}(x_{t}) is the projection of yty_{t} on the the set argmaxyf(xt,y)\arg\max_{y}f(x_{t},y) and yf(xt,y(xt))=0\nabla_{y}f(x_{t},y^{*}(x_{t}))=0, in the third inequality we use lipschtiz continuity of gradient, and in the last inequality we use quadratic growth condition. Also,

where in the first equality xx^{*} is the projection of xtx_{t} on the set argminxg(x)\arg\min_{x}g(x) and g(x)=0\nabla g(x^{*})=0, in the second inequality y(xt)y^{*}(x_{t}) is the projection of yty_{t} on the the set argmaxyf(xt,y)\arg\max_{y}f(x_{t},y) and g(xt)=xf(xt,yt)\nabla g(x_{t})=\nabla_{x}f(x_{t},y_{t}), and in the last inequality we use quadratic growth condition. Therefore with (32) and (33),

where c=1μ1μ2236l3c=1-\frac{\mu_{1}\mu_{2}^{2}}{36l^{3}}. Letting α1=[2(1+τ22l2)τ12L2μ1+20(1+τ22l2)τ12l2+20l2τ22μ2]P0\alpha_{1}=\left[\frac{2(1+\tau_{2}^{2}l^{2})\tau_{1}^{2}L^{2}}{\mu_{1}}+\frac{20(1+\tau_{2}^{2}l^{2})\tau_{1}^{2}l^{2}+20l^{2}\tau_{2}^{2}}{\mu_{2}}\right]P_{0}, we have

so {(xt,yt)}t\{(x_{t},y_{t})\}_{t} converges and by first part of this theorem the limit (x,y)(x^{*},y^{*}) must be a saddle point. Thus we have

with α=2[2(1+τ22l2)τ12L2μ1+20(1+τ22l2)τ12l2+20l2τ22μ2]/(1c)2\alpha=2\left[\frac{2(1+\tau_{2}^{2}l^{2})\tau_{1}^{2}L^{2}}{\mu_{1}}+\frac{20(1+\tau_{2}^{2}l^{2})\tau_{1}^{2}l^{2}+20l^{2}\tau_{2}^{2}}{\mu_{2}}\right]/(1-\sqrt{c})^{2}. ∎

First note that since τ1tμ22/18l2\tau_{1}^{t}\leq\mu_{2}^{2}/18l^{2}, τ2t=18l2βμ22(γ+t)=18l2τ1tμ221l\tau_{2}^{t}=\frac{18l^{2}\beta}{\mu_{2}^{2}(\gamma+t)}=\frac{18l^{2}\tau_{1}^{t}}{\mu_{2}^{2}}\leq\frac{1}{l}. Similar to the proof of Theorem 3.1, by choosing β=1\beta=1 and λ=1/10\lambda=1/10 in the Theorem B.1, we have min{k1,k2}=12μ1τ1t\min\{k_{1},k_{2}\}=\frac{1}{2}\mu_{1}\tau_{1}^{t}. We prove the theorem by induction. When t = 1, it is naturally satisfied by definition of ν\nu. We assume that Ptνγ+tP_{t}\leq\frac{\nu}{\gamma+t}. Then by Theorem B.1,

where in the second inequality we plug in τ1t\tau_{1}^{t} and τ2t\tau_{2}^{t}, in the last inequality we use (γ+t+1)(γ+t1)(γ+t)2(\gamma+t+1)(\gamma+t-1)\leq(\gamma+t)^{2} and the fact that sum of last two terms in (34) is no greater than 0 by our choice of ν\nu. ∎

Appendix C Proofs for Section 4

Because the proof is long, we break the proof into three parts for the convenience of understanding the intuition behind it.

Note that these are unbiased stochastic gradients. Similar to the proof of Theorem B.1 (replace σ2\sigma^{2} in (18) ), with τ11/L\tau_{1}\leq 1/L, we have

where in the last inequality we use Young’s inequality to the inner product and β1>0\beta_{1}>0 is a constant which we will determine later. Similarly,

where in the last inequality we use Young’s inequality to the inner product and β2>0\beta_{2}>0 is a constant. We are going to construct a potential function

and we will determine λ,cj\lambda,c_{j} and djd_{j} later. Combine (35), (36) and (38),

Then we bound the variance of the stochastic gradients,

Consider the second line. Using PL condition yf(xj+1,yj)22μ2[g(xj+1)f(xj+1,yj)]\|\nabla_{y}f(x_{j+1},y_{j})\|^{2}\geq 2\mu_{2}[g(x_{j+1})-f(x_{j+1},y_{j})] and assuming λdj+1(τ2+1/β2)\lambda\geq d_{j+1}(\tau_{2}+1/\beta_{2}), which we will justify later by our choices of dj+1d_{j+1} and β2\beta_{2}, we have

where in the last inequality we use (35) and (20). Now we plug this into Rj+1R_{j+1},

where we define ζ=1τ2μ2+λ2dj+1(τ22+τ2β2)μ2\zeta=1-\tau_{2}\mu_{2}+\frac{\lambda}{2}d_{j+1}\left(\tau_{2}^{2}+\frac{\tau_{2}}{\beta_{2}}\right)\mu_{2} and ψ=1ζ\psi=1-\zeta. With xf(xj,yj)22g(xj)2+2g(xj)xf(xj,yj)2\|\nabla_{x}f(x_{j},y_{j})\|^{2}\leq 2\|\nabla g(x_{j})\|^{2}+2\|\nabla g(x_{j})-\nabla_{x}f(x_{j},y_{j})\|^{2},

Then plugging in (26), (27) and (42), we get

Now we are ready to define sequences {cj}j\{c_{j}\}_{j} and {dj}j\{d_{j}\}_{j}. Let cN=dN=0c_{N}=d_{N}=0, and

The left hand side is exactly ak+1+λbk+1a^{k+1}+\lambda b^{k+1}, because (xk,yk)(x_{k},y_{k}) is sampled uniformly from {{(xt,j,yt,j)}j=0N1}t=0T1\{\{(x_{t,j},y_{t,j})\}_{j=0}^{N-1}\}_{t=0}^{T-1}.

It suffices to choose proper τ1\tau_{1}, τ2\tau_{2}, NN and TT such that NTγ>1NT\gamma>1. Driven by the proof, we choose

We will choose k1,k2,k3k_{1},k_{2},k_{3} and k4k_{4} later and we let k1,k2,k3,k41k_{1},k_{2},k_{3},k_{4}\leq 1. Plug back to cjc_{j} and djd_{j}, we have

where in the last inequality we assume k32+k3k41k_{3}^{2}+\frac{k_{3}}{k_{4}}\leq 1.

We define ej=max{cj,dj}e_{j}=\max\{c_{j},d_{j}\}. Then combining (53) and (54), we easily get

and note that ej>ej+1e_{j}>e_{j+1} so eje0,je_{j}\leq e_{0},\forall j. Then we want to lower bound γ\gamma. Rearrange (48),

where in the inequality, we use λ=1/20\lambda=1/20 and assume that 1κ2k32(k1+1k2)10\frac{1}{\kappa^{2}}k_{3}^{2}(k_{1}+\frac{1}{k_{2}})\leq 10. Rearranging (49),

where in the inequality we use λ=1/20\lambda=1/20 and assume k1k3/28k_{1}\leq k_{3}/28 and 1κ2k32(k1+1k2)1/4\frac{1}{\kappa^{2}}k_{3}^{2}\left(k_{1}+\frac{1}{k_{2}}\right)\leq 1/4. Note that 12τ1μ1=μ12κ2lk1\frac{1}{2}\tau_{1}\mu_{1}=\frac{\mu_{1}}{2\kappa^{2}l}k_{1} and l2τ12min{μ1,μ2}=l2κ2min{μ1,μ2}k1\frac{l^{2}\tau_{1}}{2\min\{\mu_{1},\mu_{2}\}}=\frac{l}{2\kappa^{2}\min\{\mu_{1},\mu_{2}\}}k_{1}. Then we have

Letting k1/k2=k3/k4k_{1}/k_{2}=k_{3}/k_{4} and k1=128k3k_{1}=\frac{1}{28}k_{3}, we have

where we use cj,dje0,jc_{j},d_{j}\leq e_{0},\forall j. By plugging in k1=k3/28k_{1}=k_{3}/28 and λ=1/20\lambda=1/20 into (55), we have

We choose T=1T=1, k3=βκ6k_{3}=\beta\kappa^{-6} and N=α(2k33/2+4k32)1α2k23/2N=\alpha(2k_{3}^{3/2}+4k_{3}^{2})^{-1}\geq\frac{\alpha}{2}k_{2}^{-3/2}, where α,β\alpha,\beta is irrelevant to n,l,μ1,μ2n,l,\mu_{1},\mu_{2}. Then since (1+2k33/2+4k32)Neα(1+2k_{3}^{3/2}+4k_{3}^{2})^{N}\leq e^{\alpha}, after plugging in NN and k3k_{3}, we have

Therefore, for choosing α\alpha small enough and β\beta small enough, we have NTγ2NT\gamma\geq 2. Now it remains to verify several assumptions we made in the proof. The first is k3k4+k321\frac{k_{3}}{k_{4}}+k_{3}^{2}\leq 1. Since k3k4+k32=k31/2+k32\frac{k_{3}}{k_{4}}+k_{3}^{2}=k_{3}^{1/2}+k_{3}^{2}, this assumption easily holds when β1/4\beta\leq 1/4. The second assumption we want to verify is 1κ2k32(k1+1k2)1/4\frac{1}{\kappa^{2}}k_{3}^{2}\left(k_{1}+\frac{1}{k_{2}}\right)\leq 1/4. Note that

So this assumption can also be easily satisfied when β\beta is small. The last assumption we need to verify is λdj+1(τ2+1β2)\lambda\geq d_{j+1}\left(\tau_{2}+\frac{1}{\beta_{2}}\right). Because dj+1e0d_{j+1}\leq e_{0} and (61),

So this assumption holds when α\alpha and β\beta are small.

We start from Part 3 of the proof of Theorem 4.1. We now choose k3=βn2/3k_{3}=\beta n^{-2/3}, N=α(2k33/2+4k32)1N=\alpha(2k_{3}^{3/2}+4k_{3}^{2})^{-1}, and T=κ3n1/3T=\kappa^{3}n^{-1/3} then

Therefore, for choosing α\alpha small enough and β\beta small enough, we have NTγ2NT\gamma\geq 2. Other assumptions can be easily verified by the same way as in the proof of Theorem 4.1. ∎

Appendix D AGDA for minimax problems under one-sided PL condition

We are here to show that if f(x,)-f(x,\cdot) satisfies PL condition with constant μ\mu and f(,y)f(\cdot,y) may be nonconvex (referred to as PL game by Nouiehed et al. ), AGDA as presented in Algorithm 3 can find ϵ\epsilon-stationary point of g(x):=maxyf(x,y)g(x):=\max_{y}f(x,y) within O(ϵ2)\mathcal{O}(\epsilon^{-2}) iterations. Note that GDmax has complexity O(ϵ2log(1/ϵ))\mathcal{O}(\epsilon^{-2}\log(1/\epsilon)) on minimax problems under the one-sided PL condition [Nouiehed et al., 2019]; SGDA has complexity O(ϵ2)\mathcal{O}(\epsilon^{-2}) on nonconvex-strongly-concave minimax problems [Lin et al., 2019]. Here we define condition number κ=μl\kappa=\frac{\mu}{l} and LL is still defined the same as before. The proof is based on our previous analysis and Lin et al. .

Suppose Assumption 1 holds and f(x,)-f(x,\cdot) satisfies PL condition with constant μ\mu for any xx. If we run Algorithm 3 with τ1=120κ2l\tau_{1}=\frac{1}{20\kappa^{2}l} and τ2=1l\tau_{2}=\frac{1}{l}, then

where a0=g(x0)ga_{0}=g(x_{0})-g^{*} and b0=g(x0)f(x0,y0)b_{0}=g(x_{0})-f(x_{0},y_{0}).

For convenience, we still define bt=g(xt)f(xt,yt)b_{t}=g(x_{t})-f(x_{t},y_{t}). Since it can be easily verified that τ11/L\tau_{1}\leq 1/L, by (18) and (26), we have

where in the second inequality we use Young’s inequality, and in third inequality we use (26). We write

We note that β=(1μ2τ2)[32τ1+lτ12]52τ1\beta=(1-\mu_{2}\tau_{2})\left[\frac{3}{2}\tau_{1}+l\tau_{1}^{2}\right]\leq\frac{5}{2}\tau_{1} because l/τ11l/\tau_{1}\leq 1 by our choice of τ1\tau_{1}. Also,

where in the last inequality we use μ2τ2=1/κ\mu_{2}\tau_{2}=1/\kappa and (1μ2τ2)τ1l2μ2=(11/κ)/(20κ)1/(20κ)(1-\mu_{2}\tau_{2})\frac{\tau_{1}l^{2}}{\mu_{2}}=(1-1/\kappa)/(20\kappa)\leq 1/(20\kappa). Plugging into (73),

where in the inequality we use 1α1/(2κ)1-\alpha\geq 1/(2\kappa) again.