A Unified Analysis of Extra-gradient and Optimistic Gradient Methods for Saddle Point Problems: Proximal Point Approach

Aryan Mokhtari, Asuman Ozdaglar, Sarath Pattathil

Introduction

In this paper, we study the following saddle point problem

Motivated by the interest in computational methods for solving the minmax problem in (1), in this paper we consider convergence rate analysis of discrete-time gradient based optimization algorithms for finding a saddle point of problem (1). We focus on Extra-gradient (EG) and Optimistic Gradient Descent Ascent (OGDA) methods, which have attracted much attention in the recent literature because of their superior empirical performance in GAN training (see Liang & Stokes (2018), Daskalakis et al. (2017)). EG is a classical method which was introduced in Korpelevich (1976). Its linear rate of convergence for smooth and strongly convex-strongly concave functions f(x,y)f({\mathbf{x}},{\mathbf{y}}) and bilinear functions, i.e., f(x,y)=xAyf({\mathbf{x}},{\mathbf{y}})={\mathbf{x}}^{\top}{\mathbf{A}}{\mathbf{y}}, was established in the variational inequality literature (see (Facchinei & Pang, 2007) and (Tseng, 1995)). The convergence properties of OGDA were recently studied in Daskalakis et al. (2017), which showed the convergence of the iterates to a neighborhood of the solution when the objective function is bilinear. The recent paper Liang & Stokes (2018) used a dynamical system approach to prove the linear convergence of the OGDA and EG methods for the special case when f(x,y)=xAyf({\mathbf{x}},{\mathbf{y}})={\mathbf{x}}^{\top}{\mathbf{A}}{\mathbf{y}} and the matrix A{\mathbf{A}} is square and full rank. It also presented a linear convergence rate of the vanilla Gradient Ascent Descent (GDA) method when the objective function f(x,y)f({\mathbf{x}},{\mathbf{y}}) is strongly convex-strongly concave. In a recent paper Gidel et al. (2018), a variant of the EG method is considered, relating it to OGDA updates, and show the linear convergence of the corresponding EG iterates in the case where f(x,y)f({\mathbf{x}},{\mathbf{y}}) is strongly convex-strongly concavef(x,y)f({\mathbf{x}},{\mathbf{y}}) is strongly convex-strongly concave when it is strongly convex with respect to x{\mathbf{x}} and strongly concave with respect to y.{\mathbf{y}}. (though without showing the convergence rate for the OGDA iterates).

The previous works use disparate approaches to analyze EG and OGDA methods, obtaining results in several different settings and making it difficult to see the connections and unifying principles between these iterative methods. In this paper, we show that the update of EG and OGDA can be interpreted as approximations of the Proximal Point (PP) method, introduced in Martinet (1970) and studied in Rockafellar (1976b). This viewpoint allows us to understand why EG and OGDA are convergent for a bilinear problem. It also enables us to generalize OGDA (in terms of parameters) and obtain new convergence rate results for these generalized algorithms for the bilinear case. Our results recover the linear convergence rate results of Tseng (1995) for EG and the linear rate results of Liang & Stokes (2018) for the bilinear case of OGDA. We obtain new linear convergence rate estimates for OGDA for the strongly convex-strongly concave case as well as linear convergence rates for the generalized OGDA method.

Related Work. The result in Tseng (1995) showed convergence of EG method to an ϵ\epsilon optimal solution with iteration complexity of O(κlog(1/ϵ))\mathcal{O}(\kappa\log(1/\epsilon)) (see Assumption 1 and Remark 1 for the definition of κ\kappa) , when the function f(x,y)f({\mathbf{x}},{\mathbf{y}}) is smooth and strongly convex-strongly concave and when ff is bilinear. A variational inequality perspective of saddle point problems was used in proving these results. More recently, Liang & Stokes (2018) analyzed EG and OGDA for the case when ff is bilinear, using a dynamical system perspective. The authors showed a complexity of O(κlog(1/ϵ))\mathcal{O}(\kappa\log(1/\epsilon)) for OGDA and a complexity of O(κ2log(1/ϵ))\mathcal{O}(\kappa^{2}\log(1/\epsilon)) for EG, without anlyzing the general strongly convex-strongly concave setting. In another recent independent work, Gidel et al. (2018) analyzed the convergence of the OGDA method using the interpretation that OGDA is a variant of EG using extrapolation from the past. In this connection, the OGDA iterates are the “midpoints” whereas (Gidel et al., 2018) provides a convergence of the original points (not the OGDA iterates) to an error of ϵ\epsilon in O(κlog(1/ϵ))\mathcal{O}(\kappa\log(1/\epsilon)). In this paper, we establish an overall complexity of O(κlog(1/ϵ))\mathcal{O}(\kappa\log(1/\epsilon)) for both OGDA and EG in bilinear and strongly convex-strongly concave settings by interpreting these methods as approximations of the proximal point method. The results of our paper are compared with existing results in Table 1. Apart from the algorithms summarized in Table 1, we also propose a generalized version of OGDA which extends the classical OGDA algorithm to a wider range of stepsize parameters and show its convergence for bilinear case.

There are several papers that study the convergence rate of algorithms for solving saddle point problems over a compact set. Nemirovski Nemirovski (2004) showed O(1/k)\mathcal{O}(1/k) convergence rate for the mirror-prox algorithm (a special case of which is the EG method) in convex-concave saddle point problems over compact sets. This result was extended to unbounded sets by Monteiro & Svaiter (2010) where a different error criterion was used. Nedić & Ozdaglar (2009) analyzed the (sub)Gradient Descent Ascent (GDA) algorithm for convex-concave saddle point problems when the (sub)gradients are bounded over the constraint set.

Several papers study the special case of Problem (1) when the objective function is of the form f(x,y)=g(x)+xAyh(y)f({\mathbf{x}},{\mathbf{y}})=g({\mathbf{x}})+{\mathbf{x}}^{\top}{\mathbf{A}}{\mathbf{y}}-h({\mathbf{y}}), i.e., the cross term is bilinear. For this case, when the functions gg and hh are strongly convex, primal-dual gradient-type methods converge linearly (Chen & Rockafellar, 1997; Bauschke et al., 2011). Further, Du & Hu (2018) showed that GDA achieves a linear convergence rate when gg is convex and hh is strongly convex. Chambolle & Pock (2011) introduced a primal-dual variant of the proximal point method that converges to a saddle point at a sublinear rate when gg and hh are convex and at a linear rate when gg and hh are strongly convex.

For the case when f(x,y)f({\mathbf{x}},{\mathbf{y}}) is strongly concave with respect to y{\mathbf{y}}, but possibly nonconvex with respect to x{\mathbf{x}}, Sanjabi et al. (2018a) provided convergence to a first-order stationary point using an algorithm that requires running multiple updates with respect to y{\mathbf{y}} at each step. Recently, Sanjabi et al. (2018b) extended this result to the setting when ff is Polyak-Lojasiewicz with respect to y{\mathbf{y}}.

There are several papers which solve stochastic version of Problem (1), i.e., the case where one does not have access to the exact gradients of the function, but in fact an unbiased estimate of it. Papers including Nemirovski et al. (2009); Juditsky et al. (2011); Chen et al. (2014) solve this problem in the case where the objective function is convex in x{\mathbf{x}} and concave in y{\mathbf{y}}. More recently Palaniappan & Bach (2016) uses a variance reduced version of the proximal gradient method and Chavdarova et al. (2019) uses a variance reduced version of the EG method to solve Problem (1) when the function is strongly convex in x{\mathbf{x}} and strongly concave in y{\mathbf{y}} and the function has a finite sum structure.

Optimistic gradient methods have also been studied in the context of convex online learning. In particular, Rakhlin & Sridharan (2013a, b) introduced the general version of the Optimistic Mirror Descent algorithm in the framework of online optimization. Prior to this work, a special case of Optimistic Mirror descent was analyzed by Chiang et al. (2012), again in the context of online learning.

Outline. The rest of the paper is organized as follows. We start the paper by presenting some definitions and preliminaries required for presenting our results in Section 2. Then, we revisit the Proximal Point (PP) point method in Section 3 and present its convergence properties for bilinear (Theorem 1) and general strongly convex-strongly concave (Theorem 2) problems . In Section 4, we show that the Optimistic Gradient Descent Ascent (OGDA) is an approximation of PP (Proposition 1) and prove its linear convergence rate for bilinear (Theorem 3) and strongly convex-strongly concave (Theorem 4) problems. We generalize the OGDA method in terms of its parameters and show the convergence of the generalized OGDA method for the bilinear case (Theorem 5). In Section 5, we recap the update of Extra-gradient (EG) method for solving a saddle point problem. Then, we show that EG can be interpreted as an approximation of PP (Proposition 2) and use this interpretation to study the convergence properties of EG in bilinear problems (Theorem 6) and general strongly convex-strongly concave problems (Theorem 7). We close the paper with concluding remarks. Due to space limitation our numerical experiments and proofs are presented in the supplementary material.

Notation. Lowercase boldface v{\mathbf{v}} denotes a vector and uppercase boldface A{\mathbf{A}} denotes a matrix. We use v\|{\mathbf{v}}\| to denote the Euclidean norm of vector v{\mathbf{v}}. Given a multi-input function f(x,y)f({\mathbf{x}},{\mathbf{y}}), its gradient with respect to x{\mathbf{x}} and y{\mathbf{y}} at (x0,y0)({\mathbf{x}}_{0},{\mathbf{y}}_{0}) are denoted by xf(x0,y0)\nabla_{\mathbf{x}}f({\mathbf{x}}_{0},{\mathbf{y}}_{0}) and yf(x0,y0)\nabla_{\mathbf{y}}f({\mathbf{x}}_{0},{\mathbf{y}}_{0}), respectively. We refer to the largest and smallest eigenvalues of a matrix A{\mathbf{A}} by λmax(A)\lambda_{\max}({\mathbf{A}}) and λmin(A)\lambda_{\min}({\mathbf{A}}), respectively.

Preliminaries

In this section we present properties and notations used in our results.

Throughout the paper, we consider two specific cases of Problem (1) stated in the next set of assumptions.

The function f(x,y)f({\mathbf{x}},{\mathbf{y}}) is continuously differentiable in x{\mathbf{x}} and y{\mathbf{y}}. Further, ff is μx\mu_{x}-strongly convex in x{\mathbf{x}} and μy\mu_{y}-strongly concave in y{\mathbf{y}}. The unique saddle point of f(x,y)f({\mathbf{x}},{\mathbf{y}}) is denoted by (x,y)({\mathbf{x}}^{*},{\mathbf{y}}^{*}). We define μ=min{μx,μy}\mu=\min\{\mu_{x},\mu_{y}\}.

The gradient xf(x,y)\nabla_{{\mathbf{x}}}f({\mathbf{x}},{\mathbf{y}}), is LxL_{x}-Lipschitz in x{\mathbf{x}} and LxyL_{xy}-Lipschitz in y{\mathbf{y}}, i.e.,

Moreover, the gradient yf(x,y)\nabla_{{\mathbf{y}}}f({\mathbf{x}},{\mathbf{y}}), is LyL_{y}-Lipschitz in y{\mathbf{y}} and LyxL_{yx}-Lipschitz in x{\mathbf{x}}, i.e.,

We define L=max{Lx,Lxy,Ly,Lyx}L=\max\{L_{x},L_{xy},L_{y},L_{yx}\}.

Under Assumptions 2 and 3, we define the condition number of the problem as κ:=L/μ\kappa:=L/\mu.

In the following sections, we present and analyze three different iterative algorithms for solving the saddle point problem introduced in (1). The kk-th iterates of any of these algorithms are denoted by (xk,yk)({\mathbf{x}}_{k},{\mathbf{y}}_{k}). We denote

as the distance to the saddle point (x,y)({\mathbf{x}}^{*},{\mathbf{y}}^{*}) at iteration kk.

Proximal Point method

We start our analysis by Proximal Point (PP) method, which will serve as a benchmark for the analysis of Extra-gradient and Optimistic Gradient Descent Ascent methods. The update of PP method for minimizing a convex function hh is defined as

where η\eta is a positive scalar (Bertsekas, 1999; Beck, 2017). Using the optimality condition of the update in (3), one can also write the update of the PP method as xk+1=xkηh(xk+1){\mathbf{x}}_{k+1}={\mathbf{x}}_{k}-\eta\nabla h({\mathbf{x}}_{k+1}). This expression shows that the PP method is an implicit algorithm. Convergence properties of PP for convex minimization have been extensively studied (Rockafellar, 1976a; Güler, 1991; Ferris, 1991; Eckstein & Bertsekas, 1992; Parikh et al., 2014; Beck, 2017). The extension of PP for solving saddle point problems has been also studied in Rockafellar (1976b). Here, we recap the update of PP for solving the min-max problem in (1). To do so, we define the iterates {xk+1,yk+1}\{{\mathbf{x}}_{k+1},{\mathbf{y}}_{k+1}\} as the unique solution to the saddle point problem

Using the optimality conditions of (4) (which are necessary and sufficient since the problem in (4) is convex), the update of the PP method for the saddle point problem in (1) can be written as

Note that implementing the system of updates in (5) requires computing the operators (I+ηxf)1({\mathbf{I}}+\eta\nabla_{\mathbf{x}}f)^{-1} and (I+ηyf)1({\mathbf{I}}+\eta\nabla_{\mathbf{y}}f)^{-1}, and, therefore, may not be computationally affordable for any general function ff.

In the following theorem, we show that the PP method converges linearly to (x,y)=(0,0)({\mathbf{x}}^{*},{\mathbf{y}}^{*})=({\mathbf{0}},{\mathbf{0}}) which is the unique solution of the problem minxmaxyxBy\min_{\mathbf{x}}\max_{\mathbf{y}}{\mathbf{x}}^{\top}{\mathbf{B}}{\mathbf{y}}. This result was established in Theorem 2 of (Rockafellar, 1976b) and we mention it here for completeness and we later use it as a benchmark.

Consider the saddle point problem in (1) under Assumption 1 and the proximal point method in (5). Further, recall the definition of rkr_{k} in (2). Then, for any η>0\eta>0, the iterates {xk,yk}k0\{{\mathbf{x}}_{k},{\mathbf{y}}_{k}\}_{k\geq 0} generated by the proximal point method satisfy

In the following theorem, we characterize the convergence rate of PP for a function f(x,y)f({\mathbf{x}},{\mathbf{y}}) that is strongly convex with respect to x{\mathbf{x}} and strongly concave with respect to y{\mathbf{y}}. Once again, this result was established in (Rockafellar, 1976b) and we mention it here for completeness and we later use it as a benchmark.

Consider the saddle point problem in (1) under Assumption 2 and the proximal point method in (5). Further, recall the definition of rkr_{k} in (2). Then, for any η>0\eta>0, the iterates {xk,yk}k0\{{\mathbf{x}}_{k},{\mathbf{y}}_{k}\}_{k\geq 0} generated by the proximal point method satisfy

Theorem 2 states that for the general saddle point problem in (1), if the function is strongly convex-strongly concave, the iterates generated by the PP method converge linearly to the optimal solution.

Optimistic Gradient Descent Ascent method

In this section, we study the Optimistic Gradient Descent Ascent (OGDA) method for solving saddle point problems. We first show that OGDA can be considered as an approximation of the proximal point method. Then, we use this interpretation to analyze its convergence properties for bilinear and strongly convex-strongly concave settings. The proximal point approximation approach also allows us to generalize the update of OGDA as we discuss in detail in Section 4.2.

The main idea behind the updates of the OGDA method is the addition of a “negative-momentum” term to the updates which can be clearly seen when we write the iterations as follows:

The last term in parenthesis for each of the updates can be interpreted as a “negative-momentum”, differentiating the OGDA method from vanilla Gradient Descent Ascent (GDA).

We analyze the OGDA method as an approximation of the Proximal Point (PP) method presented in Section 3. We first focus on the bilinear case (Assumption 1) for which the OGDA updates are

Note that the update of the PP method for the variable x{\mathbf{x}} in the considered bilinear problem is

where we used the fact that Iη2BBI-\eta^{2}{\mathbf{B}}{\mathbf{B}}^{\top} is an approximation of (I+η2BB)1(I+\eta^{2}{\mathbf{B}}{\mathbf{B}}^{\top})^{-1} with an error of o(η2)o(\eta^{2}). Regrouping the terms and using the updates of the PP method yield

where the last expression is the OGDA update for variable x{\mathbf{x}} plus an additional error of o(η2)o(\eta^{2}). A similar derivation can be done for the update of variable y{\mathbf{y}} to show that OGDA is an approximation of the PP method up to o(η2)o(\eta^{2}). In the following proposition, we show that this observation can be generalized for any general smooth (possibly nonconvex) function f(x,y)f({\mathbf{x}},{\mathbf{y}}).

Consider the saddle point problem in (1). Given a point (xk,yk)({\mathbf{x}}_{k},{\mathbf{y}}_{k}), let (x^k+1,y^k+1)(\hat{{\mathbf{x}}}_{k+1},\hat{{\mathbf{y}}}_{k+1}) be the point we obtain by performing the PP update on (xk,yk)({\mathbf{x}}_{k},{\mathbf{y}}_{k}), and let (xk+1,yk+1)({\mathbf{x}}_{k+1},{\mathbf{y}}_{k+1}) be the point we obtain by performing the OGDA update on (xk,yk)({\mathbf{x}}_{k},{\mathbf{y}}_{k}). Then, for a given stepsize η>0\eta>0 we have

To analyze the convergence of OGDA, we view it as a proximal point algorithm with an additional error term. In the following theorem, we characterize the convergence rate of the OGDA method for the bilinear saddle point problem defined in Assumption 1.

Consider the saddle point problem in (1) under Assumption 1 and the OGDA method outlined in Algorithm 1. Further, recall the definition of rkr_{k} in (2). If we set η=(1/40λmax(BB))\eta=(1/{40\sqrt{\lambda_{\max}({\mathbf{B}}^{\top}{\mathbf{B}})}}), then the iterates {xk,yk}k0\{{\mathbf{x}}_{k},{\mathbf{y}}_{k}\}_{k\geq 0} generated by the OGDA method satisfy

where r^0=max{r2,r1,r0}\hat{r}_{0}=\max\{r_{2},r_{1},r_{0}\} and cc is a positive constant independent of the problem parameters.

The result in Theorem 3 shows linear convergence of OGDA in a bilinear problem of the form f(x,y)=xByf({\mathbf{x}},{\mathbf{y}})={\mathbf{x}}^{\top}{\mathbf{B}}{\mathbf{y}} where matrix B{\mathbf{B}} is square and full rank. It further shows that the overall number of iterations to obtain an ϵ\epsilon-accurate solution is of O(κlog(1/ϵ))\mathcal{O}(\kappa\log(1/\epsilon)), where κ\kappa is the problem condition number as defined in Assumption 1. We would like to mention that this result is similar to the one shown in (Liang & Stokes, 2018), except here we analyze OGDA as an approximation of PP.

In the following theorem, we again use the proximal point approximation interpretation of OGDA to provide a convergence rate estimate for this algorithm when it is used for solving a general strongly convex-strongly concave saddle point problem.

Consider the saddle point problem in (1) under Assumptions 2 and 3 and the OGDA method outlined in Algorithm 1. Further, recall the definition of rkr_{k} in (2). If we set η=(1/(4L))\eta=(1/(4L)), then the iterates {xk,yk}k0\{{\mathbf{x}}_{k},{\mathbf{y}}_{k}\}_{k\geq 0} generated by OGDA satisfy

where r^0=c1κ2r0\hat{r}_{0}=c_{1}\kappa^{2}r_{0} and c,c1c,c_{1} are positive constant independent of the problem parameters.

The result in Theorem 4 shows that OGDA converges linearly to the optimal solution under the assumptions that ff is smooth and strongly convex-strongly concave. In other words, it shows that to achieve a point with error rkϵr_{k}\leq\epsilon, we need to run at most O(κlog(1/ϵ))\mathcal{O}(\kappa\log(1/\epsilon)) iterations of OGDA. This result can be compared with the results in (Gidel et al., 2018), a recent independent work which derived the OGDA updates as a variant of the EG updates (interpreting OGDA as extrapolation from the past). In this connection, the OGDA iterates are the “midpoints” whereas (Gidel et al., 2018) provides a convergence of the original points (not the OGDA iterates) to an error of ϵ\epsilon in O(κlog(1/ϵ))\mathcal{O}(\kappa\log(1/\epsilon)).

2 Generalized OGDA method

The update of OGDA both in theory and practice is only studied for the case that the coefficients of both xf(xk,yk)\nabla_{{\mathbf{x}}}f({\mathbf{x}}_{k},{\mathbf{y}}_{k}) and xf(xk,yk)xf(xk1,yk1)\nabla_{{\mathbf{x}}}f({\mathbf{x}}_{k},{\mathbf{y}}_{k})-\nabla_{{\mathbf{x}}}f({\mathbf{x}}_{k-1},{\mathbf{y}}_{k-1}) are η\eta. This implies that in the OGDA update at step kk, the coefficient of the current gradient, i.e., xf(xk,yk)\nabla_{{\mathbf{x}}}f({\mathbf{x}}_{k},{\mathbf{y}}_{k}), should be exactly twice the coefficient of the negative of the previous gradient, i.e., xf(xk,yk)-\nabla_{{\mathbf{x}}}f({\mathbf{x}}_{k},{\mathbf{y}}_{k}). It has been an open question to see if different stepsizes can be used for these terms. In this section, we generalize OGDA where the coefficients for the gradient descent and the negative momentum terms are not necessary equal to each other. We consider the following OGDA dynamics with general stepsize parameters α,β>0\alpha,\beta>0:

Note that for α=β\alpha=\beta, we recover the original OGDA method. Our goal is to show that OGDA is convergent even if α\alpha and β\beta are not equal to each other, as long as their difference is sufficiently small. In the following theorem, we formally state our result for the generalized OGDA method described in (7) and (8) when the objective function ff has a bilinear form of f(x,y)=xByf({\mathbf{x}},{\mathbf{y}})={\mathbf{x}}^{\top}{\mathbf{B}}{\mathbf{y}}.

Consider the saddle point problem in (1) under Assumption 1 and the generalized OGDA method in (7)-(8). Further, recall the definition of rkr_{k} in (2). If we set α=1/(40λmax(BB))\alpha={1}/({40\sqrt{\lambda_{\max}({\mathbf{B}}^{\top}{\mathbf{B}})}}) and α\alpha and β\beta satisfy the conditions 0<αKα2βα0<\alpha-K\alpha^{2}\leq\beta\leq\alpha for some constant K>0K>0, then the iterates {xk,yk}k0\{{\mathbf{x}}_{k},{\mathbf{y}}_{k}\}_{k\geq 0} generated by the generalized OGDA method satisfy

where r^0=max{r2,r1,r0}\hat{r}_{0}=\max\{r_{2},r_{1},r_{0}\} and cc is a positive constant independent of the problem parameters.

Theorem 5 shows that it is not necessary to use a factor of 22 in the OGDA update to have a linearly convergent method and for a wide range of parameters this result holds. A result similar to Theorem 5 can be established when β>α\beta>\alpha. We do not state the results here due to space limitations.

Extra-gradient method

In this section, we study the Extra-gradient (EG) method for solving the general saddle point problem in (1) and provide linear rates of convergence for the bilinear and the strongly convex-strongly concave case by interpreting this algorithm as an approximation of the proximal point method.

The main idea of the EG method is to use the gradient at the current point to find a mid-point, and then use the gradient at that mid-point to find the next iterate. To be more precise, given a stepsize η>0\eta>0, the update of EG at step kk for solving the saddle point problem in (1) has two steps. First, we find mid-point iterates xk+1/2{\mathbf{x}}_{k+1/2} and yk+1/2{\mathbf{y}}_{k+1/2} by performing a primal-dual gradient update as

Then, the gradients evaluated at the midpoints xk+1/2{\mathbf{x}}_{k+1/2} and yk+1/2{\mathbf{y}}_{k+1/2} are used to compute the new iterates xk+1{\mathbf{x}}_{k+1} and yk+1{\mathbf{y}}_{k+1} by performing the updates

The steps of the EG method for solving saddle point problems are outlined in Algorithm 2.

Note that in the update of the EG method, as the name suggests, requires evaluation of extra gradients at the midpoints xk+1/2{\mathbf{x}}_{k+1/2} and yk+1/2{\mathbf{y}}_{k+1/2} which doubles the computational complexity of EG compared to the vanilla Gradient Descent Ascent (GDA) method. We show next EG approximates the Proximal Point (PP) method more accurately, as compared to the GDA method. Consider the bilinear saddle point problem defined in Assumption 1. By following the update of PP in Section 3 and simplifying the expressions, the PP update for the bilinear problem under Assumption 1 can be written as

As the computation of the inverse (I+η2BB)1({\mathbf{I}}+\eta^{2}{\mathbf{B}}^{\top}{\mathbf{B}})^{-1} could be costly, one can use I{\mathbf{I}} instead with an error of o(η)o(\eta). This approximation retrieves the update of GDA which is known to possibly diverge for bilinear saddle point problems (see Daskalakis et al. (2017)). If we use the more accurate approximation (I+η2BB)1(Iη2BB)(I+\eta^{2}{\mathbf{B}}{\mathbf{B}}^{\top})^{-1}\approx({\mathbf{I}}-\eta^{2}{\mathbf{B}}{\mathbf{B}}^{\top}) which has an error of o(η2)o(\eta^{2}), we obtain

If we ignore the extra terms in (9)-(10) which are of o(η2)o(\eta^{2}), we recover the update of the EG method for the bilinear saddle point problem defined in Assumption 1. Therefore, in the bilinear problem, the EG method can be interpreted as an approximation of the PP method with an error of o(η2)o(\eta^{2}). In the following proposition, we extend this result and show that for any general smooth (possibly nonconvex) function f(x,y)f({\mathbf{x}},{\mathbf{y}}), EG is an o(η2)o(\eta^{2}) approximation of PP.

Consider the saddle point problem in (1). Given a point (xk,yk)({\mathbf{x}}_{k},{\mathbf{y}}_{k}), let (x^k+1,y^k+1)(\hat{{\mathbf{x}}}_{k+1},\hat{{\mathbf{y}}}_{k+1}) be the point we obtain by performing the PP update on (xk,yk)({\mathbf{x}}_{k},{\mathbf{y}}_{k}), and let (xk+1,yk+1)({\mathbf{x}}_{k+1},{\mathbf{y}}_{k+1}) be the point we obtain by performing the EG update on (xk,yk)({\mathbf{x}}_{k},{\mathbf{y}}_{k}). Then, for a given stepsize η>0\eta>0 we have

The next theorem views the EG method as the PP method with an error and properly bounds the error to provide convergence rate estimates for the EG method in the bilinear case.

Consider the saddle point problem in (1) under Assumption 1 and the extra-gradient (EG) method outlined in Algorithm 2. Further, recall the definition of rkr_{k} in (2). If we set η=1/(22λmax(BB))\eta={1}/({2\sqrt{2\lambda_{\max}({\mathbf{B}}^{\top}{\mathbf{B}})}}), then the iterates {xk,yk}k0\{{\mathbf{x}}_{k},{\mathbf{y}}_{k}\}_{k\geq 0} generated by the EG method satisfy

where cc is a positive constant independent of the problem parameters.

The result in Theorem 6 shows linear convergence of the iterates generated by the EG method for a bilinear problem of the form f(x,y)=xByf({\mathbf{x}},{\mathbf{y}})={\mathbf{x}}^{\top}{\mathbf{B}}{\mathbf{y}} where the matrix B{\mathbf{B}} is square and full rank. In other words, we obtain that the overall number of iterations to reach a point satisfying xk2+yk2ϵ\|{\mathbf{x}}_{k}\|^{2}+\|{\mathbf{y}}_{k}\|^{2}\leq\epsilon is at most O(κlog(1/ϵ))\mathcal{O}(\kappa\log(1/\epsilon)) which matches the rate achieved in Tseng (1995) for bilinear problems. It is worth mentioning that we obtain this optimal complexity of O(κlog(1/ϵ))\mathcal{O}(\kappa\log(1/\epsilon)) for EG in bilinear problems by analyzing this algorithm as an approximation of the PP method which differs from the approach used in (Tseng, 1995) that directly analyzes EG using a variational inequality approach. We further would like to add that this result improves the O(κ2log(1/ϵ))\mathcal{O}(\kappa^{2}\log(1/\epsilon)) of (Liang & Stokes, 2018) for EG in bilinear problems.

The following theorem characterizes the convergence rate of the EG method when f(x,y)f({\mathbf{x}},{\mathbf{y}}) is strongly convex-strongly concave.

Consider the saddle point problem in (1) under Assumptions 2 and 3 and the extra-gradient (EG) method outlined in Algorithm 2. Further, recall the definition of the condition number κ\kappa in Remark 1 and the definition of rkr_{k} in (2). If we set η=1/(4L)\eta={1}/({4L}), then the iterates {xk,yk}k0\{{\mathbf{x}}_{k},{\mathbf{y}}_{k}\}_{k\geq 0} generated by the EG method satisfy

where cc is a positive constant independent of the problem parameters.

The result in Theorem 7 characterizes a linear convergence rate for the EG algorithm in a general smooth and strongly convex-strongly concave case. Similar to the bilinear case, our proof relies on interpreting EG as an approximation of the PP method. Theorem 7 further shows that the computational complexity of EG to achieve an ϵ\epsilon-suboptimal solution, i.e., xk+1x2+yk+1y2ϵ\|{\mathbf{x}}_{k+1}-{\mathbf{x}}^{*}\|^{2}+\|{\mathbf{y}}_{k+1}-{\mathbf{y}}^{*}\|^{2}\leq\epsilon, is O(κlog(1/ϵ))\mathcal{O}(\kappa\log(1/\epsilon)), where κ=L/μ\kappa=L/\mu is the condition number of the number. Note that this complexity bound can also be obtained from the results in (Tseng, 1995).

Numerical Experiments

In this section, we compare the performance of the Proximal Point (PP) method with the Extra–Gradient (EG), Gradient Descent Ascent (GDA), and Optimistic Gradient Descent Ascent (OGDA) methods.

We first focus on the following bilinear problem

We proceed to study the performance of PP, EG, GDA, and OGDA for solving the following strongly convex-strongly concave saddle point problem

This is the saddle point reformulation of the linear regression

with an L2L_{2} regularization, as shown in Du & Hu (2018). As done in Du & Hu (2018), we generate the rows of the matrix A{\mathbf{A}} according to a Gaussian distribution N(0,Id)\mathcal{N}(0,I_{d}). Here, we set d=50d=50 and n=10n=10, and assume b=0{\mathbf{b}}={\mathbf{0}}. We also set the regularizer to λ=1/n\lambda=1/n. Figure 3 illustrates the distance to the optimal solution of the considered algorithms versus the number of iterations. As we can see, EG and OGDA perform better than GDA and their convergence paths are closer to the one for PP which has the fastest rate. This observation matches our theoretical claim that EG and OGDA are more accurate approximations of PP relative to GDA.

Conclusions

We consider discrete time gradient based methods for solving convex-concave saddle point problems, with a focus on the Extra-gradient (EG) and the Optimistic Gradient Descent Ascent (OGDA) methods. We show that EG and OGDA can be seen as approximations of the classical Proximal Point (PP) method. We provide linear rate estimates for the bilinear and strongly convex-strongly concave saddle point problems for EG and OGDA as well as their generalizations.

References

Supplementary Material

First, note that the update of the Proximal Point (PP) method for the bilinear problem (Assumption 1) can be written as

We can simplify the above iterations and write them as an explicit algorithm as follows:

Let us define the symmetric matrices Qx=(I+η2BB)1{\mathbf{Q}}_{x}=({\mathbf{I}}+\eta^{2}{\mathbf{B}}{\mathbf{B}}^{\top})^{-1} and Qy=(I+η2BB)1{\mathbf{Q}}_{y}=({\mathbf{I}}+\eta^{2}{\mathbf{B}}^{\top}{\mathbf{B}})^{-1}. Based on these definitions, and the expressions in (18) and (19) we can show that the sum xk+12+yk+12\|{\mathbf{x}}_{k+1}\|^{2}+\|{\mathbf{y}}_{k+1}\|^{2} can be written as

To simplify the expression in (8.1) we first prove the following lemma which is also useful in the rest of proofs.

Let B=UΛV{\mathbf{B}}={\mathbf{U}}\boldsymbol{\Lambda}{\mathbf{V}}^{\top} be the singular value decomposition of B{\mathbf{B}}. Here U{\mathbf{U}} and V{\mathbf{V}} are orthonormal matrices and Λ\boldsymbol{\Lambda} is a diagonal matrix with the eigenvalues of B{\mathbf{B}} as the diagonal entries. Then, we have:

Here we used the property that UU=VV=I{\mathbf{U}}^{\top}{\mathbf{U}}={\mathbf{V}}^{\top}{\mathbf{V}}={\mathbf{I}}. Now, we simplify the other side to get:

Now, since U(η2Λ2+I)1Λ2V=UΛ(η2Λ+I)1V{\mathbf{U}}(\eta^{2}\boldsymbol{\Lambda}^{2}+{\mathbf{I}})^{-1}\boldsymbol{\Lambda}^{2}{\mathbf{V}}^{\top}={\mathbf{U}}\boldsymbol{\Lambda}(\eta^{2}\boldsymbol{\Lambda}+{\mathbf{I}})^{-1}{\mathbf{V}}^{\top}, the claim in (21) follows. Using a similar argument we can also prove the equality in (22). ∎

Using the result in Lemma 1 we can show that

where the second equality holds as ab=ba{\mathbf{a}}^{\top}{\mathbf{b}}={\mathbf{b}}^{\top}{\mathbf{a}}. Hence, the expression in (26) can be simplified as

We simplify equation (26) as follows. Consider the term involving xk{\mathbf{x}}_{k}. We have

Now we use Lemma 1 to simplify (8.1) as follows

where the last equality follows by replacing Qx{\mathbf{Q}}_{x} by its definition. The same simplification follows for the terms invovling yk{\mathbf{y}}_{k} which leads to the expression

Substitute Qxxk2+η2QyBxk2\|{\mathbf{Q}}_{x}{\mathbf{x}}_{k}\|^{2}+\eta^{2}\|{\mathbf{Q}}_{y}{\mathbf{B}}^{\top}{\mathbf{x}}_{k}\|^{2} and Qyyk2+η2QxByk2\|{\mathbf{Q}}_{y}{\mathbf{y}}_{k}\|^{2}+\eta^{2}\|{\mathbf{Q}}_{x}{\mathbf{B}}{\mathbf{y}}_{k}\|^{2} in (26) with the expressions in (8.1) and (29), respectively, to obtain

Now, using the expression in (30) and the fact that λmin(BTB)=λmin(BBT)\lambda_{\min}({\mathbf{B}}^{T}{\mathbf{B}})=\lambda_{\min}({\mathbf{B}}{\mathbf{B}}^{T}) we can write

2 Proof of Theorem 2

The update of PP method can be written as

where we used the fact that ϕyk+1(xk+1)=0\nabla\phi_{{\mathbf{y}}_{k+1}}({\mathbf{x}}_{k+1})={\mathbf{0}}. Replace ϕyk+1(x)\phi_{{\mathbf{y}}_{k+1}}({\mathbf{x}}) and ϕyk+1(xk+1)\phi_{{\mathbf{y}}_{k+1}}({\mathbf{x}}_{k+1}) with their definition in (33) and further set x=x{\mathbf{x}}={\mathbf{x}}^{*} to obtain

since ϕxk+1(yk+1)=0\nabla\phi_{{\mathbf{x}}_{k+1}}({\mathbf{y}}_{k+1})={\mathbf{0}}. Replace ϕxk+1(y)\phi_{{\mathbf{x}}_{k+1}}({\mathbf{y}}) and ϕxk+1(yk+1)\phi_{{\mathbf{x}}_{k+1}}({\mathbf{y}}_{k+1}) with their definitions and further set y=y{\mathbf{y}}={\mathbf{y}}^{*} to obtain

In particular, by setting (x,y)=(xk+1,yk+1)({\mathbf{x}},{\mathbf{y}})=({\mathbf{x}}_{k+1},{\mathbf{y}}_{k+1}) we obtain that

Now, considering (35), by adding and subtracting f(x,y)f({\mathbf{x}}^{*},{\mathbf{y}}^{*}) we can write

By using the inequality in (40) we can write

Similarly, considering (38), we can write

Add equations (43) and (45), and use the definition μ=min{μx,μy}\mu=\min\{\mu_{x},\mu_{y}\} to obtain

Regrouping the terms and using the definition rk=xkx2+yky2r_{k}=\|{\mathbf{x}}_{k}-{\mathbf{x}}^{*}\|^{2}+\|{\mathbf{y}}_{k}-{\mathbf{y}}^{*}\|^{2} leads to

3 Proof of Proposition 1

We start from the Proximal Point (PP) dynamics and show that an O(η2)\mathcal{O}(\eta^{2}) approximation of this dynamics leads to OGDA. The PP updates are as follows

By writing the Taylor’s expansion of xf\nabla_{{\mathbf{x}}}f, we obtain

On adding and subtracting the term ηxf(xk,yk)\eta\nabla_{{\mathbf{x}}}f({\mathbf{x}}_{k},{\mathbf{y}}_{k}), we get

Note that from the Taylors expansion of xxf,xf\nabla_{{\mathbf{x}}{\mathbf{x}}}f,\nabla_{{\mathbf{x}}}f and the PP updates, we have

Again, from the Taylor’s expansion of xyf,yf\nabla_{{\mathbf{x}}{\mathbf{y}}}f,\nabla_{{\mathbf{y}}}f and the PP updates, we have

Making the approximations of Equations (52) and (54) in Equation (50) yields

Making this substitution back in Equation (55), we get

which is equivalent to the OGDA update plus an additional error term of order o(η2)o(\eta^{2}). The same analysis can be done for the dual updates as well to obtain

This shows that the OGDA updates and the PP updates differ by o(η2)o(\eta^{2}).

4 Proof of Theorem 3

We define the following symmetric matrices

We rewrite the properties of Ex{\mathbf{E}}_{x} and Ey{\mathbf{E}}_{y} which are

Recall that the update of OGDA for the bilinear problem can be written as

The update for the variable x{\mathbf{x}} can be written as an approximate variant of the PP update as follows

Therefore, the error between the OGDA and Proximal updates for the variable x{\mathbf{x}} is given by

We first derive an upper bound for the term in the first parentheses (ηByk+ηByk1+η2BBxkη3BBByk)(-\eta{\mathbf{B}}{\mathbf{y}}_{k}+\eta{\mathbf{B}}{\mathbf{y}}_{k-1}+\eta^{2}{\mathbf{B}}{\mathbf{B}}^{\top}{\mathbf{x}}_{k}-\eta^{3}{\mathbf{B}}{\mathbf{B}}^{\top}{\mathbf{B}}{\mathbf{y}}_{k}).

Once again, using the OGDA updates for (xkxk1)({\mathbf{x}}_{k}-{\mathbf{x}}_{k-1}) and (xk1xk2)({\mathbf{x}}_{k-1}-{\mathbf{x}}_{k-2}), we have

Therefore, considering the expressions in (62) and (8.4) the error between the updates of OGDA and PP for the variable x{\mathbf{x}} can be written as

We apply the same argument for the update of the variable y{\mathbf{y}}. Combining these results we obtain that the update of OGDA can be written as

As in the proof of Theorem 1, we define Qx=(I+η2BB)1{\mathbf{Q}}_{x}=({\mathbf{I}}+\eta^{2}{\mathbf{B}}{\mathbf{B}}^{\top})^{-1} and Qy=(I+η2BB)1{\mathbf{Q}}_{y}=({\mathbf{I}}+\eta^{2}{\mathbf{B}}^{\top}{\mathbf{B}})^{-1}. Then, we can show that

Define rk=xk2+yk2r_{k}=\|{\mathbf{x}}_{k}\|^{2}+\|{\mathbf{y}}_{k}\|^{2}. We have:

And for η=140λmax(BB)\eta=\frac{1}{40\sqrt{\lambda_{\max}({\mathbf{B}}^{\top}{\mathbf{B}})}}.

5 Proof of Theorem 4

We define z=[x;y]{\mathbf{z}}=[{\mathbf{x}};{\mathbf{y}}] and F(z)=[xf(x,y);yf(x,y)]F({\mathbf{z}})=[\nabla_{{\mathbf{x}}}f({\mathbf{x}},{\mathbf{y}});-\nabla_{{\mathbf{y}}}f({\mathbf{x}},{\mathbf{y}})]. Then the OGDA updates can be compactly written as:

We write the update in terms of the Proximal Point method with an error εk=η(F(zk+1)2F(zk)+F(zk1))\boldsymbol{\varepsilon}_{k}=\eta(F({\mathbf{z}}_{k+1})-2F({\mathbf{z}}_{k})+F({\mathbf{z}}_{k-1})) as follows:

On rearranging Equation (72) and using the fact that F(z)=0F({\mathbf{z}}^{*})=0, where z=[x;y]{\mathbf{z}}^{*}=[{\mathbf{x}}^{*};{\mathbf{y}}^{*}], we get:

On squaring Equation (74) and using Young’s inequality, we get:

Now, using strong convexity, and substituting η=1/4L\eta=1/4L, we get:

The following part of the proof is inspired by the result of Theorem 11 of Gidel et al. (2018). For OGDA iterates, we have:

On subtracting z{\mathbf{z}}^{*} from both sides and squaring, we have:

Substituting Equations (79) and (80) in Equation (78), we have:

Substituting Equation (84) in Equation (82), we get:

For η1/4L\eta\leq 1/4L, we have: 12ημ4η2L2>01-2\eta\mu-4\eta^{2}L^{2}>0 and therefore, can ignore the last term, which gives us (for η1/4L\eta\leq 1/4L)

since η1/4L\eta\leq 1/4L, we have (1ημ)4η2L2(1-\eta\mu)\geq 4\eta^{2}L^{2}, which gives:

Substituting Equation (89) back in Equation (76), we get:

On defining r^0=1024κ2z0z2\hat{r}_{0}=1024\kappa^{2}\|{\mathbf{z}}_{0}-{\mathbf{z}}^{*}\|^{2}, we have:

6 Proof of Theorem 5

The generalized OGDA method for bilinear problems is given by:

We compare this with the Proximal Point (PP) method with stepsize α\alpha. The proof follows long the exact same lines as the proof of Theorem 3.

We define the following symmetric matrices

We rewrite the properties of Ex{\mathbf{E}}_{x} and Ey{\mathbf{E}}_{y} which are

Therefore, the error between the OGDA and Proximal updates for the variable x{\mathbf{x}} is given by

We first derive an upper bound for the term in the first parentheses (βByk+βByk1+α2BBxkα3BBByk)(-\beta{\mathbf{B}}{\mathbf{y}}_{k}+\beta{\mathbf{B}}{\mathbf{y}}_{k-1}+\alpha^{2}{\mathbf{B}}{\mathbf{B}}^{\top}{\mathbf{x}}_{k}-\alpha^{3}{\mathbf{B}}{\mathbf{B}}^{\top}{\mathbf{B}}{\mathbf{y}}_{k}).

Using the generalized OGDA update, we have:

Once again, using the generalized OGDA updates for (xkxk1)({\mathbf{x}}_{k}-{\mathbf{x}}_{k-1}) and (xk1xk2)({\mathbf{x}}_{k-1}-{\mathbf{x}}_{k-2}), we have

Therefore, considering the expressions in (95) and (8.6) the error between the updates of OGDA and PP for the variable x{\mathbf{x}} can be written as

Now, the convergence proof follows along the same lines as the proof of Theorem 3. We set η=max{α,β}\eta=\max\{\alpha,\beta\}, and we need the additional assumption:

due to the presence of the term α(αβ)BBxk\alpha(\alpha-\beta){\mathbf{B}}{\mathbf{B}}^{\top}{\mathbf{x}}_{k}, Let

On making these substitutions, we get the same result as Theorem 3.

7 Proof of Proposition 2

The Extragradient updates can be written as

By writing the Taylor’s expansion of xf\nabla_{{\mathbf{x}}}f we obtain that

By following the same argument for y{\mathbf{y}} we obtain

Now we find a second order approximation for the Proximal Point Method. Note that the update of the proximal point method for variable x{\mathbf{x}} can be written as

where in the second equality we replaced xk+1{\mathbf{x}}_{k+1} and yk+1{\mathbf{y}}_{k+1} in the gradient with their updates. Hence, using Taylor’s series we can show that

where in the second equality we used the fact that xf(xk+1,yk+1)=xf(xk,yk)+O(η)\nabla_{{\mathbf{x}}}f({\mathbf{x}}_{k+1},{\mathbf{y}}_{k+1})=\nabla_{{\mathbf{x}}}f({\mathbf{x}}_{k},{\mathbf{y}}_{k})+\mathcal{O}(\eta) and yf(xk+1,yk+1)=yf(xk,yk)+O(η)\nabla_{{\mathbf{y}}}f({\mathbf{x}}_{k+1},{\mathbf{y}}_{k+1})=\nabla_{{\mathbf{y}}}f({\mathbf{x}}_{k},{\mathbf{y}}_{k})+\mathcal{O}(\eta). Similarly, we find the approximation of the update of y{\mathbf{y}} which leads to

Comparing the expressions in (8.7) and (103) with the ones in (8.7) and (8.7) implies that the difference between the updates of PP and EG is at most o(η2)o(\eta^{2}) and this completes the proof.

8 Proof of Theorem 6

Define the following symmetric error matrices

which are useful to characterize the difference between the updates of EG and PP for a bilinear problem. Note that we can bound the norms of Ex{\mathbf{E}}_{x} and Ey{\mathbf{E}}_{y} as

Since λmax(BB)=λmax(BB)\lambda_{\max}({\mathbf{B}}^{\top}{\mathbf{B}})=\lambda_{\max}({\mathbf{B}}{\mathbf{B}}^{\top}), we have:

Also, from Lemma 1 in the proof of Theorem 1, and the definitions of the error matrices in (107) it can be verified that

Moreover, using the definitions of Ex{\mathbf{E}}_{x} and Ey{\mathbf{E}}_{y} in (107), the EG updates can be written as

As in the proof of Theorem 1, we define Qx=(I+η2BB)1{\mathbf{Q}}_{x}=({\mathbf{I}}+\eta^{2}{\mathbf{B}}{\mathbf{B}}^{\top})^{-1} and Qy=(I+η2BB)1{\mathbf{Q}}_{y}=({\mathbf{I}}+\eta^{2}{\mathbf{B}}^{\top}{\mathbf{B}})^{-1}. Using these definitions we can show that

Now before adding the two sides of the expressions in (115) and (116), note that some of the cross terms in (115) and (116) cancel out. For instance, using Lemma 1 and Equations (111) and (112) we can show that

By using similar arguments it can be shown that summing two sides of the expressions in (115) and (116) leads to

where in the second equality we used the simplifications

Define rk=xk2+yk2r_{k}=\|{\mathbf{x}}_{k}\|^{2}+\|{\mathbf{y}}_{k}\|^{2}. We have:

Choosing η=122λmax(BB)\eta=\frac{1}{2\sqrt{2\lambda_{\max}({\mathbf{B}}^{\top}{\mathbf{B}})}}, we have:

9 Proof of Theorem 7

Define z=[x;y]{\mathbf{z}}=[{\mathbf{x}};{\mathbf{y}}] and F(z)=[xf(x,y);yf(x,y)]F({\mathbf{z}})=[\nabla_{{\mathbf{x}}}f({\mathbf{x}},{\mathbf{y}});-\nabla_{{\mathbf{y}}}f({\mathbf{x}},{\mathbf{y}})]. Then the EG updates can be compactly written as:

We write the update in terms of the Proximal Point method with an error εk=η(F(zk+1)F(zk+1/2))\boldsymbol{\varepsilon}_{k}=\eta(F({\mathbf{z}}_{k+1})-F({\mathbf{z}}_{k+1/2})) as follows:

On squaring and simplifying this expression, we have:

where z=[x;y]{\mathbf{z}}^{*}=[{\mathbf{x}}^{*};{\mathbf{y}}^{*}] (Note that rk=zkz2r_{k}=\|{\mathbf{z}}_{k}-{\mathbf{z}}^{*}\|^{2}). We simplify the right hand side of Equation (125) as follows-

The following part of the proof is inspired by the result of Theorem 11 of Gidel et al. (2018). We simplify the inner products and give a lower bound using strong convexity as follows:

since F(z)=0F({\mathbf{z}}^{*})=0. Now, using Young’s inequality, we have:

Substituting the above inequality in Equation (126), we have:

Since zk+1/2z22zkz2+2zk+1/2zk2\|{\mathbf{z}}_{k+1/2}-{\mathbf{z}}^{*}\|^{2}\leq 2\|{\mathbf{z}}_{k}-{\mathbf{z}}^{*}\|^{2}+2\|{\mathbf{z}}_{k+1/2}-{\mathbf{z}}_{k}\|^{2}, we have:

For η=1/4L\eta=1/4L, we have η2L2+2ημ1<1\eta^{2}L^{2}+2\eta\mu-1<1 (since μL\mu\leq L), which gives: