A Unified Analysis of Extra-gradient and Optimistic Gradient Methods for Saddle Point Problems: Proximal Point Approach
Aryan Mokhtari, Asuman Ozdaglar, Sarath Pattathil
Introduction
In this paper, we study the following saddle point problem
Motivated by the interest in computational methods for solving the minmax problem in (1), in this paper we consider convergence rate analysis of discrete-time gradient based optimization algorithms for finding a saddle point of problem (1). We focus on Extra-gradient (EG) and Optimistic Gradient Descent Ascent (OGDA) methods, which have attracted much attention in the recent literature because of their superior empirical performance in GAN training (see Liang & Stokes (2018), Daskalakis et al. (2017)). EG is a classical method which was introduced in Korpelevich (1976). Its linear rate of convergence for smooth and strongly convex-strongly concave functions and bilinear functions, i.e., , was established in the variational inequality literature (see (Facchinei & Pang, 2007) and (Tseng, 1995)). The convergence properties of OGDA were recently studied in Daskalakis et al. (2017), which showed the convergence of the iterates to a neighborhood of the solution when the objective function is bilinear. The recent paper Liang & Stokes (2018) used a dynamical system approach to prove the linear convergence of the OGDA and EG methods for the special case when and the matrix is square and full rank. It also presented a linear convergence rate of the vanilla Gradient Ascent Descent (GDA) method when the objective function is strongly convex-strongly concave. In a recent paper Gidel et al. (2018), a variant of the EG method is considered, relating it to OGDA updates, and show the linear convergence of the corresponding EG iterates in the case where is strongly convex-strongly concave is strongly convex-strongly concave when it is strongly convex with respect to and strongly concave with respect to (though without showing the convergence rate for the OGDA iterates).
The previous works use disparate approaches to analyze EG and OGDA methods, obtaining results in several different settings and making it difficult to see the connections and unifying principles between these iterative methods. In this paper, we show that the update of EG and OGDA can be interpreted as approximations of the Proximal Point (PP) method, introduced in Martinet (1970) and studied in Rockafellar (1976b). This viewpoint allows us to understand why EG and OGDA are convergent for a bilinear problem. It also enables us to generalize OGDA (in terms of parameters) and obtain new convergence rate results for these generalized algorithms for the bilinear case. Our results recover the linear convergence rate results of Tseng (1995) for EG and the linear rate results of Liang & Stokes (2018) for the bilinear case of OGDA. We obtain new linear convergence rate estimates for OGDA for the strongly convex-strongly concave case as well as linear convergence rates for the generalized OGDA method.
Related Work. The result in Tseng (1995) showed convergence of EG method to an optimal solution with iteration complexity of (see Assumption 1 and Remark 1 for the definition of ) , when the function is smooth and strongly convex-strongly concave and when is bilinear. A variational inequality perspective of saddle point problems was used in proving these results. More recently, Liang & Stokes (2018) analyzed EG and OGDA for the case when is bilinear, using a dynamical system perspective. The authors showed a complexity of for OGDA and a complexity of for EG, without anlyzing the general strongly convex-strongly concave setting. In another recent independent work, Gidel et al. (2018) analyzed the convergence of the OGDA method using the interpretation that OGDA is a variant of EG using extrapolation from the past. In this connection, the OGDA iterates are the “midpoints” whereas (Gidel et al., 2018) provides a convergence of the original points (not the OGDA iterates) to an error of in . In this paper, we establish an overall complexity of for both OGDA and EG in bilinear and strongly convex-strongly concave settings by interpreting these methods as approximations of the proximal point method. The results of our paper are compared with existing results in Table 1. Apart from the algorithms summarized in Table 1, we also propose a generalized version of OGDA which extends the classical OGDA algorithm to a wider range of stepsize parameters and show its convergence for bilinear case.
There are several papers that study the convergence rate of algorithms for solving saddle point problems over a compact set. Nemirovski Nemirovski (2004) showed convergence rate for the mirror-prox algorithm (a special case of which is the EG method) in convex-concave saddle point problems over compact sets. This result was extended to unbounded sets by Monteiro & Svaiter (2010) where a different error criterion was used. Nedić & Ozdaglar (2009) analyzed the (sub)Gradient Descent Ascent (GDA) algorithm for convex-concave saddle point problems when the (sub)gradients are bounded over the constraint set.
Several papers study the special case of Problem (1) when the objective function is of the form , i.e., the cross term is bilinear. For this case, when the functions and are strongly convex, primal-dual gradient-type methods converge linearly (Chen & Rockafellar, 1997; Bauschke et al., 2011). Further, Du & Hu (2018) showed that GDA achieves a linear convergence rate when is convex and is strongly convex. Chambolle & Pock (2011) introduced a primal-dual variant of the proximal point method that converges to a saddle point at a sublinear rate when and are convex and at a linear rate when and are strongly convex.
For the case when is strongly concave with respect to , but possibly nonconvex with respect to , Sanjabi et al. (2018a) provided convergence to a first-order stationary point using an algorithm that requires running multiple updates with respect to at each step. Recently, Sanjabi et al. (2018b) extended this result to the setting when is Polyak-Lojasiewicz with respect to .
There are several papers which solve stochastic version of Problem (1), i.e., the case where one does not have access to the exact gradients of the function, but in fact an unbiased estimate of it. Papers including Nemirovski et al. (2009); Juditsky et al. (2011); Chen et al. (2014) solve this problem in the case where the objective function is convex in and concave in . More recently Palaniappan & Bach (2016) uses a variance reduced version of the proximal gradient method and Chavdarova et al. (2019) uses a variance reduced version of the EG method to solve Problem (1) when the function is strongly convex in and strongly concave in and the function has a finite sum structure.
Optimistic gradient methods have also been studied in the context of convex online learning. In particular, Rakhlin & Sridharan (2013a, b) introduced the general version of the Optimistic Mirror Descent algorithm in the framework of online optimization. Prior to this work, a special case of Optimistic Mirror descent was analyzed by Chiang et al. (2012), again in the context of online learning.
Outline. The rest of the paper is organized as follows. We start the paper by presenting some definitions and preliminaries required for presenting our results in Section 2. Then, we revisit the Proximal Point (PP) point method in Section 3 and present its convergence properties for bilinear (Theorem 1) and general strongly convex-strongly concave (Theorem 2) problems . In Section 4, we show that the Optimistic Gradient Descent Ascent (OGDA) is an approximation of PP (Proposition 1) and prove its linear convergence rate for bilinear (Theorem 3) and strongly convex-strongly concave (Theorem 4) problems. We generalize the OGDA method in terms of its parameters and show the convergence of the generalized OGDA method for the bilinear case (Theorem 5). In Section 5, we recap the update of Extra-gradient (EG) method for solving a saddle point problem. Then, we show that EG can be interpreted as an approximation of PP (Proposition 2) and use this interpretation to study the convergence properties of EG in bilinear problems (Theorem 6) and general strongly convex-strongly concave problems (Theorem 7). We close the paper with concluding remarks. Due to space limitation our numerical experiments and proofs are presented in the supplementary material.
Notation. Lowercase boldface denotes a vector and uppercase boldface denotes a matrix. We use to denote the Euclidean norm of vector . Given a multi-input function , its gradient with respect to and at are denoted by and , respectively. We refer to the largest and smallest eigenvalues of a matrix by and , respectively.
Preliminaries
In this section we present properties and notations used in our results.
Throughout the paper, we consider two specific cases of Problem (1) stated in the next set of assumptions.
The function is continuously differentiable in and . Further, is -strongly convex in and -strongly concave in . The unique saddle point of is denoted by . We define .
The gradient , is -Lipschitz in and -Lipschitz in , i.e.,
Moreover, the gradient , is -Lipschitz in and -Lipschitz in , i.e.,
We define .
Under Assumptions 2 and 3, we define the condition number of the problem as .
In the following sections, we present and analyze three different iterative algorithms for solving the saddle point problem introduced in (1). The -th iterates of any of these algorithms are denoted by . We denote
as the distance to the saddle point at iteration .
Proximal Point method
We start our analysis by Proximal Point (PP) method, which will serve as a benchmark for the analysis of Extra-gradient and Optimistic Gradient Descent Ascent methods. The update of PP method for minimizing a convex function is defined as
where is a positive scalar (Bertsekas, 1999; Beck, 2017). Using the optimality condition of the update in (3), one can also write the update of the PP method as . This expression shows that the PP method is an implicit algorithm. Convergence properties of PP for convex minimization have been extensively studied (Rockafellar, 1976a; Güler, 1991; Ferris, 1991; Eckstein & Bertsekas, 1992; Parikh et al., 2014; Beck, 2017). The extension of PP for solving saddle point problems has been also studied in Rockafellar (1976b). Here, we recap the update of PP for solving the min-max problem in (1). To do so, we define the iterates as the unique solution to the saddle point problem
Using the optimality conditions of (4) (which are necessary and sufficient since the problem in (4) is convex), the update of the PP method for the saddle point problem in (1) can be written as
Note that implementing the system of updates in (5) requires computing the operators and , and, therefore, may not be computationally affordable for any general function .
In the following theorem, we show that the PP method converges linearly to which is the unique solution of the problem . This result was established in Theorem 2 of (Rockafellar, 1976b) and we mention it here for completeness and we later use it as a benchmark.
Consider the saddle point problem in (1) under Assumption 1 and the proximal point method in (5). Further, recall the definition of in (2). Then, for any , the iterates generated by the proximal point method satisfy
In the following theorem, we characterize the convergence rate of PP for a function that is strongly convex with respect to and strongly concave with respect to . Once again, this result was established in (Rockafellar, 1976b) and we mention it here for completeness and we later use it as a benchmark.
Consider the saddle point problem in (1) under Assumption 2 and the proximal point method in (5). Further, recall the definition of in (2). Then, for any , the iterates generated by the proximal point method satisfy
Theorem 2 states that for the general saddle point problem in (1), if the function is strongly convex-strongly concave, the iterates generated by the PP method converge linearly to the optimal solution.
Optimistic Gradient Descent Ascent method
In this section, we study the Optimistic Gradient Descent Ascent (OGDA) method for solving saddle point problems. We first show that OGDA can be considered as an approximation of the proximal point method. Then, we use this interpretation to analyze its convergence properties for bilinear and strongly convex-strongly concave settings. The proximal point approximation approach also allows us to generalize the update of OGDA as we discuss in detail in Section 4.2.
The main idea behind the updates of the OGDA method is the addition of a “negative-momentum” term to the updates which can be clearly seen when we write the iterations as follows:
The last term in parenthesis for each of the updates can be interpreted as a “negative-momentum”, differentiating the OGDA method from vanilla Gradient Descent Ascent (GDA).
We analyze the OGDA method as an approximation of the Proximal Point (PP) method presented in Section 3. We first focus on the bilinear case (Assumption 1) for which the OGDA updates are
Note that the update of the PP method for the variable in the considered bilinear problem is
where we used the fact that is an approximation of with an error of . Regrouping the terms and using the updates of the PP method yield
where the last expression is the OGDA update for variable plus an additional error of . A similar derivation can be done for the update of variable to show that OGDA is an approximation of the PP method up to . In the following proposition, we show that this observation can be generalized for any general smooth (possibly nonconvex) function .
Consider the saddle point problem in (1). Given a point , let be the point we obtain by performing the PP update on , and let be the point we obtain by performing the OGDA update on . Then, for a given stepsize we have
To analyze the convergence of OGDA, we view it as a proximal point algorithm with an additional error term. In the following theorem, we characterize the convergence rate of the OGDA method for the bilinear saddle point problem defined in Assumption 1.
Consider the saddle point problem in (1) under Assumption 1 and the OGDA method outlined in Algorithm 1. Further, recall the definition of in (2). If we set , then the iterates generated by the OGDA method satisfy
where and is a positive constant independent of the problem parameters.
The result in Theorem 3 shows linear convergence of OGDA in a bilinear problem of the form where matrix is square and full rank. It further shows that the overall number of iterations to obtain an -accurate solution is of , where is the problem condition number as defined in Assumption 1. We would like to mention that this result is similar to the one shown in (Liang & Stokes, 2018), except here we analyze OGDA as an approximation of PP.
In the following theorem, we again use the proximal point approximation interpretation of OGDA to provide a convergence rate estimate for this algorithm when it is used for solving a general strongly convex-strongly concave saddle point problem.
Consider the saddle point problem in (1) under Assumptions 2 and 3 and the OGDA method outlined in Algorithm 1. Further, recall the definition of in (2). If we set , then the iterates generated by OGDA satisfy
where and are positive constant independent of the problem parameters.
The result in Theorem 4 shows that OGDA converges linearly to the optimal solution under the assumptions that is smooth and strongly convex-strongly concave. In other words, it shows that to achieve a point with error , we need to run at most iterations of OGDA. This result can be compared with the results in (Gidel et al., 2018), a recent independent work which derived the OGDA updates as a variant of the EG updates (interpreting OGDA as extrapolation from the past). In this connection, the OGDA iterates are the “midpoints” whereas (Gidel et al., 2018) provides a convergence of the original points (not the OGDA iterates) to an error of in .
2 Generalized OGDA method
The update of OGDA both in theory and practice is only studied for the case that the coefficients of both and are . This implies that in the OGDA update at step , the coefficient of the current gradient, i.e., , should be exactly twice the coefficient of the negative of the previous gradient, i.e., . It has been an open question to see if different stepsizes can be used for these terms. In this section, we generalize OGDA where the coefficients for the gradient descent and the negative momentum terms are not necessary equal to each other. We consider the following OGDA dynamics with general stepsize parameters :
Note that for , we recover the original OGDA method. Our goal is to show that OGDA is convergent even if and are not equal to each other, as long as their difference is sufficiently small. In the following theorem, we formally state our result for the generalized OGDA method described in (7) and (8) when the objective function has a bilinear form of .
Consider the saddle point problem in (1) under Assumption 1 and the generalized OGDA method in (7)-(8). Further, recall the definition of in (2). If we set and and satisfy the conditions for some constant , then the iterates generated by the generalized OGDA method satisfy
where and is a positive constant independent of the problem parameters.
Theorem 5 shows that it is not necessary to use a factor of in the OGDA update to have a linearly convergent method and for a wide range of parameters this result holds. A result similar to Theorem 5 can be established when . We do not state the results here due to space limitations.
Extra-gradient method
In this section, we study the Extra-gradient (EG) method for solving the general saddle point problem in (1) and provide linear rates of convergence for the bilinear and the strongly convex-strongly concave case by interpreting this algorithm as an approximation of the proximal point method.
The main idea of the EG method is to use the gradient at the current point to find a mid-point, and then use the gradient at that mid-point to find the next iterate. To be more precise, given a stepsize , the update of EG at step for solving the saddle point problem in (1) has two steps. First, we find mid-point iterates and by performing a primal-dual gradient update as
Then, the gradients evaluated at the midpoints and are used to compute the new iterates and by performing the updates
The steps of the EG method for solving saddle point problems are outlined in Algorithm 2.
Note that in the update of the EG method, as the name suggests, requires evaluation of extra gradients at the midpoints and which doubles the computational complexity of EG compared to the vanilla Gradient Descent Ascent (GDA) method. We show next EG approximates the Proximal Point (PP) method more accurately, as compared to the GDA method. Consider the bilinear saddle point problem defined in Assumption 1. By following the update of PP in Section 3 and simplifying the expressions, the PP update for the bilinear problem under Assumption 1 can be written as
As the computation of the inverse could be costly, one can use instead with an error of . This approximation retrieves the update of GDA which is known to possibly diverge for bilinear saddle point problems (see Daskalakis et al. (2017)). If we use the more accurate approximation which has an error of , we obtain
If we ignore the extra terms in (9)-(10) which are of , we recover the update of the EG method for the bilinear saddle point problem defined in Assumption 1. Therefore, in the bilinear problem, the EG method can be interpreted as an approximation of the PP method with an error of . In the following proposition, we extend this result and show that for any general smooth (possibly nonconvex) function , EG is an approximation of PP.
Consider the saddle point problem in (1). Given a point , let be the point we obtain by performing the PP update on , and let be the point we obtain by performing the EG update on . Then, for a given stepsize we have
The next theorem views the EG method as the PP method with an error and properly bounds the error to provide convergence rate estimates for the EG method in the bilinear case.
Consider the saddle point problem in (1) under Assumption 1 and the extra-gradient (EG) method outlined in Algorithm 2. Further, recall the definition of in (2). If we set , then the iterates generated by the EG method satisfy
where is a positive constant independent of the problem parameters.
The result in Theorem 6 shows linear convergence of the iterates generated by the EG method for a bilinear problem of the form where the matrix is square and full rank. In other words, we obtain that the overall number of iterations to reach a point satisfying is at most which matches the rate achieved in Tseng (1995) for bilinear problems. It is worth mentioning that we obtain this optimal complexity of for EG in bilinear problems by analyzing this algorithm as an approximation of the PP method which differs from the approach used in (Tseng, 1995) that directly analyzes EG using a variational inequality approach. We further would like to add that this result improves the of (Liang & Stokes, 2018) for EG in bilinear problems.
The following theorem characterizes the convergence rate of the EG method when is strongly convex-strongly concave.
Consider the saddle point problem in (1) under Assumptions 2 and 3 and the extra-gradient (EG) method outlined in Algorithm 2. Further, recall the definition of the condition number in Remark 1 and the definition of in (2). If we set , then the iterates generated by the EG method satisfy
where is a positive constant independent of the problem parameters.
The result in Theorem 7 characterizes a linear convergence rate for the EG algorithm in a general smooth and strongly convex-strongly concave case. Similar to the bilinear case, our proof relies on interpreting EG as an approximation of the PP method. Theorem 7 further shows that the computational complexity of EG to achieve an -suboptimal solution, i.e., , is , where is the condition number of the number. Note that this complexity bound can also be obtained from the results in (Tseng, 1995).
Numerical Experiments
In this section, we compare the performance of the Proximal Point (PP) method with the Extra–Gradient (EG), Gradient Descent Ascent (GDA), and Optimistic Gradient Descent Ascent (OGDA) methods.
We first focus on the following bilinear problem
We proceed to study the performance of PP, EG, GDA, and OGDA for solving the following strongly convex-strongly concave saddle point problem
This is the saddle point reformulation of the linear regression
with an regularization, as shown in Du & Hu (2018). As done in Du & Hu (2018), we generate the rows of the matrix according to a Gaussian distribution . Here, we set and , and assume . We also set the regularizer to . Figure 3 illustrates the distance to the optimal solution of the considered algorithms versus the number of iterations. As we can see, EG and OGDA perform better than GDA and their convergence paths are closer to the one for PP which has the fastest rate. This observation matches our theoretical claim that EG and OGDA are more accurate approximations of PP relative to GDA.
Conclusions
We consider discrete time gradient based methods for solving convex-concave saddle point problems, with a focus on the Extra-gradient (EG) and the Optimistic Gradient Descent Ascent (OGDA) methods. We show that EG and OGDA can be seen as approximations of the classical Proximal Point (PP) method. We provide linear rate estimates for the bilinear and strongly convex-strongly concave saddle point problems for EG and OGDA as well as their generalizations.
References
Supplementary Material
First, note that the update of the Proximal Point (PP) method for the bilinear problem (Assumption 1) can be written as
We can simplify the above iterations and write them as an explicit algorithm as follows:
Let us define the symmetric matrices and . Based on these definitions, and the expressions in (18) and (19) we can show that the sum can be written as
To simplify the expression in (8.1) we first prove the following lemma which is also useful in the rest of proofs.
Let be the singular value decomposition of . Here and are orthonormal matrices and is a diagonal matrix with the eigenvalues of as the diagonal entries. Then, we have:
Here we used the property that . Now, we simplify the other side to get:
Now, since , the claim in (21) follows. Using a similar argument we can also prove the equality in (22). ∎
Using the result in Lemma 1 we can show that
where the second equality holds as . Hence, the expression in (26) can be simplified as
We simplify equation (26) as follows. Consider the term involving . We have
Now we use Lemma 1 to simplify (8.1) as follows
where the last equality follows by replacing by its definition. The same simplification follows for the terms invovling which leads to the expression
Substitute and in (26) with the expressions in (8.1) and (29), respectively, to obtain
Now, using the expression in (30) and the fact that we can write
2 Proof of Theorem 2
The update of PP method can be written as
where we used the fact that . Replace and with their definition in (33) and further set to obtain
since . Replace and with their definitions and further set to obtain
In particular, by setting we obtain that
Now, considering (35), by adding and subtracting we can write
By using the inequality in (40) we can write
Similarly, considering (38), we can write
Add equations (43) and (45), and use the definition to obtain
Regrouping the terms and using the definition leads to
3 Proof of Proposition 1
We start from the Proximal Point (PP) dynamics and show that an approximation of this dynamics leads to OGDA. The PP updates are as follows
By writing the Taylor’s expansion of , we obtain
On adding and subtracting the term , we get
Note that from the Taylors expansion of and the PP updates, we have
Again, from the Taylor’s expansion of and the PP updates, we have
Making the approximations of Equations (52) and (54) in Equation (50) yields
Making this substitution back in Equation (55), we get
which is equivalent to the OGDA update plus an additional error term of order . The same analysis can be done for the dual updates as well to obtain
This shows that the OGDA updates and the PP updates differ by .
4 Proof of Theorem 3
We define the following symmetric matrices
We rewrite the properties of and which are
Recall that the update of OGDA for the bilinear problem can be written as
The update for the variable can be written as an approximate variant of the PP update as follows
Therefore, the error between the OGDA and Proximal updates for the variable is given by
We first derive an upper bound for the term in the first parentheses .
Once again, using the OGDA updates for and , we have
Therefore, considering the expressions in (62) and (8.4) the error between the updates of OGDA and PP for the variable can be written as
We apply the same argument for the update of the variable . Combining these results we obtain that the update of OGDA can be written as
As in the proof of Theorem 1, we define and . Then, we can show that
Define . We have:
And for .
5 Proof of Theorem 4
We define and . Then the OGDA updates can be compactly written as:
We write the update in terms of the Proximal Point method with an error as follows:
On rearranging Equation (72) and using the fact that , where , we get:
On squaring Equation (74) and using Young’s inequality, we get:
Now, using strong convexity, and substituting , we get:
The following part of the proof is inspired by the result of Theorem of Gidel et al. (2018). For OGDA iterates, we have:
On subtracting from both sides and squaring, we have:
Substituting Equations (79) and (80) in Equation (78), we have:
Substituting Equation (84) in Equation (82), we get:
For , we have: and therefore, can ignore the last term, which gives us (for )
since , we have , which gives:
Substituting Equation (89) back in Equation (76), we get:
On defining , we have:
6 Proof of Theorem 5
The generalized OGDA method for bilinear problems is given by:
We compare this with the Proximal Point (PP) method with stepsize . The proof follows long the exact same lines as the proof of Theorem 3.
We define the following symmetric matrices
We rewrite the properties of and which are
Therefore, the error between the OGDA and Proximal updates for the variable is given by
We first derive an upper bound for the term in the first parentheses .
Using the generalized OGDA update, we have:
Once again, using the generalized OGDA updates for and , we have
Therefore, considering the expressions in (95) and (8.6) the error between the updates of OGDA and PP for the variable can be written as
Now, the convergence proof follows along the same lines as the proof of Theorem 3. We set , and we need the additional assumption:
due to the presence of the term , Let
On making these substitutions, we get the same result as Theorem 3.
7 Proof of Proposition 2
The Extragradient updates can be written as
By writing the Taylor’s expansion of we obtain that
By following the same argument for we obtain
Now we find a second order approximation for the Proximal Point Method. Note that the update of the proximal point method for variable can be written as
where in the second equality we replaced and in the gradient with their updates. Hence, using Taylor’s series we can show that
where in the second equality we used the fact that and . Similarly, we find the approximation of the update of which leads to
Comparing the expressions in (8.7) and (103) with the ones in (8.7) and (8.7) implies that the difference between the updates of PP and EG is at most and this completes the proof.
8 Proof of Theorem 6
Define the following symmetric error matrices
which are useful to characterize the difference between the updates of EG and PP for a bilinear problem. Note that we can bound the norms of and as
Since , we have:
Also, from Lemma 1 in the proof of Theorem 1, and the definitions of the error matrices in (107) it can be verified that
Moreover, using the definitions of and in (107), the EG updates can be written as
As in the proof of Theorem 1, we define and . Using these definitions we can show that
Now before adding the two sides of the expressions in (115) and (116), note that some of the cross terms in (115) and (116) cancel out. For instance, using Lemma 1 and Equations (111) and (112) we can show that
By using similar arguments it can be shown that summing two sides of the expressions in (115) and (116) leads to
where in the second equality we used the simplifications
Define . We have:
Choosing , we have:
9 Proof of Theorem 7
Define and . Then the EG updates can be compactly written as:
We write the update in terms of the Proximal Point method with an error as follows:
On squaring and simplifying this expression, we have:
where (Note that ). We simplify the right hand side of Equation (125) as follows-
The following part of the proof is inspired by the result of Theorem of Gidel et al. (2018). We simplify the inner products and give a lower bound using strong convexity as follows:
since . Now, using Young’s inequality, we have:
Substituting the above inequality in Equation (126), we have:
Since , we have:
For , we have (since ), which gives: