Optimistic Dual Extrapolation for Coherent Non-monotone Variational Inequalities
Chaobing Song, Zhengyuan Zhou, Yichao Zhou, Yong Jiang, Yi Ma
Introduction
where is called a strong solution of VIP. For the minimax problem
let . Then solving (1) is equivalent to finding a first-order Nash equilibrium of the minimax problem (2) .
The operator will be monotone if
VI with monotone operators has been well studied, which provides a concise and optimal framework for convex-concave minimax problems . For monotone VIP, it is well known that the strong solution satisfying (1) is also equivalent to the solution satisfying:
where is called a weak solution of VIP. A classical result under the monotone and Lipschitz continuous assumptions is that the Mirror-Prox algorithm can converge to an -accurate weak solution in terms of ergodic averaging in iterations, which is optimal for first-order methods in solving monotone VIPs . Nemirovski’s Mirror-Prox is a non-Euclidean extension of the extragradient method from the perspective of mirror descent. Another important non-Euclidean extension is Nesterov’s dual extrapolation from the perspective of dual averaging, which also has the optimal convergence rate. The main difference between mirror descent and dual averaging is the way of combining the constraint (or the regularization term if exists) into the projection (or the proximal) step .
Despite obtaining the optimal convergence rate, both Mirror-Prox and dual extrapolation are two-call extragradient methods that need to evaluate gradients twice per iteration. In some contexts such as training deep neural networks, evaluating gradients can be expensive. Thus it will have significant practical benefits if we only need one gradient evaluation per iteration and still maintain the same convergence rate. In terms of single-call methods for minimax problems, vanilla gradient descent ascent (and its mirror descent generalizations) might be a natural choice. Unfortunately, it is not guaranteed and it can diverge even in simple monotone settings . Consequently, after the (two-call) extragradient method , several single-call extragradient methods have been analyzed under the monotone setting and share the same convergence rates with Mirror-Prox and dual extrapolation . However, there is an increasing trend in applying these single-call extragradient methods to stabilize the training of generative adversarial networks (GAN) , which is nonconvex-nonconcave in general and hence has remained underexplored.
Nonconvex-Nonconcave Minimax Problems.
Despite the well-developed convergence theory for monotone VIPs and thus for convex-concave minimax problems, many minimax problems arising in modern machine learning are nevertheless nonconvex-nonconcave, such as GAN , adversarial training , gradient reversal for domain adaption , and multi-agent reinforcement learning . As a result, the corresponding VI is not monotone and the aforementioned theoretical guarantees for monotone VIPs no longer apply. First, for non-monotone VIPs, it is nontrivial to obtain the rate of convergence to a weak solution, thus one may explore the rate of convergence to a strong solution instead. Second, without the monotone property, the ergodic averaging technique will no longer have theoretical guarantees, thus we might need to choose the last iterate or best iterate. However, the classical convergence result said little about the rate of convergence to a weak solution or the convergence of last iterate or best iterate.Recently, shows the first tight last iterate result for general smooth convex-concave minimax problems with Lipschitz derivatives of operators.
To obtain theoretical guarantees beyond the monotone setting, a common approach is to relax the lower bound (3) in the monotone assumption. Along this research line, several more general assumptions have been proposed, such as the pseudo-monotone assumption and its variants , and the generalized monotone assumption . In the machine learning community, similar concepts have also been proposed, such as variational coherence . For simplicity, we coin the problem class along this research line as coherent non-monotone variational inequalities. Among them, is the first to provide explicit global convergence results such that the best iterate of the N-EG method can converge to an -accurate strong solution in iterations under the generalized monotone and Lipschitz continuous assumptions. However, N-EG needs to evaluate gradient twice per iteration, which is less desirable when gradient evaluation is expensive. For the single-call extragradient method , under a second-order conditionAs we will see, it is a localized version of our assumption., very recently has provided local linear convergence results in certain non-monotone setting, while the constants in these results remain implicit. The following problem remains open: Can single-call extragradient methods have explicit global convergence results beyond the monotone setting?
Contributions of This Paper.
In this paper we develop an Optimistic Dual Extrapolation (OptDE) method that provably converges to a strong solution for coherent non-monotone VIPs. The OptDE method can be viewed as a single-call variant of Nesterov’s dual extrapolation that maintains its “anticipatory” properties. We characterize convergence rates of the best iterateFor given a number of iterations, the best iterate can be explicitly found and happen before the last iterate. of OptDE under two coherent non-monotone assumptions, where the merit function is given in Definition 1 and is the natural norm used in algorithms. As shown in Table 1, when the problem has a weak solution , our method matches the best known rate of N-EG . Further strengthening the assumption to that a -weak solution exists with – nevertheless a weaker condition than the strongly monotone assumption required in previous work, we are able to obtain a linear convergence rate of . For this setting, we can also use the distance to measure the progress and obtain a linear convergence result; meanwhile, despite not shown in Table 1, we also obtain a linear convergence result of the last iterate. Our result shows that even under the two coherent non-monotone assumptions, the convergence rate of single-call extragradient methods can be comparable to that of the N-EG method with two gradient evaluations per iteration.
Our coherent non-monotone analysis for the setting that a -weak solution exists has two meaningful corollaries about best iterate and last iterate in the monotone setting, respectively: With a regularization trick, both the best iterate and last iterateHere the last iterate is not in the classical sense, which will be explained in Section 3. of OptDE can be an -accurate solution in number of iterations. To our knowledge, the near-optimal result for attaining an -accurate strong solution was only appeared in very recently with a two-loop Halpern iteration method, while our result is obtained by the simpler single-loop single-call OptDE method.
Meanwhile, we extend the OptDE algorithm to the stochastic setting as Stochastic OptDE (SOptDE) and show that our results in the deterministic setting can be naturally generalized to the stochastic setting. This allows us to characterize the stochastic oracle complexity (i.e., the number of stochastic oracles we access) of SOptDE under the coherent non-monotone assumptions. The results under the stochastic setting are summarized in Table 2.The results of the SEG , ESA algorithms are given under pseudomonotone and strongly pseudomonotone assumptions respectively, which are slightly stronger than our assumptions. As we see, the results match the best-known results of SEG The original result of SEG is given by “square natural residual”, which can be used to derive the strong solution guarantee in Table 2 (see the supplementary material for detail). and ESA respectively, while both SEG and ESA need two gradient evaluations per iteration. Meanwhile, under the assumption that a -weak solution exists, we obtain the first theoretical guarantee in terms of the merit function in Definition 1.
Last but not least, different from N-EG and ESA , the proposed OptDE and SOptDE algorithms only need the norm square being strongly convex but not necessarily globally Lipschitz continuous, which will be significant if is a non-Euclidean norm: can not be strongly convex and globally Lipschitz continuous simultaneously in general.
Technical Assumptions
To measure the accuracy of iterates to a strong solution, we consider the following “restricted strong merit function”.
With and , Definition 1 becomes the definition of the strong solution in (1). In the nonconvex-nonconcave minimax setting, Definition 1 has been proposed as the definition of the -accurate first-order Nash equilibrium . If is a bounded set, then we still have an effective measure even if ; if is unbounded, then needs to be a finite positive parameter. To give a unified measure for both bounded and unbounded settings, we set to be a finite positive parameter.
Throughout this paper, we make the following standard Lipschitz continuous assumption.
For the VIP in (1), where is the Lipschitz constant.
Meanwhile, we assume that the (possible non-Euclidean) norm satisfies Assumption 2.
is -strongly convex () with respect to (w.r.t.) and the dual norm of gradient is bounded by :
From , is -strongly convex . Without loss of generality, in Assumption 2, we assume For all the norm setting , we have
For the norm , we define the prox-mapping as
and assume that it can be solved efficiently. Meanwhile, we also define the corresponding Bregman divergence of :
Obviously we have
Then we make Assumptions 3 and 4 for the coherent non-monotone VIP we study.
For the VIP in (1), there exists a weak solution such that .
For the VIP in (1), given there exists a -weak solution with parameter such that .
Assumption 3 assumes the existence of weak solutions, which is also adopted in . Assumption 3 is slightly weaker than the variational coherence assumption or the generalized monotone assumption . Some nontrivial examples satisfying the generalized monotone assumption can be found in . The generalized monotone assumption is in turn weaker than the pseudo-monotone assumption , which is weaker than the monotone assumption (3).
Assumption 4 further assumes a stronger variant of Assumption 3, which is also called as strongly variational stability in . For the Euclidean setting where and thus , the inequality is simplified to . Assumption 4 is weaker than the strongly pseudo-monotone and strongly monotone assumptions, but as we will see, is already sufficient to ensure a linear convergence rate for our method.
Our main motivation in making Assumptions 3 and 4 is to prove explicit global convergence results for VIP under conditions as weak as possible. However, the non-monotone subsets of Assumptions 3 and 4, a.k.a., pseudomonotone and strongly pseudomonotone respectively, also have many real applications in competitive exchange economy , fractional programming , and product pricing . Meanwhile, the restriction of Assumption 4 in minimization problems such as one-point convexity is also used in analyzing neural networks.
Optimistic Dual Extrapolation
In this section, we present the optimistic dual extrapolation (OptDE) algorithm for solving the VIP in (1). The method is a single-call variant of Nesterov’s dual extrapolation . The overall algorithm is summarized as Algorithm 1. The algorithm works under either Assumption 3 by setting or Assumption 4 with .
For Algorithm 1, we define two constants and in Step 2. Then we initialize three vectors and in Step 3. In the main loop, we update the two positive numbers and in Step 5. Then we perform an “extrapolation” step in Step 6 and then “dual averaging” steps in Steps 7 and 8. As we see, as Algorithm 1 only performs one new gradient evaluation in Step , it is “optimistic” hence the name “optimistic dual extrapolation”. Once Algorithm 1 runs iterations, we return the best iterate measured by the sum of residual norms This return value is given according to our convergence analysis..
Compared with Nesterov’s dual extrapolation, the main difference is that the extrapolation Step 6 is a prox-mapping on , not on . Compared with past extra-gradient , the main difference is that we perform dual averaging by Steps 7 and 8, instead of a “mirror descent” step. Compared with N-EG which is claimed to be a non-Euclidean extragradient method , not only we perform just one gradient evaluation per iteration but also do not require to have bounded Lipschitz continuous gradients, which is significant in the non-Euclidean setting since the norm square for may not have globally bounded Lipschitz continuous gradients.
In the following, we assume is a solution that satisfies Assumption 3 if or satisfies Assumption 4 if .
with C_{0}=\big{(}1+\frac{\delta}{{\alpha\gamma}}\big{)}\sqrt{\frac{8\alpha}{\gamma}}, , and
Theorem 1 implies our main result in Table 1. As we see, for except for constants, our result is the same with the two-call extragradient method N-EG . However, to analyze single-call methods, particularly for the setting , the analysis is much more involved and leads to an interesting criterion of return value in Step 10 of Algorithm 1. For the setting then linear convergence rates can be obtained in terms of both restricted strong merit solution and solution distance. Meanwhile, for the setting , our result in terms of restricted strong merit solution (10) can not be implied by the result of the solution distance (12), while the reverse side is true. Furthermore, when , the result (10) is also used in deriving Corollary 1 for the monotone setting. Finally, to simplify our analysis, we did not yet optimize the constants in (10) and (12), which probably can be further improved.
In Theorem 1, we provide a unified result for the two settings and in terms of the best iterate. However, when we can also prove linear convergence rates in terms of last iterate, which is given in Proposition 1 below.
Let Assumptions 1 and 2 hold. For the setting (i.e., Assumption 4 holds), after iterations, Algorithm 1 returns a such that
with defined in Theorem 1, and
By Proposition 1, to prove the linear convergence of the last iterate, we do not need the strongly monotone assumption, but only Assumption 4. Despite the last iterate also has a linear convergence rate, it is slower than the rate of best iterate in Theorem 1. As we will see, Proposition 1 will also be used to prove the last iterate convergence for the monotone setting in a non-classical sense.
The motivation behind OptDE is that by generalizing Nesterov’s estimation sequence, we can perform a unified convergence analysis under Assumptions 3 and 4. However, as shown in , if a regularizer exists, the (regularized) dual averaging steps (Steps 7 and 8 of Algorithm 1) can help us better explore the structure of regularizers such as sparsity when it exists.
has given local convergence analysis in terms of solution distance by assuming that Assumption 4 holds in a neighbourhood of the optimal solution. The analysis in needs extra techniques, while the constants in the rates of are implicit. Our solution distance result in (12) can be viewed as a global and explicit version of by assuming Assumption 4 holds globally. Meanwhile, does not give any result under Assumption 3 or in terms of restricted strong solution under Assumption 4 whereas our analysis does.
Our results are mainly given under the coherent non-monotone Assumptions 3 and 4. As shown in Theorem 1, under Assumption 3 that includes the monotone assumption, we can obtain an -accurate strong solution in iterations. However, in the following we show that with a regularization trick, the rate can be much better in the monotone setting by using our results in Theorem 1 and Proposition 1.
First, to give our results in the monotone setting, we have Lemma 1.
If the VIP is monotone, then the regularized problem VIP satisfies Assumption 4 with
Due to Lemma 1, we can apply Theorem 1 and Proposition 1 to the regularized problem VIP, and then obtain Corollaries 1 and 2 for the VIP, respectively.
Given , let Assumptions 1 and 2 hold for the regularized problem VIP. By optimizing the regularized problem by Algorithm 4, then the best iterate returned by Algorithm 4 satisfies
Compared with Theorem 1 and Proposition 1, we need an extra condition in Corollary 1, which can be satisfied by choosing a large enough . By Corollary 1, by choosing K=O\big{(}\frac{1}{\epsilon}\log\frac{1}{\epsilon}\big{)}, we will obtain an -accurate solution. Note that does not appear in our algorithm and is not relevant to the choice of
Given , let Assumptions 1 and 2 hold for the regularized problem VIP. By optimizing the regularized problem by Algorithm 4, the last iterate of Algorithm 4 satisfies
Similar to Corollary 1 for best iterate, in Corollary 2, by choosing K=O\big{(}\frac{1}{\epsilon}\log\frac{1}{\epsilon}\big{)}, the last iterate will be an -accurate strong solution, which is significantly better than the tight bound for last iterate . Nevertheless, it should be noted that Corollary 2 is in a non-classical sense: we do not guarantee last iterate convergence for all , but only after K=O\big{(}\frac{1}{\epsilon}\log\frac{1}{\epsilon}\big{)} with a prescribed accuracy parameter Thus our result does not contradict with the lower bound of last iterate .
Meanwhile, our proof only relies on the regularized problem VIP satisfying Assumption 4 with which holds if the VIP is monotone. However, it is not necessary for the VIP to be monotone. For instance, if the VIP satisfies Assumption 3 and then the VIP also satisfies Assumption 4 with Of course, letting is impractical and we leave the more general setting of under non-monotone settings for further research.
Recently, has proposed a different Halpern iteration method under the monotone and Lipschitz assumptions. The Halpern iteration method does not need to know the Lipschitz constant and thus is parameter-free, and also attains the O\big{(}\frac{1}{\epsilon}\log\frac{1}{\epsilon}\big{)} convergence rate. Nevertheless, there are two major differences: The Halpern iteration method has two-loop, while our OptDE method is a single-loop single-call method; now the Halpern iteration method is limited to the Euclidean setting, while ours can have theoretical guarantees in the non-Euclidean setting.
Stochastic Optimistic Dual Extrapolation
For ( Assumption 4 holds), we have
As show in (16), for the setting (a.k.a., Assumption 3) even if the number of iterations , the expected restricted strong merit function can only be upper bounded by O\big{(}\frac{s}{L}\big{)}. Thus to guarantee the convergence of SOptDE, the variance should be , such as s^{2}=O\big{(}\frac{1}{K}\big{)}. In the Euclidean setting that by the concentration inequality , to attain a variance of O\big{(}\frac{1}{K}\big{)}, we need samples. Thus combining the setting and the result in (16), it can be verified that the single-call SOptDE method needs number of samples to obtain an -accurate solution in terms of the expected restricted strong merit function.
Under the stronger Assumption 4, our result is given in terms of the expected solution distance. As shown in (17), under Assumption 4, SOptDE can converge provably even when the variance is constant. In fact, the O\big{(}\frac{1}{K}\big{)} is optimal and has been obtained by the two-call extragradient method ESA under the pseudomonotone assumption. Meanwhile, used the plain stochastic gradient descent algorithm and obtained the O\big{(}\frac{1}{K}\big{)} result for strongly monotone variational inequalities, which can also be extended to the setting that -weak solution exists.
With the aggressive parameter setting and a large batch size strategy, we also obtain the first convergence guarantee in terms of restricted strong merit function as shown in Table 2 (see details in the supplementary material).
Concluding Remarks
In this paper, we proposed a single-call extragradient method optimistic dual extrapolation (OptDE) beyond the monotone setting and also extended it to the stochastic setting as stochastic optimistic dual extrapolation (SOptDE). We systematically proved the convergence results of OptDE and SOptDE under the Assumption 3 that a weak solution exists and Assumption 4 that a strongly weak solution exists. We also show beneficial implications of our analysis in both non-monotone and monotone settings. In the future, we will further study how the proposed new methods may lead to improved computational efficiency and performance guarantees in a wide range of machine learning problems such as the training of adversarial deep neural networks.
Broader Impact
In this paper, we discuss a systematic theoretical analysis for single-call extragradient methods, which has been widely used for modern machine learning applications. The theoretical results in this paper can bring in meaningful insight and understanding for practical algorithms.
Acknowledgement
Chaobing and Yi acknowledge support from Tsinghua-Berkeley Shenzhen Institute (TBSI) Research Fund. Yichao and Yi acknowledge funding from Sony Research. Yi acknowledges support from ONR grant N00014-20-1-2002 and the joint Simons Foundation-NSF DMS grant #2031899, as well as support from Berkeley AI Research (BAIR), Berkeley FHL Vive Center for Enhanced Reality, and Berkeley Center for Augmented Cognition.
References
Appendix A Convergence Analysis of Optimistic Dual Extrapolation
Based on the optimality condition of and Assumption 2, we have Lemma 2.
then , we have
In Lemma 2, the sequence can be viewed as the errors we need to bound in each iteration. The upper bound of the sum of is given in Lemma 3 below.
In Algorithm 1, we have
where we define for convenience.
By Lemma 3, \forall\;0<\alpha\leq\min\Big{\{}\frac{1}{4\sqrt{2}},\frac{\sqrt{3}}{4\sqrt{\gamma}}\Big{\}} and is upper bounded by the sum of strictly negative terms about , which makes it possible to give a upper bound about . To show the guarantees by restricted strong merit function and the distance , we give Lemma 4.
In Algorithm 1, we have,
Then combining Lemmas 2, 3 and 4, we obtain Theorem 1 in main body (see Section C.4 for the proof.).
Appendix B Convergence Analysis of Stochastic Optimistic Dual Extrapolation
We can extend the proof for the OptDE method in Section A to the stochastic setting for Lemmas 5, 6 and 7 and then obtain Theorems 2. First, we extend Lemma 2 into Lemma 5.
In Algorithm 5, , we have the following inequality: , let
Compared with the of Lemma 2, contains an extra term . Then based on the definition of and Assumption 5, we have Lemma 6.
In Algorithm 5, and we have
Lemma 6 extends Lemma 3 into the stochastic setting. Meanwhile, by the optimality condition of , and Assumptions 1, 2 and 5, we can extend Lemma 4 to Lemma 7.
In Algorithm 5, for the setting and we have,
Then combining Lemmas 5, 6 and 7, we obtain Theorem 2 for the SOptDE method in the main body (see Section D.4 for the proof).
It turns out that with the conservative setting , we can not obtain strong convergence results in terms of restricted strong merit function. To obtain the rate , we need adopt the more aggressive setting with a large batch size strategy, which is given in Algorithm 3. With this setting, we have Proposition 2.
with C_{1}:=4\big{(}1+\frac{\delta}{{\alpha\gamma}}\big{)}\sqrt{\frac{\alpha}{\gamma}}, , and
The proof of Proposition 2 follows the same pipeline of proving Theorem 2, except that we use the setting that is also used in Algorithm 4. We leave the proof of Proposition 2 as a simple exercise.
In Proposition 2, if we hope the variance of the stochastic estimation as then we need stochastic samples per iteration. Meanwhile, to attain an expected -accurate strong solution, we will need number of iterations. Thus the total number of stochastic samples we need is
B.2 The “ (quadratic) natural residual function” [18] and restricted strong merit function
In our notation, for any the (quadratic) natural residual function in is defined by: given
which can be used to derive the restricted strong merit function as Proposition 3.
Let Then we have
where is by the optimality condition of , is by the Cauchy-Schwarz inequality, is by the Lipschitz continuity of and the bounded assumption (7). So we have
Appendix C Proof of Section A
By the definition of proximal operator (8), we can equivalently reformulate the optimistic dual extrapolation (OptDE) algorithm in the main body as Algorithm 4. Then based on the definition of in Step 7 and the definition of the Bregman divergence , we can verify that
where is an arbitrary vector in and is irrelevant to the minimizer In our context, plays the role of a “generalized estimation sequence” to help us conduct convergence analysis. By the -strong convexity of the Bregman divergence , we know that is strongly convex with strong convexity parameter
Proof. Given the definition of the generalized estimation sequence in (31) and the minimizer in Algorithm 4, by the optimality condition of , we have:
where is by the optimality condition (32), and is by the convexity of both and
Meanwhile for , by the definition of , we have
where is by the -strong convexity of Meanwhile, by the strong convexity of in Assumption 2, we have
Then combining (34) and (35), and after simple arrangements, we have
where is by the fact and the upper bound of by (33). By the setting in Algorithm 4 and (37), we have
Then based on the definition of in Lemma 2, after simple arrangements, Lemma 2 is proved.
C.2 Proof of Lemma 3
Proof. By the definition of in Lemma 2, we have:
where is by the Cauchy–Schwarz inequality, is the Lipschitz continuous Assumption 1, is by the fact is by the triangle inequality of norm and is by the fact
Then by the optimality condition of in the -th iteration of Algorithm 4, we have:
By combining (39), (40) and (41), we have
from Step 5 of Algorithm 4 and the setting
and is by the setting that
With the , for convenience, we set By summing (42) from to , we have
where is by the fact , and is by the fact that . Lemma 3 is proved.
C.3 Proof of Lemma 4
where is by the optimality condition of , is by the Cauchy-Schwarz inequality, is by the Lipschitz continuity of and the bounded assumption (7), is by the triangle inequality of norm So we have
Meanwhile, if there exists a that satisfies Assumption 4, , with then in (45), let and by the fact and , we have
C.4 Proof of Theorem 1
Proof. Firstly, by the setting and we have:
If then
If , then A_{k}=\frac{1}{\sigma}\Big{(}1+\frac{{\alpha\gamma}\sigma}{L}\Big{)}^{k}-\frac{1}{\sigma}.
Let be the in Assumption 3 if or the in Assumption 4 if . Then by the property of , we have So by (49), it follows that
By the setting with in Algorithm 4, we have Meanwhile, for convenience, we have set . So we have
So by (20) of Lemma 4 and (52), it follows that
Similarly, if then by (21) of Lemma 4 and (52), we have
Then by defining C_{0}:=\Big{(}1+\frac{\delta}{{\alpha\gamma}}\Big{)}\sqrt{\frac{8\alpha}{\gamma}}, Theorem 1 is proved.
C.5 Proof of Proposition 1
Proof. The proof follows the same paradigm of Section C.4. Firstly, by the setting and we have
Then the (51) of Section C.4 is replaced by
Then similar to (52) to (54), we obtain the last iterate convergence result as
Thus by the definition of in Theorem 1, Proposition 1 is proved.
C.6 Proof of Lemma 1
Proof. By the definition of the Bregman divergence we have
So combining (57) and (58), it follows that
So if is monotone, then we have:
As Assumption 4 includes the strongly monotone assumption, by (60), we know that the VIP satisfies Assumption 4 with parameter
C.7 Proof of Corollary 1
Proof. By Theorem 1 and Lemma 1, if we optimize the regularized problem VIP by the ODE Algorithm 4, then after iterations, we have
where is defined in Theorem 1, A_{K-1}=\frac{1}{\epsilon}\Big{(}1+\frac{\sqrt{\alpha\gamma}\epsilon}{L}\Big{)}^{K-1}-\frac{1}{\epsilon}.
Meanwhile, by the convexity of we have
C.8 Proof of Corollary 2
Proof. By Proposition 1 and Lemma 1, if we optimize the regularized problem VIP by the OptDE Algorithm 4, then after iterations, we have
where is defined in Theorem 1, a_{K-1}=\frac{\alpha\gamma}{L}\Big{(}1+\frac{\alpha\gamma\sigma}{L}\Big{)}^{K-2}.
Meanwhile, by the convexity of we have
Appendix D Proof of Section B
By the definition of proximal operator (8), we can equivalently reformulate the stochastic optimistic dual extrapolation (SODE) of the main body as below. Then based on the definition of in Step 7 and the definition of the Bregman divergence , we can verify that
where is an arbitrary vector in and is irrelevant to the minimizer In our context, plays the role of a “generalized estimation sequence” to help us conduct convergence analysis. By the -strong convexity of the Bregman divergence , we know that is strongly convex with strong convexity parameter
Proof. Given the definition of the generalized estimation sequence in (66) and by the optimality condition of the minimizer in the Step 6 of Algorithm 5, we have:
where is by the optimality condition (67) and is by the convexity of and .
where is the -strong convexity of Meanwhile, by the -strong convexity of , we have
where is by the fact , the upper bound of in (68), is by the setting in Algorithm 5. Meanwhile, taking expectation on , we have:
So taking expectation on the randomness of all the history for (72), and using (73) and the definition of in Lemma 5, after simple arrangements, Lemma 5 is proved.
D.2 Proof of Lemma 6
Proof. By the definition of in Lemma 5, we have:
where is by the Cauchy-Schwarz inequality, is by the triangle inequality of the norm , is by the Lipschitz continuity of , is by the fact and is by the fact
Then by the optimality condition of in Algorithm 5, we have:
Combining (74), (75) and (76) with , we have
For both the settings and , by our setting, we have and , so we have
where is by the condition and Assumption 5. Lemma 6 is proved.
D.3 Proof of Lemma 7
It follows that:
where is by the fact when is by the optimality condition of .
is by the Cauchy Schwarz inequality and simple arrangement, and is by Assumption 1.
where is by the fact .
So taking expectation on , by Assumption 5, we have:
D.4 Proof of Theorem 2
Proof. Firstly, by the setting and we have
If then
If , then A_{k}=\Big{(}\frac{\alpha\gamma}{4L}\Big{)}^{2}\sigma(k+1)^{2}.
Then for both the setting ( Assumption 3 holds) and ( Assumption 4 holds), we have
So in Lemma 5, let , we have
where is by the -strong convexity Bregman divergence of , is by the Assumption 3 () or the Assumption 4 (), is by Lemma 5, and is by Lemma 6.
So taking expectation on all the history, we have
Then taking expectation on all the history, we have
where is by the Jensen inequality, is by the fact that and is by (87).
where is by Lemma 7, is by the triangle inequality of and Assumption 5, is by the triangle inequality of , is by (87) and (88).
For by (84) and (85), we have