Optimistic Dual Extrapolation for Coherent Non-monotone Variational Inequalities

Chaobing Song, Zhengyuan Zhou, Yichao Zhou, Yong Jiang, Yi Ma

Introduction

where w{\bm{w}}^{*} is called a strong solution of VIP(F,W)(F,{\mathcal{W}}). For the minimax problem

let WX×Y,w[xy],F(w)[xf(x,y)yf(x,y)]{\mathcal{W}}\equiv{\mathcal{X}}\times{\mathcal{Y}},{\bm{w}}\equiv\left[\begin{smallmatrix}{\bm{x}}\\ {\bm{y}}\end{smallmatrix}\right],F({\bm{w}})\equiv\left[\begin{smallmatrix}\nabla_{{\bm{x}}}f({\bm{x}},{\bm{y}})\\ -\nabla_{{\bm{y}}}f({\bm{x}},{\bm{y}})\end{smallmatrix}\right]. Then solving (1) is equivalent to finding a first-order Nash equilibrium of the minimax problem (2) .

The operator F(w)F({\bm{w}}) will be monotone if

VI with monotone operators has been well studied, which provides a concise and optimal framework for convex-concave minimax problems . For monotone VIP(F,W)(F,{\mathcal{W}}), it is well known that the strong solution satisfying (1) is also equivalent to the solution wW{\bm{w}}^{*}\in{\mathcal{W}} satisfying:

where w{\bm{w}}^{*} is called a weak solution of VIP(F,W)(F,{\mathcal{W}}). A classical result under the monotone and Lipschitz continuous assumptions is that the Mirror-Prox algorithm can converge to an ϵ\epsilon-accurate weak solution in terms of ergodic averaging in O(1/ϵ)O(1/\epsilon) iterations, which is optimal for first-order methods in solving monotone VIPs . Nemirovski’s Mirror-Prox is a non-Euclidean extension of the extragradient method from the perspective of mirror descent. Another important non-Euclidean extension is Nesterov’s dual extrapolation from the perspective of dual averaging, which also has the optimal O(1/ϵ)O(1/\epsilon) convergence rate. The main difference between mirror descent and dual averaging is the way of combining the constraint (or the regularization term if exists) into the projection (or the proximal) step .

Despite obtaining the optimal convergence rate, both Mirror-Prox and dual extrapolation are two-call extragradient methods that need to evaluate gradients twice per iteration. In some contexts such as training deep neural networks, evaluating gradients can be expensive. Thus it will have significant practical benefits if we only need one gradient evaluation per iteration and still maintain the same convergence rate. In terms of single-call methods for minimax problems, vanilla gradient descent ascent (and its mirror descent generalizations) might be a natural choice. Unfortunately, it is not guaranteed and it can diverge even in simple monotone settings . Consequently, after the (two-call) extragradient method , several single-call extragradient methods have been analyzed under the monotone setting and share the same convergence rates with Mirror-Prox and dual extrapolation . However, there is an increasing trend in applying these single-call extragradient methods to stabilize the training of generative adversarial networks (GAN) , which is nonconvex-nonconcave in general and hence has remained underexplored.

Nonconvex-Nonconcave Minimax Problems.

Despite the well-developed convergence theory for monotone VIPs and thus for convex-concave minimax problems, many minimax problems arising in modern machine learning are nevertheless nonconvex-nonconcave, such as GAN , adversarial training , gradient reversal for domain adaption , and multi-agent reinforcement learning . As a result, the corresponding VI is not monotone and the aforementioned theoretical guarantees for monotone VIPs no longer apply. First, for non-monotone VIPs, it is nontrivial to obtain the rate of convergence to a weak solution, thus one may explore the rate of convergence to a strong solution instead. Second, without the monotone property, the ergodic averaging technique will no longer have theoretical guarantees, thus we might need to choose the last iterate or best iterate. However, the classical convergence result said little about the rate of convergence to a weak solution or the convergence of last iterate or best iterate.Recently, shows the first tight last iterate result for general smooth convex-concave minimax problems with Lipschitz derivatives of operators.

To obtain theoretical guarantees beyond the monotone setting, a common approach is to relax the lower bound (3) in the monotone assumption. Along this research line, several more general assumptions have been proposed, such as the pseudo-monotone assumption and its variants , and the generalized monotone assumption . In the machine learning community, similar concepts have also been proposed, such as variational coherence . For simplicity, we coin the problem class along this research line as coherent non-monotone variational inequalities. Among them, is the first to provide explicit global convergence results such that the best iterate of the N-EG method can converge to an ϵ\epsilon-accurate strong solution in O(1/ϵ2)O(1/\epsilon^{2}) iterations under the generalized monotone and Lipschitz continuous assumptions. However, N-EG needs to evaluate gradient twice per iteration, which is less desirable when gradient evaluation is expensive. For the single-call extragradient method , under a second-order conditionAs we will see, it is a localized version of our assumption., very recently has provided local linear convergence results in certain non-monotone setting, while the constants in these results remain implicit. The following problem remains open: Can single-call extragradient methods have explicit global convergence results beyond the monotone setting?

Contributions of This Paper.

In this paper we develop an Optimistic Dual Extrapolation (OptDE) method that provably converges to a strong solution for coherent non-monotone VIPs. The OptDE method can be viewed as a single-call variant of Nesterov’s dual extrapolation that maintains its “anticipatory” properties. We characterize convergence rates of the best iterateFor given a number of iterations, the best iterate can be explicitly found and happen before the last iterate. of OptDE under two coherent non-monotone assumptions, where the merit function is given in Definition 1 and \|\cdot\| is the natural norm used in algorithms. As shown in Table 1, when the problem has a weak solution w{\bm{w}}^{*}, our method matches the best known rate O(1/ϵ2)O({1}/{\epsilon^{2}}) of N-EG . Further strengthening the assumption to that a σ\sigma-weak solution w{\bm{w}}^{*} exists with σ>0\sigma>0 – nevertheless a weaker condition than the strongly monotone assumption required in previous work, we are able to obtain a linear convergence rate of O(log1ϵ)O(\log\frac{1}{\epsilon}). For this setting, we can also use the distance w2\|\cdot-{\bm{w}}^{*}\|^{2} to measure the progress and obtain a linear convergence result; meanwhile, despite not shown in Table 1, we also obtain a linear convergence result of the last iterate. Our result shows that even under the two coherent non-monotone assumptions, the convergence rate of single-call extragradient methods can be comparable to that of the N-EG method with two gradient evaluations per iteration.

Our coherent non-monotone analysis for the setting that a σ\sigma-weak solution exists has two meaningful corollaries about best iterate and last iterate in the monotone setting, respectively: With a regularization trick, both the best iterate and last iterateHere the last iterate is not in the classical sense, which will be explained in Section 3. of OptDE can be an ϵ\epsilon-accurate solution in O(1ϵlog1ϵ)O(\frac{1}{\epsilon}\log\frac{1}{\epsilon}) number of iterations. To our knowledge, the near-optimal result O(1ϵlog1ϵ)O(\frac{1}{\epsilon}\log\frac{1}{\epsilon}) for attaining an ϵ\epsilon-accurate strong solution was only appeared in very recently with a two-loop Halpern iteration method, while our result is obtained by the simpler single-loop single-call OptDE method.

Meanwhile, we extend the OptDE algorithm to the stochastic setting as Stochastic OptDE (SOptDE) and show that our results in the deterministic setting can be naturally generalized to the stochastic setting. This allows us to characterize the stochastic oracle complexity (i.e., the number of stochastic oracles we access) of SOptDE under the coherent non-monotone assumptions. The results under the stochastic setting are summarized in Table 2.The results of the SEG , ESA algorithms are given under pseudomonotone and strongly pseudomonotone assumptions respectively, which are slightly stronger than our assumptions. As we see, the results match the best-known results of SEG The original result of SEG is given by “square natural residual”, which can be used to derive the strong solution guarantee in Table 2 (see the supplementary material for detail). and ESA respectively, while both SEG and ESA need two gradient evaluations per iteration. Meanwhile, under the assumption that a σ\sigma-weak solution exists, we obtain the first theoretical guarantee in terms of the merit function in Definition 1.

Last but not least, different from N-EG and ESA , the proposed OptDE and SOptDE algorithms only need the norm square 2\|\cdot\|^{2} being strongly convex but not necessarily globally Lipschitz continuous, which will be significant if \|\cdot\| is a non-Euclidean norm: 2\|\cdot\|^{2} can not be strongly convex and globally Lipschitz continuous simultaneously in general.

Technical Assumptions

To measure the accuracy of iterates to a strong solution, we consider the following “restricted strong merit function”.

With ϵ0\epsilon\to 0 and D+D\to+\infty, Definition 1 becomes the definition of the strong solution in (1). In the nonconvex-nonconcave minimax setting, Definition 1 has been proposed as the definition of the ϵ\epsilon-accurate first-order Nash equilibrium . If W{\mathcal{W}} is a bounded set, then we still have an effective measure even if D+D\rightarrow+\infty; if W{\mathcal{W}} is unbounded, then DD needs to be a finite positive parameter. To give a unified measure for both bounded and unbounded settings, we set DD to be a finite positive parameter.

Throughout this paper, we make the following standard Lipschitz continuous assumption.

For the VIP(F,W)(F,{\mathcal{W}}) in (1), w,vW,\forall{\bm{w}},{\bm{v}}\in{\mathcal{W}}, F(w)F(v)Lwv,\|F({\bm{w}})-F({\bm{v}})\|_{*}\leq L\|{\bm{w}}-{\bm{v}}\|, where L>0L>0 is the Lipschitz constant.

Meanwhile, we assume that the (possible non-Euclidean) norm \|\cdot\| satisfies Assumption 2.

12w2\frac{1}{2}\|{\bm{w}}\|^{2} is γ\gamma-strongly convex (0<γ10<\gamma\leq 1) with respect to (w.r.t.) \|\cdot\| and the dual norm of gradient 12w2\nabla\frac{1}{2}\|{\bm{w}}\|^{2} is bounded by δw(δ>0)\delta\|{\bm{w}}\|(\delta>0):

From , 12p2(1<p2)\frac{1}{2}\|\cdot\|_{p}^{2}(1<p\leq 2) is (p1)(p-1)-strongly convex w.r.t.w.r.t. p\|\cdot\|_{p}. Without loss of generality, in Assumption 2, we assume 0<γ1.0<\gamma\leq 1. For all the norm setting 12p2(1<p2)\frac{1}{2}\|\cdot\|_{p}^{2}(1<p\leq 2), we have δ=1.\delta=1.

For the norm \|\cdot\|, we define the prox-mapping as

and assume that it can be solved efficiently. Meanwhile, we also define the corresponding Bregman divergence of 122\frac{1}{2}\|\cdot\|^{2}: w,vW,\forall{\bm{w}},{\bm{v}}\in{\mathcal{W}},

Obviously we have Vv(w)γ2wv2.V_{{\bm{v}}}({\bm{w}})\geq\frac{\gamma}{2}\|{\bm{w}}-{\bm{v}}\|^{2}.

Then we make Assumptions 3 and 4 for the coherent non-monotone VIP(F,W)(F,{\mathcal{W}}) we study.

For the VIP(F,W)(F,{\mathcal{W}}) in (1), there exists a weak solution wW{\bm{w}}^{*}\in{\mathcal{W}} such that wW,\forall{\bm{w}}\in{\mathcal{W}}, F(w),ww0\langle F({\bm{w}}),{\bm{w}}-{\bm{w}}^{*}\rangle\geq 0.

For the VIP(F,W)(F,{\mathcal{W}}) in (1), given w0W,{\bm{w}}_{0}\in{\mathcal{W}}, there exists a σ\sigma-weak solution wW{\bm{w}}^{*}\in{\mathcal{W}} with parameter σ>0\sigma>0 such that wW,\forall{\bm{w}}\in{\mathcal{W}}, F(w),wwσγ(Vww0(ww0)+Vww0(ww0))\langle F({\bm{w}}),{\bm{w}}-{\bm{w}}^{*}\rangle\geq\frac{\sigma}{\gamma}(V_{{\bm{w}}-{\bm{w}}_{0}}({\bm{w}}^{*}-{\bm{w}}_{0})+V_{{\bm{w}}^{*}-{\bm{w}}_{0}}({\bm{w}}-{\bm{w}}_{0})).

Assumption 3 assumes the existence of weak solutions, which is also adopted in . Assumption 3 is slightly weaker than the variational coherence assumption or the generalized monotone assumption . Some nontrivial examples satisfying the generalized monotone assumption can be found in . The generalized monotone assumption is in turn weaker than the pseudo-monotone assumption , which is weaker than the monotone assumption (3).

Assumption 4 further assumes a stronger variant of Assumption 3, which is also called as strongly variational stability in . For the Euclidean setting where :=2\|\cdot\|:=\|\cdot\|_{2} and thus γ=1\gamma=1, the inequality is simplified to F(w),wwσww22\langle F({\bm{w}}),{\bm{w}}-{\bm{w}}^{*}\rangle\geq\sigma\|{\bm{w}}-{\bm{w}}^{*}\|_{2}^{2}. Assumption 4 is weaker than the strongly pseudo-monotone and strongly monotone assumptions, but as we will see, is already sufficient to ensure a linear convergence rate for our method.

Our main motivation in making Assumptions 3 and 4 is to prove explicit global convergence results for VIP(F,W)(F,{\mathcal{W}}) under conditions as weak as possible. However, the non-monotone subsets of Assumptions 3 and 4, a.k.a., pseudomonotone and strongly pseudomonotone respectively, also have many real applications in competitive exchange economy , fractional programming , and product pricing . Meanwhile, the restriction of Assumption 4 in minimization problems such as one-point convexity is also used in analyzing neural networks.

Optimistic Dual Extrapolation

In this section, we present the optimistic dual extrapolation (OptDE) algorithm for solving the VIP(F,W)(F,{\mathcal{W}}) in (1). The method is a single-call variant of Nesterov’s dual extrapolation . The overall algorithm is summarized as Algorithm 1. The algorithm works under either Assumption 3 by setting σ=0\sigma=0 or Assumption 4 with σ>0\sigma>0.

For Algorithm 1, we define two constants A0A_{0} and α\alpha in Step 2. Then we initialize three vectors w0,z0{\bm{w}}_{0},{\bm{z}}_{0} and g0{\bm{g}}_{0} in Step 3. In the main loop, we update the two positive numbers aka_{k} and AkA_{k} in Step 5. Then we perform an “extrapolation” step in Step 6 and then “dual averaging” steps in Steps 7 and 8. As we see, as Algorithm 1 only performs one new gradient evaluation in Step 88, it is “optimistic” hence the name “optimistic dual extrapolation”. Once Algorithm 1 runs KK iterations, we return the best iterate measured by the sum of residual norms wkzk1+wk1zk1\|{\bm{w}}_{k}-{\bm{z}}_{k-1}\|+\|{\bm{w}}_{k-1}-{\bm{z}}_{k-1}\|This return value is given according to our convergence analysis..

Compared with Nesterov’s dual extrapolation, the main difference is that the extrapolation Step 6 is a prox-mapping on F(wk1)F({\bm{w}}_{k-1}), not on F(zk1)F({\bm{z}}_{k-1}). Compared with past extra-gradient , the main difference is that we perform dual averaging by Steps 7 and 8, instead of a “mirror descent” step. Compared with N-EG which is claimed to be a non-Euclidean extragradient method , not only we perform just one gradient evaluation per iteration but also do not require 122\frac{1}{2}\|\cdot\|^{2} to have bounded Lipschitz continuous gradients, which is significant in the non-Euclidean setting since the norm square 12p2\frac{1}{2}\|\cdot\|_{p}^{2} for p(1,2)p\in(1,2) may not have globally bounded Lipschitz continuous gradients.

In the following, we assume w{\bm{w}}^{*} is a solution that satisfies Assumption 3 if σ=0\sigma=0 or satisfies Assumption 4 if σ>0\sigma>0.

with C_{0}=\big{(}1+\frac{\delta}{{\alpha\gamma}}\big{)}\sqrt{\frac{8\alpha}{\gamma}}, a1=αγLa_{1}=\frac{{\alpha\gamma}}{L}, and

Theorem 1 implies our main result in Table 1. As we see, for σ=0,\sigma=0, except for constants, our result is the same with the two-call extragradient method N-EG . However, to analyze single-call methods, particularly for the setting σ=0\sigma=0, the analysis is much more involved and leads to an interesting criterion of return value in Step 10 of Algorithm 1. For the setting σ>0,\sigma>0, then linear convergence rates can be obtained in terms of both restricted strong merit solution and solution distance. Meanwhile, for the setting σ>0\sigma>0, our result in terms of restricted strong merit solution (10) can not be implied by the result of the solution distance (12), while the reverse side is true. Furthermore, when σ>0\sigma>0, the result (10) is also used in deriving Corollary 1 for the monotone setting. Finally, to simplify our analysis, we did not yet optimize the constants in (10) and (12), which probably can be further improved.

In Theorem 1, we provide a unified result for the two settings σ=0\sigma=0 and σ>0\sigma>0 in terms of the best iterate. However, when σ>0,\sigma>0, we can also prove linear convergence rates in terms of last iterate, which is given in Proposition 1 below.

Let Assumptions 1 and 2 hold. For the setting σ>0\sigma>0 (i.e., Assumption 4 holds), K1,\forall K\geq 1, after KK iterations, Algorithm 1 returns a wK{{\bm{w}}}_{K} such that

with C0C_{0} defined in Theorem 1, a0=a1a_{0}=a_{1} and K1,\forall K\geq 1,

By Proposition 1, to prove the linear convergence of the last iterate, we do not need the strongly monotone assumption, but only Assumption 4. Despite the last iterate also has a linear convergence rate, it is slower than the rate of best iterate in Theorem 1. As we will see, Proposition 1 will also be used to prove the last iterate convergence for the monotone setting in a non-classical sense.

The motivation behind OptDE is that by generalizing Nesterov’s estimation sequence, we can perform a unified convergence analysis under Assumptions 3 and 4. However, as shown in , if a regularizer exists, the (regularized) dual averaging steps (Steps 7 and 8 of Algorithm 1) can help us better explore the structure of regularizers such as sparsity when it exists.

has given local convergence analysis in terms of solution distance by assuming that Assumption 4 holds in a neighbourhood of the optimal solution. The analysis in needs extra techniques, while the constants in the rates of are implicit. Our solution distance result in (12) can be viewed as a global and explicit version of by assuming Assumption 4 holds globally. Meanwhile, does not give any result under Assumption 3 or in terms of restricted strong solution under Assumption 4 whereas our analysis does.

Our results are mainly given under the coherent non-monotone Assumptions 3 and 4. As shown in Theorem 1, under Assumption 3 that includes the monotone assumption, we can obtain an ϵ\epsilon-accurate strong solution in O(ϵ2)O(\epsilon^{-2}) iterations. However, in the following we show that with a regularization trick, the rate can be much better in the monotone setting by using our results in Theorem 1 and Proposition 1.

First, to give our results in the monotone setting, we have Lemma 1.

If the VIP(F,W)(F,{\mathcal{W}}) is monotone, then the regularized problem VIP(F+ϵ12γw02,W)(F+\epsilon\nabla\frac{1}{2\gamma}\|\cdot-{\bm{w}}_{0}\|^{2},{\mathcal{W}}) satisfies Assumption 4 with σ=ϵ.\sigma=\epsilon.

Due to Lemma 1, we can apply Theorem 1 and Proposition 1 to the regularized problem VIP(F+ϵ12γw02,W)(F+\epsilon\nabla\frac{1}{2\gamma}\|\cdot-{\bm{w}}_{0}\|^{2},{\mathcal{W}}), and then obtain Corollaries 1 and 2 for the VIP(F,W)(F,{\mathcal{W}}), respectively.

Given w0W{\bm{w}}_{0}\in{\mathcal{W}}, let Assumptions 1 and 2 hold for the regularized problem VIP(F+ϵ12γw02,W)(F+\epsilon\nabla\frac{1}{2\gamma}\|\cdot-{\bm{w}}_{0}\|^{2},{\mathcal{W}}). By optimizing the regularized problem by Algorithm 4, then the best iterate returned by Algorithm 4 satisfies

Compared with Theorem 1 and Proposition 1, we need an extra condition ww0D\|{\bm{w}}-{\bm{w}}_{0}\|\leq D in Corollary 1, which can be satisfied by choosing a large enough DD. By Corollary 1, by choosing K=O\big{(}\frac{1}{\epsilon}\log\frac{1}{\epsilon}\big{)}, we will obtain an O(Dϵ)O(D\epsilon)-accurate solution. Note that DD does not appear in our algorithm and is not relevant to the choice of ϵ.\epsilon.

Given w0W{\bm{w}}_{0}\in{\mathcal{W}}, let Assumptions 1 and 2 hold for the regularized problem VIP(F+ϵ12γw02,W)(F+\epsilon\nabla\frac{1}{2\gamma}\|\cdot-{\bm{w}}_{0}\|^{2},{\mathcal{W}}). By optimizing the regularized problem by Algorithm 4, the last iterate of Algorithm 4 satisfies

Similar to Corollary 1 for best iterate, in Corollary 2, by choosing K=O\big{(}\frac{1}{\epsilon}\log\frac{1}{\epsilon}\big{)}, the last iterate will be an O(Dϵ)O(D\epsilon)-accurate strong solution, which is significantly better than the tight bound O(1/ϵ2)O(1/\epsilon^{2}) for last iterate . Nevertheless, it should be noted that Corollary 2 is in a non-classical sense: we do not guarantee last iterate convergence for all K1K\geq 1, but only after K=O\big{(}\frac{1}{\epsilon}\log\frac{1}{\epsilon}\big{)} with a prescribed accuracy parameter ϵ.\epsilon. Thus our result does not contradict with the lower bound of last iterate .

Meanwhile, our proof only relies on the regularized problem VIP(F+ϵ12γw02,W)(F+\epsilon\nabla\frac{1}{2\gamma}\|\cdot-{\bm{w}}_{0}\|^{2},{\mathcal{W}}) satisfying Assumption 4 with σ=ϵ,\sigma=\epsilon, which holds if the VIP(F,W)(F,{\mathcal{W}}) is monotone. However, it is not necessary for the VIP(F,W)(F,{\mathcal{W}}) to be monotone. For instance, if the VIP(F,W)(F,{\mathcal{W}}) satisfies Assumption 3 and w0=w,{\bm{w}}_{0}={\bm{w}}^{*}, then the VIP(F+ϵ12γw02,W)(F+\epsilon\nabla\frac{1}{2\gamma}\|\cdot-{\bm{w}}_{0}\|^{2},{\mathcal{W}}) also satisfies Assumption 4 with σ=ϵ.\sigma=\epsilon. Of course, letting w0=w{\bm{w}}_{0}={\bm{w}}^{*} is impractical and we leave the more general setting of w0{\bm{w}}_{0} under non-monotone settings for further research.

Recently, has proposed a different Halpern iteration method under the monotone and Lipschitz assumptions. The Halpern iteration method does not need to know the Lipschitz constant and thus is parameter-free, and also attains the O\big{(}\frac{1}{\epsilon}\log\frac{1}{\epsilon}\big{)} convergence rate. Nevertheless, there are two major differences: The Halpern iteration method has two-loop, while our OptDE method is a single-loop single-call method; now the Halpern iteration method is limited to the Euclidean setting, while ours can have theoretical guarantees in the non-Euclidean setting.

Stochastic Optimistic Dual Extrapolation

For σ>0,\sigma>0, (i.e.,i.e., Assumption 4 holds), we have

As show in (16), for the setting σ=0\sigma=0 (a.k.a., Assumption 3) even if the number of iterations KK\rightarrow\infty, the expected restricted strong merit function can only be upper bounded by O\big{(}\frac{s}{L}\big{)}. Thus to guarantee the convergence of SOptDE, the variance should be o(1)o(1), such as s^{2}=O\big{(}\frac{1}{K}\big{)}. In the Euclidean setting that :=2,\|\cdot\|:=\|\cdot\|_{2}, by the concentration inequality , to attain a variance of O\big{(}\frac{1}{K}\big{)}, we need O(K)O(K) samples. Thus combining the setting s2=O(1K)s^{2}=O(\frac{1}{K}) and the result in (16), it can be verified that the single-call SOptDE method needs O(1/ϵ4)O(1/\epsilon^{4}) number of samples to obtain an ϵ\epsilon-accurate solution in terms of the expected restricted strong merit function.

Under the stronger Assumption 4, our result is given in terms of the expected solution distance. As shown in (17), under Assumption 4, SOptDE can converge provably even when the variance s2s^{2} is constant. In fact, the O\big{(}\frac{1}{K}\big{)} is optimal and has been obtained by the two-call extragradient method ESA under the pseudomonotone assumption. Meanwhile, used the plain stochastic gradient descent algorithm and obtained the O\big{(}\frac{1}{K}\big{)} result for strongly monotone variational inequalities, which can also be extended to the setting that σ\sigma-weak solution exists.

With the aggressive parameter setting ak=αγ(1+σAk1)La_{k}=\frac{\alpha\gamma(1+\sigma A_{k-1})}{L} and a large batch size strategy, we also obtain the first convergence guarantee O(1/ϵ2log1ϵ)O(1/\epsilon^{2}\log\frac{1}{\epsilon}) in terms of restricted strong merit function as shown in Table 2 (see details in the supplementary material).

Concluding Remarks

In this paper, we proposed a single-call extragradient method optimistic dual extrapolation (OptDE) beyond the monotone setting and also extended it to the stochastic setting as stochastic optimistic dual extrapolation (SOptDE). We systematically proved the convergence results of OptDE and SOptDE under the Assumption 3 that a weak solution exists and Assumption 4 that a strongly weak solution exists. We also show beneficial implications of our analysis in both non-monotone and monotone settings. In the future, we will further study how the proposed new methods may lead to improved computational efficiency and performance guarantees in a wide range of machine learning problems such as the training of adversarial deep neural networks.

Broader Impact

In this paper, we discuss a systematic theoretical analysis for single-call extragradient methods, which has been widely used for modern machine learning applications. The theoretical results in this paper can bring in meaningful insight and understanding for practical algorithms.

Acknowledgement

Chaobing and Yi acknowledge support from Tsinghua-Berkeley Shenzhen Institute (TBSI) Research Fund. Yichao and Yi acknowledge funding from Sony Research. Yi acknowledges support from ONR grant N00014-20-1-2002 and the joint Simons Foundation-NSF DMS grant #2031899, as well as support from Berkeley AI Research (BAIR), Berkeley FHL Vive Center for Enhanced Reality, and Berkeley Center for Augmented Cognition.

References

Appendix A Convergence Analysis of Optimistic Dual Extrapolation

Based on the optimality condition of zk{\bm{z}}_{k} and Assumption 2, we have Lemma 2.

then u,w0W\forall{\bm{u}},{\bm{w}}_{0}\in{\mathcal{W}}, we have

In Lemma 2, the sequence {E1k}\{E_{1k}\} can be viewed as the errors we need to bound in each iteration. The upper bound of the sum of {E1k}\{E_{1k}\} is given in Lemma 3 below.

In Algorithm 1, k[K],\forall k\in[K], we have

where we define a0:=a1a_{0}:=a_{1} for convenience.

By Lemma 3, \forall\;0<\alpha\leq\min\Big{\{}\frac{1}{4\sqrt{2}},\frac{\sqrt{3}}{4\sqrt{\gamma}}\Big{\}} and k[K],k\in[K], k=1KE1k\sum_{k=1}^{K}E_{1k} is upper bounded by the sum of strictly negative terms about wkzk12+wk1zk12\|{\bm{w}}_{k}-{\bm{z}}_{k-1}\|^{2}+\|{\bm{w}}_{k-1}-{\bm{z}}_{k-1}\|^{2}, which makes it possible to give a upper bound about mink[K](wkzk1+wk1zk1)\min_{k\in[K]}(\|{\bm{w}}_{k}-{\bm{z}}_{k-1}\|+\|{\bm{w}}_{k-1}-{\bm{z}}_{k-1}\|). To show the guarantees by restricted strong merit function and the distance wkw\|{\bm{w}}_{k}-{\bm{w}}^{*}\|, we give Lemma 4.

In Algorithm 1, k[K],\forall k\in[K], we have,

Then combining Lemmas 2, 3 and 4, we obtain Theorem 1 in main body (see Section C.4 for the proof.).

Appendix B Convergence Analysis of Stochastic Optimistic Dual Extrapolation

We can extend the proof for the OptDE method in Section A to the stochastic setting for Lemmas 5, 6 and 7 and then obtain Theorems 2. First, we extend Lemma 2 into Lemma 5.

In Algorithm 5, k[K]\forall k\in[K], we have the following inequality: u,w0W\forall{\bm{u}},{\bm{w}}_{0}\in{\mathcal{W}}, let

Compared with the E1kE_{1k} of Lemma 2, E2kE_{2k} contains an extra term akF(wk)F(wk;ξk),wkua_{k}\langle F({\bm{w}}_{k})-F({\bm{w}}_{k};\xi_{k}),{\bm{w}}_{k}-{\bm{u}}\rangle. Then based on the definition of E2kE_{2k} and Assumption 5, we have Lemma 6.

In Algorithm 5, k[K]\forall k\in[K] and uW,\forall{\bm{u}}\in{\mathcal{W}}, we have

Lemma 6 extends Lemma 3 into the stochastic setting. Meanwhile, by the optimality condition of wk{\bm{w}}_{k}, and Assumptions 1, 2 and 5, we can extend Lemma 4 to Lemma 7.

In Algorithm 5, for the setting σ=0\sigma=0 and k[K],\forall k\in[K], we have,

Then combining Lemmas 5, 6 and 7, we obtain Theorem 2 for the SOptDE method in the main body (see Section D.4 for the proof).

It turns out that with the conservative setting ak=αγ1+σAk1La_{k}=\frac{\alpha\gamma\sqrt{1+\sigma A_{k-1}}}{L}, we can not obtain strong convergence results in terms of restricted strong merit function. To obtain the rate O(1ϵ2log1ϵ)O\left(\frac{1}{\epsilon^{2}}\log\frac{1}{\epsilon}\right), we need adopt the more aggressive setting ak=αγ(1+σAk1)La_{k}=\frac{\alpha\gamma{(1+\sigma A_{k-1}})}{L} with a large batch size strategy, which is given in Algorithm 3. With this setting, we have Proposition 2.

with C_{1}:=4\big{(}1+\frac{\delta}{{\alpha\gamma}}\big{)}\sqrt{\frac{\alpha}{\gamma}}, a1=αγLa_{1}=\frac{{\alpha\gamma}}{L}, and

The proof of Proposition 2 follows the same pipeline of proving Theorem 2, except that we use the setting ak=αγ(1+σAk1)La_{k}=\frac{\alpha\gamma({1+\sigma A_{k-1}})}{L} that is also used in Algorithm 4. We leave the proof of Proposition 2 as a simple exercise.

In Proposition 2, if we hope the variance of the stochastic estimation {F(wk;ξk)}\{F({\bm{w}}_{k};\xi_{k})\} as s2=O(1AK1+a1),s^{2}=O(\frac{1}{A_{K-1}+a_{1}}), then we need O(AK1+a1)O(A_{K-1}+a_{1}) stochastic samples per iteration. Meanwhile, to attain an expected ϵ\epsilon-accurate strong solution, we will need O(log1ϵ)O(\log\frac{1}{\epsilon}) number of iterations. Thus the total number of stochastic samples we need is O(1ϵ2log1ϵ).O(\frac{1}{\epsilon^{2}}\log\frac{1}{\epsilon}).

B.2 The “ (quadratic) natural residual function” [18] and restricted strong merit function

In our notation, for any wW,{\bm{w}}\in{\mathcal{W}}, the (quadratic) natural residual function in is defined by: given η>0,\eta>0,

which can be used to derive the restricted strong merit function as Proposition 3.

Let w:=Pw(ηF(w))=arg minzW{ηF(w),z+12γzw2}.{\bm{w}}^{\prime}:=P_{{\bm{w}}}(\eta F({\bm{w}}))=\operatorname*{arg\,min}_{{\bm{z}}\in{\mathcal{W}}}\{\langle\eta F({\bm{w}}),{\bm{z}}\rangle+\frac{1}{2\gamma}\|{\bm{z}}-{\bm{w}}\|^{2}\}. Then we have

where (a)(a) is by the optimality condition of w{\bm{w}}^{\prime}, (b)(b) is by the Cauchy-Schwarz inequality, (c)(c) is by the Lipschitz continuity of F(w)F({\mathbf{w}}) and the bounded assumption (7). So we have

Appendix C Proof of Section A

By the definition of proximal operator (8), we can equivalently reformulate the optimistic dual extrapolation (OptDE) algorithm in the main body as Algorithm 4. Then based on the definition of gk{\bm{g}}_{k} in Step 7 and the definition of the Bregman divergence Vw(u)(w,uW)V_{{\bm{w}}}({\bm{u}})({\bm{w}},{\bm{u}}\in{\mathcal{W}}), we can verify that

where u{\bm{u}} is an arbitrary vector in W{\mathcal{W}} and is irrelevant to the minimizer zk.{\bm{z}}_{k}. In our context, ψk(z)\psi_{k}({\bm{z}}) plays the role of a “generalized estimation sequence” to help us conduct convergence analysis. By the γ\gamma-strong convexity of the Bregman divergence Vwiw0(zw0)V_{{\bm{w}}_{i}-{\bm{w}}_{0}}({\bm{z}}-{\bm{w}}_{0}), we know that ψk(z)\psi_{k}({\bm{z}}) is strongly convex with strong convexity parameter 1+σi=1kai=1+σAk.1+\sigma\sum_{i=1}^{k}a_{i}=1+\sigma A_{k}.

Proof. Given the definition of the generalized estimation sequence ψk(z)\psi_{k}({\bm{z}}) in (31) and the minimizer zk{\bm{z}}_{k} in Algorithm 4, by the optimality condition of zk{\bm{z}}_{k}, we have: uW,\forall{\bm{u}}\in{\mathcal{W}},

where (a)(a) is by the optimality condition (32), and (b)(b) is by the convexity of both Vwiw0(uw0)V_{{\bm{w}}_{i}-{\bm{w}}_{0}}({\bm{u}}-{\bm{w}}_{0}) and 12γuw02.\frac{1}{2\gamma}\|{\bm{u}}-{\bm{w}}_{0}\|^{2}.

Meanwhile for k1k\geq 1, by the definition of ψk(zk)\psi_{k}({\bm{z}}_{k}) , we have

where (a)(a) is by the (1+σAk1)(1+\sigma A_{k-1})-strong convexity of ψk1(z).\psi_{k-1}({\bm{z}}). Meanwhile, by the strong convexity of 122\frac{1}{2}\|\cdot\|^{2} in Assumption 2, we have

Then combining (34) and (35), and after simple arrangements, we have

where (a)(a) is by the fact ψ0(z0)=0\psi_{0}({\bm{z}}_{0})=0 and the upper bound of ψK(zK)\psi_{K}({\bm{z}}_{K}) by (33). By the setting ak=αγ(1+σAk1)La_{k}=\frac{{\alpha\gamma}(1+\sigma A_{k-1})}{L} in Algorithm 4 and (37), we have

Then based on the definition of E1kE_{1k} in Lemma 2, after simple arrangements, Lemma 2 is proved.

C.2 Proof of Lemma 3

Proof. By the definition of E1kE_{1k} in Lemma 2, we have: k[K],\forall k\in[K],

where (a)(a) is by the Cauchy–Schwarz inequality, (b)(b) is the Lipschitz continuous Assumption 1, (c)(c) is by the fact aba2+b24,ab\leq a^{2}+\frac{b^{2}}{4}, (d)(d) is by the triangle inequality of norm \|\cdot\| and (e)(e) is by the fact (a+b)22(a2+b2).(a+b)^{2}\leq 2(a^{2}+b^{2}).

Then by the optimality condition of wk{\bm{w}}_{k} in the kk-th iteration of Algorithm 4, we have: zW,\forall{\bm{z}}\in{\mathcal{W}},

By combining (39), (40) and (41), we have

from Step 5 of Algorithm 4 and the setting

and (b)(b) is by the setting that 0<γ1.0<\gamma\leq 1.

With the w0=z0{\bm{w}}_{0}={\bm{z}}_{0}, for convenience, we set a0:=a1.a_{0}:=a_{1}. By summing (42) from k=1k=1 to KK, we have

where (a)(a) is by the fact w0=z0{\bm{w}}_{0}={\bm{z}}_{0}, and (b)(b) is by the fact that akak1>0a_{k}\geq a_{k-1}>0. Lemma 3 is proved.

C.3 Proof of Lemma 4

where (a)(a) is by the optimality condition of wk{\bm{w}}_{k}, (b)(b) is by the Cauchy-Schwarz inequality, (c)(c) is by the Lipschitz continuity of F(w)F({\mathbf{w}}) and the bounded assumption (7), (d)(d) is by the triangle inequality of norm .\|\cdot\|. So we have

Meanwhile, if there exists a w{\bm{w}}^{*} that satisfies Assumption 4, i.e.,i.e., wW\forall{\bm{w}}\in{\mathcal{W}}, F(w),wwσγ(Vww0(ww0)+Vww0(ww0))\langle F({\bm{w}}),{\bm{w}}-{\bm{w}}^{*}\rangle\geq\frac{\sigma}{\gamma}(V_{{\bm{w}}-{\bm{w}}_{0}}({\bm{w}}^{*}-{\bm{w}}_{0})+V_{{\bm{w}}^{*}-{\bm{w}}_{0}}({\bm{w}}-{\bm{w}}_{0})) with σ>0,\sigma>0, then in (45), let w:=w,{\bm{w}}:={\bm{w}}^{*}, and by the fact Vwkw0(ww0)γ2wkw2V_{{\bm{w}}_{k}-{\bm{w}}_{0}}({\bm{w}}^{*}-{\bm{w}}_{0})\geq\frac{\gamma}{2}\|{\bm{w}}_{k}-{\bm{w}}^{*}\|^{2} and Vww0(wkw0)γ2wkw2V_{{\bm{w}}^{*}-{\bm{w}}_{0}}({\bm{w}}_{k}-{\bm{w}}_{0})\geq\frac{\gamma}{2}\|{\bm{w}}_{k}-{\bm{w}}^{*}\|^{2}, we have

C.4 Proof of Theorem 1

Proof. Firstly, by the setting ak=αγ(1+σAk1)La_{k}=\frac{{\alpha\gamma}(1+\sigma A_{k-1})}{L} and A0=0,Ak=Ak1+ak,A_{0}=0,A_{k}=A_{k-1}+a_{k}, we have: k0,\forall k\geq 0,

If σ=0,\sigma=0, then Ak=αγkL.A_{k}=\frac{\alpha\gamma k}{L}.

If σ>0\sigma>0, then A_{k}=\frac{1}{\sigma}\Big{(}1+\frac{{\alpha\gamma}\sigma}{L}\Big{)}^{k}-\frac{1}{\sigma}.

Let w{\bm{w}} be the w{\bm{w}}^{*} in Assumption 3 if σ=0\sigma=0 or the w{\bm{w}}^{*} in Assumption 4 if σ>0\sigma>0. Then by the property of w{\bm{w}}^{*}, we have F(wk),wkwσγVwkw0(ww0)0.\langle F({\bm{w}}_{k}),{\bm{w}}_{k}-{\bm{w}}^{*}\rangle-\frac{\sigma}{\gamma}V_{{\bm{w}}_{k}-{\bm{w}}_{0}}({\bm{w}}^{*}-{\bm{w}}_{0})\geq 0. So by (49), it follows that

By the setting Ak=Ak1+akA_{k}=A_{k-1}+a_{k} with A0=0A_{0}=0 in Algorithm 4, we have Ak=i=1kai.A_{k}=\sum_{i=1}^{k}a_{i}. Meanwhile, for convenience, we have set a0=a1a_{0}=a_{1}. So we have

So by (20) of Lemma 4 and (52), it follows that

Similarly, if σ>0,\sigma>0, then by (21) of Lemma 4 and (52), we have

Then by defining C_{0}:=\Big{(}1+\frac{\delta}{{\alpha\gamma}}\Big{)}\sqrt{\frac{8\alpha}{\gamma}}, Theorem 1 is proved.

C.5 Proof of Proposition 1

Proof. The proof follows the same paradigm of Section C.4. Firstly, by the setting ak=αγ(1+σAk1)La_{k}=\frac{{\alpha\gamma}(1+\sigma A_{k-1})}{L} and A0=0,Ak=Ak1+ak,A_{0}=0,A_{k}=A_{k-1}+a_{k}, we have k0,\forall k\geq 0,

Then the (51) of Section C.4 is replaced by

Then similar to (52) to (54), we obtain the last iterate convergence result as

Thus by the definition of C0C_{0} in Theorem 1, Proposition 1 is proved.

C.6 Proof of Lemma 1

Proof. By the definition of the Bregman divergence Vw(v),V_{{\bm{w}}}({\bm{v}}), we have

So combining (57) and (58), it follows that

So if F(w)F({\bm{w}}) is monotone, then we have: w0,w,vW,\forall{\bm{w}}_{0},{\bm{w}},{\bm{v}}\in{\mathcal{W}},

As Assumption 4 includes the strongly monotone assumption, by (60), we know that the VIP(F+ϵ12γw02,W)(F+\epsilon\nabla\frac{1}{2\gamma}\|\cdot-{\bm{w}}_{0}\|^{2},{\mathcal{W}}) satisfies Assumption 4 with parameter σ=ϵ.\sigma=\epsilon.

C.7 Proof of Corollary 1

Proof. By Theorem 1 and Lemma 1, if we optimize the regularized problem VIP(F+ϵ12γw02,W)(F+\epsilon\nabla\frac{1}{2\gamma}\|\cdot-{\bm{w}}_{0}\|^{2},{\mathcal{W}}) by the ODE Algorithm 4, then after KK iterations, we have

where C0C_{0} is defined in Theorem 1, A_{K-1}=\frac{1}{\epsilon}\Big{(}1+\frac{\sqrt{\alpha\gamma}\epsilon}{L}\Big{)}^{K-1}-\frac{1}{\epsilon}.

Meanwhile, by the convexity of 12γww02,\frac{1}{2\gamma}\|{{\bm{w}}}-{\bm{w}}_{0}\|^{2}, we have

C.8 Proof of Corollary 2

Proof. By Proposition 1 and Lemma 1, if we optimize the regularized problem VIP(F+ϵ12γw02,W)(F+\epsilon\nabla\frac{1}{2\gamma}\|\cdot-{\bm{w}}_{0}\|^{2},{\mathcal{W}}) by the OptDE Algorithm 4, then after KK iterations, we have

where C0C_{0} is defined in Theorem 1, a_{K-1}=\frac{\alpha\gamma}{L}\Big{(}1+\frac{\alpha\gamma\sigma}{L}\Big{)}^{K-2}.

Meanwhile, by the convexity of 12γww02,\frac{1}{2\gamma}\|{{\bm{w}}}-{\bm{w}}_{0}\|^{2}, we have

Appendix D Proof of Section B

By the definition of proximal operator (8), we can equivalently reformulate the stochastic optimistic dual extrapolation (SODE) of the main body as below. Then based on the definition of gk{\bm{g}}_{k} in Step 7 and the definition of the Bregman divergence Vw(u)V_{{\bm{w}}}({\bm{u}}), we can verify that

where u{\bm{u}} is an arbitrary vector in W{\mathcal{W}} and is irrelevant to the minimizer zk.{\bm{z}}_{k}. In our context, ψ^k(z)\hat{\psi}_{k}({\bm{z}}) plays the role of a “generalized estimation sequence” to help us conduct convergence analysis. By the γ\gamma-strong convexity of the Bregman divergence Vwiw0(zw0)V_{{\bm{w}}_{i}-{\bm{w}}_{0}}({\bm{z}}-{\bm{w}}_{0}), we know that ψ^k(z)\hat{\psi}_{k}({\bm{z}}) is strongly convex with strong convexity parameter 1+σi=1kai=1+σAk.1+\sigma\sum_{i=1}^{k}a_{i}=1+\sigma A_{k}.

Proof. Given the definition of the generalized estimation sequence ψ^k(z)\hat{\psi}_{k}({\bm{z}}) in (66) and by the optimality condition of the minimizer zk{\bm{z}}_{k} in the Step 6 of Algorithm 5, we have: uW,\forall{\bm{u}}\in{\mathcal{W}},

where (a)(a) is by the optimality condition (67) and (b)(b) is by the convexity of Vwiw0(uw0)V_{{\bm{w}}_{i}-{\bm{w}}_{0}}({\bm{u}}-{\bm{w}}_{0}) and 12γuw02\frac{1}{2\gamma}\|{\bm{u}}-{\bm{w}}_{0}\|^{2}.

where (a)(a) is the (1+σAk1)(1+\sigma A_{k-1})-strong convexity of ψ^k1(z).\hat{\psi}_{k-1}({\bm{z}}). Meanwhile, by the γ\gamma-strong convexity of 122\frac{1}{2}\|\cdot\|^{2}, we have

where (a)(a) is by the fact ψ^0(z0)=0\hat{\psi}_{0}({\bm{z}}_{0})=0, the upper bound of ψ^K(zK)\hat{\psi}_{K}({\bm{z}}_{K}) in (68), (b)(b) is by the setting ak2=(αγ)2(1+σAk1)L2a_{k}^{2}=\frac{{(\alpha\gamma)^{2}}(1+\sigma A_{k-1})}{L^{2}} in Algorithm 5. Meanwhile, taking expectation on ξk\xi_{k}, we have: uW,\forall{\bm{u}}\in{\mathcal{W}},

So taking expectation on the randomness of all the history for (72), and using (73) and the definition of {E2k}\{E_{2k}\} in Lemma 5, after simple arrangements, Lemma 5 is proved.

D.2 Proof of Lemma 6

Proof. By the definition of E2kE_{2k} in Lemma 5, we have: k[K],\forall k\in[K],

where (a)(a) is by the Cauchy-Schwarz inequality, (b)(b) is by the triangle inequality of the norm \|\cdot\|_{*}, (c)(c) is by the Lipschitz continuity of F(w)F({\bm{w}}), (d)(d) is by the fact aba2+b24,ab\leq a^{2}+\frac{b^{2}}{4}, (e),(f)(e),(f) and (g)(g) is by the fact (a+b)22(a2+b2).(a+b)^{2}\leq 2(a^{2}+b^{2}).

Then by the optimality condition of wk{\bm{w}}_{k} in Algorithm 5, we have: zW,\forall{\bm{z}}\in{\mathcal{W}},

Combining (74), (75) and (76) with z:=zk{\bm{z}}:={\bm{z}}_{k}, we have

For both the settings σ=0\sigma=0 and σ>0\sigma>0, by our setting, we have aka1=αγLa_{k}\geq a_{1}=\frac{{\alpha\gamma}}{L} and α=min{γ32,116}\alpha=\min\{\frac{\gamma}{32},\frac{1}{16}\}, so we have

where (a)(a) is by the condition w0=z0{\bm{w}}_{0}={\bm{z}}_{0} and Assumption 5. Lemma 6 is proved.

D.3 Proof of Lemma 7

It follows that: wW\forall{\bm{w}}\in{\mathcal{W}}

where (a)(a) is by the fact ak=αγLa_{k}=\frac{\alpha\gamma}{L} when σ=0,\sigma=0, (b)(b) is by the optimality condition of wk{\bm{w}}_{k}.

(a)(a) is by the Cauchy Schwarz inequality and simple arrangement, and (b)(b) is by Assumption 1.

where (a)(a) is by the fact aba22+b22ab\leq\frac{a^{2}}{2}+\frac{b^{2}}{2}.

So taking expectation on ξk1\xi_{k-1}, by Assumption 5, we have: wW\forall{\bm{w}}\in{\mathcal{W}}

D.4 Proof of Theorem 2

Proof. Firstly, by the setting ak=αγ1+σAk1La_{k}=\frac{{\alpha\gamma}\sqrt{1+\sigma A_{k-1}}}{L} and A0=0,Ak=Ak1+ak,A_{0}=0,A_{k}=A_{k-1}+a_{k}, we have

If σ=0,\sigma=0, then Ak=αγkL.A_{k}=\frac{{\alpha\gamma}k}{L}.

If σ>0\sigma>0, then A_{k}=\Big{(}\frac{\alpha\gamma}{4L}\Big{)}^{2}\sigma(k+1)^{2}.

Then for both the setting σ=0\sigma=0 (i.e.,i.e., Assumption 3 holds) and σ>0\sigma>0 (i.e.,i.e., Assumption 4 holds), we have

So in Lemma 5, let u=w{\bm{u}}={\bm{w}}^{*}, we have

where (a)(a) is by the γ\gamma-strong convexity Bregman divergence of Vww0(wkw0)V_{{\bm{w}}^{*}-{\bm{w}}_{0}}({\bm{w}}_{k}-{\bm{w}}_{0}), (b)(b) is by the Assumption 3 (σ=0\sigma=0) or the Assumption 4 (σ>0\sigma>0), (c)(c) is by Lemma 5, and (d)(d) is by Lemma 6.

So taking expectation on all the history, we have

Then taking expectation on all the history, we have

where (a)(a) is by the Jensen inequality, (b)(b) is by the fact that (a+b)22(a2+b2)(a+b)^{2}\leq 2(a^{2}+b^{2}) and (c)(c) is by (87).

where (a)(a) is by Lemma 7, (b)(b) is by the triangle inequality of \|\cdot\| and Assumption 5, (c)(c) is by the triangle inequality of \|\cdot\|, (d)(d) is by (87) and (88).

For σ>0,\sigma>0, by (84) and (85), we have