Optimistic Dual Extrapolation for Coherent Non-monotone Variational Inequalities

Chaobing Song, Zhengyuan Zhou, Yichao Zhou, Yong Jiang, Yi Ma

Introduction

where ${\bm{w}}^{*}$ is called a strong solution of VIP $(F,{\mathcal{W}})$ . For the minimax problem

let ${\mathcal{W}}\equiv{\mathcal{X}}\times{\mathcal{Y}},{\bm{w}}\equiv\left[\begin{smallmatrix}{\bm{x}}\\ {\bm{y}}\end{smallmatrix}\right],F({\bm{w}})\equiv\left[\begin{smallmatrix}\nabla_{{\bm{x}}}f({\bm{x}},{\bm{y}})\\ -\nabla_{{\bm{y}}}f({\bm{x}},{\bm{y}})\end{smallmatrix}\right]$ . Then solving (1) is equivalent to finding a first-order Nash equilibrium of the minimax problem (2) .

The operator $F({\bm{w}})$ will be monotone if

VI with monotone operators has been well studied, which provides a concise and optimal framework for convex-concave minimax problems . For monotone VIP $(F,{\mathcal{W}})$ , it is well known that the strong solution satisfying (1) is also equivalent to the solution ${\bm{w}}^{*}\in{\mathcal{W}}$ satisfying:

where ${\bm{w}}^{*}$ is called a weak solution of VIP $(F,{\mathcal{W}})$ . A classical result under the monotone and Lipschitz continuous assumptions is that the Mirror-Prox algorithm can converge to an $\epsilon$ -accurate weak solution in terms of ergodic averaging in $O(1/\epsilon)$ iterations, which is optimal for first-order methods in solving monotone VIPs . Nemirovski’s Mirror-Prox is a non-Euclidean extension of the extragradient method from the perspective of mirror descent. Another important non-Euclidean extension is Nesterov’s dual extrapolation from the perspective of dual averaging, which also has the optimal $O(1/\epsilon)$ convergence rate. The main difference between mirror descent and dual averaging is the way of combining the constraint (or the regularization term if exists) into the projection (or the proximal) step .

Despite obtaining the optimal convergence rate, both Mirror-Prox and dual extrapolation are two-call extragradient methods that need to evaluate gradients twice per iteration. In some contexts such as training deep neural networks, evaluating gradients can be expensive. Thus it will have significant practical benefits if we only need one gradient evaluation per iteration and still maintain the same convergence rate. In terms of single-call methods for minimax problems, vanilla gradient descent ascent (and its mirror descent generalizations) might be a natural choice. Unfortunately, it is not guaranteed and it can diverge even in simple monotone settings . Consequently, after the (two-call) extragradient method , several single-call extragradient methods have been analyzed under the monotone setting and share the same convergence rates with Mirror-Prox and dual extrapolation . However, there is an increasing trend in applying these single-call extragradient methods to stabilize the training of generative adversarial networks (GAN) , which is nonconvex-nonconcave in general and hence has remained underexplored.

Nonconvex-Nonconcave Minimax Problems.

Despite the well-developed convergence theory for monotone VIPs and thus for convex-concave minimax problems, many minimax problems arising in modern machine learning are nevertheless nonconvex-nonconcave, such as GAN , adversarial training , gradient reversal for domain adaption , and multi-agent reinforcement learning . As a result, the corresponding VI is not monotone and the aforementioned theoretical guarantees for monotone VIPs no longer apply. First, for non-monotone VIPs, it is nontrivial to obtain the rate of convergence to a weak solution, thus one may explore the rate of convergence to a strong solution instead. Second, without the monotone property, the ergodic averaging technique will no longer have theoretical guarantees, thus we might need to choose the last iterate or best iterate. However, the classical convergence result said little about the rate of convergence to a weak solution or the convergence of last iterate or best iterate.Recently, shows the first tight last iterate result for general smooth convex-concave minimax problems with Lipschitz derivatives of operators.

To obtain theoretical guarantees beyond the monotone setting, a common approach is to relax the lower bound (3) in the monotone assumption. Along this research line, several more general assumptions have been proposed, such as the pseudo-monotone assumption and its variants , and the generalized monotone assumption . In the machine learning community, similar concepts have also been proposed, such as variational coherence . For simplicity, we coin the problem class along this research line as coherent non-monotone variational inequalities. Among them, is the first to provide explicit global convergence results such that the best iterate of the N-EG method can converge to an $\epsilon$ -accurate strong solution in $O(1/\epsilon^{2})$ iterations under the generalized monotone and Lipschitz continuous assumptions. However, N-EG needs to evaluate gradient twice per iteration, which is less desirable when gradient evaluation is expensive. For the single-call extragradient method , under a second-order conditionAs we will see, it is a localized version of our assumption., very recently has provided local linear convergence results in certain non-monotone setting, while the constants in these results remain implicit. The following problem remains open: Can single-call extragradient methods have explicit global convergence results beyond the monotone setting?

Contributions of This Paper.

In this paper we develop an Optimistic Dual Extrapolation (OptDE) method that provably converges to a strong solution for coherent non-monotone VIPs. The OptDE method can be viewed as a single-call variant of Nesterov’s dual extrapolation that maintains its “anticipatory” properties. We characterize convergence rates of the best iterateFor given a number of iterations, the best iterate can be explicitly found and happen before the last iterate. of OptDE under two coherent non-monotone assumptions, where the merit function is given in Definition 1 and $\|\cdot\|$ is the natural norm used in algorithms. As shown in Table 1, when the problem has a weak solution ${\bm{w}}^{*}$ , our method matches the best known rate $O({1}/{\epsilon^{2}})$ of N-EG . Further strengthening the assumption to that a $\sigma$ -weak solution ${\bm{w}}^{*}$ exists with $\sigma>0$ – nevertheless a weaker condition than the strongly monotone assumption required in previous work, we are able to obtain a linear convergence rate of $O(\log\frac{1}{\epsilon})$ . For this setting, we can also use the distance $\|\cdot-{\bm{w}}^{*}\|^{2}$ to measure the progress and obtain a linear convergence result; meanwhile, despite not shown in Table 1, we also obtain a linear convergence result of the last iterate. Our result shows that even under the two coherent non-monotone assumptions, the convergence rate of single-call extragradient methods can be comparable to that of the N-EG method with two gradient evaluations per iteration.

Our coherent non-monotone analysis for the setting that a $\sigma$ -weak solution exists has two meaningful corollaries about best iterate and last iterate in the monotone setting, respectively: With a regularization trick, both the best iterate and last iterateHere the last iterate is not in the classical sense, which will be explained in Section 3. of OptDE can be an $\epsilon$ -accurate solution in $O(\frac{1}{\epsilon}\log\frac{1}{\epsilon})$ number of iterations. To our knowledge, the near-optimal result $O(\frac{1}{\epsilon}\log\frac{1}{\epsilon})$ for attaining an $\epsilon$ -accurate strong solution was only appeared in very recently with a two-loop Halpern iteration method, while our result is obtained by the simpler single-loop single-call OptDE method.

Meanwhile, we extend the OptDE algorithm to the stochastic setting as Stochastic OptDE (SOptDE) and show that our results in the deterministic setting can be naturally generalized to the stochastic setting. This allows us to characterize the stochastic oracle complexity (i.e., the number of stochastic oracles we access) of SOptDE under the coherent non-monotone assumptions. The results under the stochastic setting are summarized in Table 2.The results of the SEG , ESA algorithms are given under pseudomonotone and strongly pseudomonotone assumptions respectively, which are slightly stronger than our assumptions. As we see, the results match the best-known results of SEG The original result of SEG is given by “square natural residual”, which can be used to derive the strong solution guarantee in Table 2 (see the supplementary material for detail). and ESA respectively, while both SEG and ESA need two gradient evaluations per iteration. Meanwhile, under the assumption that a $\sigma$ -weak solution exists, we obtain the first theoretical guarantee in terms of the merit function in Definition 1.

Last but not least, different from N-EG and ESA , the proposed OptDE and SOptDE algorithms only need the norm square $\|\cdot\|^{2}$ being strongly convex but not necessarily globally Lipschitz continuous, which will be significant if $\|\cdot\|$ is a non-Euclidean norm: $\|\cdot\|^{2}$ can not be strongly convex and globally Lipschitz continuous simultaneously in general.

Technical Assumptions

To measure the accuracy of iterates to a strong solution, we consider the following “restricted strong merit function”.

With $\epsilon\to 0$ and $D\to+\infty$ , Definition 1 becomes the definition of the strong solution in (1). In the nonconvex-nonconcave minimax setting, Definition 1 has been proposed as the definition of the $\epsilon$ -accurate first-order Nash equilibrium . If ${\mathcal{W}}$ is a bounded set, then we still have an effective measure even if $D\rightarrow+\infty$ ; if ${\mathcal{W}}$ is unbounded, then $D$ needs to be a finite positive parameter. To give a unified measure for both bounded and unbounded settings, we set $D$ to be a finite positive parameter.

Throughout this paper, we make the following standard Lipschitz continuous assumption.

For the VIP $(F,{\mathcal{W}})$ in (1), $\forall{\bm{w}},{\bm{v}}\in{\mathcal{W}},$ $\|F({\bm{w}})-F({\bm{v}})\|_{*}\leq L\|{\bm{w}}-{\bm{v}}\|,$ where $L>0$ is the Lipschitz constant.

Meanwhile, we assume that the (possible non-Euclidean) norm $\|\cdot\|$ satisfies Assumption 2.

$\frac{1}{2}\|{\bm{w}}\|^{2}$ is $\gamma$ -strongly convex ( $0<\gamma\leq 1$ ) with respect to (w.r.t.) $\|\cdot\|$ and the dual norm of gradient $\nabla\frac{1}{2}\|{\bm{w}}\|^{2}$ is bounded by $\delta\|{\bm{w}}\|(\delta>0)$ :

From , $\frac{1}{2}\|\cdot\|_{p}^{2}(1<p\leq 2)$ is $(p-1)$ -strongly convex $w.r.t.$ $\|\cdot\|_{p}$ . Without loss of generality, in Assumption 2, we assume $0<\gamma\leq 1.$ For all the norm setting $\frac{1}{2}\|\cdot\|_{p}^{2}(1<p\leq 2)$ , we have $\delta=1.$

For the norm $\|\cdot\|$ , we define the prox-mapping as

and assume that it can be solved efficiently. Meanwhile, we also define the corresponding Bregman divergence of $\frac{1}{2}\|\cdot\|^{2}$ : $\forall{\bm{w}},{\bm{v}}\in{\mathcal{W}},$

Obviously we have $V_{{\bm{v}}}({\bm{w}})\geq\frac{\gamma}{2}\|{\bm{w}}-{\bm{v}}\|^{2}.$

Then we make Assumptions 3 and 4 for the coherent non-monotone VIP $(F,{\mathcal{W}})$ we study.

For the VIP $(F,{\mathcal{W}})$ in (1), there exists a weak solution ${\bm{w}}^{*}\in{\mathcal{W}}$ such that $\forall{\bm{w}}\in{\mathcal{W}},$ $\langle F({\bm{w}}),{\bm{w}}-{\bm{w}}^{*}\rangle\geq 0$ .

For the VIP $(F,{\mathcal{W}})$ in (1), given ${\bm{w}}_{0}\in{\mathcal{W}},$ there exists a $\sigma$ -weak solution ${\bm{w}}^{*}\in{\mathcal{W}}$ with parameter $\sigma>0$ such that $\forall{\bm{w}}\in{\mathcal{W}},$ $\langle F({\bm{w}}),{\bm{w}}-{\bm{w}}^{*}\rangle\geq\frac{\sigma}{\gamma}(V_{{\bm{w}}-{\bm{w}}_{0}}({\bm{w}}^{*}-{\bm{w}}_{0})+V_{{\bm{w}}^{*}-{\bm{w}}_{0}}({\bm{w}}-{\bm{w}}_{0}))$ .

Assumption 3 assumes the existence of weak solutions, which is also adopted in . Assumption 3 is slightly weaker than the variational coherence assumption or the generalized monotone assumption . Some nontrivial examples satisfying the generalized monotone assumption can be found in . The generalized monotone assumption is in turn weaker than the pseudo-monotone assumption , which is weaker than the monotone assumption (3).

Assumption 4 further assumes a stronger variant of Assumption 3, which is also called as strongly variational stability in . For the Euclidean setting where $\|\cdot\|:=\|\cdot\|_{2}$ and thus $\gamma=1$ , the inequality is simplified to $\langle F({\bm{w}}),{\bm{w}}-{\bm{w}}^{*}\rangle\geq\sigma\|{\bm{w}}-{\bm{w}}^{*}\|_{2}^{2}$ . Assumption 4 is weaker than the strongly pseudo-monotone and strongly monotone assumptions, but as we will see, is already sufficient to ensure a linear convergence rate for our method.

Our main motivation in making Assumptions 3 and 4 is to prove explicit global convergence results for VIP $(F,{\mathcal{W}})$ under conditions as weak as possible. However, the non-monotone subsets of Assumptions 3 and 4, a.k.a., pseudomonotone and strongly pseudomonotone respectively, also have many real applications in competitive exchange economy , fractional programming , and product pricing . Meanwhile, the restriction of Assumption 4 in minimization problems such as one-point convexity is also used in analyzing neural networks.

Optimistic Dual Extrapolation

In this section, we present the optimistic dual extrapolation (OptDE) algorithm for solving the VIP $(F,{\mathcal{W}})$ in (1). The method is a single-call variant of Nesterov’s dual extrapolation . The overall algorithm is summarized as Algorithm 1. The algorithm works under either Assumption 3 by setting $\sigma=0$ or Assumption 4 with $\sigma>0$ .

For Algorithm 1, we define two constants $A_{0}$ and $\alpha$ in Step 2. Then we initialize three vectors ${\bm{w}}_{0},{\bm{z}}_{0}$ and ${\bm{g}}_{0}$ in Step 3. In the main loop, we update the two positive numbers $a_{k}$ and $A_{k}$ in Step 5. Then we perform an “extrapolation” step in Step 6 and then “dual averaging” steps in Steps 7 and 8. As we see, as Algorithm 1 only performs one new gradient evaluation in Step $8$ , it is “optimistic” hence the name “optimistic dual extrapolation”. Once Algorithm 1 runs $K$ iterations, we return the best iterate measured by the sum of residual norms $\|{\bm{w}}_{k}-{\bm{z}}_{k-1}\|+\|{\bm{w}}_{k-1}-{\bm{z}}_{k-1}\|$ This return value is given according to our convergence analysis..

Compared with Nesterov’s dual extrapolation, the main difference is that the extrapolation Step 6 is a prox-mapping on $F({\bm{w}}_{k-1})$ , not on $F({\bm{z}}_{k-1})$ . Compared with past extra-gradient , the main difference is that we perform dual averaging by Steps 7 and 8, instead of a “mirror descent” step. Compared with N-EG which is claimed to be a non-Euclidean extragradient method , not only we perform just one gradient evaluation per iteration but also do not require $\frac{1}{2}\|\cdot\|^{2}$ to have bounded Lipschitz continuous gradients, which is significant in the non-Euclidean setting since the norm square $\frac{1}{2}\|\cdot\|_{p}^{2}$ for $p\in(1,2)$ may not have globally bounded Lipschitz continuous gradients.

In the following, we assume ${\bm{w}}^{*}$ is a solution that satisfies Assumption 3 if $\sigma=0$ or satisfies Assumption 4 if $\sigma>0$ .

with $C_{0}=\big{(}1+\frac{\delta}{{\alpha\gamma}}\big{)}\sqrt{\frac{8\alpha}{\gamma}},$ $a_{1}=\frac{{\alpha\gamma}}{L}$ , and

Theorem 1 implies our main result in Table 1. As we see, for $\sigma=0,$ except for constants, our result is the same with the two-call extragradient method N-EG . However, to analyze single-call methods, particularly for the setting $\sigma=0$ , the analysis is much more involved and leads to an interesting criterion of return value in Step 10 of Algorithm 1. For the setting $\sigma>0,$ then linear convergence rates can be obtained in terms of both restricted strong merit solution and solution distance. Meanwhile, for the setting $\sigma>0$ , our result in terms of restricted strong merit solution (10) can not be implied by the result of the solution distance (12), while the reverse side is true. Furthermore, when $\sigma>0$ , the result (10) is also used in deriving Corollary 1 for the monotone setting. Finally, to simplify our analysis, we did not yet optimize the constants in (10) and (12), which probably can be further improved.

In Theorem 1, we provide a unified result for the two settings $\sigma=0$ and $\sigma>0$ in terms of the best iterate. However, when $\sigma>0,$ we can also prove linear convergence rates in terms of last iterate, which is given in Proposition 1 below.

Let Assumptions 1 and 2 hold. For the setting $\sigma>0$ (i.e., Assumption 4 holds), $\forall K\geq 1,$ after $K$ iterations, Algorithm 1 returns a ${{\bm{w}}}_{K}$ such that

with $C_{0}$ defined in Theorem 1, $a_{0}=a_{1}$ and $\forall K\geq 1,$

By Proposition 1, to prove the linear convergence of the last iterate, we do not need the strongly monotone assumption, but only Assumption 4. Despite the last iterate also has a linear convergence rate, it is slower than the rate of best iterate in Theorem 1. As we will see, Proposition 1 will also be used to prove the last iterate convergence for the monotone setting in a non-classical sense.

The motivation behind OptDE is that by generalizing Nesterov’s estimation sequence, we can perform a unified convergence analysis under Assumptions 3 and 4. However, as shown in , if a regularizer exists, the (regularized) dual averaging steps (Steps 7 and 8 of Algorithm 1) can help us better explore the structure of regularizers such as sparsity when it exists.

has given local convergence analysis in terms of solution distance by assuming that Assumption 4 holds in a neighbourhood of the optimal solution. The analysis in needs extra techniques, while the constants in the rates of are implicit. Our solution distance result in (12) can be viewed as a global and explicit version of by assuming Assumption 4 holds globally. Meanwhile, does not give any result under Assumption 3 or in terms of restricted strong solution under Assumption 4 whereas our analysis does.

Our results are mainly given under the coherent non-monotone Assumptions 3 and 4. As shown in Theorem 1, under Assumption 3 that includes the monotone assumption, we can obtain an $\epsilon$ -accurate strong solution in $O(\epsilon^{-2})$ iterations. However, in the following we show that with a regularization trick, the rate can be much better in the monotone setting by using our results in Theorem 1 and Proposition 1.

First, to give our results in the monotone setting, we have Lemma 1.

If the VIP $(F,{\mathcal{W}})$ is monotone, then the regularized problem VIP $(F+\epsilon\nabla\frac{1}{2\gamma}\|\cdot-{\bm{w}}_{0}\|^{2},{\mathcal{W}})$ satisfies Assumption 4 with $\sigma=\epsilon.$

Due to Lemma 1, we can apply Theorem 1 and Proposition 1 to the regularized problem VIP $(F+\epsilon\nabla\frac{1}{2\gamma}\|\cdot-{\bm{w}}_{0}\|^{2},{\mathcal{W}})$ , and then obtain Corollaries 1 and 2 for the VIP $(F,{\mathcal{W}})$ , respectively.

Given ${\bm{w}}_{0}\in{\mathcal{W}}$ , let Assumptions 1 and 2 hold for the regularized problem VIP $(F+\epsilon\nabla\frac{1}{2\gamma}\|\cdot-{\bm{w}}_{0}\|^{2},{\mathcal{W}})$ . By optimizing the regularized problem by Algorithm 4, then the best iterate returned by Algorithm 4 satisfies

Compared with Theorem 1 and Proposition 1, we need an extra condition $\|{\bm{w}}-{\bm{w}}_{0}\|\leq D$ in Corollary 1, which can be satisfied by choosing a large enough $D$ . By Corollary 1, by choosing $K=O\big{(}\frac{1}{\epsilon}\log\frac{1}{\epsilon}\big{)}$ , we will obtain an $O(D\epsilon)$ -accurate solution. Note that $D$ does not appear in our algorithm and is not relevant to the choice of $\epsilon.$

Similar to Corollary 1 for best iterate, in Corollary 2, by choosing $K=O\big{(}\frac{1}{\epsilon}\log\frac{1}{\epsilon}\big{)}$ , the last iterate will be an $O(D\epsilon)$ -accurate strong solution, which is significantly better than the tight bound $O(1/\epsilon^{2})$ for last iterate . Nevertheless, it should be noted that Corollary 2 is in a non-classical sense: we do not guarantee last iterate convergence for all $K\geq 1$ , but only after $K=O\big{(}\frac{1}{\epsilon}\log\frac{1}{\epsilon}\big{)}$ with a prescribed accuracy parameter $\epsilon.$ Thus our result does not contradict with the lower bound of last iterate .

Meanwhile, our proof only relies on the regularized problem VIP $(F+\epsilon\nabla\frac{1}{2\gamma}\|\cdot-{\bm{w}}_{0}\|^{2},{\mathcal{W}})$ satisfying Assumption 4 with $\sigma=\epsilon,$ which holds if the VIP $(F,{\mathcal{W}})$ is monotone. However, it is not necessary for the VIP $(F,{\mathcal{W}})$ to be monotone. For instance, if the VIP $(F,{\mathcal{W}})$ satisfies Assumption 3 and ${\bm{w}}_{0}={\bm{w}}^{*},$ then the VIP $(F+\epsilon\nabla\frac{1}{2\gamma}\|\cdot-{\bm{w}}_{0}\|^{2},{\mathcal{W}})$ also satisfies Assumption 4 with $\sigma=\epsilon.$ Of course, letting ${\bm{w}}_{0}={\bm{w}}^{*}$ is impractical and we leave the more general setting of ${\bm{w}}_{0}$ under non-monotone settings for further research.

Recently, has proposed a different Halpern iteration method under the monotone and Lipschitz assumptions. The Halpern iteration method does not need to know the Lipschitz constant and thus is parameter-free, and also attains the $O\big{(}\frac{1}{\epsilon}\log\frac{1}{\epsilon}\big{)}$ convergence rate. Nevertheless, there are two major differences: The Halpern iteration method has two-loop, while our OptDE method is a single-loop single-call method; now the Halpern iteration method is limited to the Euclidean setting, while ours can have theoretical guarantees in the non-Euclidean setting.

Stochastic Optimistic Dual Extrapolation

For $\sigma>0,$ ( $i.e.,$ Assumption 4 holds), we have

As show in (16), for the setting $\sigma=0$ (a.k.a., Assumption 3) even if the number of iterations $K\rightarrow\infty$ , the expected restricted strong merit function can only be upper bounded by $O\big{(}\frac{s}{L}\big{)}.$ Thus to guarantee the convergence of SOptDE, the variance should be $o(1)$ , such as $s^{2}=O\big{(}\frac{1}{K}\big{)}.$ In the Euclidean setting that $\|\cdot\|:=\|\cdot\|_{2},$ by the concentration inequality , to attain a variance of $O\big{(}\frac{1}{K}\big{)}$ , we need $O(K)$ samples. Thus combining the setting $s^{2}=O(\frac{1}{K})$ and the result in (16), it can be verified that the single-call SOptDE method needs $O(1/\epsilon^{4})$ number of samples to obtain an $\epsilon$ -accurate solution in terms of the expected restricted strong merit function.

Under the stronger Assumption 4, our result is given in terms of the expected solution distance. As shown in (17), under Assumption 4, SOptDE can converge provably even when the variance $s^{2}$ is constant. In fact, the $O\big{(}\frac{1}{K}\big{)}$ is optimal and has been obtained by the two-call extragradient method ESA under the pseudomonotone assumption. Meanwhile, used the plain stochastic gradient descent algorithm and obtained the $O\big{(}\frac{1}{K}\big{)}$ result for strongly monotone variational inequalities, which can also be extended to the setting that $\sigma$ -weak solution exists.

With the aggressive parameter setting $a_{k}=\frac{\alpha\gamma(1+\sigma A_{k-1})}{L}$ and a large batch size strategy, we also obtain the first convergence guarantee $O(1/\epsilon^{2}\log\frac{1}{\epsilon})$ in terms of restricted strong merit function as shown in Table 2 (see details in the supplementary material).

Concluding Remarks

In this paper, we proposed a single-call extragradient method optimistic dual extrapolation (OptDE) beyond the monotone setting and also extended it to the stochastic setting as stochastic optimistic dual extrapolation (SOptDE). We systematically proved the convergence results of OptDE and SOptDE under the Assumption 3 that a weak solution exists and Assumption 4 that a strongly weak solution exists. We also show beneficial implications of our analysis in both non-monotone and monotone settings. In the future, we will further study how the proposed new methods may lead to improved computational efficiency and performance guarantees in a wide range of machine learning problems such as the training of adversarial deep neural networks.

Broader Impact

In this paper, we discuss a systematic theoretical analysis for single-call extragradient methods, which has been widely used for modern machine learning applications. The theoretical results in this paper can bring in meaningful insight and understanding for practical algorithms.

Acknowledgement

Chaobing and Yi acknowledge support from Tsinghua-Berkeley Shenzhen Institute (TBSI) Research Fund. Yichao and Yi acknowledge funding from Sony Research. Yi acknowledges support from ONR grant N00014-20-1-2002 and the joint Simons Foundation-NSF DMS grant #2031899, as well as support from Berkeley AI Research (BAIR), Berkeley FHL Vive Center for Enhanced Reality, and Berkeley Center for Augmented Cognition.

References

Appendix A Convergence Analysis of Optimistic Dual Extrapolation

Based on the optimality condition of ${\bm{z}}_{k}$ and Assumption 2, we have Lemma 2.

then $\forall{\bm{u}},{\bm{w}}_{0}\in{\mathcal{W}}$ , we have

In Lemma 2, the sequence $\{E_{1k}\}$ can be viewed as the errors we need to bound in each iteration. The upper bound of the sum of $\{E_{1k}\}$ is given in Lemma 3 below.

In Algorithm 1, $\forall k\in[K],$ we have

where we define $a_{0}:=a_{1}$ for convenience.

By Lemma 3, $\forall\;0<\alpha\leq\min\Big{\{}\frac{1}{4\sqrt{2}},\frac{\sqrt{3}}{4\sqrt{\gamma}}\Big{\}}$ and $k\in[K],$ $\sum_{k=1}^{K}E_{1k}$ is upper bounded by the sum of strictly negative terms about $\|{\bm{w}}_{k}-{\bm{z}}_{k-1}\|^{2}+\|{\bm{w}}_{k-1}-{\bm{z}}_{k-1}\|^{2}$ , which makes it possible to give a upper bound about $\min_{k\in[K]}(\|{\bm{w}}_{k}-{\bm{z}}_{k-1}\|+\|{\bm{w}}_{k-1}-{\bm{z}}_{k-1}\|)$ . To show the guarantees by restricted strong merit function and the distance $\|{\bm{w}}_{k}-{\bm{w}}^{*}\|$ , we give Lemma 4.

In Algorithm 1, $\forall k\in[K],$ we have,

Then combining Lemmas 2, 3 and 4, we obtain Theorem 1 in main body (see Section C.4 for the proof.).

Appendix B Convergence Analysis of Stochastic Optimistic Dual Extrapolation

We can extend the proof for the OptDE method in Section A to the stochastic setting for Lemmas 5, 6 and 7 and then obtain Theorems 2. First, we extend Lemma 2 into Lemma 5.

In Algorithm 5, $\forall k\in[K]$ , we have the following inequality: $\forall{\bm{u}},{\bm{w}}_{0}\in{\mathcal{W}}$ , let

Compared with the $E_{1k}$ of Lemma 2, $E_{2k}$ contains an extra term $a_{k}\langle F({\bm{w}}_{k})-F({\bm{w}}_{k};\xi_{k}),{\bm{w}}_{k}-{\bm{u}}\rangle$ . Then based on the definition of $E_{2k}$ and Assumption 5, we have Lemma 6.

In Algorithm 5, $\forall k\in[K]$ and $\forall{\bm{u}}\in{\mathcal{W}},$ we have

Lemma 6 extends Lemma 3 into the stochastic setting. Meanwhile, by the optimality condition of ${\bm{w}}_{k}$ , and Assumptions 1, 2 and 5, we can extend Lemma 4 to Lemma 7.

In Algorithm 5, for the setting $\sigma=0$ and $\forall k\in[K],$ we have,

Then combining Lemmas 5, 6 and 7, we obtain Theorem 2 for the SOptDE method in the main body (see Section D.4 for the proof).

It turns out that with the conservative setting $a_{k}=\frac{\alpha\gamma\sqrt{1+\sigma A_{k-1}}}{L}$ , we can not obtain strong convergence results in terms of restricted strong merit function. To obtain the rate $O\left(\frac{1}{\epsilon^{2}}\log\frac{1}{\epsilon}\right)$ , we need adopt the more aggressive setting $a_{k}=\frac{\alpha\gamma{(1+\sigma A_{k-1}})}{L}$ with a large batch size strategy, which is given in Algorithm 3. With this setting, we have Proposition 2.

with $C_{1}:=4\big{(}1+\frac{\delta}{{\alpha\gamma}}\big{)}\sqrt{\frac{\alpha}{\gamma}},$ $a_{1}=\frac{{\alpha\gamma}}{L}$ , and

The proof of Proposition 2 follows the same pipeline of proving Theorem 2, except that we use the setting $a_{k}=\frac{\alpha\gamma({1+\sigma A_{k-1}})}{L}$ that is also used in Algorithm 4. We leave the proof of Proposition 2 as a simple exercise.

In Proposition 2, if we hope the variance of the stochastic estimation $\{F({\bm{w}}_{k};\xi_{k})\}$ as $s^{2}=O(\frac{1}{A_{K-1}+a_{1}}),$ then we need $O(A_{K-1}+a_{1})$ stochastic samples per iteration. Meanwhile, to attain an expected $\epsilon$ -accurate strong solution, we will need $O(\log\frac{1}{\epsilon})$ number of iterations. Thus the total number of stochastic samples we need is $O(\frac{1}{\epsilon^{2}}\log\frac{1}{\epsilon}).$

B.2 The “ (quadratic) natural residual function” [18] and restricted strong merit function

In our notation, for any ${\bm{w}}\in{\mathcal{W}},$ the (quadratic) natural residual function in is defined by: given $\eta>0,$

which can be used to derive the restricted strong merit function as Proposition 3.

Let ${\bm{w}}^{\prime}:=P_{{\bm{w}}}(\eta F({\bm{w}}))=\operatorname*{arg\,min}_{{\bm{z}}\in{\mathcal{W}}}\{\langle\eta F({\bm{w}}),{\bm{z}}\rangle+\frac{1}{2\gamma}\|{\bm{z}}-{\bm{w}}\|^{2}\}.$ Then we have

where $(a)$ is by the optimality condition of ${\bm{w}}^{\prime}$ , $(b)$ is by the Cauchy-Schwarz inequality, $(c)$ is by the Lipschitz continuity of $F({\mathbf{w}})$ and the bounded assumption (7). So we have

Appendix C Proof of Section A

By the definition of proximal operator (8), we can equivalently reformulate the optimistic dual extrapolation (OptDE) algorithm in the main body as Algorithm 4. Then based on the definition of ${\bm{g}}_{k}$ in Step 7 and the definition of the Bregman divergence $V_{{\bm{w}}}({\bm{u}})({\bm{w}},{\bm{u}}\in{\mathcal{W}})$ , we can verify that

where ${\bm{u}}$ is an arbitrary vector in ${\mathcal{W}}$ and is irrelevant to the minimizer ${\bm{z}}_{k}.$ In our context, $\psi_{k}({\bm{z}})$ plays the role of a “generalized estimation sequence” to help us conduct convergence analysis. By the $\gamma$ -strong convexity of the Bregman divergence $V_{{\bm{w}}_{i}-{\bm{w}}_{0}}({\bm{z}}-{\bm{w}}_{0})$ , we know that $\psi_{k}({\bm{z}})$ is strongly convex with strong convexity parameter $1+\sigma\sum_{i=1}^{k}a_{i}=1+\sigma A_{k}.$

Proof. Given the definition of the generalized estimation sequence $\psi_{k}({\bm{z}})$ in (31) and the minimizer ${\bm{z}}_{k}$ in Algorithm 4, by the optimality condition of ${\bm{z}}_{k}$ , we have: $\forall{\bm{u}}\in{\mathcal{W}},$

where $(a)$ is by the optimality condition (32), and $(b)$ is by the convexity of both $V_{{\bm{w}}_{i}-{\bm{w}}_{0}}({\bm{u}}-{\bm{w}}_{0})$ and $\frac{1}{2\gamma}\|{\bm{u}}-{\bm{w}}_{0}\|^{2}.$

Meanwhile for $k\geq 1$ , by the definition of $\psi_{k}({\bm{z}}_{k})$ , we have

where $(a)$ is by the $(1+\sigma A_{k-1})$ -strong convexity of $\psi_{k-1}({\bm{z}}).$ Meanwhile, by the strong convexity of $\frac{1}{2}\|\cdot\|^{2}$ in Assumption 2, we have

Then combining (34) and (35), and after simple arrangements, we have

where $(a)$ is by the fact $\psi_{0}({\bm{z}}_{0})=0$ and the upper bound of $\psi_{K}({\bm{z}}_{K})$ by (33). By the setting $a_{k}=\frac{{\alpha\gamma}(1+\sigma A_{k-1})}{L}$ in Algorithm 4 and (37), we have

Then based on the definition of $E_{1k}$ in Lemma 2, after simple arrangements, Lemma 2 is proved.

C.2 Proof of Lemma 3

Proof. By the definition of $E_{1k}$ in Lemma 2, we have: $\forall k\in[K],$

where $(a)$ is by the Cauchy–Schwarz inequality, $(b)$ is the Lipschitz continuous Assumption 1, $(c)$ is by the fact $ab\leq a^{2}+\frac{b^{2}}{4},$ $(d)$ is by the triangle inequality of norm $\|\cdot\|$ and $(e)$ is by the fact $(a+b)^{2}\leq 2(a^{2}+b^{2}).$

Then by the optimality condition of ${\bm{w}}_{k}$ in the $k$ -th iteration of Algorithm 4, we have: $\forall{\bm{z}}\in{\mathcal{W}},$

By combining (39), (40) and (41), we have

from Step 5 of Algorithm 4 and the setting

and $(b)$ is by the setting that $0<\gamma\leq 1.$

With the ${\bm{w}}_{0}={\bm{z}}_{0}$ , for convenience, we set $a_{0}:=a_{1}.$ By summing (42) from $k=1$ to $K$ , we have

where $(a)$ is by the fact ${\bm{w}}_{0}={\bm{z}}_{0}$ , and $(b)$ is by the fact that $a_{k}\geq a_{k-1}>0$ . Lemma 3 is proved.

C.3 Proof of Lemma 4

where $(a)$ is by the optimality condition of ${\bm{w}}_{k}$ , $(b)$ is by the Cauchy-Schwarz inequality, $(c)$ is by the Lipschitz continuity of $F({\mathbf{w}})$ and the bounded assumption (7), $(d)$ is by the triangle inequality of norm $\|\cdot\|.$ So we have

Meanwhile, if there exists a ${\bm{w}}^{*}$ that satisfies Assumption 4, $i.e.,$ $\forall{\bm{w}}\in{\mathcal{W}}$ , $\langle F({\bm{w}}),{\bm{w}}-{\bm{w}}^{*}\rangle\geq\frac{\sigma}{\gamma}(V_{{\bm{w}}-{\bm{w}}_{0}}({\bm{w}}^{*}-{\bm{w}}_{0})+V_{{\bm{w}}^{*}-{\bm{w}}_{0}}({\bm{w}}-{\bm{w}}_{0}))$ with $\sigma>0,$ then in (45), let ${\bm{w}}:={\bm{w}}^{*},$ and by the fact $V_{{\bm{w}}_{k}-{\bm{w}}_{0}}({\bm{w}}^{*}-{\bm{w}}_{0})\geq\frac{\gamma}{2}\|{\bm{w}}_{k}-{\bm{w}}^{*}\|^{2}$ and $V_{{\bm{w}}^{*}-{\bm{w}}_{0}}({\bm{w}}_{k}-{\bm{w}}_{0})\geq\frac{\gamma}{2}\|{\bm{w}}_{k}-{\bm{w}}^{*}\|^{2}$ , we have

C.4 Proof of Theorem 1

Proof. Firstly, by the setting $a_{k}=\frac{{\alpha\gamma}(1+\sigma A_{k-1})}{L}$ and $A_{0}=0,A_{k}=A_{k-1}+a_{k},$ we have: $\forall k\geq 0,$

If $\sigma=0,$ then $A_{k}=\frac{\alpha\gamma k}{L}.$

If $\sigma>0$ , then $A_{k}=\frac{1}{\sigma}\Big{(}1+\frac{{\alpha\gamma}\sigma}{L}\Big{)}^{k}-\frac{1}{\sigma}$ .

Let ${\bm{w}}$ be the ${\bm{w}}^{*}$ in Assumption 3 if $\sigma=0$ or the ${\bm{w}}^{*}$ in Assumption 4 if $\sigma>0$ . Then by the property of ${\bm{w}}^{*}$ , we have $\langle F({\bm{w}}_{k}),{\bm{w}}_{k}-{\bm{w}}^{*}\rangle-\frac{\sigma}{\gamma}V_{{\bm{w}}_{k}-{\bm{w}}_{0}}({\bm{w}}^{*}-{\bm{w}}_{0})\geq 0.$ So by (49), it follows that

By the setting $A_{k}=A_{k-1}+a_{k}$ with $A_{0}=0$ in Algorithm 4, we have $A_{k}=\sum_{i=1}^{k}a_{i}.$ Meanwhile, for convenience, we have set $a_{0}=a_{1}$ . So we have

So by (20) of Lemma 4 and (52), it follows that

Similarly, if $\sigma>0,$ then by (21) of Lemma 4 and (52), we have

Then by defining $C_{0}:=\Big{(}1+\frac{\delta}{{\alpha\gamma}}\Big{)}\sqrt{\frac{8\alpha}{\gamma}},$ Theorem 1 is proved.

C.5 Proof of Proposition 1

Proof. The proof follows the same paradigm of Section C.4. Firstly, by the setting $a_{k}=\frac{{\alpha\gamma}(1+\sigma A_{k-1})}{L}$ and $A_{0}=0,A_{k}=A_{k-1}+a_{k},$ we have $\forall k\geq 0,$

Then the (51) of Section C.4 is replaced by

Then similar to (52) to (54), we obtain the last iterate convergence result as

Thus by the definition of $C_{0}$ in Theorem 1, Proposition 1 is proved.

C.6 Proof of Lemma 1

Proof. By the definition of the Bregman divergence $V_{{\bm{w}}}({\bm{v}}),$ we have

So combining (57) and (58), it follows that

So if $F({\bm{w}})$ is monotone, then we have: $\forall{\bm{w}}_{0},{\bm{w}},{\bm{v}}\in{\mathcal{W}},$

As Assumption 4 includes the strongly monotone assumption, by (60), we know that the VIP $(F+\epsilon\nabla\frac{1}{2\gamma}\|\cdot-{\bm{w}}_{0}\|^{2},{\mathcal{W}})$ satisfies Assumption 4 with parameter $\sigma=\epsilon.$

C.7 Proof of Corollary 1

Proof. By Theorem 1 and Lemma 1, if we optimize the regularized problem VIP $(F+\epsilon\nabla\frac{1}{2\gamma}\|\cdot-{\bm{w}}_{0}\|^{2},{\mathcal{W}})$ by the ODE Algorithm 4, then after $K$ iterations, we have

where $C_{0}$ is defined in Theorem 1, $A_{K-1}=\frac{1}{\epsilon}\Big{(}1+\frac{\sqrt{\alpha\gamma}\epsilon}{L}\Big{)}^{K-1}-\frac{1}{\epsilon}.$

Meanwhile, by the convexity of $\frac{1}{2\gamma}\|{{\bm{w}}}-{\bm{w}}_{0}\|^{2},$ we have

C.8 Proof of Corollary 2

Proof. By Proposition 1 and Lemma 1, if we optimize the regularized problem VIP $(F+\epsilon\nabla\frac{1}{2\gamma}\|\cdot-{\bm{w}}_{0}\|^{2},{\mathcal{W}})$ by the OptDE Algorithm 4, then after $K$ iterations, we have

where $C_{0}$ is defined in Theorem 1, $a_{K-1}=\frac{\alpha\gamma}{L}\Big{(}1+\frac{\alpha\gamma\sigma}{L}\Big{)}^{K-2}.$

Meanwhile, by the convexity of $\frac{1}{2\gamma}\|{{\bm{w}}}-{\bm{w}}_{0}\|^{2},$ we have

Appendix D Proof of Section B

By the definition of proximal operator (8), we can equivalently reformulate the stochastic optimistic dual extrapolation (SODE) of the main body as below. Then based on the definition of ${\bm{g}}_{k}$ in Step 7 and the definition of the Bregman divergence $V_{{\bm{w}}}({\bm{u}})$ , we can verify that

where ${\bm{u}}$ is an arbitrary vector in ${\mathcal{W}}$ and is irrelevant to the minimizer ${\bm{z}}_{k}.$ In our context, $\hat{\psi}_{k}({\bm{z}})$ plays the role of a “generalized estimation sequence” to help us conduct convergence analysis. By the $\gamma$ -strong convexity of the Bregman divergence $V_{{\bm{w}}_{i}-{\bm{w}}_{0}}({\bm{z}}-{\bm{w}}_{0})$ , we know that $\hat{\psi}_{k}({\bm{z}})$ is strongly convex with strong convexity parameter $1+\sigma\sum_{i=1}^{k}a_{i}=1+\sigma A_{k}.$

Proof. Given the definition of the generalized estimation sequence $\hat{\psi}_{k}({\bm{z}})$ in (66) and by the optimality condition of the minimizer ${\bm{z}}_{k}$ in the Step 6 of Algorithm 5, we have: $\forall{\bm{u}}\in{\mathcal{W}},$

where $(a)$ is by the optimality condition (67) and $(b)$ is by the convexity of $V_{{\bm{w}}_{i}-{\bm{w}}_{0}}({\bm{u}}-{\bm{w}}_{0})$ and $\frac{1}{2\gamma}\|{\bm{u}}-{\bm{w}}_{0}\|^{2}$ .

where $(a)$ is the $(1+\sigma A_{k-1})$ -strong convexity of $\hat{\psi}_{k-1}({\bm{z}}).$ Meanwhile, by the $\gamma$ -strong convexity of $\frac{1}{2}\|\cdot\|^{2}$ , we have

where $(a)$ is by the fact $\hat{\psi}_{0}({\bm{z}}_{0})=0$ , the upper bound of $\hat{\psi}_{K}({\bm{z}}_{K})$ in (68), $(b)$ is by the setting $a_{k}^{2}=\frac{{(\alpha\gamma)^{2}}(1+\sigma A_{k-1})}{L^{2}}$ in Algorithm 5. Meanwhile, taking expectation on $\xi_{k}$ , we have: $\forall{\bm{u}}\in{\mathcal{W}},$

So taking expectation on the randomness of all the history for (72), and using (73) and the definition of $\{E_{2k}\}$ in Lemma 5, after simple arrangements, Lemma 5 is proved.

D.2 Proof of Lemma 6

Proof. By the definition of $E_{2k}$ in Lemma 5, we have: $\forall k\in[K],$

where $(a)$ is by the Cauchy-Schwarz inequality, $(b)$ is by the triangle inequality of the norm $\|\cdot\|_{*}$ , $(c)$ is by the Lipschitz continuity of $F({\bm{w}})$ , $(d)$ is by the fact $ab\leq a^{2}+\frac{b^{2}}{4},$ $(e),(f)$ and $(g)$ is by the fact $(a+b)^{2}\leq 2(a^{2}+b^{2}).$

Then by the optimality condition of ${\bm{w}}_{k}$ in Algorithm 5, we have: $\forall{\bm{z}}\in{\mathcal{W}},$

Combining (74), (75) and (76) with ${\bm{z}}:={\bm{z}}_{k}$ , we have

For both the settings $\sigma=0$ and $\sigma>0$ , by our setting, we have $a_{k}\geq a_{1}=\frac{{\alpha\gamma}}{L}$ and $\alpha=\min\{\frac{\gamma}{32},\frac{1}{16}\}$ , so we have

where $(a)$ is by the condition ${\bm{w}}_{0}={\bm{z}}_{0}$ and Assumption 5. Lemma 6 is proved.

D.3 Proof of Lemma 7

It follows that: $\forall{\bm{w}}\in{\mathcal{W}}$

where $(a)$ is by the fact $a_{k}=\frac{\alpha\gamma}{L}$ when $\sigma=0,$ $(b)$ is by the optimality condition of ${\bm{w}}_{k}$ .

$(a)$ is by the Cauchy Schwarz inequality and simple arrangement, and $(b)$ is by Assumption 1.

where $(a)$ is by the fact $ab\leq\frac{a^{2}}{2}+\frac{b^{2}}{2}$ .

So taking expectation on $\xi_{k-1}$ , by Assumption 5, we have: $\forall{\bm{w}}\in{\mathcal{W}}$

D.4 Proof of Theorem 2

Proof. Firstly, by the setting $a_{k}=\frac{{\alpha\gamma}\sqrt{1+\sigma A_{k-1}}}{L}$ and $A_{0}=0,A_{k}=A_{k-1}+a_{k},$ we have

If $\sigma=0,$ then $A_{k}=\frac{{\alpha\gamma}k}{L}.$

If $\sigma>0$ , then $A_{k}=\Big{(}\frac{\alpha\gamma}{4L}\Big{)}^{2}\sigma(k+1)^{2}$ .

Then for both the setting $\sigma=0$ ( $i.e.,$ Assumption 3 holds) and $\sigma>0$ ( $i.e.,$ Assumption 4 holds), we have

So in Lemma 5, let ${\bm{u}}={\bm{w}}^{*}$ , we have

where $(a)$ is by the $\gamma$ -strong convexity Bregman divergence of $V_{{\bm{w}}^{*}-{\bm{w}}_{0}}({\bm{w}}_{k}-{\bm{w}}_{0})$ , $(b)$ is by the Assumption 3 ( $\sigma=0$ ) or the Assumption 4 ( $\sigma>0$ ), $(c)$ is by Lemma 5, and $(d)$ is by Lemma 6.

So taking expectation on all the history, we have

Then taking expectation on all the history, we have

where $(a)$ is by the Jensen inequality, $(b)$ is by the fact that $(a+b)^{2}\leq 2(a^{2}+b^{2})$ and $(c)$ is by (87).

where $(a)$ is by Lemma 7, $(b)$ is by the triangle inequality of $\|\cdot\|$ and Assumption 5, $(c)$ is by the triangle inequality of $\|\cdot\|$ , $(d)$ is by (87) and (88).

For $\sigma>0,$ by (84) and (85), we have