On the convergence of single-call stochastic extra-gradient methods

Yu-Guan Hsieh, Franck Iutzeler, Jérôme Malick, Panayotis Mertikopoulos

Introduction

Deep learning is arguably the fastest-growing field in artificial intelligence: its applications range from image recognition and natural language processing to medical anomaly detection, drug discovery, and most fields where computers are required to make sense of massive amounts of data. In turn, this has spearheaded a prolific research thrust in optimization theory with the twofold aim of demystifying the successes of deep learning models and of providing novel methods to overcome their failures.

Introduced by Goodfellow et al. , generative adversarial networks have become the youngest torchbearers of the deep learning revolution and have occupied the forefront of this drive in more ways than one. First, the adversarial training of deep neural nets has given rise to new challenges regarding the efficient allocation of parallelizable resources, the compatibility of the chosen architectures, etc. Second, the loss landscape in GANs is no longer that of a minimization problem but that of a zero-sum, min-max game – or, more generally, a variational inequality (VI).

Variational inequalities are a flexible and widely studied framework in optimization which, among others, incorporates minimization, saddle-point, Nash equilibrium, and fixed point problems. As such, there is an extensive literature devoted to solving variational inequalities in different contexts; for an introduction, see and references therein. In particular, in the setting of monotone variational inequalities with Lipschitz continuous operators, it is well known that the optimal rate of convergence is O(1/t)\operatorname{\mathcal{O}}(1/t), and that this rate is achieved by the EG algorithm of Korpelevich and its Bregman variant, the Mirror-Prox (MP) algorithm of Nemirovski .Korpelevich proved the method’s asymptotic convergence for pseudomonotone variational inequalities. The O(1/t)\operatorname{\mathcal{O}}(1/t) convergence rate was later established by Nemirovski with ergodic averaging.

These algorithms require two projections and two oracle calls per iteration, so they are more costly than standard Forward-Backward / descent methods. As a result, there are two complementary strands of literature aiming to reduce one (or both) of these cost multipliers – that is, the number of projections and/or the number of oracle calls per iteration. The first class contains algorithms like the Forward-Backward-Forward (FBF) method of Tseng , while the second focuses on gradient extrapolation mechanisms like Popov’s modified Arrow–Hurwicz algorithm .

In deep learning, the latter direction has attracted considerably more interest than the former. The main reason for this is that neural net training often does not involve constraints (and, when it does, they are relatively cheap to handle). On the other hand, gradient calculations can become very costly, so a decrease in the number of oracle calls could offer significant practical benefits. In view of this, our aim in this paper is (i) to develop a synthetic approach to methods that retain the anticipatory properties of the Extra-Gradient algorithm while making a single oracle call per iteration; and (ii) to derive quantitative convergence results for such single-call extra-gradient (11-EG) algorithms.

Our first contribution complements the existing literature (reviewed below and in Section 3) by showing that the class of single-call extra-gradient (11-EG) algorithms under study attains the optimal O(1/t)\operatorname{\mathcal{O}}(1/t) convergence rate of the two-call method in deterministic variational inequalities with a monotone, Lipschitz continuous operator. Subsequently, we show that this rate is also achieved in stochastic variational inequalities with strongly monotone operators provided that the optimizer has access to an oracle with bounded variance (but not necessarily bounded second moments).

Importantly, this stochastic result concerns both the method’s “ergodic average” (a weighted average of the sequence of points generated by the algorithm) as well as its “last iterate” (the last generated point). The reason for this dual focus is that averaging can be very useful in convex/monotone landscapes, but it is not as beneficial in non-monotone problems (where Jensen’s inequality does not apply). On that account, last-iterate convergence results comprise an essential stepping stone for venturing beyond monotone problems.

Armed with these encouraging results, we then focus on non-monotone problems and show that, with high probability, the method’s last iterate exhibits a O(1/t)\operatorname{\mathcal{O}}(1/t) local convergence rate to solutions of non-monotone variational inequalities that satisfy a second-order sufficient condition. To the best of our knowledge, this is the first convergence rate guarantee of this type for stochastic, non-monotone variational inequalities.

Related work.

The prominence of Extra-Gradient/Mirror-Prox methods in solving variational inequalities and saddle-point problems has given rise to a vast corpus of literature which we cannot hope to do justice here. Especially in the context of adversarial networks, there has been a flurry of recent activity relating variants of the Extra-Gradient algorithm to GAN training, see e.g., and references therein. For concreteness, we focus here on algorithms with a single-call structure and refer the reader to Sections 3, 4 and 5 for additional details.

The first variant of Extra-Gradient with a single oracle call per iteration dates back to Popov . This algorithm was subsequently studied by, among others, Chiang et al. , Rakhlin and Sridharan and Gidel et al. ; see also for a “reflected” variant, for an “optimistic” one, and Section 3 for a discussion of the differences between these variants. In the context of deterministic, strongly monotone variational inequalities with Lipschitz continuous operators, the last iterate of the method was shown to exhibit a geometric convergence rate ; similar geometric convergence results also extend to bilinear saddle-point problems , even though the operator involved is not strongly monotone. In turn, this implies the convergence of the method’s ergodic average, but at a O(1/t)\operatorname{\mathcal{O}}(1/t) rate (because of the hysteresis of the average). In view of this, the fact that 11-EG methods retain the optimal O(1/t)\operatorname{\mathcal{O}}(1/t) convergence rate in deterministic variational inequalities without strong monotonicity assumptions closes an important gap in the literature.A few weeks after the submission of our paper, we were made aware of a very recent preprint by Mokhtari et al. which also establishes a O(1/t)\operatorname{\mathcal{O}}(1/t) convergence rate for the algorithm’s “optimistic” variant in saddle-point problems (in terms of the Nikaido–Isoda gap function). To the best of our knowledge, this is the closest result to our own in the literature.

At the local level, the geometric convergence results discussed above echo a surge of interest in local convergence guarantees of optimization algorithms applied to games and saddle-point problems, see e.g., and references therein. In more detail, Liang and Stokes proved local geometric convergence for several algorithms in possibly non-monotone saddle-point problems under a local smoothness condition. In a similar vein, Daskalakis and Panageas analyzed the limit points of (optimistic) gradient descent, and showed that local saddle points are stable stationary points; subsequently, Adolphs et al. and Mazumdar et al. proposed a class of algorithms that eliminate stationary points which are not local Nash equilibria.

Geometric convergence results of this type are inherently deterministic because they rely on an associated resolvent operator being firmly nonexpansive – or, equivalently, rely on the use of the center manifold theorem. In a stochastic setting, these techniques are no longer applicable because the contraction property cannot be maintained in the presence of noise; in fact, unless the problem at hand is amenable to variance reduction – e.g., as in – geometric convergence is not possible if the noise process is even weakly isotropic. Instead, for monotone problems, Cui and Shanbhag and Gidel et al. showed that the ergodic average of the method attains a O(1/t)\operatorname{\mathcal{O}}(1/\sqrt{t}) convergence rate. Our global convergence results for stochastic variational inequalities improve this rate to O(1/t)\operatorname{\mathcal{O}}(1/t) in strongly monotone variational inequalities for both the method’s ergodic average and its last iterate. In the same light, our local O(1/t)\operatorname{\mathcal{O}}(1/t) convergence results for non-monotone variational inequalities provide a key extension of local, deterministic convergence results to a fully stochastic setting, all the while retaining the fastest convergence rate for monotone variational inequalities.

For convenience, our contributions relative to the state of the art are summarized in Table 1.

Problem setup and blanket assumptions

To provide some intuition about (VI), we discuss two important examples below:

Given the original formulation of GANs as (stochastic) saddle-point problems , this observation has been at the core of a vigorous literature at the interface between optimization, game theory, and deep learning, see e.g., and references therein.∎

The operator analogue of convexity for a function is monotonicity, i.e.,

Specifically, when V=fV=\nabla f for some sufficiently smooth function ff, this condition is equivalent to ff being convex . In this case, following Nesterov and Juditsky et al. , the quality of a candidate solution x^X\hat{x}\in\mathcal{X} can be assessed via the so-called error (or merit) function

Assume VV is monotone. If xx^{\star} is a solution of (VI), we have Err(x)=0\operatorname{Err}(x^{\star})=0 and ErrR(x)=0\operatorname{Err}_{R}(x^{\star})=0 for all sufficiently large RR. Conversely, if ErrR(x^)=0\operatorname{Err}_{R}(\hat{x})=0 for large enough R>0R>0 and some x^XR\hat{x}\in\mathcal{X}_{R}, then x^\hat{x} is a solution of (VI).

color=Orchid!20!LightGray,author=Yu-Guan:]We use mean squared error in the strongly monotone case. In light of this result, Err\operatorname{Err} and ErrR\operatorname{Err}_{R} will be among our principal measures of convergence in the sequel.

Blanket assumptions.

With all this in hand, we present below the main assumptions that will underlie the bulk of the analysis to follow.

The solution set X\mathcal{X}^{\star} of (VI) is nonempty.

The operator VV is β\beta-Lipschitz continuous, i.e.,

In some cases, we will also strengthen Assumption 3 to:

The operator VV is α\alpha-strongly monotone, i.e.,

Throughout our paper, we will be interested in sequences of points XtXX_{t}\in\mathcal{X} generated by algorithms that can access the operator VV via a stochastic oracle .Depending on the algorithm, the sequence index tt may take positive integer or half-integer values (or both). Formally, this is a black-box mechanism which, when called at XtXX_{t}\in\mathcal{X}, returns the estimate

In the above, Ft\mathcal{F}_{t} denotes the history (natural filtration) of XtX_{t}, so XtX_{t} is adapted to Ft\mathcal{F}_{t} by definition; on the other hand, since the tt-th instance of ZtZ_{t} is generated randomly from XtX_{t}, ZtZ_{t} is not adapted to Ft\mathcal{F}_{t}. Obviously, if σ2=0\sigma^{2}=0, we have the deterministic, perfect feedback case Vt=V(Xt)V_{t}=V(X_{t}).

Algorithms

In the general framework outlined in the previous section, the Extra-Gradient (EG) algorithm of Korpelevich can be stated in recursive form as

Single-call variants of the Extra-Gradient algorithm.

Given the significant computational overhead of gradient calculations, a key desideratum is to drop the second oracle call in (EG) while retaining the algorithm’s “anticipatory” properties. In light of this, we will focus on methods that perform a single oracle call at the leading state Xt+1/2X_{t+1/2}, but replace the update rule for Xt+1/2X_{t+1/2} (and, possibly, XtX_{t} as well) with a proxy that compensates for the missing gradient. Concretely, we will examine the following family of single-call extra-gradient (11-EG) algorithms:

[Proxy: use Vt1/2V_{t-1/2} instead of VtV_{t} in the calculation of Xt+1/2X_{t+1/2}; use Xt+1/2+γtVt1/2X_{t+1/2}+\gamma_{t}V_{t-1/2} instead of XtX_{t} in the calculation of Xt+1X_{t+1}; no projection]

These are the main algorithmic schemes that we will consider, so a few remarks are in order. First, given the extensive literature on the subject, this list is not exhaustive; see e.g., for a generalization of (OG), for a variant that employs averaging to update the algorithm’s base state XtX_{t}, and for a proxy defined via “negative momentum”. Nevertheless, the algorithms presented above appear to be the most widely used single-call variants of (EG), and they illustrate very clearly the two principal mechanisms for approximating missing gradients: (i ) using past gradients (as in the Past Extra-Gradient (PEG) and Optimistic Gradient (OG) variants); and/or (ii ) using a difference of successive states (as in the Reflected Gradient (RG) variant).

We also take this opportunity to provide some background and clear up some issues on terminology regarding the methods presented above. First, the idea of using past gradients dates back at least to Popov , who introduced (PEG) as a “modified Arrow–Hurwicz” method a few years after the original paper of Korpelevich ; the same algorithm is called “meta” in and “extrapolation from the past” in (but see also the note regarding optimism below). The terminology “Reflected Gradient” and the precise formulation that we use here for (RG) is due to Malitsky . The well-known primal-dual algorithm of Chambolle and Pock can be seen as a one-sided, alternating variant of the method for saddle-point problems; see also for a more recent take.

Finally, the terminology “optimistic” is due to Rakhlin and Sridharan , who provided a unified view of (PEG) and (EG) based on the sequence of oracle vectors used to update the algorithm’s leading state Xt+1/2X_{t+1/2}.More precisely, Rakhlin and Sridharan use the term Optimistic Mirror Descent (OMD) in reference to the Mirror-Prox method of Nemirovski , itself a variant of (EG) with projections defined by means of a Bregman function; for a related treatment, see Nesterov and Juditsky et al. . Because the framework of encompasses two different algorithms, there is some danger of confusion regarding the use of the term “optimism”; in particular, both (EG) and (PEG) can be seen as instances of optimism. The specific formulation of (OG) that we present here is the projected version of the algorithm considered by Daskalakis et al. ;To see this, note that the difference between two consecutive intermediate steps Xt1/2X_{t-1/2} and Xt+1/2X_{t+1/2} can be written as Xt+1/2=ΠX(Xt1/2(γt1+γt)Vt1/2+γt1Vt3/2)X_{t+1/2}=\operatorname{\operatorname{\Pi}}_{\mathcal{X}}(X_{t-1/2}-(\gamma_{t-1}+\gamma_{t})V_{t-1/2}+\gamma_{t-1}V_{t-3/2}). Writing (OG) in the form presented above shows that (OG) can also be viewed as a single-call variant of the FBF method of Tseng . by contrast, the “optimistic” method of Mertikopoulos et al. is equivalent to (EG) – not (PEG) or (OG).

The proof of this proposition follows by a simple rearrangement of the update rules for (PEG), (RG) and (OG), so we omit it. In the projected case, the 11-EG updates presented above are no longer equivalent – though, of course, they remain closely related.

Deterministic analysis

We begin with the deterministic analysis, i.e., when the optimizer receives oracle feedback of the form (7) with σ=0\sigma=0. In terms of presentation, we keep the global and local cases separated and we interleave our results for the generated sequence XtX_{t} and its ergodic average. To streamline our presentation, we defer the details of the proofs to the paper’s supplement and only discuss here the main ideas.

Our first result below shows that the algorithms under study achieve the optimal O(1/t)\operatorname{\mathcal{O}}(1/t) ergodic convergence rate in monotone problems with Lipschitz continuous operators.

Suppose that VV satisfies Assumptions 1, 2 and 3. Assume further that a 11-EG algorithm is run with perfect oracle feedback and a constant step-size γ<1/(cβ)\gamma<1/(c\beta), where c=1+2c=1+\sqrt{2} for the RG variant and c=2c=2 for the PEG and OG variants. Then, for all R>0R>0, we have

where Xˉt=t1s=1tXs+1/2\bar{X}_{t}=t^{-1}\sum_{s=1}^{t}X_{s+1/2} is the ergodic average of the algorithm’s sequence of leading states.

This result shows that the EG and 11-EG algorithms share the same convergence rate guarantees, so we can safely drop one gradient calculation per iteration in the monotone case. The proof of the theorem is based on the following technical lemma which enables us to treat the different variants of the 11-EG method in a unified way.

for all pXRp\in\mathcal{X}_{R} and all s{1,,t}s\in\{1,\dotsc,t\}. Then,

For Examples 1 and 2 it is possible to state both Theorem 1 and Lemma 2 with more adapted measures. We refer the readers to the supplement for more details.

The use of Lemma 2 is tailored to time-averaged sequences like Xˉt\bar{X}_{t}, and relies on establishing a suitable “quasi-descent inequality” of the form (10) for the iterates of 11-EG. Doing this requires in turn a careful comparison of successive iterates of the algorithm via the Lipschitz continuity assumption for VV; we defer the precise treatment of this argument to the paper’s supplement.

On the other hand, because the role of averaging is essential in this argument, the convergence of the algorithm’s last iterate requires significantly different techniques. To the best of our knowledge, there are no comparable convergence rate guarantees for XtX_{t} under Assumptions 1, 2 and 3; however, if Assumption 3 is strengthened to Assumption 3(s), the convergence of XtX_{t} to the (necessarily unique) solution of (VI) occurs at a geometric rate. For completeness, we state here a consolidated version of the geometric convergence results of Malitsky , Gidel et al. , and Mokhtari et al. .

Assume that VV satisfies Assumptions 1, 2 and 3(s), and let xx^{\star} denote the (necessarily unique) solution of (VI). If a 11-EG algorithm is run with a sufficiently small step-size γ\gamma, the generated sequence XtX_{t} converges to xx^{\star} at a rate of Xtx=O(exp(ρt))\lVert X_{t}-x^{\star}\rVert=\operatorname{\mathcal{O}}(\exp(-\rho\,t)) for some ρ>0\rho>0.

2 Local convergence

We continue by presenting a local convergence result for deterministic, non-monotone problems. To state it, we will employ the following notion of regularity in lieu of Assumptions 1, 2 and 3 and 3(s).

We say that xx^{\star} is a regular solution of (VI) if VV is C1C^{1}-smooth in a neighborhood of xx^{\star} and the Jacobian JacV(x)\operatorname{Jac}_{V}(x^{\star}) is positive-definite along rays emanating from xx^{\star}, i.e.,

This notion of regularity is an extension of similar conditions that have been employed in the local analysis of loss minimization and saddle-point problems. More precisely, if V=fV=\nabla f for some loss function ff, this definition is equivalent to positive-definiteness of the Hessian along qualified constraints [5, Chap. 3.2]. As for saddle-point problems and smooth games, variants of this condition can be found in several different sources, see e.g., and references therein.

Under this condition, we obtain the following local geometric convergence result for 11-EG methods.

Let xx^{\star} be a regular solution of (VI). If a 11-EG method is run with perfect oracle feedback and is initialized sufficiently close to xx^{\star} with a sufficiently small constant step-size,we have Xtx=O(exp(ρt))\lVert X_{t}-x^{\star}\rVert=\operatorname{\mathcal{O}}(\exp(-\rho\,t)) for some ρ>0\rho>0.

The proof of this theorem relies on showing that (i ) VVessentially behaves like a smooth, strongly monotone operator close to xx^{\star}; and (ii ) if the method is initialized in a small enough neighborhood of xx^{\star}, it will remain in said neighborhood for all tt. As a result, Theorem 4 essentially follows by “localizing” Theorem 2 to this neighborhood.

As a preamble to our stochastic analysis in the next section, we should state here that, albeit straightforward, the proof strategy outlined above breaks down if we have access to VV only via a stochastic oracle. In this case, a single “bad” realization of the feedback noise ZtZ_{t} could drive the process away from the attraction region of any local solution of (VI). For this reason, the stochastic analysis requires significantly different tools and techniques and is considerably more intricate.

Stochastic analysis

We now present our analysis for stochastic variational inequalities with oracle feedback of the form (7). For concreteness, given that the PEG variant of the 11-EG method employs the most straightforward proxy mechanism, we will focus on this variant throughout; for the other variants, the proofs and corresponding explicit expressions follow from the same rationale (as in the case of Theorem 1).

As we mentioned in the introduction, under Assumptions 1, 2 and 3, Cui and Shanbhag and Gidel et al. showed that 11-EG methods attain a O(1/t)\operatorname{\mathcal{O}}(1/\sqrt{t}) ergodic convergence rate. By strengthening Assumption 3 to Assumption 3(s), we show that this result can be augmented in two synergistic ways: under Assumptions 1, 2 and 3(s), both the last iterate and the ergodic average of 11-EG achieve a O(1/t)\operatorname{\mathcal{O}}(1/t) convergence rate.

Suppose that VV satisfies Assumptions 1, 2 and 3(s), and assume that (PEG) is run with stochastic oracle feedback of the form (7) and a step-size of the form γt=γ/(t+b)\gamma_{t}=\gamma/(t+b) for some γ>1/α\gamma>1/\alpha and b4βγb\geq 4\beta\gamma. Then, the generated sequence of the algorithm’s base states satisfies

while its ergodic average Xˉt=t1s=1tXs\bar{X}_{t}=t^{-1}\sum_{s=1}^{t}X_{s} enjoys the bound color=DodgerBlue!30,author=Pan:]Why O(1/t)\operatorname{\mathcal{O}}(1/t)? I think it would be better to write o(logt/t)o(\log t/t). Also, I would suggest to write the O()\operatorname{\mathcal{O}}(\dots) with 1/1/\cdots, otherwise they stand out too much (because of the huge parentheses). color=Orchid!20!LightGray,author=Yu-Guan:]I prefer to stay with 1t\frac{1}{t}. Let’s see what the others think.

Regarding our proof strategy for the last iterate of the process, we can no longer rely either on a contraction argument or the averaging mechanism that yields the O(1/t)\operatorname{\mathcal{O}}(1/\sqrt{t}) ergodic convergence rate. Instead, we show in the appendix that XtX_{t} is (stochastically) quasi-Fejér in the sense of ; then, leveraging the method’s specific step-size, we employ successive numerical sequence estimates to control the summability error and obtain the O(1/t)\operatorname{\mathcal{O}}(1/t) rate.

2 Local convergence

We proceed to examine the convergence of the method in the stochastic, non-monotone case. Our main result in this regard is the following.

There are neighborhoods UU and U1U_{1} of xx^{\star} in X\mathcal{X} such that, if X1/2U,X1U1X_{1/2}\in U,X_{1}\in U_{1}, the event

occurs with probability at least 1δ1-\delta.

where M=supxUV(x)<M=\sup_{x\in U}\lVert V(x)\rVert<\infty and α=infxUV(x),xx/xx2>0\alpha=\inf_{x\in U}\langle V(x),x-x^{\star}\rangle/\lVert x-x^{\star}\rVert^{2}>0.

The finiteness of MM and the positivity of α\alpha are both consequences of the regularity of xx^{\star} and their values only depend on the size of the neighborhood UU. Taking a larger UU would increase the algorithm’s certified initialization basin but it would also negatively impact its convergence rate (since MM would increase while α\alpha would decrease). Likewise, the neighborhood U1U_{1} only depends on the size of UU and, as we explain in the appendix, it suffices to take U1U_{1} to be “one fourth” of UU.

From the above, it becomes clear that the situation is significantly more involved than the corresponding deterministic analysis. This is also reflected in the proof of Theorem 6 which requires completely new techniques, well beyond the straightforward localization scheme underlying Theorem 4. More precisely, a key step in the proof (which we detail in the appendix) is to show that the iterates of the method remain close to xx^{\star} for all tt with arbitrarily high probability. In turn, this requires showing that the probability of getting a string of “bad” noise realizations of arbitrary length is controllably small. Even then however, the global analysis still cannot be localized because conditioning changes the probability law under which the oracle noise is unbiased. Accounting for this conditional bias requires a surprisingly delicate probabilistic argument which we also detail in the supplement.

Concluding remarks

Our aim in this paper was to provide a synthetic view of single-call surrogates to the Extra-Gradient algorithm, and to establish optimal convergence rates in a range of different settings – deterministic, stochastic, and/or non-monotone. Several interesting avenues open up as a result, from extending the theory to more general Bregman proximal settings, to developing an adaptive version as in the recent work for two-call methods. We defer these research directions to future work.

Acknowledgments

This work benefited from financial support by MIAI Grenoble Alpes (Multidisciplinary Institute in Artificial Intelligence). P. Mertikopoulos was partially supported by the French National Research Agency (ANR) grant ORACLESS (ANR–16–CE33–0004–01) and the EU COST Action CA16228 “European Network for Game Theory” (GAMENET).

References

Appendix A Technical lemmas

Since pCp\in\mathcal{C}, we have the following property x+(xy),x+p0\langle x^{+}-(x-y),x^{+}-p\rangle\leq 0, leading to

If C2C1\mathcal{C}_{2}\subseteq\mathcal{C}_{1}, for all pC2p\in\mathcal{C}_{2}, it holds

With x2+C2C1x^{+}_{2}\in\mathcal{C}_{2}\subseteq\mathcal{C}_{1}, we can apply A.1 to (x,y,x+,p,C)(x,y2,x2+,p,C2)(x,y,x^{+},p,\mathcal{C})\leftarrow(x,y_{2},x^{+}_{2},p,\mathcal{C}_{2}) and (x,y,x+,p,C)(x,y1,x1+,x2+,C1)(x,y,x^{+},p,\mathcal{C})\leftarrow(x,y_{1},x^{+}_{1},x^{+}_{2},\mathcal{C}_{1}), which yields

By summing (A.6) and (A.7), we readily get the first inequality of (A.4). We conclude with help of Young’s inequality 2y2y1,x1+x2+y2y12+x1+x2+22\langle y_{2}-y_{1},x^{+}_{1}-x^{+}_{2}\rangle\leq\lVert y_{2}-y_{1}\rVert^{2}+\lVert x^{+}_{1}-x^{+}_{2}\rVert^{2}. ∎

By substituting kt+bk\leftarrow t+b, (A.8) combined with (A.11) yields

Let us define atatq/((q1)(t+b))a^{\prime}_{t}\coloneqq a_{t}-q^{\prime}/((q-1)(t+b)). (A.12) becomes

This inequality holds for all tt0t\geq t_{0}. Then, either: • ata^{\prime}_{t} becomes non-positive for some t>t1=max(t0,qb)t>t_{1}=\max(t_{0},\lfloor q\rfloor-b), and (A.13) implies that this is also the case for all subsequent tt, which leads to

• or ata^{\prime}_{t} is positive for all t>t1t>t_{1} and we get

The Lipschitz continuity is straightforward: a C1C^{1}-smooth operator is necessarily locally Lipschitz and thus Lipshitz on every compact. The proof consists in establishing the existence of α\alpha. To this end, we consider the following function:

Consequently, writing z=xxTCX(x)z=x-x^{\star}\in\operatorname{TC}_{\mathcal{X}}(x^{\star}), xλ=x+λ(xx)Kx^{\prime}_{\lambda}=x^{\star}+\lambda(x-x^{\star})\in\mathcal{K}, we have

Finally, since xx^{\star} is a solution of (VI), we have V(x),xx0\langle V(x^{\star}),x-x^{\star}\rangle\geq 0 and

Appendix B Proofs for the deterministic setting

For any pXRp\in\mathcal{X}_{R}, we have X1p2R2\lVert X_{1}-p\rVert^{2}\leq R^{2}, and by monoticity of VV,

In other words, for all pXRp\in\mathcal{X}_{R},

Dividing the two sides of (B.3) by 2s=1tλs2\sum_{s=1}^{t}\lambda_{s} and maximizing over pXRp\in\mathcal{X}_{R} leads to the desired result.

B.2 Proof of Theorem 1

To facilitate analysis and presentation of our results, (PEG) and (OG) are initialized with random X12X_{\frac{1}{2}} and X1X_{1} in X\mathcal{X} while for (RG) we start with X0X_{0} and X12X_{\frac{1}{2}}. We are constrained to have different initial states in (RG) due to its specific formulation.

For t1t\geq 1, the second inequality of A.2 (b) applied to (x,y1,y2,x1+,x2+,C1,C2)(Xt,γV(Xt12),γV(Xt+12),Xt+12,Xt+1,X,X)(x,y_{1},y_{2},x^{+}_{1},x^{+}_{2},\mathcal{C}_{1},\mathcal{C}_{2})\leftarrow(X_{t},\gamma V(X_{t-\frac{1}{2}}),\gamma V(X_{t+\frac{1}{2}}),X_{t+\frac{1}{2}},X_{t+1},\mathcal{X},\mathcal{X}) results in

where we used the fact that VV is β\beta-Lipschitz continuous for the second inequality.

Now, let us use Young’s inequality a+b22a2+2b2\lVert a+b\rVert^{2}\leq 2\lVert a\rVert^{2}+2\lVert b\rVert^{2} to get

and the non-expansiveness of the projection to get for any t2t\geq 2,

where we used the fact that γ1/(2β)\gamma\leq 1/(2\beta) in the last inequality; and in order to display a telescopic term, we reformulate (B.8) as

We now substitute (B.9) in (B.4) to get for all t2t\geq 2,

and thus (10) holds true for all t2t\geq 2 with λt=γ\lambda_{t}=\gamma and μt=γ2β2Xt12Xt322\mu_{t}=\gamma^{2}\beta^{2}\lVert X_{t-\frac{1}{2}}-X_{t-\frac{3}{2}}\rVert^{2}.

which also matches (10) for t=1t=1 with λt=γ\lambda_{t}=\gamma, μ2\mu_{2} as defined previously, and μ1=4γ2β2X1X122X1X122\mu_{1}=4\gamma^{2}\beta^{2}\lVert X_{1}-X_{\frac{1}{2}}\rVert^{2}\leq\lVert X_{1}-X_{\frac{1}{2}}\rVert^{2}. Thus, Lemma 2 enables us to conclude the proof for Past Extra-Gradient (PEG).

Optimistic Gradient (OG).

The update of OG with constant step-size γ\gamma can be written as

One the one hand, since Xt+12=ΠX(XtγV(Xt12))X_{t+\frac{1}{2}}=\operatorname{\operatorname{\Pi}}_{\mathcal{X}}(X_{t}-\gamma V(X_{t-\frac{1}{2}})) and pXp\in\mathcal{X}, we have

On the other other hand, by definition of Xt+1X_{t+1} and the β\beta-Lipschitz continuity of VV,

Then, applying the same arguments used to get (B.9), we can show that for all t2t\geq 2,

Putting together (B.14), (B.15), (B.16), and (B.17), we obtain for γ1/(2β)\gamma\leq 1/(2\beta) and for all t2t\geq 2,

Reflected Gradient (RG).

As Xt=ΠX(Xt1γV(Xt12))X_{t}=\operatorname{\operatorname{\Pi}}_{\mathcal{X}}(X_{t-1}-\gamma V(X_{t-\frac{1}{2}})) and Xt1,Xt+1XX_{t-1},X_{t+1}\in\mathcal{X}, it follows

By summing (B.22) and (B.23) and rearranging the terms, we get

By using twice Young’s inequality: i) 2a,bεa2+(1/ε)b22\langle a,b\rangle\leq\varepsilon\lVert a\rVert^{2}+(1/\varepsilon)\lVert b\rVert^{2} with ε=1/2\varepsilon=1/\sqrt{2}; then ii) a+b2(1+ε)a2+(1+1/ε)b2\lVert a+b\rVert^{2}\leq(1+\varepsilon^{\prime})\lVert a\rVert^{2}+(1+1/\varepsilon^{\prime})\lVert b\rVert^{2} with ε=1+2\varepsilon^{\prime}=1+\sqrt{2}, we have

B.3 Lemma 2 with other suboptimality measures

Here we discuss how the statement of Lemma 2, and consequently also that of Theorem 1, can be adjusted to consider more adapted convergence measures in the cases of loss minimization and min-max optimization. The notations are those of Examples 1 and 2, and we write xˉ=(s=1tλs)1s=1tλsXs+12\bar{x}=(\sum_{s=1}^{t}\lambda_{s})^{-1}\sum_{s=1}^{t}\lambda_{s}X_{s+\frac{1}{2}}.

V=fV=\nabla f is monotone implies the convexity of ff, so

This is true for any pXp\in\mathcal{X}, and especially for pXp\in\mathcal{X}^{\star}. Let R=dist(x1,X)R=\operatorname{dist}(x_{1},\mathcal{X}^{\star}). By invoking (B.1), we conclude

Min-max optimization.

V=(θL,ϕL)V=(\nabla_{\theta}\mathcal{L},-\nabla_{\phi}\mathcal{L}) being monotone is equivalent to L\mathcal{L} being convex-concave. In such saddle-point problems, the quality of a candidate solution x^=(θ^,ϕ^)\hat{x}=(\hat{\theta},\hat{\phi}) is often assessed via the Nikaido–Isoda function , defined here as

provided of course that the right-hand side is well-posed. Its restricted variant NIR\operatorname{NI}_{R} can also be defined by analogy with the definition of ErrR\operatorname{Err}_{R}.

Let us denote Xs+12=(θs+12,ϕs+12)X_{s+\frac{1}{2}}=(\theta_{s+\frac{1}{2}},\phi_{s+\frac{1}{2}}) and p=(θ,ϕ)p=(\theta,\phi). By convex-concavity of L\mathcal{L}, it holds

We can again apply Jensen’s inequality to show that

B.4 Proof of Theorem 4

for the three algorithms with μt0\mu_{t}\geq 0. By imposing X12=X1X_{\frac{1}{2}}=X_{1} in PEG and OG, we get μ1=0\mu_{1}=0. Similarly, we may impose X0=X12X_{0}=X_{\frac{1}{2}} in RG, leading to μ1X1X02γ2V(X0)2\mu_{1}\leq\lVert X_{1}-X_{0}\rVert^{2}\leq\gamma^{2}\lVert V(X_{0})\rVert^{2}. It is thus possible to choose the adequate initial points and γ\gamma such that X1x2+μ1r24\lVert X_{1}-x^{\star}\rVert^{2}+\mu_{1}\leq\frac{r^{2}}{4}, which in turn guarantees Xtx2r24\lVert X_{t}-x^{\star}\rVert^{2}\leq\frac{r^{2}}{4}.

Part (ii). We now proceed to prove that we may choose γ\gamma sufficiently small such that if Xtx2r24\lVert X_{t}-x^{\star}\rVert^{2}\leq\frac{r^{2}}{4} and Xt12KX_{t-\frac{1}{2}}\in\mathcal{K} then Xt+12KX_{t+\frac{1}{2}}\in\mathcal{K}. We notice that for the three algorithms, we have

by the non-expansiveness of the projection.In particular this also holds for RG since then Xt+12Xt=XtXt1=ΠX(Xt1γV(Xt12))ΠX(Xt1)X_{t+\frac{1}{2}}-X_{t}=X_{t}-X_{t-1}=\operatorname{\operatorname{\Pi}}_{\mathcal{X}}(X_{t-1}-\gamma V(X_{t-\frac{1}{2}}))-\operatorname{\operatorname{\Pi}}_{\mathcal{X}}(X_{t-1}). We define MsupxKV(x)<M\coloneqq\sup_{x\in\mathcal{K}}\lVert V(x)\rVert<\infty where the finiteness of MM comes from the continuity of VV and the boundedness of K\mathcal{K}. We choose γr/(2M)\gamma\leq r/(2M) so that γ2V(Xt12)2r24\gamma^{2}\lVert V(X_{t-\frac{1}{2}})\rVert^{2}\leq\frac{r^{2}}{4} since Xt12KX_{t-\frac{1}{2}}\in\mathcal{K}. Then, by Young’s inequality, we get

In other words, Xt+12KX_{t+\frac{1}{2}}\in\mathcal{K}.

Conclusion. We first notice that the conditions on the initial points and the stepsize γ\gamma do not depend on the iteration. Thus, by simple induction we have that if we initialize the algorithm such that

Appendix C Proofs for the stochastic setting

Let us focus in this section on the (PEG) algorithm:

As in the proof of Theorem 1, we first apply A.2 (b) with (x,y1,y2,x1+,x2+,C1,C2)(Xt,γtVt12,γtVt+12,Xt+12,Xt+1,X,X)(x,y_{1},y_{2},x^{+}_{1},x^{+}_{2},\mathcal{C}_{1},\mathcal{C}_{2})\leftarrow(X_{t},\gamma_{t}V_{t-\frac{1}{2}},\gamma_{t}V_{t+\frac{1}{2}},X_{t+\frac{1}{2}},X_{t+1},\mathcal{X},\mathcal{X}) and the solution xXx^{\star}\in\mathcal{X} as a trial point to obtain

The following holds true thanks to the law of total expectation,

By Young’s inequality, β\beta-Lipschitz continuity of VV, and non-expansiveness of the projection, we have

Notice that the choice b4βγb\geq 4\beta\gamma implies 8γt2β2+2γtβ18\gamma_{t}^{2}\beta^{2}+2\gamma_{t}\beta\leq 1, which in turn yields 8γt2β21αγt8\gamma_{t}^{2}\beta^{2}\leq 1-\alpha\gamma_{t}. Combining (C.2) and (C.3), similarly to (B.9), we can thus show that

Since xx^{\star} is the unique solution of (VI), it follows V(x),Xt+12x0\langle V(x^{\star}),X_{t+\frac{1}{2}}-x^{\star}\rangle\geq 0. Consequently, with strong monotonicity of VV, we get

Taking expectation over (C.1) and using (C.4), (C.5), (C.8) leads to

Using 8γt2β2+2αγt108\gamma_{t}^{2}\beta^{2}+2\alpha\gamma_{t}-1\leq 0, (C.9) reduces to

The second term on the left-hand side (LHS) of the inequality is always positive, and (13) follows immediately.

Ergodic convergence.

The convergence of Xˉt\bar{X}_{t} as shown in (14) can be deduce directly from above by using Jensen’s inequality:

and then we bound the right-hand side (RHS) of the inequality by (13).

C.2 Proof of Theorem 6

We start by defining some important quantities that will be used in our proof. For any T1T\geq 1, we set

We additionally define Q02γ12V122Q_{0}\coloneqq 2\gamma_{1}^{2}\lVert V_{\frac{1}{2}}\rVert^{2}, H0{Q0ε}H_{0}\coloneqq\{Q_{0}\leq\varepsilon\} and H1E0ΩH_{-1}\coloneqq E_{0}\coloneqq\Omega, where Ω\Omega denotes the whole sample space. It follows from the definitions that both (HT)T1(H_{T})_{T\geq{-1}} and (ET)T0(E_{T})_{T\geq 0} are decreasing sequences of events. Moreover, we have HTFT+1H_{T}\in\mathcal{F}_{T+1} while ETFTE_{T}\in\mathcal{F}_{T}. Also notice that E=T0ETE_{\infty}=\bigcap_{T\geq 0}E_{T}.

In terms of notation, for an event EΩE\subseteq\Omega, we denote by \mathds1E\operatorname{\mathds{1}}_{E} its indicator function and EcE^{c} its complementary. For any pair of events E,FΩE,F\subseteq\Omega, we denote by EFE\setminus F the event “EE and not FF” i.e., EFcE\cap F^{c}.

The proof of the theorem relies on the two following lemmas.

For any T0T\geq 0, we have the inclusion HT1ETH_{T-1}\subseteq E_{T}.

Initialization: H1E0H_{-1}\subseteq E_{0} is clear. To prove that we also have H0E1H_{0}\subseteq E_{1}, we use Young’s inequality to get

On the one hand, since X1U1X_{1}\in U_{1} by assumption, it holds X1x2r216\lVert X_{1}-x^{\star}\rVert^{2}\leq\frac{r^{2}}{16}. On the other hand,

For any realization in H0H_{0}, we have 2γ1V122r282\gamma_{1}\lVert V_{\frac{1}{2}}\rVert^{2}\leq\frac{r^{2}}{8}; and so we can deduce from (C.18) that X32x2r24<r2\lVert X_{\frac{3}{2}}-x^{\star}\rVert^{2}\leq\frac{r^{2}}{4}<r^{2}. Since X32XX_{\frac{3}{2}}\in\mathcal{X}, it follows that X32UX_{\frac{3}{2}}\in U. This means that H0E1H_{0}\subseteq E_{1}.

Inductive step: Suppose that HT1ETH_{T-1}\subseteq E_{T} holds for some T1T\geq 1. We would like to prove HTET+1H_{T}\subseteq E_{T+1}. To do so, we show that XT+1x2716r2\lVert X_{T+1}-x^{\star}\rVert^{2}\leq\frac{7}{16}r^{2} for any realization in HTH_{T}. Applying A.2 (b) as in (C.1) yields for all t{1,...,T}t\in\{1,...,T\},

where in the last line we can use V(Xt+12),Xt+12x0\langle V(X_{t+\frac{1}{2}}),X_{t+\frac{1}{2}}-x^{\star}\rangle\geq 0 since by induction hypothesis, HTHT1ETH_{T}\subseteq H_{T-1}\subseteq E_{T}, which means for any realization in HTH_{T}, Xt+12UX_{t+\frac{1}{2}}\in U for all t{1,...,T}t\in\{1,...,T\}.

By definition of HTH_{T}, we have ST2QTr416S_{T}^{2}\leq Q_{T}\leq\frac{r^{4}}{16} (so STr24\lvert S_{T}\rvert\leq\frac{r^{2}}{4}) and RTQTr28R_{T}\leq Q_{T}\leq\frac{r^{2}}{8}. Using that X1x2r216\lVert X_{1}-x^{\star}\rVert^{2}\leq\frac{r^{2}}{16} by assumption, it follows immediately that XT+1x2716r2\lVert X_{T+1}-x^{\star}\rVert^{2}\leq\frac{7}{16}r^{2}.

Finally, in order to bound XT+32x2\lVert X_{T+\frac{3}{2}}-x^{\star}\rVert^{2}, we again rely on Young’s inequality:

For any realization in HTH_{T}, we have that

Thus, (C.22) implies that XT+32x2r2\lVert X_{T+\frac{3}{2}}-x^{\star}\rVert^{2}\leq r^{2}, and subsequently XT+32UX_{T+\frac{3}{2}}\in U. As HTETH_{T}\subseteq E_{T} and ET+1={XT+32U}ETE_{T+1}=\{X_{T+\frac{3}{2}}\in U\}\operatorname*{\cap}E_{T}, we have proven that HTET+1H_{T}\subseteq E_{T+1}. ∎

For t1t\geq 1, we have the following recurrence inequality

where M4M2+4σ2+4r2σ2\mathcal{M}\coloneqq 4M^{2}+4\sigma^{2}+4r^{2}\sigma^{2} and εmin(r28,r416)\varepsilon\coloneqq\min\left(\frac{r^{2}}{8},\frac{r^{4}}{16}\right).

Moreover, if t=1t=1, the bound can be refined to

where the second equality comes from the fact that as Ht1Ht2H_{t-1}\subseteq H_{t-2}, we have Ht1=Ht2(Ht2Ht1)H_{t-1}=H_{t-2}\setminus(H_{t-2}\setminus H_{t-1}).

Since St1S_{t-1} and Ht1H_{t-1} are Ft\mathcal{F}_{t}-measurable, we get

By C.1, Ht1EtH_{t-1}\subseteq E_{t} which means that for any realization in Ht1H_{t-1}, we have Xt+12UX_{t+\frac{1}{2}}\in U. Therefore, Xt+12x2\mathds1Ht1r2\mathds1Ht1\lVert X_{t+\frac{1}{2}}-x^{\star}\rVert^{2}\operatorname{\mathds{1}}_{H_{t-1}}\leq r^{2}\operatorname{\mathds{1}}_{H_{t-1}} and consequently

Using again that Ht1H_{t-1} is Ft\mathcal{F}_{t}-measurable along with the boundedness of the variance of Zt+12Z_{t+\frac{1}{2}} (see Eq. (8b)), we get

Applying once again the techniques above and relying on the boundedness of VV (as for any realization in Ht1EtH_{t-1}\subseteq E_{t} we have Xt+12UX_{t+\frac{1}{2}}\in U and M=supxUV(x)<M=\sup_{x\in U}V(x)<\infty), we get

Using that Ht1Ht2H_{t-1}\subseteq H_{t-2} and repeating the arguments leading to (C.32), we have

Combining (C.28), (C.29), (C.31), (C.32) and (C.33), we get

For the last term on the RHS of (C.27), we get by definition that for any realization in Ht2Ht1H_{t-2}\setminus H_{t-1}, Qt1>εQ_{t-1}>\varepsilon and thus

Substituting (C.34) and (C.35) into (C.27) gives exactly (C.25).

The case t=1t=1 is proved similarly. In fact,

Consequently by using H0E1H_{0}\subseteq E_{1}, we have

By definition H1H0={Q0>ε}H_{-1}\setminus H_{0}=\{Q_{0}>\varepsilon\}, which shows (C.35) is equally true with t=1t=1. (C.26) can then be immediately deduced from (C.27). ∎

The last line is true since QTQ_{T} is a positive random variable.

We now use Lemma C.2 by summing (C.25) from t=2t=2 to TT and (C.26) which leads to

We set Γt=1γt2<\Gamma\coloneqq\sum_{t=1}^{\infty}\gamma_{t}^{2}<\infty. Combining (C.38), (C.39) and (C.40), we obtain

Furthermore, for any realization in EtE_{t}, Xt+12UX_{t+\frac{1}{2}}\in U so that V(Xt+12),Xt+12xαXt+12x2\langle V(X_{t+\frac{1}{2}}),X_{t+\frac{1}{2}}-x^{\star}\rangle\geq\alpha\lVert X_{t+\frac{1}{2}}-x^{\star}\rVert^{2} and thus equation (C.8) holds, which allows us to write

We also recall that as EtFtE_{t}\in\mathcal{F}_{t}, it holds

Taking expectation over (C.44) then leads to

We can choose bb sufficiently large so that 2αγt102\alpha\gamma_{t}-1\leq 0 for all t1t\geq 1. Using EtEt1E_{t}\subseteq E_{t-1}, we obtain