Halpern Iteration for Near-Optimal and Parameter-Free Monotone Inclusion and Strong Solutions to Variational Inequalities

Jelena Diakonikolas

Introduction

the monotone inclusion problem consists in finding a point u\mathbf{u}^{*} that satisfies:

Monotone inclusion is a fundamental problem in continuous optimization that is closely related to variational inequalities (VIs) with monotone operators, which model a plethora of problems in mathematical programming, game theory, engineering, and finance (Facchinei and Pang, 2003, Section 1.4). Within machine learning, VIs with monotone operators and associated monotone inclusion problems arise, for example, as an abstraction of convex-concave min-max optimization problems, which naturally model adversarial training (Madry et al., 2018; Arjovsky et al., 2017; Arjovsky and Bottou, 2017; Goodfellow et al., 2014).

On the other hand, approximate monotone inclusion is well-defined even for unbounded feasible sets. In the context of min-max optimization, it corresponds to guarantees in terms of stationarity. Specifically, in the unconstrained setting, solving monotone inclusion corresponds to minimizing the norm of the gradient of Φ.\Phi. Note that even in the special setting of convex optimization, convergence in norm of the gradient is much less understood than convergence in optimality gap (Nesterov, 2012; Kim and Fessler, 2018). Further, unlike classical results for VIs that provide convergence guarantees for approximating weak solutions (Nemirovski, 2004; Nesterov, 2007), approximations to monotone inclusion lead to approximations to strong solutions (see Section 1.2 for definitions of weak and strong solutions and their relationship to monotone inclusion).

We leverage the connections between nonexpansive maps, structured monotone operators, and proximal maps to obtain near-optimal algorithms for solving monotone inclusion over different classes of problems with Lipschitz-continuous operators. In particular, we make use of the classical Halpern iteration, which is defined by (Halpern, 1967):

In addition to its simplicity, Halpern iteration is particularly relevant to machine learning applications, as it is an implicitly regularized method with the following property: if the set of fixed points of TT is non-empty, then Halpern iteration (Hal) started at a point u0\mathbf{u}_{0} and applied with any choice of step sizes {λk}k1\{\lambda_{k}\}_{k\geq 1} that satisfy all of the following conditions:

A special case of what is now known as the Halpern iteration (Hal) was introduced and its asymptotic convergence properties were analyzed by Halpern (1967) in the setting of u0=0\mathbf{u}_{0}=\textbf{0} and T:B2B2,T:\mathcal{B}_{2}\to\mathcal{B}_{2}, where B2\mathcal{B}_{2} is the unit Euclidean ball. Using the proof-theoretic techniques of Kohlenbach (2008), Leustean (2007) extracted from the asymptotic convergence result of Wittmann (1992) the rate at which Halpern iteration converges to a fixed point. The results obtained by Leustean (2007) are rather loose and provide guarantees of the form T(uk)uk=O(Mlog(k))\|T(\mathbf{u}_{k})-\mathbf{u}_{k}\|=O(\frac{M}{\log(k)}) in the best case (obtained for λk=Θ(1k)\lambda_{k}=\Theta(\frac{1}{k})), where Mu0+T(u0)+uk,M\geq\|\mathbf{u}_{0}\|+\|T(\mathbf{u}_{0})\|+\|\mathbf{u}_{k}\|, k.\forall k. A tighter result that shows that T(uk)uk\|T(\mathbf{u}_{k})-\mathbf{u}_{k}\| decreases at rate that is at least as good as 1/k1/\sqrt{k} was obtained by Kohlenbach (2011). The results of Leustean (2007) and Kohlenbach (2011) apply to general normed spaces. The work of Kohlenbach (2011) also provided an explicit rate of metastability that characterizes the convergence of the sequence of iterates {uk}\{\mathbf{u}_{k}\} in Hilbert spaces.

More recently, Lieder (2017) proved that under the standard assumption that TT has a fixed point u\mathbf{u}^{*} and for the step size λk=1k+1,\lambda_{k}=\frac{1}{k+1}, Halpern iteration converges to a fixed point as T(uk)uk=2u0uk+1.\|T(\mathbf{u}_{k})-\mathbf{u}_{k}\|=\frac{2\|\mathbf{u}_{0}-\mathbf{u}^{*}\|}{k+1}. A similar result but for an alternative algorithm was recently obtained by Kim (2019). These two results (as well as all the results from this paper) only apply to Hilbert spaces. Unlike Halpern iteration, the algorithm introduced by Kim (2019) is not known to possess the implicit regularization property discussed earlier in this paper. The results of Lieder (2017) and Kim (2019) can be used to obtain the same 1/k1/k convergence rate for monotone inclusion with a cocoercive operator but only if the cocoercivity parameter is known, which is rarely the case in practice. Similarly, those results can also be extended to more general monotone Lipschitz operators but only if the proximal map (or resolvent) of FF can be computed exactly, an assumption that can rarely be met (see Section 1.2 for definitions of cocoercive operators and proximal maps). We also note that the results of Lieder (2017) and Kim (2019) were obtained using the performance estimation (PEP) framework of Drori and Teboulle (2014). The convergence proofs resulting from the use of PEP are computer-assisted: they are generated as solutions to large semidefinite programs, which typically makes them hard to interpret and generalize.

Our approach is arguably simpler, as it relies on the use of a potential function, which allows us to remove the assumptions about the knowledge of the problem parameters and availability of exact proximal maps. Our main contributions are summarized as follows:

We introduce a new, potential-based, proof of convergence of Halpern iteration that applies to more general step sizes λk\lambda_{k} than handled by the analysis of Lieder (2017) (Section 2). The proof is simple and only requires elementary algebra. Further, the proof is derived for cocoercive operators and leads to a parameter-free algorithm for monotone inclusion. We also extend this parameter-free method to the constrained setting using the concept of gradient mapping generalized to monotone operators (Section 2.1). To the best of our knowledge, this is the first work to obtain the 1/k1/k convergence rate with a parameter-free method.

Results for monotone Lipschitz operators.

Up to a logarithmic factor, we obtain the same 1/k1/k convergence rate for the parameter-free setting of the more general monotone Lipschitz operators (Section 2.2). The best known convergence rate established by previous work for the same setting was of the order 1/k1/\sqrt{k} (Dang and Lan, 2015; Ryu et al., 2019). We obtain the improved convergence rate through the use of the Halpern iteration with inexact proximal maps that can be implemented efficiently. The idea of coupling inexact proximal maps with another method is similar in spirit to the Catalyst framework (Lin et al., 2017) and other instantiations of the inexact proximal-point method, such as, e.g., in the work of Davis and Drusvyatskiy (2019); Asi and Duchi (2019); Lin et al. (2018). However, we note that, unlike in the previous work, the coupling used here is with a method (Halpern iteration) whose convergence properties were not well-understood and for which no simple potential-based convergence proof existed prior to our work.

Results for strongly monotone Lipschitz operators.

We show that a simple restarting-based approach applied to our method for operators that are only monotone and Lipschitz (described above) leads to a parameter-free method for strongly monotone and Lipschitz operators (Section 2.3). Under mild assumptions about the problem parameters and up to a poly-logarithmic factor, the resulting algorithm is iteration-complexity-optimal. To the best of our knowledge, this is the first near-optimal parameter-free method for the setting of strongly monotone Lipschitz operators and any of the associated problems – monotone inclusion, VIs, or convex-concave min-max optimization.

Lower bounds.

To certify near-optimality of the analyzed methods, we provide lower bounds that rely on algorithmic reductions between different problem classes and highlight connections between them (Section 3). The lower bounds are derived by leveraging the recent lower bound of Ouyang and Xu (2019) for approximating the optimality gap in convex-concave min-max optimization.

2 Notation and Preliminaries

Let UE\mathcal{U}\subseteq E be closed and convex, and let F:EEF:E\rightarrow E be an LL-Lipschitz-continuous operator defined on U.\mathcal{U}. Namely, we assume that:

The definition of monotonicity was already provided in Eq. (1.1), and easily specializes to monotonicity on the set U\mathcal{U} by restricting u,v\mathbf{u},\mathbf{v} to be from U.\mathcal{U}. Further, FF is said to be:

strongly monotone (or coercive) on U\mathcal{U} with parameter mm, if:

cocoercive on U\mathcal{U} with parameter γ\gamma, if:

It is immediate from the definition of cocoercivity that every γ\gamma-cocoercive operator is monotone and 1/γ1/\gamma-Lipschitz. The latter follows by applying the Cauchy-Schwarz inequality to the left-hand side of Eq. (1.5) and then dividing both sides by γF(u)F(v)\gamma\|F(\mathbf{u})-F(\mathbf{v})\|.

Examples of monotone operators include the gradient of a convex function and appropriately modified gradient of a convex-concave function. Namely, if a function Φ(x,y)\Phi(\mathbf{x},\mathbf{y}) is convex in x\mathbf{x} and concave in y,\mathbf{y}, then F([\vbox{\Let@\restore@math@cr\default@tag\halign{\hfil\m@th\scriptstyle#&\m@th\scriptstyle{}#\hfil\cr\mathbf{x}\\ \mathbf{y}\crcr}}])=[\vbox{\Let@\restore@math@cr\default@tag\halign{\hfil\m@th\scriptstyle#&\m@th\scriptstyle{}#\hfil\cr\nabla_{\mathbf{x}}\Phi(\mathbf{x},\mathbf{y})\\ -\nabla_{\mathbf{y}}\Phi(\mathbf{x},\mathbf{y})\crcr}}] is monotone.

The Stampacchia Variational Inequality (SVI) problem consists in finding uU\mathbf{u}^{*}\in\mathcal{U} such that:

In this case, u\mathbf{u}^{*} is also referred to as a strong solution to the variational inequality (VI) corresponding to FF and U\mathcal{U}. The Minty Variational Inequality (MVI) problem consists in finding u\mathbf{u}^{*} such that:

in which case u\mathbf{u}^{*} is referred to as a weak solution to the variational inequality corresponding to FF and U\mathcal{U}. In general, if FF is continuous, then the solutions to (MVI) are a subset of the solutions to (SVI). If we assume that FF is monotone, then (1.1) implies that every solution to (SVI) is also a solution to (MVI), and thus the two solution sets are equivalent. The solution set to monotone inclusion is the same as the solution set to (SVI).

Approximate versions of variational inequality problems (SVI) and (MVI) are defined as follows: Given ϵ>0,\epsilon>0, find an ϵ\epsilon-approximate solution uϵU,\mathbf{u}^{*}_{\epsilon}\in\mathcal{U}, which is a solution that satisfies:

Clearly, when FF is monotone, an ϵ\epsilon-approximate solution to (SVI) is also an ϵ\epsilon-approximate solution to (MVI); the reverse does not hold in general.

Similarly, ϵ\epsilon-approximate monotone inclusion can be defined as fidning uϵ\mathbf{u}^{*}_{\epsilon} that satisfies:

where B(ϵ)\mathcal{B}(\epsilon) is the ball w.r.t. \|\cdot\|, centered at 0 and of radius ϵ.\epsilon. We will sometimes write Eq. (1.6) in the equivalent form F(uϵ)IU(uϵ)+B(ϵ).-F(\mathbf{u}^{*}_{\epsilon})\in\partial I_{\mathcal{U}}(\mathbf{u}^{*}_{\epsilon})+\mathcal{B}(\epsilon). The following fact is immediate from Eq. (1.6).

Given FF and U,\mathcal{U}, let uϵ\mathbf{u}^{*}_{\epsilon} satisfy Eq. (1.6). Then:

where Buϵ\mathcal{B}_{\mathbf{u}^{*}_{\epsilon}} denotes the unit ball w.r.t. ,\|\cdot\|, centered at Buϵ.\mathcal{B}_{\mathbf{u}^{*}_{\epsilon}}.

Further, if the diameter of U\mathcal{U}, D=supu,vUuvD=\sup_{\mathbf{u},\mathbf{v}\in\mathcal{U}}\|\mathbf{u}-\mathbf{v}\|, is bounded, then:

Thus, when the diameter DD is bounded, any ϵD\frac{\epsilon}{D}-approximate solution to monotone inclusion is an ϵ\epsilon-approximate solution to (SVI) (and thus also to (MVI)); the converse does not hold in general. Recall that when DD is unbounded, neither (SVI) nor (MVI) can be approximated.

Nonexpansive Maps.

Let T:EET:E\to E. We say that TT is nonexpansive on UE\mathcal{U}\subseteq E, if u,vU:\forall\mathbf{u},\mathbf{v}\in\mathcal{U}:

Nonexpansive maps are closely related to cocoercive operators, and here we summarize some of the basic properties that are used in our analysis. More information can be found in, e.g., the book by Bauschke and Combettes (2011).

TT is said to be firmly nonexpansive or averaged, if u,vU:\forall\mathbf{u},\mathbf{v}\in\mathcal{U}:

Useful properties of firmly nonexpansive maps are summarized in the following fact.

Halpern Iteration for Monotone Inclusion and Variational Inequalities

Halpern iteration is typically stated for nonexpansive maps TT as in (Hal). Because our interest is in cocoercive operators FF with the unknown parameter 1/L,1/L, we instead work with the following version of the Halpern iteration:

where Lk(0,),k.L_{k}\in(0,\infty),\,\forall k. If LL was known, we could simply set Lk+1=L,L_{k+1}=L, in which case (H) would be equivalent to the standard Halpern iteration, due to Fact 1.2. We assume throughout that λ1=12.\lambda_{1}=\frac{1}{2}.

We start with the assumption that the setting is unconstrained: UE.\mathcal{U}\equiv E. We will see in Section 2.1 how the result can be extended to the constrained case. Section 2.2 will consider the case of operators that are monotone and Lipschitz, while Section 2.3 will deal with the strongly monotone and Lipschitz case. Some of the proofs are omitted and are instead provided in Appendix A.

To analyze the convergence of (H) for the appropriate choices of sequences {λi}i1\{\lambda_{i}\}_{i\geq 1} and {Li}i1,\{L_{i}\}_{i\geq 1}, we make use of the following potential function:

Let us first show that if AkCkA_{k}\mathcal{C}_{k} is non-increasing with kk for an appropriately chosen sequence of positive numbers {Ak}k1,\{A_{k}\}_{k\geq 1}, then we can deduce a property that, under suitable conditions on {λi}i1\{\lambda_{i}\}_{i\geq 1} and {Li}i1,\{L_{i}\}_{i\geq 1}, implies a convergence rate for (H).

Let Ck\mathcal{C}_{k} be defined as in Eq. (2.1) and let u\mathbf{u}^{*} be the solution to (MI) that minimizes u0u\|\mathbf{u}_{0}-\mathbf{u}^{*}\|. Assume further that F(u1)F(u0),u1u01L1F(u1)F(u0)2.\left\langle F(\mathbf{u}_{1})-F(\mathbf{u}_{0}),\mathbf{u}_{1}-\mathbf{u}_{0}\right\rangle\geq\frac{1}{L_{1}}\|F(\mathbf{u}_{1})-F(\mathbf{u}_{0})\|^{2}. If Ak+1Ck+1AkCk,A_{k+1}\mathcal{C}_{k+1}\leq A_{k}\mathcal{C}_{k}, k1,\forall k\geq 1, where {Ai}i1\{A_{i}\}_{i\geq 1} is a sequence of positive numbers that satisfies A1=1A_{1}=1, then:

Using Lemma 2.1, our goal is now to show that we can choose Lk=O(L)L_{k}=O(L) and λk=O(1k),\lambda_{k}=O(\frac{1}{k}), which in turn would imply the desired 1/k1/k convergence rate: F(uk)=O(Lu0uk).\|F(\mathbf{u}_{k})\|=O(\frac{L\|\mathbf{u}_{0}-\mathbf{u}^{*}\|}{k}). The following lemma provides sufficient conditions for {Ai}i1,\{A_{i}\}_{i\geq 1}, {λi}i1\{\lambda_{i}\}_{i\geq 1}, and {Li}i1\{L_{i}\}_{i\geq 1} to ensure that Ak+1Ck+1AkCk,A_{k+1}\mathcal{C}_{k+1}\leq A_{k}\mathcal{C}_{k}, k1,\forall k\geq 1, so that Lemma 2.1 applies.

Let Ck\mathcal{C}_{k} be defined as in Eq. (2.1). Let {Ai}i1\{A_{i}\}_{i\geq 1} be defined recursively as A1=1A_{1}=1 and Ak+1=Akλk(1λk)λk+1A_{k+1}=A_{k}\frac{\lambda_{k}}{(1-\lambda_{k})\lambda_{k+1}} for k1.k\geq 1. Assume that {λi}i1\{\lambda_{i}\}_{i\geq 1} is chosen so that λ1=12\lambda_{1}=\frac{1}{2} and for k1:k\geq 1: λk+112λk+1λkLk(1λk)Lk+1\frac{\lambda_{k+1}}{1-2\lambda_{k+1}}\geq\frac{\lambda_{k}L_{k}}{(1-\lambda_{k})L_{k+1}}. Finally, assume that Lk(0,)L_{k}\in(0,\infty) and F(uk)F(uk1),ukuk11LkF(uk)F(uk1)2\left\langle F(\mathbf{u}_{k})-F(\mathbf{u}_{k-1}),\mathbf{u}_{k}-\mathbf{u}_{k-1}\right\rangle\geq\frac{1}{L_{k}}\|F(\mathbf{u}_{k})-F(\mathbf{u}_{k-1})\|^{2}, k.\forall k. Then,

Observe first the following. If we knew LL and set Lk=L,L_{k}=L, λk=1k+1,\lambda_{k}=\frac{1}{k+1}, and Ak=k(k+1)/2,A_{k}=k(k+1)/2, then all of the conditions from Lemma 2.2 would be satisfied, and Lemma 2.1 would then imply F(uk)Lu0uk,\|F(\mathbf{u}_{k})\|\leq\frac{L\|\mathbf{u}_{0}-\mathbf{u}^{*}\|}{k}, which recovers the result of Lieder (2017). The choice λk=1k+1\lambda_{k}=\frac{1}{k+1} is also the tightest possible that satisfies the conditions Lemma 2.2 – the inequality relating λk+1\lambda_{k+1} and λk\lambda_{k} is satisfied with equality. This result is in line with the numerical observations made by Lieder (2017), who observed that the convergence of Halpern iteration is fastest for λk=1k+1\lambda_{k}=\frac{1}{k+1}.

To construct a parameter-free method, we use that FF is LL-cocoercive; namely, that there exists a constant L<L<\infty such that FF satisfies Eq. (1.5) with γ=1/L\gamma=1/L. The idea is to start to with a “guess” of LL (e.g., L0=1L_{0}=1) and double the guess LkL_{k} as long as F(uk)F(uk1),ukuk1<1LkF(uk)F(uk1)2.\left\langle F(\mathbf{u}_{k})-F(\mathbf{u}_{k-1}),\mathbf{u}_{k}-\mathbf{u}_{k-1}\right\rangle<\frac{1}{L_{k}}\|F(\mathbf{u}_{k})-F(\mathbf{u}_{k-1})\|^{2}. The total number of times that the guess can be doubled is bounded above by max{0,log2(2L/L0)}.\max\{0,\log_{2}(2L/L_{0})\}. Parameter λk\lambda_{k} is simply chosen to satisfy the condition from Lemma 2.2. The algorithm pseudocode is stated in Algorithm 1 for a given accuracy specified at the input.

We now prove the first of our main results. Note that the total number of arithmetic operations in Algorithm 1 is of the order of the number of oracle queries to FF multiplied by the complexity of evaluating FF at a point. The same will be true for all the algorithms stated in this paper, except that the complexity of evaluating FF may be replaced by the complexity of projections onto U\mathcal{U}.

Given u0U\mathbf{u}_{0}\in\mathcal{U} and an operator FF that is 1L\frac{1}{L}-cocoercive on E,E, Algorithm 1 returns a point uk\mathbf{u}_{k} such that F(uk)ϵ\|F(\mathbf{u}_{k})\|\leq\epsilon after at most max{2L,L0}u0uϵ+max{0,log2(2L/L0)}\frac{\max\{2L,L_{0}\}\|\mathbf{u}_{0}-\mathbf{u}^{*}\|}{\epsilon}+\max\{0,\log_{2}(2L/L_{0})\} oracle queries to FF.

As FF is 1L\frac{1}{L}-cocoercive, Lkmax{2L,L0}L_{k}\leq\max\{2L,L_{0}\} and the total number of times that the algorithm enters the inner while loop is at most max{0,log2(2L/L0)}.\max\{0,\log_{2}(2L/L_{0})\}. The parameters satisfy the assumptions of Lemmas 2.1 and 2.2, and, thus, F(uk)Lkλk1λku0u.\|F(\mathbf{u}_{k})\|\leq L_{k}\frac{\lambda_{k}}{1-\lambda_{k}}\|\mathbf{u}_{0}-\mathbf{u}^{*}\|. Hence, we only need to show that λk\lambda_{k} decreases sufficiently fast with k.k. As LkL_{k} can only be increased in any iteration, we have that

Hence, the total number of outer iterations is at most max{2L,L0}u0uϵ\frac{\max\{2L,L_{0}\}\|\mathbf{u}_{0}-\mathbf{u}^{*}\|}{\epsilon}. Combining with the maximum total number of inner iterations from the beginning of the proof, the result follows. ∎

Assume now that UE.\mathcal{U}\subseteq E. We will make use of a counterpart to gradient mapping (Nesterov, 2018, Chapter 2) that we refer to as the operator mapping, defined as:

where \Pi_{\mathcal{U}}\big{(}\mathbf{u}-\frac{1}{\eta}F(\mathbf{u})\big{)} is the projection operator, namely:

Operator mapping generalizes a cocoercive operator to the constrained case: when UE,\mathcal{U}\equiv E, GηF.G_{\eta}\equiv F.

It is a well-known fact that the projection operator is firmly-nonexpansive (Bauschke and Combettes, 2011, Proposition 4.16). Thus, Fact 1.3 can be used to show that, if FF is 1L\frac{1}{L}-cocoercive and ηL,\eta\geq L, then GηG_{\eta} is 12η\frac{1}{2\eta}-cocoercive. This is shown in the following (simple) proposition.

Let FF be an 1L\frac{1}{L}-cocoercive operator and let GηG_{\eta} be defined as in Eq. (1.1), where ηL.\eta\geq L. Then GηG_{\eta} is 12η\frac{1}{2\eta}-cocoercive.

As GηG_{\eta} is 12η\frac{1}{2\eta}-cocoercive, applying results from the beginning of the section to GηG_{\eta}, it is now immediate that Algorithm 2 (provided for completeness) produces uk\mathbf{u}_{k} with GLk(uk)ϵ\|G_{L_{k}}(\mathbf{u}_{k})\|\leq\epsilon after at most max{4L,L0}u0uϵ+max{0,log2(4L/L0)}\frac{\max\{4L,L_{0}\}\|\mathbf{u}_{0}-\mathbf{u}^{*}\|}{\epsilon}+\max\{0,\log_{2}(4L/L_{0})\} oracle queries to FF (as each computation of GηG_{\eta} requires one oracle query to FF).

To complete this subsection, it remains to show that GηG_{\eta} is a good surrogate for approximating (MI) (and (SVI)). This is indeed the case and it follows as a suitable generalization of Lemma 3 from Ghadimi and Lan (2016), which is provided here for completeness.

Let GηG_{\eta} be defined as in Eq. (2.2). Denote uˉ=ΠU(uF(u)/η),\bar{\mathbf{u}}=\Pi_{\mathcal{U}}(\mathbf{u}-F(\mathbf{u})/\eta), so that Gη(u)=η(uuˉ).G_{\eta}(\mathbf{u})=\eta(\mathbf{u}-\bar{\mathbf{u}}). If, for some uU,\mathbf{u}\in\mathcal{U}, Gη(u)ϵ,\|G_{\eta}(\mathbf{u})\|\leq\epsilon, then

Lemma 2.5 implies that when the operator mapping is small in norm ,\|\cdot\|, then uˉ=ΠU(uF(u)/η)\bar{\mathbf{u}}=\Pi_{\mathcal{U}}(\mathbf{u}-F(\mathbf{u})/\eta) is an approximate solution to (MI) corresponding to FF on U.\mathcal{U}. We can now formally bound the number of oracle queries to FF needed to approximate (MI) and (SVI).

Given u0U\mathbf{u}_{0}\in\mathcal{U} and a 1L\frac{1}{L}-cocoercive operator FF, Algorithm 2 returns uˉkU\bar{\mathbf{u}}_{k}\in\mathcal{U} such that

GLk(uˉk)ϵ2\|G_{L_{k}}(\bar{\mathbf{u}}_{k})\|\leq\frac{\epsilon}{2}, maxv{UBuˉk}F(uˉk),uˉkvϵ\max_{\mathbf{v}\in\{\mathcal{U}\cap{\cal B}_{\bar{\mathbf{u}}_{k}}\}}\left\langle F(\bar{\mathbf{u}}_{k}),\bar{\mathbf{u}}_{k}-\mathbf{v}\right\rangle\leq\epsilon after at most

maxvUF(uˉ),uˉvϵ\max_{\mathbf{v}\in\mathcal{U}}\left\langle F(\bar{\mathbf{u}}),\bar{\mathbf{u}}-\mathbf{v}\right\rangle\leq\epsilon after at most

Further, every point uk\mathbf{u}_{k} that Algorithm 2 constructs is from the feasible set: ukU,\mathbf{u}_{k}\in\mathcal{U}, k0\forall k\geq 0, and a simple modification to the algorithm takes at most max{4L,L0}u0uϵ+max{0,log2(4L/L0)}\frac{\max\{4L,L_{0}\}\|\mathbf{u}_{0}-\mathbf{u}^{*}\|}{\epsilon}+\max\{0,\log_{2}(4L/L_{0})\} oracle queries to FF to construct a point such that GLk(uk)ϵ\|G_{L_{k}}(\mathbf{u}_{k})\|\leq\epsilon.

By the definition of Gη,G_{\eta}, if u0U,\mathbf{u}_{0}\in\mathcal{U}, then ukU,\mathbf{u}_{k}\in\mathcal{U}, for all k.k. This follows simply as:

Observe that, due to Line 2 of Algorithm 2, LkLˉk.L_{k}\geq\bar{L}_{k}. The rest of the proof follows using Lemma 2.5, Fact 1.1, and the same reasoning as in the proof of Theorem 2.3. Observe that if the goal is to only output a point uk\mathbf{u}_{k} such that GLk(uk)ϵ\|G_{L_{k}}(\mathbf{u}_{k})\|\leq\epsilon, then computing uˉk\bar{\mathbf{u}}_{k} and F(uˉk)F(\bar{\mathbf{u}}_{k}) is not needed, and the algorithm can instead use GLk(uk)>ϵ\|G_{L_{k}}(\mathbf{u}_{k})\|>\epsilon as the exit condition in the outer while loop. ∎

2 Setups with non-Cocoercive Lipschitz Operators

Finding a point uU\mathbf{u}\in\mathcal{U} such that P(u)ϵ\|P(\mathbf{u})\|\leq\epsilon is sufficient for approximating monotone inclusion (and (SVI)). This is shown in the following simple proposition, provided here for completeness.

As P(u)ϵ,\|P(\mathbf{u})\|\leq\epsilon, the result follows. ∎

Let uˉk=JF+IU(uk),\bar{\mathbf{u}}_{k}^{*}=J_{F+I_{\mathcal{U}}}(\mathbf{u}_{k}), where ukU\mathbf{u}_{k}\in\mathcal{U} and FF is LL-Lipschitz. Then, there exists a parameter-free algorithm that queries FF at most O((L+1)log(Lukuˉkϵ))O((L+1)\log(\frac{L\|\mathbf{u}_{k}-\bar{\mathbf{u}}_{k}^{*}\|}{\epsilon})) times and outputs a point uˉk\bar{\mathbf{u}}_{k} such that uˉkuˉkϵ.\|\bar{\mathbf{u}}_{k}-\bar{\mathbf{u}}^{*}_{k}\|\leq\epsilon.

To obtain the desired result, we need to prove the convergence of a Halpern iteration with inexact evaluations of the cocoercive operator PP. Note that here we do know the cocoercivity parameter of PP – it is equal to 1/21/2. The resulting inexact version of Halpern’s iteration for PP is:

To analyze the convergence of (2.3), we again use the potential function Ck\mathcal{C}_{k} from Eq. (2.1), with PP as the operator. For simplicity of exposition, we take the best choice of λi=1i+1\lambda_{i}=\frac{1}{i+1} that can be obtained from Lemma 2.1 for Li=L=2,L_{i}=L=2, i.\forall i. The key result for this setting is provided in the following lemma, whose proof is deferred to the appendix.

Let Ck\mathcal{C}_{k} be defined as in Eq. (2.1) with PP as the 12\frac{1}{2}-cocoercive operator, and let Lk=2,L_{k}=2, λk=1k+1,\lambda_{k}=\frac{1}{k+1}, and Ak=k(k+1)2,A_{k}=\frac{k(k+1)}{2}, k1\forall k\geq 1. If the iterates uk\mathbf{u}_{k} evolve according to (2.3) for an arbitrary initial point u0U,\mathbf{u}_{0}\in\mathcal{U}, then:

Further, if, k1,\forall k\geq 1, ek1ϵ4k(k+1),\|\mathbf{e}_{k-1}\|\leq\frac{\epsilon}{4k(k+1)}, then P(uK)ϵ\|P(\mathbf{u}_{K})\|\leq\epsilon after at most K=4u0uϵK=\frac{4\|\mathbf{u}_{0}-\mathbf{u}^{*}\|}{\epsilon} iterations.

We are now ready to state the algorithm and prove the main theorem for this subsection.

Let FF be a monotone and LL-Lipschitz operator and let u0U\mathbf{u}_{0}\in\mathcal{U} be an arbitrary initial point. For any ϵ>0,\epsilon>0, Algorithm 3 outputs a point with P(uk)ϵ\|P(\mathbf{u}_{k})\|\leq\epsilon after at most 8uu0ϵ\frac{8\|\mathbf{u}^{*}-\mathbf{u}_{0}\|}{\epsilon} iterations, where each iteration can be implemented with O((L+1)log((L+1)u0uϵ)O((L+1)\log(\frac{(L+1)\|\mathbf{u}_{0}-\mathbf{u}^{*}\|}{\epsilon}) oracle queries to F.F. Hence, the total number of oracle queries to FF is: O\big{(}\frac{(L+1)\|\mathbf{u}_{0}-\mathbf{u}^{*}\|}{\epsilon}\log\big{(}\frac{(L+1)\|\mathbf{u}_{0}-\mathbf{u}^{*}\|}{\epsilon}\big{)}\big{)}.

Similarly as before, P(uk)ϵ\|P(\mathbf{u}_{k})\|\leq\epsilon implies an ϵ\epsilon-approximate solution to (MI), by Proposition 2.7. When the diameter DD is bounded, P(uk)ϵD\|P(\mathbf{u}_{k})\|\leq\frac{\epsilon}{D} implies an ϵ\epsilon-approximate solution to (SVI).

3 Setups with Strongly Monotone and Lipschitz Operators

We now show that by restarting Algorithm 3, we can obtain a parameter-free method with near-optimal oracle complexity. To simplify the exposition, we assume w.l.o.g. that L=Ω(1).L=\Omega(1).

Given FF that is LL-Lipschitz and mm-strongly monotone, consider running the following algorithm A\mathcal{A}, starting with u0U\mathbf{u}_{0}\in\mathcal{U}:

Then, A\mathcal{A} outputs ukU\mathbf{u}_{k}\in\mathcal{U} with P(uk)ϵ\|P(\mathbf{u}_{k})\|\leq\epsilon after at most 1+log2(u0uϵ)1+\log_{2}(\frac{\|\mathbf{u}_{0}-\mathbf{u}^{*}\|}{\epsilon}) iterations, for any ϵ(0,12]\epsilon\in(0,\frac{1}{2}]. The total number of queries to FF until P(uk)ϵ\|P(\mathbf{u}_{k})\|\leq\epsilon is O\big{(}(L+\frac{L}{m})\log(\frac{\|\mathbf{u}_{0}-\mathbf{u}^{*}\|}{\epsilon})\log(L+\frac{L}{m})\big{)}.

The first part is immediate, as each call to Algorithm 3 ensures, due to Theorem 2.10, that

and P(u0)2u0u\|P(\mathbf{u}_{0})\|\leq 2\|\mathbf{u}_{0}-\mathbf{u}^{*}\| as PP is 2-Lipschitz (because it is 12\frac{1}{2}-cocoercive) and P(u)=0.P(\mathbf{u}^{*})=\textbf{0}.

On the other hand, as FF is mm-strongly monotone and u\mathbf{u}^{*} is an (MVI) solution,

Hence: uˉk1u1mP(uk1).\|\bar{\mathbf{u}}^{*}_{k-1}-\mathbf{u}^{*}\|\leq\frac{1}{m}\|P(\mathbf{u}_{k-1})\|. It remains to use the triangle inequality and P(uk1)=uk1uˉk1P(\mathbf{u}_{k-1})=\mathbf{u}_{k-1}-\bar{\mathbf{u}}^{*}_{k-1} to obtain: \|\mathbf{u}_{k-1}-\mathbf{u}^{*}\|\leq\big{(}1+\frac{1}{m}\big{)}\|P(\mathbf{u}_{k-1})\|.

Lower Bound Reductions

In this section, we only state the lower bounds, while more details about the oracle model and the proof are deferred to Appendix A.

For any deterministic algorithm working in the operator oracle model and any L,D>0L,D>0, there exists an LL-Lipschitz-continuous operator FF and a closed convex feasible set U\mathcal{U} with diameter DD such that:

For all ϵ>0\epsilon>0 such that k=LD2ϵ=O(d)k=\frac{LD^{2}}{\epsilon}=O(d), maxuUF(uk),uku=Ω(ϵ)\max_{\mathbf{u}\in\mathcal{U}}\left\langle F(\mathbf{u}_{k}),\mathbf{u}_{k}-\mathbf{u}\right\rangle=\Omega(\epsilon);

For all ϵ>0\epsilon>0 such that k=LDϵ=O(d)k=\frac{LD}{\epsilon}=O(d), maxu{UBuk}F(uk),uku=Ω(ϵ)\max_{\mathbf{u}\in\{\mathcal{U}\cap\mathcal{B}_{\mathbf{u}_{k}}\}}\left\langle F(\mathbf{u}_{k}),\mathbf{u}_{k}-\mathbf{u}\right\rangle=\Omega(\epsilon);

If FF is 1L\frac{1}{L}-cocoercive, then for all ϵ>0\epsilon>0 such that k=LDϵlog(D/ϵ)=O(d)k=\frac{LD}{\epsilon\log(D/\epsilon)}=O(d), it holds that

If FF is mm-strongly monotone, then for all ϵ>0\epsilon>0 such that k=Lm=O(d)k=\frac{L}{m}=O(d), it holds that

Parts (a) and (b) of Lemma 3.1 certify that Algorithm 3 is optimal up to a logarithmic factor, due to Theorem 2.10. This is true because we can run Algorithm 3 with accuracy ϵD\frac{\epsilon}{D} to obtain maxuUF(uk),uku=O(ϵ)\max_{\mathbf{u}\in\mathcal{U}}\left\langle F(\mathbf{u}_{k}),\mathbf{u}_{k}-\mathbf{u}\right\rangle=O(\epsilon) in k=O(LD2ϵlog(LDϵ))k=O(\frac{LD^{2}}{\epsilon}\log(\frac{LD}{\epsilon})) iterations, or with accuracy ϵ\epsilon to obtain maxu{UBuk}F(uk),uku=O(ϵ)\max_{\mathbf{u}\in\{\mathcal{U}\cap\mathcal{B}_{\mathbf{u}_{k}}\}}\left\langle F(\mathbf{u}_{k}),\mathbf{u}_{k}-\mathbf{u}\right\rangle=O(\epsilon) in k=O(LDϵlog(LDϵ))k=O(\frac{LD}{\epsilon}\log(\frac{LD}{\epsilon})) iterations (see Proposition 2.7).

Part (c) of Lemma 3.1 certifies that Algorithm 2 is optimal up to a log(D/ϵ)\log(D/\epsilon) factor, due to Theorem 2.6. Part (d) certifies that the restarting algorithm from Theorem 2.12 is optimal up to a factor log(D/ϵ)log(L/m)\log(D/\epsilon)\log(L/m) whenever L=Ω(L/m).L=\Omega(L/m). Note that L=Ω(L/m)L=\Omega(L/m) can be ensured by a proper scaling of the problem instance, as any such scaling would leave the condition number L/mL/m unaffected and would only impact the target error ϵ,\epsilon, which only appears under a logarithm.

Conclusion

We showed that variants of Halpern iteration can be used to obtain near-optimal methods for solving different classes of monotone inclusion problems with Lipschitz operators. The results highlight connections between monotone inclusion, variational inequalities, fixed points of nonexpansive maps, and proximal-point-type algorithms. Some interesting questions that merit further investigation remain. In particular, one open question that arises is to close the gap between the upper and lower bounds provided here. We conjecture that the optimal complexity of monotone inclusion is: (i) Θ(LDϵ)\Theta(\frac{LD}{\epsilon}) when the operator is either LL-Lipschitz or 1L\frac{1}{L}-cocoercive, and (ii) Θ(Lmlog(LDϵ))\Theta(\frac{L}{m}\log(\frac{LD}{\epsilon})) when the operator is LL-Lipschitz and mm-strongly monotone.

Acknowledgements

We thank Prof. Ulrich Kohlenbach for useful comments and pointers to the literature. We also thank Howard Heaton for pointing out a typo in the proof of Lemma 2.1 in a previous version of this paper.

References

Appendix A Omitted Proofs

Let Ck\mathcal{C}_{k} be defined as in Eq. (2.1) and let u\mathbf{u}^{*} be the solution to (MI) that minimizes u0u\|\mathbf{u}_{0}-\mathbf{u}^{*}\|. Assume further that F(u1)F(u0),u1u01L1F(u1)F(u0)2.\left\langle F(\mathbf{u}_{1})-F(\mathbf{u}_{0}),\mathbf{u}_{1}-\mathbf{u}_{0}\right\rangle\geq\frac{1}{L_{1}}\|F(\mathbf{u}_{1})-F(\mathbf{u}_{0})\|^{2}. If Ak+1Ck+1AkCk,A_{k+1}\mathcal{C}_{k+1}\leq A_{k}\mathcal{C}_{k}, k1,\forall k\geq 1, where {Ai}i1\{A_{i}\}_{i\geq 1} is a sequence of positive numbers that satisfies A1=1A_{1}=1, then:

The statement holds trivially if F(uk)=0,\|F(\mathbf{u}_{k})\|=0, so assume that F(uk)>0.\|F(\mathbf{u}_{k})\|>0. Under the assumption of the lemma, we have that AkCkC1,A_{k}\mathcal{C}_{k}\leq\mathcal{C}_{1}, k1.\forall k\geq 1. From (H) and λ1=12\lambda_{1}=\frac{1}{2}, u1=u01L1F(u0),\mathbf{u}_{1}=\mathbf{u}_{0}-\frac{1}{L_{1}}F(\mathbf{u}_{0}), and thus: C1=1L1F(u1)21L1F(u1),F(u0).\mathcal{C}_{1}=\frac{1}{L_{1}}\|F(\mathbf{u}_{1})\|^{2}-\frac{1}{L_{1}}\left\langle F(\mathbf{u}_{1}),F(\mathbf{u}_{0})\right\rangle.

Let u\mathbf{u}^{*} be an arbitrary solution to (MI) (and thus also to (MVI)). As F(u1)F(u0),u1u01L1F(u1)F(u0)2\left\langle F(\mathbf{u}_{1})-F(\mathbf{u}_{0}),\mathbf{u}_{1}-\mathbf{u}_{0}\right\rangle\geq\frac{1}{L_{1}}\|F(\mathbf{u}_{1})-F(\mathbf{u}_{0})\|^{2} and u1=u01L1F(u0),\mathbf{u}_{1}=\mathbf{u}_{0}-\frac{1}{L_{1}}F(\mathbf{u}_{0}), it follows that F(u1)2F(u0),F(u1),\|F(\mathbf{u}_{1})\|^{2}\leq\left\langle F(\mathbf{u}_{0}),F(\mathbf{u}_{1})\right\rangle, and, thus C10.\mathcal{C}_{1}\leq 0. Further, as Ak>0,A_{k}>0, we also have Ck0,\mathcal{C}_{k}\leq 0, and, hence:

where the last line is by u\mathbf{u}^{*} being a solution to (MVI) and by the Cauchy-Schwarz inequality. The conclusion of the lemma now follows by dividing both sides of F(uk)2Lkλk1λkF(uk)u0u\|F(\mathbf{u}_{k})\|^{2}\leq L_{k}\frac{\lambda_{k}}{1-\lambda_{k}}\|F(\mathbf{u}_{k})\|\cdot\|\mathbf{u}_{0}-\mathbf{u}^{*}\| by F(uk)\|F(\mathbf{u}_{k})\| and observing that the statement holds for an arbitrary solution u\mathbf{u}^{*} to (MI), and thus, it also holds for the one that minimizes the distance to u0.\mathbf{u}_{0}.

Let Ck\mathcal{C}_{k} be defined as in Eq. (2.1). Let {Ai}i1\{A_{i}\}_{i\geq 1} be defined recursively as A1=1A_{1}=1 and Ak+1=Akλk(1λk)λk+1A_{k+1}=A_{k}\frac{\lambda_{k}}{(1-\lambda_{k})\lambda_{k+1}} for k1.k\geq 1. Assume that {λi}i1\{\lambda_{i}\}_{i\geq 1} is chosen so that λ1=12\lambda_{1}=\frac{1}{2} and for k1:k\geq 1: λk+112λk+1λkLk(1λk)Lk+1\frac{\lambda_{k+1}}{1-2\lambda_{k+1}}\geq\frac{\lambda_{k}L_{k}}{(1-\lambda_{k})L_{k+1}}. Finally, assume that Lk(0,)L_{k}\in(0,\infty) and F(uk)F(uk1),ukuk11LkF(uk)F(uk1)2\left\langle F(\mathbf{u}_{k})-F(\mathbf{u}_{k-1}),\mathbf{u}_{k}-\mathbf{u}_{k-1}\right\rangle\geq\frac{1}{L_{k}}\|F(\mathbf{u}_{k})-F(\mathbf{u}_{k-1})\|^{2}, k.\forall k. Then,

which, after expanding the left-hand side, can be equivalently written as:

From (H), we have that uk+1uk=λk+11λk+1(u0uk+1)2Lk+1F(uk)\mathbf{u}_{k+1}-\mathbf{u}_{k}=\frac{\lambda_{k+1}}{1-\lambda_{k+1}}(\mathbf{u}_{0}-\mathbf{u}_{k+1})-\frac{2}{L_{k+1}}F(\mathbf{u}_{k}) and uk+1uk=λk+1(u0uk)2(1λk+1)Lk+1F(uk).\mathbf{u}_{k+1}-\mathbf{u}_{k}=\lambda_{k+1}(\mathbf{u}_{0}-\mathbf{u}_{k})-\frac{2(1-\lambda_{k+1})}{L_{k+1}}F(\mathbf{u}_{k}). Hence:

Rearranging the last inequality and multiplying both sides by Ak+1,A_{k+1}, we have:

The left-hand side of the last inequality if precisely Ak+1Ck+1.A_{k+1}\mathcal{C}_{k+1}. The right-hand side is AkCk,\leq A_{k}\mathcal{C}_{k}, by the choice of sequences {Ai}i1,\{A_{i}\}_{i\geq 1}, {λi}i1.\{\lambda_{i}\}_{i\geq 1}.

A.2 Operator Mapping

Let FF be an 1L\frac{1}{L}-cocoercive operator and let GηG_{\eta} be defined as in Eq. (1.1), where ηL.\eta\geq L. Then GηG_{\eta} is 12η\frac{1}{2\eta}-cocoercive.

As ηL\eta\geq L and FF is 1L\frac{1}{L}-cocoercive, 1ηF(u)F(v),uv1η2F(u)F(v)2.\frac{1}{\eta}\left\langle F(\mathbf{u})-F(\mathbf{v}),\mathbf{u}-\mathbf{v}\right\rangle\geq\frac{1}{\eta^{2}}\|F(\mathbf{u})-F(\mathbf{v})\|^{2}. It remains to apply Young’s inequality, which implies Gη(u)Gη(v),F(u)F(v)12Gη(u)Gη(v)2+12F(u)F(v)2.\left\langle G_{\eta}(\mathbf{u})-G_{\eta}(\mathbf{v}),F(\mathbf{u})-F(\mathbf{v})\right\rangle\leq\frac{1}{2}\|G_{\eta}(\mathbf{u})-G_{\eta}(\mathbf{v})\|^{2}+\frac{1}{2}\|F(\mathbf{u})-F(\mathbf{v})\|^{2}.

A.3 Approximating the Resolvent

Let us start by proving the convergence of a version of the Extragradient method of Korpelevich that does not require the knowledge of the Lipschitz constant LL (but does require knowledge of the strong monotonicity parameter mm; when computing the resolvent we have m=1m=1). The algorithm is summarized in Algorithm 4.

Observe that the update step for uk\mathbf{u}_{k} from Lines 4 and 4 can be written in the form of a projection onto U;\mathcal{U}; we chose to write it in the current form as it is more convenient for the analysis.

We now bound the convergence of Algorithm 4.

Let a0>0a_{0}>0 and let FF be mm-strongly monotone and LL-Lipschitz. Then, Algorithm 4 outputs a point uk\mathbf{u}_{k} with ukuϵ\|\mathbf{u}_{k}-\mathbf{u}^{*}\|\leq\epsilon after at most k=O\big{(}\frac{L}{m}\log(\frac{L\|\mathbf{u}_{0}-\mathbf{u}^{*}\|}{m\epsilon}\big{)}) oracle queries to F,F, where u\mathbf{u}^{*} solves (SVI).

Define Ak=i=0kai.A_{k}=\sum_{i=0}^{k}a_{i}. To prove the lemma, we will use the following gap (or merit) functions:

As FF is strongly monotone, fk0,k.f_{k}\geq 0,\,\forall k. By convention, we take f1=0f_{-1}=0 and A1=0A_{-1}=0, so that A_{k}f_{k}-A_{k-1}f_{k-1}=a_{k}\Big{(}\left\langle F(\bar{\mathbf{u}}_{k}),\bar{\mathbf{u}}_{k}-\mathbf{u}^{*}\right\rangle-\frac{m}{2}\|\bar{\mathbf{u}}_{k}-\mathbf{u}^{*}\|^{2}\Big{)}. Let us now bound AkfkAk1fk1A_{k}f_{k}-A_{k-1}f_{k-1}, and observe that AkfkAk1fk10A_{k}f_{k}-A_{k-1}f_{k-1}\geq 0. First, write

By the first-order optimality of uk+1\mathbf{u}_{k+1} in its definition, we have, u:\forall\mathbf{u}:

By the standard three-point identity (which can also be verified directly):

Thus, setting u=u:\mathbf{u}=\mathbf{u}^{*}:

By the condition of the while loop in Line 4 of Algorithm 4, and because AkfkAk1fk10A_{k}f_{k}-A_{k-1}f_{k-1}\geq 0,

The condition of the while loop in Line 4 of Algorithm 4 is satisfied for any ak12L,a_{k}\leq\frac{1}{2L}, as

where we have used the Cauchy-Schwarz inequality, the fact that FF is LL-Lipschitz, and the Young inequality. Thus, in any iteration, ak>14L,a_{k}>\frac{1}{4L}, and the total number of times the while loop from Line 4 is entered is at most log2(4L/a0).\log_{2}(4L/a_{0}).

From Eq. (A.5), uuk+1211+m/(4L)uuk2(1m8L)uuk2.\|\mathbf{u}^{*}-\mathbf{u}_{k+1}\|^{2}\leq\frac{1}{1+m/(4L)}\|\mathbf{u}^{*}-\mathbf{u}_{k}\|^{2}\leq(1-\frac{m}{8L})\|\mathbf{u}^{*}-\mathbf{u}_{k}\|^{2}. Thus, for any δ>0,\delta>0, uukδ\|\mathbf{u}^{*}-\mathbf{u}_{k}\|\leq\delta for k16Lmlog(uu0δ).k\geq\frac{16L}{m}\log(\frac{\|\mathbf{u}^{*}-\mathbf{u}_{0}\|}{\delta}). Consequently, from Eq. (A.5), uˉkuk2δ\|\bar{\mathbf{u}}_{k}-\mathbf{u}_{k}\|\leq\sqrt{2}\delta whenever uukδ.\|\mathbf{u}^{*}-\mathbf{u}_{k}\|\leq\delta. In particular, for δ=akmϵ52mϵ202L,\delta=\frac{a_{k}m\epsilon}{5\sqrt{2}}\geq\frac{m\epsilon}{20\sqrt{2}L}, uˉkuk2δ=akm5ϵ\|\bar{\mathbf{u}}_{k}-\mathbf{u}_{k}\|\leq\sqrt{2}\delta=\frac{a_{k}m}{5}\epsilon after at most k=16Lmlog(202Luu0mϵk=\frac{16L}{m}\log(\frac{20\sqrt{2}L\|\mathbf{u}^{*}-\mathbf{u}_{0}\|}{m\epsilon} (outer loop) iterations.

On the other hand, as FF is mm-strongly monotone, we also have F(uˉk),uˉkum2uˉku2.\left\langle F(\bar{\mathbf{u}}_{k}),\bar{\mathbf{u}}_{k}-\mathbf{u}^{*}\right\rangle\geq\frac{m}{2}\|\bar{\mathbf{u}}_{k}-\mathbf{u}^{*}\|^{2}. Hence, uˉku4ϵ5.\|\bar{\mathbf{u}}_{k}-\mathbf{u}^{*}\|\leq\frac{4\epsilon}{5}. Finally, applying the triangle inequality and as a+k1/m:a+k\leq 1/m:

Note that we have already bounded the total number of inner and outer loop iterations. Observing that each inner iteration makes 2 oracle queries to FF and each outer iteration makes 22 oracle queries to FF outside of the inner iteration, the bound on the total number of oracle queries to FF follows. ∎

Let uˉk=JF+IU(uk),\bar{\mathbf{u}}_{k}^{*}=J_{F+I_{\mathcal{U}}}(\mathbf{u}_{k}), where ukU\mathbf{u}_{k}\in\mathcal{U} and FF is LL-Lipschitz. Then, there exists a parameter-free algorithm that queries FF at most O((L+1)log((L+1)ukuˉkϵ))O((L+1)\log(\frac{(L+1)\|\mathbf{u}_{k}-\bar{\mathbf{u}}_{k}^{*}\|}{\epsilon})) times and outputs a point uˉk\bar{\mathbf{u}}_{k} such that uˉkuˉkϵ.\|\bar{\mathbf{u}}_{k}-\bar{\mathbf{u}}^{*}_{k}\|\leq\epsilon.

Observe first that uˉk\bar{\mathbf{u}}^{*}_{k} solves (SVI) for operator Fˉ(u)=F(u)+uuk\bar{F}(\mathbf{u})=F(\mathbf{u})+\mathbf{u}-\mathbf{u}_{k} over the set U.\mathcal{U}. This follows from the definition of the resolvent, which implies:

Equivalently: 0Fˉ(uˉk)+IU(uˉk)\textbf{0}\in\bar{F}(\bar{\mathbf{u}}_{k}^{*})+\partial I_{\mathcal{U}}(\bar{\mathbf{u}}^{*}_{k}).

The rest of the proof follows by applying Lemma A.1 to Fˉ,\bar{F}, which is (L+1)(L+1)-Lipschitz and 11-strongly monotone. ∎

A.4 Inexact Halpern Iteration

We start by first proving the following auxiliary result.

Given an initial point u0U,\mathbf{u}_{0}\in\mathcal{U}, let uk\mathbf{u}_{k} evolve according to Eq. (2.3), where λk=1k+1\lambda_{k}=\frac{1}{k+1}. Then,

where u\mathbf{u}^{*} is such that P(u)=0.\|P(\mathbf{u}^{*})\|=0.

where we have used the triangle inequality and nonexpansivity of T.T. The result follows by recursively applying the last inequality and observing that j=ik(1λj)=ik+1.\prod_{j=i}^{k}(1-\lambda_{j})=\frac{i}{k+1}.

Using this proposition, we can now prove the following lemma.

Let Ck\mathcal{C}_{k} be defined as in Eq. (2.1) with PP as the 12\frac{1}{2}-cocoercive operator, and let Lk=2,L_{k}=2, λk=1k+1,\lambda_{k}=\frac{1}{k+1}, and Ak=k(k+1)2,A_{k}=\frac{k(k+1)}{2}, k1\forall k\geq 1. If the iterates uk\mathbf{u}_{k} evolve according to (2.3) for an arbitrary initial point u0U,\mathbf{u}_{0}\in\mathcal{U}, then:

Further, if, k1,\forall k\geq 1, ek1ϵ4k(k+1),\|\mathbf{e}_{k-1}\|\leq\frac{\epsilon}{4k(k+1)}, then P(uK)ϵ\|P(\mathbf{u}_{K})\|\leq\epsilon after at most K=4u0uϵK=\frac{4\|\mathbf{u}_{0}-\mathbf{u}^{*}\|}{\epsilon} iterations.

By the same arguments as in the proof of Lemma 2.1:

Plugging λk+1=1k+2\lambda_{k+1}=\frac{1}{k+2} in the last inequality and using the definition of Ck\mathcal{C}_{k} and the choice of AkA_{k} from the statement of the lemma completes the proof of the first part.

Using the same arguments as in the proof of Lemma 2.2, we can conclude from Ak+1Ck+1AkCk+Ak+1ek,(1λk+1)P(uk)P(uk+1),\quad A_{k+1}\mathcal{C}_{k+1}\leq A_{k}\mathcal{C}_{k}+A_{k+1}\left\langle\mathbf{e}_{k},(1-\lambda_{k+1})P(\mathbf{u}_{k})-P(\mathbf{u}_{k+1})\right\rangle, k1\forall k\geq 1 that:

Let us now bound each ei1,ii+1P(ui1)P(ui)\left\langle\mathbf{e}_{i-1},\frac{i}{i+1}P(\mathbf{u}_{i-1})-P(\mathbf{u}_{i})\right\rangle term. Recall that P(u)=0P(\mathbf{u}^{*})=\textbf{0} and PP is 2-Lipschitz (as discussed in Section 1.2, this follows from PP being 12\frac{1}{2}-cocoercive). Thus, we have:

where we have used Proposition A.2 in the last inequality. In particular, if ei1ϵ4i(i+1)\|\mathbf{e}_{i-1}\|\leq\frac{\epsilon}{4i(i+1)}, then, i1\forall i\geq 1:

Observe that if u0uϵ/2,\|\mathbf{u}_{0}-\mathbf{u}^{*}\|\leq\epsilon/2, as PP is 2-Lipschitz and P(u)=0,P(\mathbf{u}^{*})=\textbf{0}, we would have P(u0)ϵ,\|P(\mathbf{u}_{0})\|\leq\epsilon, and the statement of the second part of the lemma would hold trivially. Assume from now on that u0u>ϵ/2.\|\mathbf{u}_{0}-\mathbf{u}^{*}\|>\epsilon/2. Suppose that P(uk)>ϵ\|P(\mathbf{u}_{k})\|>\epsilon and k4u0uϵ.k\geq\frac{4\|\mathbf{u}_{0}-\mathbf{u}^{*}\|}{\epsilon}. Then, dividing both sides of Eq. (A.7) by P(uk)/2\|P(\mathbf{u}_{k})\|/2 and using that P(uk)>ϵ\|P(\mathbf{u}_{k})\|>\epsilon and u0u>ϵ/2\|\mathbf{u}_{0}-\mathbf{u}^{*}\|>\epsilon/2, we get:

contradicting the assumption that P(uk)>ϵ\|P(\mathbf{u}_{k})\|>\epsilon and completing the proof. ∎

A.5 Strongly Monotone Lipschitz Operators

Given FF that is LL-Lipschitz and mm-strongly monotone, consider running the following algorithm A\mathcal{A}, starting with uˉ0U\bar{\mathbf{u}}_{0}\in\mathcal{U}:

Then, A\mathcal{A} outputs a point ukU\mathbf{u}_{k}\in\mathcal{U} with P(uk)ϵ\|P(\mathbf{u}_{k})\|\leq\epsilon after at most log2(u0u/ϵ)\log_{2}(\|\mathbf{u}_{0}-\mathbf{u}^{*}\|/\epsilon) iterations, where w.l.o.g. ϵ12\epsilon\leq\frac{1}{2}. The total number of oracle queries to FF until this happens is O\big{(}(L+\frac{L}{m})\log(\|\mathbf{u}_{0}-\mathbf{u}^{*}\|/\epsilon)\log(L+\frac{L}{m})\big{)}.

The first part of the theorem is immediate, as each call to Algorithm 3 ensures, due to Theorem 2.10, that

and P(u0)2u0u\|P(\mathbf{u}_{0})\|\leq 2\|\mathbf{u}_{0}-\mathbf{u}^{*}\| as PP is 2-Lipschitz (because it is 12\frac{1}{2}-cocoercive) and P(u)=0.P(\mathbf{u}^{*})=\textbf{0}.

On the other hand, as FF is mm-strongly monotone and u\mathbf{u}^{*} is an (MVI) solution,

Hence: uˉk1u1mP(uk1).\|\bar{\mathbf{u}}^{*}_{k-1}-\mathbf{u}^{*}\|\leq\frac{1}{m}\|P(\mathbf{u}_{k-1})\|. It remains to use the triangle inequality and P(uk1)=uk1uˉk1P(\mathbf{u}_{k-1})=\mathbf{u}_{k-1}-\bar{\mathbf{u}}^{*}_{k-1} to obtain:

A.6 Lower Bounds

We make use of the lower bound from Ouyang and Xu and the algorithmic reductions between the problems considered in previous sections to derive (near-tight) lower bounds for all of the problems considered in this paper.

The lower bounds are for deterministic algorithms working in a (first-order) oracle model. For convex-concave saddle-point problems with the objective Φ(x,y)\Phi(\mathbf{x},\mathbf{y}) and closed convex feasible set X×Y,\mathcal{X}\times\mathcal{Y}, any such algorithm A\mathcal{A} can be described as follows: in each iteration kk, A\mathcal{A} queries a pair of points (xˉk,yˉk)X×Y(\bar{\mathbf{x}}_{k},\bar{\mathbf{y}}_{k})\in\mathcal{X}\times\mathcal{Y} to obtain (xΦ(xˉk,yˉk),yΦ(xˉk,yˉk)),(\nabla_{\mathbf{x}}\Phi(\bar{\mathbf{x}}_{k},\bar{\mathbf{y}}_{k}),\,\nabla_{\mathbf{y}}\Phi(\bar{\mathbf{x}}_{k},\bar{\mathbf{y}}_{k})), and outputs a candidate solution pair (xk,yk)X×Y.(\mathbf{x}_{k},\mathbf{y}_{k})\in\mathcal{X}\times\mathcal{Y}. Both the query points pair (xˉk,yˉk)(\bar{\mathbf{x}}_{k},\bar{\mathbf{y}}_{k}) and the candidate solution pair (xk,yk)(\mathbf{x}_{k},\mathbf{y}_{k}) can only depend on (i) global problem parameters (such as the Lipschitz constant of Φ\Phi’s gradients or the feasible sets X,Y\mathcal{X},\mathcal{Y}) and (ii) oracle queries and answers up to iteration k:k:

We start by summarizing the result from [Ouyang and Xu, 2019, Theorem 9].

where (xk,yk)X×Y(\mathbf{x}_{k},\mathbf{y}_{k})\in\mathcal{X}\times\mathcal{Y} is the algorithm output after kk iterations and RX,RYR_{\mathcal{X}},\,R_{\mathcal{Y}} denote the diameters of the feasible sets X,Y,\mathcal{X},\,\mathcal{Y}, respectively, and where both X,Y,\mathcal{X},\,\mathcal{Y}, are closed and convex.

The assumption of the theorem that k=O(d)k=O(d) means that the lower bound applies in the high-dimensional regime d=Ω(L(RX2+RXRY)ϵ),d=\Omega(\frac{L({R_{\mathcal{X}}}^{2}+R_{\mathcal{X}}R_{\mathcal{Y}})}{\epsilon}), which is standard and generally unavoidable.

In the setting of VIs, we consider a related model in which an algorithm has oracle access to FF and refer to it as the operator oracle model. Similarly as for the saddle-point problems, we consider deterministic algorithms that on a given problem instance described by (F,U)(F,\mathcal{U}) operate as follows: in each iteration kk the algorithm queries a point uˉkU\bar{\mathbf{u}}_{k}\in\mathcal{U}, receives F(uˉk),F(\bar{\mathbf{u}}_{k}), and outputs a solution candidate ukU\mathbf{u}_{k}\in\mathcal{U}. Both uk\mathbf{u}_{k} and uˉk\bar{\mathbf{u}}_{k} can only depend on (i) global problem parameters (such as the feasible set U\mathcal{U} and the Lipschitz parameter of FF), and (ii) oracle queries and answers up to iteration k:k: {uˉi,F(uˉi)}i=0k1.\{\bar{\mathbf{u}}_{i},F(\bar{\mathbf{u}}_{i})\}_{i=0}^{k-1}. Note that all methods described in this paper and most of the commonly used methods for solving VIs, such as, e.g., the mirror-prox method of Nemirovski and dual extrapolation method of Nesterov , work in this oracle model.

For any deterministic algorithm working in the operator oracle model described above and any L,D>0L,D>0, there exists a VI described by an LL-Lipschitz-continuous operator FF and a closed convex feasible set U\mathcal{U} with diameter DD such that:

For all ϵ>0\epsilon>0 such that k=LD2ϵ=O(d)k=\frac{LD^{2}}{\epsilon}=O(d), maxuUF(uk),uku=Ω(ϵ)\max_{\mathbf{u}\in\mathcal{U}}\left\langle F(\mathbf{u}_{k}),\mathbf{u}_{k}-\mathbf{u}\right\rangle=\Omega(\epsilon);

For all ϵ>0\epsilon>0 such that k=LDϵ=O(d)k=\frac{LD}{\epsilon}=O(d), maxu{UBuk}F(uk),uku=Ω(ϵ)\max_{\mathbf{u}\in\{\mathcal{U}\cap\mathcal{B}_{\mathbf{u}_{k}}\}}\left\langle F(\mathbf{u}_{k}),\mathbf{u}_{k}-\mathbf{u}\right\rangle=\Omega(\epsilon);

If FF is 1L\frac{1}{L}-cocoercive, then for all ϵ>0\epsilon>0 such that k=LDϵlog(D/ϵ)=O(d)k=\frac{LD}{\epsilon\log(D/\epsilon)}=O(d), it holds that

If FF is mm-strongly monotone, then for all ϵ>0\epsilon>0 such that k=Lm=O(d)k=\frac{L}{m}=O(d), it holds that

Proof of (a): Suppose that this claim was not true. Then we would be able to solve any instance with LL-Lipschitz FF and U\mathcal{U} with diameter bounded by DD and obtain uk\mathbf{u}_{k} with maxuUF(uk),ukuϵ\max_{\mathbf{u}\in\mathcal{U}}\left\langle F(\mathbf{u}_{k}),\mathbf{u}_{k}-\mathbf{u}\right\rangle\leq\epsilon in o(LD2ϵ)o(\frac{LD^{2}}{\epsilon}) iterations, assuming the appropriate high-dimensional regime. In particular, given any fixed convex-concave Φ(x,y)\Phi(\mathbf{x},\mathbf{y}) with LL-Lipschitz gradients and feasible sets X,Y\mathcal{X},\mathcal{Y} whose diameter is D/2,D/2, let \mathbf{u}=[\vbox{\Let@\restore@math@cr\default@tag\halign{\hfil\m@th\scriptstyle#&\m@th\scriptstyle{}#\hfil\cr\mathbf{x}\\ \mathbf{y}\crcr}}], F(\mathbf{u})=[\vbox{\Let@\restore@math@cr\default@tag\halign{\hfil\m@th\scriptstyle#&\m@th\scriptstyle{}#\hfil\cr\nabla_{\mathbf{x}}\Phi(\mathbf{x},\mathbf{y})\\ -\nabla_{\mathbf{y}}\Phi(\mathbf{x},\mathbf{y})\crcr}}], U=X×Y.\mathcal{U}=\mathcal{X}\times\mathcal{Y}. Then, it is not hard to verify that FF is monotone and LL-Lipschitz (see, e.g., Nemirovski , Facchinei and Pang ) and the diameter of U\mathcal{U} is D.D. Thus, by assumption, we would be able to construct a point \mathbf{u}_{k}=[\vbox{\Let@\restore@math@cr\default@tag\halign{\hfil\m@th\scriptstyle#&\m@th\scriptstyle{}#\hfil\cr\mathbf{x}_{k}\\ \mathbf{y}_{k}\crcr}}] for which maxuUF(uk),ukuϵ\max_{\mathbf{u}\in\mathcal{U}}\left\langle F(\mathbf{u}_{k}),\mathbf{u}_{k}-\mathbf{u}\right\rangle\leq\epsilon in o(LD2ϵ)o(\frac{LD^{2}}{\epsilon}) iterations. But then, because Φ\Phi is convex-concave, we would also have, for any xX,yY\mathbf{x}\in\mathcal{X},\mathbf{y}\in\mathcal{Y}:

Because we obtained this bound for an arbitrary LL-Lipschitz convex-concave Φ\Phi and arbitrary feasible sets X,Y\mathcal{X},\mathcal{Y} with diameters D/2,D/2, Theorem A.3 leads to a contradiction.

Proof of (b): If (b) was not true, then we would be able to obtain a point uk\mathbf{u}_{k} with

in k=LD2ϵk=\frac{LD^{2}}{\epsilon} iterations. But the same point would satisfy maxuUF(uk),uku=o(ϵ),\max_{\mathbf{u}\in\mathcal{U}}\left\langle F(\mathbf{u}_{k}),\mathbf{u}_{k}-\mathbf{u}\right\rangle=o(\epsilon), which is a contradiction, due to (a).

Proof of (c): We prove the claim for L=2.L=2. This is w.l.o.g., due to the standard rescaling argument: if FF is 1L\frac{1}{L}-cocoercive, then Fˉ=F/(2L)\bar{F}=F/(2L) is 12\frac{1}{2}-cocoercive. Further, if, for some ukU,\mathbf{u}_{k}\in\mathcal{U},

then maxu{UBuk}F(uk),uku=Ω(Lϵ).\max_{\mathbf{u}\in\{\mathcal{U}\cap\mathcal{B}_{\mathbf{u}_{k}}\}}\left\langle F(\mathbf{u}_{k}),\mathbf{u}_{k}-\mathbf{u}\right\rangle=\Omega(L\epsilon).

Suppose that the claim was not true for a 12\frac{1}{2}-cocoercive operator F.F. Then for any MM-Lipschitz monotone operator G,G, we would be able to use the strategy from Section 2.2 to obtain a point uk\mathbf{u}_{k} with

in k=MDϵk=\frac{MD}{\epsilon} iterations. This is a contradiction, due to (b).

Proof of (d): Suppose that the claim was not true, i.e., that there existed an algorithm that, for any m,L>0,m,L>0, could output uk\mathbf{u}_{k} with maxu{UBuk}Fˉ(uk),uku=ϵ/2\max_{\mathbf{u}\in\{\mathcal{U}\cap\mathcal{B}_{\mathbf{u}_{k}}\}}\left\langle\bar{F}(\mathbf{u}_{k}),\mathbf{u}_{k}-\mathbf{u}\right\rangle=\epsilon/2 in k=o(L/m)k=o(L/m) iterations, for any mm-strongly monotone and LL-Lipschitz operator. Then for any LL-Lipschitz monotone operator FF, we could apply that algorithm to Fˉ()=F()+ϵ2D(u0)\bar{F}(\cdot)=F(\cdot)+\frac{\epsilon}{2D}(\cdot-\mathbf{u}_{0}) to obtain a point uk\mathbf{u}_{k} with maxu{UBuk}Fˉ(uk),uku=ϵ/2\max_{\mathbf{u}\in\{\mathcal{U}\cap\mathcal{B}_{\mathbf{u}_{k}}\}}\left\langle\bar{F}(\mathbf{u}_{k}),\mathbf{u}_{k}-\mathbf{u}\right\rangle=\epsilon/2 in k=o(LD/ϵ)k=o(LD/\epsilon) iterations. But then we would also have: