The Complexity of Constrained Min-Max Optimization

Constantinos Daskalakis, Stratis Skoulakis, Manolis Zampetakis

Introduction

Min-Max Optimization has played a central role in the development of Game Theory [vN28], Convex Optimization [Dan51, Adl13], and Online Learning [Bla56, CBL06, SS12, BCB12, SSBD14, Haz16]. In its general constrained form, it can be written down as follows:

The goal in (1.1) is to find a feasible pair (x,y)(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star}), i.e., g(x,y)0g(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})\leq 0, that satisfies the following

Unfortunately, our ability to solve Problem (1.1) remains rather poor in settings where our objective function ff is not convex-concave. This is emerging as a major challenge in Deep Learning, where min-max optimization has recently found many important applications, such as training Generative Adversarial Networks (see e.g. [GPM+14, ACB17]), and robustifying deep neural network-based models against adversarial attacks (see e.g. [MMS+18]). These applications are indicative of a broader deep learning paradigm wherein robustness properties of a deep learning system are tested and enforced by another deep learning system. In these applications, it is very common to encounter min-max problems with objectives that are nonconvex-nonconcave, and thus evade treatment by the classical algorithmic toolkit targeting convex-concave objectives.

Indeed, the optimization challenges posed by objectives that are nonconvex-nonconcave are not just theoretical frustration. Practical experience with first-order methods is rife with frustration as well. A common experience is that the training dynamics of first-order methods is unstable, oscillatory or divergent, and the quality of the points encountered in the course of training can be poor; see e.g. [Goo16, MPPSD16, DISZ18, MGN18, DP18, MR18, MPP18, ADLH19]. This experience is in stark contrast to minimization (resp. maximization) problems, where even for nonconvex (resp. nonconcave) objectives, first-order methods have been found to efficiently converge to approximate local optima or stationary points (see e.g. [AAZB+17, JGN+17, LPP+19]), while practical methods such Stochastic Gradient Descent, Adagrad, and Adam [DHS11, KB14, RKK18] are driving much of the recent progress in Deep Learning.

The goal of this paper is to shed light on the complexity of min-max optimization problems, and elucidate its difference to minimization and maximization problems—as far as the latter is concerned without loss of generality we focus on minimization problems, as maximization problems behave exactly the same; we will also think of minimization problems in the framework of (1.1), where the variable y\boldsymbol{y} is absent, that is d2=0d_{2}=0. An important driver of our comparison between min-max optimization and minimization is, of course, the nature of the objective. So let us discuss:

\triangleright Convex-Concave Objective. The benign setting for min-max optimization is that where the objective function is convex-concave, while the benign setting for minimization is that where the objective function is convex. In their corresponding benign settings, the two problems behave quite similarly from a computational perspective in that they are amenable to convex programming, as well as first-order methods which only require gradient information about the objective function. Moreover, in their benign settings, both problems have guaranteed existence of a solution under compactness of the constraint set. Finally, it is clear how to define approximate solutions. We just relax the inequalities on the left hand side of (1.2) and (1.3) by some ε>0\varepsilon>0.

\triangleright Nonconvex-Nonconcave Objective. By contrapositive, the challenging setting for min-max optimization is that where the objective is not convex-concave, while the challenging setting for minimization is that where the objective is not convex. In these challenging settings, the behavior of the two problems diverges significantly. The first difference is that, while a solution to a minimization problem is still guaranteed to exist under compactness of the constraint set even when the objective is not convex, a solution to a min-max problem is not guaranteed to exist when the objective is not convex-concave, even under compactness of the constrained set. A trivial example is this: minxmaxy(xy)2\min_{x\in}\max_{y\in}(x-y)^{2}. Unsurprisingly, we show that checking whether a min-max optimization problem has a solution is NP\mathsf{NP}-hard. In fact, we show that checking whether there is an approximate min-max solution is NP\mathsf{NP}-hard, even when the function is Lispchitz and smooth and the desired approximation error is an absolute constant (see Theorem 10.1).

Since min-max solutions may not exist, what could we plausibly hope to compute? There are two obvious targets:

approximate stationary points of ff, as considered e.g. by [ALW19]; and

some type of approximate local min-max solution.

Unfortunately, as far as (I) is concerned, it is still possible that (even approximate) stationary points may not exist, and we show that checking if there is one is NP\mathsf{NP}-hard, even when the constraint set is d^{d}, the objective has Lipschitzness and smoothness polynomial in dd, and the desired approximation is an absolute constant (Theorem 4.1). So we focus on (II), i.e. (approximate) local min-max solutions. Several kinds of those have been proposed in the literature [DP18, MR18, JNJ19]. We consider a generalization of the concept of local min-max equilibria, proposed in [DP18, MR18], that also accommodates approximation.

Given ff, gg as above, and ε,δ>0\varepsilon,\delta>0, some point (x,y)(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star}) is an (ε,δ)(\varepsilon,\delta)-local min-max solution of (1.1), or a (ε,δ)(\varepsilon,\delta)-local min-max equilibrium, if it is feasible, i.e. g(x,y)0g(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})\leq 0, and satisfies:

In words, (x,y)(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star}) is an (ε,δ)(\varepsilon,\delta)-local min-max equilibrium, whenever the min player cannot update x\boldsymbol{x} to a feasible point within δ\delta of x\boldsymbol{x}^{\star} to reduce ff by at least ε\varepsilon, and symmetrically the max player cannot change y\boldsymbol{y} locally to increase ff by at least ε\varepsilon.

We show that the existence and complexity of computing such approximate local min-max equilibria depends on the relationship of ε\varepsilon and δ\delta with the smoothness, LL, and the Lipschitzness, GG, of the objective function ff. We distinguish the following regimes, also shown in Figure 1 together with a summary of our associated results.

\blacktriangleright Trivial Regime. This occurs when δ<εG\delta<\frac{\varepsilon}{G}. This regime is trivial because the GG-Lipschitzness of ff guarantees that all feasible points are (ε,δ)(\varepsilon,\delta)-local min-max solutions.

\blacktriangleright Local Regime. This occurs when δ<2εL\delta<\sqrt{\frac{2\varepsilon}{L}}, and it represents the interesting regime for min-max optimization. In this regime, we use the smoothness of ff to show that (ε,δ)(\varepsilon,\delta)-local min-max solutions always exist. Indeed, we show (Theorem 5.1) that computing them is computationally equivalent to the following variant of (I) which is more suitable for the constrained setting:

(approximate) fixed points of the projected gradient descent-ascent dynamics (Section 3.3).

We show via an application of Brouwer’s fixed point theorem to the iteration map of the projected gradient descent-ascent dynamics that (I)’ are guaranteed to exist. In fact, not only do they exist, but computing them is in PPAD\mathsf{PPAD}, as can be shown by bounding the Lipschitzness of the projected gradient descent-ascent dynamics (Theorem 5.2).

\blacktriangleright Global Regime. This occurs when δ\delta is comparable to the diameter of the constraint set. In this case, the existence of (ε,δ)(\varepsilon,\delta)-local min-max solutions is not guaranteed, and determining their existence is NP\mathsf{NP}-hard, even if ε\varepsilon is an absolute constant (Theorem 10.1).

The main results of this paper, summarized in Figure 1, are to characterize the complexity of computing local min-max solutions in the local regime. Our first main theorem is the following:

Computing (ε,δ)(\varepsilon,\delta)-local min-max solutions of Lipschitz and smooth objectives over convex compact domains in the local regime is PPAD\mathsf{PPAD}-complete. The hardness holds even when the constraint set is a polytope that is a subset of d^{d}, the objective takes values in $andthesmoothness,Lipschitzness,and the smoothness, Lipschitzness,1/\varepsilonandand1/\deltaarepolynomialinthedimension.Equivalently,computingare polynomial in the dimension. Equivalently, computing\alphaapproximatefixedpointsoftheProjectedGradientDescentAscentdynamicsonsmoothandLipschitzobjectivesis-approximate fixed points of the Projected Gradient Descent-Ascent dynamics on smooth and Lipschitz objectives is\mathsf{PPAD}complete,andthehardnessholdsevenwhenthetheconstraintsetisapolytopethatisasubsetof-complete, and the hardness holds even when the the constraint set is a polytope that is a subset of^{d},theobjectivetakesvaluesin, the objective takes values in[-d,d]andsmoothness,Lipschitzness,andand smoothness, Lipschitzness, and1/\alpha$ are polynomial in the dimension.

For the above complexity result we assume that we have “white box” access to the objective function. An important byproduct of our proof, however, is to also establish an unconditional hardness result in the Nemirovsky-Yudin [NY83] oracle optimization model, wherein we are given black-box access to oracles computing the objective function and its gradient. Our second main result is informally stated in Informal Theorem 2.

Assume that we have black-box access to an oracle computing a GG-Lipschitz and LL-smooth objective function f:Pf:\mathcal{P}\to, where Pd\mathcal{P}\subseteq^{d} is a known polytope, and its gradient f\nabla f. Then, computing an (ε,δ)(\varepsilon,\delta)-local min-max solution in the local regime (i.e., when δ<2ε/L\delta<\sqrt{2\varepsilon/L}) requires a number of oracle queries that is exponential in at least one of the following: 1/ε1/\varepsilon, LL, GG, or dd. In fact, exponential in dd-many queries are required even when LL, GG, 1/ε1/\varepsilon and 1/δ1/\delta are all polynomial in dd.

Importantly, the above lower bounds, in both the white-box and the black-box setting, come in sharp contrast to minimization problems, given that finding approximate local minima of smooth non-convex objectives ranging in [B,B][-B,B] in the local regime can be done using first-order methods using O(BL/ε)O(B\cdot L/\varepsilon) time/queries (see Section E). Our results are the first to show an exponential separation between these two fundamental problems in optimization in the black-box setting, and a super-polynomial separation in the white-box setting assuming PPADFP\mathsf{PPAD}\neq\mathsf{FP}.

We very briefly outline some of the main ideas for the PPAD\mathsf{PPAD}-hardness proof that we present in Sections 6 and 7. Our starting point as in many PPAD\mathsf{PPAD}-hardness results is a discrete analog of the problem of finding Brouwer fixed points of a continuous map. Departing from previous work, however, we do not use Sperner’s lemma as the discrete analog of Brouwer’s fixed point theorem. Instead, we define a new problem, called BiSperner, which is useful for showing our hardness results. BiSperner is closely related to the problem of finding panchromatic simplices guaranteed by Sperner’s lemma except, roughly speaking, that the vertices of the simplicization of a dd-dimensional hypercube are colored with 2d2d rather than d+1d+1 colors, every point of the simplicization is colored with dd colors rather than one, and we are seeking a vertex of the simplicization so that the union of colors on the vertices in its neighborhood covers the full set of colors. The first step of our proof is to show that BiSperner is PPAD\mathsf{PPAD}-hard. This step follows from the hardness of computing Brouwer fixed points.

The step that we describe next is only implicitly done by our proof, but it serves as useful intuition for reading and understanding it. We want to define a discrete two-player zero-sum game whose local equilibrium points correspond to solutions of a given BiSperner instance. Our two players, called “minimizer” and “maximizer,” each choose a vertex of the simplicization of the BiSperner instance. For every pair of strategies in our discrete game, i.e. vertices, chosen by our players, we define a function value and gradient values. Note that, at this point, we treat these values at different vertices of the simplicization as independent choices, i.e. are not defining a function over the continuum whose function values and gradient values are consistent with these choices. It is our intention, however, that in the continuous two-player zero-sum game that we obtain in the next paragraph via our interpolation scheme, wherein the minimizer and maximizer may choose any point in the continuous hypercube, the function value determines the payment of the minimizer to the maximizer, and the gradient value determines the direction of the best-response dynamics of the game. Before getting to that continuous game in the next paragraph, the main technical step of this discrete part of our construction is showing that every local equilibrium of the discrete game corresponds to a solution of the BiSperner instance we are reducing from. In order to achieve this we need to add some constraints to couple the strategies of the minimizer and the maximizer player. This step is the reason that the constraints g(x,y)0g(\boldsymbol{x},\boldsymbol{y})\leq 0 appear in the final min-max problem that we produce.

The third and quite challenging step of the proof is to show that we can interpolate in a smooth and computationally efficient way the discrete zero-sum game of the previous step. In low dimensions (treated in Section 6) such smooth and efficient interpolation can be done in a relatively simple way using single-dimensional smooth step functions. In high dimensions, however, the smooth and efficient interpolation becomes a challenging problem and to the best of our knowledge no simple solution exists. For this reason we construct our novel smooth and efficient interpolation coefficients of Section 8. These are a technically involved construction that we believe will prove to be very useful for characterizing the complexity of approximate solutions of other optimization problems.

The last part of our proof is to show that all the previous steps can be implemented in an efficient way both with respect to computational but also with respect to query complexity. This part is essential for both our white-box and black-box results. Although this seems like a relatively easy step, it becomes more difficult due to the complicated expressions in our smooth and efficient interpolation coefficients used in our previous step.

Closing this section we mention that all our NP\mathsf{NP}-hardness results are proven using a cute application of Lovász Local Lemma [EL73], which provides a powerful rounding tool that can drive the inapproximability all the way up to an absolute constant.

2 Local Minimization vs Local Min-Max Optimization

Because our proof is convoluted, involving multiple steps, it is difficult to discern from it why finding local min-max solutions is so much harder than finding local minima. For this reason, we illustrate in this section a fundamental difference between local minimization and local min-max optimization. This provides good intuition about why our hardness construction would fail if we tried to apply it to prove hardness results for finding local minima (which we know don’t exist).

Ultimately a key difference between min-min and min-max optimization is that best-response paths in min-max optimization problems can be closed, i.e., can form a cycle, as shown in Figure 2, Panel (b). On the other hand, this is impossible in min-min problems as the function value must monotonically decrease along best-response paths, thus cycles may not exist.

The above discussion offers qualitative differences between min-min and min-max optimization, which lie in the heart of why our computational intractability results are possible to prove for min-max but not min-min problems. For the precise step in our construction that breaks if we were to switch from a min-max to a min-min problem we refer the reader to Remark 6.9.

3 Further Related Work

There is a broad literature on the complexity of equilibrium computation. Virtually all these results are obtained within the computational complexity formalism of total search problems in NP\mathsf{NP}, which was spearheaded by [JPY88, MP89, Pap94b] to capture the complexity of search problems that are guaranteed to have a solution. Some key complexity classes in this landscape are shown in Figure 3. We give a non-exhaustive list of intractability results for equilibrium computation: [FPT04] prove that computing pure Nash equilibria in congestion games is PLS\mathsf{PLS}-complete; [DGP09] and later [CDT09] show that computing approximate Nash equilibria in normal-form games is PPAD\mathsf{PPAD}-complete; [EY10] study the complexity of computing exact Nash equilibria (which may use irrational probabilities), introducing the complexity class FIXP\mathsf{FIXP};

[VY11, CPY17] consider the complexity of computing Market equilibria; [Das13, Rub15, Rub16] consider the complexity of computing approximate Nash equilibria of constant approximation; [KM18] establish a connection between approximate Nash equilibrium computation and the SoS hierarchy; [Meh14, DFS20] study the complexity of computing Nash equilibria in specially structured games. A result that is particularly useful for our work is the result of [HPV89] which shows black-box query lower bounds for computing Brouwer fixed points of a continuous function. We use this result in Section 9 as an ingredient for proving our black-box lower bounds for computing approximate local min-max solutions.

Beyond equilibrium computation and its applications to Economics and Game Theory, the study of total search problems has found profound connections to many scientific fields, including continuous optimization [DP11, DTZ18], combinatorial optimization [SY91], query complexity [BCE+95], topology [GH19], topological combinatorics and social choice theory [FG18, FG19, FRHSZ20b, FRHSZ20a], algebraic combinatorics [BIQ+17, GKSZ19], and cryptography [Jeř16, BPR15, SZZ18]. For a more extensive overview of total search problems we refer the reader to the recent survey by Daskalakis [Das18].

As already discussed, min-max optimization has intimate connections to the foundations of Game Theory, Mathematical Programming, Online Learning, Statistics, and several other fields. Recent applications of min-max optimization to Machine Learning, such as Generative Adversarial Networks and Adversarial Training, have motivated a slew of recent work targeting first-order (or other light-weight online learning) methods for solving min-max optimization problems for convex-concave, nonconvex-concave, as well as nonconvex-nonconcave objectives. Work on convex-concave and nonconvex-concave objectives has focused on obtaining online learning methods with improved rates [KM19, LJJ19, TJNO19, NSH+19, LTHC19, OX19, Zha19, ADSG19, AMLJG20, GPDO20, LJJ20] and last-iterate convergence guarantees [DISZ18, DP18, MR18, MPP18, RLLY18, HA18, ADLH19, DP19, LS19, GHP+19, MOP19, ALW19], while work on nonconvex-nonconcave problems has focused on identifying different notions of local min-max solutions [JNJ19, MV20] and studying the existence and (local) convergence properties of learning methods at these points [WZB19, MV20, MSV20].

Preliminaries

We study optimization problems involving real-valued functions, considering two access models to such functions.

Black Box Model. In this model we are given access to an oracle Of\mathcal{O}_{f} such that given a point xd\boldsymbol{x}\in^{d} the oracle Of\mathcal{O}_{f} returns the values f(x)f(\boldsymbol{x}) and f(x)\nabla f(\boldsymbol{x}). In this model we assume that we can perform real number arithmetic operations. This is the traditional model used to prove lower bounds in Optimization and Machine Learning [NY83].

Promise Problems. To simplify the exposition of our paper, make the definitions of our computational problems and theorem statements clearer, and make our intractability results stronger, we choose to enforce the following constraints on our function access, Of\mathcal{O}_{f} or Cf\mathcal{C}_{f}, as a promise, rather than enforcing these constraints in some syntactic manner.

Consistency of Function Values and Gradient Values. Given some oracle Of\mathcal{O}_{f} or Turing machine Cf\mathcal{C}_{f}, it is difficult to determine by querying the oracle or examining the description of the Turing machine whether the function and gradient values output on different inputs are consistent with some differentiable function. In all our computational problems, we will only consider instances where this is promised to be the case. Moreover, for all our computational hardness results, the instances of the problems arising from our reductions satisfy these constraints, which are guaranteed syntactically by our reduction.

Lipschitzness, Smoothness and Boundedness. Similarly, given some oracle Of\mathcal{O}_{f} or Turing machine Cf\mathcal{C}_{f}, it is difficult to determine, by querying the oracle or examining the description of the Turing machine, whether the function and gradient values output by Of\mathcal{O}_{f} or Cf\mathcal{C}_{f} are consistent with some Lipschitz, smooth and bounded function with some prescribed Lipschitzness, smoothness, and bound on its absolute value. In all our computational problems, we only consider instances where the GG-Lipschitzness, LL-smoothness and BB-boundedness of the function are promised to hold for the prescribed, in the input of the problem, parameters GG, LL and BB. Moreover, for all our computational hardness results, the instances of the problems arising from our reductions satisfy this constraint, which is guaranteed syntactically by our reduction.

In summary, in the rest of this paper, whenever we prove an upper bound for some computational problem, namely an upper bound on the number of steps or queries to the function oracle required to solve the problem in the black-box model, or the containment of the problem in some complexity class in the white-box model, we assume that the afore-described properties are satisfied by the Of\mathcal{O}_{f} or Cf\mathcal{C}_{f} provided in the input. On the other hand, whenever we prove a lower bound for some computational problem, namely a lower bound on the number of steps/queries required to solve it in the black-box model, or its hardness for some complexity class in the white-box model, the instances arising in our lower bounds are guaranteed to satisfy the above properties syntactically by our constructions. As such, our hardness results will not exploit the difficulty in checking whether Of\mathcal{O}_{f} or Cf\mathcal{C}_{f} satisfy the above constraints in order to infuse computational complexity into our problems, but will faithfully target the computational problems pertaining to min-max optimization of smooth and Lipschitz objectives that we aim to understand in this paper.

1 Complexity Classes and Reductions

In this section we define the main complexity classes that we use in this paper, namely NP\mathsf{NP}, FNP\mathsf{FNP} and PPAD\mathsf{PPAD}, as well as the notion of reduction used to show containment or hardness of a problem for one of these complexity classes.

To define the complexity class PPAD\mathsf{PPAD} we first define the notion of polynomial-time reductions between search problemsIn this paper we only define and consider Karp-reductions between search problems., and the computational problem End-of-a-LineThis problem is sometimes called End-of-the-Line, but we adopt the nomenclature proposed by [Rub16] since we agree that it describes the problem better..

A search problem P1P_{1} is polynomial-time reducible to a search problem P2P_{2} if there exist polynomial-time computable functions f:{0,1}{0,1}f:\left\{0,1\right\}^{*}\to\left\{0,1\right\}^{*} and g:{0,1}×{0,1}×{0,1}{0,1}g:\left\{0,1\right\}^{*}\times\left\{0,1\right\}^{*}\times\left\{0,1\right\}^{*}\to\left\{0,1\right\}^{*} with the following properties: (i) if x\boldsymbol{x} is an input to P1P_{1}, then f(x)f(\boldsymbol{x}) is an input to P2P_{2}; and (ii) if y\boldsymbol{y} is a solution to P2P_{2} on input f(x)f(\boldsymbol{x}), then g(x,f(x),y)g(\boldsymbol{x},f(\boldsymbol{x}),\boldsymbol{y}) is a solution to P1P_{1} on input x\boldsymbol{x}.

To make sense of the above definition, we envision that the circuits CS\mathcal{C}_{S} and CP\mathcal{C}_{P} implicitly define a directed graph, with vertex set {0,1}n\{0,1\}^{n}, such that the directed edge (x,y){0,1}n×{0,1}n(\boldsymbol{x},\boldsymbol{y})\in\left\{0,1\right\}^{n}\times\left\{0,1\right\}^{n} belongs to the graph if and only if CS(x)=y\mathcal{C}_{S}(\boldsymbol{x})=\boldsymbol{y} and CP(y)=x\mathcal{C}_{P}(\boldsymbol{y})=\boldsymbol{x}. As such, all vertices in the implicitly defined graph have in-degree and out-degree at most 11. The above problem permits an output of 0\boldsymbol{0} if 0\boldsymbol{0} has equal in-degree and out-degree in this graph. Otherwise it permits an output x0\boldsymbol{x}\neq\boldsymbol{0} such that x\boldsymbol{x} has in-degree or out-degree equal to . It follows by the parity argument on directed graphs, namely that in every directed graph the sum of in-degrees equals the sum of out-degrees, that End-of-a-Line is a total problem, i.e. that for any possible binary circuits CS\mathcal{C}_{S} and CP\mathcal{C}_{P} there exists a solution of the “0.” kind or the “1.” kind in the definition of our problem (or both). Indeed, if 0\boldsymbol{0} has unequal in- and out-degrees, there must exist another vertex x0\boldsymbol{x}\neq\boldsymbol{0} with unequal in- and out-degrees, thus one of these degrees must be (as all vertices in the graph have in- and out-degrees bounded by 11).

We are finally ready to define the complexity class PPAD\mathsf{PPAD} introduced by [Pap94b].

The complexity class PPAD\mathsf{PPAD} contains all search problems that are polynomial time reducible to the End-of-a-Line problem.

The complexity class PPAD\mathsf{PPAD} is of particular importance, since it contains lots of fundamental problems in Game Theory, Economics, Topology and several other fields [DGP09, Das18]. A particularly important PPAD\mathsf{PPAD}-complete problem is finding fixed points of continuous functions, whose existence is guaranteed by Brouwer’s fixed point theorem.

While not stated exactly in this form, the following is a straightforward implication of the results presented in [CDT09].

Computational Problems of Interest

In this section, we define the computational problems that we study in this paper and discuss our main results, postponing formal statements to Section 4. We start in Section 3.1 by defining the mathematical objects of our study, and proceed in Section 3.2 to define our main computational problems, namely: (1) finding approximate stationary points; (2) finding approximate local minima; and (3) finding approximate local min-max equilibria. In Section 3.3, we present some bonus problems, which are intimately related, as we will see, to problems (2) and (3). As discussed in Section 2, for ease of presentation, we define our problems as promise problems.

We define the concepts of stationary points, local minima, and local min-max equilibria of real valued functions, and make some remarks about their existence, as well as their computational complexity. The formal discussion of the latter is postponed to Sections 3.2 and 4.

Now that we have defined the domain of the real-valued functions that we consider in this paper we are ready to define a notion of approximate stationary points.

It is easy to see that there exist continuously differentiable functions ff that do not have any (approximate) stationary points, e.g. linear functions. As we will see later in this paper, deciding whether a given function ff has a stationary point is NP\mathsf{NP}-hard and, in fact, it is even NP\mathsf{NP}-hard to decide whether a function has an approximate stationary point of a very gross approximation. At the same time, verifying whether a given point is (approximately) stationary can be done efficiently given access to a polynomial-time Turing machine that computes f\nabla f, so the problem of deciding whether an (approximate) stationary point exists lies in NP\mathsf{NP}, as long as we can guarantee that, if there is such a point, there will also be one with polynomial bit complexity. We postpone a formal discussion of the computational complexity of finding (approximate) stationary points or deciding their existence until we have formally defined our corresponding computational problem and settled the bit complexity of its solutions.

For the definition of local minima and local min-max equilibria we need the notion of closed dd-dimensional Euclidean balls.

To be clear, using the term “local minimum” in Definition 3.5 is a bit of a misnomer, since for large enough values of δ\delta the definition captures global minima as well. As δ\delta ranges from large to small, our notion of (ε,δ)(\varepsilon,\delta)-local minimum transitions from being an ε\varepsilon-globally optimal point to being an ε\varepsilon-locally optimal point. Importantly, unlike (approximate) stationary points, a (ε,δ)(\varepsilon,\delta)-local minimum is guaranteed to exist for all ε,δ>0\varepsilon,\delta>0 due to the compactness of dP(A,b)^{d}\cap\mathcal{P}(\boldsymbol{A},\boldsymbol{b}) and the continuity of ff. Thus the problem of finding an (ε,δ)(\varepsilon,\delta)-local minimum is total for arbitrary values of ε\varepsilon and δ\delta. On the negative side, for arbitrary values of ε\varepsilon and δ\delta, there is no polynomial-size and polynomial-time verifiable witness for certifying that a point x\boldsymbol{x}^{\star} is an (ε,δ)(\varepsilon,\delta)-local minimum. Thus the problem of finding an (ε,δ)(\varepsilon,\delta)-local minimum is not known to lie in FNP\mathsf{FNP}. As we will see in Section 4, this issue can be circumvented if we focus on particular settings of ε\varepsilon and δ\delta, in relationship to the Lipschitzness and smoothness of ff and the dimension dd.

Finally we define (ε,δ)(\varepsilon,\delta)-local min-max equilibrium as follows, recasting Definition 1.1 to the constraint set P(A,b)\mathcal{P}(\boldsymbol{A},\boldsymbol{b}).

f(x,y)<f(x,y)+εf(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})<f(\boldsymbol{x},\boldsymbol{y}^{\star})+\varepsilon for every xBd1(δ;x)\boldsymbol{x}\in\mathsf{B}_{d_{1}}(\delta;\boldsymbol{x}^{\star}) with (x,y)P(A,b)(\boldsymbol{x},\boldsymbol{y}^{\star})\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b}); and

f(x,y)>f(x,y)εf(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})>f(\boldsymbol{x}^{\star},\boldsymbol{y})-\varepsilon for every yBd2(δ;y)\boldsymbol{y}\in\mathsf{B}_{d_{2}}(\delta;\boldsymbol{y}^{\star}) with (x,y)P(A,b)(\boldsymbol{x}^{\star},\boldsymbol{y})\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b}).

Similarly to Definition 3.5, for large enough values of δ\delta, Definition 3.6 captures global min-max equilibria as well. As δ\delta ranges from large to small, our notion of (ε,δ)(\varepsilon,\delta)-local min-max equilibrium transitions from being an ε\varepsilon-approximate min-max equilibrium to being an ε\varepsilon-approximate local min-max equilibrium. Moreover, in comparison to local minima and stationary points, the problem of finding an (ε,δ)(\varepsilon,\delta)-local min-max equilibrium is neither total nor can its solutions be verified efficiently for all values of ε\varepsilon and δ\delta, even when P(A,b)=d\mathcal{P}(\boldsymbol{A},\boldsymbol{b})=^{d}. Again, this issue can be circumvented if we focus on particular settings of ε\varepsilon and δ\delta values, as we will see in Section 4.

2 First-Order Local Optimization Computational Problems

In this section, we define the search problems associated with our aforementioned definitions of approximate stationary points, local minima, and local min-max equilibria. We state our problems in terms of white-box access to the function ff and its gradient. Switching to the black-box variants of our computational problems amounts to simply replacing the Turing machines provided in the input of the problems with oracle access to the function and its gradient, as discussed in Section 2. As per our discussion in the same section, we define our computational problems as promise problems, the promise being that the Turing machine (or oracle) provided in the input to our problems outputs function values and gradient values that are consistent with a smooth and Lipschitz function with the prescribed in the input smoothness and Lipschitzness. Besides making the presentation cleaner, as we discussed in Section 2, the motivation for doing so is to prevent the possibility that computational complexity is tacked into our problems due to the possibility that the Turing machines/oracles provided in the input do not output function and gradient values that are consistent with a Lipschitz and smooth function. Importantly, all our computational hardness results syntactically guarantee that the Turing machines/oracles provided as input to our constructed hard instances satisfy these constraints.

Before stating our main computational problems below, we note that, for each problem, the dimension dd (in unary representation) is also an implicit input, as the description of the Turing machine Cf\mathcal{C}_{f} (or the interface to the oracle Of\mathcal{O}_{f} in the black-box counterpart of each problem below) has size at least linear in dd. We also refer to Remark 2.6 for how we may formally study complexity problems that take a polynomial-time Turing Machine in their input.

It is easy to see that StationaryPoint lies in FNP\mathsf{FNP}. Indeed, if there exists some point xP(A,b)\boldsymbol{x}\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b}) such that f(x)2<ε/2\left\|\nabla f(\boldsymbol{x})\right\|_{2}<\varepsilon/2, then by the LL-smoothness of ff there must exist some point xP(A,b)\boldsymbol{x}^{\star}\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b}) of bit complexity polynomial in the size of the input such that f(x)2<ε\left\|\nabla f(\boldsymbol{x}^{\star})\right\|_{2}<\varepsilon. On the other hand, it is clear that no such point exists if for all xP(A,b)\boldsymbol{x}\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b}), f(x)2>ε\left\|\nabla f(\boldsymbol{x})\right\|_{2}>\varepsilon. We note that the looseness of the output requirement in our problem for functions ff that do not have points xP(A,b)x\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b}) such that f(x)2<ε/2\left\|\nabla f(\boldsymbol{x})\right\|_{2}<\varepsilon/2 but do have points xP(A,b)x\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b}) such that f(x)2ε\left\|\nabla f(\boldsymbol{x})\right\|_{2}\leq\varepsilon is introduced for the sole purpose of making the problem lie in FNP\mathsf{FNP}, as otherwise we would not be able to guarantee that the solutions to our search problem have polynomial bit complexity. As we show in Section 4, StationaryPoint is also FNP\mathsf{FNP}-hard, even when ε\varepsilon is a constant, the constraint set is very simple, namely P(A,b)=d\mathcal{P}(\boldsymbol{A},\boldsymbol{b})=^{d}, and G,LG,L are both polynomial in dd.

Next, we define the computational problems associated with local minimum and local min-max equilibrium. Recall that the first is guaranteed to have a solution, because, in particular, a global minimum exists due to the continuity of ff and the compactness of P(A,b)\mathcal{P}(\boldsymbol{A},\boldsymbol{b}).

3 Bonus Problems: Fixed Points of Gradient Descent/Gradient Descent-Ascent

Next we present a couple of bonus problems, GDFixedPoint and GDAFixedPoint, which respectively capture the computation of fixed points of the (projected) gradient descent and the (projected) gradient descent-ascent dynamics, with learning rate =1=1. As we see in Section 5, these problems are intimately related, indeed equivalent under polynomial-time reductions, to problems LocalMin and LocalMinMax respectively, in certain regimes of the approximation parameters. Before stating problems GDFixedPoint and GDAFixedPoint, we define the mappings FGDF_{GD} and FGDAF_{GDA} whose fixed points these problems are targeting.

for all (x,y)K(\boldsymbol{x},\boldsymbol{y})\in K, where K(y)={x(x,y)K}K(\boldsymbol{y})=\{\boldsymbol{x}^{\prime}\mid(\boldsymbol{x}^{\prime},\boldsymbol{y})\in K\} and K(x)={y(x,y)K}K(\boldsymbol{x})=\{\boldsymbol{y}^{\prime}\mid(\boldsymbol{x},\boldsymbol{y}^{\prime})\in K\}.

Note that FGDAF_{GDA} is called “unsafe” because the projection happens individually for xxf(x,y)\boldsymbol{x}-\nabla_{\boldsymbol{x}}f(\boldsymbol{x},\boldsymbol{y}) and y+yf(x,y)\boldsymbol{y}+\nabla_{\boldsymbol{y}}f(\boldsymbol{x},\boldsymbol{y}), thus FGDA(x,y)F_{GDA}(\boldsymbol{x},\boldsymbol{y}) may not lie in KK. We also define the “safe” version FsGDAF_{sGDA}, which projects the pair (xxf(x,y),y+yf(x,y))(\boldsymbol{x}-\nabla_{\boldsymbol{x}}f(\boldsymbol{x},\boldsymbol{y}),\boldsymbol{y}+\nabla_{\boldsymbol{y}}f(\boldsymbol{x},\boldsymbol{y})) jointly onto KK. As we show in Section 5 (in particular inside the proof of Theorem 5.2), computing fixed points of FGDAF_{GDA} and FsGDAF_{sGDA} are computationally equivalent so we stick to FGDAF_{GDA} which makes the presentation slightly cleaner.

We are now ready to define GDFixedPoint and GDAFixedPoint. As per earlier discussions, we define these computational problems as promise problems, the promise being that the Turing machine provided in the input to these problems outputs function values and gradient values that are consistent with a smooth and Lipschitz function with the prescribed, in the input to these problems, smoothness and Lipschitzness.

Summary of Results

In this section we summarize our results for the optimization problems that we defined in the previous section. We start with our theorem about the complexity of finding approximate stationary points, which we show to be FNP\mathsf{FNP}-complete even for large values of the approximation.

The computational problem StationaryPoint is FNP\mathsf{FNP}-complete, even when ε\varepsilon is set to any value 1/24\leq 1/24, and even when P(A,b)=d\mathcal{P}(\boldsymbol{A},\boldsymbol{b})=^{d}, G=dG=\sqrt{d}, L=dL=d, and B=1B=1.

The complexity of LocalMin and LocalMinMax is more difficult to characterize, as the nature of these problems changes drastically depending on the relationship of δ\delta with with ε\varepsilon, GG, LL and dd, which determines whether these problems ask for a globally vs locally approximately optimal solution. In particular, there are two regimes wherein the complexity of both problems is simple to characterize.

Global Regime. When δd\delta\geq\sqrt{d} then both LocalMin and LocalMinMax ask for a globally optimal solution. In this regime it is not difficult to see that both problems are FNP\mathsf{FNP}-hard to solve even when ε=Θ(1)\varepsilon=\Theta(1) and GG, LL are O(d)O(d) (see Section 10).

Trivial Regime. When δ\delta satisfies δ<ε/G\delta<\varepsilon/G, then for every point zP(A,b)\boldsymbol{z}\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b}) it holds that f(z)f(z)<ε\left|f(\boldsymbol{z})-f(\boldsymbol{z}^{\prime})\right|<\varepsilon for every zBd(δ;z)\boldsymbol{z}^{\prime}\in\mathsf{B}_{d}(\delta;\boldsymbol{z}) with zP(A,b)\boldsymbol{z}^{\prime}\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b}). Thus, every point z\boldsymbol{z} in the domain P(A,b)\mathcal{P}(\boldsymbol{A},\boldsymbol{b}) is a solution to both LocalMin and LocalMinMax.

It is clear from our discussion above, and in earlier sections, that, to really capture the complexity of finding local as opposed to global minima/min-max equilibria, we should restrict the value of δ\delta. We identify the following regime, which we call the “local regime.” As we argue shortly, this regime is markedly different from the global regime identified above in that (i) a solution is guaranteed to exist for both our problems of interest, where in the global regime only LocalMin is guaranteed to have a solution; and (ii) their computational complexity transitions to lower complexity classes.

Local Regime. Our main focus in this paper is the regime defined by δ<2ε/L\delta<\sqrt{2\varepsilon/L}. In this regime it is well known that Projected Gradient Descent can solve LocalMin in time O(BL/ε)O(B\cdot L/\varepsilon) (see Appendix E). Our main interest is understanding the complexity of LocalMinMax, which is not well understood in this regime. We note that the use of the constant 22 in the constraint δ<2ε/L\delta<\sqrt{2\varepsilon/L} which defines the local regime has a natural motivation: consider a point z\boldsymbol{z} where a LL-smooth function ff has f(z)=0\nabla f(\boldsymbol{z})=0; it follows from the definition of smoothness that z\boldsymbol{z} is both an (ε,δ)(\varepsilon,\delta)-local min and an (ε,δ)(\varepsilon,\delta)-local min-max equilibrium, as long as δ<2ε/L\delta<\sqrt{2\varepsilon/L}.

The following theorems provide tight upper and lower bounds on the computational complexity of solving LocalMinMax in the local regime. For compactness, we define the following problem:

We define the local-regime local min-max equilibrium computation problem, in short LR-LocalMinMax, to be the search problem LocalMinMax restricted to instances in the local regime, i.e. satisfying δ<2ε/L\delta<\sqrt{2\varepsilon/L}.

The computational problem LR-LocalMinMax belongs to PPAD\mathsf{PPAD}. As a byproduct, if some function ff is GG-Lipschitz and LL-smooth, then an (ε,δ)(\varepsilon,\delta)-local min-max equilibrium is guaranteed to exist when δ<2ε/L\delta<\sqrt{2\varepsilon/L}, i.e. in the local regime.

An important property of our reduction in the proof of Theorem 4.4 is that it is a black-box reduction. We can hence prove the following unconditional lower bound in the black-box model.

Our main goal in the rest of the paper is to provide the proofs of Theorems 4.3, 4.4 and 4.5. In Section 5, we show how to use Brouwer’s fixed point theorem to prove the existence of approximate local min-max equilibrium in the local regime. Moreover, we establish an equivalence between LocalMinMax and GDAFixedPoint, in the local regime, and show that both belong to PPAD\mathsf{PPAD}. In Sections 6 and 7, we provide a detailed proof of our main result, i.e. Theorem 4.4. Finally, in Section 9, we show how our proof from Section 7 produces as a byproduct the black-box, unconditional lower bound of Theorem 4.5. In Section 8, we outline a useful interpolation technique which allows as to interpolate a function given its values and the values of its gradient on a hypergrid, so as to enforce the Lipschitzness and smoothness of the interpolating function. We make heavy use of this technically involved result in all our hardness proofs.

Existence of Approximate Local Min-Max Equilibrium

In this section, we establish the totality of LR-LocalMinMax, i.e. LocalMinMax for instances satisfying δ<2ε/L\delta<\sqrt{2\varepsilon/L} as defined in Definition 4.2. In particular, we prove that every GG-Lipschitz and LL-smooth function admits an (ε,δ)(\varepsilon,\delta)-local min-max equilibrium, as long as δ<2ε/L\delta<\sqrt{2\varepsilon/L}. A byproduct of our proof is in fact that LR-LocalMinMax lies inside PPAD\mathsf{PPAD}. Specifically the main tool that we use to prove our result is a computational equivalence between the problem of finding fixed points of the Gradient Descent/Ascent dynamic, i.e. GDAFixedPoint, and the problem LR-LocalMinMax. A similar equivalence between GDFixedPoint and LocalMin also holds, but the details of that are left to the reader as a simple exercise. Next, we first present the equivalence between GDAFixedPoint and LR-LocalMinMax, and we then show that GDAFixedPoint is in PPAD\mathsf{PPAD}, which then also establishes that LR-LocalMinMax is in PPAD\mathsf{PPAD}.

For arbitrary ε>0\varepsilon>0 and 0<δ<2ε/L0<\delta<\sqrt{2\varepsilon/L}, suppose that (x,y)P(A,b)(\boldsymbol{x}^{\ast},\boldsymbol{y}^{\ast})\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b}) is an α\alpha-approximate fixed point of FGDAF_{GDA}, i.e., (x,y)FGDA(x,y)2<α\left\|(\boldsymbol{x}^{\ast},\boldsymbol{y}^{\ast})-F_{GDA}(\boldsymbol{x}^{\ast},\boldsymbol{y}^{\ast})\right\|_{2}<\alpha, where α(G+δ)2+4(εL2δ2)(G+δ)2\alpha\leq\frac{\sqrt{(G+\delta)^{2}+4(\varepsilon-\frac{L}{2}\delta^{2})}-(G+\delta)}{2}. Then (x,y)(\boldsymbol{x}^{\ast},\boldsymbol{y}^{\ast}) is also a (ε,δ)(\varepsilon,\delta)-local min-max equilibrium of ff.

For arbitary α>0\alpha>0, suppose that (x,y)(\boldsymbol{x}^{\ast},\boldsymbol{y}^{\ast}) is an (ε,δ)(\varepsilon,\delta)-local min-max equilibrium of ff for ε=α2L(5L+2)2\varepsilon=\frac{\alpha^{2}\cdot L}{(5L+2)^{2}} and δ=ε/L\delta=\sqrt{\varepsilon/L}. Then (x,y)(\boldsymbol{x}^{\ast},\boldsymbol{y}^{\ast}) is also an α\alpha-approximate fixed point of FGDAF_{GDA}.

The proof of Theorem 5.1 is presented in Appendix B.1. As already discussed, we use GDAFixedPoint as an intermediate step to establish the totality of LR-LocalMinMax and to show its inclusion in PPAD\mathsf{PPAD}. This leads to the following theorem.

The computational problems GDAFixedPoint and LR-LocalMinMax are both total search problems and they both lie in PPAD\mathsf{PPAD}.

Observe that Theorem 4.3 is implied by Theorem 5.2 whose proof is presented in Appendix B.2.

Hardness of Local Min-Max Equilibrium – Four-Dimensions

In Section 5, we established that LR-LocalMinMax belongs to PPAD\mathsf{PPAD}. Our proof is via the intermediate problem GDAFixedPoint which we showed that it is computationally equivalent to LR-LocalMinMax. Our next step is to prove the PPAD\mathsf{PPAD}-hardness of LR-LocalMinMax using again GDAFixedPoint as an intermediate problem.

The problem GDAFixedPoint is PPAD\mathsf{PPAD}-complete even in dimension d=4d=4 and B=2B=2. Therefore, LR-LocalMinMax is PPAD\mathsf{PPAD}-complete even in dimension d=4d=4 and B=2B=2.

The first color of every vertex is either 11^{-} or 1+1^{+} and the second color is either 22^{-} or 2+2^{+}.

The first color of all vertices on the left boundary of the grid is 1+1^{+}.

The first color of all vertices on the right boundary of the grid is 11^{-}.

The second color of all vertices on the bottom boundary of the grid is 2+2^{+}.

The second color of all vertices on the top boundary of the grid is 22^{-}.

Consider the function g(x)=M(x)xg(\boldsymbol{x})=M(\boldsymbol{x})-\boldsymbol{x}. Since MM is LL-Lipschitz, the function g:22g:^{2}\to^{2} is also (L+1)(L+1)-Lipschitz. Additionally gg can be easily computed via a polynomial-time Turing machine Cg\mathcal{C}_{g} that uses CM\mathcal{C}_{M} as a subroutine. We construct a proper coloring of a fine grid of 2^{2} using the signs of the outputs of gg. Namely we set n=log(L/γ)+2n=\mathop{\left\lceil\log(L/\gamma)+2\right\rceil} and this defines a 2n×2n2^{n}\times 2^{n} grid over 2^{2} that is indexed by {0,1}n×{0,1}n\{0,1\}^{n}\times\{0,1\}^{n}. Let gη:22g_{\eta}:^{2}\to^{2} be the function that the Turing Machine Cg\mathcal{C}_{g} evaluate when the requested accuracy is η>0\eta>0. Now we can define the circuit Cl\mathcal{C}_{l} as follows, We remind that we abuse the notation and we use a coordinate i{0,1}ni\in\{0,1\}^{n} both as a binary string and as a number in ([2n1]1)\left(\left[2^{n}-1\right]-1\right) and it is clear from the context which of the two we use.

Let η=γ22\eta=\frac{\gamma}{2\sqrt{2}}, there exists xR\boldsymbol{x}\in R such that g1(x)γ22\left|g_{1}(\boldsymbol{x})\right|\leq\frac{\gamma}{2\sqrt{2}} and yR\boldsymbol{y}\in R such that g2(y)γ22\left|g_{2}(\boldsymbol{y})\right|\leq\frac{\gamma}{2\sqrt{2}}.

We will prove the existence of x\boldsymbol{x} and the existence of y\boldsymbol{y} follows using an identical argument. If there exists a corner x\boldsymbol{x} of RR such that g1(x)g_{1}(\boldsymbol{x}) is in the range [η,η][-\eta,\eta] then the claim follows. Suppose not. Using this together with the fact that the first color of one of the corners of RR is 11^{-} and also the first color of one of the corners of RR is 1+1^{+} we conclude that there exist points x,x\boldsymbol{x},\boldsymbol{x}^{\prime} such that gη,1(x)0g_{\eta,1}(\boldsymbol{x})\geq 0 and gη,1(x)0g_{\eta,1}(\boldsymbol{x}^{\prime})\leq 0 The latter is inaccurate for the cases where the vertex (0,j)(0,j) belongs to either facets i=0i=0 or i=2n1i=2^{n}-1. Notice that the coloring in such vertices does not depend on the value of gηg_{\eta}. However in case where the color of such a corner is not consistent with the value of gηg_{\eta}, i.e. gη,1(0,j)<0g_{\eta,1}(0,j)<0 and Cl1(0,j)=1\mathcal{C}_{l}^{1}(0,j)=1 then this means that g1(0,j)η|g_{1}(0,j)|\leq\eta. This is due to the fact that g1(0,j)0g_{1}(0,j)\geq 0 and g1(0,j)g1,η(0,j)η|g_{1}(0,j)-g_{1,\eta}(0,j)|\leq\eta.. But we have that gηg2η\left\|g_{\eta}-g\right\|_{2}\leq\eta. This together with the fact that g1(x)∉[η,η]g_{1}(\boldsymbol{x})\not\in[-\eta,\eta] and g1(x)∉[η,η]g_{1}(\boldsymbol{x}^{\prime})\not\in[-\eta,\eta] implies that g1(x)0g_{1}(\boldsymbol{x})\geq 0 and also g1(x)0g_{1}(\boldsymbol{x}^{\prime})\leq 0. But because of the LL-Lipschitzness of gg and because the distance between x\boldsymbol{x} and x\boldsymbol{x}^{\prime} is at most 2γ4L\sqrt{2}\frac{\gamma}{4L} we conclude that g1(x)g1(x)γ22\left|g_{1}(\boldsymbol{x})-g_{1}(\boldsymbol{x}^{\prime})\right|\leq\frac{\gamma}{2\sqrt{2}}. Hence due to the signs of g1(x)g_{1}(\boldsymbol{x}) and g1(x)g_{1}(\boldsymbol{x}^{\prime}) we conclude that g1(x)γ22\left|g_{1}(\boldsymbol{x})\right|\leq\frac{\gamma}{2\sqrt{2}}. The same way we can prove that g1(y)γ22\left|g_{1}(\boldsymbol{y})\right|\leq\frac{\gamma}{2\sqrt{2}} and the claim follows. ∎

Using the Claim 6.3 and the LL-Lipschitzness of gg we get that for every zR\boldsymbol{z}\in R

where we have used also the fact that for any two points z,w\boldsymbol{z},\boldsymbol{w} it holds that zw22γ4L\left\|\boldsymbol{z}-\boldsymbol{w}\right\|_{2}\leq\sqrt{2}\frac{\gamma}{4L} which follows from the definition of the size of the grid. Therefore we have that g(z)2γ\left\|g(\boldsymbol{z})\right\|_{2}\leq\gamma and hence M(z)z2γ\left\|M(\boldsymbol{z})-\boldsymbol{z}\right\|_{2}\leq\gamma which implies that any point zR\boldsymbol{z}\in R is a γ\gamma-approximate fixed point of MM and the lemma follows. ∎

2 From 2D Bi-Sperner to Fixed Points of Gradient Descent/Ascent

The simplest way to achieve this is to define the function ff locally close to (x,y)(\boldsymbol{x},\boldsymbol{y}) to be equal to

Similarly, if x\boldsymbol{x} is on a vertex of the N×NN\times N grid, and the coloring of this vertex is (1,2)(1^{-},2^{-}), i.e. the output of Cl\mathcal{C}_{l} on this vertex is (1,1)(-1,-1), then we would like to have

The simplest way to achieve this is to define the function ff locally close to (x,y)(\boldsymbol{x},\boldsymbol{y}) to be equal to

In Figure 5 we show pictorially the correspondence of the colors of the vertices of the grid with the gradient of the function ff that we design. As shown in the figure, any set of vertices that share at least one of the colors 1+1^{+}, 11^{-}, 2+2^{+}, 22^{-}, agree on the direction of the gradient with respect the horizontal or the vertical axis. This observation is one of the main ingredients in the proof of correctness of our reduction that we present later in this section.

When x\boldsymbol{x} is not on a vertex of the N×NN\times N grid then our goal is to define ff via interpolating the functions corresponding to the corners of the cell in which x\boldsymbol{x} belongs. The reason that this interpolation is challenging is that we need to make sure the following properties are satisfied

the resulting function ff is both Lipschitz and smooth inside every cell,

the resulting function ff is both Lipschitz and smooth even at the boundaries of every cell, where two differect cells stick together,

For the low dimensional case, that we explore in this section, satisfying the first two properties is not a very difficult task, whereas for the third property we need to be careful and achieving this property is the main technical contribution of this section. On the contrary, for the high-dimensional case that we explore in Section 7 even achieving the first two properties is very challenging and technical.

As we will see in Section 6.2.1, if we accomplish a construction of a function ff with the aforementioned properties, then the fixed points of the projected Gradient Descent/Ascent can only appear inside cells that have all of the colors {1,1+,2,2+}\{1^{-},1^{+},2^{-},2^{+}\} at their corners. To see this consider a cell that misses some color, e.g. 1+1^{+}. Then all the corners of this cell have as first color 11^{-}. Since ff is defined as interpolation of the functions in the corners of the cells, with the aforementioned properties, inside that cell there is always a direction with respect to x1x_{1} and y1y_{1} for which the gradient is large enough. Hence any point inside that cell cannot be a fixed point of the projected Gradient Descent/Ascent. Of course this example provides just an intuition of our construction and ignores case where the cell is on the boundary of the grid. We provide a detailed explanation of this case in Section 6.2.1.

The above neat idea needs some technical adjustments in order to work. At first, the interpolation of the function in the interior of the cell must be smooth enough so that the resulting function is both Lipschitz and smooth. In order to satisfy this, we need to choose appropriate coefficients of the interpolation that interpolate smoothly not only the value of the function but also its derivatives. For this purpose we use the following smooth step function of order 11.

We define S1:S_{1}:\to to be the smooth step function of order 11 that is equal to S1(x)=3x22x3S_{1}(x)=3x^{2}-2x^{3}. Observe that the following hold S1(0)=0S_{1}(0)=0, S1(1)=1S_{1}(1)=1, S1(0)=0S_{1}^{\prime}(0)=0, and S1(1)=0S_{1}^{\prime}(1)=0.

As we have discussed, another issue is that since the interpolation coefficients depend on the value of x\boldsymbol{x} it could be that the derivatives of these coefficients overpower the derivatives of the functions that we interpolate. In this case we could be potentially creating fixed points of Gradient Descent/Ascent even in non panchromatic squares. As we will see later the magnitude of the derivatives from the interpolation coefficients depends on the differences x1y1x_{1}-y_{1} and x2y2x_{2}-y_{2}. Hence if we ensure that these differences are small then the derivatives of the interpolation coefficients will have to remain small and hence they can never overpower the derivatives from the corners of every cell. This is the place in our reduction where we add the constraints A(x,y)b\boldsymbol{A}\cdot(\boldsymbol{x},\boldsymbol{y})\leq\boldsymbol{b} that define the domain of the function ff as we describe in Section 3.

Now that we have summarized the main ideas of our construction we are ready for the formal definition of ff based on the coloring circuit Cl\mathcal{C}_{l}.

where the coefficients α1(x),α2(x)\alpha_{1}(\boldsymbol{x}),\alpha_{2}(\boldsymbol{x})\in are defined as follows

where δ1/(N1)=1/(2n1)\delta\triangleq 1/(N-1)=1/(2^{n}-1).

In Figure 6 we present an example of the application of Definition 6.5 to a specific cell with some given coloring on the corners.

An important property of the definition of the function fClf_{\mathcal{C}_{l}} is that the coefficients used in the definition of αi\alpha_{i} have the following two properties

Hence the function fClf_{\mathcal{C}_{l}} inside a cell is a smooth convex combination of the functions on the corners of the cell, as is suggested from Figure 6. Of course there are many ways to define such convex combination but in our case we use the smooth step function S1S_{1} to ensure the Lipschitz continuous gradient of the overall function fClf_{\mathcal{C}_{l}}. We prove this formally in the next lemma.

Let fClf_{\mathcal{C}_{l}} be the function defined based on a coloring circuit Cl\mathcal{C}_{l}, as per Definition 6.5. Then fClf_{\mathcal{C}_{l}} is continuous and differentiable at any point (x,y)4(\boldsymbol{x},\boldsymbol{y})\in^{4}. Moreover, fClf_{\mathcal{C}_{l}} is Θ(1/δ)\Theta(1/\delta)-Lipschitz and Θ(1/δ2)\Theta(1/\delta^{2})-smooth in the whole 4-dimensional hypercube 4^{4}, where δ=1/(N1)=1/(2n1)\delta=1/(N-1)=1/(2^{n}-1).

Clearly from Definition 6.5, fClf_{\mathcal{C}_{l}} is differentiable at any point (x,y)4(\boldsymbol{x},\boldsymbol{y})\in^{4} in which x\boldsymbol{x} lies on the strict interior of its respective cell. In this case the derivative with respect to x1x_{1} is

where for α1(x)/x1\partial\alpha_{1}(\boldsymbol{x})/\partial x_{1} we have that

Now since maxzS1(z)6\max_{z\in}\left|S^{\prime}_{1}(z)\right|\leq 6, we can conclude that α1(x)x124/δ\left|\frac{\partial\alpha_{1}(\boldsymbol{x})}{\partial x_{1}}\right|\leq 24/\delta. Similarly we can prove that α2(x)x124/δ\left|\frac{\partial\alpha_{2}(\boldsymbol{x})}{\partial x_{1}}\right|\leq 24/\delta, which combined with α1(x)1\left|\alpha_{1}(\boldsymbol{x})\right|\leq 1 implies fCl(x,y)x1O(1/δ)\left|\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x},\boldsymbol{y})}{\partial x_{1}}\right|\leq O(1/\delta). Using similar reasoning we can prove that fCl(x,y)x2O(1/δ)\left|\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x},\boldsymbol{y})}{\partial x_{2}}\right|\leq O(1/\delta) and that fCl(x,y)yi1\left|\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x},\boldsymbol{y})}{\partial y_{i}}\right|\leq 1 for i=1,2i=1,2. Hence

The only thing we are missing to prove the Lipschitzness of fClf_{\mathcal{C}_{l}} is to prove its continuity on the boundaries of the cells of our subdivision. Suppose x\boldsymbol{x} lies on the boundary of some cell, e.g. let x\boldsymbol{x} lie on edge (C,D)(C,D) of one cell that is the same as the edge (A,B)(A^{\prime},B^{\prime}) of the cell to the right of that cell. Since S1(0)=0S_{1}(0)=0, S1(0)=0S^{\prime}_{1}(0)=0 and S1(1)=0S^{\prime}_{1}(1)=0 it holds that α1(x)/x1=0\partial\alpha_{1}(\boldsymbol{x})/\partial x_{1}=0 and the same for α2\alpha_{2}. Therefore the value of fCl/x1\partial f_{\mathcal{C}_{l}}/\partial x_{1} remains the same no matter the cell according to which it was calculated. As a result, fClf_{\mathcal{C}_{l}} is differentiable with respect to x1x_{1} even if x\boldsymbol{x} belongs in the boundary of its cell. Using the exact same reasoning for the rest of the variables, one can show that the function fClf_{\mathcal{C}_{l}} is differentiable at any point (x,y)4(\boldsymbol{x},\boldsymbol{y})\in^{4} and because of the aforementioned bound on the gradient fCl\nabla f_{\mathcal{C}_{l}} we can conclude that fClf_{\mathcal{C}_{l}} is O(1/δ)O(1/\delta)-Lipschitz.

Using very similar calculations, we can compute the closed formulas of the second derivatives of fClf_{\mathcal{C}_{l}} and using the bounds fCl()2\left|f_{\mathcal{C}_{l}}(\cdot)\right|\leq 2, S1()1\left|S_{1}(\cdot)\right|\leq 1, S1()6\left|S^{\prime}_{1}(\cdot)\right|\leq 6, and S1()6\left|S^{\prime\prime}_{1}(\cdot)\right|\leq 6, we can prove that each entry of the Hessian 2fCl(x,y)\nabla^{2}f_{\mathcal{C}_{l}}(\boldsymbol{x},\boldsymbol{y}) is bounded by O(1/δ2)O(1/\delta^{2}) and thus

which implies the Θ(1/δ2)\Theta(1/\delta^{2})-smoothness of fClf_{\mathcal{C}_{l}}. ∎

(+)\boldsymbol{(+)} Construction of Instance for Fixed Points of Gradient Descent/Ascent.

Our construction can be described via the following properties.

The payoff function is the real-valued function fCl(x,y)f_{\mathcal{C}_{l}}(\boldsymbol{x},\boldsymbol{y}) from the Definition 6.5.

The domain is the polytope P(A,b)\mathcal{P}(\boldsymbol{A},\boldsymbol{b}) that we described in Section 3. The matrix A\boldsymbol{A} and the vector b\boldsymbol{b} have constant size and they are computed so that the following inequalities hold

where Δ=δ/12\Delta=\delta/12 and δ=1/(N1)=1/(2n1)\delta=1/(N-1)=1/(2^{n}-1).

The parameter α\alpha is set to be equal to Δ/3\Delta/3.

The parameters GG and LL are set to be equal to the upper bounds on the Lipschitzness and the smoothness of fClf_{\mathcal{C}_{l}} respectively that we derived in Lemma 6.6. Namely we have that G=O(1/δ)=O(2n)G=O(1/\delta)=O(2^{n}) and L=O(1/δ2)=O(22n)L=O(1/\delta^{2})=O(2^{2n}).

We prove this last statement in Lemma 6.8, but before that we need the following technical lemma that will be useful to argue about solution on the boundary of P(A,b)\mathcal{P}(\boldsymbol{A},\boldsymbol{b}).

If xi(α,1α)x_{i}^{\star}\in(\alpha,1-\alpha) and xi(yiΔ+α,yi+Δα)x_{i}^{\star}\in(y_{i}^{\star}-\Delta+\alpha,y_{i}^{\star}+\Delta-\alpha) then fCl(x,y)xiα\left|\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial x_{i}}\right|\leq\alpha.

If xiαx^{\star}_{i}\leq\alpha or xiyiΔ+αx^{\star}_{i}\leq y^{\star}_{i}-\Delta+\alpha then fCl(x,y)xiα\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial x_{i}}\geq-\alpha.

If xi1αx^{\star}_{i}\geq 1-\alpha or xiyi+Δαx^{\star}_{i}\geq y^{\star}_{i}+\Delta-\alpha then fCl(x,y)xiα\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial x_{i}}\leq\alpha.

The symmetric statements for yiy_{i}^{\star} hold. For i{1,2}i\in\{1,2\}:

If yi(α,1α)y_{i}^{\star}\in(\alpha,1-\alpha) and yi(xiΔ+α,xi+Δα)y_{i}^{\star}\in(x_{i}^{\star}-\Delta+\alpha,x_{i}^{\star}+\Delta-\alpha) then fCl(x,y)yiα\left|\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial y_{i}}\right|\leq\alpha.

If yiαy^{\star}_{i}\leq\alpha or yixiΔ+αy^{\star}_{i}\leq x^{\star}_{i}-\Delta+\alpha then fCl(x,y)yiα\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial y_{i}}\leq\alpha.

If yi1αy^{\star}_{i}\geq 1-\alpha or yixi+Δαy^{\star}_{i}\geq x^{\star}_{i}+\Delta-\alpha then fCl(x,y)yiα\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial y_{i}}\geq-\alpha.

For this proof it is convenient to define x^=xxfCl(x,y)\hat{\boldsymbol{x}}=\boldsymbol{x}^{\star}-\nabla_{x}f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star}), K(y)={x(x,y)P(A,b))}K(\boldsymbol{y}^{\star})=\{\boldsymbol{x}\mid(\boldsymbol{x},\boldsymbol{y}^{\star})\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b}))\}, and z=ΠK(y)x^\boldsymbol{z}=\Pi_{K(\boldsymbol{y}^{\star})}\hat{\boldsymbol{x}}.

For the second case, we assume for the sake of contradiction that xiαx^{\star}_{i}\leq\alpha and fCl(x,y)xi<α\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial x_{i}}<-\alpha. These imply that x^i>xi+α\hat{x}_{i}>x_{i}^{\star}+\alpha and that zi=min(yi+Δ,x^i,1)>min(Δ,x^i,1)min(3α,xi+α)z_{i}=\min(y_{i}^{\star}+\Delta,\hat{x}_{i},1)>\min(\Delta,\hat{x}_{i},1)\geq\min(3\alpha,x_{i}^{\star}+\alpha). As a result, xizi=zixi>min(3α,x^i+α)xi\left|x_{i}^{\star}-z_{i}\right|=z_{i}-x_{i}^{\star}>\min(3\alpha,\hat{x}_{i}+\alpha)-x_{i}^{\star} which is greater than α\alpha. The latter is a contradiction with the assumption that (x,y)(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star}) is a solution to the GDAFixedPoint problem. Also if we assume that xiyiΔ+αx_{i}^{\star}\leq y_{i}^{\star}-\Delta+\alpha using the same reasoning we get that zi=min(x^i,yi+Δα,1)z_{i}=\min(\hat{x}_{i},y_{i}^{\star}+\Delta-\alpha,1). From this we can again prove that xizi>α\left|x_{i}^{\star}-z_{i}\right|>\alpha which contradicts the fact that (x,y)(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star}) is a solution to GDAFixedPoint.

The third case can be proved using the same arguments as the second case. Then using the corresponding arguments we can prove the corresponding statements for the yy variables. ∎

We are now ready to prove that solutions of GDAFixedPoint can only occur in cells that are either panchromatic or violate the boundary conditions of a proper coloring. For convenience in the rest of this section we define R(x)R(\boldsymbol{x}) to be the cell of the 2n×2n2^{n}\times 2^{n} grid that contains x\boldsymbol{x}.

for i,ji,j such that x1[i2n1,i+12n1] and x2[j2n1,j+12n1]x_{1}\in\left[\frac{i}{2^{n}-1},\frac{i+1}{2^{n}-1}\right]\text{ and }x_{2}\in\left[\frac{j}{2^{n}-1},\frac{j+1}{2^{n}-1}\right] if there are multiple ii, jj that satisfy the above condition then we choose R(x)R(\boldsymbol{x}) to be the cell that corresponds to the ii, jj such that the pair (i,j)(i,j) it the lexicographically first such that ii, jj satisfy the above condition. We also define the corners Rc(x)R_{c}(\boldsymbol{x}) of R(x)R(\boldsymbol{x}) as

where R(x)=[i2n1,i+12n1]×[j2n1,j+12n1]R(\boldsymbol{x})=\left[\frac{i}{2^{n}-1},\frac{i+1}{2^{n}-1}\right]\times\left[\frac{j}{2^{n}-1},\frac{j+1}{2^{n}-1}\right].

x11/(2n1)x_{1}^{\star}\geq 1/(2^{n}-1) and, for all vRc(x)\boldsymbol{v}\in R_{c}(\boldsymbol{x}^{\star}), it holds that Cl1(v)=1\mathcal{C}_{l}^{1}(\boldsymbol{v})=-1.

x1(2n2)/(2n1)x_{1}^{\star}\leq(2^{n}-2)/(2^{n}-1) and, for all vRc(x)\boldsymbol{v}\in R_{c}(\boldsymbol{x}^{\star}), it holds that Cl1(v)=+1\mathcal{C}_{l}^{1}(\boldsymbol{v})=+1.

x21/(2n1)x_{2}^{\star}\geq 1/(2^{n}-1) and, for all vRc(x)\boldsymbol{v}\in R_{c}(\boldsymbol{x}^{\star}), it holds that Cl2(v)=1\mathcal{C}_{l}^{2}(\boldsymbol{v})=-1.

x2(2n2)/(2n1)x_{2}^{\star}\leq(2^{n}-2)/(2^{n}-1) and, for all vRc(x)\boldsymbol{v}\in R_{c}(\boldsymbol{x}^{\star}), it holds that Cl2(v)=+1\mathcal{C}_{l}^{2}(\boldsymbol{v})=+1.

We prove that there is no solution (x,y)(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star}) of GDAFixedPoint that satisfies the statement 1. and the fact that (x,y)(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star}) cannot satisfy the other statements follows similarly. It is convenient for us to define x^=xxfCl(x,y)\hat{\boldsymbol{x}}=\boldsymbol{x}^{\star}-\nabla_{x}f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star}), K(y)={x(x,y)P(A,b))}K(\boldsymbol{y}^{\star})=\{\boldsymbol{x}\mid(\boldsymbol{x},\boldsymbol{y}^{\star})\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b}))\}, z=ΠK(y)x^\boldsymbol{z}=\Pi_{K(\boldsymbol{y}^{\star})}\hat{\boldsymbol{x}}, and y^=y+yfCl(x,y)\hat{\boldsymbol{y}}=\boldsymbol{y}^{\star}+\nabla_{y}f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star}), K(x)={y(x,y)P(A,b))}K(\boldsymbol{x}^{\star})=\{\boldsymbol{y}\mid(\boldsymbol{x}^{\star},\boldsymbol{y})\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b}))\}, w=ΠK(x)y^\boldsymbol{w}=\Pi_{K(\boldsymbol{x}^{\star})}\hat{\boldsymbol{y}}.

For the sake of contradiction we assume that there exists a solution of (x,y)(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star}) such that x11/(2n1)x_{1}^{\star}\geq 1/(2^{n}-1) and for all vRc(x)\boldsymbol{v}\in R_{c}(\boldsymbol{x}^{\star}) it holds that Cl1(v)=1\mathcal{C}_{l}^{1}(\boldsymbol{v})=-1. Using the fact that the first color of all the corners of R(x)R(\boldsymbol{x}^{\star}) is 11^{-}, we will prove that (1) fCl(x,y)x11/2\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial x_{1}}\geq 1/2, and (2) fCl(x,y)y1=1\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial y_{1}}=-1.

Let R(x)=[i2n1,i+12n1]×[j2n1,j+12n1]R(\boldsymbol{x}^{\star})=\left[\frac{i}{2^{n}-1},\frac{i+1}{2^{n}-1}\right]\times\left[\frac{j}{2^{n}-1},\frac{j+1}{2^{n}-1}\right], then since all the corners vRc(x)\boldsymbol{v}\in R_{c}(\boldsymbol{x}^{\star}) have Cl1(v)=1\mathcal{C}_{l}^{1}(\boldsymbol{v})=-1, from the Definition 6.5 we have that

where (x1A,x2A)=(i/(2n1),j/(2n1))(x_{1}^{A},x_{2}^{A})=(i/(2^{n}-1),j/(2^{n}-1)), (x1B,x2B)=(i/(2n1),(j+1)/(2n1))(x_{1}^{B},x_{2}^{B})=(i/(2^{n}-1),(j+1)/(2^{n}-1)), (x1C,x2C)=((i+1)/(2n1),(j+1)/(2n1))(x_{1}^{C},x_{2}^{C})=((i+1)/(2^{n}-1),(j+1)/(2^{n}-1)), and (x1D,x2D)=((i+1)/(2n1),j/(2n1))(x_{1}^{D},x_{2}^{D})=((i+1)/(2^{n}-1),j/(2^{n}-1)). If we differentiate this with respect to y1y_{1} we immediately get that fCl(x,y)y1=1\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial y_{1}}=-1. On the other hand if we differentiate with respect to x1x_{1} we get

where the last inequality follows from the fact that S1()3/2\left|S^{\prime}_{1}(\cdot)\right|\leq 3/2 and the fact that, due to the constraints that define the polytope P(A,b)\mathcal{P}(\boldsymbol{A},\boldsymbol{b}), it holds that x2y2Δ\left|x_{2}-y_{2}\right|\leq\Delta.

Hence we have established that if x11/(2n1)x_{1}^{\star}\geq 1/(2^{n}-1) and for all vRc(x)\boldsymbol{v}\in R_{c}(\boldsymbol{x}^{\star}) it holds that Cl1(v)=1\mathcal{C}_{l}^{1}(\boldsymbol{v})=-1 then it holds that that (1) fCl(x,y)x11/2\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial x_{1}}\geq 1/2, and (2) fCl(x,y)y1=1\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial y_{1}}=-1. Now it is easy to see that the only way to satisfy both fCl(x,y)x11/2\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial x_{1}}\geq 1/2 and z1x1α\left|z_{1}-x_{1}^{\star}\right|\leq\alpha is that either x1αx_{1}^{\star}\leq\alpha or x1y1Δ+αx_{1}^{\star}\leq y_{1}^{\star}-\Delta+\alpha. The first case is excluded by the assumption in the first statement of our lemma and our choice of α=Δ/3=1/(36(2n1))\alpha=\Delta/3=1/(36\cdot(2^{n}-1)) thus it holds that x1y1Δ+αx_{1}^{\star}\leq y_{1}^{\star}-\Delta+\alpha. But then we can use the case 3 for the yy variables of Lemma 6.7 and we get that fCl(x,y)y1α\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial y_{1}}\geq-\alpha, which cannot be true since we proved that fCl(x,y)y1=1\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial y_{1}}=-1. Therefore we have a contradiction and the first statement of the lemma holds. Using the same reasoning we prove the rest of the statements. ∎

The computations presented in (6.4) is the precise point where an attempt to prove the hardness of minimization problems would fail. In particular, if our goal was to construct a hard minimization instance then the function fClf_{\mathcal{C}_{l}} would need to have the terms xi+yix_{i}+y_{i} instead of xiyix_{i}-y_{i} so that the fixed points of gradient descent coincide with approximate local minimum of fClf_{\mathcal{C}_{l}}. In that case we cannot lower bound the gradient of (6.4) below from 1/21/2 because the term x2+y2\left|x^{\star}_{2}+y^{\star}_{2}\right| will be the dominant one and hence the sign of the derivative can change depending on the value x2+y2\left|x^{\star}_{2}+y^{\star}_{2}\right|. For a more intuitive explanation of the reason why we cannot prove hardness of minimization problems we refer to the Introduction, at Section 1.2.

We have now all the ingredients to prove Theorem 6.1.

Hardness of Local Min-Max Equilibrium – High-Dimensions

The iith color of every vertex is either the color i+i^{+} or the color ii^{-}.

All the vertices whose iith coordinate is , i.e. they are at the lower boundary of the iith direction, should have the iith color equal to i+i^{+}.

All the vertices whose iith coordinate is 11, i.e. they are at the higher boundary of the iith direction, should have the iith color equal to ii^{-}.

γ\gamma-SuccinctBrouwer is PPAD\mathsf{PPAD}-complete for any fixed constant γ>0\gamma>0.

Consider the function g(x)=M(x)xg(\boldsymbol{x})=M(\boldsymbol{x})-\boldsymbol{x}. Since MM is 1/γ1/\gamma-Lipschitz, g:ddg:^{d}\to^{d} is also (1+1/γ)(1+1/\gamma)-Lipschitz. Additionally gg can be easily computed via a polynomial-time Turing machine Cg\mathcal{C}_{g} that uses CM\mathcal{C}_{M} as a subroutine. We construct the coloring sequences of every vertex of a dd-dimensional grid with N=Θ(d/γ2)N=\Theta(d/\gamma^{2}) points in every direction using gg. Let gη:22g_{\eta}:^{2}\to^{2} be the function that the Turing Machine Cg\mathcal{C}_{g} evaluate when the requested accuracy is η>0\eta>0. For each vertex v=(v1,,vn)([N]1)d\boldsymbol{v}=(v_{1},\ldots,v_{n})\in\left(\left[N\right]-1\right)^{d} of the dd-dimensional grid its coloring sequence Cl(v){1,1}d\mathcal{C}_{l}(\boldsymbol{v})\in\{-1,1\}^{d} is constructed as follows: For each coordinate j=1,,dj=1,\ldots,d,

Let v\boldsymbol{v} be any vertex on the same cubelet with the output vertices v(1)\boldsymbol{v}^{(1)}, \ldots, v(d)\boldsymbol{v}^{(d)}, u(1)\boldsymbol{u}^{(1)}, \ldots, u(d)\boldsymbol{u}^{(d)}. From the guarantees of colors of the sequences v(1)\boldsymbol{v}^{(1)}, \ldots, v(d)\boldsymbol{v}^{(d)}, u(1)\boldsymbol{u}^{(1)}, \ldots, u(d)\boldsymbol{u}^{(d)} we have that either Clj(v)Clj(v(j))=1\mathcal{C}_{l}^{j}(\boldsymbol{v})\cdot\mathcal{C}_{l}^{j}(\boldsymbol{v}^{(j)})=-1 or Clj(v)Clj(u(j))=1\mathcal{C}_{l}^{j}(\boldsymbol{v})\cdot\mathcal{C}_{l}^{j}(\boldsymbol{u}^{(j)})=-1, let v(j)\overline{\boldsymbol{v}}^{(j)} be the vertex v(j)\boldsymbol{v}^{(j)} or u(j)\boldsymbol{u}^{(j)} depending on which one the jjth color has product equal to 1-1 with Clj(v)\mathcal{C}_{l}^{j}(\boldsymbol{v}). Now let η=2dγN\eta=\frac{2\sqrt{d}}{\gamma N} if gj(vN1)[η,η]g_{j}\left(\frac{\boldsymbol{v}}{N-1}\right)\in[-\eta,\eta] then the wanted inequality follows. On the other hand if gj(vN1)[η,η]g_{j}\left(\frac{\boldsymbol{v}}{N-1}\right)\in[-\eta,\eta] then using the fact that g(vN1)gη(vN1)η\left\|g\left(\frac{\boldsymbol{v}}{N-1}\right)-g_{\eta}\left(\frac{\boldsymbol{v}}{N-1}\right)\right\|_{\infty}\leq\eta and that from the definition of the colors we have that either gη,j(vN1)0g_{\eta,j}\left(\frac{\boldsymbol{v}}{N-1}\right)\geq 0, gη,j(v(j)N1)<0g_{\eta,j}\left(\frac{\overline{\boldsymbol{v}}^{(j)}}{N-1}\right)<0 or gη,j(vN1)<0g_{\eta,j}\left(\frac{\boldsymbol{v}}{N-1}\right)<0, gη,j(v^(j)N1)0g_{\eta,j}\left(\frac{\hat{\boldsymbol{v}}^{(j)}}{N-1}\right)\geq 0 we conclude that gj(vN1)0g_{j}\left(\frac{\boldsymbol{v}}{N-1}\right)\geq 0, gj(v(j)N1)<0g_{j}\left(\frac{\overline{\boldsymbol{v}}^{(j)}}{N-1}\right)<0 or gj(vN1)<0g_{j}\left(\frac{\boldsymbol{v}}{N-1}\right)<0, gj(v^(j)N1)0g_{j}\left(\frac{\hat{\boldsymbol{v}}^{(j)}}{N-1}\right)\geq 0 and thus,

where in the second inequality we have used the (1+1/γ)(1+1/\gamma)-Lipschitzness of gg. As a result, the point v^=v/(N1)d\hat{\boldsymbol{v}}=\boldsymbol{v}/(N-1)\in^{d} satisfies M(v^)v^22d/(γN)\left\|M(\hat{\boldsymbol{v}})-\hat{\boldsymbol{v}}\right\|_{2}\leq 2d/(\gamma N) and thus for if we pick N=Θ(d/γ2)N=\Theta(d/\gamma^{2}) then any vertex v\boldsymbol{v} of the panchromatic cell is a solution for γ\gamma-SuccinctBrouwer. ∎

2 From High Dimensional Bi-Sperner to Fixed Points of Gradient Descent/Ascent

where c([N1]1)d\boldsymbol{c}\in\left(\left[N-1\right]-1\right)^{d} such that x[c1N1,c1+1N1]××[cdN1,cd+1N1]\boldsymbol{x}\in\left[\frac{c_{1}}{N-1},\frac{c_{1}+1}{N-1}\right]\times\cdots\times\left[\frac{c_{d}}{N-1},\frac{c_{d}+1}{N-1}\right] and if there are multiple corners c\boldsymbol{c} that satisfy this condition then we choose R(x)R(\boldsymbol{x}) to be the cell that corresponds to the c\boldsymbol{c} that is lexicographically first among those that satisfy the condition. We also define Rc(x)R_{c}(\boldsymbol{x}) to be the set of vertices that are corners of the cublet R(x)R(\boldsymbol{x}), namely

where c([N1]1)d\boldsymbol{c}\in\left(\left[N-1\right]-1\right)^{d} such that R(x)=[c1N1,c1+1N1]××[cdN1,cd+1N1]R(\boldsymbol{x})=\left[\frac{c_{1}}{N-1},\frac{c_{1}+1}{N-1}\right]\times\cdots\times\left[\frac{c_{d}}{N-1},\frac{c_{d}+1}{N-1}\right] Every y\boldsymbol{y} that belongs to the cubelet R(x)R(\boldsymbol{x}) can be written as a convex combination of the vectors v/(N1)\boldsymbol{v}/(N-1) where vRc(x)\boldsymbol{v}\in R_{c}(\boldsymbol{x}). The value of the function fCl(x,y)f_{\mathcal{C}_{l}}(\boldsymbol{x},\boldsymbol{y}) that we construct in this section is determined by the coloring sequences Cl(v)\mathcal{C}_{l}(\boldsymbol{v}) of the vertices vRc(x)\boldsymbol{v}\in R_{c}(\boldsymbol{x}). One of the main challenges that we face though is that the size of Rc(x)R_{c}(\boldsymbol{x}) is 2d2^{d} and hence if we want to be able to compute the value of fCl(x,y)f_{\mathcal{C}_{l}}(\boldsymbol{x},\boldsymbol{y}) efficiently then we have to find a consistent rule to pick a subset of the vertices of Rc(x)R_{c}(\boldsymbol{x}) whose coloring sequence we need to define the function value fCl(x,y)f_{\mathcal{C}_{l}}(\boldsymbol{x},\boldsymbol{y}). Although there are traditional ways to overcome this difficulty using the canonical simplicization of the cubelet R(x)R(\boldsymbol{x}), these technique leads only to functions that are continuous and Lipschitz but they are not enough to guarantee continuity of the gradient and hence the resulting functions are not smooth.

The problem of finding a computationally efficient way to define a continuous function as an interpolation of some fixed function in the corners of a cubelet so that the resulting function is both Lischitz and smooth is surprisingly difficult to solve. For this reason we introduce in this section the smooth and efficient interpolation coefficients (SEIC) that as we will see in Section 7.2.2, is the main technical tool to implement such an interpolation. Our novel interpolation coefficients are of independent interest and we believe that they will serve as a main technical tool for proving other hardness results in continuous optimization in the future.

In this section we only give a high level description of the smooth and efficient interpolation coefficients via their properties that we use in Section 7.2.2 to define the function fClf_{\mathcal{C}_{l}}. The actual construction of the coefficients is very challenging and technical and hence we postpone a detail exposition for Section 8.

For all vertices v([N]1)d\boldsymbol{v}\in\left(\left[N\right]-1\right)^{d}, the coefficient Pv(x)\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x}) is a twice-differentiable function and satisfies

Pv(x)xiΘ(d12/δ)\left|\frac{\partial\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x})}{\partial x_{i}}\right|\leq\Theta(d^{12}/\delta).

For all v([N]1)d\boldsymbol{v}\in\left(\left[N\right]-1\right)^{d}, it holds that Pv(x)0\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x})\geq 0 and v([N]1)dPv(x)=vRc(x)Pv(x)=1\sum_{\boldsymbol{v}\in\left(\left[N\right]-1\right)^{d}}\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x})=\sum_{\boldsymbol{v}\in R_{c}(\boldsymbol{x})}\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x})=1.

For all xd\boldsymbol{x}\in^{d}, if xi1/(N1)x_{i}\leq 1/(N-1) for some i[d]i\in[d] then there exists vR+(x)\boldsymbol{v}\in R_{+}(\boldsymbol{x}) such that vi=0v_{i}=0. Respectively, if xi11/(N1)x_{i}\geq 1-1/(N-1) then there exists vR+(x)\boldsymbol{v}\in R_{+}(\boldsymbol{x}) such that vi=1v_{i}=1.

An intuitive explanation of the properties of the SEIC coefficients is the following

The coefficients Pv\mathsf{P}_{\boldsymbol{v}} are both Lipschitz and smooth with Lipschitzness and smoothness parameters that depends polynomially in dd and N=1/δ+1N=1/\delta+1.

The coefficients Pv(x)\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x}) define a convex combination of the vertices Rc(x)R_{c}(\boldsymbol{x}).

For every xd\boldsymbol{x}\in^{d}, out of the NdN^{d} coefficients Pv\mathsf{P}_{\boldsymbol{v}} only d+1d+1 have non-zero value, or non-zero gradient or non-zero Hessian when evaluated at the point x\boldsymbol{x}. Moreover, given xd\boldsymbol{x}\in^{d} we can identify these d+1d+1 coefficients efficiently.

For every xd\boldsymbol{x}\in^{d} that is in a cubelet that touches the boundary there is at least one of the vertices in R+(x)R_{+}(\boldsymbol{x}) that is on the boundary of the continuous hypercube d^{d}.

In Section 10 in the proof of Theorem 10.4 we present a simple application of the existence of the SEIC coefficients for proving very simple black box oracle lower bounds for the global minimization problem.

Based on the existence of these coefficients we are now ready to define the function fClf_{\mathcal{C}_{l}} which is the main construction of our reduction.

2.2 Definition of a Lipschitz and Smooth Function Based on a BiSperner Instance

In this section our goal is to formally define the function fClf_{\mathcal{C}_{l}} and prove its Lipschitzness and smoothness properties in Lemma 7.5.

where αj(x)=v([N]1)dPv(x)Clj(v)\alpha_{j}(\boldsymbol{x})=-\sum_{\boldsymbol{v}\in\left(\left[N\right]-1\right)^{d}}\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x})\cdot\mathcal{C}^{j}_{l}(\boldsymbol{v}), and Pv\mathsf{P}_{\boldsymbol{v}} are the coefficients defined in Definition 7.3.

We first prove that the function fClf_{\mathcal{C}_{l}} constructed in Definition 7.4 is GG-Lipschitz and LL-smooth for some appropriately selected parameters GG, LL that are polynomial in the dimension dd and in the discretization parameter NN. We use this property to establish that fClf_{\mathcal{C}_{l}} is a valid input to the promise problem GDAFixedPoint.

The function fClf_{\mathcal{C}_{l}} of Definition 7.4 is O(d15/δ)O(d^{15}/\delta)-Lipschitz and O(d27/δ2)O(d^{27}/\delta^{2})-smooth.

If we take the derivative with respect to xix_{i} and yiy_{i} and using property (B) of the coefficients Pv\mathsf{P}_{\boldsymbol{v}} we get the following relations,

Now by the property (C) of Definition 7.3 there are most d+1d+1 vertices v\boldsymbol{v} of Rc(x)R_{c}(\boldsymbol{x}) with the property Pv(x)0\nabla\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x})\neq 0. Then if we also use property (A) we get αj(x)xiΘ(d13/δ)\left|\frac{\partial\alpha_{j}(\boldsymbol{x})}{\partial x_{i}}\right|\leq\Theta(d^{13}/\delta) and using the property (B) we get αi(x)1\left|\alpha_{i}(\boldsymbol{x})\right|\leq 1. Thus fCl(x,y)xiΘ(d14/δ)\left|\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x},\boldsymbol{y})}{\partial x_{i}}\right|\leq\Theta(d^{14}/\delta) and fCl(x,y)yiΘ(d)\left|\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x},\boldsymbol{y})}{\partial y_{i}}\right|\leq\Theta(d). Therefore we can conclude that fCl(x,y)2Θ(d15/δ)\left\|\nabla f_{\mathcal{C}_{l}}(\boldsymbol{x},\boldsymbol{y})\right\|_{2}\leq\Theta(d^{15}/\delta) and hence this proves that the function fClf_{\mathcal{C}_{l}} is Lipschitz continuous with Lipschitz constant Θ(d15/δ)\Theta(d^{15}/\delta).

To prove the smoothness of fClf_{\mathcal{C}_{l}}, we use the property (B) of the Definition 7.3 and we have

2.3 Description and Correctness of the Reduction – Proof of Theorem 4.4

()\boldsymbol{(}\star) Construction of Instance for Fixed Points of Gradient Descent/Ascent.

The payoff function is the real-valued function fCl(x,y)f_{\mathcal{C}_{l}}(\boldsymbol{x},\boldsymbol{y}) from the Definition 7.4.

The domain is the polytope P(A,b)\mathcal{P}(\boldsymbol{A},\boldsymbol{b}) that we described in Section 3. The matrix A\boldsymbol{A} and the vector b\boldsymbol{b} are computed so that the following inequalities hold

The parameter α\alpha is set to be equal to Δ/3\Delta/3.

The parameters GG and LL are set to be equal to the upper bounds on the Lipschitzness and the smoothness of fClf_{\mathcal{C}_{l}} respectively that we derived in Lemma 7.5. Namely we have that G=O(d15/δ)G=O(d^{15}/\delta) and L=O(d27/δ2)L=O(d^{27}/\delta^{2}).

The first thing to observe is that the afore-described reduction is polynomial-time. For this observe that all of α\alpha, GG, LL, A\boldsymbol{A}, and b\boldsymbol{b} have representation that is polynomial in dd even if we use unary instead of binary representation. So the only thing that remains is the existence of a Turing machine CfCl\mathcal{C}_{f_{\mathcal{C}_{l}}} that computes the function and the gradient value of fClf_{\mathcal{C}_{l}} in time polynomial to the size of Cl\mathcal{C}_{l} and the requested accuracy. To prove this we need a detailed description of the SEIC coefficients and for this reason we postpone the proof of this to the Appendix D. Here we state the formally the result that we prove in the Appendix D which together with the discussion above proves that our reduction is indeed polynomial-time.

Moreover the running time of CfCl\mathcal{C}_{f_{\mathcal{C}_{l}}} is polynomial in the binary representation of x\boldsymbol{x}, y\boldsymbol{y}, and log(1/ε)\log(1/\varepsilon).

If xi(α,1α)x_{i}^{\star}\in(\alpha,1-\alpha) and xi(yiΔ+α,yi+Δα)x_{i}^{\star}\in(y_{i}^{\star}-\Delta+\alpha,y_{i}^{\star}+\Delta-\alpha) then fCl(x,y)xiα\left|\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial x_{i}}\right|\leq\alpha.

If xiαx^{\star}_{i}\leq\alpha or xiyiΔ+αx^{\star}_{i}\leq y^{\star}_{i}-\Delta+\alpha then fCl(x,y)xiα\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial x_{i}}\geq-\alpha.

If xi1αx^{\star}_{i}\geq 1-\alpha or xiyi+Δαx^{\star}_{i}\geq y^{\star}_{i}+\Delta-\alpha then fCl(x,y)xiα\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial x_{i}}\leq\alpha.

The symmetric statements for yiy_{i}^{\star} hold.

If yi(α,1α)y_{i}^{\star}\in(\alpha,1-\alpha) and yi(xiΔ+α,xi+Δα)y_{i}^{\star}\in(x_{i}^{\star}-\Delta+\alpha,x_{i}^{\star}+\Delta-\alpha) then fCl(x,y)yiα\left|\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial y_{i}}\right|\leq\alpha.

If yiαy^{\star}_{i}\leq\alpha or yixiΔ+αy^{\star}_{i}\leq x^{\star}_{i}-\Delta+\alpha then fCl(x,y)yiα\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial y_{i}}\leq\alpha.

If yi1αy^{\star}_{i}\geq 1-\alpha or yixi+Δαy^{\star}_{i}\geq x^{\star}_{i}+\Delta-\alpha then fCl(x,y)yiα\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial y_{i}}\geq-\alpha.

The proof of this lemma is identical to the proof of Lemma 6.7 and for this reason we skip the details of the proof here. ∎

xi1/(N1)x_{i}^{\star}\geq 1/(N-1) and for any vR+(x)\boldsymbol{v}\in R_{+}(\boldsymbol{x}^{\star}), it holds that Cli(v)=1\mathcal{C}_{l}^{i}(\boldsymbol{v})=-1.

xi11/(N1)x_{i}^{\star}\leq 1-1/(N-1) and for any vR+(x)\boldsymbol{v}\in R_{+}(\boldsymbol{x}^{\star}), it holds that Cl1(v)=+1\mathcal{C}_{l}^{1}(\boldsymbol{v})=+1.

We prove that there is no solution (x,y)(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star}) of GDAFixedPoint that satisfies the statement 1. and the fact that (x,y)(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star}) cannot satisfy the statement 2. follows similarly. It is convenient for us to define x^=xxfCl(x,y)\hat{\boldsymbol{x}}=\boldsymbol{x}^{\star}-\nabla_{x}f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star}), K(y)={x(x,y)P(A,b))}K(\boldsymbol{y}^{\star})=\{\boldsymbol{x}\mid(\boldsymbol{x},\boldsymbol{y}^{\star})\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b}))\}, z=ΠK(y)x^\boldsymbol{z}=\Pi_{K(\boldsymbol{y}^{\star})}\hat{\boldsymbol{x}}, and y^=yyfCl(x,y)\hat{\boldsymbol{y}}=\boldsymbol{y}^{\star}-\nabla_{y}f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star}), K(x)={y(x,y)P(A,b))}K(\boldsymbol{x}^{\star})=\{\boldsymbol{y}\mid(\boldsymbol{x}^{\star},\boldsymbol{y})\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b}))\}, w=ΠK(x)y^\boldsymbol{w}=\Pi_{K(\boldsymbol{x}^{\star})}\hat{\boldsymbol{y}}.

For the sake of contradiction we assume that there exists a solution of (x,y)(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star}) such that x11/(N1)x_{1}^{\star}\geq 1/(N-1) and for any vR+(x)\boldsymbol{v}\in R_{+}(\boldsymbol{x}^{\star}) it holds that Cli(v)=1\mathcal{C}_{l}^{i}(\boldsymbol{v})=-1. Using this fact, we will prove that (1) fCl(x,y)xi1/2\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial x_{i}}\geq 1/2, and (2) fCl(x,y)yi=1\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial y_{i}}=-1.

Let R(x)=[c1N1,c1+1N1]××[cdN1,cd+1N1]R(\boldsymbol{x}^{\star})=\left[\frac{c_{1}}{N-1},\frac{c_{1}+1}{N-1}\right]\times\cdots\times\left[\frac{c_{d}}{N-1},\frac{c_{d}+1}{N-1}\right], then since all the corners vR+(x)\boldsymbol{v}\in R_{+}(\boldsymbol{x}^{\star}) have Cli(v)=1\mathcal{C}_{l}^{i}(\boldsymbol{v})=-1, from the Definition 7.4 we have that

If we differentiate this with respect to yiy_{i} we immediately get that fCl(x,y)yi=1\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial y_{i}}=-1. On the other hand if we differentiate with respect to xix_{i} we get

where the above follows from the following facts: (1) that αj(x)xlΘ(d13/δ)\left|\frac{\partial\alpha_{j}(\boldsymbol{x})}{\partial x_{l}}\right|\leq\Theta(d^{13}/\delta), which is proved in the proof of Lemma 7.5, (2) xjyjΔ\left|x_{j}-y_{j}\right|\leq\Delta, and (3) the definition of Δ\Delta. Now it is easy to see that the only way to satisfy both fCl(x,y)xi1/2\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial x_{i}}\geq 1/2 and zixiα\left|z_{i}-x_{i}^{\star}\right|\leq\alpha is that either xiαx_{i}^{\star}\leq\alpha or xiyiΔ+αx_{i}^{\star}\leq y_{i}^{\star}-\Delta+\alpha. The first case is excluded by the assumption of the first statement of our lemma and our choice of α=Δ/3<1/(N1)\alpha=\Delta/3<1/(N-1), thus it holds that xiyiΔ+αx_{i}^{\star}\leq y_{i}^{\star}-\Delta+\alpha. But then we can use the case 3. for the yy variables of Lemma 6.7 and we get that fCl(x,y)y1α\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial y_{1}}\geq-\alpha, which cannot be true since we proved that fCl(x,y)yi=1\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial y_{i}}=-1. Therefore we have a contradiction and the first statement of the lemma holds. Using the same reasoning we prove the second statement too. ∎

Let (x,y)(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star}) be a solution to the GDAFixedPoint problem with input a Turing machine that represents the function fClf_{\mathcal{C}_{l}}, α=Δ/3\alpha=\Delta/3, where Δ=tδ/d14\Delta=t\cdot\delta/d^{14}, G=Θ(d15/δ)G=\Theta(d^{15}/\delta), L=Θ(d27/δ2)L=\Theta(d^{27}/\delta^{2}), and A\boldsymbol{A}, b\boldsymbol{b} as described in (\star).

For each coordinate ii, there exist the following three mutually exclusive cases,

1N1xi11N1\boldsymbol{\frac{1}{N-1}\leq x_{i}^{\star}\leq 1-\frac{1}{N-1}}: Since R+(x)1\left|R_{+}(\boldsymbol{x}^{\star})\right|\geq 1, it follows directly from Lemma 7.8 that there exists vR+(x)\boldsymbol{v}\in R_{+}(\boldsymbol{x}^{\star}) such that Cli(v)=1\mathcal{C}_{l}^{i}(\boldsymbol{v})=-1 and vR+(x)\boldsymbol{v}^{\prime}\in R_{+}(\boldsymbol{x}^{\star}) such that Cli(v)=+1\mathcal{C}_{l}^{i}(\boldsymbol{v})=+1.

Smooth and Efficient Interpolation Coefficients

In this section we describe the construction of the smooth and efficient interpolation coefficients (SEIC) that we introduced in Section 7.2.1. After the description of the construction we present the statements of the lemmas that prove the properties (A) - (D) of their Definition 7.3 and we refer to the Appendix C. We first remind the definition of the SEIC coefficients.

Our main goal in this section is to prove the following theorem.

One important component of the construction of the SEIC coefficients is the smooth-step functions which we introduce in Section 8.1. These functions also provide a toy example of smooth and efficient interpolation coefficients in 11 dimension. Then in Section 8.2 we present the construction of the SEIC coefficients in multiple dimensions and in Section 8.3 we state the main lemmas that lead to the proof of Theorem 8.1.

For every x0x\leq 0 it holds that g(x)=0g(x)=0, for every x1x\geq 1 it holds that g(x)=1g(x)=1 and for every xx\in it holds that S(x)S(x)\in.

For some kk it holds that gg is kk times continuously differentiable and its kkth derivative satisfies g(k)(0)=0g^{(k)}(0)=0 and g(k)(1)=0g^{(k)}(1)=0.

The largest number kk such that the smoothness property from above holds is characterizes the order of smoothness of the smooth step function gg.

In Section 6 we have already defined and used the smooth step function of order 11. For the construction of the SEIC coefficients we use the smooth step function of order 22 and the smooth step function of order \infty defined as follows.

We note that we use the notation SS instead of S2S_{2} for the smooth step function of order 22 for simplicitly of the exposition of the paper.

We present a plot of these step function in Figure 7, and we summarize some of their properties in Lemma 8.3. A more detailed lemma with additional properties of SS_{\infty} that are useful for the proof of Theorem 8.1 is presented in Lemma C.5 in the Appendix C.

The calculations for SS_{\infty} are more complicated. We have that

We set h(x)(exp(ln(2)x)+exp(ln(2)1x))(1x)2x2h(x)\triangleq\left(\exp\left(\frac{\ln(2)}{x}\right)+\exp\left(\frac{\ln(2)}{1-x}\right)\right)\left(1-x\right)^{2}x^{2} for xx\in and doing simple calculations we get that for x1/2x\leq 1/2 it holds that h(x)14exp(ln(2)x)x2h(x)\geq\frac{1}{4}\exp\left(\frac{\ln(2)}{x}\right)x^{2}. But the later can be easily lower bounded by 1/41/4. Applying the same argument for x1/2x\geq 1/2 we get that in general h(x)1/4h(x)\geq 1/4. Also it is not hard to see that for x1/2x\leq 1/2 it holds that exp(ln(2)x(1x))4exp(ln(2)x)\exp\left(\frac{\ln(2)}{x(1-x)}\right)\leq 4\exp\left(\frac{\ln(2)}{x}\right), whereas for x1/2x\geq 1/2 it holds that exp(ln(2)x(1x))4exp(ln(2)1x)\exp\left(\frac{\ln(2)}{x(1-x)}\right)\leq 4\exp\left(\frac{\ln(2)}{1-x}\right). Combining all these we can conclude that S(x)16\left|S^{\prime}_{\infty}(x)\right|\leq 16. Using similar argument we can prove that S(x)32\left|S^{\prime\prime}_{\infty}(x)\right|\leq 32. For all the derivatives of SS_{\infty} we can inductively prove that

where h0(1)=0h_{0}(1)=0 and all the functions hi(x)h_{i}(x) are bounded. Then the fact that all the derivatives of SS_{\infty} vanish at and at 11 follows by a simple inductive argument. ∎

Using the smooth step functions that we described above we can get a construction of SEIC coefficients for the single dimensional case. Unfortunately the extension to multiple dimensions is substantially harder and invokes new ideas that we explore later in this section. For the single dimensional problem of this example we have the interval $dividedwithdivided withNdiscretepointsandourgoalistodesigndiscrete points and our goal is to designNfunctionsfunctions\mathsf{P}_{1}-\mathsf{P}_{N}$ that satisfy the properties (A) - (D) of Definition 7.3. A simple construction of such functions is the following

Based on Lemma 8.3 it is not hard then to see that Pi\mathsf{P}_{i} is twice differentiable and it has bounded first and second derivatives, hence it satisfies property (A) of Definition 8. Using the fact that 1S(x)=S(1x)1-S_{\infty}(x)=S_{\infty}(1-x) we can also prove property (B). Finally properties (C) and (D) can be proved via the definition of the coefficient Pi\mathsf{P}_{i} from above. In Figure 7 we can see the plot of P3\mathsf{P}_{3} for N=5N=5. We leave the exact proofs of this example as an exercise for the reader.

2 Construction of SEIC Coefficients in High-Dimensions

The goal of this section is to present the construction of the family Id,N\mathcal{I}_{d,N} of smooth and efficient interpolation coefficients for every number of dimensions dd and any discretization parameter NN. Before diving into the details of our construction observe that even the 2-dimensional case with N=2N=2 is not trivial. In particular, the first attempt would be to define the SEIC coefficients based on the simple split of the square 2^{2} to two triangles divided by the diagonal of 2^{2}. Then using any soft-max function that is twice continuously differentiable we define a convex combination at every triangle. Unfortunately this approach cannot work since the resulting coefficients have discontinuous gradients along the diagonal of 2^{2}. We leave the presice calculations of this example as an exercise to the reader.

We start with some definitions about the orientation and the representation of the cubelets of the grid ([N]1)d\left(\left[N\right]-1\right)^{d}. Then we proceed with the definition of the QvQ_{\boldsymbol{v}} functions in Definition 8.7. Finally using QvQ_{\boldsymbol{v}} we can proceed with the construction of the SEIC coefficients.

Each cubelet [c1N1,c1+1N1]××[cdN1,cd+1N1]\left[\frac{c_{1}}{N-1},\frac{c_{1}+1}{N-1}\right]\times\cdots\times\left[\frac{c_{d}}{N-1},\frac{c_{d}+1}{N-1}\right], where c([N1]1)d\boldsymbol{c}\in\left(\left[N-1\right]-1\right)^{d} admits a source vertex sc=(s1,,sd)([N]1)d\boldsymbol{s}^{\boldsymbol{c}}=(s_{1},\ldots,s_{d})\in\left(\left[N\right]-1\right)^{d} and a target vertex tc=(t1,,td)([N]1)d\boldsymbol{t}^{\boldsymbol{c}}=(t_{1},\ldots,t_{d})\in\left(\left[N\right]-1\right)^{d} defined as follows,

Notice that the source sc\boldsymbol{s}^{\boldsymbol{c}} and the target tc\boldsymbol{t}^{\boldsymbol{c}} are vertices of the cubelet whose down-left corner is c\boldsymbol{c}.

(Canonical Representation) Let xd\boldsymbol{x}\in^{d} and R(x)=[c1N1,c1+1N1]××[cdN1,cd+1N1]R(\boldsymbol{x})=\left[\frac{c_{1}}{N-1},\frac{c_{1}+1}{N-1}\right]\times\cdots\times\left[\frac{c_{d}}{N-1},\frac{c_{d}+1}{N-1}\right] where c([N1]1)d\boldsymbol{c}\in\left(\left[N-1\right]-1\right)^{d}. The canonical representation of x\boldsymbol{x} under cubelet with down-left corner c\boldsymbol{c}, denoted by pxc=(p1,,pd)\boldsymbol{p}_{\boldsymbol{x}}^{\boldsymbol{c}}=(p_{1},\ldots,p_{d}) is defined as follows,

where tc=(t1,,td)\boldsymbol{t}^{\boldsymbol{c}}=(t_{1},\ldots,t_{d}) and sc=(s1,,sd)\boldsymbol{s}^{\boldsymbol{c}}=(s_{1},\ldots,s_{d}) are respectively the target and the source of R(x)R(\boldsymbol{x}).

Let xd\boldsymbol{x}\in^{d} lying in the cublet

with corners Rc(x)={c1,c1+1}××{cd,cd+1}R_{c}(\boldsymbol{x})=\{c_{1},c_{1}+1\}\times\cdots\times\{c_{d},c_{d}+1\}, where c([N1]1)d\boldsymbol{c}\in\left(\left[N-1\right]-1\right)^{d}. Let also sc=(s1,,sd)\boldsymbol{s}^{\boldsymbol{c}}=(s_{1},\ldots,s_{d}) be the source vertex of R(x)R(\boldsymbol{x}) and pxc=(p1,,pd)\boldsymbol{p}_{\boldsymbol{x}}^{\boldsymbol{c}}=(p_{1},\ldots,p_{d}) be the canonical representation of x\boldsymbol{x}. Then for each vertex vRc(x)\boldsymbol{v}\in R_{c}(\boldsymbol{x}) we define the following partition of the set of coordinates [d][d],

where S(x)S_{\infty}(x) and S(x)S(x) are the smooth step function defined in Definition 8.2.

To provide a better understanding of the Definitions 8.5, 8.6, and 8.7 we present the following 33-dimensional example.

We consider a case where d=3d=3 and N=3N=3. Let x=(1.3/3,2.5/3,0.3/3)\boldsymbol{x}=(1.3/3,2.5/3,0.3/3) lying in the cubelet R(x)=[13,23]×[23,1]×[0,13]R(\boldsymbol{x})=\left[\frac{1}{3},\frac{2}{3}\right]\times\left[\frac{2}{3},1\right]\times\left[0,\frac{1}{3}\right], and let c=(1,2,0)\boldsymbol{c}=(1,2,0). Then the source of R(x)R(\boldsymbol{x}) is sc=(2,2,0)\boldsymbol{s}^{\boldsymbol{c}}=(2,2,0) and the target tc=(1,3,1)\boldsymbol{t}^{\boldsymbol{c}}=(1,3,1) (Definition 8.5). The canonical representation of x\boldsymbol{x} is pxc=(0.7,0.5,0.3)\boldsymbol{p}_{\boldsymbol{x}}^{\boldsymbol{c}}=(0.7,0.5,0.3) (Definition 8.6). The only vertices with no-zero coefficients Qvc(x)Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x}) are those belonging in the set R+(x)={(1,3,1),(1,3,0),(1,2,0),(2,2,0)}R_{+}(\boldsymbol{x})=\{(1,3,1),(1,3,0),(1,2,0),(2,2,0)\} and again by Definition 8.7 we have that

Q(1,3,1)(x)=S(S(0.3))S(S(0.5))S(S(0.7))Q_{(1,3,1)}(\boldsymbol{x})=S_{\infty}(S(0.3))\cdot S_{\infty}(S(0.5))\cdot S_{\infty}(S(0.7)),

Q(1,3,0)(x)=S(S(0.5)S(0.3))S(S(0.7)S(0.3))Q_{(1,3,0)}(\boldsymbol{x})=S_{\infty}(S(0.5)-S(0.3))\cdot S_{\infty}(S(0.7)-S(0.3)),

Q(1,2,0)(x)=S(S(0.7)S(0.3))S(S(0.7)S(0.5))Q_{(1,2,0)}(\boldsymbol{x})=S_{\infty}(S(0.7)-S(0.3))\cdot S_{\infty}(S(0.7)-S(0.5)),

Q(2,2,0)(x)=S(1S(0.3))S(1S(0.5))S(1S(0.7))Q_{(2,2,0)}(\boldsymbol{x})=S_{\infty}(1-S(0.3))\cdot S_{\infty}(1-S(0.5))\cdot S_{\infty}(1-S(0.7)).

Now based on the Definitions 8.5, 8.6, and 8.7 we are ready to present the construction of the smooth and efficient interpolation coefficients.

Let xd\boldsymbol{x}\in^{d} lying in the cubelet R(x)=[c1N1,c1+1N1]××[cdN1,cd+1N1]R(\boldsymbol{x})=\left[\frac{c_{1}}{N-1},\frac{c_{1}+1}{N-1}\right]\times\cdots\times\left[\frac{c_{d}}{N-1},\frac{c_{d}+1}{N-1}\right]. Then for each vertex v([N]1)d\boldsymbol{v}\in\left(\left[N\right]-1\right)^{d} the coefficient Pv(x)\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x}) is defined as follows,

where the functions Qvc(x)0Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})\geq 0 are defined in Definition 8.7 for any vRc(x)\boldsymbol{v}\in R_{c}(\boldsymbol{x}).

3 Sketch of the Proof of Theorem 8.1

First it is necessary to argue that Pv(x)\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x}) is a continuous function since it could be the case that Qvc(x)/(vRc(x)Qvc(x))Qvc(x)/(vVcQvc(x))Q^{\boldsymbol{c}}_{\boldsymbol{v}}(\boldsymbol{x})/(\sum_{\boldsymbol{v}\in R_{\boldsymbol{c}}(\boldsymbol{x})}Q^{\boldsymbol{c}}_{\boldsymbol{v}}(\boldsymbol{x}))\neq Q^{\boldsymbol{c}^{\prime}}_{\boldsymbol{v}}(\boldsymbol{x})/(\sum_{\boldsymbol{v}\in V_{\boldsymbol{c}^{\prime}}}Q^{\boldsymbol{c}^{\prime}}_{\boldsymbol{v}}(\boldsymbol{x})) for some point x\boldsymbol{x} that lies in the boundary of two adjacent cubelets with down-left corners c\boldsymbol{c} and c\boldsymbol{c}^{\prime} respectively. We specifically design the coefficients Qvc(x)Q_{v}^{\boldsymbol{c}}(\boldsymbol{x}) such as the latter does not occur and this is the main reason that the definition of the function Qvc(x)Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x}) is slightly complicated. For this reason we prove the following lemma.

For any vertex v([N]1)d\boldsymbol{v}\in\left(\left[N\right]-1\right)^{d}, Pv(x)\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x}) is a continuous and twice differentiable function and for any vRc(x)\boldsymbol{v}\notin R_{c}(\boldsymbol{x}) it holds that Pv(x)=Pv(x)=2Pv(x)=0\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x})=\nabla\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x})=\nabla^{2}\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x})=0. Moreover, for every xd\boldsymbol{x}\in^{d} the set R+(x)R_{+}(\boldsymbol{x}) of vertices v([N]1)d\boldsymbol{v}\in\left(\left[N\right]-1\right)^{d} such that Pv(x)>0\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x})>0 satisfies R+(x)=d+1\left|R_{+}(\boldsymbol{x})\right|=d+1.

Based on Lemma 8.10 and the expression of Pv\mathsf{P}_{\boldsymbol{v}} we can prove that the Pv\mathsf{P}_{\boldsymbol{v}} coefficients defined in Definition 8.9 satisfy the properties (B) and (C) of the definition 7.3. To prove the properties (A) and (D) we also need the following two lemmas.

For any vertex v([N]1)d\boldsymbol{v}\in\left(\left[N\right]-1\right)^{d}, it holds that

Let a point xd\boldsymbol{x}\in^{d} and R+(x)R_{+}(\boldsymbol{x}) the set of vertices with Pv(x)>0\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x})>0, then we have that

If 0xi<1/(N1)0\leq x_{i}<1/(N-1) then there always exists a vertex vR+(x)\boldsymbol{v}\in R_{+}(\boldsymbol{x}) such that vi=0v_{i}=0.

If 11/(N1)<xi11-1/(N-1)<x_{i}\leq 1 then there always exists a vertex vR+(x)\boldsymbol{v}\in R_{+}(\boldsymbol{x}) such that vi=1v_{i}=1.

The proofs of Lemmas 8.10, 8.11, and 8.12 can be found in Appendix C. Based on Lemmas 8.10, 8.11, and 8.12 we are now ready to prove Theorem 8.1.

The fact that the coefficients Pv\mathsf{P}_{\boldsymbol{v}} satisfy the property (A) follows directly from Lemma 8.11. Property (B) follows directly from the definition of Pv\mathsf{P}_{\boldsymbol{v}} in Definition 8.9 and the simple fact that Qvc(x)0Q^{\boldsymbol{c}}_{\boldsymbol{v}}(\boldsymbol{x})\geq 0. Property (C) follows from the second part of Lemma 8.10. Finally Property (D) follows directly from Lemma 8.12. ∎

Unconditional Black-Box Lower Bounds

In this section our goal is to prove Theorem 4.5 based on the Theorem 4.4 that we proved in Section 7 and the known black box lower bounds that we know for PPAD\mathsf{PPAD} by [HPV89]. In this section we assume that all the real number operation are performed with infinite precision.

Assume that there exists an algorithm AA that has black-box oracle access to the value of a function M:ddM:^{d}\to^{d} and outputs wd\boldsymbol{w}^{\star}\in^{d}. There exists a universal constant c>0c>0 such that if MM is 22-Lipschitz and M(w)w21/(2c)\left\|M(\boldsymbol{w}^{\star})-\boldsymbol{w}^{\star}\right\|_{2}\leq 1/(2c), then AA has to make at least 2d2^{d} different oracle calls to the function value of MM.

It is easy to observe in the reduction in the proof of Theorem 7.2 is a black-box reduction and in every evaluation of the constructed circuit Cl\mathcal{C}_{l} only requires one evaluation of the input function MM. Therefore the proof of Theorem 7.2 together with the Theorem 9.1 imply the following corollary.

Based on Corollary 9.2 and the reduction that we presented in Section 7, we are now ready to prove Theorem 4.5.

Hardness in the Global Regime

In this section our goal is to prove that the complexity of the problems LocalMinMax and LocalMin is significantly increased when ε\varepsilon, δ\delta lie outside the local regime, in the global regime. We start with the following theorem where we show that FNP\mathsf{FNP}-hardness of LocalMinMax.

LocalMinMax is FNP\mathsf{FNP}-hard even when ε\varepsilon is set to any value 1/384\leq 1/384, δ\delta is set to any value 1\geq 1, and even when P(A,b)=d\mathcal{P}(\boldsymbol{A},\boldsymbol{b})=^{d}, G=dG=\sqrt{d}, L=dL=d, and B=dB=d.

We now present a reduction from 3-SAT(3) to LocalMinMax that proves Theorem 10.1. First we remind the definition of the problem 3-SAT(3).

where each wj,zjw_{j},z_{j} are additional variables associated with clause ϕj\phi_{j}. The player that wants to minimize ff controls x,w\boldsymbol{x},\boldsymbol{w} vectors while the maximizing player controls the z\boldsymbol{z} variables.

The formula ϕ\phi admits a satisfying assignment if and only if there exist an (ε,δ)(\varepsilon,\delta)-local min-max equilibrium of f(x,w)f(\boldsymbol{x},\boldsymbol{w}) with ε1/384\varepsilon\leq 1/384, δ=1\delta=1 and (x,w)n+2m(\boldsymbol{x},\boldsymbol{w})\in^{n+2m}.

Let us assume that there exists a satisfying assignment. Given such a satisfying assignment we will construct ((x,w),z)\left((\boldsymbol{x}^{\star},\boldsymbol{w}^{\star}),\boldsymbol{z}^{\star}\right) that is a (0,1)(0,1)-local min-max equilibrium of ff. We set each variable xi1x_{i}^{\star}\triangleq 1 if and only if the respective boolean variable is true. Observe that this implies that Pj(x)=0P_{j}(\boldsymbol{x}^{\star})=0 for all jj, meaning that the strategy profile ((x,w),z)\left((\boldsymbol{x}^{\star},\boldsymbol{w}^{\star}),\boldsymbol{z}^{\star}\right) is a global Nash equilibrium no matter the values of w,z\boldsymbol{w}^{\star},\boldsymbol{z}^{\star}.

On the opposite direction, let us assume that there exists an (ε,δ)(\varepsilon,\delta)-local min-max equilibrium of ff with ε=1/384\varepsilon=1/384 and δ=1\delta=1. In this case we first prove that for each j=1,,mj=1,\ldots,m

Fix any clause jj. In case wjzj1/4\left|w_{j}^{\star}-z_{j}^{\star}\right|\geq 1/4 then the minimizing player can further decrease ff by at least Pj(x)/16P_{j}(x)/16 by setting wjzjw_{j}^{\star}\triangleq z_{j}^{\star}. On the other hand in case wjzj1/4\left|w_{j}^{\star}-z_{j}^{\star}\right|\leq 1/4 then the maximizing player can increase ff by at least Pj(x)/16P_{j}(x^{\star})/16 by moving zjz_{j}^{\star} either to or to 11. We remark that both of the options are feasible since δ=1\delta=1.

Now consider the probability distribution over the boolean assignments where each boolean variable xix_{i} is independently selected to be true with probability xix_{i}^{\star}. Then,

Since each ϕj\phi_{j} shares variables with at most 66 other clauses, the event of ϕj\phi_{j} not being satisfied is dependent with at most 66 other events. By the Lovász Local Lemma [EL73], we get that the probability none of these events occur is positive. As a result, there exists a satisfying assignment. ∎

Next we show the FNP\mathsf{FNP}-hardness of LocalMin. As we can see there is a gap between Theorem 10.1 and Theorem 10.3. In particular, the FNP\mathsf{FNP}-hardness result of LocalMinMax is stronger since it holds for any δ1\delta\geq 1 whereas for the FNP\mathsf{FNP}-hardness of LocalMin our proof needs δd\delta\geq\sqrt{d} when the rest of the parameters remain the same.

LocalMin is FNP\mathsf{FNP}-hard even when ε\varepsilon is set to any value 1/24\leq 1/24, δ\delta is set to any value d\geq\sqrt{d}, and even when P(A,b)=d\mathcal{P}(\boldsymbol{A},\boldsymbol{b})=^{d}, G=dG=\sqrt{d}, L=dL=d, and B=dB=d.

We follow the same proof as in the proof of Theorem 10.1 but we instead set f(x)=j=1mPj(x)f(\boldsymbol{x})=\sum_{j=1}^{m}P_{j}(\boldsymbol{x}) where xn\boldsymbol{x}\in^{n} (the number of variables is d:=nd:=n). We then get that if the initial formula is satisfiable then there exist xP(A,b)\boldsymbol{x}\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b}), such that f(x)=0f(\boldsymbol{x})=0. On the other hand if there exist xP(A,b)\boldsymbol{x}\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b}) such that f(x)1/24f(\boldsymbol{x})\leq 1/24 then the formula is satisfiable due to the Lovász Local Lemma [EL73]. Therefore the FNP\mathsf{FNP}-hardness follows again from the constructive proof of the Lovász Local Lemma [Mos09, MT10]. Setting δn\delta\geq\sqrt{n} which equals the diameter of the feasibility set implies that in case there exists x^\hat{\boldsymbol{x}} with f(x^)=0f(\hat{\boldsymbol{x}})=0 then all (ε,δ)(\varepsilon,\delta)-LocalMin x\boldsymbol{x}^{\ast} must admit value f(x)1/24f(\boldsymbol{x}^{\ast})\leq 1/24 and thus a satisfying assignment is implied. ∎

Next we prove a black box lower bound for minimization in the global regime. The proof of following lower bound illustrates the strength of the SEIC coefficients presented in Section 8. The next Theorem can also be used to prove the FNP\mathsf{FNP}-hardness of LocalMin in the global regime but with worse Lipschitzness and smoothness parameters than the once at Theorem 10.3 and for this reason we present both of them.

In the worst case, Ω(2d/d)\Omega\left(2^{d}/d\right) value/gradient black-box queries are needed to determine a (ε,δ)(\varepsilon,\delta)-LocalMin for functions f(x):df(\boldsymbol{x}):^{d}\to with G=Θ(d15)G=\Theta(d^{15}), L=Θ(d22)L=\Theta(d^{22}), ε<1\varepsilon<1, δ=d\delta=\sqrt{d}.

The proof is based on the fact that given just black-box access to a boolean formula ϕ:{0,1}d{0,1}\phi:\{0,1\}^{d}\mapsto\{0,1\}, at least Ω(2d)\Omega(2^{d}) queries are needed in order to determine whether ϕ\phi admits a satisfying assignment. The term black-box access refers to the fact that the clauses of the formula are not given and the only way to determine whether a specific boolean assignment is satisfying is by quering the specific binary string.

Given such a black-box oracle for a satisfying assignment dd, we construct the function fϕ(x):df_{\phi}(\boldsymbol{x}):^{d}\mapsto as follows:

for each corner vV\boldsymbol{v}\in V of the d^{d} hypercube, i.e. v{0,1}d\boldsymbol{v}\in\{0,1\}^{d}, we set fϕ(v):=1ϕ(v)f_{\phi}(\boldsymbol{v}):=1-\phi(\boldsymbol{v}).

for the rest of the points xd/V\boldsymbol{x}\in^{d}/V, fϕ(x):=vVPv(x)fϕ(v)f_{\phi}(\boldsymbol{x}):=\sum_{\boldsymbol{v}\in V}P_{\boldsymbol{v}}(\boldsymbol{x})\cdot f_{\phi}(\boldsymbol{v}) where PvP_{\boldsymbol{v}} are the coefficients of Definition 8.9.

We remind that by Lemma 8.11, we get that fϕ(x)2Θ(d12)\left\|\nabla f_{\phi}(\boldsymbol{x})\right\|_{2}\leq\Theta(d^{12}) and 2fϕ(x)2Θ(d25)\left\|\nabla^{2}f_{\phi}(\boldsymbol{x})\right\|_{2}\leq\Theta(d^{25}), meaning that fϕ()f_{\phi}(\cdot) is Θ(d12)\Theta(d^{12})-Lipschitz and Θ(d25)\Theta(d^{25})-smooth. Moreover by Lemma 8.7 , for any xn\boldsymbol{x}\in^{n} the set V(x)={vV:Pv(x)0}V(x)=\{\boldsymbol{v}\in V:P_{\boldsymbol{v}}(\boldsymbol{x})\neq 0\} has cardinality at most d+1d+1, while at the same time vVPv(x)=1\sum_{\boldsymbol{v}\in V}P_{\boldsymbol{v}}(\boldsymbol{x})=1.

In case ϕ\phi is not satisfiable then fϕ(x)=1f_{\phi}(\boldsymbol{x})=1 for all xd\boldsymbol{x}\in^{d} since fϕ(v)=1f_{\phi}(\boldsymbol{v})=1 for all vV\boldsymbol{v}\in V. In case there exists a satisfying assignment v\boldsymbol{v}^{\ast} then fϕ(v)=0f_{\phi}(\boldsymbol{v}^{\ast})=0. Since δd\delta\geq\sqrt{d} that is the diameter of d^{d}, any (ε,δ)(\varepsilon,\delta)-LocalMin x\boldsymbol{x}^{\ast} must have fϕ(x)ε<1f_{\phi}(\boldsymbol{x})\leq\varepsilon<1. Since fϕ(x)vV(x)Pv(x)fϕ(v)<1f_{\phi}(\boldsymbol{x}^{\ast})\triangleq\sum_{\boldsymbol{v}\in V(\boldsymbol{x}^{\ast})}P_{\boldsymbol{v}}(\boldsymbol{x}^{\ast})\cdot f_{\phi}(\boldsymbol{v}^{\ast})<1, there exists at least one vertex v^V(x)\hat{\boldsymbol{v}}\in V(\boldsymbol{x}) with fϕ(v^)=0f_{\phi}(\hat{\boldsymbol{v}})=0, meaning that ϕ(v)=1\phi(\boldsymbol{v}^{\ast})=1. As a result, given an (ε,δ)(\varepsilon,\delta)-LocalMin x\boldsymbol{x}^{\ast} with fϕ(x)<1f_{\phi}(\boldsymbol{x}^{\ast})<1, we can find a satisfying v^\hat{\boldsymbol{v}} by querying ϕ(v)\phi(\boldsymbol{v}) for each vertex vV(x)\boldsymbol{v}\in V(\boldsymbol{x}^{\ast}). Since V(x)d+1\left|V(\boldsymbol{x}^{\ast})\right|\leq d+1, this will take at most d+1d+1 additional queries.

Up next, we argue that in case an (ε,δ)(\varepsilon,\delta)-LocalMin could be determined with less than O(2d/d)O(2^{d}/d) value/gradient queries, then determining whether ϕ\phi admits a satisfying assignment could be done with less that O(2d)O(2^{d}) queries on ϕ\phi (the latter is obviously impossible). Notice that any value/gradient query both fϕ(x)f_{\phi}(\boldsymbol{x}) and fϕ(x)\nabla f_{\phi}(\boldsymbol{x}) can be computed by querying the value fϕ(v)f_{\phi}(\boldsymbol{v}) of the vertices vV(x)\boldsymbol{v}\in V(\boldsymbol{x}). Since V(x)d+1\left|V(\boldsymbol{x})\right|\leq d+1, any value/gradient query of fϕf_{\phi} can be simulated by d+1d+1 queries on ϕ\phi. ∎

Acknowledgements

This work was supported by NSF Awards IIS-1741137, CCF-1617730 and CCF-1901292, by a Simons Investigator Award, by the DOE PhILMs project (No. DE-AC05-76RL01830), and by the DARPA award HR00111990021. M.Z. was also supported by Google Ph.D. Fellowship. S.S. was supported by NRF 2018 Fellowship NRF-NRFF2018-07.

References

Appendix A Proof of Theorem 4.1

We first remind the definition of the 3-SAT(3) problem that we will use for our reduction.

It is well known that 3-SAT(3) is FNP\mathsf{FNP}-complete, for details see §9.2\S 9.2 of [Pap94a]. To prove Theorem 4.1, we reduce 3-SAT(3) to ε\varepsilon-StationaryPoint.

There exists a satisfying assignment for the clauses ϕ1,,ϕm\phi_{1},\ldots,\phi_{m} if and only if there solution of the constructed StationaryPoint with ε=1/24\varepsilon=1/24 a admits solution (x,w)n+m(\boldsymbol{x}^{\star},\boldsymbol{w}^{\star})\in^{n+m} such that f(x,w)2<1/24\left\|\nabla f(\boldsymbol{x}^{\star},\boldsymbol{w}^{\star})\right\|_{2}<1/24.

By the definition of StationaryPoint, in case there exists a pair of points (x^,w^)n+m(\hat{\boldsymbol{x}},\hat{\boldsymbol{w}})\in^{n+m} with f(x^,w^)2<ε/2=1/48\left\|\nabla f(\hat{\boldsymbol{x}},\hat{\boldsymbol{w}})\right\|_{2}<\varepsilon/2=1/48, then a pair of points (x,w)(\boldsymbol{x}^{\star},\boldsymbol{w}^{\star}) with f(x,w)2<ε=1/24\left\|\nabla f(\boldsymbol{x}^{\star},\boldsymbol{w}^{\star})\right\|_{2}<\varepsilon=1/24 must be returned. In case f(x,w)2>ε=1/24\left\|\nabla f(\boldsymbol{x},\boldsymbol{w})\right\|_{2}>\varepsilon=1/24 for all (x,w)n+m(\boldsymbol{x},\boldsymbol{w})\in^{n+m}, the null symbol \bot is returned.

Let us assume that there exists a satisfying assignment of ϕ\phi. Consider the solution (x^,w^)(\hat{\boldsymbol{x}},\hat{\boldsymbol{w}}) constructed as follows: each variable x^i\hat{x}_{i} is set to 11 iff the respective boolean variable is true and w^j=0\hat{w}_{j}=0 for all j=1,,mj=1,\ldots,m. Since the assignment satisfies the CNF-formula ϕ\phi, there exists at least one true literal in each clause ϕj\phi_{j} which means that Pj(x)=0P_{j}(x)=0 for all j=1,,mj=1,\ldots,m. As a result f(x^,w^)wj=Pj(x^)=0\frac{\partial f(\hat{\boldsymbol{x}},\hat{\boldsymbol{w}})}{\partial w_{j}}=P_{j}(\hat{\boldsymbol{x}})=0 for all j=1,,mj=1,\ldots,m. At the same time, f(x^,w^)xi=0\frac{\partial f(\hat{\boldsymbol{x}},\hat{\boldsymbol{w}})}{\partial x_{i}}=0 since w^j=0\hat{w}_{j}=0 for all j=1,,mj=1,\ldots,m. Overall we have that f(x^,w^)=0<1/48=ε/2\nabla f(\hat{\boldsymbol{x}},\hat{\boldsymbol{w}})=0<1/48=\varepsilon/2. As a result, the constructed StationaryPoint instance must return a solution (x,w)(\boldsymbol{x}^{\star},\boldsymbol{w}^{\star}) with f(x,w)2<124=ε\left\|\nabla f(\boldsymbol{x}^{\star},\boldsymbol{w}^{\star})\right\|_{2}<\frac{1}{24}=\varepsilon.

On the opposite direction, the existence of a pair of points (x,w)(\boldsymbol{x}^{\star},\boldsymbol{w}^{\star}) with f(x,w)2<1/24\left\|\nabla f(\boldsymbol{x}^{\star},\boldsymbol{w}^{\star})\right\|_{2}<1/24 implies Pj(x)<1/24P_{j}(\boldsymbol{x}^{\ast})<1/24 for all j=1mj=1\ldots m. Consider the probability distribution over the boolean assignments in which each boolean variable xix_{i} is independently selected to be true with probability xix_{i}^{\star}. Then,

Since ϕj\phi_{j} shares variables with at most 66 other clauses, the bad event of ϕj\phi_{j} not being satisfied is dependent with at most 66 other bad events. By Lovász Local Lemma [EL73], we get that the probability none of the events occurs is positive. As a result, there exists a satisfying assignment. ∎

Using Lemma A.1 we can conclude that ϕ\phi is satisfiable if and only if ff has a 1/241/24-approximate stationary point. What is left to prove the FNP\mathsf{FNP}-hardness is to show how we can find a satisfying assignment of ϕ\phi given an approximate stationary point of ff. This can be done using the celebrated results that provide constructive proofs of the Lovász Local Lemma [Mos09, MT10]. Finally, we remind that the constructed function ff is Θ(d)\Theta\left(\sqrt{d}\right)-Lipschitz and Θ(d)\Theta\left(d\right)-smooth, where dd is the number of variables that is equal to n+mn+m.

Appendix B Missing Proofs from Section 5

In this section we give proofs for the statements presented in Section 5. These statements establish the totality and inclusion to PPAD\mathsf{PPAD} of LR-LocalMinMax and GDAFixedPoint.

We start with establishing claim “1.” in the statement of the theorem. It will be clear that our proof will provide a polynomial-time reduction from LR-LocalMinMax to GDAFixedPoint. Suppose that (x,y)(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star}) is an α\alpha-approximate fixed point of FGDAF_{GDA}, where α\alpha is the specified in the theorem statement function of δ\delta, GG and LL. To simplify our proof, we abuse notation and define f(x)f(x,y)f(\boldsymbol{x})\triangleq f(\boldsymbol{x},\boldsymbol{y}^{\star}), f(x)xf(x,y)\nabla f(\boldsymbol{x})\triangleq\nabla_{x}f(\boldsymbol{x},\boldsymbol{y}^{\star}), K{x(x,y)P(A,b)}K\triangleq\{\boldsymbol{x}\mid(\boldsymbol{x},\boldsymbol{y}^{\star})\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b})\} and x^ΠK(xf(x))\hat{\boldsymbol{x}}\triangleq\Pi_{K}(\boldsymbol{x}^{\star}-\nabla f(\boldsymbol{x}^{\star})). Because (x,y)(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star}) is an α\alpha-approximate fixed point of FFDAF_{FDA}, it follows that x^x2<α\left\|\hat{\boldsymbol{x}}-\boldsymbol{x}^{\star}\right\|_{2}<\alpha.

f(x),xx<(G+δ+α)α, for all xKBd1(δ;x)\langle\nabla f(\boldsymbol{x}^{\star}),\boldsymbol{x}^{\star}-\boldsymbol{x}\rangle<(G+\delta+\alpha)\cdot\alpha,\text{ for all }\boldsymbol{x}\in K\cap B_{d_{1}}(\delta;\boldsymbol{x}^{\star}).

Using the fact that x^=ΠK(xf(x))\hat{\boldsymbol{x}}=\Pi_{K}(\boldsymbol{x}^{\star}-\nabla f(\boldsymbol{x}^{\star})) and that KK is a convex set we can apply Theorem 1.5.5 (b) of [FP07] to get that

Next, we do some simple algebra to get that, for all xKBd1(δ;x)\boldsymbol{x}\in K\cap B_{d_{1}}(\delta;\boldsymbol{x}^{\star}),

where the second to last inequality follows from Cauchy–Schwarz inequality and the triangle inequality, and the last inequality follows from the triangle inequality and the following facts: (1) xx^2<α\left\|\boldsymbol{x}^{\star}-\hat{\boldsymbol{x}}\right\|_{2}<\alpha, (2) xBd1(δ;x)\boldsymbol{x}\in B_{d_{1}}(\delta;\boldsymbol{x}^{\star}), and (3) f(x,y)2G\left\|\nabla f(\boldsymbol{x},\boldsymbol{y})\right\|_{2}\leq G for all (x,y)P(A,b)(\boldsymbol{x},\boldsymbol{y})\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b}). ∎

For all xKBd1(δ;x)\boldsymbol{x}\in K\cap B_{d_{1}}(\delta;\boldsymbol{x}^{\star}), from the LL-smoothness of ff we have that

f(x)f(x)f(\boldsymbol{x}^{\star})\leq f(\boldsymbol{x}): In this case we stop, remembering that

f(x)>f(x)f(\boldsymbol{x}^{\star})>f(\boldsymbol{x}): In this case, we consider two further sub-cases:

f(x),xx0\langle\nabla f(\boldsymbol{x}^{*}),\boldsymbol{x}-\boldsymbol{x}^{\star}\rangle\geq 0: in this sub-case, Eq (B.2) gives

where for the last inequality we used that xBd1(δ;x)\boldsymbol{x}\in B_{d_{1}}(\delta;\boldsymbol{x}^{\star}), and that δ<2ε/L\delta<\sqrt{2\varepsilon/L}.

f(x),xx<0\langle\nabla f(\boldsymbol{x}^{*}),\boldsymbol{x}-\boldsymbol{x}^{\star}\rangle<0: in this sub-case, Eq (B.2) gives

where the second inequality follows from the fact that xBd1(δ;x)\boldsymbol{x}\in B_{d_{1}}(\delta;\boldsymbol{x}^{\star}), the third inequality follows from Claim B.1, and the last inequality follows from the constraints δ<2ε/L\delta<\sqrt{2\varepsilon/L} and α(G+δ)2+4(εL2δ2)(G+δ)2\alpha\leq\frac{\sqrt{(G+\delta)^{2}+4(\varepsilon-\frac{L}{2}\delta^{2})}-(G+\delta)}{2}.

In all cases, we get from (B.3), (B.4) and (B.5) that f(x)<f(x)+εf(\boldsymbol{x}^{\star})<f(\boldsymbol{x})+\varepsilon, for all xKBd1(δ;x)x\in K\cap B_{d_{1}}(\delta;\boldsymbol{x}^{\star}). Thus, lifting our abuse of notation, we get that f(x,y)<f(x,y)+εf(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})<f(\boldsymbol{x},\boldsymbol{y}^{\star})+\varepsilon, for all x{xxBd1(δ;x) and (x,y)P(A,b)}\boldsymbol{x}\in\{\boldsymbol{x}\mid\boldsymbol{x}\in B_{d_{1}}(\delta;\boldsymbol{x}^{\star})\text{ and }(\boldsymbol{x},\boldsymbol{y}^{\star})\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b})\}. Using an identical argument we can also show that f(x,y)>f(x,y)εf(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})>f(\boldsymbol{x}^{\star},\boldsymbol{y})-\varepsilon for all y{yyBd2(δ;y) and (x,y)P(A,b)}\boldsymbol{y}\in\{\boldsymbol{y}\mid\boldsymbol{y}\in B_{d_{2}}(\delta;\boldsymbol{y}^{\star})\text{ and }(\boldsymbol{x}^{\star},\boldsymbol{y})\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b})\}. The first part of the theorem follows.

Now let us establish claim “2.” in the theorem statement. It will be clear that our proof will provide a polynomial-time reduction from GDAFixedPoint to LR-LocalMinMax. For the choice of parameters ε\varepsilon and δ\delta described in the theorem statement, we will show that, if (x,y)(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star}) is an (ε,δ)(\varepsilon,\delta)-local min-max equilibrium of ff, then FGDAx(x,y)x2<α/2\left\|F_{GDAx}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})-\boldsymbol{x}^{\star}\right\|_{2}<\alpha/2 and FGDAy(x,y)y2<α/2\left\|F_{GDAy}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})-\boldsymbol{y}^{\star}\right\|_{2}<\alpha/2. The second part of the theorem will then follow. We only prove that FGDAx(x,y)x2<α/2\left\|F_{GDAx}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})-\boldsymbol{x}^{\star}\right\|_{2}<\alpha/2, as the argument for y\boldsymbol{y}^{\star} is identical. In the argument below we abuse notation in the same way we described earlier. With that notation we will show that x^x2<α/2\left\|\hat{\boldsymbol{x}}-\boldsymbol{x}^{\star}\right\|_{2}<\alpha/2.

Proof that x^x<α/2\boldsymbol{\left\|\hat{\boldsymbol{x}}-\boldsymbol{x}^{\star}\right\|<\alpha/2}. From our choice of ε\varepsilon and δ\delta, it is easy to see that δ=α/(5L+2)<α/2\delta=\alpha/(5L+2)<\alpha/2. Thus, if x^x<δ\left\|\hat{\boldsymbol{x}}-\boldsymbol{x}^{\star}\right\|<\delta, then we automatically get x^x<α/2\left\|\hat{\boldsymbol{x}}-\boldsymbol{x}^{\star}\right\|<\alpha/2. So it remains to handle the case x^xδ\left\|\hat{\boldsymbol{x}}-\boldsymbol{x}^{\star}\right\|\geq\delta. We choose xcx+δx^xx^x2\boldsymbol{x}_{c}\triangleq\boldsymbol{x}^{\star}+\delta\frac{\hat{\boldsymbol{x}}-\boldsymbol{x}^{\star}}{\left\|\hat{\boldsymbol{x}}-\boldsymbol{x}^{\star}\right\|_{2}}. It is easy to see that xcBd1(δ;x)\boldsymbol{x}_{c}\in B_{d_{1}}(\delta;\boldsymbol{x}^{\star}) and hence we get that

where the first inequality follows from the fact that (x,y)(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star}) is an (ε,δ)(\varepsilon,\delta)-local min-max equilibrium, the second inequality follows from the LL-smoothness of ff, and the third inequality follows from xcxδ\left\|\boldsymbol{x}_{c}-\boldsymbol{x}^{\star}\right\|\leq\delta and our choice of δ=ε/L\delta=\sqrt{\varepsilon/L}. The above implies:

Since x^x=(xcx)x^x2/δ\hat{\boldsymbol{x}}-\boldsymbol{x}^{\star}=\left(\boldsymbol{x}_{c}-\boldsymbol{x}^{\star}\right)\cdot\left\|\hat{\boldsymbol{x}}-\boldsymbol{x}^{\star}\right\|_{2}/\delta we get that f(x),xx^<3ε2δxx^2\left\langle\nabla f(\boldsymbol{x}^{\star}),\boldsymbol{x}^{\star}-\hat{\boldsymbol{x}}\right\rangle<\frac{3\varepsilon}{2\delta}\left\|\boldsymbol{x}^{\star}-\hat{\boldsymbol{x}}\right\|_{2}. Therefore

where in the above inequality we have also used (B.1). As a result, xx^2<3ε2δ<α/2\left\|\boldsymbol{x}^{\star}-\hat{\boldsymbol{x}}\right\|_{2}<\frac{3\varepsilon}{2\delta}<\alpha/2.

B.2 Proof of Theorem 5.2

We provide a polynomial-time reduction from GDAFixedPoint to Brouwer. This establishes both the totality of GDAFixedPoint and its inclusion to PPAD\mathsf{PPAD}, since Brouwer is both total and lies in PPAD\mathsf{PPAD}, as per Lemma 2.5. It also establishes the totality and inclusion to PPAD\mathsf{PPAD} of LR-LocalMinMax, since LR-LocalMinMax is polynomial-time reducible to GDAFixedPoint, as shown in Theorem 5.1.

We proceed to describe our reduction. Suppose that ff is the GG-Lipschitz and LL-smooth function provided as input to GDAFixedPoint. Suppose also that α\alpha is the approximation parameter provided as input to GDAFixedPoint. Given ff and α\alpha, we define function M:P(A,b)P(A,b)M:\mathcal{P}(\boldsymbol{A},\boldsymbol{b})\rightarrow\mathcal{P}(\boldsymbol{A},\boldsymbol{b}), which serves as input to Brouwer, as follows:

Given that ff is LL-smooth, it follows that MM is (L+1)(L+1)-Lipschitz. We set the approximation parameter provided as input to Brouwer be γ=α2/4(G+2d)\gamma=\alpha^{2}/4(G+2\sqrt{d}).

To show the validity of the afore-described reduction, we prove that every feasible point (x,y)P(A,b)(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b}) that is a γ\gamma-approximate fixed point of MM, i.e. M(x,y)(x,y)2<γ\left\|M(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})-(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})\right\|_{2}<\gamma is also an α\alpha-approximate fixed point of FGDAF_{GDA}. Observe that since P(A,b)d\mathcal{P}(\boldsymbol{A},\boldsymbol{b})\subseteq^{d} it holds that (x,y)(x,y)2d\left\|(\boldsymbol{x},\boldsymbol{y})-(\boldsymbol{x}^{\prime},\boldsymbol{y}^{\prime})\right\|_{2}\leq\sqrt{d} for all (x,y),(x,y)P(A,b)(\boldsymbol{x},\boldsymbol{y}),(\boldsymbol{x}^{\prime},\boldsymbol{y}^{\prime})\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b}). Hence, if γ>d\gamma>\sqrt{d}, then finding γ\gamma-approximate fixed points of MM is trivial and the same is true for fiding α\alpha-approximate fixed points of FGDAF_{GDA}, since γ=α2/4(G+2d)\gamma=\alpha^{2}/4(G+2\sqrt{d}) which implies that, if γ>d\gamma>\sqrt{d}, then α>d\alpha>\sqrt{d}. Thus, we may assume that γd\gamma\leq\sqrt{d}.

Next, to simplify notation we define (xΔ,yΔ)=(xxf(x,y),y+yf(x,y))(\boldsymbol{x}_{\Delta},\boldsymbol{y}_{\Delta})=(x^{\star}-\nabla_{x}f(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star}),\boldsymbol{y}^{\star}+\nabla_{y}f(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})) and (x^,y^)=argmin(x,y)P(A,b)(xΔ,yΔ)(x,y)2(\hat{\boldsymbol{x}},\hat{\boldsymbol{y}})=\operatorname*{argmin}_{(\boldsymbol{x},\boldsymbol{y})\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b})}\left\|(\boldsymbol{x}_{\Delta},\boldsymbol{y}_{\Delta})-(\boldsymbol{x},\boldsymbol{y})\right\|_{2}. Given that (x,y)(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star}) is a γ\gamma-approximate fixed point of MM, we have that

Using Theorem 1.5.5 (b) of [FP07], we get that

For all (x,y)P(A,b)(\boldsymbol{x},\boldsymbol{y})\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b}), (xΔ,yΔ)(x,y),(x,y)(x,y)<(G+2d)γ\left\langle(\boldsymbol{x}_{\Delta},\boldsymbol{y}_{\Delta})-(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star}),(\boldsymbol{x},\boldsymbol{y})-(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})\right\rangle<(G+2\sqrt{d})\cdot\gamma.

Now let x=argminxK(y)xxΔ2\boldsymbol{x}^{\prime}=\operatorname*{argmin}_{\boldsymbol{x}\in K(y^{\star})}\left\|\boldsymbol{x}-\boldsymbol{x}_{\Delta}\right\|_{2} where K(y)={x(x,y)P(A,b))}K(\boldsymbol{y}^{\star})=\{\boldsymbol{x}\mid(\boldsymbol{x},\boldsymbol{y}^{\star})\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b}))\}. Using Theorem 1.5.5 (b) of [FP07] for x\boldsymbol{x}^{\prime} we get that xΔx,xx0\left\langle\boldsymbol{x}_{\Delta}-\boldsymbol{x}^{\prime},\boldsymbol{x}^{\star}-\boldsymbol{x}^{\prime}\right\rangle\leq 0. Using Claim B.2 for vector (x,y)P(A,b)(\boldsymbol{x}^{\prime},y^{\star})\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b}) we get that xxΔ,xx<(G+2d)γ\left\langle\boldsymbol{x}^{\star}-\boldsymbol{x}_{\Delta},\boldsymbol{x}^{\star}-\boldsymbol{x}^{\prime}\right\rangle<(G+2\sqrt{d})\gamma. Adding the last two inequalities and using the fact that γ=α2/4(G+2d)\gamma=\alpha^{2}/4(G+2\sqrt{d}) we get the following

Using the exact same reasoning we can also prove that

where K(x)={y(x,y)P(A,b))}K(\boldsymbol{x}^{\star})=\{\boldsymbol{y}\mid(\boldsymbol{x}^{\star},\boldsymbol{y})\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b}))\}. Combining the last two inequalities we get that (x,y)(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star}) is an α\alpha-approximate fixed point of FGDAF_{GDA}.

Appendix C Missing Proofs from Section 8

In this section we present the missing proofs from Section 8 and more precisely in the following sections we prove the Lemmas 8.10, 8.11, and 8.12. For the rest of the proofs in this section we define L(c)L(\boldsymbol{c}) to be the cubelet which has the down-left corner equal to c\boldsymbol{c}, formaly

and we also define Lc(c)L_{c}(\boldsymbol{c}) to be the set of corners of the cubelet L(c)L(\boldsymbol{c}), or more formally

We start with a lemma about the differentiability properties of the functions QvcQ_{\boldsymbol{v}}^{\boldsymbol{c}} which we defined in Definition 8.7.

1st order differentiability: We remind from the Definition 8.7 that if we let sc=(s1,,sd)\boldsymbol{s}^{\boldsymbol{c}}=(s_{1},\ldots,s_{d}) be the source vertex of R(x)R(\boldsymbol{x}) and pxc=(p1,,pd)\boldsymbol{p}_{\boldsymbol{x}}^{\boldsymbol{c}}=(p_{1},\ldots,p_{d}) be the canonical representation of x\boldsymbol{x}. Then for each vertex vRc(x)\boldsymbol{v}\in R_{c}(\boldsymbol{x}) we define the following partition of the set of coordinates [d][d],

Now in case Bvc=B_{\boldsymbol{v}}^{\boldsymbol{c}}=\varnothing, which corresponds to v\boldsymbol{v} being the source node sc\boldsymbol{s}^{\boldsymbol{c}} then Qvc(x)=j=1dS(1S(pj))Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})=\prod_{j=1}^{d}S_{\infty}(1-S(p_{j})) which is clearly differentiable as product of compositions of differentiable functions. The exact same holds for Avc=A_{\boldsymbol{v}}^{\boldsymbol{c}}=\varnothing which corresponds to v\boldsymbol{v} being the target vertex tc\boldsymbol{t}^{\boldsymbol{c}} of the cubelet R(x)R(\boldsymbol{x}). We thus focus on the case where Avc,BvcA_{\boldsymbol{v}}^{\boldsymbol{c}},B_{\boldsymbol{v}}^{\boldsymbol{c}}\neq\varnothing. To simplify notation we denote Qvc(x)Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x}) by Q(x)Q(\boldsymbol{x}), AvcA_{\boldsymbol{v}}^{\boldsymbol{c}} by AA and BvcB_{\boldsymbol{v}}^{\boldsymbol{c}} by BB for the rest of this proof. We prove that in case iBi\in B then Q(x)xi\frac{\partial Q(\boldsymbol{x})}{\partial x_{i}} always exits. The case iAi\in A follows then symmetrically. We have the following cases

pi>pjp_{i}>p_{j} for all jAj\in A: Then Q(x)xi\frac{\partial Q(\boldsymbol{x})}{\partial x_{i}} exists since both S()S_{\infty}(\cdot) and S()S(\cdot) are differentiable.

pi<pjp_{i}<p_{j} for some jAj\in A: By Definition 8.7, if ε\varepsilon is sufficiently small then Q(xiε,xi)=Q(xi+ε,xi)=Q(xi,xi)=0Q(x_{i}-\varepsilon,x_{-i})=Q(x_{i}+\varepsilon,\boldsymbol{x}_{-i})=Q(x_{i},\boldsymbol{x}_{-i})=0. Thus Q(x)xi\frac{\partial Q(\boldsymbol{x})}{\partial x_{i}} exists and equals .

pi=pjp_{i}=p_{j} for some jAj\in A and pipjp_{i}\geq p_{j^{\prime}} for all jA{j}j^{\prime}\in A\setminus\{j\}: By Definition 8.7, if ε\varepsilon is sufficiently small then Q(xiε,xi)=0Q(x_{i}-\varepsilon,\boldsymbol{x}_{-i})=0 and also Q(xi,xi)=0Q(x_{i},\boldsymbol{x}_{-i})=0, thus

since both S()S_{\infty}(\cdot) and S()S(\cdot) are differentiable functions, S(S(pi)S(pj))=S(0)=0S_{\infty}(S(p_{i})-S(p_{j}))=S_{\infty}(0)=0, and S(S(pi)S(pj))=S(0)=0S^{\prime}_{\infty}(S(p_{i})-S(p_{j}))=S^{\prime}_{\infty}(0)=0.

pi>pjp_{i}>p_{j} for all jAj\in A: Then Q(x)xi2Q(x)xixk\frac{\partial Q^{\prime}(\boldsymbol{x})}{\partial x_{i}}\triangleq\frac{\partial^{2}Q(\boldsymbol{x})}{\partial x_{i}\partial x_{k}} exists since both S()S_{\infty}(\cdot) and S()S(\cdot) are twice differentiable.

pi<pjp_{i}<p_{j} for some jAj\in A. By Definition 8.7, Q(xiε,xi)=Q(xi+ε,xi)=Q(xi,xi)=0Q^{\prime}(x_{i}-\varepsilon,\boldsymbol{x}_{-i})=Q^{\prime}(x_{i}+\varepsilon,\boldsymbol{x}_{-i})=Q^{\prime}(x_{i},\boldsymbol{x}_{-i})=0. Thus Q(x)xi2Q(x)xixk\frac{\partial Q^{\prime}(\boldsymbol{x})}{\partial x_{i}}\triangleq\frac{\partial^{2}Q(\boldsymbol{x})}{\partial x_{i}\partial x_{k}} exists and equals .

pi=pjp_{i}=p_{j} for some jAj\in A and pi>pjp_{i}>p_{j^{\prime}} for all jA{j}j^{\prime}\in A\setminus\{j\}. By Definition 8.7, if ε\varepsilon is sufficiently small then Q(xiε,xi)=0Q^{\prime}(x_{i}-\varepsilon,\boldsymbol{x}_{-i})=0 and thus

At the same time limε0+Q(xi+ε,xi)Q(xi,xi)ε\lim_{\varepsilon\rightarrow 0^{+}}\frac{Q^{\prime}(x_{i}+\varepsilon,\boldsymbol{x}_{-i})-Q^{\prime}(x_{i},\boldsymbol{x}_{-i})}{\varepsilon} exists since both S()S_{\infty}(\cdot) and S()S(\cdot) are twice differentiable. Moreover equals since S(S(pi)S(pj))=S(0)=0S_{\infty}(S(p_{i})-S(p_{j}))=S_{\infty}(0)=0 and S(S(pi)S(pj))=S(0)=S(0)=S(0)=0S^{\prime}_{\infty}(S(p_{i})-S(p_{j}))=S^{\prime}_{\infty}(0)=S^{\prime\prime}_{\infty}(0)=S(0)=0.

In every step of the above proof where we use properties of SS_{\infty} and SS we use Lemma 8.3. ∎

So far we have established the fact that the functions Qvc(x)Q^{\boldsymbol{c}}_{\boldsymbol{v}}(\boldsymbol{x}) are twice differentiable when x\boldsymbol{x} moves within the same cubelet. Next we will show that when x\boldsymbol{x} moves from one cubelet to another then the corresponding QvcQ^{\boldsymbol{c}}_{\boldsymbol{v}} functions changes value smoothly.

Let xd\boldsymbol{x}\in^{d} such that there exists a coordinate i[d]i\in[d] with the property R(xi+ε,xi)=[c1N1,c1+1N1]××[cdN1,cd+1N1]R(x_{i}+\varepsilon,\boldsymbol{x}_{-i})=\left[\frac{c_{1}}{N-1},\frac{c_{1}+1}{N-1}\right]\times\cdots\times\left[\frac{c_{d}}{N-1},\frac{c_{d}+1}{N-1}\right] and R(xiε,xi)=[c1N1,c1+1N1]××[cdN1,cd+1N1]R(x_{i}-\varepsilon,\boldsymbol{x}_{-i})=\left[\frac{c^{\prime}_{1}}{N-1},\frac{c^{\prime}_{1}+1}{N-1}\right]\times\cdots\times\left[\frac{c^{\prime}_{d}}{N-1},\frac{c^{\prime}_{d}+1}{N-1}\right], with c,c([N1]1)d\boldsymbol{c},\boldsymbol{c}^{\prime}\in\left(\left[N-1\right]-1\right)^{d} and ε\varepsilon sufficiently small, i.e. x\boldsymbol{x} lies in the boundary of two cubelets. Then the following statements hold.

For all vertices vRc(xi+ε,xi)Rc(xiε,xi)\boldsymbol{v}\in R_{c}(x_{i}+\varepsilon,\boldsymbol{x}_{-i})\cap R_{c}(x_{i}-\varepsilon,\boldsymbol{x}_{-i}), it holds that

Qvc(x)=Qvc(x)Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})=Q_{\boldsymbol{v}}^{\boldsymbol{c}^{\prime}}(\boldsymbol{x}),

Qvc(x)xj=Qvc(x)xi\frac{\partial Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})}{\partial x_{j}}=\frac{\partial Q_{\boldsymbol{v}}^{\boldsymbol{c}^{\prime}}(\boldsymbol{x})}{\partial x_{i}} for all i[d]i\in[d], and

Lemma C.2 is crucial since it establishes that Pv(x)\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x}) is a continuous and twice differentiable even when x\boldsymbol{x} moves from one cubelet to another. Since the proof of Lemma C.2 is very long and contains the proof of some sublemmas, we postpone it for the end of this section in Section C.1.1. We now proceed with the proof of Lemma 8.10.

We first prove that Pv(x)\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x}) is a continuous function. Let xd\boldsymbol{x}\in^{d} lying on the boundary of the following cubelets

vi=1mRc(xji+ηi,xji)\boldsymbol{v}\notin\cup_{i=1}^{m}R_{c}(x_{j_{i}}+\eta_{i},\boldsymbol{x}_{-j_{i}}). By Definition 8.9, in all the mm aforementioned cubelets, the coefficient Pv\mathsf{P}_{\boldsymbol{v}} takes value and hence it is continuous in this part of the space.

vjURc(xji+ηi,xji)\boldsymbol{v}\in\cap_{j\in U}R_{c}(x_{j_{i}}+\eta_{i},\boldsymbol{x}_{-j_{i}}) and viURc(xji+ηi,xji)\boldsymbol{v}\notin\cup_{i\in\overline{U}}R_{c}(x_{j_{i}}+\eta_{i},\boldsymbol{x}_{-j_{i}}), for some U[m]U\subseteq[m] with U=[m]U\overline{U}=[m]\setminus U. In this case Pv(xji+ηi,xji)\mathsf{P}_{\boldsymbol{v}}(x_{j_{i}}+\eta_{i},\boldsymbol{x}_{j_{i}}) was computed according to a cubelet with vRc(xji+ηi,xji)\boldsymbol{v}\in R_{c}(x_{j_{i}}+\eta_{i},\boldsymbol{x}_{-j_{i}}). Then Lemma C.2 implies that Qvc(i)(x)=0Q^{\boldsymbol{c}^{(i)}}_{\boldsymbol{v}}(\boldsymbol{x})=0 since vRc(xji+ηi,xji)Rc(xji+ηi,xji)\boldsymbol{v}\in R_{c}(x_{j_{i}}+\eta_{i},\boldsymbol{x}_{-j_{i}})\setminus R_{c}(x_{j_{i^{\prime}}}+\eta_{i^{\prime}},\boldsymbol{x}_{-j_{i^{\prime}}}) where i[m]i^{\prime}\in[m] and iii\neq i^{\prime}. Therefore we conclude that Pv(x)=0\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x})=0 and

vi=1mRc(xji+ηi,xji)\boldsymbol{v}\in\cap_{i=1}^{m}R_{c}(x_{j_{i}}+\eta_{i},\boldsymbol{x}_{-j_{i}}). By Lemma C.2 for all i[m]i\in[m] it holds that

which again implies the continuity of Pv(x)\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x}) at x\boldsymbol{x}.

To prove that Pv(x)xi\frac{\partial\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x})}{\partial x_{i}} always exists, we consider the following 33 mutually exclusive cases.

vLc(c(1))\boldsymbol{v}\in L_{c}(\boldsymbol{c}^{(1)}) for c(1)C+\boldsymbol{c}^{(1)}\in C^{+} and vLc(c(2))\boldsymbol{v}\in L_{c}(\boldsymbol{c}^{(2)}) for c(2)C\boldsymbol{c}^{(2)}\in C^{-}. Since the coefficient Pv(x)\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x}) is a continuous function, we have that

limε0+Pv(xi+ε,xi)Pv(xi,xi)ε=Qvc(1)(x)xivLc(c(1))Qvc(1)(x)Qvc(1)(x)vLc(c(1))Qvc(1)(x)xi(vLc(c(1))Qvc(1)(x))2\lim_{\varepsilon\rightarrow 0^{+}}\frac{\mathsf{P}_{\boldsymbol{v}}(x_{i}+\varepsilon,\boldsymbol{x}_{-i})-\mathsf{P}_{\boldsymbol{v}}(x_{i},\boldsymbol{x}_{-i})}{\varepsilon}=\frac{\frac{\partial Q_{\boldsymbol{v}}^{\boldsymbol{c}^{(1)}}(\boldsymbol{x})}{\partial x_{i}}\sum_{\boldsymbol{v}^{\prime}\in L_{c}(\boldsymbol{c}^{(1)})}Q_{\boldsymbol{v}^{\prime}}^{\boldsymbol{c}^{(1)}}(\boldsymbol{x})-Q_{\boldsymbol{v}}^{\boldsymbol{c}^{(1)}}(\boldsymbol{x})\sum_{\boldsymbol{v}^{\prime}\in L_{c}(\boldsymbol{c}^{(1)})}\frac{\partial Q_{\boldsymbol{v}^{\prime}}^{\boldsymbol{c}^{(1)}}(\boldsymbol{x})}{\partial x_{i}}}{\left(\sum_{\boldsymbol{v}^{\prime}\in L_{c}(\boldsymbol{c}^{(1)})}Q_{\boldsymbol{v}^{\prime}}^{\boldsymbol{c}^{(1)}}(\boldsymbol{x})\right)^{2}}

limε0+Pv(xi,xi)Pv(xiε,xi)ε=Qvc(2)(x)xivLc(c(2))Qvc(2)(x)Qvc(2)(x)vLc(c(2))Qvc(2)(x)xi(vLc(c(2))Qvc(2)(x))2\lim_{\varepsilon\rightarrow 0^{+}}\frac{\mathsf{P}_{\boldsymbol{v}}(x_{i},\boldsymbol{x}_{-i})-\mathsf{P}_{\boldsymbol{v}}(x_{i}-\varepsilon,\boldsymbol{x}_{-i})}{\varepsilon}=\frac{\frac{\partial Q_{\boldsymbol{v}}^{\boldsymbol{c}^{(2)}}(\boldsymbol{x})}{\partial x_{i}}\sum_{\boldsymbol{v}^{\prime}\in L_{c}(\boldsymbol{c}^{(2)})}Q_{\boldsymbol{v}^{\prime}}^{\boldsymbol{c}^{(2)}}(\boldsymbol{x})-Q_{\boldsymbol{v}}^{\boldsymbol{c}^{(2)}}(\boldsymbol{x})\sum_{\boldsymbol{v}^{\prime}\in L_{c}(\boldsymbol{c}^{(2)})}\frac{\partial Q_{\boldsymbol{v}^{\prime}}^{\boldsymbol{c}^{(2)}}(\boldsymbol{x})}{\partial x_{i}}}{\left(\sum_{\boldsymbol{v}^{\prime}\in L_{c}(\boldsymbol{c}^{(2)})}Q_{\boldsymbol{v}^{\prime}}^{\boldsymbol{c}^{(2)}}(\boldsymbol{x})\right)^{2}}

Both of the above limits exists due to the fact that Qvc(x)Q^{\boldsymbol{c}}_{\boldsymbol{v}}(\boldsymbol{x}) is differentiable (Lemma C.1). Moreover, since vLc(c(1))Lc(c(2))\boldsymbol{v}\in L_{c}(\boldsymbol{c}^{(1)})\cap L_{c}(\boldsymbol{c}^{(2)}), Case 11 of Lemma C.2 implies that the two limits above have exactly the same value and hence Pv\mathsf{P}_{\boldsymbol{v}} is differentiable at x\boldsymbol{x}.

vLc(c(1))\boldsymbol{v}\notin L_{c}(\boldsymbol{c}^{(1)}) for all c(1)C+\boldsymbol{c}^{(1)}\in C^{+}. In the case where vLc(c)\boldsymbol{v}\notin L_{c}(\boldsymbol{c}) for all the down-left corners c\boldsymbol{c} of the cubelets at which x\boldsymbol{x} lies, then by Definition 8.9 Pv(xi,xi)=Pv(xi+ε,xi)=Pv(xiε,xi)=0\mathsf{P}_{\boldsymbol{v}}(x_{i},\boldsymbol{x}_{-i})=\mathsf{P}_{\boldsymbol{v}}(x_{i}+\varepsilon,\boldsymbol{x}_{-i})=\mathsf{P}_{\boldsymbol{v}}(x_{i}-\varepsilon,\boldsymbol{x}_{-i})=0. Thus Pv(x)xi\frac{\partial\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x})}{\partial x_{i}} exists and equals . Therefore we may assume that vLc(c)\boldsymbol{v}\in L_{c}(\boldsymbol{c}) for some down-left corner c\boldsymbol{c} of a cubelet at which x\boldsymbol{x} lies. Due to the fact that Pv(x)\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x}) is a continuous function and that vLc(c(1))\boldsymbol{v}\notin L_{c}(\boldsymbol{c}^{(1)}) for all c(1)C+\boldsymbol{c}^{(1)}\in C^{+}, we get that

We also have that vLc(c)/Lcc(1)\boldsymbol{v}\in L_{c}(\boldsymbol{c})/L_{c}{\boldsymbol{c}^{(1)}} where c\boldsymbol{c}, c(1)\boldsymbol{c}^{(1)} are down-left corners of cubelets at which x\boldsymbol{x} lies and (xi+ε,xi)(x_{i}+\varepsilon,\boldsymbol{x}_{-i}) lies respectively. Therefore we get by Case 11 of Lemma C.2 that Qvc(x)=0Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})=0 implying that Pv(xi,xi)=0\mathsf{P}_{\boldsymbol{v}}(x_{i},\boldsymbol{x}_{-i})=0. As a result,

We now need to argue that limε0+Pv(xi,xi)Pv(xiε,xi)ε\lim_{\varepsilon\to 0^{+}}\frac{\mathsf{P}_{\boldsymbol{v}}(x_{i},\boldsymbol{x}_{-i})-\mathsf{P}_{\boldsymbol{v}}(x_{i}-\varepsilon,x_{-i})}{\varepsilon} exists and equals . At first observe that 0xiciδ0\leq x_{i}-c_{i}\leq\delta since x\boldsymbol{x} lies in the cubelet with down-left corner c\boldsymbol{c}. In case xici<δx_{i}-c_{i}<\delta then (xi+ε,xi)(x_{i}+\varepsilon,\boldsymbol{x}_{-i}) lies in c\boldsymbol{c} for arbitrarily small ε\varepsilon, meaning that cC+\boldsymbol{c}\in C^{+}. The latter contradicts the fact that vLcc(1)\boldsymbol{v}\notin L_{c}{\boldsymbol{c}^{(1)}} for all c(1)C+\boldsymbol{c}^{(1)}\in C^{+}. As a result, xici=δx_{i}-c_{i}=\delta which implies that cC\boldsymbol{c}\in C^{-} and hence

The above limit equals to since Qvc(x)=Qvc(x)xi=0Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})=\frac{\partial Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})}{\partial x_{i}}=0 by applying Lemma C.2 due to the fact that vLc(c)Lc(c(1))\boldsymbol{v}\in L_{c}(\boldsymbol{c})\setminus L_{c}(\boldsymbol{c}^{(1)}).

vLc(c(2))\boldsymbol{v}\notin L_{c}(\boldsymbol{c}^{(2)}) for all c(2)C\boldsymbol{c}^{(2)}\in C^{-}. Symmetrically with the previous case.

The second order differentiability of Pv(x)\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x}) can be established using exactly the same arguments for computing the following limit

since for all the others v([N]1)d\boldsymbol{v}\in\left(\left[N\right]-1\right)^{d} it holds that Qvc(x)=0Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})=0, Qvc(x)=0\nabla Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})=0, and 2Qvc(x)=0\nabla^{2}Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})=0. These vertices vR+(x)\boldsymbol{v}\in R_{+}(\boldsymbol{x}) can be computed in polynomial time as follows: i) the coordinates p1,,pdp_{1},\ldots,p_{d} are sorted in increasing order, and ii) for each m=0,,dm=0,\ldots,d compute the vertex v(m)Lc(c)\boldsymbol{v}^{(m)}\in L_{c}(\boldsymbol{c}),

To finish the proof of Lemma 8.10 we only need the proof of Lemma C.2 which we present in the following section.

Let a point xd\boldsymbol{x}\in^{d} lying in the boundary of the cubelets with down-left corners c=(c1,,cm1,cm,cm+1,,cd)\boldsymbol{c}=(c_{1},\ldots,c_{m-1},c_{m},c_{m+1},\ldots,c_{d}) and c=(c1,,cm1,cm+1,cm+1,,cd)\boldsymbol{c}^{\prime}=(c_{1},\ldots,c_{m-1},c_{m}+1,c_{m+1},\ldots,c_{d}). Then the canonical representation of x\boldsymbol{x} in the cubelet L(c)L(\boldsymbol{c}) is the same with the the canonical representation of x\boldsymbol{x} in the cubelet L(c)L(\boldsymbol{c}^{\prime}). More precisely, pxc=pxc\boldsymbol{p}_{\boldsymbol{x}}^{\boldsymbol{c}}=\boldsymbol{p}_{\boldsymbol{x}}^{\boldsymbol{c}^{\prime}}.

Let cmc_{m} be even. By the definition of the canonical representation in Definition 8.6, the source and target of the cubelets L(c)L(\boldsymbol{c}) and L(c)L(\boldsymbol{c}^{\prime}) are respectively,

sc=(s1,,sm1,cm,sm+1,,sd)\boldsymbol{s}^{\boldsymbol{c}}=(s_{1},\ldots,s_{m-1},c_{m},s_{m+1},\ldots,s_{d}),

tc=(t1,,sm1,cm+1,tm+1,,td)\boldsymbol{t}^{\boldsymbol{c}}=(t_{1},\ldots,s_{m-1},c_{m}+1,t_{m+1},\ldots,t_{d}),

sc=(s1,,sm1,cm+2,sm+1,,sd)\boldsymbol{s}^{\boldsymbol{c}^{\prime}}=(s_{1},\ldots,s_{m-1},c_{m}+2,s_{m+1},\ldots,s_{d}),

tc=(t1,,tm1,cm+1,tm+1,,td)\boldsymbol{t}^{\boldsymbol{c}^{\prime}}=(t_{1},\ldots,t_{m-1},c_{m}+1,t_{m+1},\ldots,t_{d}).

Hence we get that pj=pjp_{j}=p_{j}^{\prime} for jmj\neq m. Since x\boldsymbol{x} belongs to the boundary of both cublets L(c)L(\boldsymbol{c}) and L(c)L(\boldsymbol{c}^{\prime}) we get that xm=cm+1x_{m}=c_{m}+1 which implies that pm=pm=1p_{m}=p_{m}^{\prime}=1. In case cmc_{m} is odd we get that pxc=pxc\boldsymbol{p}_{\boldsymbol{x}}^{\boldsymbol{c}}=\boldsymbol{p}_{\boldsymbol{x}}^{\boldsymbol{c}^{\prime}} but with pm=pm=0p_{m}=p_{m}^{\prime}=0. ∎

Let xd\boldsymbol{x}\in^{d} lying at the intersection of the cubelets L(c)L(\boldsymbol{c}), L(c)L(\boldsymbol{c}^{\prime}) with down-left corners c=(c1,,cm1,cm,cm+1,,cd)\boldsymbol{c}=(c_{1},\ldots,c_{m-1},c_{m},c_{m+1},\ldots,c_{d}), and c=(c1,,cm1,cm+1,cm+1,,cd)\boldsymbol{c}^{\prime}=(c_{1},\ldots,c_{m-1},c_{m}+1,c_{m+1},\ldots,c_{d}). Then the following statements are true.

For all vertices vLc(c)Lc(c)\boldsymbol{v}\in L_{c}(\boldsymbol{c})\cap L_{c}(\boldsymbol{c}^{\prime}) it holds that

Qvc(x)=Qvc(x)Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})=Q_{\boldsymbol{v}}^{\boldsymbol{c}^{\prime}}(\boldsymbol{x}),

Qvc(x)xi=Qvc(x)xi\frac{\partial Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})}{\partial x_{i}}=\frac{\partial Q_{\boldsymbol{v}}^{\boldsymbol{c}^{\prime}}(\boldsymbol{x})}{\partial x_{i}},

Let vLc(c)Lc(c)\boldsymbol{v}\in L_{c}(\boldsymbol{c})\cap L_{c}(\boldsymbol{c}^{\prime}) then we have that

Qvc(x)=Qvc(x)Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})=Q_{\boldsymbol{v}}^{\boldsymbol{c}^{\prime}}(\boldsymbol{x}). By Lemma C.3 we get that the canonical representation pxc=pxc\boldsymbol{p}_{\boldsymbol{x}}^{\boldsymbol{c}}=\boldsymbol{p}_{\boldsymbol{x}}^{\boldsymbol{c}^{\prime}}. Since Qvc(x)Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x}) is a function of the canonical representation pxc\boldsymbol{p}_{\boldsymbol{x}}^{\boldsymbol{c}} (see Definition 8.9), it holds that Qvc(x)=Qvc(x)Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})=Q_{\boldsymbol{v}}^{\boldsymbol{c}^{\prime}}(\boldsymbol{x}) for all vertices vLc(c)Lc(c)\boldsymbol{v}\in L_{c}(\boldsymbol{c})\cap L_{c}(\boldsymbol{c}^{\prime}).

Qvc(x)xi=Qvc(x)xi\frac{\partial Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})}{\partial x_{i}}=\frac{\partial Q_{\boldsymbol{v}}^{\boldsymbol{c}^{\prime}}(\boldsymbol{x})}{\partial x_{i}}. For imi\neq m, we get that Qvc(x)xi=1tisiQvc(x)pi=1tisiQvc(x)pi=Qvc(x)xi\frac{\partial Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})}{\partial x_{i}}=\frac{1}{t_{i}-s_{i}}\frac{\partial Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})}{\partial p_{i}}=\frac{1}{t^{\prime}_{i}-s^{\prime}_{i}}\frac{\partial Q_{\boldsymbol{v}}^{\boldsymbol{c}^{\prime}}(\boldsymbol{x})}{\partial p^{\prime}_{i}}=\frac{\partial Q_{\boldsymbol{v}}^{\boldsymbol{c}^{\prime}}(\boldsymbol{x})}{\partial x_{i}} since ti=tit_{i}=t_{i}^{\prime} and si=sis_{i}=s_{i}^{\prime} for all imi\neq m. The latter argument cannot be applied for the mm-th coordinate since tmsm=(tmsm)t_{m}-s_{m}=-(t_{m}^{\prime}-s_{m}^{\prime}). However since x\boldsymbol{x} belongs to the boundary of both the cubelets L(c)L(\boldsymbol{c}) and L(c)L(\boldsymbol{c}^{\prime}) it is implied that pm=pmp_{m}=p_{m}^{\prime} is either or 11, meaning that Qvc(x)xm=Qvc(x)xm=0\frac{\partial Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})}{\partial x_{m}}=\frac{\partial Q_{\boldsymbol{v}}^{\boldsymbol{c}^{\prime}}(\boldsymbol{x})}{\partial x_{m}}=0 since S(0)=S(1)=0S^{\prime}(0)=S^{\prime}(1)=0 from Lemma 8.3.

This case follows with the same reasoning with previous case 22.

Let vLc(c)Lc(c)\boldsymbol{v}\in L_{c}(\boldsymbol{c})\cap L_{c}(\boldsymbol{c}^{\prime}). There exists a sequence of corners

such that c(j)c(j+1)1=1\left\|\boldsymbol{c}^{(j)}-\boldsymbol{c}^{(j+1)}\right\|_{1}=1 and vLc(cj)\boldsymbol{v}\in L_{c}(\boldsymbol{c}^{j}) for all j[m]j\in[m]. By Lemma C.4 we get that,

Qvc(j)(x)=Qvc(j+1)(x)Q_{\boldsymbol{v}}^{\boldsymbol{c}^{(j)}}(\boldsymbol{x})=Q_{\boldsymbol{v}}^{\boldsymbol{c}^{(j+1)}}(\boldsymbol{x}).

Qvc(j)(x)xi=Qvc(j+1)(x)xi\frac{\partial Q_{\boldsymbol{v}}^{\boldsymbol{c}^{(j)}}(\boldsymbol{x})}{\partial x_{i}}=\frac{\partial Q_{\boldsymbol{v}}^{\boldsymbol{c}^{(j+1)}}(\boldsymbol{x})}{\partial x_{i}}.

C.2 Proof of Lemma 8.11

We start this section with some fundamental properties of the smooth step function SS_{\infty} that are more fine-grained than the properties we presented in Lemma 8.3.

For d10d\geq 10 there exists a universal constant c>0c>0 such that the following statements hold.

If x1/dx\geq 1/d then S(x)c2dS_{\infty}(x)\geq c\cdot 2^{-d}.

If x1/dx\leq 1/d then S(x)cd22dS_{\infty}^{\prime}(x)\leq c\cdot d^{2}\cdot 2^{-d}.

If x1/dx\geq 1/d then S(x)S(x)cd2\frac{S_{\infty}^{\prime}(x)}{S_{\infty}(x)}\leq c\cdot d^{2}.

If x1/dx\leq 1/d then S(x)cd42d\left|S^{\prime\prime}_{\infty}(x)\right|\leq c\cdot d^{4}\cdot 2^{-d}.

If x1/dx\geq 1/d then S(x)S(x)cd4\frac{\left|S^{\prime\prime}_{\infty}(x)\right|}{S_{\infty}(x)}\leq c\cdot d^{4}.

We compute the derivative of SS_{\infty} and we have that

from which we immediately get S(x)0S^{\prime}_{\infty}(x)\geq 0. Then we can compute the second derivative of SS_{\infty} as follows

We next want to prove that S(x)0S^{\prime\prime}_{\infty}(x)\geq 0 for x1/10x\leq 1/10. To see this observe that 12S(x)1/21-2\cdot S_{\infty}(x)\geq 1/2 for x1/dx\leq 1/d and therefore

hence for x4/ln(2)x\leq 4/\ln(2) it holds that S(x)0S^{\prime\prime}_{\infty}(x)\geq 0. By similar but more tedious calculations we can conclude that S(x)0S^{\prime\prime\prime}_{\infty}(x)\geq 0 for x1/10x\leq 1/10. Hence in the interval x[0,1/10]x\in[0,1/10] all the functions SS_{\infty}, SS^{\prime}_{\infty}, SS^{\prime\prime}_{\infty} are all increasing functions of xx.

Next we show that the function h(x)=21/x+21/(1x)h(x)=2^{-1/x}+2^{-1/(1-x)} is upper and lower bounded. First observe that h(x)max{21/x,21/(1x)}h(x)\geq\max\{2^{-1/x},2^{-1/(1-x)}\}. Now if we set t(x)=21/xt(x)=2^{-1/x} then t(x)=ln(2)t(x)/x2t^{\prime}(x)=\ln(2)t(x)/x^{2} and hence t(x)t(1/2)=1/4t(x)\geq t(1/2)=1/4 for x1/2x\geq 1/2. The same way we can prove that 21/(1x)1/42^{-1/(1-x)}\geq 1/4 for x1/2x\leq 1/2. Therefore h(x)1/4h(x)\geq 1/4 for all xx\in. Also it is not hard to see that 21/x1/22^{-1/x}\leq 1/2 and 21/(1x)1/22^{-1/(1-x)}\leq 1/2 which implies h(x)1h(x)\leq 1. Hence overall we have that h(x)[1/4,1]h(x)\in[1/4,1] for all xx\in. We are now ready to prove the statements.

We have shown that S(x)0S^{\prime}_{\infty}(x)\geq 0 for all xx\in. Hence SS_{\infty} is an increasing function and therefore S(x)S(1/d)S_{\infty}(x)\geq S_{\infty}(1/d) for x1/dx\geq 1/d. Now we have that S(1/d)=2d/h(1/d)2dS_{\infty}(1/d)=2^{-d}/h(1/d)\geq 2^{-d}.

Since S(x)S^{\prime}_{\infty}(x) is increasing for x[0,1/10]x\in[0,1/10], we have that S(x)S(1/d)S^{\prime}_{\infty}(x)\leq S^{\prime}_{\infty}(1/d) for x1/dx\leq 1/d and therefore

Follows directly from the statement 1., the fact that S(x)S^{\prime\prime}_{\infty}(x) is increasing for x[0,1/10]x\in[0,1/10] and the above expression of SS^{\prime\prime}_{\infty} this statement follows.

This statement follows using the same reasoning with statement 3.

In order to prove Lemma 8.11. We first introduce several technical lemmas.

Let xd\boldsymbol{x}\in^{d} lying in cublet L(c)L(\boldsymbol{c}), with c([N]1)d\boldsymbol{c}\in\left(\left[N\right]-1\right)^{d} and let pxc=(p1,,pd)\boldsymbol{p}_{\boldsymbol{x}}^{\boldsymbol{c}}=(p_{1},\ldots,p_{d}) be the canonical representation of x\boldsymbol{x}. Then for all vertices vLc(c)\boldsymbol{v}\in L_{c}(\boldsymbol{c}), it holds that

where the last inequality follows by the fact that S()6\left|S^{\prime}(\cdot)\right|\leq 6. Since Ad\left|A\right|\leq d the proof of the lemma will be completed if we are able to show that for any jAj\in A, it holds that

In case S(pi)S(pj)1/d5S(p_{i})-S(p_{j})\geq 1/d^{5} then by case 3.3. of Lemma C.5 we get that S(S(pi)S(pj))cd10S(S(pi)S(pj))\left|S_{\infty}^{\prime}(S(p_{i})-S(p_{j}))\right|\leq c\cdot d^{10}\cdot S_{\infty}(S(p_{i})-S(p_{j})), which implies gthe following

Now consider the case where S(pi)S(pj)1/d5S(p_{i})-S(p_{j})\leq 1/d^{5}. Using case 2.2. of Lemma C.5, we have that

Combining the later with the discussion in the rest of the proof the lemma follows. ∎

For any vertex v([N]1)d\boldsymbol{v}\in\left(\left[N\right]-1\right)^{d} it holds that Pv(x)xiΘ(d12/δ)\left|\frac{\partial\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x})}{\partial x_{i}}\right|\leq\Theta\left(d^{12}/\delta\right).

To simplify notation we use Qv(x)Q_{\boldsymbol{v}}(\boldsymbol{x}) instead of Qvc(x)Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x}) for the rest of the proof. Without loss of generality we assume that x\boldsymbol{x} lies on a cubelet L(c)L(\boldsymbol{c}) with c([N]1)d\boldsymbol{c}\in\left(\left[N\right]-1\right)^{d} and vLc(c)\boldsymbol{v}\in L_{c}(\boldsymbol{c}), since otherwise Pv(x)xi=0\frac{\partial\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x})}{\partial x_{i}}=0. Let pxc=(p1,,pd)\boldsymbol{p}_{\boldsymbol{x}}^{\boldsymbol{c}}=(p_{1},\ldots,p_{d}) be the canonical representation of x\boldsymbol{x} in the cubelet L(c)L(\boldsymbol{c}). Then it holds that

where the last inequality follows by Lemma C.6 and the fact that at most d+1d+1 vertices v\boldsymbol{v} of Lc(c)L_{c}(\boldsymbol{c}) have non-zero gradient as we have proved in Lemma 8.10. Then the proof of Lemma C.7 follows by the fact that pi=xisitisip_{i}=\frac{x_{i}-s_{i}}{t_{i}-s_{i}}. ∎

If additionally it holds that S(pi)S(pm1)1/d5S(p_{i})-S(p_{m_{1}})\leq 1/d^{5} or S(pj)S(pm2)1/d5S(p_{j})-S(p_{m_{2}})\leq 1/d^{5}, then by the case 2.2. of Lemma C.5, we have that

In case S(pi)S(pj)1/d5S(p_{i})-S(p_{j})\geq 1/d^{5} then by case 4.4. of Lemma C.5, we get that CS(pipj)cd20CS(pipj)\left|CS^{{}^{\prime\prime}}(p_{i}-p_{j})\right|\leq cd^{20}\cdot CS(p_{i}-p_{j}) which implies that QΘ(d20)Qvc(x)Q^{{}^{\prime\prime}}\leq\Theta(d^{20})\cdot Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x}).

On the other hand if S(pi)S(pj)1/d5S(p_{i})-S(p_{j})\leq 1/d^{5} then by case 5.5. of Lemma C.5, we get that QCS(pipj)cd20ed5Q^{{}^{\prime\prime}}\leq\left|CS^{{}^{\prime\prime}}(p_{i}-p_{j})\right|\leq c\cdot d^{20}e^{-d^{5}}. As in the proof of Lemma C.6, there exists a vertex vRc(x)\boldsymbol{v}^{\ast}\in R_{\boldsymbol{c}}(\boldsymbol{x}) such that Qvc(x)cd2e(d+1)2d2Q_{\boldsymbol{v}^{\ast}}^{\boldsymbol{c}}(\boldsymbol{x})\geq c^{d^{2}}e^{-(d+1)^{2}d^{2}} and thus QΘ(d20)vLc(c)Qvc(x)Q^{{}^{\prime\prime}}\leq\Theta(d^{20})\sum_{\boldsymbol{v}\in L_{c}(\boldsymbol{c})}Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x}). Overall we get that

If we combine all the above cases then the Lemma follows. ∎

Finally using Lemma C.7 and Lemma C.9 we get the proof of Lemma 8.11.

C.3 Proof of Lemma 8.12

Let 0xi<1/(N1)0\leq x_{i}<1/(N-1) and c=(c1,,ci,,cd)\boldsymbol{c}=(c_{1},\ldots,c_{i},\ldots,c_{d}) denote down-left corner of the cubelet R(x)R(\boldsymbol{x}) at which xd\boldsymbol{x}\in^{d} lies, i.e. xL(c)\boldsymbol{x}\in L(\boldsymbol{c}). Since x1/(N1)\boldsymbol{x}\leq 1/(N-1), this means that ci=0c_{i}=0. By the definition of sources and targets in Definition 8.6, we have that si=0s_{i}=0 and ti=1/(N1)t_{i}=1/(N-1), where sis_{i}, tit_{i} are respectively the ii-th coordinate of the source sc\boldsymbol{s}_{\boldsymbol{c}} and the target tc\boldsymbol{t}_{\boldsymbol{c}} vertex. Let the canonical representation pxc=(p1,,pd)p_{\boldsymbol{x}}^{\boldsymbol{c}}=(p_{1},\ldots,p_{d}) of x\boldsymbol{x} in the cubelet L(c)L(\boldsymbol{c}). Now partition the coordinates [d][d] in the following sets

If B=B=\varnothing then notice that Psc(x)>0\mathsf{P}_{\boldsymbol{s}_{\boldsymbol{c}}}(\boldsymbol{x})>0, since pi<1p_{i}<1, by the fact that xi<1/(N1)x_{i}<1/(N-1). Thus the lemma follows since si=0s_{i}=0. So we may assume that BB\neq\varnothing. In this case consider the corner v=(v1,,vd)\boldsymbol{v}=(v_{1},\ldots,v_{d}) defined as follows

Observe that Qvc(x)>0Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})>0 and thus vR+(x)\boldsymbol{v}\in R_{+}(\boldsymbol{x}). Moreover the coordinate iAi\in A and therefore it holds that vi=si=0v_{i}=s_{i}=0. This proves the first statement of the Lemma.

For the second statement let 11/(N1)xi1/(N1)1-1/(N-1)\leq x_{i}\leq 1/(N-1) and c=(c1,,ci,,cd)\boldsymbol{c}=(c_{1},\ldots,c_{i},\ldots,c_{d}) denote down-left corner of the cubelet R(x)R(\boldsymbol{x}) at which xd\boldsymbol{x}\in^{d} lies, i.e. xL(c)\boldsymbol{x}\in L(\boldsymbol{c}). This means that ci=N2N1c_{i}=\frac{N-2}{N-1}.

Let NN be odd. In this case by the definition of sources and targets in Definition 8.6, we have that si=11/(N1)s_{i}=1-1/(N-1) and ti=1t_{i}=1, where sis_{i}, tit_{i} are respectively the ii-th coordinate of the source and target vertex. Let pxc=(p1,,pd)p_{\boldsymbol{x}}^{\boldsymbol{c}}=(p_{1},\ldots,p_{d}) be the canonical representation of x\boldsymbol{x} under in the cubelet L(c)L(\boldsymbol{c}). Now partition the coordinates [d][d] as follows,

If A=A=\varnothing then notice that for the target vertex tc\boldsymbol{t}_{\boldsymbol{c}}, Ptc(x)>0\mathsf{P}_{\boldsymbol{t}_{\boldsymbol{c}}}(\boldsymbol{x})>0, since pi>0p_{i}>0, by the fact that xi>11/(N1)x_{i}>1-1/(N-1). Thus the lemma follows since ti=1t_{i}=1. So we may assume that AA\neq\varnothing. In this case consider the corner v=(v1,,vd)\boldsymbol{v}=(v_{1},\ldots,v_{d}) defined as follows,

Observe that Qvc(x)>0Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})>0 and thus vR+(x)\boldsymbol{v}\in R_{+}(\boldsymbol{x}). Moreover the coordinate iBi\in B and thus vi=ti=1v_{i}=t_{i}=1.

Let NN be even. In this case we have that ti=11/(N1)t_{i}=1-1/(N-1) and si=1s_{i}=1. Now partition the coordinates [d][d] as follows,

If B=B=\varnothing then notice that for the source vertex sc\boldsymbol{s}_{\boldsymbol{c}}, Psc(x)>0\mathsf{P}_{\boldsymbol{s}_{\boldsymbol{c}}}(\boldsymbol{x})>0, since pi<1p_{i}<1, by the fact that xi>11/(N1)x_{i}>1-1/(N-1). Thus the lemma follows since si=1s_{i}=1. In case BB\neq\varnothing consider the corner v=(v1,,vd)\boldsymbol{v}=(v_{1},\ldots,v_{d}) defined as follows,

Observe that Qvc(x)>0Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})>0 and thus vR+(x)\boldsymbol{v}\in R_{+}(\boldsymbol{x}). Moreover the coordinate iAi\in A and thus vi=si=1v_{i}=s_{i}=1.

If we put together the last two cases then this implies the second statement of the lemma.

Appendix D Constructing the Turing Machine – Proof of Theorem 7.6

In this section we prove Theorem 7.6 establishing that both the function fCl(x,y)f_{\mathcal{C}_{l}}(\boldsymbol{x},\boldsymbol{y}) of Definition 7.4 and its gradient, is computable by a polynomial-time Turing Machine. We prove Theorem 7.6 through a series of Lemmas. To simplify notation we set blog1/εb\triangleq\log 1/\varepsilon.

There exist Turing Machines MSM_{S_{\infty}}, MSM_{S^{\prime}_{\infty}} that given input xx\in and ε\varepsilon in binary form, compute [S(x)]b\left[S_{\infty}(x)\right]_{b} and [S(x)]b\left[S^{\prime}_{\infty}(x)\right]_{b} in time polynomial in b=log(1/ε)b=\log(1/\varepsilon) and the binary representation of xx.

The Turing Machine MSM_{S_{\infty}} outputs the fist bb bits of the following quantity,

where bb^{\prime} will be selected sufficiently large. Notice it is possible to compute the above quantity due to the fact that all functions 1γ+1γ1\frac{1}{\gamma}+\frac{1}{\gamma-1}, 2γ2^{\gamma} and 11+γ\frac{1}{1+\gamma} can be computed with accuracy 2b2^{-b^{\prime}} in polynomial time with respect to bb^{\prime} and the binary representation of γ\gamma [Bre76]. Moreover,

where the first inequality follows from triangle inequality and the second follows from the facts that 1/(1+γ)1/(1+\gamma) is a 11-Lipschitz function of γ\gamma for γ0\gamma\geq 0, and 1/(1+2γ)1/(1+2^{\gamma}) is an ln(2)\ln(2)-Lipschitz function of γ\gamma for γ0\gamma\geq 0. The last inequality follows from the definition of []b\left[\cdot\right]_{b^{\prime}}. Hence W(x)W(x) is indeed equal to [S(x)]b\left[S_{\infty}(x)\right]_{b} if we choose b=b+2b^{\prime}=b+2.

Next we explain how MSM_{S^{\prime}_{\infty}} computes [S(x)]b\left[S^{\prime}_{\infty}(x)\right]_{b}. First notice that S(x)S^{\prime}_{\infty}(x) is equal to

To describe how to compute S(x)S^{\prime}_{\infty}(x) we first assume that we have computed the following quantities. Then based on these quantities we show how S(x)S^{\prime}_{\infty}(x) can be computed and finally we consider the computation of these quantities.

A[1x221x+1x1]bA\leftarrow\left[\frac{1}{x^{2}}2^{-\frac{1}{x}+\frac{1}{x-1}}\right]_{b^{\prime}},

B[1(x1)221x+1x1]bB\leftarrow\left[\frac{1}{(x-1)^{2}}2^{-\frac{1}{x}+\frac{1}{x-1}}\right]_{b^{\prime}},

C[(21x+21x1)2]bC\leftarrow\left[\left(2^{-\frac{1}{x}}+2^{\frac{1}{x-1}}\right)^{2}\right]_{b^{\prime}}.

Then MSM_{S^{\prime}_{\infty}} outputs the fist bb bits of the quantity [[ln2]b[A+BC]b]b\left[\left[\ln 2\right]_{b^{\prime}}\cdot\left[\frac{A+B}{C}\right]_{b^{\prime}}\right]_{b^{\prime}}. We now prove that

Consider the function g(α,β,γ)=α+βγg(\alpha,\beta,\gamma)=\frac{\alpha+\beta}{\gamma} where α,βc1\left|\alpha\right|,\left|\beta\right|\leq c_{1} and γc2\left|\gamma\right|\geq c_{2} where c1,c2c_{1},c_{2} are universal constants. Notice that g(α,β,γ)g(\alpha,\beta,\gamma) is cc-Lipschitz for c=2c22+2c1c22c=\sqrt{\frac{2}{c_{2}^{2}}+\frac{2c_{1}}{c_{2}^{2}}}. Since for sufficiently large bb^{\prime} all the quantities A,B,1x221x+1x1,1(x1)221x+1x1c1\left|A\right|,\left|B\right|,\left|\frac{1}{x^{2}}2^{-\frac{1}{x}+\frac{1}{x-1}}\right|,\left|\frac{1}{(x-1)^{2}}2^{-\frac{1}{x}+\frac{1}{x-1}}\right|\leq c_{1} and C,(21x+21x1)2c2\left|C\right|,\left(2^{-\frac{1}{x}}+2^{\frac{1}{x-1}}\right)^{2}\geq c_{2} where c1,c2c_{1},c_{2} are universal constants we get that

Now consider the function g(α,β)=αβg(\alpha,\beta)=\alpha\cdot\beta where α,βc\left|\alpha\right|,\left|\beta\right|\leq c where cc is a universal constant. In this case g(α,β)g(\alpha,\beta) is 2c\sqrt{2}c-Lipschitz continuous. Since for bb^{\prime} sufficiently large all the quantities [ln2]b,[A+BC]b,ln2,A+BC\left|[\ln 2]_{b^{\prime}}\right|,\left|\left[\frac{A+B}{C}\right]_{b^{\prime}}\right|,\ln 2,\left|\frac{A+B}{C}\right| are bounded by a universal constant cc, we have that,

Next we explain how the values A,BA,B and CC are computed while [ln(2)]b\left[\ln(2)\right]_{b}^{\prime} can easily be computed via standard techniques [Bre76].

Computation of A\boldsymbol{A}. The Turing Machine MSM_{S^{\prime}_{\infty}} will compute AA by taking the first bb^{\prime} bits of the following quantity,

where bb^{\prime\prime} will be taken sufficiently large. We remark that both where both the exponentiation and the natural logarithm can be computed in polynomial-time with respect to the number of accuracy bits and the binary representation of the input [Bre76]. The function 1x221x+1x1=21x+1x1+2lnx/ln2\frac{1}{x^{2}}2^{-\frac{1}{x}+\frac{1}{x-1}}=2^{-\frac{1}{x}+\frac{1}{x-1}+2\ln x/\ln 2} is cc-Lipschitz where cc is a universal constant. Thus,

Computation of B\boldsymbol{B}. Using the same arguments as for AA.

Computation of C\boldsymbol{C}. To compute CC we first compute bb^{\prime\prime} bits of the following quantity,

The latter follows by applying the triangle inequality and the following 33 inequalities.

this holds since for b>1b^{\prime\prime}>1 we have

are both upper-bounded by 22 while the function g(α)=α2g(\alpha)=\alpha^{2} is 44-Lipschitz for α2\left|\alpha\right|\leq 2.

The latter follows since for bb^{\prime\prime} larger than a universal constant, both [2[1x]b]b+[2[1x1]b]b\left[2^{-\left[\frac{1}{x}\right]_{b^{\prime\prime}}}\right]_{b^{\prime\prime}}+\left[2^{\left[\frac{1}{x-1}\right]_{b^{\prime\prime}}}\right]_{b^{\prime\prime}} and 2[1x]b+2[1x1]b2^{-\left[\frac{1}{x}\right]_{b^{\prime\prime}}}+2^{\left[\frac{1}{x-1}\right]_{b^{\prime\prime}}} are greater than a universal constant cc, while the function g(α,β)=1/(α+β)2g(\alpha,\beta)=1/(\alpha+\beta)^{2} is Θ(c3)\Theta\left(c^{3}\right)-Lipschitz for α+βc\alpha+\beta\geq c.

The latter follows since for bb^{\prime\prime} larger than a universal constant it holds that both the quantities in the left hand side are greater than a positive universal constant cc, while the function g(α,β)=1/(2α+2β)g(\alpha,\beta)=1/(2^{-\alpha}+2^{\beta}) for 2α+2βc2^{-\alpha}+2^{\beta}\geq c, α0\alpha\geq 0, and β0\beta\leq 0 is Θ(1/c3)\Theta\left(1/c^{3}\right)-Lipschitz.

There exist Turing Machines MQM_{Q} and MQM_{Q^{\prime}} that given xd\boldsymbol{x}\in^{d} and ε>0\varepsilon>0 in binary form, respectively compute [Qvc(x)]b\left[Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})\right]_{b} and [Qvc(x)]b\left[\nabla Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})\right]_{b} for all vertices v([N]1)d\boldsymbol{v}\in\left(\left[N\right]-1\right)^{d} with Qvc(x)>0Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})>0, where b=log(1/ε)b=\log(1/\varepsilon). These vertices are most d+1d+1. Moreover both MQM_{Q} and MQM_{Q^{\prime}} run in polynomial time with respect to bb, dd and the binary representation of x\boldsymbol{x}.

Both MQM_{Q}, MQM_{Q^{\prime}} firsts compute the canonical representation pxcdp_{\boldsymbol{x}}^{\boldsymbol{c}}\in^{d} with the respect to the cell R(x)R(\boldsymbol{x}) in which x\boldsymbol{x} lies. Such a cell R(x)R(\boldsymbol{x}) can be computed by taking the first (logN+1)(\log N+1)-bits at each coordinate of x\boldsymbol{x}. The source vertex sc=(s1,,sd)\boldsymbol{s}^{\boldsymbol{c}}=(s_{1},\ldots,s_{d}) and the target vertex tc=(t1,,td)\boldsymbol{t}^{\boldsymbol{c}}=(t_{1},\ldots,t_{d}) with respect to R(x)R(\boldsymbol{x}) are also computed. Once this is done we are only interested in vertices vRc(x)\boldsymbol{v}\in R_{\boldsymbol{c}}(\boldsymbol{x}) for which

since for all the other v([N]1)d\boldsymbol{v}\in\left(\left[N\right]-1\right)^{d} both Qvc(x)=0Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})=0 and Qvc(x)=0\nabla Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})=0. These vertices, that are denoted by R+(x)R_{+}(\boldsymbol{x}), are at most d+1d+1 and can be computed in polynomial time.

The vertices vR+(x)\boldsymbol{v}\in R_{+}(\boldsymbol{x}) can be computed in polynomial time as follows: (i) the coordinates p1,,pdp_{1},\ldots,p_{d} are sorted in increasing order ii) for each m=0,,dm=0,\ldots,d compute the vertex vmRc(x)\boldsymbol{v}^{m}\in R_{\boldsymbol{c}}(\boldsymbol{x}),

By Definition 8.7 it immediately follows that R+(x)m=0d{vm}R_{+}(\boldsymbol{x})\subseteq\bigcup_{m=0}^{d}\{\boldsymbol{v}^{m}\} which also establish that R+(x)d+1\left|R_{+}(\boldsymbol{x})\right|\leq d+1.

where bb^{\prime} is selected sufficiently large. We next prove that this computation indeed outputs [Qvc(x)]b\left[Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})\right]_{b} accurately.

We can now use the above inequality to bound [Qvc(x)pi]bQvc(x)pi\left|\left[\frac{\partial Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})}{\partial p_{i}}\right]_{b^{\prime}}-\frac{\partial Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})}{\partial p_{i}}\right|. More precisely,

Thus the analysis is completed by selecting b=b+Θ(logd)b^{\prime}=b+\Theta(\log d) + Θ(logN)\Theta(\log N). ∎

For accuracy bΘ(d2logd)b^{\prime}\geq\Theta(d^{2}\log d) we get that,

Consider the function g(y)=yi/(j=1d+1yj)g(\boldsymbol{y})=y_{i}/(\sum_{j=1}^{d+1}y_{j}). Notice that for yd+1\boldsymbol{y}\in^{d+1} and j=1d+1yjμ\sum_{j=1}^{d+1}y_{j}\geq\mu then g(y)2Θ(d3/2/μ2)\left\|\nabla g(\boldsymbol{y})\right\|_{2}\leq\Theta(d^{3/2}/\mu^{2}). The latter implies that for y,zd+1\boldsymbol{y},\boldsymbol{z}\in^{d+1} such that j=1d+1yjμ\sum_{j=1}^{d+1}y_{j}\geq\mu and that j=1d+1zjμ\sum_{j=1}^{d+1}z_{j}\geq\mu, it holds that

Since there are at most d+1d+1 vertices vR+(x)\boldsymbol{v}^{\prime}\in R_{+}(\boldsymbol{x}) while both the term vR+(x)[Qvc(x)]b\sum_{\boldsymbol{v}^{\prime}\in R_{+}(\boldsymbol{x})}\left[Q_{\boldsymbol{v}^{\prime}}^{\boldsymbol{c}}(\boldsymbol{x})\right]_{b^{\prime}} and the term vR+(x)Qvc(x)\sum_{\boldsymbol{v}^{\prime}\in R_{+}(\boldsymbol{x})}Q_{\boldsymbol{v}^{\prime}}^{\boldsymbol{c}}(\boldsymbol{x}) are greater than Θ((1/d)d2)\Theta\left((1/d)^{d^{2}}\right), we can apply the above inequality with μ=Θ((1/d)d2)\mu=\Theta\left((1/d)^{d^{2}}\right) and we get the following

The proof is completed via selecting b=b+Θ(d2logd)b^{\prime}=b+\Theta(d^{2}\log d).

we next prove that the above computation is correct.

Setting b=b+Θ(logd)b^{\prime}=b+\Theta\left(\log d\right) we get the desired result. Similarly for fCl(x,y)xi\frac{\partial f_{\mathcal{C}_{l}(\boldsymbol{x},\boldsymbol{y})}}{\partial x_{i}} and fCl(x,y)yi\frac{\partial f_{\mathcal{C}_{l}(\boldsymbol{x},\boldsymbol{y})}}{\partial y_{i}}. ∎

Appendix E Convergence of PGD to Approximate Local Minimum

where η=1/L\eta=1/L and x\boldsymbol{x}^{\star} is a global minimum of ff.

If we run the Projected Gradient Descent algorithm on ff then we have

then due to the LL-smoothness of ff we have that

We can now apply Theorem 1.5.5 (b) of [FP07] to get that

If sum all the above inequalities and divide by TT then we get

Therefore for T2L(f(x0)f(x))ε2T\geq\frac{2L\left(f(\boldsymbol{x}_{0})-f(\boldsymbol{x}^{\star})\right)}{\varepsilon^{2}} we have that