The Complexity of Constrained Min-Max Optimization

Constantinos Daskalakis, Stratis Skoulakis, Manolis Zampetakis

Introduction

Min-Max Optimization has played a central role in the development of Game Theory [vN28], Convex Optimization [Dan51, Adl13], and Online Learning [Bla56, CBL06, SS12, BCB12, SSBD14, Haz16]. In its general constrained form, it can be written down as follows:

The goal in (1.1) is to find a feasible pair $(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})$ , i.e., $g(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})\leq 0$ , that satisfies the following

Unfortunately, our ability to solve Problem (1.1) remains rather poor in settings where our objective function $f$ is not convex-concave. This is emerging as a major challenge in Deep Learning, where min-max optimization has recently found many important applications, such as training Generative Adversarial Networks (see e.g. [GPM+14, ACB17]), and robustifying deep neural network-based models against adversarial attacks (see e.g. [MMS+18]). These applications are indicative of a broader deep learning paradigm wherein robustness properties of a deep learning system are tested and enforced by another deep learning system. In these applications, it is very common to encounter min-max problems with objectives that are nonconvex-nonconcave, and thus evade treatment by the classical algorithmic toolkit targeting convex-concave objectives.

Indeed, the optimization challenges posed by objectives that are nonconvex-nonconcave are not just theoretical frustration. Practical experience with first-order methods is rife with frustration as well. A common experience is that the training dynamics of first-order methods is unstable, oscillatory or divergent, and the quality of the points encountered in the course of training can be poor; see e.g. [Goo16, MPPSD16, DISZ18, MGN18, DP18, MR18, MPP18, ADLH19]. This experience is in stark contrast to minimization (resp. maximization) problems, where even for nonconvex (resp. nonconcave) objectives, first-order methods have been found to efficiently converge to approximate local optima or stationary points (see e.g. [AAZB+17, JGN+17, LPP+19]), while practical methods such Stochastic Gradient Descent, Adagrad, and Adam [DHS11, KB14, RKK18] are driving much of the recent progress in Deep Learning.

The goal of this paper is to shed light on the complexity of min-max optimization problems, and elucidate its difference to minimization and maximization problems—as far as the latter is concerned without loss of generality we focus on minimization problems, as maximization problems behave exactly the same; we will also think of minimization problems in the framework of (1.1), where the variable $\boldsymbol{y}$ is absent, that is $d_{2}=0$ . An important driver of our comparison between min-max optimization and minimization is, of course, the nature of the objective. So let us discuss:

$\triangleright$ Convex-Concave Objective. The benign setting for min-max optimization is that where the objective function is convex-concave, while the benign setting for minimization is that where the objective function is convex. In their corresponding benign settings, the two problems behave quite similarly from a computational perspective in that they are amenable to convex programming, as well as first-order methods which only require gradient information about the objective function. Moreover, in their benign settings, both problems have guaranteed existence of a solution under compactness of the constraint set. Finally, it is clear how to define approximate solutions. We just relax the inequalities on the left hand side of (1.2) and (1.3) by some $\varepsilon>0$ .

$\triangleright$ Nonconvex-Nonconcave Objective. By contrapositive, the challenging setting for min-max optimization is that where the objective is not convex-concave, while the challenging setting for minimization is that where the objective is not convex. In these challenging settings, the behavior of the two problems diverges significantly. The first difference is that, while a solution to a minimization problem is still guaranteed to exist under compactness of the constraint set even when the objective is not convex, a solution to a min-max problem is not guaranteed to exist when the objective is not convex-concave, even under compactness of the constrained set. A trivial example is this: $\min_{x\in}\max_{y\in}(x-y)^{2}$ . Unsurprisingly, we show that checking whether a min-max optimization problem has a solution is $\mathsf{NP}$ -hard. In fact, we show that checking whether there is an approximate min-max solution is $\mathsf{NP}$ -hard, even when the function is Lispchitz and smooth and the desired approximation error is an absolute constant (see Theorem 10.1).

Since min-max solutions may not exist, what could we plausibly hope to compute? There are two obvious targets:

approximate stationary points of $f$ , as considered e.g. by [ALW19]; and

some type of approximate local min-max solution.

Unfortunately, as far as (I) is concerned, it is still possible that (even approximate) stationary points may not exist, and we show that checking if there is one is $\mathsf{NP}$ -hard, even when the constraint set is $^{d}$ , the objective has Lipschitzness and smoothness polynomial in $d$ , and the desired approximation is an absolute constant (Theorem 4.1). So we focus on (II), i.e. (approximate) local min-max solutions. Several kinds of those have been proposed in the literature [DP18, MR18, JNJ19]. We consider a generalization of the concept of local min-max equilibria, proposed in [DP18, MR18], that also accommodates approximation.

Given $f$ , $g$ as above, and $\varepsilon,\delta>0$ , some point $(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})$ is an $(\varepsilon,\delta)$ -local min-max solution of (1.1), or a $(\varepsilon,\delta)$ -local min-max equilibrium, if it is feasible, i.e. $g(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})\leq 0$ , and satisfies:

In words, $(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})$ is an $(\varepsilon,\delta)$ -local min-max equilibrium, whenever the min player cannot update $\boldsymbol{x}$ to a feasible point within $\delta$ of $\boldsymbol{x}^{\star}$ to reduce $f$ by at least $\varepsilon$ , and symmetrically the max player cannot change $\boldsymbol{y}$ locally to increase $f$ by at least $\varepsilon$ .

We show that the existence and complexity of computing such approximate local min-max equilibria depends on the relationship of $\varepsilon$ and $\delta$ with the smoothness, $L$ , and the Lipschitzness, $G$ , of the objective function $f$ . We distinguish the following regimes, also shown in Figure 1 together with a summary of our associated results.

$\blacktriangleright$ Trivial Regime. This occurs when $\delta<\frac{\varepsilon}{G}$ . This regime is trivial because the $G$ -Lipschitzness of $f$ guarantees that all feasible points are $(\varepsilon,\delta)$ -local min-max solutions.

$\blacktriangleright$ Local Regime. This occurs when $\delta<\sqrt{\frac{2\varepsilon}{L}}$ , and it represents the interesting regime for min-max optimization. In this regime, we use the smoothness of $f$ to show that $(\varepsilon,\delta)$ -local min-max solutions always exist. Indeed, we show (Theorem 5.1) that computing them is computationally equivalent to the following variant of (I) which is more suitable for the constrained setting:

(approximate) fixed points of the projected gradient descent-ascent dynamics (Section 3.3).

We show via an application of Brouwer’s fixed point theorem to the iteration map of the projected gradient descent-ascent dynamics that (I)’ are guaranteed to exist. In fact, not only do they exist, but computing them is in $\mathsf{PPAD}$ , as can be shown by bounding the Lipschitzness of the projected gradient descent-ascent dynamics (Theorem 5.2).

$\blacktriangleright$ Global Regime. This occurs when $\delta$ is comparable to the diameter of the constraint set. In this case, the existence of $(\varepsilon,\delta)$ -local min-max solutions is not guaranteed, and determining their existence is $\mathsf{NP}$ -hard, even if $\varepsilon$ is an absolute constant (Theorem 10.1).

The main results of this paper, summarized in Figure 1, are to characterize the complexity of computing local min-max solutions in the local regime. Our first main theorem is the following:

Computing $(\varepsilon,\delta)$ -local min-max solutions of Lipschitz and smooth objectives over convex compact domains in the local regime is $\mathsf{PPAD}$ -complete. The hardness holds even when the constraint set is a polytope that is a subset of $^{d}$ , the objective takes values in $ $and the smoothness, Lipschitzness,$ 1/\varepsilon $and$ 1/\delta $are polynomial in the dimension. Equivalently, computing$ \alpha $-approximate fixed points of the Projected Gradient Descent-Ascent dynamics on smooth and Lipschitz objectives is$ \mathsf{PPAD} $-complete, and the hardness holds even when the the constraint set is a polytope that is a subset of$ ^{d} $, the objective takes values in$ [-d,d] $and smoothness, Lipschitzness, and$ 1/\alpha$ are polynomial in the dimension.

For the above complexity result we assume that we have “white box” access to the objective function. An important byproduct of our proof, however, is to also establish an unconditional hardness result in the Nemirovsky-Yudin [NY83] oracle optimization model, wherein we are given black-box access to oracles computing the objective function and its gradient. Our second main result is informally stated in Informal Theorem 2.

Assume that we have black-box access to an oracle computing a $G$ -Lipschitz and $L$ -smooth objective function $f:\mathcal{P}\to$ , where $\mathcal{P}\subseteq^{d}$ is a known polytope, and its gradient $\nabla f$ . Then, computing an $(\varepsilon,\delta)$ -local min-max solution in the local regime (i.e., when $\delta<\sqrt{2\varepsilon/L}$ ) requires a number of oracle queries that is exponential in at least one of the following: $1/\varepsilon$ , $L$ , $G$ , or $d$ . In fact, exponential in $d$ -many queries are required even when $L$ , $G$ , $1/\varepsilon$ and $1/\delta$ are all polynomial in $d$ .

Importantly, the above lower bounds, in both the white-box and the black-box setting, come in sharp contrast to minimization problems, given that finding approximate local minima of smooth non-convex objectives ranging in $[-B,B]$ in the local regime can be done using first-order methods using $O(B\cdot L/\varepsilon)$ time/queries (see Section E). Our results are the first to show an exponential separation between these two fundamental problems in optimization in the black-box setting, and a super-polynomial separation in the white-box setting assuming $\mathsf{PPAD}\neq\mathsf{FP}$ .

We very briefly outline some of the main ideas for the $\mathsf{PPAD}$ -hardness proof that we present in Sections 6 and 7. Our starting point as in many $\mathsf{PPAD}$ -hardness results is a discrete analog of the problem of finding Brouwer fixed points of a continuous map. Departing from previous work, however, we do not use Sperner’s lemma as the discrete analog of Brouwer’s fixed point theorem. Instead, we define a new problem, called BiSperner, which is useful for showing our hardness results. BiSperner is closely related to the problem of finding panchromatic simplices guaranteed by Sperner’s lemma except, roughly speaking, that the vertices of the simplicization of a $d$ -dimensional hypercube are colored with $2d$ rather than $d+1$ colors, every point of the simplicization is colored with $d$ colors rather than one, and we are seeking a vertex of the simplicization so that the union of colors on the vertices in its neighborhood covers the full set of colors. The first step of our proof is to show that BiSperner is $\mathsf{PPAD}$ -hard. This step follows from the hardness of computing Brouwer fixed points.

The step that we describe next is only implicitly done by our proof, but it serves as useful intuition for reading and understanding it. We want to define a discrete two-player zero-sum game whose local equilibrium points correspond to solutions of a given BiSperner instance. Our two players, called “minimizer” and “maximizer,” each choose a vertex of the simplicization of the BiSperner instance. For every pair of strategies in our discrete game, i.e. vertices, chosen by our players, we define a function value and gradient values. Note that, at this point, we treat these values at different vertices of the simplicization as independent choices, i.e. are not defining a function over the continuum whose function values and gradient values are consistent with these choices. It is our intention, however, that in the continuous two-player zero-sum game that we obtain in the next paragraph via our interpolation scheme, wherein the minimizer and maximizer may choose any point in the continuous hypercube, the function value determines the payment of the minimizer to the maximizer, and the gradient value determines the direction of the best-response dynamics of the game. Before getting to that continuous game in the next paragraph, the main technical step of this discrete part of our construction is showing that every local equilibrium of the discrete game corresponds to a solution of the BiSperner instance we are reducing from. In order to achieve this we need to add some constraints to couple the strategies of the minimizer and the maximizer player. This step is the reason that the constraints $g(\boldsymbol{x},\boldsymbol{y})\leq 0$ appear in the final min-max problem that we produce.

The third and quite challenging step of the proof is to show that we can interpolate in a smooth and computationally efficient way the discrete zero-sum game of the previous step. In low dimensions (treated in Section 6) such smooth and efficient interpolation can be done in a relatively simple way using single-dimensional smooth step functions. In high dimensions, however, the smooth and efficient interpolation becomes a challenging problem and to the best of our knowledge no simple solution exists. For this reason we construct our novel smooth and efficient interpolation coefficients of Section 8. These are a technically involved construction that we believe will prove to be very useful for characterizing the complexity of approximate solutions of other optimization problems.

The last part of our proof is to show that all the previous steps can be implemented in an efficient way both with respect to computational but also with respect to query complexity. This part is essential for both our white-box and black-box results. Although this seems like a relatively easy step, it becomes more difficult due to the complicated expressions in our smooth and efficient interpolation coefficients used in our previous step.

Closing this section we mention that all our $\mathsf{NP}$ -hardness results are proven using a cute application of Lovász Local Lemma [EL73], which provides a powerful rounding tool that can drive the inapproximability all the way up to an absolute constant.

2 Local Minimization vs Local Min-Max Optimization

Because our proof is convoluted, involving multiple steps, it is difficult to discern from it why finding local min-max solutions is so much harder than finding local minima. For this reason, we illustrate in this section a fundamental difference between local minimization and local min-max optimization. This provides good intuition about why our hardness construction would fail if we tried to apply it to prove hardness results for finding local minima (which we know don’t exist).

Ultimately a key difference between min-min and min-max optimization is that best-response paths in min-max optimization problems can be closed, i.e., can form a cycle, as shown in Figure 2, Panel (b). On the other hand, this is impossible in min-min problems as the function value must monotonically decrease along best-response paths, thus cycles may not exist.

The above discussion offers qualitative differences between min-min and min-max optimization, which lie in the heart of why our computational intractability results are possible to prove for min-max but not min-min problems. For the precise step in our construction that breaks if we were to switch from a min-max to a min-min problem we refer the reader to Remark 6.9.

3 Further Related Work

There is a broad literature on the complexity of equilibrium computation. Virtually all these results are obtained within the computational complexity formalism of total search problems in $\mathsf{NP}$ , which was spearheaded by [JPY88, MP89, Pap94b] to capture the complexity of search problems that are guaranteed to have a solution. Some key complexity classes in this landscape are shown in Figure 3. We give a non-exhaustive list of intractability results for equilibrium computation: [FPT04] prove that computing pure Nash equilibria in congestion games is $\mathsf{PLS}$ -complete; [DGP09] and later [CDT09] show that computing approximate Nash equilibria in normal-form games is $\mathsf{PPAD}$ -complete; [EY10] study the complexity of computing exact Nash equilibria (which may use irrational probabilities), introducing the complexity class $\mathsf{FIXP}$ ;

[VY11, CPY17] consider the complexity of computing Market equilibria; [Das13, Rub15, Rub16] consider the complexity of computing approximate Nash equilibria of constant approximation; [KM18] establish a connection between approximate Nash equilibrium computation and the SoS hierarchy; [Meh14, DFS20] study the complexity of computing Nash equilibria in specially structured games. A result that is particularly useful for our work is the result of [HPV89] which shows black-box query lower bounds for computing Brouwer fixed points of a continuous function. We use this result in Section 9 as an ingredient for proving our black-box lower bounds for computing approximate local min-max solutions.

Beyond equilibrium computation and its applications to Economics and Game Theory, the study of total search problems has found profound connections to many scientific fields, including continuous optimization [DP11, DTZ18], combinatorial optimization [SY91], query complexity [BCE+95], topology [GH19], topological combinatorics and social choice theory [FG18, FG19, FRHSZ20b, FRHSZ20a], algebraic combinatorics [BIQ+17, GKSZ19], and cryptography [Jeř16, BPR15, SZZ18]. For a more extensive overview of total search problems we refer the reader to the recent survey by Daskalakis [Das18].

As already discussed, min-max optimization has intimate connections to the foundations of Game Theory, Mathematical Programming, Online Learning, Statistics, and several other fields. Recent applications of min-max optimization to Machine Learning, such as Generative Adversarial Networks and Adversarial Training, have motivated a slew of recent work targeting first-order (or other light-weight online learning) methods for solving min-max optimization problems for convex-concave, nonconvex-concave, as well as nonconvex-nonconcave objectives. Work on convex-concave and nonconvex-concave objectives has focused on obtaining online learning methods with improved rates [KM19, LJJ19, TJNO19, NSH+19, LTHC19, OX19, Zha19, ADSG19, AMLJG20, GPDO20, LJJ20] and last-iterate convergence guarantees [DISZ18, DP18, MR18, MPP18, RLLY18, HA18, ADLH19, DP19, LS19, GHP+19, MOP19, ALW19], while work on nonconvex-nonconcave problems has focused on identifying different notions of local min-max solutions [JNJ19, MV20] and studying the existence and (local) convergence properties of learning methods at these points [WZB19, MV20, MSV20].

Preliminaries

We study optimization problems involving real-valued functions, considering two access models to such functions.

Black Box Model. In this model we are given access to an oracle $\mathcal{O}_{f}$ such that given a point $\boldsymbol{x}\in^{d}$ the oracle $\mathcal{O}_{f}$ returns the values $f(\boldsymbol{x})$ and $\nabla f(\boldsymbol{x})$ . In this model we assume that we can perform real number arithmetic operations. This is the traditional model used to prove lower bounds in Optimization and Machine Learning [NY83].

Promise Problems. To simplify the exposition of our paper, make the definitions of our computational problems and theorem statements clearer, and make our intractability results stronger, we choose to enforce the following constraints on our function access, $\mathcal{O}_{f}$ or $\mathcal{C}_{f}$ , as a promise, rather than enforcing these constraints in some syntactic manner.

Consistency of Function Values and Gradient Values. Given some oracle $\mathcal{O}_{f}$ or Turing machine $\mathcal{C}_{f}$ , it is difficult to determine by querying the oracle or examining the description of the Turing machine whether the function and gradient values output on different inputs are consistent with some differentiable function. In all our computational problems, we will only consider instances where this is promised to be the case. Moreover, for all our computational hardness results, the instances of the problems arising from our reductions satisfy these constraints, which are guaranteed syntactically by our reduction.

Lipschitzness, Smoothness and Boundedness. Similarly, given some oracle $\mathcal{O}_{f}$ or Turing machine $\mathcal{C}_{f}$ , it is difficult to determine, by querying the oracle or examining the description of the Turing machine, whether the function and gradient values output by $\mathcal{O}_{f}$ or $\mathcal{C}_{f}$ are consistent with some Lipschitz, smooth and bounded function with some prescribed Lipschitzness, smoothness, and bound on its absolute value. In all our computational problems, we only consider instances where the $G$ -Lipschitzness, $L$ -smoothness and $B$ -boundedness of the function are promised to hold for the prescribed, in the input of the problem, parameters $G$ , $L$ and $B$ . Moreover, for all our computational hardness results, the instances of the problems arising from our reductions satisfy this constraint, which is guaranteed syntactically by our reduction.

In summary, in the rest of this paper, whenever we prove an upper bound for some computational problem, namely an upper bound on the number of steps or queries to the function oracle required to solve the problem in the black-box model, or the containment of the problem in some complexity class in the white-box model, we assume that the afore-described properties are satisfied by the $\mathcal{O}_{f}$ or $\mathcal{C}_{f}$ provided in the input. On the other hand, whenever we prove a lower bound for some computational problem, namely a lower bound on the number of steps/queries required to solve it in the black-box model, or its hardness for some complexity class in the white-box model, the instances arising in our lower bounds are guaranteed to satisfy the above properties syntactically by our constructions. As such, our hardness results will not exploit the difficulty in checking whether $\mathcal{O}_{f}$ or $\mathcal{C}_{f}$ satisfy the above constraints in order to infuse computational complexity into our problems, but will faithfully target the computational problems pertaining to min-max optimization of smooth and Lipschitz objectives that we aim to understand in this paper.

1 Complexity Classes and Reductions

In this section we define the main complexity classes that we use in this paper, namely $\mathsf{NP}$ , $\mathsf{FNP}$ and $\mathsf{PPAD}$ , as well as the notion of reduction used to show containment or hardness of a problem for one of these complexity classes.

To define the complexity class $\mathsf{PPAD}$ we first define the notion of polynomial-time reductions between search problemsIn this paper we only define and consider Karp-reductions between search problems., and the computational problem End-of-a-LineThis problem is sometimes called End-of-the-Line, but we adopt the nomenclature proposed by [Rub16] since we agree that it describes the problem better..

A search problem $P_{1}$ is polynomial-time reducible to a search problem $P_{2}$ if there exist polynomial-time computable functions $f:\left\{0,1\right\}^{*}\to\left\{0,1\right\}^{*}$ and $g:\left\{0,1\right\}^{*}\times\left\{0,1\right\}^{*}\times\left\{0,1\right\}^{*}\to\left\{0,1\right\}^{*}$ with the following properties: (i) if $\boldsymbol{x}$ is an input to $P_{1}$ , then $f(\boldsymbol{x})$ is an input to $P_{2}$ ; and (ii) if $\boldsymbol{y}$ is a solution to $P_{2}$ on input $f(\boldsymbol{x})$ , then $g(\boldsymbol{x},f(\boldsymbol{x}),\boldsymbol{y})$ is a solution to $P_{1}$ on input $\boldsymbol{x}$ .

To make sense of the above definition, we envision that the circuits $\mathcal{C}_{S}$ and $\mathcal{C}_{P}$ implicitly define a directed graph, with vertex set $\{0,1\}^{n}$ , such that the directed edge $(\boldsymbol{x},\boldsymbol{y})\in\left\{0,1\right\}^{n}\times\left\{0,1\right\}^{n}$ belongs to the graph if and only if $\mathcal{C}_{S}(\boldsymbol{x})=\boldsymbol{y}$ and $\mathcal{C}_{P}(\boldsymbol{y})=\boldsymbol{x}$ . As such, all vertices in the implicitly defined graph have in-degree and out-degree at most $1$ . The above problem permits an output of $\boldsymbol{0}$ if $\boldsymbol{0}$ has equal in-degree and out-degree in this graph. Otherwise it permits an output $\boldsymbol{x}\neq\boldsymbol{0}$ such that $\boldsymbol{x}$ has in-degree or out-degree equal to . It follows by the parity argument on directed graphs, namely that in every directed graph the sum of in-degrees equals the sum of out-degrees, that End-of-a-Line is a total problem, i.e. that for any possible binary circuits $\mathcal{C}_{S}$ and $\mathcal{C}_{P}$ there exists a solution of the “0.” kind or the “1.” kind in the definition of our problem (or both). Indeed, if $\boldsymbol{0}$ has unequal in- and out-degrees, there must exist another vertex $\boldsymbol{x}\neq\boldsymbol{0}$ with unequal in- and out-degrees, thus one of these degrees must be (as all vertices in the graph have in- and out-degrees bounded by $1$ ).

We are finally ready to define the complexity class $\mathsf{PPAD}$ introduced by [Pap94b].

The complexity class $\mathsf{PPAD}$ contains all search problems that are polynomial time reducible to the End-of-a-Line problem.

The complexity class $\mathsf{PPAD}$ is of particular importance, since it contains lots of fundamental problems in Game Theory, Economics, Topology and several other fields [DGP09, Das18]. A particularly important $\mathsf{PPAD}$ -complete problem is finding fixed points of continuous functions, whose existence is guaranteed by Brouwer’s fixed point theorem.

While not stated exactly in this form, the following is a straightforward implication of the results presented in [CDT09].

Computational Problems of Interest

In this section, we define the computational problems that we study in this paper and discuss our main results, postponing formal statements to Section 4. We start in Section 3.1 by defining the mathematical objects of our study, and proceed in Section 3.2 to define our main computational problems, namely: (1) finding approximate stationary points; (2) finding approximate local minima; and (3) finding approximate local min-max equilibria. In Section 3.3, we present some bonus problems, which are intimately related, as we will see, to problems (2) and (3). As discussed in Section 2, for ease of presentation, we define our problems as promise problems.

We define the concepts of stationary points, local minima, and local min-max equilibria of real valued functions, and make some remarks about their existence, as well as their computational complexity. The formal discussion of the latter is postponed to Sections 3.2 and 4.

Now that we have defined the domain of the real-valued functions that we consider in this paper we are ready to define a notion of approximate stationary points.

It is easy to see that there exist continuously differentiable functions $f$ that do not have any (approximate) stationary points, e.g. linear functions. As we will see later in this paper, deciding whether a given function $f$ has a stationary point is $\mathsf{NP}$ -hard and, in fact, it is even $\mathsf{NP}$ -hard to decide whether a function has an approximate stationary point of a very gross approximation. At the same time, verifying whether a given point is (approximately) stationary can be done efficiently given access to a polynomial-time Turing machine that computes $\nabla f$ , so the problem of deciding whether an (approximate) stationary point exists lies in $\mathsf{NP}$ , as long as we can guarantee that, if there is such a point, there will also be one with polynomial bit complexity. We postpone a formal discussion of the computational complexity of finding (approximate) stationary points or deciding their existence until we have formally defined our corresponding computational problem and settled the bit complexity of its solutions.

For the definition of local minima and local min-max equilibria we need the notion of closed $d$ -dimensional Euclidean balls.

To be clear, using the term “local minimum” in Definition 3.5 is a bit of a misnomer, since for large enough values of $\delta$ the definition captures global minima as well. As $\delta$ ranges from large to small, our notion of $(\varepsilon,\delta)$ -local minimum transitions from being an $\varepsilon$ -globally optimal point to being an $\varepsilon$ -locally optimal point. Importantly, unlike (approximate) stationary points, a $(\varepsilon,\delta)$ -local minimum is guaranteed to exist for all $\varepsilon,\delta>0$ due to the compactness of $^{d}\cap\mathcal{P}(\boldsymbol{A},\boldsymbol{b})$ and the continuity of $f$ . Thus the problem of finding an $(\varepsilon,\delta)$ -local minimum is total for arbitrary values of $\varepsilon$ and $\delta$ . On the negative side, for arbitrary values of $\varepsilon$ and $\delta$ , there is no polynomial-size and polynomial-time verifiable witness for certifying that a point $\boldsymbol{x}^{\star}$ is an $(\varepsilon,\delta)$ -local minimum. Thus the problem of finding an $(\varepsilon,\delta)$ -local minimum is not known to lie in $\mathsf{FNP}$ . As we will see in Section 4, this issue can be circumvented if we focus on particular settings of $\varepsilon$ and $\delta$ , in relationship to the Lipschitzness and smoothness of $f$ and the dimension $d$ .

Finally we define $(\varepsilon,\delta)$ -local min-max equilibrium as follows, recasting Definition 1.1 to the constraint set $\mathcal{P}(\boldsymbol{A},\boldsymbol{b})$ .

$f(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})<f(\boldsymbol{x},\boldsymbol{y}^{\star})+\varepsilon$ for every $\boldsymbol{x}\in\mathsf{B}_{d_{1}}(\delta;\boldsymbol{x}^{\star})$ with $(\boldsymbol{x},\boldsymbol{y}^{\star})\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b})$ ; and

$f(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})>f(\boldsymbol{x}^{\star},\boldsymbol{y})-\varepsilon$ for every $\boldsymbol{y}\in\mathsf{B}_{d_{2}}(\delta;\boldsymbol{y}^{\star})$ with $(\boldsymbol{x}^{\star},\boldsymbol{y})\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b})$ .

Similarly to Definition 3.5, for large enough values of $\delta$ , Definition 3.6 captures global min-max equilibria as well. As $\delta$ ranges from large to small, our notion of $(\varepsilon,\delta)$ -local min-max equilibrium transitions from being an $\varepsilon$ -approximate min-max equilibrium to being an $\varepsilon$ -approximate local min-max equilibrium. Moreover, in comparison to local minima and stationary points, the problem of finding an $(\varepsilon,\delta)$ -local min-max equilibrium is neither total nor can its solutions be verified efficiently for all values of $\varepsilon$ and $\delta$ , even when $\mathcal{P}(\boldsymbol{A},\boldsymbol{b})=^{d}$ . Again, this issue can be circumvented if we focus on particular settings of $\varepsilon$ and $\delta$ values, as we will see in Section 4.

2 First-Order Local Optimization Computational Problems

In this section, we define the search problems associated with our aforementioned definitions of approximate stationary points, local minima, and local min-max equilibria. We state our problems in terms of white-box access to the function $f$ and its gradient. Switching to the black-box variants of our computational problems amounts to simply replacing the Turing machines provided in the input of the problems with oracle access to the function and its gradient, as discussed in Section 2. As per our discussion in the same section, we define our computational problems as promise problems, the promise being that the Turing machine (or oracle) provided in the input to our problems outputs function values and gradient values that are consistent with a smooth and Lipschitz function with the prescribed in the input smoothness and Lipschitzness. Besides making the presentation cleaner, as we discussed in Section 2, the motivation for doing so is to prevent the possibility that computational complexity is tacked into our problems due to the possibility that the Turing machines/oracles provided in the input do not output function and gradient values that are consistent with a Lipschitz and smooth function. Importantly, all our computational hardness results syntactically guarantee that the Turing machines/oracles provided as input to our constructed hard instances satisfy these constraints.

Before stating our main computational problems below, we note that, for each problem, the dimension $d$ (in unary representation) is also an implicit input, as the description of the Turing machine $\mathcal{C}_{f}$ (or the interface to the oracle $\mathcal{O}_{f}$ in the black-box counterpart of each problem below) has size at least linear in $d$ . We also refer to Remark 2.6 for how we may formally study complexity problems that take a polynomial-time Turing Machine in their input.

It is easy to see that StationaryPoint lies in $\mathsf{FNP}$ . Indeed, if there exists some point $\boldsymbol{x}\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b})$ such that $\left\|\nabla f(\boldsymbol{x})\right\|_{2}<\varepsilon/2$ , then by the $L$ -smoothness of $f$ there must exist some point $\boldsymbol{x}^{\star}\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b})$ of bit complexity polynomial in the size of the input such that $\left\|\nabla f(\boldsymbol{x}^{\star})\right\|_{2}<\varepsilon$ . On the other hand, it is clear that no such point exists if for all $\boldsymbol{x}\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b})$ , $\left\|\nabla f(\boldsymbol{x})\right\|_{2}>\varepsilon$ . We note that the looseness of the output requirement in our problem for functions $f$ that do not have points $x\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b})$ such that $\left\|\nabla f(\boldsymbol{x})\right\|_{2}<\varepsilon/2$ but do have points $x\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b})$ such that $\left\|\nabla f(\boldsymbol{x})\right\|_{2}\leq\varepsilon$ is introduced for the sole purpose of making the problem lie in $\mathsf{FNP}$ , as otherwise we would not be able to guarantee that the solutions to our search problem have polynomial bit complexity. As we show in Section 4, StationaryPoint is also $\mathsf{FNP}$ -hard, even when $\varepsilon$ is a constant, the constraint set is very simple, namely $\mathcal{P}(\boldsymbol{A},\boldsymbol{b})=^{d}$ , and $G,L$ are both polynomial in $d$ .

Next, we define the computational problems associated with local minimum and local min-max equilibrium. Recall that the first is guaranteed to have a solution, because, in particular, a global minimum exists due to the continuity of $f$ and the compactness of $\mathcal{P}(\boldsymbol{A},\boldsymbol{b})$ .

3 Bonus Problems: Fixed Points of Gradient Descent/Gradient Descent-Ascent

Next we present a couple of bonus problems, GDFixedPoint and GDAFixedPoint, which respectively capture the computation of fixed points of the (projected) gradient descent and the (projected) gradient descent-ascent dynamics, with learning rate $=1$ . As we see in Section 5, these problems are intimately related, indeed equivalent under polynomial-time reductions, to problems LocalMin and LocalMinMax respectively, in certain regimes of the approximation parameters. Before stating problems GDFixedPoint and GDAFixedPoint, we define the mappings $F_{GD}$ and $F_{GDA}$ whose fixed points these problems are targeting.

for all $(\boldsymbol{x},\boldsymbol{y})\in K$ , where $K(\boldsymbol{y})=\{\boldsymbol{x}^{\prime}\mid(\boldsymbol{x}^{\prime},\boldsymbol{y})\in K\}$ and $K(\boldsymbol{x})=\{\boldsymbol{y}^{\prime}\mid(\boldsymbol{x},\boldsymbol{y}^{\prime})\in K\}$ .

Note that $F_{GDA}$ is called “unsafe” because the projection happens individually for $\boldsymbol{x}-\nabla_{\boldsymbol{x}}f(\boldsymbol{x},\boldsymbol{y})$ and $\boldsymbol{y}+\nabla_{\boldsymbol{y}}f(\boldsymbol{x},\boldsymbol{y})$ , thus $F_{GDA}(\boldsymbol{x},\boldsymbol{y})$ may not lie in $K$ . We also define the “safe” version $F_{sGDA}$ , which projects the pair $(\boldsymbol{x}-\nabla_{\boldsymbol{x}}f(\boldsymbol{x},\boldsymbol{y}),\boldsymbol{y}+\nabla_{\boldsymbol{y}}f(\boldsymbol{x},\boldsymbol{y}))$ jointly onto $K$ . As we show in Section 5 (in particular inside the proof of Theorem 5.2), computing fixed points of $F_{GDA}$ and $F_{sGDA}$ are computationally equivalent so we stick to $F_{GDA}$ which makes the presentation slightly cleaner.

We are now ready to define GDFixedPoint and GDAFixedPoint. As per earlier discussions, we define these computational problems as promise problems, the promise being that the Turing machine provided in the input to these problems outputs function values and gradient values that are consistent with a smooth and Lipschitz function with the prescribed, in the input to these problems, smoothness and Lipschitzness.

Summary of Results

In this section we summarize our results for the optimization problems that we defined in the previous section. We start with our theorem about the complexity of finding approximate stationary points, which we show to be $\mathsf{FNP}$ -complete even for large values of the approximation.

The computational problem StationaryPoint is $\mathsf{FNP}$ -complete, even when $\varepsilon$ is set to any value $\leq 1/24$ , and even when $\mathcal{P}(\boldsymbol{A},\boldsymbol{b})=^{d}$ , $G=\sqrt{d}$ , $L=d$ , and $B=1$ .

The complexity of LocalMin and LocalMinMax is more difficult to characterize, as the nature of these problems changes drastically depending on the relationship of $\delta$ with with $\varepsilon$ , $G$ , $L$ and $d$ , which determines whether these problems ask for a globally vs locally approximately optimal solution. In particular, there are two regimes wherein the complexity of both problems is simple to characterize.

Global Regime. When $\delta\geq\sqrt{d}$ then both LocalMin and LocalMinMax ask for a globally optimal solution. In this regime it is not difficult to see that both problems are $\mathsf{FNP}$ -hard to solve even when $\varepsilon=\Theta(1)$ and $G$ , $L$ are $O(d)$ (see Section 10).

Trivial Regime. When $\delta$ satisfies $\delta<\varepsilon/G$ , then for every point $\boldsymbol{z}\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b})$ it holds that $\left|f(\boldsymbol{z})-f(\boldsymbol{z}^{\prime})\right|<\varepsilon$ for every $\boldsymbol{z}^{\prime}\in\mathsf{B}_{d}(\delta;\boldsymbol{z})$ with $\boldsymbol{z}^{\prime}\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b})$ . Thus, every point $\boldsymbol{z}$ in the domain $\mathcal{P}(\boldsymbol{A},\boldsymbol{b})$ is a solution to both LocalMin and LocalMinMax.

It is clear from our discussion above, and in earlier sections, that, to really capture the complexity of finding local as opposed to global minima/min-max equilibria, we should restrict the value of $\delta$ . We identify the following regime, which we call the “local regime.” As we argue shortly, this regime is markedly different from the global regime identified above in that (i) a solution is guaranteed to exist for both our problems of interest, where in the global regime only LocalMin is guaranteed to have a solution; and (ii) their computational complexity transitions to lower complexity classes.

Local Regime. Our main focus in this paper is the regime defined by $\delta<\sqrt{2\varepsilon/L}$ . In this regime it is well known that Projected Gradient Descent can solve LocalMin in time $O(B\cdot L/\varepsilon)$ (see Appendix E). Our main interest is understanding the complexity of LocalMinMax, which is not well understood in this regime. We note that the use of the constant $2$ in the constraint $\delta<\sqrt{2\varepsilon/L}$ which defines the local regime has a natural motivation: consider a point $\boldsymbol{z}$ where a $L$ -smooth function $f$ has $\nabla f(\boldsymbol{z})=0$ ; it follows from the definition of smoothness that $\boldsymbol{z}$ is both an $(\varepsilon,\delta)$ -local min and an $(\varepsilon,\delta)$ -local min-max equilibrium, as long as $\delta<\sqrt{2\varepsilon/L}$ .

The following theorems provide tight upper and lower bounds on the computational complexity of solving LocalMinMax in the local regime. For compactness, we define the following problem:

We define the local-regime local min-max equilibrium computation problem, in short LR-LocalMinMax, to be the search problem LocalMinMax restricted to instances in the local regime, i.e. satisfying $\delta<\sqrt{2\varepsilon/L}$ .

The computational problem LR-LocalMinMax belongs to $\mathsf{PPAD}$ . As a byproduct, if some function $f$ is $G$ -Lipschitz and $L$ -smooth, then an $(\varepsilon,\delta)$ -local min-max equilibrium is guaranteed to exist when $\delta<\sqrt{2\varepsilon/L}$ , i.e. in the local regime.

An important property of our reduction in the proof of Theorem 4.4 is that it is a black-box reduction. We can hence prove the following unconditional lower bound in the black-box model.

Our main goal in the rest of the paper is to provide the proofs of Theorems 4.3, 4.4 and 4.5. In Section 5, we show how to use Brouwer’s fixed point theorem to prove the existence of approximate local min-max equilibrium in the local regime. Moreover, we establish an equivalence between LocalMinMax and GDAFixedPoint, in the local regime, and show that both belong to $\mathsf{PPAD}$ . In Sections 6 and 7, we provide a detailed proof of our main result, i.e. Theorem 4.4. Finally, in Section 9, we show how our proof from Section 7 produces as a byproduct the black-box, unconditional lower bound of Theorem 4.5. In Section 8, we outline a useful interpolation technique which allows as to interpolate a function given its values and the values of its gradient on a hypergrid, so as to enforce the Lipschitzness and smoothness of the interpolating function. We make heavy use of this technically involved result in all our hardness proofs.

Existence of Approximate Local Min-Max Equilibrium

In this section, we establish the totality of LR-LocalMinMax, i.e. LocalMinMax for instances satisfying $\delta<\sqrt{2\varepsilon/L}$ as defined in Definition 4.2. In particular, we prove that every $G$ -Lipschitz and $L$ -smooth function admits an $(\varepsilon,\delta)$ -local min-max equilibrium, as long as $\delta<\sqrt{2\varepsilon/L}$ . A byproduct of our proof is in fact that LR-LocalMinMax lies inside $\mathsf{PPAD}$ . Specifically the main tool that we use to prove our result is a computational equivalence between the problem of finding fixed points of the Gradient Descent/Ascent dynamic, i.e. GDAFixedPoint, and the problem LR-LocalMinMax. A similar equivalence between GDFixedPoint and LocalMin also holds, but the details of that are left to the reader as a simple exercise. Next, we first present the equivalence between GDAFixedPoint and LR-LocalMinMax, and we then show that GDAFixedPoint is in $\mathsf{PPAD}$ , which then also establishes that LR-LocalMinMax is in $\mathsf{PPAD}$ .

For arbitrary $\varepsilon>0$ and $0<\delta<\sqrt{2\varepsilon/L}$ , suppose that $(\boldsymbol{x}^{\ast},\boldsymbol{y}^{\ast})\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b})$ is an $\alpha$ -approximate fixed point of $F_{GDA}$ , i.e., $\left\|(\boldsymbol{x}^{\ast},\boldsymbol{y}^{\ast})-F_{GDA}(\boldsymbol{x}^{\ast},\boldsymbol{y}^{\ast})\right\|_{2}<\alpha$ , where $\alpha\leq\frac{\sqrt{(G+\delta)^{2}+4(\varepsilon-\frac{L}{2}\delta^{2})}-(G+\delta)}{2}$ . Then $(\boldsymbol{x}^{\ast},\boldsymbol{y}^{\ast})$ is also a $(\varepsilon,\delta)$ -local min-max equilibrium of $f$ .

For arbitary $\alpha>0$ , suppose that $(\boldsymbol{x}^{\ast},\boldsymbol{y}^{\ast})$ is an $(\varepsilon,\delta)$ -local min-max equilibrium of $f$ for $\varepsilon=\frac{\alpha^{2}\cdot L}{(5L+2)^{2}}$ and $\delta=\sqrt{\varepsilon/L}$ . Then $(\boldsymbol{x}^{\ast},\boldsymbol{y}^{\ast})$ is also an $\alpha$ -approximate fixed point of $F_{GDA}$ .

The proof of Theorem 5.1 is presented in Appendix B.1. As already discussed, we use GDAFixedPoint as an intermediate step to establish the totality of LR-LocalMinMax and to show its inclusion in $\mathsf{PPAD}$ . This leads to the following theorem.

The computational problems GDAFixedPoint and LR-LocalMinMax are both total search problems and they both lie in $\mathsf{PPAD}$ .

Observe that Theorem 4.3 is implied by Theorem 5.2 whose proof is presented in Appendix B.2.

Hardness of Local Min-Max Equilibrium – Four-Dimensions

In Section 5, we established that LR-LocalMinMax belongs to $\mathsf{PPAD}$ . Our proof is via the intermediate problem GDAFixedPoint which we showed that it is computationally equivalent to LR-LocalMinMax. Our next step is to prove the $\mathsf{PPAD}$ -hardness of LR-LocalMinMax using again GDAFixedPoint as an intermediate problem.

The problem GDAFixedPoint is $\mathsf{PPAD}$ -complete even in dimension $d=4$ and $B=2$ . Therefore, LR-LocalMinMax is $\mathsf{PPAD}$ -complete even in dimension $d=4$ and $B=2$ .

The first color of every vertex is either $1^{-}$ or $1^{+}$ and the second color is either $2^{-}$ or $2^{+}$ .

The first color of all vertices on the left boundary of the grid is $1^{+}$ .

The first color of all vertices on the right boundary of the grid is $1^{-}$ .

The second color of all vertices on the bottom boundary of the grid is $2^{+}$ .

The second color of all vertices on the top boundary of the grid is $2^{-}$ .

Consider the function $g(\boldsymbol{x})=M(\boldsymbol{x})-\boldsymbol{x}$ . Since $M$ is $L$ -Lipschitz, the function $g:^{2}\to^{2}$ is also $(L+1)$ -Lipschitz. Additionally $g$ can be easily computed via a polynomial-time Turing machine $\mathcal{C}_{g}$ that uses $\mathcal{C}_{M}$ as a subroutine. We construct a proper coloring of a fine grid of $^{2}$ using the signs of the outputs of $g$ . Namely we set $n=\mathop{\left\lceil\log(L/\gamma)+2\right\rceil}$ and this defines a $2^{n}\times 2^{n}$ grid over $^{2}$ that is indexed by $\{0,1\}^{n}\times\{0,1\}^{n}$ . Let $g_{\eta}:^{2}\to^{2}$ be the function that the Turing Machine $\mathcal{C}_{g}$ evaluate when the requested accuracy is $\eta>0$ . Now we can define the circuit $\mathcal{C}_{l}$ as follows, We remind that we abuse the notation and we use a coordinate $i\in\{0,1\}^{n}$ both as a binary string and as a number in $\left(\left[2^{n}-1\right]-1\right)$ and it is clear from the context which of the two we use.

Let $\eta=\frac{\gamma}{2\sqrt{2}}$ , there exists $\boldsymbol{x}\in R$ such that $\left|g_{1}(\boldsymbol{x})\right|\leq\frac{\gamma}{2\sqrt{2}}$ and $\boldsymbol{y}\in R$ such that $\left|g_{2}(\boldsymbol{y})\right|\leq\frac{\gamma}{2\sqrt{2}}$ .

We will prove the existence of $\boldsymbol{x}$ and the existence of $\boldsymbol{y}$ follows using an identical argument. If there exists a corner $\boldsymbol{x}$ of $R$ such that $g_{1}(\boldsymbol{x})$ is in the range $[-\eta,\eta]$ then the claim follows. Suppose not. Using this together with the fact that the first color of one of the corners of $R$ is $1^{-}$ and also the first color of one of the corners of $R$ is $1^{+}$ we conclude that there exist points $\boldsymbol{x},\boldsymbol{x}^{\prime}$ such that $g_{\eta,1}(\boldsymbol{x})\geq 0$ and $g_{\eta,1}(\boldsymbol{x}^{\prime})\leq 0$ The latter is inaccurate for the cases where the vertex $(0,j)$ belongs to either facets $i=0$ or $i=2^{n}-1$ . Notice that the coloring in such vertices does not depend on the value of $g_{\eta}$ . However in case where the color of such a corner is not consistent with the value of $g_{\eta}$ , i.e. $g_{\eta,1}(0,j)<0$ and $\mathcal{C}_{l}^{1}(0,j)=1$ then this means that $|g_{1}(0,j)|\leq\eta$ . This is due to the fact that $g_{1}(0,j)\geq 0$ and $|g_{1}(0,j)-g_{1,\eta}(0,j)|\leq\eta$ .. But we have that $\left\|g_{\eta}-g\right\|_{2}\leq\eta$ . This together with the fact that $g_{1}(\boldsymbol{x})\not\in[-\eta,\eta]$ and $g_{1}(\boldsymbol{x}^{\prime})\not\in[-\eta,\eta]$ implies that $g_{1}(\boldsymbol{x})\geq 0$ and also $g_{1}(\boldsymbol{x}^{\prime})\leq 0$ . But because of the $L$ -Lipschitzness of $g$ and because the distance between $\boldsymbol{x}$ and $\boldsymbol{x}^{\prime}$ is at most $\sqrt{2}\frac{\gamma}{4L}$ we conclude that $\left|g_{1}(\boldsymbol{x})-g_{1}(\boldsymbol{x}^{\prime})\right|\leq\frac{\gamma}{2\sqrt{2}}$ . Hence due to the signs of $g_{1}(\boldsymbol{x})$ and $g_{1}(\boldsymbol{x}^{\prime})$ we conclude that $\left|g_{1}(\boldsymbol{x})\right|\leq\frac{\gamma}{2\sqrt{2}}$ . The same way we can prove that $\left|g_{1}(\boldsymbol{y})\right|\leq\frac{\gamma}{2\sqrt{2}}$ and the claim follows. ∎

Using the Claim 6.3 and the $L$ -Lipschitzness of $g$ we get that for every $\boldsymbol{z}\in R$

where we have used also the fact that for any two points $\boldsymbol{z},\boldsymbol{w}$ it holds that $\left\|\boldsymbol{z}-\boldsymbol{w}\right\|_{2}\leq\sqrt{2}\frac{\gamma}{4L}$ which follows from the definition of the size of the grid. Therefore we have that $\left\|g(\boldsymbol{z})\right\|_{2}\leq\gamma$ and hence $\left\|M(\boldsymbol{z})-\boldsymbol{z}\right\|_{2}\leq\gamma$ which implies that any point $\boldsymbol{z}\in R$ is a $\gamma$ -approximate fixed point of $M$ and the lemma follows. ∎

2 From 2D Bi-Sperner to Fixed Points of Gradient Descent/Ascent

The simplest way to achieve this is to define the function $f$ locally close to $(\boldsymbol{x},\boldsymbol{y})$ to be equal to

Similarly, if $\boldsymbol{x}$ is on a vertex of the $N\times N$ grid, and the coloring of this vertex is $(1^{-},2^{-})$ , i.e. the output of $\mathcal{C}_{l}$ on this vertex is $(-1,-1)$ , then we would like to have

The simplest way to achieve this is to define the function $f$ locally close to $(\boldsymbol{x},\boldsymbol{y})$ to be equal to

In Figure 5 we show pictorially the correspondence of the colors of the vertices of the grid with the gradient of the function $f$ that we design. As shown in the figure, any set of vertices that share at least one of the colors $1^{+}$ , $1^{-}$ , $2^{+}$ , $2^{-}$ , agree on the direction of the gradient with respect the horizontal or the vertical axis. This observation is one of the main ingredients in the proof of correctness of our reduction that we present later in this section.

When $\boldsymbol{x}$ is not on a vertex of the $N\times N$ grid then our goal is to define $f$ via interpolating the functions corresponding to the corners of the cell in which $\boldsymbol{x}$ belongs. The reason that this interpolation is challenging is that we need to make sure the following properties are satisfied

the resulting function $f$ is both Lipschitz and smooth inside every cell,

the resulting function $f$ is both Lipschitz and smooth even at the boundaries of every cell, where two differect cells stick together,

For the low dimensional case, that we explore in this section, satisfying the first two properties is not a very difficult task, whereas for the third property we need to be careful and achieving this property is the main technical contribution of this section. On the contrary, for the high-dimensional case that we explore in Section 7 even achieving the first two properties is very challenging and technical.

As we will see in Section 6.2.1, if we accomplish a construction of a function $f$ with the aforementioned properties, then the fixed points of the projected Gradient Descent/Ascent can only appear inside cells that have all of the colors $\{1^{-},1^{+},2^{-},2^{+}\}$ at their corners. To see this consider a cell that misses some color, e.g. $1^{+}$ . Then all the corners of this cell have as first color $1^{-}$ . Since $f$ is defined as interpolation of the functions in the corners of the cells, with the aforementioned properties, inside that cell there is always a direction with respect to $x_{1}$ and $y_{1}$ for which the gradient is large enough. Hence any point inside that cell cannot be a fixed point of the projected Gradient Descent/Ascent. Of course this example provides just an intuition of our construction and ignores case where the cell is on the boundary of the grid. We provide a detailed explanation of this case in Section 6.2.1.

The above neat idea needs some technical adjustments in order to work. At first, the interpolation of the function in the interior of the cell must be smooth enough so that the resulting function is both Lipschitz and smooth. In order to satisfy this, we need to choose appropriate coefficients of the interpolation that interpolate smoothly not only the value of the function but also its derivatives. For this purpose we use the following smooth step function of order $1$ .

We define $S_{1}:\to$ to be the smooth step function of order $1$ that is equal to $S_{1}(x)=3x^{2}-2x^{3}$ . Observe that the following hold $S_{1}(0)=0$ , $S_{1}(1)=1$ , $S_{1}^{\prime}(0)=0$ , and $S_{1}^{\prime}(1)=0$ .

As we have discussed, another issue is that since the interpolation coefficients depend on the value of $\boldsymbol{x}$ it could be that the derivatives of these coefficients overpower the derivatives of the functions that we interpolate. In this case we could be potentially creating fixed points of Gradient Descent/Ascent even in non panchromatic squares. As we will see later the magnitude of the derivatives from the interpolation coefficients depends on the differences $x_{1}-y_{1}$ and $x_{2}-y_{2}$ . Hence if we ensure that these differences are small then the derivatives of the interpolation coefficients will have to remain small and hence they can never overpower the derivatives from the corners of every cell. This is the place in our reduction where we add the constraints $\boldsymbol{A}\cdot(\boldsymbol{x},\boldsymbol{y})\leq\boldsymbol{b}$ that define the domain of the function $f$ as we describe in Section 3.

Now that we have summarized the main ideas of our construction we are ready for the formal definition of $f$ based on the coloring circuit $\mathcal{C}_{l}$ .

where the coefficients $\alpha_{1}(\boldsymbol{x}),\alpha_{2}(\boldsymbol{x})\in$ are defined as follows

where $\delta\triangleq 1/(N-1)=1/(2^{n}-1)$ .

In Figure 6 we present an example of the application of Definition 6.5 to a specific cell with some given coloring on the corners.

An important property of the definition of the function $f_{\mathcal{C}_{l}}$ is that the coefficients used in the definition of $\alpha_{i}$ have the following two properties

Hence the function $f_{\mathcal{C}_{l}}$ inside a cell is a smooth convex combination of the functions on the corners of the cell, as is suggested from Figure 6. Of course there are many ways to define such convex combination but in our case we use the smooth step function $S_{1}$ to ensure the Lipschitz continuous gradient of the overall function $f_{\mathcal{C}_{l}}$ . We prove this formally in the next lemma.

Let $f_{\mathcal{C}_{l}}$ be the function defined based on a coloring circuit $\mathcal{C}_{l}$ , as per Definition 6.5. Then $f_{\mathcal{C}_{l}}$ is continuous and differentiable at any point $(\boldsymbol{x},\boldsymbol{y})\in^{4}$ . Moreover, $f_{\mathcal{C}_{l}}$ is $\Theta(1/\delta)$ -Lipschitz and $\Theta(1/\delta^{2})$ -smooth in the whole 4-dimensional hypercube $^{4}$ , where $\delta=1/(N-1)=1/(2^{n}-1)$ .

Clearly from Definition 6.5, $f_{\mathcal{C}_{l}}$ is differentiable at any point $(\boldsymbol{x},\boldsymbol{y})\in^{4}$ in which $\boldsymbol{x}$ lies on the strict interior of its respective cell. In this case the derivative with respect to $x_{1}$ is

where for $\partial\alpha_{1}(\boldsymbol{x})/\partial x_{1}$ we have that

Now since $\max_{z\in}\left|S^{\prime}_{1}(z)\right|\leq 6$ , we can conclude that $\left|\frac{\partial\alpha_{1}(\boldsymbol{x})}{\partial x_{1}}\right|\leq 24/\delta$ . Similarly we can prove that $\left|\frac{\partial\alpha_{2}(\boldsymbol{x})}{\partial x_{1}}\right|\leq 24/\delta$ , which combined with $\left|\alpha_{1}(\boldsymbol{x})\right|\leq 1$ implies $\left|\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x},\boldsymbol{y})}{\partial x_{1}}\right|\leq O(1/\delta)$ . Using similar reasoning we can prove that $\left|\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x},\boldsymbol{y})}{\partial x_{2}}\right|\leq O(1/\delta)$ and that $\left|\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x},\boldsymbol{y})}{\partial y_{i}}\right|\leq 1$ for $i=1,2$ . Hence

The only thing we are missing to prove the Lipschitzness of $f_{\mathcal{C}_{l}}$ is to prove its continuity on the boundaries of the cells of our subdivision. Suppose $\boldsymbol{x}$ lies on the boundary of some cell, e.g. let $\boldsymbol{x}$ lie on edge $(C,D)$ of one cell that is the same as the edge $(A^{\prime},B^{\prime})$ of the cell to the right of that cell. Since $S_{1}(0)=0$ , $S^{\prime}_{1}(0)=0$ and $S^{\prime}_{1}(1)=0$ it holds that $\partial\alpha_{1}(\boldsymbol{x})/\partial x_{1}=0$ and the same for $\alpha_{2}$ . Therefore the value of $\partial f_{\mathcal{C}_{l}}/\partial x_{1}$ remains the same no matter the cell according to which it was calculated. As a result, $f_{\mathcal{C}_{l}}$ is differentiable with respect to $x_{1}$ even if $\boldsymbol{x}$ belongs in the boundary of its cell. Using the exact same reasoning for the rest of the variables, one can show that the function $f_{\mathcal{C}_{l}}$ is differentiable at any point $(\boldsymbol{x},\boldsymbol{y})\in^{4}$ and because of the aforementioned bound on the gradient $\nabla f_{\mathcal{C}_{l}}$ we can conclude that $f_{\mathcal{C}_{l}}$ is $O(1/\delta)$ -Lipschitz.

Using very similar calculations, we can compute the closed formulas of the second derivatives of $f_{\mathcal{C}_{l}}$ and using the bounds $\left|f_{\mathcal{C}_{l}}(\cdot)\right|\leq 2$ , $\left|S_{1}(\cdot)\right|\leq 1$ , $\left|S^{\prime}_{1}(\cdot)\right|\leq 6$ , and $\left|S^{\prime\prime}_{1}(\cdot)\right|\leq 6$ , we can prove that each entry of the Hessian $\nabla^{2}f_{\mathcal{C}_{l}}(\boldsymbol{x},\boldsymbol{y})$ is bounded by $O(1/\delta^{2})$ and thus

which implies the $\Theta(1/\delta^{2})$ -smoothness of $f_{\mathcal{C}_{l}}$ . ∎

$\boldsymbol{(+)}$ Construction of Instance for Fixed Points of Gradient Descent/Ascent.

Our construction can be described via the following properties.

The payoff function is the real-valued function $f_{\mathcal{C}_{l}}(\boldsymbol{x},\boldsymbol{y})$ from the Definition 6.5.

The domain is the polytope $\mathcal{P}(\boldsymbol{A},\boldsymbol{b})$ that we described in Section 3. The matrix $\boldsymbol{A}$ and the vector $\boldsymbol{b}$ have constant size and they are computed so that the following inequalities hold

where $\Delta=\delta/12$ and $\delta=1/(N-1)=1/(2^{n}-1)$ .

The parameter $\alpha$ is set to be equal to $\Delta/3$ .

The parameters $G$ and $L$ are set to be equal to the upper bounds on the Lipschitzness and the smoothness of $f_{\mathcal{C}_{l}}$ respectively that we derived in Lemma 6.6. Namely we have that $G=O(1/\delta)=O(2^{n})$ and $L=O(1/\delta^{2})=O(2^{2n})$ .

We prove this last statement in Lemma 6.8, but before that we need the following technical lemma that will be useful to argue about solution on the boundary of $\mathcal{P}(\boldsymbol{A},\boldsymbol{b})$ .

If $x_{i}^{\star}\in(\alpha,1-\alpha)$ and $x_{i}^{\star}\in(y_{i}^{\star}-\Delta+\alpha,y_{i}^{\star}+\Delta-\alpha)$ then $\left|\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial x_{i}}\right|\leq\alpha$ .

If $x^{\star}_{i}\leq\alpha$ or $x^{\star}_{i}\leq y^{\star}_{i}-\Delta+\alpha$ then $\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial x_{i}}\geq-\alpha$ .

If $x^{\star}_{i}\geq 1-\alpha$ or $x^{\star}_{i}\geq y^{\star}_{i}+\Delta-\alpha$ then $\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial x_{i}}\leq\alpha$ .

The symmetric statements for $y_{i}^{\star}$ hold. For $i\in\{1,2\}$ :

If $y_{i}^{\star}\in(\alpha,1-\alpha)$ and $y_{i}^{\star}\in(x_{i}^{\star}-\Delta+\alpha,x_{i}^{\star}+\Delta-\alpha)$ then $\left|\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial y_{i}}\right|\leq\alpha$ .

If $y^{\star}_{i}\leq\alpha$ or $y^{\star}_{i}\leq x^{\star}_{i}-\Delta+\alpha$ then $\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial y_{i}}\leq\alpha$ .

If $y^{\star}_{i}\geq 1-\alpha$ or $y^{\star}_{i}\geq x^{\star}_{i}+\Delta-\alpha$ then $\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial y_{i}}\geq-\alpha$ .

For this proof it is convenient to define $\hat{\boldsymbol{x}}=\boldsymbol{x}^{\star}-\nabla_{x}f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})$ , $K(\boldsymbol{y}^{\star})=\{\boldsymbol{x}\mid(\boldsymbol{x},\boldsymbol{y}^{\star})\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b}))\}$ , and $\boldsymbol{z}=\Pi_{K(\boldsymbol{y}^{\star})}\hat{\boldsymbol{x}}$ .

For the second case, we assume for the sake of contradiction that $x^{\star}_{i}\leq\alpha$ and $\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial x_{i}}<-\alpha$ . These imply that $\hat{x}_{i}>x_{i}^{\star}+\alpha$ and that $z_{i}=\min(y_{i}^{\star}+\Delta,\hat{x}_{i},1)>\min(\Delta,\hat{x}_{i},1)\geq\min(3\alpha,x_{i}^{\star}+\alpha)$ . As a result, $\left|x_{i}^{\star}-z_{i}\right|=z_{i}-x_{i}^{\star}>\min(3\alpha,\hat{x}_{i}+\alpha)-x_{i}^{\star}$ which is greater than $\alpha$ . The latter is a contradiction with the assumption that $(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})$ is a solution to the GDAFixedPoint problem. Also if we assume that $x_{i}^{\star}\leq y_{i}^{\star}-\Delta+\alpha$ using the same reasoning we get that $z_{i}=\min(\hat{x}_{i},y_{i}^{\star}+\Delta-\alpha,1)$ . From this we can again prove that $\left|x_{i}^{\star}-z_{i}\right|>\alpha$ which contradicts the fact that $(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})$ is a solution to GDAFixedPoint.

The third case can be proved using the same arguments as the second case. Then using the corresponding arguments we can prove the corresponding statements for the $y$ variables. ∎

We are now ready to prove that solutions of GDAFixedPoint can only occur in cells that are either panchromatic or violate the boundary conditions of a proper coloring. For convenience in the rest of this section we define $R(\boldsymbol{x})$ to be the cell of the $2^{n}\times 2^{n}$ grid that contains $\boldsymbol{x}$ .

for $i,j$ such that $x_{1}\in\left[\frac{i}{2^{n}-1},\frac{i+1}{2^{n}-1}\right]\text{ and }x_{2}\in\left[\frac{j}{2^{n}-1},\frac{j+1}{2^{n}-1}\right]$ if there are multiple $i$ , $j$ that satisfy the above condition then we choose $R(\boldsymbol{x})$ to be the cell that corresponds to the $i$ , $j$ such that the pair $(i,j)$ it the lexicographically first such that $i$ , $j$ satisfy the above condition. We also define the corners $R_{c}(\boldsymbol{x})$ of $R(\boldsymbol{x})$ as

where $R(\boldsymbol{x})=\left[\frac{i}{2^{n}-1},\frac{i+1}{2^{n}-1}\right]\times\left[\frac{j}{2^{n}-1},\frac{j+1}{2^{n}-1}\right]$ .

$x_{1}^{\star}\geq 1/(2^{n}-1)$ and, for all $\boldsymbol{v}\in R_{c}(\boldsymbol{x}^{\star})$ , it holds that $\mathcal{C}_{l}^{1}(\boldsymbol{v})=-1$ .

$x_{1}^{\star}\leq(2^{n}-2)/(2^{n}-1)$ and, for all $\boldsymbol{v}\in R_{c}(\boldsymbol{x}^{\star})$ , it holds that $\mathcal{C}_{l}^{1}(\boldsymbol{v})=+1$ .

$x_{2}^{\star}\geq 1/(2^{n}-1)$ and, for all $\boldsymbol{v}\in R_{c}(\boldsymbol{x}^{\star})$ , it holds that $\mathcal{C}_{l}^{2}(\boldsymbol{v})=-1$ .

$x_{2}^{\star}\leq(2^{n}-2)/(2^{n}-1)$ and, for all $\boldsymbol{v}\in R_{c}(\boldsymbol{x}^{\star})$ , it holds that $\mathcal{C}_{l}^{2}(\boldsymbol{v})=+1$ .

We prove that there is no solution $(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})$ of GDAFixedPoint that satisfies the statement 1. and the fact that $(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})$ cannot satisfy the other statements follows similarly. It is convenient for us to define $\hat{\boldsymbol{x}}=\boldsymbol{x}^{\star}-\nabla_{x}f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})$ , $K(\boldsymbol{y}^{\star})=\{\boldsymbol{x}\mid(\boldsymbol{x},\boldsymbol{y}^{\star})\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b}))\}$ , $\boldsymbol{z}=\Pi_{K(\boldsymbol{y}^{\star})}\hat{\boldsymbol{x}}$ , and $\hat{\boldsymbol{y}}=\boldsymbol{y}^{\star}+\nabla_{y}f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})$ , $K(\boldsymbol{x}^{\star})=\{\boldsymbol{y}\mid(\boldsymbol{x}^{\star},\boldsymbol{y})\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b}))\}$ , $\boldsymbol{w}=\Pi_{K(\boldsymbol{x}^{\star})}\hat{\boldsymbol{y}}$ .

For the sake of contradiction we assume that there exists a solution of $(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})$ such that $x_{1}^{\star}\geq 1/(2^{n}-1)$ and for all $\boldsymbol{v}\in R_{c}(\boldsymbol{x}^{\star})$ it holds that $\mathcal{C}_{l}^{1}(\boldsymbol{v})=-1$ . Using the fact that the first color of all the corners of $R(\boldsymbol{x}^{\star})$ is $1^{-}$ , we will prove that (1) $\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial x_{1}}\geq 1/2$ , and (2) $\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial y_{1}}=-1$ .

Let $R(\boldsymbol{x}^{\star})=\left[\frac{i}{2^{n}-1},\frac{i+1}{2^{n}-1}\right]\times\left[\frac{j}{2^{n}-1},\frac{j+1}{2^{n}-1}\right]$ , then since all the corners $\boldsymbol{v}\in R_{c}(\boldsymbol{x}^{\star})$ have $\mathcal{C}_{l}^{1}(\boldsymbol{v})=-1$ , from the Definition 6.5 we have that

where $(x_{1}^{A},x_{2}^{A})=(i/(2^{n}-1),j/(2^{n}-1))$ , $(x_{1}^{B},x_{2}^{B})=(i/(2^{n}-1),(j+1)/(2^{n}-1))$ , $(x_{1}^{C},x_{2}^{C})=((i+1)/(2^{n}-1),(j+1)/(2^{n}-1))$ , and $(x_{1}^{D},x_{2}^{D})=((i+1)/(2^{n}-1),j/(2^{n}-1))$ . If we differentiate this with respect to $y_{1}$ we immediately get that $\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial y_{1}}=-1$ . On the other hand if we differentiate with respect to $x_{1}$ we get

where the last inequality follows from the fact that $\left|S^{\prime}_{1}(\cdot)\right|\leq 3/2$ and the fact that, due to the constraints that define the polytope $\mathcal{P}(\boldsymbol{A},\boldsymbol{b})$ , it holds that $\left|x_{2}-y_{2}\right|\leq\Delta$ .

Hence we have established that if $x_{1}^{\star}\geq 1/(2^{n}-1)$ and for all $\boldsymbol{v}\in R_{c}(\boldsymbol{x}^{\star})$ it holds that $\mathcal{C}_{l}^{1}(\boldsymbol{v})=-1$ then it holds that that (1) $\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial x_{1}}\geq 1/2$ , and (2) $\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial y_{1}}=-1$ . Now it is easy to see that the only way to satisfy both $\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial x_{1}}\geq 1/2$ and $\left|z_{1}-x_{1}^{\star}\right|\leq\alpha$ is that either $x_{1}^{\star}\leq\alpha$ or $x_{1}^{\star}\leq y_{1}^{\star}-\Delta+\alpha$ . The first case is excluded by the assumption in the first statement of our lemma and our choice of $\alpha=\Delta/3=1/(36\cdot(2^{n}-1))$ thus it holds that $x_{1}^{\star}\leq y_{1}^{\star}-\Delta+\alpha$ . But then we can use the case 3 for the $y$ variables of Lemma 6.7 and we get that $\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial y_{1}}\geq-\alpha$ , which cannot be true since we proved that $\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial y_{1}}=-1$ . Therefore we have a contradiction and the first statement of the lemma holds. Using the same reasoning we prove the rest of the statements. ∎

The computations presented in (6.4) is the precise point where an attempt to prove the hardness of minimization problems would fail. In particular, if our goal was to construct a hard minimization instance then the function $f_{\mathcal{C}_{l}}$ would need to have the terms $x_{i}+y_{i}$ instead of $x_{i}-y_{i}$ so that the fixed points of gradient descent coincide with approximate local minimum of $f_{\mathcal{C}_{l}}$ . In that case we cannot lower bound the gradient of (6.4) below from $1/2$ because the term $\left|x^{\star}_{2}+y^{\star}_{2}\right|$ will be the dominant one and hence the sign of the derivative can change depending on the value $\left|x^{\star}_{2}+y^{\star}_{2}\right|$ . For a more intuitive explanation of the reason why we cannot prove hardness of minimization problems we refer to the Introduction, at Section 1.2.

We have now all the ingredients to prove Theorem 6.1.

Hardness of Local Min-Max Equilibrium – High-Dimensions

The $i$ th color of every vertex is either the color $i^{+}$ or the color $i^{-}$ .

All the vertices whose $i$ th coordinate is , i.e. they are at the lower boundary of the $i$ th direction, should have the $i$ th color equal to $i^{+}$ .

All the vertices whose $i$ th coordinate is $1$ , i.e. they are at the higher boundary of the $i$ th direction, should have the $i$ th color equal to $i^{-}$ .

$\gamma$ -SuccinctBrouwer is $\mathsf{PPAD}$ -complete for any fixed constant $\gamma>0$ .

Consider the function $g(\boldsymbol{x})=M(\boldsymbol{x})-\boldsymbol{x}$ . Since $M$ is $1/\gamma$ -Lipschitz, $g:^{d}\to^{d}$ is also $(1+1/\gamma)$ -Lipschitz. Additionally $g$ can be easily computed via a polynomial-time Turing machine $\mathcal{C}_{g}$ that uses $\mathcal{C}_{M}$ as a subroutine. We construct the coloring sequences of every vertex of a $d$ -dimensional grid with $N=\Theta(d/\gamma^{2})$ points in every direction using $g$ . Let $g_{\eta}:^{2}\to^{2}$ be the function that the Turing Machine $\mathcal{C}_{g}$ evaluate when the requested accuracy is $\eta>0$ . For each vertex $\boldsymbol{v}=(v_{1},\ldots,v_{n})\in\left(\left[N\right]-1\right)^{d}$ of the $d$ -dimensional grid its coloring sequence $\mathcal{C}_{l}(\boldsymbol{v})\in\{-1,1\}^{d}$ is constructed as follows: For each coordinate $j=1,\ldots,d$ ,

Let $\boldsymbol{v}$ be any vertex on the same cubelet with the output vertices $\boldsymbol{v}^{(1)}$ , $\ldots$ , $\boldsymbol{v}^{(d)}$ , $\boldsymbol{u}^{(1)}$ , $\ldots$ , $\boldsymbol{u}^{(d)}$ . From the guarantees of colors of the sequences $\boldsymbol{v}^{(1)}$ , $\ldots$ , $\boldsymbol{v}^{(d)}$ , $\boldsymbol{u}^{(1)}$ , $\ldots$ , $\boldsymbol{u}^{(d)}$ we have that either $\mathcal{C}_{l}^{j}(\boldsymbol{v})\cdot\mathcal{C}_{l}^{j}(\boldsymbol{v}^{(j)})=-1$ or $\mathcal{C}_{l}^{j}(\boldsymbol{v})\cdot\mathcal{C}_{l}^{j}(\boldsymbol{u}^{(j)})=-1$ , let $\overline{\boldsymbol{v}}^{(j)}$ be the vertex $\boldsymbol{v}^{(j)}$ or $\boldsymbol{u}^{(j)}$ depending on which one the $j$ th color has product equal to $-1$ with $\mathcal{C}_{l}^{j}(\boldsymbol{v})$ . Now let $\eta=\frac{2\sqrt{d}}{\gamma N}$ if $g_{j}\left(\frac{\boldsymbol{v}}{N-1}\right)\in[-\eta,\eta]$ then the wanted inequality follows. On the other hand if $g_{j}\left(\frac{\boldsymbol{v}}{N-1}\right)\in[-\eta,\eta]$ then using the fact that $\left\|g\left(\frac{\boldsymbol{v}}{N-1}\right)-g_{\eta}\left(\frac{\boldsymbol{v}}{N-1}\right)\right\|_{\infty}\leq\eta$ and that from the definition of the colors we have that either $g_{\eta,j}\left(\frac{\boldsymbol{v}}{N-1}\right)\geq 0$ , $g_{\eta,j}\left(\frac{\overline{\boldsymbol{v}}^{(j)}}{N-1}\right)<0$ or $g_{\eta,j}\left(\frac{\boldsymbol{v}}{N-1}\right)<0$ , $g_{\eta,j}\left(\frac{\hat{\boldsymbol{v}}^{(j)}}{N-1}\right)\geq 0$ we conclude that $g_{j}\left(\frac{\boldsymbol{v}}{N-1}\right)\geq 0$ , $g_{j}\left(\frac{\overline{\boldsymbol{v}}^{(j)}}{N-1}\right)<0$ or $g_{j}\left(\frac{\boldsymbol{v}}{N-1}\right)<0$ , $g_{j}\left(\frac{\hat{\boldsymbol{v}}^{(j)}}{N-1}\right)\geq 0$ and thus,

where in the second inequality we have used the $(1+1/\gamma)$ -Lipschitzness of $g$ . As a result, the point $\hat{\boldsymbol{v}}=\boldsymbol{v}/(N-1)\in^{d}$ satisfies $\left\|M(\hat{\boldsymbol{v}})-\hat{\boldsymbol{v}}\right\|_{2}\leq 2d/(\gamma N)$ and thus for if we pick $N=\Theta(d/\gamma^{2})$ then any vertex $\boldsymbol{v}$ of the panchromatic cell is a solution for $\gamma$ -SuccinctBrouwer. ∎

2 From High Dimensional Bi-Sperner to Fixed Points of Gradient Descent/Ascent

where $\boldsymbol{c}\in\left(\left[N-1\right]-1\right)^{d}$ such that $\boldsymbol{x}\in\left[\frac{c_{1}}{N-1},\frac{c_{1}+1}{N-1}\right]\times\cdots\times\left[\frac{c_{d}}{N-1},\frac{c_{d}+1}{N-1}\right]$ and if there are multiple corners $\boldsymbol{c}$ that satisfy this condition then we choose $R(\boldsymbol{x})$ to be the cell that corresponds to the $\boldsymbol{c}$ that is lexicographically first among those that satisfy the condition. We also define $R_{c}(\boldsymbol{x})$ to be the set of vertices that are corners of the cublet $R(\boldsymbol{x})$ , namely

where $\boldsymbol{c}\in\left(\left[N-1\right]-1\right)^{d}$ such that $R(\boldsymbol{x})=\left[\frac{c_{1}}{N-1},\frac{c_{1}+1}{N-1}\right]\times\cdots\times\left[\frac{c_{d}}{N-1},\frac{c_{d}+1}{N-1}\right]$ Every $\boldsymbol{y}$ that belongs to the cubelet $R(\boldsymbol{x})$ can be written as a convex combination of the vectors $\boldsymbol{v}/(N-1)$ where $\boldsymbol{v}\in R_{c}(\boldsymbol{x})$ . The value of the function $f_{\mathcal{C}_{l}}(\boldsymbol{x},\boldsymbol{y})$ that we construct in this section is determined by the coloring sequences $\mathcal{C}_{l}(\boldsymbol{v})$ of the vertices $\boldsymbol{v}\in R_{c}(\boldsymbol{x})$ . One of the main challenges that we face though is that the size of $R_{c}(\boldsymbol{x})$ is $2^{d}$ and hence if we want to be able to compute the value of $f_{\mathcal{C}_{l}}(\boldsymbol{x},\boldsymbol{y})$ efficiently then we have to find a consistent rule to pick a subset of the vertices of $R_{c}(\boldsymbol{x})$ whose coloring sequence we need to define the function value $f_{\mathcal{C}_{l}}(\boldsymbol{x},\boldsymbol{y})$ . Although there are traditional ways to overcome this difficulty using the canonical simplicization of the cubelet $R(\boldsymbol{x})$ , these technique leads only to functions that are continuous and Lipschitz but they are not enough to guarantee continuity of the gradient and hence the resulting functions are not smooth.

The problem of finding a computationally efficient way to define a continuous function as an interpolation of some fixed function in the corners of a cubelet so that the resulting function is both Lischitz and smooth is surprisingly difficult to solve. For this reason we introduce in this section the smooth and efficient interpolation coefficients (SEIC) that as we will see in Section 7.2.2, is the main technical tool to implement such an interpolation. Our novel interpolation coefficients are of independent interest and we believe that they will serve as a main technical tool for proving other hardness results in continuous optimization in the future.

In this section we only give a high level description of the smooth and efficient interpolation coefficients via their properties that we use in Section 7.2.2 to define the function $f_{\mathcal{C}_{l}}$ . The actual construction of the coefficients is very challenging and technical and hence we postpone a detail exposition for Section 8.

For all vertices $\boldsymbol{v}\in\left(\left[N\right]-1\right)^{d}$ , the coefficient $\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x})$ is a twice-differentiable function and satisfies

$\left|\frac{\partial\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x})}{\partial x_{i}}\right|\leq\Theta(d^{12}/\delta)$ .

For all $\boldsymbol{v}\in\left(\left[N\right]-1\right)^{d}$ , it holds that $\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x})\geq 0$ and $\sum_{\boldsymbol{v}\in\left(\left[N\right]-1\right)^{d}}\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x})=\sum_{\boldsymbol{v}\in R_{c}(\boldsymbol{x})}\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x})=1$ .

For all $\boldsymbol{x}\in^{d}$ , if $x_{i}\leq 1/(N-1)$ for some $i\in[d]$ then there exists $\boldsymbol{v}\in R_{+}(\boldsymbol{x})$ such that $v_{i}=0$ . Respectively, if $x_{i}\geq 1-1/(N-1)$ then there exists $\boldsymbol{v}\in R_{+}(\boldsymbol{x})$ such that $v_{i}=1$ .

An intuitive explanation of the properties of the SEIC coefficients is the following

The coefficients $\mathsf{P}_{\boldsymbol{v}}$ are both Lipschitz and smooth with Lipschitzness and smoothness parameters that depends polynomially in $d$ and $N=1/\delta+1$ .

The coefficients $\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x})$ define a convex combination of the vertices $R_{c}(\boldsymbol{x})$ .

For every $\boldsymbol{x}\in^{d}$ , out of the $N^{d}$ coefficients $\mathsf{P}_{\boldsymbol{v}}$ only $d+1$ have non-zero value, or non-zero gradient or non-zero Hessian when evaluated at the point $\boldsymbol{x}$ . Moreover, given $\boldsymbol{x}\in^{d}$ we can identify these $d+1$ coefficients efficiently.

For every $\boldsymbol{x}\in^{d}$ that is in a cubelet that touches the boundary there is at least one of the vertices in $R_{+}(\boldsymbol{x})$ that is on the boundary of the continuous hypercube $^{d}$ .

In Section 10 in the proof of Theorem 10.4 we present a simple application of the existence of the SEIC coefficients for proving very simple black box oracle lower bounds for the global minimization problem.

Based on the existence of these coefficients we are now ready to define the function $f_{\mathcal{C}_{l}}$ which is the main construction of our reduction.

2.2 Definition of a Lipschitz and Smooth Function Based on a BiSperner Instance

In this section our goal is to formally define the function $f_{\mathcal{C}_{l}}$ and prove its Lipschitzness and smoothness properties in Lemma 7.5.

where $\alpha_{j}(\boldsymbol{x})=-\sum_{\boldsymbol{v}\in\left(\left[N\right]-1\right)^{d}}\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x})\cdot\mathcal{C}^{j}_{l}(\boldsymbol{v})$ , and $\mathsf{P}_{\boldsymbol{v}}$ are the coefficients defined in Definition 7.3.

We first prove that the function $f_{\mathcal{C}_{l}}$ constructed in Definition 7.4 is $G$ -Lipschitz and $L$ -smooth for some appropriately selected parameters $G$ , $L$ that are polynomial in the dimension $d$ and in the discretization parameter $N$ . We use this property to establish that $f_{\mathcal{C}_{l}}$ is a valid input to the promise problem GDAFixedPoint.

The function $f_{\mathcal{C}_{l}}$ of Definition 7.4 is $O(d^{15}/\delta)$ -Lipschitz and $O(d^{27}/\delta^{2})$ -smooth.

If we take the derivative with respect to $x_{i}$ and $y_{i}$ and using property (B) of the coefficients $\mathsf{P}_{\boldsymbol{v}}$ we get the following relations,

Now by the property (C) of Definition 7.3 there are most $d+1$ vertices $\boldsymbol{v}$ of $R_{c}(\boldsymbol{x})$ with the property $\nabla\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x})\neq 0$ . Then if we also use property (A) we get $\left|\frac{\partial\alpha_{j}(\boldsymbol{x})}{\partial x_{i}}\right|\leq\Theta(d^{13}/\delta)$ and using the property (B) we get $\left|\alpha_{i}(\boldsymbol{x})\right|\leq 1$ . Thus $\left|\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x},\boldsymbol{y})}{\partial x_{i}}\right|\leq\Theta(d^{14}/\delta)$ and $\left|\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x},\boldsymbol{y})}{\partial y_{i}}\right|\leq\Theta(d)$ . Therefore we can conclude that $\left\|\nabla f_{\mathcal{C}_{l}}(\boldsymbol{x},\boldsymbol{y})\right\|_{2}\leq\Theta(d^{15}/\delta)$ and hence this proves that the function $f_{\mathcal{C}_{l}}$ is Lipschitz continuous with Lipschitz constant $\Theta(d^{15}/\delta)$ .

To prove the smoothness of $f_{\mathcal{C}_{l}}$ , we use the property (B) of the Definition 7.3 and we have

2.3 Description and Correctness of the Reduction – Proof of Theorem 4.4

$\boldsymbol{(}\star)$ Construction of Instance for Fixed Points of Gradient Descent/Ascent.

The payoff function is the real-valued function $f_{\mathcal{C}_{l}}(\boldsymbol{x},\boldsymbol{y})$ from the Definition 7.4.

The domain is the polytope $\mathcal{P}(\boldsymbol{A},\boldsymbol{b})$ that we described in Section 3. The matrix $\boldsymbol{A}$ and the vector $\boldsymbol{b}$ are computed so that the following inequalities hold

The parameter $\alpha$ is set to be equal to $\Delta/3$ .

The parameters $G$ and $L$ are set to be equal to the upper bounds on the Lipschitzness and the smoothness of $f_{\mathcal{C}_{l}}$ respectively that we derived in Lemma 7.5. Namely we have that $G=O(d^{15}/\delta)$ and $L=O(d^{27}/\delta^{2})$ .

The first thing to observe is that the afore-described reduction is polynomial-time. For this observe that all of $\alpha$ , $G$ , $L$ , $\boldsymbol{A}$ , and $\boldsymbol{b}$ have representation that is polynomial in $d$ even if we use unary instead of binary representation. So the only thing that remains is the existence of a Turing machine $\mathcal{C}_{f_{\mathcal{C}_{l}}}$ that computes the function and the gradient value of $f_{\mathcal{C}_{l}}$ in time polynomial to the size of $\mathcal{C}_{l}$ and the requested accuracy. To prove this we need a detailed description of the SEIC coefficients and for this reason we postpone the proof of this to the Appendix D. Here we state the formally the result that we prove in the Appendix D which together with the discussion above proves that our reduction is indeed polynomial-time.

Moreover the running time of $\mathcal{C}_{f_{\mathcal{C}_{l}}}$ is polynomial in the binary representation of $\boldsymbol{x}$ , $\boldsymbol{y}$ , and $\log(1/\varepsilon)$ .

If $x^{\star}_{i}\leq\alpha$ or $x^{\star}_{i}\leq y^{\star}_{i}-\Delta+\alpha$ then $\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial x_{i}}\geq-\alpha$ .

The symmetric statements for $y_{i}^{\star}$ hold.

If $y^{\star}_{i}\leq\alpha$ or $y^{\star}_{i}\leq x^{\star}_{i}-\Delta+\alpha$ then $\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial y_{i}}\leq\alpha$ .

The proof of this lemma is identical to the proof of Lemma 6.7 and for this reason we skip the details of the proof here. ∎

$x_{i}^{\star}\geq 1/(N-1)$ and for any $\boldsymbol{v}\in R_{+}(\boldsymbol{x}^{\star})$ , it holds that $\mathcal{C}_{l}^{i}(\boldsymbol{v})=-1$ .

$x_{i}^{\star}\leq 1-1/(N-1)$ and for any $\boldsymbol{v}\in R_{+}(\boldsymbol{x}^{\star})$ , it holds that $\mathcal{C}_{l}^{1}(\boldsymbol{v})=+1$ .

We prove that there is no solution $(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})$ of GDAFixedPoint that satisfies the statement 1. and the fact that $(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})$ cannot satisfy the statement 2. follows similarly. It is convenient for us to define $\hat{\boldsymbol{x}}=\boldsymbol{x}^{\star}-\nabla_{x}f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})$ , $K(\boldsymbol{y}^{\star})=\{\boldsymbol{x}\mid(\boldsymbol{x},\boldsymbol{y}^{\star})\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b}))\}$ , $\boldsymbol{z}=\Pi_{K(\boldsymbol{y}^{\star})}\hat{\boldsymbol{x}}$ , and $\hat{\boldsymbol{y}}=\boldsymbol{y}^{\star}-\nabla_{y}f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})$ , $K(\boldsymbol{x}^{\star})=\{\boldsymbol{y}\mid(\boldsymbol{x}^{\star},\boldsymbol{y})\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b}))\}$ , $\boldsymbol{w}=\Pi_{K(\boldsymbol{x}^{\star})}\hat{\boldsymbol{y}}$ .

For the sake of contradiction we assume that there exists a solution of $(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})$ such that $x_{1}^{\star}\geq 1/(N-1)$ and for any $\boldsymbol{v}\in R_{+}(\boldsymbol{x}^{\star})$ it holds that $\mathcal{C}_{l}^{i}(\boldsymbol{v})=-1$ . Using this fact, we will prove that (1) $\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial x_{i}}\geq 1/2$ , and (2) $\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial y_{i}}=-1$ .

Let $R(\boldsymbol{x}^{\star})=\left[\frac{c_{1}}{N-1},\frac{c_{1}+1}{N-1}\right]\times\cdots\times\left[\frac{c_{d}}{N-1},\frac{c_{d}+1}{N-1}\right]$ , then since all the corners $\boldsymbol{v}\in R_{+}(\boldsymbol{x}^{\star})$ have $\mathcal{C}_{l}^{i}(\boldsymbol{v})=-1$ , from the Definition 7.4 we have that

If we differentiate this with respect to $y_{i}$ we immediately get that $\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial y_{i}}=-1$ . On the other hand if we differentiate with respect to $x_{i}$ we get

where the above follows from the following facts: (1) that $\left|\frac{\partial\alpha_{j}(\boldsymbol{x})}{\partial x_{l}}\right|\leq\Theta(d^{13}/\delta)$ , which is proved in the proof of Lemma 7.5, (2) $\left|x_{j}-y_{j}\right|\leq\Delta$ , and (3) the definition of $\Delta$ . Now it is easy to see that the only way to satisfy both $\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial x_{i}}\geq 1/2$ and $\left|z_{i}-x_{i}^{\star}\right|\leq\alpha$ is that either $x_{i}^{\star}\leq\alpha$ or $x_{i}^{\star}\leq y_{i}^{\star}-\Delta+\alpha$ . The first case is excluded by the assumption of the first statement of our lemma and our choice of $\alpha=\Delta/3<1/(N-1)$ , thus it holds that $x_{i}^{\star}\leq y_{i}^{\star}-\Delta+\alpha$ . But then we can use the case 3. for the $y$ variables of Lemma 6.7 and we get that $\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial y_{1}}\geq-\alpha$ , which cannot be true since we proved that $\frac{\partial f_{\mathcal{C}_{l}}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})}{\partial y_{i}}=-1$ . Therefore we have a contradiction and the first statement of the lemma holds. Using the same reasoning we prove the second statement too. ∎

Let $(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})$ be a solution to the GDAFixedPoint problem with input a Turing machine that represents the function $f_{\mathcal{C}_{l}}$ , $\alpha=\Delta/3$ , where $\Delta=t\cdot\delta/d^{14}$ , $G=\Theta(d^{15}/\delta)$ , $L=\Theta(d^{27}/\delta^{2})$ , and $\boldsymbol{A}$ , $\boldsymbol{b}$ as described in ( $\star$ ).

For each coordinate $i$ , there exist the following three mutually exclusive cases,

$\boldsymbol{\frac{1}{N-1}\leq x_{i}^{\star}\leq 1-\frac{1}{N-1}}$ : Since $\left|R_{+}(\boldsymbol{x}^{\star})\right|\geq 1$ , it follows directly from Lemma 7.8 that there exists $\boldsymbol{v}\in R_{+}(\boldsymbol{x}^{\star})$ such that $\mathcal{C}_{l}^{i}(\boldsymbol{v})=-1$ and $\boldsymbol{v}^{\prime}\in R_{+}(\boldsymbol{x}^{\star})$ such that $\mathcal{C}_{l}^{i}(\boldsymbol{v})=+1$ .

Smooth and Efficient Interpolation Coefficients

In this section we describe the construction of the smooth and efficient interpolation coefficients (SEIC) that we introduced in Section 7.2.1. After the description of the construction we present the statements of the lemmas that prove the properties (A) - (D) of their Definition 7.3 and we refer to the Appendix C. We first remind the definition of the SEIC coefficients.

Our main goal in this section is to prove the following theorem.

One important component of the construction of the SEIC coefficients is the smooth-step functions which we introduce in Section 8.1. These functions also provide a toy example of smooth and efficient interpolation coefficients in $1$ dimension. Then in Section 8.2 we present the construction of the SEIC coefficients in multiple dimensions and in Section 8.3 we state the main lemmas that lead to the proof of Theorem 8.1.

For every $x\leq 0$ it holds that $g(x)=0$ , for every $x\geq 1$ it holds that $g(x)=1$ and for every $x\in$ it holds that $S(x)\in$ .

For some $k$ it holds that $g$ is $k$ times continuously differentiable and its $k$ th derivative satisfies $g^{(k)}(0)=0$ and $g^{(k)}(1)=0$ .

The largest number $k$ such that the smoothness property from above holds is characterizes the order of smoothness of the smooth step function $g$ .

In Section 6 we have already defined and used the smooth step function of order $1$ . For the construction of the SEIC coefficients we use the smooth step function of order $2$ and the smooth step function of order $\infty$ defined as follows.

We note that we use the notation $S$ instead of $S_{2}$ for the smooth step function of order $2$ for simplicitly of the exposition of the paper.

We present a plot of these step function in Figure 7, and we summarize some of their properties in Lemma 8.3. A more detailed lemma with additional properties of $S_{\infty}$ that are useful for the proof of Theorem 8.1 is presented in Lemma C.5 in the Appendix C.

The calculations for $S_{\infty}$ are more complicated. We have that

We set $h(x)\triangleq\left(\exp\left(\frac{\ln(2)}{x}\right)+\exp\left(\frac{\ln(2)}{1-x}\right)\right)\left(1-x\right)^{2}x^{2}$ for $x\in$ and doing simple calculations we get that for $x\leq 1/2$ it holds that $h(x)\geq\frac{1}{4}\exp\left(\frac{\ln(2)}{x}\right)x^{2}$ . But the later can be easily lower bounded by $1/4$ . Applying the same argument for $x\geq 1/2$ we get that in general $h(x)\geq 1/4$ . Also it is not hard to see that for $x\leq 1/2$ it holds that $\exp\left(\frac{\ln(2)}{x(1-x)}\right)\leq 4\exp\left(\frac{\ln(2)}{x}\right)$ , whereas for $x\geq 1/2$ it holds that $\exp\left(\frac{\ln(2)}{x(1-x)}\right)\leq 4\exp\left(\frac{\ln(2)}{1-x}\right)$ . Combining all these we can conclude that $\left|S^{\prime}_{\infty}(x)\right|\leq 16$ . Using similar argument we can prove that $\left|S^{\prime\prime}_{\infty}(x)\right|\leq 32$ . For all the derivatives of $S_{\infty}$ we can inductively prove that

where $h_{0}(1)=0$ and all the functions $h_{i}(x)$ are bounded. Then the fact that all the derivatives of $S_{\infty}$ vanish at and at $1$ follows by a simple inductive argument. ∎

Using the smooth step functions that we described above we can get a construction of SEIC coefficients for the single dimensional case. Unfortunately the extension to multiple dimensions is substantially harder and invokes new ideas that we explore later in this section. For the single dimensional problem of this example we have the interval $ $divided with$ N $discrete points and our goal is to design$ N $functions$ \mathsf{P}_{1} $-$ \mathsf{P}_{N}$ that satisfy the properties (A) - (D) of Definition 7.3. A simple construction of such functions is the following

Based on Lemma 8.3 it is not hard then to see that $\mathsf{P}_{i}$ is twice differentiable and it has bounded first and second derivatives, hence it satisfies property (A) of Definition 8. Using the fact that $1-S_{\infty}(x)=S_{\infty}(1-x)$ we can also prove property (B). Finally properties (C) and (D) can be proved via the definition of the coefficient $\mathsf{P}_{i}$ from above. In Figure 7 we can see the plot of $\mathsf{P}_{3}$ for $N=5$ . We leave the exact proofs of this example as an exercise for the reader.

2 Construction of SEIC Coefficients in High-Dimensions

The goal of this section is to present the construction of the family $\mathcal{I}_{d,N}$ of smooth and efficient interpolation coefficients for every number of dimensions $d$ and any discretization parameter $N$ . Before diving into the details of our construction observe that even the 2-dimensional case with $N=2$ is not trivial. In particular, the first attempt would be to define the SEIC coefficients based on the simple split of the square $^{2}$ to two triangles divided by the diagonal of $^{2}$ . Then using any soft-max function that is twice continuously differentiable we define a convex combination at every triangle. Unfortunately this approach cannot work since the resulting coefficients have discontinuous gradients along the diagonal of $^{2}$ . We leave the presice calculations of this example as an exercise to the reader.

We start with some definitions about the orientation and the representation of the cubelets of the grid $\left(\left[N\right]-1\right)^{d}$ . Then we proceed with the definition of the $Q_{\boldsymbol{v}}$ functions in Definition 8.7. Finally using $Q_{\boldsymbol{v}}$ we can proceed with the construction of the SEIC coefficients.

Each cubelet $\left[\frac{c_{1}}{N-1},\frac{c_{1}+1}{N-1}\right]\times\cdots\times\left[\frac{c_{d}}{N-1},\frac{c_{d}+1}{N-1}\right]$ , where $\boldsymbol{c}\in\left(\left[N-1\right]-1\right)^{d}$ admits a source vertex $\boldsymbol{s}^{\boldsymbol{c}}=(s_{1},\ldots,s_{d})\in\left(\left[N\right]-1\right)^{d}$ and a target vertex $\boldsymbol{t}^{\boldsymbol{c}}=(t_{1},\ldots,t_{d})\in\left(\left[N\right]-1\right)^{d}$ defined as follows,

Notice that the source $\boldsymbol{s}^{\boldsymbol{c}}$ and the target $\boldsymbol{t}^{\boldsymbol{c}}$ are vertices of the cubelet whose down-left corner is $\boldsymbol{c}$ .

(Canonical Representation) Let $\boldsymbol{x}\in^{d}$ and $R(\boldsymbol{x})=\left[\frac{c_{1}}{N-1},\frac{c_{1}+1}{N-1}\right]\times\cdots\times\left[\frac{c_{d}}{N-1},\frac{c_{d}+1}{N-1}\right]$ where $\boldsymbol{c}\in\left(\left[N-1\right]-1\right)^{d}$ . The canonical representation of $\boldsymbol{x}$ under cubelet with down-left corner $\boldsymbol{c}$ , denoted by $\boldsymbol{p}_{\boldsymbol{x}}^{\boldsymbol{c}}=(p_{1},\ldots,p_{d})$ is defined as follows,

where $\boldsymbol{t}^{\boldsymbol{c}}=(t_{1},\ldots,t_{d})$ and $\boldsymbol{s}^{\boldsymbol{c}}=(s_{1},\ldots,s_{d})$ are respectively the target and the source of $R(\boldsymbol{x})$ .

Let $\boldsymbol{x}\in^{d}$ lying in the cublet

with corners $R_{c}(\boldsymbol{x})=\{c_{1},c_{1}+1\}\times\cdots\times\{c_{d},c_{d}+1\}$ , where $\boldsymbol{c}\in\left(\left[N-1\right]-1\right)^{d}$ . Let also $\boldsymbol{s}^{\boldsymbol{c}}=(s_{1},\ldots,s_{d})$ be the source vertex of $R(\boldsymbol{x})$ and $\boldsymbol{p}_{\boldsymbol{x}}^{\boldsymbol{c}}=(p_{1},\ldots,p_{d})$ be the canonical representation of $\boldsymbol{x}$ . Then for each vertex $\boldsymbol{v}\in R_{c}(\boldsymbol{x})$ we define the following partition of the set of coordinates $[d]$ ,

where $S_{\infty}(x)$ and $S(x)$ are the smooth step function defined in Definition 8.2.

To provide a better understanding of the Definitions 8.5, 8.6, and 8.7 we present the following $3$ -dimensional example.

We consider a case where $d=3$ and $N=3$ . Let $\boldsymbol{x}=(1.3/3,2.5/3,0.3/3)$ lying in the cubelet $R(\boldsymbol{x})=\left[\frac{1}{3},\frac{2}{3}\right]\times\left[\frac{2}{3},1\right]\times\left[0,\frac{1}{3}\right]$ , and let $\boldsymbol{c}=(1,2,0)$ . Then the source of $R(\boldsymbol{x})$ is $\boldsymbol{s}^{\boldsymbol{c}}=(2,2,0)$ and the target $\boldsymbol{t}^{\boldsymbol{c}}=(1,3,1)$ (Definition 8.5). The canonical representation of $\boldsymbol{x}$ is $\boldsymbol{p}_{\boldsymbol{x}}^{\boldsymbol{c}}=(0.7,0.5,0.3)$ (Definition 8.6). The only vertices with no-zero coefficients $Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})$ are those belonging in the set $R_{+}(\boldsymbol{x})=\{(1,3,1),(1,3,0),(1,2,0),(2,2,0)\}$ and again by Definition 8.7 we have that

$Q_{(1,3,1)}(\boldsymbol{x})=S_{\infty}(S(0.3))\cdot S_{\infty}(S(0.5))\cdot S_{\infty}(S(0.7))$ ,

$Q_{(1,3,0)}(\boldsymbol{x})=S_{\infty}(S(0.5)-S(0.3))\cdot S_{\infty}(S(0.7)-S(0.3))$ ,

$Q_{(1,2,0)}(\boldsymbol{x})=S_{\infty}(S(0.7)-S(0.3))\cdot S_{\infty}(S(0.7)-S(0.5))$ ,

$Q_{(2,2,0)}(\boldsymbol{x})=S_{\infty}(1-S(0.3))\cdot S_{\infty}(1-S(0.5))\cdot S_{\infty}(1-S(0.7))$ .

Now based on the Definitions 8.5, 8.6, and 8.7 we are ready to present the construction of the smooth and efficient interpolation coefficients.

Let $\boldsymbol{x}\in^{d}$ lying in the cubelet $R(\boldsymbol{x})=\left[\frac{c_{1}}{N-1},\frac{c_{1}+1}{N-1}\right]\times\cdots\times\left[\frac{c_{d}}{N-1},\frac{c_{d}+1}{N-1}\right]$ . Then for each vertex $\boldsymbol{v}\in\left(\left[N\right]-1\right)^{d}$ the coefficient $\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x})$ is defined as follows,

where the functions $Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})\geq 0$ are defined in Definition 8.7 for any $\boldsymbol{v}\in R_{c}(\boldsymbol{x})$ .

3 Sketch of the Proof of Theorem 8.1

First it is necessary to argue that $\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x})$ is a continuous function since it could be the case that $Q^{\boldsymbol{c}}_{\boldsymbol{v}}(\boldsymbol{x})/(\sum_{\boldsymbol{v}\in R_{\boldsymbol{c}}(\boldsymbol{x})}Q^{\boldsymbol{c}}_{\boldsymbol{v}}(\boldsymbol{x}))\neq Q^{\boldsymbol{c}^{\prime}}_{\boldsymbol{v}}(\boldsymbol{x})/(\sum_{\boldsymbol{v}\in V_{\boldsymbol{c}^{\prime}}}Q^{\boldsymbol{c}^{\prime}}_{\boldsymbol{v}}(\boldsymbol{x}))$ for some point $\boldsymbol{x}$ that lies in the boundary of two adjacent cubelets with down-left corners $\boldsymbol{c}$ and $\boldsymbol{c}^{\prime}$ respectively. We specifically design the coefficients $Q_{v}^{\boldsymbol{c}}(\boldsymbol{x})$ such as the latter does not occur and this is the main reason that the definition of the function $Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})$ is slightly complicated. For this reason we prove the following lemma.

For any vertex $\boldsymbol{v}\in\left(\left[N\right]-1\right)^{d}$ , $\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x})$ is a continuous and twice differentiable function and for any $\boldsymbol{v}\notin R_{c}(\boldsymbol{x})$ it holds that $\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x})=\nabla\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x})=\nabla^{2}\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x})=0$ . Moreover, for every $\boldsymbol{x}\in^{d}$ the set $R_{+}(\boldsymbol{x})$ of vertices $\boldsymbol{v}\in\left(\left[N\right]-1\right)^{d}$ such that $\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x})>0$ satisfies $\left|R_{+}(\boldsymbol{x})\right|=d+1$ .

Based on Lemma 8.10 and the expression of $\mathsf{P}_{\boldsymbol{v}}$ we can prove that the $\mathsf{P}_{\boldsymbol{v}}$ coefficients defined in Definition 8.9 satisfy the properties (B) and (C) of the definition 7.3. To prove the properties (A) and (D) we also need the following two lemmas.

For any vertex $\boldsymbol{v}\in\left(\left[N\right]-1\right)^{d}$ , it holds that

Let a point $\boldsymbol{x}\in^{d}$ and $R_{+}(\boldsymbol{x})$ the set of vertices with $\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x})>0$ , then we have that

If $0\leq x_{i}<1/(N-1)$ then there always exists a vertex $\boldsymbol{v}\in R_{+}(\boldsymbol{x})$ such that $v_{i}=0$ .

If $1-1/(N-1)<x_{i}\leq 1$ then there always exists a vertex $\boldsymbol{v}\in R_{+}(\boldsymbol{x})$ such that $v_{i}=1$ .

The proofs of Lemmas 8.10, 8.11, and 8.12 can be found in Appendix C. Based on Lemmas 8.10, 8.11, and 8.12 we are now ready to prove Theorem 8.1.

The fact that the coefficients $\mathsf{P}_{\boldsymbol{v}}$ satisfy the property (A) follows directly from Lemma 8.11. Property (B) follows directly from the definition of $\mathsf{P}_{\boldsymbol{v}}$ in Definition 8.9 and the simple fact that $Q^{\boldsymbol{c}}_{\boldsymbol{v}}(\boldsymbol{x})\geq 0$ . Property (C) follows from the second part of Lemma 8.10. Finally Property (D) follows directly from Lemma 8.12. ∎

Unconditional Black-Box Lower Bounds

In this section our goal is to prove Theorem 4.5 based on the Theorem 4.4 that we proved in Section 7 and the known black box lower bounds that we know for $\mathsf{PPAD}$ by [HPV89]. In this section we assume that all the real number operation are performed with infinite precision.

Assume that there exists an algorithm $A$ that has black-box oracle access to the value of a function $M:^{d}\to^{d}$ and outputs $\boldsymbol{w}^{\star}\in^{d}$ . There exists a universal constant $c>0$ such that if $M$ is $2$ -Lipschitz and $\left\|M(\boldsymbol{w}^{\star})-\boldsymbol{w}^{\star}\right\|_{2}\leq 1/(2c)$ , then $A$ has to make at least $2^{d}$ different oracle calls to the function value of $M$ .

It is easy to observe in the reduction in the proof of Theorem 7.2 is a black-box reduction and in every evaluation of the constructed circuit $\mathcal{C}_{l}$ only requires one evaluation of the input function $M$ . Therefore the proof of Theorem 7.2 together with the Theorem 9.1 imply the following corollary.

Based on Corollary 9.2 and the reduction that we presented in Section 7, we are now ready to prove Theorem 4.5.

Hardness in the Global Regime

In this section our goal is to prove that the complexity of the problems LocalMinMax and LocalMin is significantly increased when $\varepsilon$ , $\delta$ lie outside the local regime, in the global regime. We start with the following theorem where we show that $\mathsf{FNP}$ -hardness of LocalMinMax.

LocalMinMax is $\mathsf{FNP}$ -hard even when $\varepsilon$ is set to any value $\leq 1/384$ , $\delta$ is set to any value $\geq 1$ , and even when $\mathcal{P}(\boldsymbol{A},\boldsymbol{b})=^{d}$ , $G=\sqrt{d}$ , $L=d$ , and $B=d$ .

We now present a reduction from 3-SAT(3) to LocalMinMax that proves Theorem 10.1. First we remind the definition of the problem 3-SAT(3).

where each $w_{j},z_{j}$ are additional variables associated with clause $\phi_{j}$ . The player that wants to minimize $f$ controls $\boldsymbol{x},\boldsymbol{w}$ vectors while the maximizing player controls the $\boldsymbol{z}$ variables.

The formula $\phi$ admits a satisfying assignment if and only if there exist an $(\varepsilon,\delta)$ -local min-max equilibrium of $f(\boldsymbol{x},\boldsymbol{w})$ with $\varepsilon\leq 1/384$ , $\delta=1$ and $(\boldsymbol{x},\boldsymbol{w})\in^{n+2m}$ .

Let us assume that there exists a satisfying assignment. Given such a satisfying assignment we will construct $\left((\boldsymbol{x}^{\star},\boldsymbol{w}^{\star}),\boldsymbol{z}^{\star}\right)$ that is a $(0,1)$ -local min-max equilibrium of $f$ . We set each variable $x_{i}^{\star}\triangleq 1$ if and only if the respective boolean variable is true. Observe that this implies that $P_{j}(\boldsymbol{x}^{\star})=0$ for all $j$ , meaning that the strategy profile $\left((\boldsymbol{x}^{\star},\boldsymbol{w}^{\star}),\boldsymbol{z}^{\star}\right)$ is a global Nash equilibrium no matter the values of $\boldsymbol{w}^{\star},\boldsymbol{z}^{\star}$ .

On the opposite direction, let us assume that there exists an $(\varepsilon,\delta)$ -local min-max equilibrium of $f$ with $\varepsilon=1/384$ and $\delta=1$ . In this case we first prove that for each $j=1,\ldots,m$

Fix any clause $j$ . In case $\left|w_{j}^{\star}-z_{j}^{\star}\right|\geq 1/4$ then the minimizing player can further decrease $f$ by at least $P_{j}(x)/16$ by setting $w_{j}^{\star}\triangleq z_{j}^{\star}$ . On the other hand in case $\left|w_{j}^{\star}-z_{j}^{\star}\right|\leq 1/4$ then the maximizing player can increase $f$ by at least $P_{j}(x^{\star})/16$ by moving $z_{j}^{\star}$ either to or to $1$ . We remark that both of the options are feasible since $\delta=1$ .

Now consider the probability distribution over the boolean assignments where each boolean variable $x_{i}$ is independently selected to be true with probability $x_{i}^{\star}$ . Then,

Since each $\phi_{j}$ shares variables with at most $6$ other clauses, the event of $\phi_{j}$ not being satisfied is dependent with at most $6$ other events. By the Lovász Local Lemma [EL73], we get that the probability none of these events occur is positive. As a result, there exists a satisfying assignment. ∎

Next we show the $\mathsf{FNP}$ -hardness of LocalMin. As we can see there is a gap between Theorem 10.1 and Theorem 10.3. In particular, the $\mathsf{FNP}$ -hardness result of LocalMinMax is stronger since it holds for any $\delta\geq 1$ whereas for the $\mathsf{FNP}$ -hardness of LocalMin our proof needs $\delta\geq\sqrt{d}$ when the rest of the parameters remain the same.

LocalMin is $\mathsf{FNP}$ -hard even when $\varepsilon$ is set to any value $\leq 1/24$ , $\delta$ is set to any value $\geq\sqrt{d}$ , and even when $\mathcal{P}(\boldsymbol{A},\boldsymbol{b})=^{d}$ , $G=\sqrt{d}$ , $L=d$ , and $B=d$ .

We follow the same proof as in the proof of Theorem 10.1 but we instead set $f(\boldsymbol{x})=\sum_{j=1}^{m}P_{j}(\boldsymbol{x})$ where $\boldsymbol{x}\in^{n}$ (the number of variables is $d:=n$ ). We then get that if the initial formula is satisfiable then there exist $\boldsymbol{x}\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b})$ , such that $f(\boldsymbol{x})=0$ . On the other hand if there exist $\boldsymbol{x}\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b})$ such that $f(\boldsymbol{x})\leq 1/24$ then the formula is satisfiable due to the Lovász Local Lemma [EL73]. Therefore the $\mathsf{FNP}$ -hardness follows again from the constructive proof of the Lovász Local Lemma [Mos09, MT10]. Setting $\delta\geq\sqrt{n}$ which equals the diameter of the feasibility set implies that in case there exists $\hat{\boldsymbol{x}}$ with $f(\hat{\boldsymbol{x}})=0$ then all $(\varepsilon,\delta)$ -LocalMin $\boldsymbol{x}^{\ast}$ must admit value $f(\boldsymbol{x}^{\ast})\leq 1/24$ and thus a satisfying assignment is implied. ∎

Next we prove a black box lower bound for minimization in the global regime. The proof of following lower bound illustrates the strength of the SEIC coefficients presented in Section 8. The next Theorem can also be used to prove the $\mathsf{FNP}$ -hardness of LocalMin in the global regime but with worse Lipschitzness and smoothness parameters than the once at Theorem 10.3 and for this reason we present both of them.

In the worst case, $\Omega\left(2^{d}/d\right)$ value/gradient black-box queries are needed to determine a $(\varepsilon,\delta)$ -LocalMin for functions $f(\boldsymbol{x}):^{d}\to$ with $G=\Theta(d^{15})$ , $L=\Theta(d^{22})$ , $\varepsilon<1$ , $\delta=\sqrt{d}$ .

The proof is based on the fact that given just black-box access to a boolean formula $\phi:\{0,1\}^{d}\mapsto\{0,1\}$ , at least $\Omega(2^{d})$ queries are needed in order to determine whether $\phi$ admits a satisfying assignment. The term black-box access refers to the fact that the clauses of the formula are not given and the only way to determine whether a specific boolean assignment is satisfying is by quering the specific binary string.

Given such a black-box oracle for a satisfying assignment $d$ , we construct the function $f_{\phi}(\boldsymbol{x}):^{d}\mapsto$ as follows:

for each corner $\boldsymbol{v}\in V$ of the $^{d}$ hypercube, i.e. $\boldsymbol{v}\in\{0,1\}^{d}$ , we set $f_{\phi}(\boldsymbol{v}):=1-\phi(\boldsymbol{v})$ .

for the rest of the points $\boldsymbol{x}\in^{d}/V$ , $f_{\phi}(\boldsymbol{x}):=\sum_{\boldsymbol{v}\in V}P_{\boldsymbol{v}}(\boldsymbol{x})\cdot f_{\phi}(\boldsymbol{v})$ where $P_{\boldsymbol{v}}$ are the coefficients of Definition 8.9.

We remind that by Lemma 8.11, we get that $\left\|\nabla f_{\phi}(\boldsymbol{x})\right\|_{2}\leq\Theta(d^{12})$ and $\left\|\nabla^{2}f_{\phi}(\boldsymbol{x})\right\|_{2}\leq\Theta(d^{25})$ , meaning that $f_{\phi}(\cdot)$ is $\Theta(d^{12})$ -Lipschitz and $\Theta(d^{25})$ -smooth. Moreover by Lemma 8.7 , for any $\boldsymbol{x}\in^{n}$ the set $V(x)=\{\boldsymbol{v}\in V:P_{\boldsymbol{v}}(\boldsymbol{x})\neq 0\}$ has cardinality at most $d+1$ , while at the same time $\sum_{\boldsymbol{v}\in V}P_{\boldsymbol{v}}(\boldsymbol{x})=1$ .

In case $\phi$ is not satisfiable then $f_{\phi}(\boldsymbol{x})=1$ for all $\boldsymbol{x}\in^{d}$ since $f_{\phi}(\boldsymbol{v})=1$ for all $\boldsymbol{v}\in V$ . In case there exists a satisfying assignment $\boldsymbol{v}^{\ast}$ then $f_{\phi}(\boldsymbol{v}^{\ast})=0$ . Since $\delta\geq\sqrt{d}$ that is the diameter of $^{d}$ , any $(\varepsilon,\delta)$ -LocalMin $\boldsymbol{x}^{\ast}$ must have $f_{\phi}(\boldsymbol{x})\leq\varepsilon<1$ . Since $f_{\phi}(\boldsymbol{x}^{\ast})\triangleq\sum_{\boldsymbol{v}\in V(\boldsymbol{x}^{\ast})}P_{\boldsymbol{v}}(\boldsymbol{x}^{\ast})\cdot f_{\phi}(\boldsymbol{v}^{\ast})<1$ , there exists at least one vertex $\hat{\boldsymbol{v}}\in V(\boldsymbol{x})$ with $f_{\phi}(\hat{\boldsymbol{v}})=0$ , meaning that $\phi(\boldsymbol{v}^{\ast})=1$ . As a result, given an $(\varepsilon,\delta)$ -LocalMin $\boldsymbol{x}^{\ast}$ with $f_{\phi}(\boldsymbol{x}^{\ast})<1$ , we can find a satisfying $\hat{\boldsymbol{v}}$ by querying $\phi(\boldsymbol{v})$ for each vertex $\boldsymbol{v}\in V(\boldsymbol{x}^{\ast})$ . Since $\left|V(\boldsymbol{x}^{\ast})\right|\leq d+1$ , this will take at most $d+1$ additional queries.

Up next, we argue that in case an $(\varepsilon,\delta)$ -LocalMin could be determined with less than $O(2^{d}/d)$ value/gradient queries, then determining whether $\phi$ admits a satisfying assignment could be done with less that $O(2^{d})$ queries on $\phi$ (the latter is obviously impossible). Notice that any value/gradient query both $f_{\phi}(\boldsymbol{x})$ and $\nabla f_{\phi}(\boldsymbol{x})$ can be computed by querying the value $f_{\phi}(\boldsymbol{v})$ of the vertices $\boldsymbol{v}\in V(\boldsymbol{x})$ . Since $\left|V(\boldsymbol{x})\right|\leq d+1$ , any value/gradient query of $f_{\phi}$ can be simulated by $d+1$ queries on $\phi$ . ∎

Acknowledgements

This work was supported by NSF Awards IIS-1741137, CCF-1617730 and CCF-1901292, by a Simons Investigator Award, by the DOE PhILMs project (No. DE-AC05-76RL01830), and by the DARPA award HR00111990021. M.Z. was also supported by Google Ph.D. Fellowship. S.S. was supported by NRF 2018 Fellowship NRF-NRFF2018-07.

References

Appendix A Proof of Theorem 4.1

We first remind the definition of the 3-SAT(3) problem that we will use for our reduction.

It is well known that 3-SAT(3) is $\mathsf{FNP}$ -complete, for details see $\S 9.2$ of [Pap94a]. To prove Theorem 4.1, we reduce 3-SAT(3) to $\varepsilon$ -StationaryPoint.

There exists a satisfying assignment for the clauses $\phi_{1},\ldots,\phi_{m}$ if and only if there solution of the constructed StationaryPoint with $\varepsilon=1/24$ a admits solution $(\boldsymbol{x}^{\star},\boldsymbol{w}^{\star})\in^{n+m}$ such that $\left\|\nabla f(\boldsymbol{x}^{\star},\boldsymbol{w}^{\star})\right\|_{2}<1/24$ .

By the definition of StationaryPoint, in case there exists a pair of points $(\hat{\boldsymbol{x}},\hat{\boldsymbol{w}})\in^{n+m}$ with $\left\|\nabla f(\hat{\boldsymbol{x}},\hat{\boldsymbol{w}})\right\|_{2}<\varepsilon/2=1/48$ , then a pair of points $(\boldsymbol{x}^{\star},\boldsymbol{w}^{\star})$ with $\left\|\nabla f(\boldsymbol{x}^{\star},\boldsymbol{w}^{\star})\right\|_{2}<\varepsilon=1/24$ must be returned. In case $\left\|\nabla f(\boldsymbol{x},\boldsymbol{w})\right\|_{2}>\varepsilon=1/24$ for all $(\boldsymbol{x},\boldsymbol{w})\in^{n+m}$ , the null symbol $\bot$ is returned.

Let us assume that there exists a satisfying assignment of $\phi$ . Consider the solution $(\hat{\boldsymbol{x}},\hat{\boldsymbol{w}})$ constructed as follows: each variable $\hat{x}_{i}$ is set to $1$ iff the respective boolean variable is true and $\hat{w}_{j}=0$ for all $j=1,\ldots,m$ . Since the assignment satisfies the CNF-formula $\phi$ , there exists at least one true literal in each clause $\phi_{j}$ which means that $P_{j}(x)=0$ for all $j=1,\ldots,m$ . As a result $\frac{\partial f(\hat{\boldsymbol{x}},\hat{\boldsymbol{w}})}{\partial w_{j}}=P_{j}(\hat{\boldsymbol{x}})=0$ for all $j=1,\ldots,m$ . At the same time, $\frac{\partial f(\hat{\boldsymbol{x}},\hat{\boldsymbol{w}})}{\partial x_{i}}=0$ since $\hat{w}_{j}=0$ for all $j=1,\ldots,m$ . Overall we have that $\nabla f(\hat{\boldsymbol{x}},\hat{\boldsymbol{w}})=0<1/48=\varepsilon/2$ . As a result, the constructed StationaryPoint instance must return a solution $(\boldsymbol{x}^{\star},\boldsymbol{w}^{\star})$ with $\left\|\nabla f(\boldsymbol{x}^{\star},\boldsymbol{w}^{\star})\right\|_{2}<\frac{1}{24}=\varepsilon$ .

On the opposite direction, the existence of a pair of points $(\boldsymbol{x}^{\star},\boldsymbol{w}^{\star})$ with $\left\|\nabla f(\boldsymbol{x}^{\star},\boldsymbol{w}^{\star})\right\|_{2}<1/24$ implies $P_{j}(\boldsymbol{x}^{\ast})<1/24$ for all $j=1\ldots m$ . Consider the probability distribution over the boolean assignments in which each boolean variable $x_{i}$ is independently selected to be true with probability $x_{i}^{\star}$ . Then,

Since $\phi_{j}$ shares variables with at most $6$ other clauses, the bad event of $\phi_{j}$ not being satisfied is dependent with at most $6$ other bad events. By Lovász Local Lemma [EL73], we get that the probability none of the events occurs is positive. As a result, there exists a satisfying assignment. ∎

Using Lemma A.1 we can conclude that $\phi$ is satisfiable if and only if $f$ has a $1/24$ -approximate stationary point. What is left to prove the $\mathsf{FNP}$ -hardness is to show how we can find a satisfying assignment of $\phi$ given an approximate stationary point of $f$ . This can be done using the celebrated results that provide constructive proofs of the Lovász Local Lemma [Mos09, MT10]. Finally, we remind that the constructed function $f$ is $\Theta\left(\sqrt{d}\right)$ -Lipschitz and $\Theta\left(d\right)$ -smooth, where $d$ is the number of variables that is equal to $n+m$ .

Appendix B Missing Proofs from Section 5

In this section we give proofs for the statements presented in Section 5. These statements establish the totality and inclusion to $\mathsf{PPAD}$ of LR-LocalMinMax and GDAFixedPoint.

We start with establishing claim “1.” in the statement of the theorem. It will be clear that our proof will provide a polynomial-time reduction from LR-LocalMinMax to GDAFixedPoint. Suppose that $(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})$ is an $\alpha$ -approximate fixed point of $F_{GDA}$ , where $\alpha$ is the specified in the theorem statement function of $\delta$ , $G$ and $L$ . To simplify our proof, we abuse notation and define $f(\boldsymbol{x})\triangleq f(\boldsymbol{x},\boldsymbol{y}^{\star})$ , $\nabla f(\boldsymbol{x})\triangleq\nabla_{x}f(\boldsymbol{x},\boldsymbol{y}^{\star})$ , $K\triangleq\{\boldsymbol{x}\mid(\boldsymbol{x},\boldsymbol{y}^{\star})\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b})\}$ and $\hat{\boldsymbol{x}}\triangleq\Pi_{K}(\boldsymbol{x}^{\star}-\nabla f(\boldsymbol{x}^{\star}))$ . Because $(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})$ is an $\alpha$ -approximate fixed point of $F_{FDA}$ , it follows that $\left\|\hat{\boldsymbol{x}}-\boldsymbol{x}^{\star}\right\|_{2}<\alpha$ .

$\langle\nabla f(\boldsymbol{x}^{\star}),\boldsymbol{x}^{\star}-\boldsymbol{x}\rangle<(G+\delta+\alpha)\cdot\alpha,\text{ for all }\boldsymbol{x}\in K\cap B_{d_{1}}(\delta;\boldsymbol{x}^{\star})$ .

Using the fact that $\hat{\boldsymbol{x}}=\Pi_{K}(\boldsymbol{x}^{\star}-\nabla f(\boldsymbol{x}^{\star}))$ and that $K$ is a convex set we can apply Theorem 1.5.5 (b) of [FP07] to get that

Next, we do some simple algebra to get that, for all $\boldsymbol{x}\in K\cap B_{d_{1}}(\delta;\boldsymbol{x}^{\star})$ ,

where the second to last inequality follows from Cauchy–Schwarz inequality and the triangle inequality, and the last inequality follows from the triangle inequality and the following facts: (1) $\left\|\boldsymbol{x}^{\star}-\hat{\boldsymbol{x}}\right\|_{2}<\alpha$ , (2) $\boldsymbol{x}\in B_{d_{1}}(\delta;\boldsymbol{x}^{\star})$ , and (3) $\left\|\nabla f(\boldsymbol{x},\boldsymbol{y})\right\|_{2}\leq G$ for all $(\boldsymbol{x},\boldsymbol{y})\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b})$ . ∎

For all $\boldsymbol{x}\in K\cap B_{d_{1}}(\delta;\boldsymbol{x}^{\star})$ , from the $L$ -smoothness of $f$ we have that

$f(\boldsymbol{x}^{\star})\leq f(\boldsymbol{x})$ : In this case we stop, remembering that

$f(\boldsymbol{x}^{\star})>f(\boldsymbol{x})$ : In this case, we consider two further sub-cases:

$\langle\nabla f(\boldsymbol{x}^{*}),\boldsymbol{x}-\boldsymbol{x}^{\star}\rangle\geq 0$ : in this sub-case, Eq (B.2) gives

where for the last inequality we used that $\boldsymbol{x}\in B_{d_{1}}(\delta;\boldsymbol{x}^{\star})$ , and that $\delta<\sqrt{2\varepsilon/L}$ .

$\langle\nabla f(\boldsymbol{x}^{*}),\boldsymbol{x}-\boldsymbol{x}^{\star}\rangle<0$ : in this sub-case, Eq (B.2) gives

where the second inequality follows from the fact that $\boldsymbol{x}\in B_{d_{1}}(\delta;\boldsymbol{x}^{\star})$ , the third inequality follows from Claim B.1, and the last inequality follows from the constraints $\delta<\sqrt{2\varepsilon/L}$ and $\alpha\leq\frac{\sqrt{(G+\delta)^{2}+4(\varepsilon-\frac{L}{2}\delta^{2})}-(G+\delta)}{2}$ .

In all cases, we get from (B.3), (B.4) and (B.5) that $f(\boldsymbol{x}^{\star})<f(\boldsymbol{x})+\varepsilon$ , for all $x\in K\cap B_{d_{1}}(\delta;\boldsymbol{x}^{\star})$ . Thus, lifting our abuse of notation, we get that $f(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})<f(\boldsymbol{x},\boldsymbol{y}^{\star})+\varepsilon$ , for all $\boldsymbol{x}\in\{\boldsymbol{x}\mid\boldsymbol{x}\in B_{d_{1}}(\delta;\boldsymbol{x}^{\star})\text{ and }(\boldsymbol{x},\boldsymbol{y}^{\star})\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b})\}$ . Using an identical argument we can also show that $f(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})>f(\boldsymbol{x}^{\star},\boldsymbol{y})-\varepsilon$ for all $\boldsymbol{y}\in\{\boldsymbol{y}\mid\boldsymbol{y}\in B_{d_{2}}(\delta;\boldsymbol{y}^{\star})\text{ and }(\boldsymbol{x}^{\star},\boldsymbol{y})\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b})\}$ . The first part of the theorem follows.

Now let us establish claim “2.” in the theorem statement. It will be clear that our proof will provide a polynomial-time reduction from GDAFixedPoint to LR-LocalMinMax. For the choice of parameters $\varepsilon$ and $\delta$ described in the theorem statement, we will show that, if $(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})$ is an $(\varepsilon,\delta)$ -local min-max equilibrium of $f$ , then $\left\|F_{GDAx}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})-\boldsymbol{x}^{\star}\right\|_{2}<\alpha/2$ and $\left\|F_{GDAy}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})-\boldsymbol{y}^{\star}\right\|_{2}<\alpha/2$ . The second part of the theorem will then follow. We only prove that $\left\|F_{GDAx}(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})-\boldsymbol{x}^{\star}\right\|_{2}<\alpha/2$ , as the argument for $\boldsymbol{y}^{\star}$ is identical. In the argument below we abuse notation in the same way we described earlier. With that notation we will show that $\left\|\hat{\boldsymbol{x}}-\boldsymbol{x}^{\star}\right\|_{2}<\alpha/2$ .

Proof that $\boldsymbol{\left\|\hat{\boldsymbol{x}}-\boldsymbol{x}^{\star}\right\|<\alpha/2}$ . From our choice of $\varepsilon$ and $\delta$ , it is easy to see that $\delta=\alpha/(5L+2)<\alpha/2$ . Thus, if $\left\|\hat{\boldsymbol{x}}-\boldsymbol{x}^{\star}\right\|<\delta$ , then we automatically get $\left\|\hat{\boldsymbol{x}}-\boldsymbol{x}^{\star}\right\|<\alpha/2$ . So it remains to handle the case $\left\|\hat{\boldsymbol{x}}-\boldsymbol{x}^{\star}\right\|\geq\delta$ . We choose $\boldsymbol{x}_{c}\triangleq\boldsymbol{x}^{\star}+\delta\frac{\hat{\boldsymbol{x}}-\boldsymbol{x}^{\star}}{\left\|\hat{\boldsymbol{x}}-\boldsymbol{x}^{\star}\right\|_{2}}$ . It is easy to see that $\boldsymbol{x}_{c}\in B_{d_{1}}(\delta;\boldsymbol{x}^{\star})$ and hence we get that

where the first inequality follows from the fact that $(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})$ is an $(\varepsilon,\delta)$ -local min-max equilibrium, the second inequality follows from the $L$ -smoothness of $f$ , and the third inequality follows from $\left\|\boldsymbol{x}_{c}-\boldsymbol{x}^{\star}\right\|\leq\delta$ and our choice of $\delta=\sqrt{\varepsilon/L}$ . The above implies:

Since $\hat{\boldsymbol{x}}-\boldsymbol{x}^{\star}=\left(\boldsymbol{x}_{c}-\boldsymbol{x}^{\star}\right)\cdot\left\|\hat{\boldsymbol{x}}-\boldsymbol{x}^{\star}\right\|_{2}/\delta$ we get that $\left\langle\nabla f(\boldsymbol{x}^{\star}),\boldsymbol{x}^{\star}-\hat{\boldsymbol{x}}\right\rangle<\frac{3\varepsilon}{2\delta}\left\|\boldsymbol{x}^{\star}-\hat{\boldsymbol{x}}\right\|_{2}$ . Therefore

where in the above inequality we have also used (B.1). As a result, $\left\|\boldsymbol{x}^{\star}-\hat{\boldsymbol{x}}\right\|_{2}<\frac{3\varepsilon}{2\delta}<\alpha/2$ .

B.2 Proof of Theorem 5.2

We provide a polynomial-time reduction from GDAFixedPoint to Brouwer. This establishes both the totality of GDAFixedPoint and its inclusion to $\mathsf{PPAD}$ , since Brouwer is both total and lies in $\mathsf{PPAD}$ , as per Lemma 2.5. It also establishes the totality and inclusion to $\mathsf{PPAD}$ of LR-LocalMinMax, since LR-LocalMinMax is polynomial-time reducible to GDAFixedPoint, as shown in Theorem 5.1.

We proceed to describe our reduction. Suppose that $f$ is the $G$ -Lipschitz and $L$ -smooth function provided as input to GDAFixedPoint. Suppose also that $\alpha$ is the approximation parameter provided as input to GDAFixedPoint. Given $f$ and $\alpha$ , we define function $M:\mathcal{P}(\boldsymbol{A},\boldsymbol{b})\rightarrow\mathcal{P}(\boldsymbol{A},\boldsymbol{b})$ , which serves as input to Brouwer, as follows:

Given that $f$ is $L$ -smooth, it follows that $M$ is $(L+1)$ -Lipschitz. We set the approximation parameter provided as input to Brouwer be $\gamma=\alpha^{2}/4(G+2\sqrt{d})$ .

To show the validity of the afore-described reduction, we prove that every feasible point $(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b})$ that is a $\gamma$ -approximate fixed point of $M$ , i.e. $\left\|M(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})-(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})\right\|_{2}<\gamma$ is also an $\alpha$ -approximate fixed point of $F_{GDA}$ . Observe that since $\mathcal{P}(\boldsymbol{A},\boldsymbol{b})\subseteq^{d}$ it holds that $\left\|(\boldsymbol{x},\boldsymbol{y})-(\boldsymbol{x}^{\prime},\boldsymbol{y}^{\prime})\right\|_{2}\leq\sqrt{d}$ for all $(\boldsymbol{x},\boldsymbol{y}),(\boldsymbol{x}^{\prime},\boldsymbol{y}^{\prime})\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b})$ . Hence, if $\gamma>\sqrt{d}$ , then finding $\gamma$ -approximate fixed points of $M$ is trivial and the same is true for fiding $\alpha$ -approximate fixed points of $F_{GDA}$ , since $\gamma=\alpha^{2}/4(G+2\sqrt{d})$ which implies that, if $\gamma>\sqrt{d}$ , then $\alpha>\sqrt{d}$ . Thus, we may assume that $\gamma\leq\sqrt{d}$ .

Next, to simplify notation we define $(\boldsymbol{x}_{\Delta},\boldsymbol{y}_{\Delta})=(x^{\star}-\nabla_{x}f(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star}),\boldsymbol{y}^{\star}+\nabla_{y}f(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star}))$ and $(\hat{\boldsymbol{x}},\hat{\boldsymbol{y}})=\operatorname*{argmin}_{(\boldsymbol{x},\boldsymbol{y})\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b})}\left\|(\boldsymbol{x}_{\Delta},\boldsymbol{y}_{\Delta})-(\boldsymbol{x},\boldsymbol{y})\right\|_{2}$ . Given that $(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})$ is a $\gamma$ -approximate fixed point of $M$ , we have that

Using Theorem 1.5.5 (b) of [FP07], we get that

For all $(\boldsymbol{x},\boldsymbol{y})\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b})$ , $\left\langle(\boldsymbol{x}_{\Delta},\boldsymbol{y}_{\Delta})-(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star}),(\boldsymbol{x},\boldsymbol{y})-(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})\right\rangle<(G+2\sqrt{d})\cdot\gamma$ .

Now let $\boldsymbol{x}^{\prime}=\operatorname*{argmin}_{\boldsymbol{x}\in K(y^{\star})}\left\|\boldsymbol{x}-\boldsymbol{x}_{\Delta}\right\|_{2}$ where $K(\boldsymbol{y}^{\star})=\{\boldsymbol{x}\mid(\boldsymbol{x},\boldsymbol{y}^{\star})\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b}))\}$ . Using Theorem 1.5.5 (b) of [FP07] for $\boldsymbol{x}^{\prime}$ we get that $\left\langle\boldsymbol{x}_{\Delta}-\boldsymbol{x}^{\prime},\boldsymbol{x}^{\star}-\boldsymbol{x}^{\prime}\right\rangle\leq 0$ . Using Claim B.2 for vector $(\boldsymbol{x}^{\prime},y^{\star})\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b})$ we get that $\left\langle\boldsymbol{x}^{\star}-\boldsymbol{x}_{\Delta},\boldsymbol{x}^{\star}-\boldsymbol{x}^{\prime}\right\rangle<(G+2\sqrt{d})\gamma$ . Adding the last two inequalities and using the fact that $\gamma=\alpha^{2}/4(G+2\sqrt{d})$ we get the following

Using the exact same reasoning we can also prove that

where $K(\boldsymbol{x}^{\star})=\{\boldsymbol{y}\mid(\boldsymbol{x}^{\star},\boldsymbol{y})\in\mathcal{P}(\boldsymbol{A},\boldsymbol{b}))\}$ . Combining the last two inequalities we get that $(\boldsymbol{x}^{\star},\boldsymbol{y}^{\star})$ is an $\alpha$ -approximate fixed point of $F_{GDA}$ .

Appendix C Missing Proofs from Section 8

In this section we present the missing proofs from Section 8 and more precisely in the following sections we prove the Lemmas 8.10, 8.11, and 8.12. For the rest of the proofs in this section we define $L(\boldsymbol{c})$ to be the cubelet which has the down-left corner equal to $\boldsymbol{c}$ , formaly

and we also define $L_{c}(\boldsymbol{c})$ to be the set of corners of the cubelet $L(\boldsymbol{c})$ , or more formally

We start with a lemma about the differentiability properties of the functions $Q_{\boldsymbol{v}}^{\boldsymbol{c}}$ which we defined in Definition 8.7.

1st order differentiability: We remind from the Definition 8.7 that if we let $\boldsymbol{s}^{\boldsymbol{c}}=(s_{1},\ldots,s_{d})$ be the source vertex of $R(\boldsymbol{x})$ and $\boldsymbol{p}_{\boldsymbol{x}}^{\boldsymbol{c}}=(p_{1},\ldots,p_{d})$ be the canonical representation of $\boldsymbol{x}$ . Then for each vertex $\boldsymbol{v}\in R_{c}(\boldsymbol{x})$ we define the following partition of the set of coordinates $[d]$ ,

Now in case $B_{\boldsymbol{v}}^{\boldsymbol{c}}=\varnothing$ , which corresponds to $\boldsymbol{v}$ being the source node $\boldsymbol{s}^{\boldsymbol{c}}$ then $Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})=\prod_{j=1}^{d}S_{\infty}(1-S(p_{j}))$ which is clearly differentiable as product of compositions of differentiable functions. The exact same holds for $A_{\boldsymbol{v}}^{\boldsymbol{c}}=\varnothing$ which corresponds to $\boldsymbol{v}$ being the target vertex $\boldsymbol{t}^{\boldsymbol{c}}$ of the cubelet $R(\boldsymbol{x})$ . We thus focus on the case where $A_{\boldsymbol{v}}^{\boldsymbol{c}},B_{\boldsymbol{v}}^{\boldsymbol{c}}\neq\varnothing$ . To simplify notation we denote $Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})$ by $Q(\boldsymbol{x})$ , $A_{\boldsymbol{v}}^{\boldsymbol{c}}$ by $A$ and $B_{\boldsymbol{v}}^{\boldsymbol{c}}$ by $B$ for the rest of this proof. We prove that in case $i\in B$ then $\frac{\partial Q(\boldsymbol{x})}{\partial x_{i}}$ always exits. The case $i\in A$ follows then symmetrically. We have the following cases

$p_{i}>p_{j}$ for all $j\in A$ : Then $\frac{\partial Q(\boldsymbol{x})}{\partial x_{i}}$ exists since both $S_{\infty}(\cdot)$ and $S(\cdot)$ are differentiable.

$p_{i}<p_{j}$ for some $j\in A$ : By Definition 8.7, if $\varepsilon$ is sufficiently small then $Q(x_{i}-\varepsilon,x_{-i})=Q(x_{i}+\varepsilon,\boldsymbol{x}_{-i})=Q(x_{i},\boldsymbol{x}_{-i})=0$ . Thus $\frac{\partial Q(\boldsymbol{x})}{\partial x_{i}}$ exists and equals .

$p_{i}=p_{j}$ for some $j\in A$ and $p_{i}\geq p_{j^{\prime}}$ for all $j^{\prime}\in A\setminus\{j\}$ : By Definition 8.7, if $\varepsilon$ is sufficiently small then $Q(x_{i}-\varepsilon,\boldsymbol{x}_{-i})=0$ and also $Q(x_{i},\boldsymbol{x}_{-i})=0$ , thus

since both $S_{\infty}(\cdot)$ and $S(\cdot)$ are differentiable functions, $S_{\infty}(S(p_{i})-S(p_{j}))=S_{\infty}(0)=0$ , and $S^{\prime}_{\infty}(S(p_{i})-S(p_{j}))=S^{\prime}_{\infty}(0)=0$ .

$p_{i}>p_{j}$ for all $j\in A$ : Then $\frac{\partial Q^{\prime}(\boldsymbol{x})}{\partial x_{i}}\triangleq\frac{\partial^{2}Q(\boldsymbol{x})}{\partial x_{i}\partial x_{k}}$ exists since both $S_{\infty}(\cdot)$ and $S(\cdot)$ are twice differentiable.

$p_{i}<p_{j}$ for some $j\in A$ . By Definition 8.7, $Q^{\prime}(x_{i}-\varepsilon,\boldsymbol{x}_{-i})=Q^{\prime}(x_{i}+\varepsilon,\boldsymbol{x}_{-i})=Q^{\prime}(x_{i},\boldsymbol{x}_{-i})=0$ . Thus $\frac{\partial Q^{\prime}(\boldsymbol{x})}{\partial x_{i}}\triangleq\frac{\partial^{2}Q(\boldsymbol{x})}{\partial x_{i}\partial x_{k}}$ exists and equals .

$p_{i}=p_{j}$ for some $j\in A$ and $p_{i}>p_{j^{\prime}}$ for all $j^{\prime}\in A\setminus\{j\}$ . By Definition 8.7, if $\varepsilon$ is sufficiently small then $Q^{\prime}(x_{i}-\varepsilon,\boldsymbol{x}_{-i})=0$ and thus

At the same time $\lim_{\varepsilon\rightarrow 0^{+}}\frac{Q^{\prime}(x_{i}+\varepsilon,\boldsymbol{x}_{-i})-Q^{\prime}(x_{i},\boldsymbol{x}_{-i})}{\varepsilon}$ exists since both $S_{\infty}(\cdot)$ and $S(\cdot)$ are twice differentiable. Moreover equals since $S_{\infty}(S(p_{i})-S(p_{j}))=S_{\infty}(0)=0$ and $S^{\prime}_{\infty}(S(p_{i})-S(p_{j}))=S^{\prime}_{\infty}(0)=S^{\prime\prime}_{\infty}(0)=S(0)=0$ .

In every step of the above proof where we use properties of $S_{\infty}$ and $S$ we use Lemma 8.3. ∎

So far we have established the fact that the functions $Q^{\boldsymbol{c}}_{\boldsymbol{v}}(\boldsymbol{x})$ are twice differentiable when $\boldsymbol{x}$ moves within the same cubelet. Next we will show that when $\boldsymbol{x}$ moves from one cubelet to another then the corresponding $Q^{\boldsymbol{c}}_{\boldsymbol{v}}$ functions changes value smoothly.

Let $\boldsymbol{x}\in^{d}$ such that there exists a coordinate $i\in[d]$ with the property $R(x_{i}+\varepsilon,\boldsymbol{x}_{-i})=\left[\frac{c_{1}}{N-1},\frac{c_{1}+1}{N-1}\right]\times\cdots\times\left[\frac{c_{d}}{N-1},\frac{c_{d}+1}{N-1}\right]$ and $R(x_{i}-\varepsilon,\boldsymbol{x}_{-i})=\left[\frac{c^{\prime}_{1}}{N-1},\frac{c^{\prime}_{1}+1}{N-1}\right]\times\cdots\times\left[\frac{c^{\prime}_{d}}{N-1},\frac{c^{\prime}_{d}+1}{N-1}\right]$ , with $\boldsymbol{c},\boldsymbol{c}^{\prime}\in\left(\left[N-1\right]-1\right)^{d}$ and $\varepsilon$ sufficiently small, i.e. $\boldsymbol{x}$ lies in the boundary of two cubelets. Then the following statements hold.

For all vertices $\boldsymbol{v}\in R_{c}(x_{i}+\varepsilon,\boldsymbol{x}_{-i})\cap R_{c}(x_{i}-\varepsilon,\boldsymbol{x}_{-i})$ , it holds that

$Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})=Q_{\boldsymbol{v}}^{\boldsymbol{c}^{\prime}}(\boldsymbol{x})$ ,

$\frac{\partial Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})}{\partial x_{j}}=\frac{\partial Q_{\boldsymbol{v}}^{\boldsymbol{c}^{\prime}}(\boldsymbol{x})}{\partial x_{i}}$ for all $i\in[d]$ , and

Lemma C.2 is crucial since it establishes that $\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x})$ is a continuous and twice differentiable even when $\boldsymbol{x}$ moves from one cubelet to another. Since the proof of Lemma C.2 is very long and contains the proof of some sublemmas, we postpone it for the end of this section in Section C.1.1. We now proceed with the proof of Lemma 8.10.

We first prove that $\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x})$ is a continuous function. Let $\boldsymbol{x}\in^{d}$ lying on the boundary of the following cubelets

$\boldsymbol{v}\notin\cup_{i=1}^{m}R_{c}(x_{j_{i}}+\eta_{i},\boldsymbol{x}_{-j_{i}})$ . By Definition 8.9, in all the $m$ aforementioned cubelets, the coefficient $\mathsf{P}_{\boldsymbol{v}}$ takes value and hence it is continuous in this part of the space.

$\boldsymbol{v}\in\cap_{j\in U}R_{c}(x_{j_{i}}+\eta_{i},\boldsymbol{x}_{-j_{i}})$ and $\boldsymbol{v}\notin\cup_{i\in\overline{U}}R_{c}(x_{j_{i}}+\eta_{i},\boldsymbol{x}_{-j_{i}})$ , for some $U\subseteq[m]$ with $\overline{U}=[m]\setminus U$ . In this case $\mathsf{P}_{\boldsymbol{v}}(x_{j_{i}}+\eta_{i},\boldsymbol{x}_{j_{i}})$ was computed according to a cubelet with $\boldsymbol{v}\in R_{c}(x_{j_{i}}+\eta_{i},\boldsymbol{x}_{-j_{i}})$ . Then Lemma C.2 implies that $Q^{\boldsymbol{c}^{(i)}}_{\boldsymbol{v}}(\boldsymbol{x})=0$ since $\boldsymbol{v}\in R_{c}(x_{j_{i}}+\eta_{i},\boldsymbol{x}_{-j_{i}})\setminus R_{c}(x_{j_{i^{\prime}}}+\eta_{i^{\prime}},\boldsymbol{x}_{-j_{i^{\prime}}})$ where $i^{\prime}\in[m]$ and $i\neq i^{\prime}$ . Therefore we conclude that $\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x})=0$ and

$\boldsymbol{v}\in\cap_{i=1}^{m}R_{c}(x_{j_{i}}+\eta_{i},\boldsymbol{x}_{-j_{i}})$ . By Lemma C.2 for all $i\in[m]$ it holds that

which again implies the continuity of $\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x})$ at $\boldsymbol{x}$ .

To prove that $\frac{\partial\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x})}{\partial x_{i}}$ always exists, we consider the following $3$ mutually exclusive cases.

$\boldsymbol{v}\in L_{c}(\boldsymbol{c}^{(1)})$ for $\boldsymbol{c}^{(1)}\in C^{+}$ and $\boldsymbol{v}\in L_{c}(\boldsymbol{c}^{(2)})$ for $\boldsymbol{c}^{(2)}\in C^{-}$ . Since the coefficient $\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x})$ is a continuous function, we have that

$\lim_{\varepsilon\rightarrow 0^{+}}\frac{\mathsf{P}_{\boldsymbol{v}}(x_{i}+\varepsilon,\boldsymbol{x}_{-i})-\mathsf{P}_{\boldsymbol{v}}(x_{i},\boldsymbol{x}_{-i})}{\varepsilon}=\frac{\frac{\partial Q_{\boldsymbol{v}}^{\boldsymbol{c}^{(1)}}(\boldsymbol{x})}{\partial x_{i}}\sum_{\boldsymbol{v}^{\prime}\in L_{c}(\boldsymbol{c}^{(1)})}Q_{\boldsymbol{v}^{\prime}}^{\boldsymbol{c}^{(1)}}(\boldsymbol{x})-Q_{\boldsymbol{v}}^{\boldsymbol{c}^{(1)}}(\boldsymbol{x})\sum_{\boldsymbol{v}^{\prime}\in L_{c}(\boldsymbol{c}^{(1)})}\frac{\partial Q_{\boldsymbol{v}^{\prime}}^{\boldsymbol{c}^{(1)}}(\boldsymbol{x})}{\partial x_{i}}}{\left(\sum_{\boldsymbol{v}^{\prime}\in L_{c}(\boldsymbol{c}^{(1)})}Q_{\boldsymbol{v}^{\prime}}^{\boldsymbol{c}^{(1)}}(\boldsymbol{x})\right)^{2}}$

$\lim_{\varepsilon\rightarrow 0^{+}}\frac{\mathsf{P}_{\boldsymbol{v}}(x_{i},\boldsymbol{x}_{-i})-\mathsf{P}_{\boldsymbol{v}}(x_{i}-\varepsilon,\boldsymbol{x}_{-i})}{\varepsilon}=\frac{\frac{\partial Q_{\boldsymbol{v}}^{\boldsymbol{c}^{(2)}}(\boldsymbol{x})}{\partial x_{i}}\sum_{\boldsymbol{v}^{\prime}\in L_{c}(\boldsymbol{c}^{(2)})}Q_{\boldsymbol{v}^{\prime}}^{\boldsymbol{c}^{(2)}}(\boldsymbol{x})-Q_{\boldsymbol{v}}^{\boldsymbol{c}^{(2)}}(\boldsymbol{x})\sum_{\boldsymbol{v}^{\prime}\in L_{c}(\boldsymbol{c}^{(2)})}\frac{\partial Q_{\boldsymbol{v}^{\prime}}^{\boldsymbol{c}^{(2)}}(\boldsymbol{x})}{\partial x_{i}}}{\left(\sum_{\boldsymbol{v}^{\prime}\in L_{c}(\boldsymbol{c}^{(2)})}Q_{\boldsymbol{v}^{\prime}}^{\boldsymbol{c}^{(2)}}(\boldsymbol{x})\right)^{2}}$

Both of the above limits exists due to the fact that $Q^{\boldsymbol{c}}_{\boldsymbol{v}}(\boldsymbol{x})$ is differentiable (Lemma C.1). Moreover, since $\boldsymbol{v}\in L_{c}(\boldsymbol{c}^{(1)})\cap L_{c}(\boldsymbol{c}^{(2)})$ , Case $1$ of Lemma C.2 implies that the two limits above have exactly the same value and hence $\mathsf{P}_{\boldsymbol{v}}$ is differentiable at $\boldsymbol{x}$ .

$\boldsymbol{v}\notin L_{c}(\boldsymbol{c}^{(1)})$ for all $\boldsymbol{c}^{(1)}\in C^{+}$ . In the case where $\boldsymbol{v}\notin L_{c}(\boldsymbol{c})$ for all the down-left corners $\boldsymbol{c}$ of the cubelets at which $\boldsymbol{x}$ lies, then by Definition 8.9 $\mathsf{P}_{\boldsymbol{v}}(x_{i},\boldsymbol{x}_{-i})=\mathsf{P}_{\boldsymbol{v}}(x_{i}+\varepsilon,\boldsymbol{x}_{-i})=\mathsf{P}_{\boldsymbol{v}}(x_{i}-\varepsilon,\boldsymbol{x}_{-i})=0$ . Thus $\frac{\partial\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x})}{\partial x_{i}}$ exists and equals . Therefore we may assume that $\boldsymbol{v}\in L_{c}(\boldsymbol{c})$ for some down-left corner $\boldsymbol{c}$ of a cubelet at which $\boldsymbol{x}$ lies. Due to the fact that $\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x})$ is a continuous function and that $\boldsymbol{v}\notin L_{c}(\boldsymbol{c}^{(1)})$ for all $\boldsymbol{c}^{(1)}\in C^{+}$ , we get that

We also have that $\boldsymbol{v}\in L_{c}(\boldsymbol{c})/L_{c}{\boldsymbol{c}^{(1)}}$ where $\boldsymbol{c}$ , $\boldsymbol{c}^{(1)}$ are down-left corners of cubelets at which $\boldsymbol{x}$ lies and $(x_{i}+\varepsilon,\boldsymbol{x}_{-i})$ lies respectively. Therefore we get by Case $1$ of Lemma C.2 that $Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})=0$ implying that $\mathsf{P}_{\boldsymbol{v}}(x_{i},\boldsymbol{x}_{-i})=0$ . As a result,

We now need to argue that $\lim_{\varepsilon\to 0^{+}}\frac{\mathsf{P}_{\boldsymbol{v}}(x_{i},\boldsymbol{x}_{-i})-\mathsf{P}_{\boldsymbol{v}}(x_{i}-\varepsilon,x_{-i})}{\varepsilon}$ exists and equals . At first observe that $0\leq x_{i}-c_{i}\leq\delta$ since $\boldsymbol{x}$ lies in the cubelet with down-left corner $\boldsymbol{c}$ . In case $x_{i}-c_{i}<\delta$ then $(x_{i}+\varepsilon,\boldsymbol{x}_{-i})$ lies in $\boldsymbol{c}$ for arbitrarily small $\varepsilon$ , meaning that $\boldsymbol{c}\in C^{+}$ . The latter contradicts the fact that $\boldsymbol{v}\notin L_{c}{\boldsymbol{c}^{(1)}}$ for all $\boldsymbol{c}^{(1)}\in C^{+}$ . As a result, $x_{i}-c_{i}=\delta$ which implies that $\boldsymbol{c}\in C^{-}$ and hence

The above limit equals to since $Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})=\frac{\partial Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})}{\partial x_{i}}=0$ by applying Lemma C.2 due to the fact that $\boldsymbol{v}\in L_{c}(\boldsymbol{c})\setminus L_{c}(\boldsymbol{c}^{(1)})$ .

$\boldsymbol{v}\notin L_{c}(\boldsymbol{c}^{(2)})$ for all $\boldsymbol{c}^{(2)}\in C^{-}$ . Symmetrically with the previous case.

The second order differentiability of $\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x})$ can be established using exactly the same arguments for computing the following limit

since for all the others $\boldsymbol{v}\in\left(\left[N\right]-1\right)^{d}$ it holds that $Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})=0$ , $\nabla Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})=0$ , and $\nabla^{2}Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})=0$ . These vertices $\boldsymbol{v}\in R_{+}(\boldsymbol{x})$ can be computed in polynomial time as follows: i) the coordinates $p_{1},\ldots,p_{d}$ are sorted in increasing order, and ii) for each $m=0,\ldots,d$ compute the vertex $\boldsymbol{v}^{(m)}\in L_{c}(\boldsymbol{c})$ ,

To finish the proof of Lemma 8.10 we only need the proof of Lemma C.2 which we present in the following section.

Let a point $\boldsymbol{x}\in^{d}$ lying in the boundary of the cubelets with down-left corners $\boldsymbol{c}=(c_{1},\ldots,c_{m-1},c_{m},c_{m+1},\ldots,c_{d})$ and $\boldsymbol{c}^{\prime}=(c_{1},\ldots,c_{m-1},c_{m}+1,c_{m+1},\ldots,c_{d})$ . Then the canonical representation of $\boldsymbol{x}$ in the cubelet $L(\boldsymbol{c})$ is the same with the the canonical representation of $\boldsymbol{x}$ in the cubelet $L(\boldsymbol{c}^{\prime})$ . More precisely, $\boldsymbol{p}_{\boldsymbol{x}}^{\boldsymbol{c}}=\boldsymbol{p}_{\boldsymbol{x}}^{\boldsymbol{c}^{\prime}}$ .

Let $c_{m}$ be even. By the definition of the canonical representation in Definition 8.6, the source and target of the cubelets $L(\boldsymbol{c})$ and $L(\boldsymbol{c}^{\prime})$ are respectively,

$\boldsymbol{s}^{\boldsymbol{c}}=(s_{1},\ldots,s_{m-1},c_{m},s_{m+1},\ldots,s_{d})$ ,

$\boldsymbol{t}^{\boldsymbol{c}}=(t_{1},\ldots,s_{m-1},c_{m}+1,t_{m+1},\ldots,t_{d})$ ,

$\boldsymbol{s}^{\boldsymbol{c}^{\prime}}=(s_{1},\ldots,s_{m-1},c_{m}+2,s_{m+1},\ldots,s_{d})$ ,

$\boldsymbol{t}^{\boldsymbol{c}^{\prime}}=(t_{1},\ldots,t_{m-1},c_{m}+1,t_{m+1},\ldots,t_{d})$ .

Hence we get that $p_{j}=p_{j}^{\prime}$ for $j\neq m$ . Since $\boldsymbol{x}$ belongs to the boundary of both cublets $L(\boldsymbol{c})$ and $L(\boldsymbol{c}^{\prime})$ we get that $x_{m}=c_{m}+1$ which implies that $p_{m}=p_{m}^{\prime}=1$ . In case $c_{m}$ is odd we get that $\boldsymbol{p}_{\boldsymbol{x}}^{\boldsymbol{c}}=\boldsymbol{p}_{\boldsymbol{x}}^{\boldsymbol{c}^{\prime}}$ but with $p_{m}=p_{m}^{\prime}=0$ . ∎

Let $\boldsymbol{x}\in^{d}$ lying at the intersection of the cubelets $L(\boldsymbol{c})$ , $L(\boldsymbol{c}^{\prime})$ with down-left corners $\boldsymbol{c}=(c_{1},\ldots,c_{m-1},c_{m},c_{m+1},\ldots,c_{d})$ , and $\boldsymbol{c}^{\prime}=(c_{1},\ldots,c_{m-1},c_{m}+1,c_{m+1},\ldots,c_{d})$ . Then the following statements are true.

For all vertices $\boldsymbol{v}\in L_{c}(\boldsymbol{c})\cap L_{c}(\boldsymbol{c}^{\prime})$ it holds that

$Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})=Q_{\boldsymbol{v}}^{\boldsymbol{c}^{\prime}}(\boldsymbol{x})$ ,

$\frac{\partial Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})}{\partial x_{i}}=\frac{\partial Q_{\boldsymbol{v}}^{\boldsymbol{c}^{\prime}}(\boldsymbol{x})}{\partial x_{i}}$ ,

Let $\boldsymbol{v}\in L_{c}(\boldsymbol{c})\cap L_{c}(\boldsymbol{c}^{\prime})$ then we have that

$Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})=Q_{\boldsymbol{v}}^{\boldsymbol{c}^{\prime}}(\boldsymbol{x})$ . By Lemma C.3 we get that the canonical representation $\boldsymbol{p}_{\boldsymbol{x}}^{\boldsymbol{c}}=\boldsymbol{p}_{\boldsymbol{x}}^{\boldsymbol{c}^{\prime}}$ . Since $Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})$ is a function of the canonical representation $\boldsymbol{p}_{\boldsymbol{x}}^{\boldsymbol{c}}$ (see Definition 8.9), it holds that $Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})=Q_{\boldsymbol{v}}^{\boldsymbol{c}^{\prime}}(\boldsymbol{x})$ for all vertices $\boldsymbol{v}\in L_{c}(\boldsymbol{c})\cap L_{c}(\boldsymbol{c}^{\prime})$ .

$\frac{\partial Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})}{\partial x_{i}}=\frac{\partial Q_{\boldsymbol{v}}^{\boldsymbol{c}^{\prime}}(\boldsymbol{x})}{\partial x_{i}}$ . For $i\neq m$ , we get that $\frac{\partial Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})}{\partial x_{i}}=\frac{1}{t_{i}-s_{i}}\frac{\partial Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})}{\partial p_{i}}=\frac{1}{t^{\prime}_{i}-s^{\prime}_{i}}\frac{\partial Q_{\boldsymbol{v}}^{\boldsymbol{c}^{\prime}}(\boldsymbol{x})}{\partial p^{\prime}_{i}}=\frac{\partial Q_{\boldsymbol{v}}^{\boldsymbol{c}^{\prime}}(\boldsymbol{x})}{\partial x_{i}}$ since $t_{i}=t_{i}^{\prime}$ and $s_{i}=s_{i}^{\prime}$ for all $i\neq m$ . The latter argument cannot be applied for the $m$ -th coordinate since $t_{m}-s_{m}=-(t_{m}^{\prime}-s_{m}^{\prime})$ . However since $\boldsymbol{x}$ belongs to the boundary of both the cubelets $L(\boldsymbol{c})$ and $L(\boldsymbol{c}^{\prime})$ it is implied that $p_{m}=p_{m}^{\prime}$ is either or $1$ , meaning that $\frac{\partial Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})}{\partial x_{m}}=\frac{\partial Q_{\boldsymbol{v}}^{\boldsymbol{c}^{\prime}}(\boldsymbol{x})}{\partial x_{m}}=0$ since $S^{\prime}(0)=S^{\prime}(1)=0$ from Lemma 8.3.

This case follows with the same reasoning with previous case $2$ .

Let $\boldsymbol{v}\in L_{c}(\boldsymbol{c})\cap L_{c}(\boldsymbol{c}^{\prime})$ . There exists a sequence of corners

such that $\left\|\boldsymbol{c}^{(j)}-\boldsymbol{c}^{(j+1)}\right\|_{1}=1$ and $\boldsymbol{v}\in L_{c}(\boldsymbol{c}^{j})$ for all $j\in[m]$ . By Lemma C.4 we get that,

$Q_{\boldsymbol{v}}^{\boldsymbol{c}^{(j)}}(\boldsymbol{x})=Q_{\boldsymbol{v}}^{\boldsymbol{c}^{(j+1)}}(\boldsymbol{x})$ .

$\frac{\partial Q_{\boldsymbol{v}}^{\boldsymbol{c}^{(j)}}(\boldsymbol{x})}{\partial x_{i}}=\frac{\partial Q_{\boldsymbol{v}}^{\boldsymbol{c}^{(j+1)}}(\boldsymbol{x})}{\partial x_{i}}$ .

C.2 Proof of Lemma 8.11

We start this section with some fundamental properties of the smooth step function $S_{\infty}$ that are more fine-grained than the properties we presented in Lemma 8.3.

For $d\geq 10$ there exists a universal constant $c>0$ such that the following statements hold.

If $x\geq 1/d$ then $S_{\infty}(x)\geq c\cdot 2^{-d}$ .

If $x\leq 1/d$ then $S_{\infty}^{\prime}(x)\leq c\cdot d^{2}\cdot 2^{-d}$ .

If $x\geq 1/d$ then $\frac{S_{\infty}^{\prime}(x)}{S_{\infty}(x)}\leq c\cdot d^{2}$ .

If $x\leq 1/d$ then $\left|S^{\prime\prime}_{\infty}(x)\right|\leq c\cdot d^{4}\cdot 2^{-d}$ .

If $x\geq 1/d$ then $\frac{\left|S^{\prime\prime}_{\infty}(x)\right|}{S_{\infty}(x)}\leq c\cdot d^{4}$ .

We compute the derivative of $S_{\infty}$ and we have that

from which we immediately get $S^{\prime}_{\infty}(x)\geq 0$ . Then we can compute the second derivative of $S_{\infty}$ as follows

We next want to prove that $S^{\prime\prime}_{\infty}(x)\geq 0$ for $x\leq 1/10$ . To see this observe that $1-2\cdot S_{\infty}(x)\geq 1/2$ for $x\leq 1/d$ and therefore

hence for $x\leq 4/\ln(2)$ it holds that $S^{\prime\prime}_{\infty}(x)\geq 0$ . By similar but more tedious calculations we can conclude that $S^{\prime\prime\prime}_{\infty}(x)\geq 0$ for $x\leq 1/10$ . Hence in the interval $x\in[0,1/10]$ all the functions $S_{\infty}$ , $S^{\prime}_{\infty}$ , $S^{\prime\prime}_{\infty}$ are all increasing functions of $x$ .

Next we show that the function $h(x)=2^{-1/x}+2^{-1/(1-x)}$ is upper and lower bounded. First observe that $h(x)\geq\max\{2^{-1/x},2^{-1/(1-x)}\}$ . Now if we set $t(x)=2^{-1/x}$ then $t^{\prime}(x)=\ln(2)t(x)/x^{2}$ and hence $t(x)\geq t(1/2)=1/4$ for $x\geq 1/2$ . The same way we can prove that $2^{-1/(1-x)}\geq 1/4$ for $x\leq 1/2$ . Therefore $h(x)\geq 1/4$ for all $x\in$ . Also it is not hard to see that $2^{-1/x}\leq 1/2$ and $2^{-1/(1-x)}\leq 1/2$ which implies $h(x)\leq 1$ . Hence overall we have that $h(x)\in[1/4,1]$ for all $x\in$ . We are now ready to prove the statements.

We have shown that $S^{\prime}_{\infty}(x)\geq 0$ for all $x\in$ . Hence $S_{\infty}$ is an increasing function and therefore $S_{\infty}(x)\geq S_{\infty}(1/d)$ for $x\geq 1/d$ . Now we have that $S_{\infty}(1/d)=2^{-d}/h(1/d)\geq 2^{-d}$ .

Since $S^{\prime}_{\infty}(x)$ is increasing for $x\in[0,1/10]$ , we have that $S^{\prime}_{\infty}(x)\leq S^{\prime}_{\infty}(1/d)$ for $x\leq 1/d$ and therefore

Follows directly from the statement 1., the fact that $S^{\prime\prime}_{\infty}(x)$ is increasing for $x\in[0,1/10]$ and the above expression of $S^{\prime\prime}_{\infty}$ this statement follows.

This statement follows using the same reasoning with statement 3.

In order to prove Lemma 8.11. We first introduce several technical lemmas.

Let $\boldsymbol{x}\in^{d}$ lying in cublet $L(\boldsymbol{c})$ , with $\boldsymbol{c}\in\left(\left[N\right]-1\right)^{d}$ and let $\boldsymbol{p}_{\boldsymbol{x}}^{\boldsymbol{c}}=(p_{1},\ldots,p_{d})$ be the canonical representation of $\boldsymbol{x}$ . Then for all vertices $\boldsymbol{v}\in L_{c}(\boldsymbol{c})$ , it holds that

where the last inequality follows by the fact that $\left|S^{\prime}(\cdot)\right|\leq 6$ . Since $\left|A\right|\leq d$ the proof of the lemma will be completed if we are able to show that for any $j\in A$ , it holds that

In case $S(p_{i})-S(p_{j})\geq 1/d^{5}$ then by case $3.$ of Lemma C.5 we get that $\left|S_{\infty}^{\prime}(S(p_{i})-S(p_{j}))\right|\leq c\cdot d^{10}\cdot S_{\infty}(S(p_{i})-S(p_{j}))$ , which implies gthe following

Now consider the case where $S(p_{i})-S(p_{j})\leq 1/d^{5}$ . Using case $2.$ of Lemma C.5, we have that

Combining the later with the discussion in the rest of the proof the lemma follows. ∎

For any vertex $\boldsymbol{v}\in\left(\left[N\right]-1\right)^{d}$ it holds that $\left|\frac{\partial\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x})}{\partial x_{i}}\right|\leq\Theta\left(d^{12}/\delta\right)$ .

To simplify notation we use $Q_{\boldsymbol{v}}(\boldsymbol{x})$ instead of $Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})$ for the rest of the proof. Without loss of generality we assume that $\boldsymbol{x}$ lies on a cubelet $L(\boldsymbol{c})$ with $\boldsymbol{c}\in\left(\left[N\right]-1\right)^{d}$ and $\boldsymbol{v}\in L_{c}(\boldsymbol{c})$ , since otherwise $\frac{\partial\mathsf{P}_{\boldsymbol{v}}(\boldsymbol{x})}{\partial x_{i}}=0$ . Let $\boldsymbol{p}_{\boldsymbol{x}}^{\boldsymbol{c}}=(p_{1},\ldots,p_{d})$ be the canonical representation of $\boldsymbol{x}$ in the cubelet $L(\boldsymbol{c})$ . Then it holds that

where the last inequality follows by Lemma C.6 and the fact that at most $d+1$ vertices $\boldsymbol{v}$ of $L_{c}(\boldsymbol{c})$ have non-zero gradient as we have proved in Lemma 8.10. Then the proof of Lemma C.7 follows by the fact that $p_{i}=\frac{x_{i}-s_{i}}{t_{i}-s_{i}}$ . ∎

If additionally it holds that $S(p_{i})-S(p_{m_{1}})\leq 1/d^{5}$ or $S(p_{j})-S(p_{m_{2}})\leq 1/d^{5}$ , then by the case $2.$ of Lemma C.5, we have that

In case $S(p_{i})-S(p_{j})\geq 1/d^{5}$ then by case $4.$ of Lemma C.5, we get that $\left|CS^{{}^{\prime\prime}}(p_{i}-p_{j})\right|\leq cd^{20}\cdot CS(p_{i}-p_{j})$ which implies that $Q^{{}^{\prime\prime}}\leq\Theta(d^{20})\cdot Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})$ .

On the other hand if $S(p_{i})-S(p_{j})\leq 1/d^{5}$ then by case $5.$ of Lemma C.5, we get that $Q^{{}^{\prime\prime}}\leq\left|CS^{{}^{\prime\prime}}(p_{i}-p_{j})\right|\leq c\cdot d^{20}e^{-d^{5}}$ . As in the proof of Lemma C.6, there exists a vertex $\boldsymbol{v}^{\ast}\in R_{\boldsymbol{c}}(\boldsymbol{x})$ such that $Q_{\boldsymbol{v}^{\ast}}^{\boldsymbol{c}}(\boldsymbol{x})\geq c^{d^{2}}e^{-(d+1)^{2}d^{2}}$ and thus $Q^{{}^{\prime\prime}}\leq\Theta(d^{20})\sum_{\boldsymbol{v}\in L_{c}(\boldsymbol{c})}Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})$ . Overall we get that

If we combine all the above cases then the Lemma follows. ∎

Finally using Lemma C.7 and Lemma C.9 we get the proof of Lemma 8.11.

C.3 Proof of Lemma 8.12

Let $0\leq x_{i}<1/(N-1)$ and $\boldsymbol{c}=(c_{1},\ldots,c_{i},\ldots,c_{d})$ denote down-left corner of the cubelet $R(\boldsymbol{x})$ at which $\boldsymbol{x}\in^{d}$ lies, i.e. $\boldsymbol{x}\in L(\boldsymbol{c})$ . Since $\boldsymbol{x}\leq 1/(N-1)$ , this means that $c_{i}=0$ . By the definition of sources and targets in Definition 8.6, we have that $s_{i}=0$ and $t_{i}=1/(N-1)$ , where $s_{i}$ , $t_{i}$ are respectively the $i$ -th coordinate of the source $\boldsymbol{s}_{\boldsymbol{c}}$ and the target $\boldsymbol{t}_{\boldsymbol{c}}$ vertex. Let the canonical representation $p_{\boldsymbol{x}}^{\boldsymbol{c}}=(p_{1},\ldots,p_{d})$ of $\boldsymbol{x}$ in the cubelet $L(\boldsymbol{c})$ . Now partition the coordinates $[d]$ in the following sets

If $B=\varnothing$ then notice that $\mathsf{P}_{\boldsymbol{s}_{\boldsymbol{c}}}(\boldsymbol{x})>0$ , since $p_{i}<1$ , by the fact that $x_{i}<1/(N-1)$ . Thus the lemma follows since $s_{i}=0$ . So we may assume that $B\neq\varnothing$ . In this case consider the corner $\boldsymbol{v}=(v_{1},\ldots,v_{d})$ defined as follows

Observe that $Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})>0$ and thus $\boldsymbol{v}\in R_{+}(\boldsymbol{x})$ . Moreover the coordinate $i\in A$ and therefore it holds that $v_{i}=s_{i}=0$ . This proves the first statement of the Lemma.

For the second statement let $1-1/(N-1)\leq x_{i}\leq 1/(N-1)$ and $\boldsymbol{c}=(c_{1},\ldots,c_{i},\ldots,c_{d})$ denote down-left corner of the cubelet $R(\boldsymbol{x})$ at which $\boldsymbol{x}\in^{d}$ lies, i.e. $\boldsymbol{x}\in L(\boldsymbol{c})$ . This means that $c_{i}=\frac{N-2}{N-1}$ .

Let $N$ be odd. In this case by the definition of sources and targets in Definition 8.6, we have that $s_{i}=1-1/(N-1)$ and $t_{i}=1$ , where $s_{i}$ , $t_{i}$ are respectively the $i$ -th coordinate of the source and target vertex. Let $p_{\boldsymbol{x}}^{\boldsymbol{c}}=(p_{1},\ldots,p_{d})$ be the canonical representation of $\boldsymbol{x}$ under in the cubelet $L(\boldsymbol{c})$ . Now partition the coordinates $[d]$ as follows,

If $A=\varnothing$ then notice that for the target vertex $\boldsymbol{t}_{\boldsymbol{c}}$ , $\mathsf{P}_{\boldsymbol{t}_{\boldsymbol{c}}}(\boldsymbol{x})>0$ , since $p_{i}>0$ , by the fact that $x_{i}>1-1/(N-1)$ . Thus the lemma follows since $t_{i}=1$ . So we may assume that $A\neq\varnothing$ . In this case consider the corner $\boldsymbol{v}=(v_{1},\ldots,v_{d})$ defined as follows,

Observe that $Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})>0$ and thus $\boldsymbol{v}\in R_{+}(\boldsymbol{x})$ . Moreover the coordinate $i\in B$ and thus $v_{i}=t_{i}=1$ .

Let $N$ be even. In this case we have that $t_{i}=1-1/(N-1)$ and $s_{i}=1$ . Now partition the coordinates $[d]$ as follows,

If $B=\varnothing$ then notice that for the source vertex $\boldsymbol{s}_{\boldsymbol{c}}$ , $\mathsf{P}_{\boldsymbol{s}_{\boldsymbol{c}}}(\boldsymbol{x})>0$ , since $p_{i}<1$ , by the fact that $x_{i}>1-1/(N-1)$ . Thus the lemma follows since $s_{i}=1$ . In case $B\neq\varnothing$ consider the corner $\boldsymbol{v}=(v_{1},\ldots,v_{d})$ defined as follows,

Observe that $Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})>0$ and thus $\boldsymbol{v}\in R_{+}(\boldsymbol{x})$ . Moreover the coordinate $i\in A$ and thus $v_{i}=s_{i}=1$ .

If we put together the last two cases then this implies the second statement of the lemma.

Appendix D Constructing the Turing Machine – Proof of Theorem 7.6

In this section we prove Theorem 7.6 establishing that both the function $f_{\mathcal{C}_{l}}(\boldsymbol{x},\boldsymbol{y})$ of Definition 7.4 and its gradient, is computable by a polynomial-time Turing Machine. We prove Theorem 7.6 through a series of Lemmas. To simplify notation we set $b\triangleq\log 1/\varepsilon$ .

There exist Turing Machines $M_{S_{\infty}}$ , $M_{S^{\prime}_{\infty}}$ that given input $x\in$ and $\varepsilon$ in binary form, compute $\left[S_{\infty}(x)\right]_{b}$ and $\left[S^{\prime}_{\infty}(x)\right]_{b}$ in time polynomial in $b=\log(1/\varepsilon)$ and the binary representation of $x$ .

The Turing Machine $M_{S_{\infty}}$ outputs the fist $b$ bits of the following quantity,

where $b^{\prime}$ will be selected sufficiently large. Notice it is possible to compute the above quantity due to the fact that all functions $\frac{1}{\gamma}+\frac{1}{\gamma-1}$ , $2^{\gamma}$ and $\frac{1}{1+\gamma}$ can be computed with accuracy $2^{-b^{\prime}}$ in polynomial time with respect to $b^{\prime}$ and the binary representation of $\gamma$ [Bre76]. Moreover,

where the first inequality follows from triangle inequality and the second follows from the facts that $1/(1+\gamma)$ is a $1$ -Lipschitz function of $\gamma$ for $\gamma\geq 0$ , and $1/(1+2^{\gamma})$ is an $\ln(2)$ -Lipschitz function of $\gamma$ for $\gamma\geq 0$ . The last inequality follows from the definition of $\left[\cdot\right]_{b^{\prime}}$ . Hence $W(x)$ is indeed equal to $\left[S_{\infty}(x)\right]_{b}$ if we choose $b^{\prime}=b+2$ .

Next we explain how $M_{S^{\prime}_{\infty}}$ computes $\left[S^{\prime}_{\infty}(x)\right]_{b}$ . First notice that $S^{\prime}_{\infty}(x)$ is equal to

To describe how to compute $S^{\prime}_{\infty}(x)$ we first assume that we have computed the following quantities. Then based on these quantities we show how $S^{\prime}_{\infty}(x)$ can be computed and finally we consider the computation of these quantities.

$A\leftarrow\left[\frac{1}{x^{2}}2^{-\frac{1}{x}+\frac{1}{x-1}}\right]_{b^{\prime}}$ ,

$B\leftarrow\left[\frac{1}{(x-1)^{2}}2^{-\frac{1}{x}+\frac{1}{x-1}}\right]_{b^{\prime}}$ ,

$C\leftarrow\left[\left(2^{-\frac{1}{x}}+2^{\frac{1}{x-1}}\right)^{2}\right]_{b^{\prime}}$ .

Then $M_{S^{\prime}_{\infty}}$ outputs the fist $b$ bits of the quantity $\left[\left[\ln 2\right]_{b^{\prime}}\cdot\left[\frac{A+B}{C}\right]_{b^{\prime}}\right]_{b^{\prime}}$ . We now prove that

Consider the function $g(\alpha,\beta,\gamma)=\frac{\alpha+\beta}{\gamma}$ where $\left|\alpha\right|,\left|\beta\right|\leq c_{1}$ and $\left|\gamma\right|\geq c_{2}$ where $c_{1},c_{2}$ are universal constants. Notice that $g(\alpha,\beta,\gamma)$ is $c$ -Lipschitz for $c=\sqrt{\frac{2}{c_{2}^{2}}+\frac{2c_{1}}{c_{2}^{2}}}$ . Since for sufficiently large $b^{\prime}$ all the quantities $\left|A\right|,\left|B\right|,\left|\frac{1}{x^{2}}2^{-\frac{1}{x}+\frac{1}{x-1}}\right|,\left|\frac{1}{(x-1)^{2}}2^{-\frac{1}{x}+\frac{1}{x-1}}\right|\leq c_{1}$ and $\left|C\right|,\left(2^{-\frac{1}{x}}+2^{\frac{1}{x-1}}\right)^{2}\geq c_{2}$ where $c_{1},c_{2}$ are universal constants we get that

Now consider the function $g(\alpha,\beta)=\alpha\cdot\beta$ where $\left|\alpha\right|,\left|\beta\right|\leq c$ where $c$ is a universal constant. In this case $g(\alpha,\beta)$ is $\sqrt{2}c$ -Lipschitz continuous. Since for $b^{\prime}$ sufficiently large all the quantities $\left|[\ln 2]_{b^{\prime}}\right|,\left|\left[\frac{A+B}{C}\right]_{b^{\prime}}\right|,\ln 2,\left|\frac{A+B}{C}\right|$ are bounded by a universal constant $c$ , we have that,

Next we explain how the values $A,B$ and $C$ are computed while $\left[\ln(2)\right]_{b}^{\prime}$ can easily be computed via standard techniques [Bre76].

Computation of $\boldsymbol{A}$ . The Turing Machine $M_{S^{\prime}_{\infty}}$ will compute $A$ by taking the first $b^{\prime}$ bits of the following quantity,

where $b^{\prime\prime}$ will be taken sufficiently large. We remark that both where both the exponentiation and the natural logarithm can be computed in polynomial-time with respect to the number of accuracy bits and the binary representation of the input [Bre76]. The function $\frac{1}{x^{2}}2^{-\frac{1}{x}+\frac{1}{x-1}}=2^{-\frac{1}{x}+\frac{1}{x-1}+2\ln x/\ln 2}$ is $c$ -Lipschitz where $c$ is a universal constant. Thus,

Computation of $\boldsymbol{B}$ . Using the same arguments as for $A$ .

Computation of $\boldsymbol{C}$ . To compute $C$ we first compute $b^{\prime\prime}$ bits of the following quantity,

The latter follows by applying the triangle inequality and the following $3$ inequalities.

this holds since for $b^{\prime\prime}>1$ we have

are both upper-bounded by $2$ while the function $g(\alpha)=\alpha^{2}$ is $4$ -Lipschitz for $\left|\alpha\right|\leq 2$ .

The latter follows since for $b^{\prime\prime}$ larger than a universal constant, both $\left[2^{-\left[\frac{1}{x}\right]_{b^{\prime\prime}}}\right]_{b^{\prime\prime}}+\left[2^{\left[\frac{1}{x-1}\right]_{b^{\prime\prime}}}\right]_{b^{\prime\prime}}$ and $2^{-\left[\frac{1}{x}\right]_{b^{\prime\prime}}}+2^{\left[\frac{1}{x-1}\right]_{b^{\prime\prime}}}$ are greater than a universal constant $c$ , while the function $g(\alpha,\beta)=1/(\alpha+\beta)^{2}$ is $\Theta\left(c^{3}\right)$ -Lipschitz for $\alpha+\beta\geq c$ .

The latter follows since for $b^{\prime\prime}$ larger than a universal constant it holds that both the quantities in the left hand side are greater than a positive universal constant $c$ , while the function $g(\alpha,\beta)=1/(2^{-\alpha}+2^{\beta})$ for $2^{-\alpha}+2^{\beta}\geq c$ , $\alpha\geq 0$ , and $\beta\leq 0$ is $\Theta\left(1/c^{3}\right)$ -Lipschitz.

There exist Turing Machines $M_{Q}$ and $M_{Q^{\prime}}$ that given $\boldsymbol{x}\in^{d}$ and $\varepsilon>0$ in binary form, respectively compute $\left[Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})\right]_{b}$ and $\left[\nabla Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})\right]_{b}$ for all vertices $\boldsymbol{v}\in\left(\left[N\right]-1\right)^{d}$ with $Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})>0$ , where $b=\log(1/\varepsilon)$ . These vertices are most $d+1$ . Moreover both $M_{Q}$ and $M_{Q^{\prime}}$ run in polynomial time with respect to $b$ , $d$ and the binary representation of $\boldsymbol{x}$ .

Both $M_{Q}$ , $M_{Q^{\prime}}$ firsts compute the canonical representation $p_{\boldsymbol{x}}^{\boldsymbol{c}}\in^{d}$ with the respect to the cell $R(\boldsymbol{x})$ in which $\boldsymbol{x}$ lies. Such a cell $R(\boldsymbol{x})$ can be computed by taking the first $(\log N+1)$ -bits at each coordinate of $\boldsymbol{x}$ . The source vertex $\boldsymbol{s}^{\boldsymbol{c}}=(s_{1},\ldots,s_{d})$ and the target vertex $\boldsymbol{t}^{\boldsymbol{c}}=(t_{1},\ldots,t_{d})$ with respect to $R(\boldsymbol{x})$ are also computed. Once this is done we are only interested in vertices $\boldsymbol{v}\in R_{\boldsymbol{c}}(\boldsymbol{x})$ for which

since for all the other $\boldsymbol{v}\in\left(\left[N\right]-1\right)^{d}$ both $Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})=0$ and $\nabla Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})=0$ . These vertices, that are denoted by $R_{+}(\boldsymbol{x})$ , are at most $d+1$ and can be computed in polynomial time.

The vertices $\boldsymbol{v}\in R_{+}(\boldsymbol{x})$ can be computed in polynomial time as follows: (i) the coordinates $p_{1},\ldots,p_{d}$ are sorted in increasing order ii) for each $m=0,\ldots,d$ compute the vertex $\boldsymbol{v}^{m}\in R_{\boldsymbol{c}}(\boldsymbol{x})$ ,

By Definition 8.7 it immediately follows that $R_{+}(\boldsymbol{x})\subseteq\bigcup_{m=0}^{d}\{\boldsymbol{v}^{m}\}$ which also establish that $\left|R_{+}(\boldsymbol{x})\right|\leq d+1$ .

where $b^{\prime}$ is selected sufficiently large. We next prove that this computation indeed outputs $\left[Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})\right]_{b}$ accurately.

We can now use the above inequality to bound $\left|\left[\frac{\partial Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})}{\partial p_{i}}\right]_{b^{\prime}}-\frac{\partial Q_{\boldsymbol{v}}^{\boldsymbol{c}}(\boldsymbol{x})}{\partial p_{i}}\right|$ . More precisely,

Thus the analysis is completed by selecting $b^{\prime}=b+\Theta(\log d)$ + $\Theta(\log N)$ . ∎

For accuracy $b^{\prime}\geq\Theta(d^{2}\log d)$ we get that,

Consider the function $g(\boldsymbol{y})=y_{i}/(\sum_{j=1}^{d+1}y_{j})$ . Notice that for $\boldsymbol{y}\in^{d+1}$ and $\sum_{j=1}^{d+1}y_{j}\geq\mu$ then $\left\|\nabla g(\boldsymbol{y})\right\|_{2}\leq\Theta(d^{3/2}/\mu^{2})$ . The latter implies that for $\boldsymbol{y},\boldsymbol{z}\in^{d+1}$ such that $\sum_{j=1}^{d+1}y_{j}\geq\mu$ and that $\sum_{j=1}^{d+1}z_{j}\geq\mu$ , it holds that

Since there are at most $d+1$ vertices $\boldsymbol{v}^{\prime}\in R_{+}(\boldsymbol{x})$ while both the term $\sum_{\boldsymbol{v}^{\prime}\in R_{+}(\boldsymbol{x})}\left[Q_{\boldsymbol{v}^{\prime}}^{\boldsymbol{c}}(\boldsymbol{x})\right]_{b^{\prime}}$ and the term $\sum_{\boldsymbol{v}^{\prime}\in R_{+}(\boldsymbol{x})}Q_{\boldsymbol{v}^{\prime}}^{\boldsymbol{c}}(\boldsymbol{x})$ are greater than $\Theta\left((1/d)^{d^{2}}\right)$ , we can apply the above inequality with $\mu=\Theta\left((1/d)^{d^{2}}\right)$ and we get the following

The proof is completed via selecting $b^{\prime}=b+\Theta(d^{2}\log d)$ .

we next prove that the above computation is correct.

Setting $b^{\prime}=b+\Theta\left(\log d\right)$ we get the desired result. Similarly for $\frac{\partial f_{\mathcal{C}_{l}(\boldsymbol{x},\boldsymbol{y})}}{\partial x_{i}}$ and $\frac{\partial f_{\mathcal{C}_{l}(\boldsymbol{x},\boldsymbol{y})}}{\partial y_{i}}$ . ∎

Appendix E Convergence of PGD to Approximate Local Minimum

where $\eta=1/L$ and $\boldsymbol{x}^{\star}$ is a global minimum of $f$ .

If we run the Projected Gradient Descent algorithm on $f$ then we have

then due to the $L$ -smoothness of $f$ we have that

We can now apply Theorem 1.5.5 (b) of [FP07] to get that

If sum all the above inequalities and divide by $T$ then we get

Therefore for $T\geq\frac{2L\left(f(\boldsymbol{x}_{0})-f(\boldsymbol{x}^{\star})\right)}{\varepsilon^{2}}$ we have that