The Complexity of Constrained Min-Max Optimization
Constantinos Daskalakis, Stratis Skoulakis, Manolis Zampetakis
Introduction
Min-Max Optimization has played a central role in the development of Game Theory [vN28], Convex Optimization [Dan51, Adl13], and Online Learning [Bla56, CBL06, SS12, BCB12, SSBD14, Haz16]. In its general constrained form, it can be written down as follows:
The goal in (1.1) is to find a feasible pair , i.e., , that satisfies the following
Unfortunately, our ability to solve Problem (1.1) remains rather poor in settings where our objective function is not convex-concave. This is emerging as a major challenge in Deep Learning, where min-max optimization has recently found many important applications, such as training Generative Adversarial Networks (see e.g. [GPM+14, ACB17]), and robustifying deep neural network-based models against adversarial attacks (see e.g. [MMS+18]). These applications are indicative of a broader deep learning paradigm wherein robustness properties of a deep learning system are tested and enforced by another deep learning system. In these applications, it is very common to encounter min-max problems with objectives that are nonconvex-nonconcave, and thus evade treatment by the classical algorithmic toolkit targeting convex-concave objectives.
Indeed, the optimization challenges posed by objectives that are nonconvex-nonconcave are not just theoretical frustration. Practical experience with first-order methods is rife with frustration as well. A common experience is that the training dynamics of first-order methods is unstable, oscillatory or divergent, and the quality of the points encountered in the course of training can be poor; see e.g. [Goo16, MPPSD16, DISZ18, MGN18, DP18, MR18, MPP18, ADLH19]. This experience is in stark contrast to minimization (resp. maximization) problems, where even for nonconvex (resp. nonconcave) objectives, first-order methods have been found to efficiently converge to approximate local optima or stationary points (see e.g. [AAZB+17, JGN+17, LPP+19]), while practical methods such Stochastic Gradient Descent, Adagrad, and Adam [DHS11, KB14, RKK18] are driving much of the recent progress in Deep Learning.
The goal of this paper is to shed light on the complexity of min-max optimization problems, and elucidate its difference to minimization and maximization problems—as far as the latter is concerned without loss of generality we focus on minimization problems, as maximization problems behave exactly the same; we will also think of minimization problems in the framework of (1.1), where the variable is absent, that is . An important driver of our comparison between min-max optimization and minimization is, of course, the nature of the objective. So let us discuss:
Convex-Concave Objective. The benign setting for min-max optimization is that where the objective function is convex-concave, while the benign setting for minimization is that where the objective function is convex. In their corresponding benign settings, the two problems behave quite similarly from a computational perspective in that they are amenable to convex programming, as well as first-order methods which only require gradient information about the objective function. Moreover, in their benign settings, both problems have guaranteed existence of a solution under compactness of the constraint set. Finally, it is clear how to define approximate solutions. We just relax the inequalities on the left hand side of (1.2) and (1.3) by some .
Nonconvex-Nonconcave Objective. By contrapositive, the challenging setting for min-max optimization is that where the objective is not convex-concave, while the challenging setting for minimization is that where the objective is not convex. In these challenging settings, the behavior of the two problems diverges significantly. The first difference is that, while a solution to a minimization problem is still guaranteed to exist under compactness of the constraint set even when the objective is not convex, a solution to a min-max problem is not guaranteed to exist when the objective is not convex-concave, even under compactness of the constrained set. A trivial example is this: . Unsurprisingly, we show that checking whether a min-max optimization problem has a solution is -hard. In fact, we show that checking whether there is an approximate min-max solution is -hard, even when the function is Lispchitz and smooth and the desired approximation error is an absolute constant (see Theorem 10.1).
Since min-max solutions may not exist, what could we plausibly hope to compute? There are two obvious targets:
approximate stationary points of , as considered e.g. by [ALW19]; and
some type of approximate local min-max solution.
Unfortunately, as far as (I) is concerned, it is still possible that (even approximate) stationary points may not exist, and we show that checking if there is one is -hard, even when the constraint set is , the objective has Lipschitzness and smoothness polynomial in , and the desired approximation is an absolute constant (Theorem 4.1). So we focus on (II), i.e. (approximate) local min-max solutions. Several kinds of those have been proposed in the literature [DP18, MR18, JNJ19]. We consider a generalization of the concept of local min-max equilibria, proposed in [DP18, MR18], that also accommodates approximation.
Given , as above, and , some point is an -local min-max solution of (1.1), or a -local min-max equilibrium, if it is feasible, i.e. , and satisfies:
In words, is an -local min-max equilibrium, whenever the min player cannot update to a feasible point within of to reduce by at least , and symmetrically the max player cannot change locally to increase by at least .
We show that the existence and complexity of computing such approximate local min-max equilibria depends on the relationship of and with the smoothness, , and the Lipschitzness, , of the objective function . We distinguish the following regimes, also shown in Figure 1 together with a summary of our associated results.
Trivial Regime. This occurs when . This regime is trivial because the -Lipschitzness of guarantees that all feasible points are -local min-max solutions.
Local Regime. This occurs when , and it represents the interesting regime for min-max optimization. In this regime, we use the smoothness of to show that -local min-max solutions always exist. Indeed, we show (Theorem 5.1) that computing them is computationally equivalent to the following variant of (I) which is more suitable for the constrained setting:
(approximate) fixed points of the projected gradient descent-ascent dynamics (Section 3.3).
We show via an application of Brouwer’s fixed point theorem to the iteration map of the projected gradient descent-ascent dynamics that (I)’ are guaranteed to exist. In fact, not only do they exist, but computing them is in , as can be shown by bounding the Lipschitzness of the projected gradient descent-ascent dynamics (Theorem 5.2).
Global Regime. This occurs when is comparable to the diameter of the constraint set. In this case, the existence of -local min-max solutions is not guaranteed, and determining their existence is -hard, even if is an absolute constant (Theorem 10.1).
The main results of this paper, summarized in Figure 1, are to characterize the complexity of computing local min-max solutions in the local regime. Our first main theorem is the following:
Computing -local min-max solutions of Lipschitz and smooth objectives over convex compact domains in the local regime is -complete. The hardness holds even when the constraint set is a polytope that is a subset of , the objective takes values in $1/\varepsilon1/\delta\alpha\mathsf{PPAD}^{d}[-d,d]1/\alpha$ are polynomial in the dimension.
For the above complexity result we assume that we have “white box” access to the objective function. An important byproduct of our proof, however, is to also establish an unconditional hardness result in the Nemirovsky-Yudin [NY83] oracle optimization model, wherein we are given black-box access to oracles computing the objective function and its gradient. Our second main result is informally stated in Informal Theorem 2.
Assume that we have black-box access to an oracle computing a -Lipschitz and -smooth objective function , where is a known polytope, and its gradient . Then, computing an -local min-max solution in the local regime (i.e., when ) requires a number of oracle queries that is exponential in at least one of the following: , , , or . In fact, exponential in -many queries are required even when , , and are all polynomial in .
Importantly, the above lower bounds, in both the white-box and the black-box setting, come in sharp contrast to minimization problems, given that finding approximate local minima of smooth non-convex objectives ranging in in the local regime can be done using first-order methods using time/queries (see Section E). Our results are the first to show an exponential separation between these two fundamental problems in optimization in the black-box setting, and a super-polynomial separation in the white-box setting assuming .
We very briefly outline some of the main ideas for the -hardness proof that we present in Sections 6 and 7. Our starting point as in many -hardness results is a discrete analog of the problem of finding Brouwer fixed points of a continuous map. Departing from previous work, however, we do not use Sperner’s lemma as the discrete analog of Brouwer’s fixed point theorem. Instead, we define a new problem, called BiSperner, which is useful for showing our hardness results. BiSperner is closely related to the problem of finding panchromatic simplices guaranteed by Sperner’s lemma except, roughly speaking, that the vertices of the simplicization of a -dimensional hypercube are colored with rather than colors, every point of the simplicization is colored with colors rather than one, and we are seeking a vertex of the simplicization so that the union of colors on the vertices in its neighborhood covers the full set of colors. The first step of our proof is to show that BiSperner is -hard. This step follows from the hardness of computing Brouwer fixed points.
The step that we describe next is only implicitly done by our proof, but it serves as useful intuition for reading and understanding it. We want to define a discrete two-player zero-sum game whose local equilibrium points correspond to solutions of a given BiSperner instance. Our two players, called “minimizer” and “maximizer,” each choose a vertex of the simplicization of the BiSperner instance. For every pair of strategies in our discrete game, i.e. vertices, chosen by our players, we define a function value and gradient values. Note that, at this point, we treat these values at different vertices of the simplicization as independent choices, i.e. are not defining a function over the continuum whose function values and gradient values are consistent with these choices. It is our intention, however, that in the continuous two-player zero-sum game that we obtain in the next paragraph via our interpolation scheme, wherein the minimizer and maximizer may choose any point in the continuous hypercube, the function value determines the payment of the minimizer to the maximizer, and the gradient value determines the direction of the best-response dynamics of the game. Before getting to that continuous game in the next paragraph, the main technical step of this discrete part of our construction is showing that every local equilibrium of the discrete game corresponds to a solution of the BiSperner instance we are reducing from. In order to achieve this we need to add some constraints to couple the strategies of the minimizer and the maximizer player. This step is the reason that the constraints appear in the final min-max problem that we produce.
The third and quite challenging step of the proof is to show that we can interpolate in a smooth and computationally efficient way the discrete zero-sum game of the previous step. In low dimensions (treated in Section 6) such smooth and efficient interpolation can be done in a relatively simple way using single-dimensional smooth step functions. In high dimensions, however, the smooth and efficient interpolation becomes a challenging problem and to the best of our knowledge no simple solution exists. For this reason we construct our novel smooth and efficient interpolation coefficients of Section 8. These are a technically involved construction that we believe will prove to be very useful for characterizing the complexity of approximate solutions of other optimization problems.
The last part of our proof is to show that all the previous steps can be implemented in an efficient way both with respect to computational but also with respect to query complexity. This part is essential for both our white-box and black-box results. Although this seems like a relatively easy step, it becomes more difficult due to the complicated expressions in our smooth and efficient interpolation coefficients used in our previous step.
Closing this section we mention that all our -hardness results are proven using a cute application of Lovász Local Lemma [EL73], which provides a powerful rounding tool that can drive the inapproximability all the way up to an absolute constant.
2 Local Minimization vs Local Min-Max Optimization
Because our proof is convoluted, involving multiple steps, it is difficult to discern from it why finding local min-max solutions is so much harder than finding local minima. For this reason, we illustrate in this section a fundamental difference between local minimization and local min-max optimization. This provides good intuition about why our hardness construction would fail if we tried to apply it to prove hardness results for finding local minima (which we know don’t exist).
Ultimately a key difference between min-min and min-max optimization is that best-response paths in min-max optimization problems can be closed, i.e., can form a cycle, as shown in Figure 2, Panel (b). On the other hand, this is impossible in min-min problems as the function value must monotonically decrease along best-response paths, thus cycles may not exist.
The above discussion offers qualitative differences between min-min and min-max optimization, which lie in the heart of why our computational intractability results are possible to prove for min-max but not min-min problems. For the precise step in our construction that breaks if we were to switch from a min-max to a min-min problem we refer the reader to Remark 6.9.
3 Further Related Work
There is a broad literature on the complexity of equilibrium computation. Virtually all these results are obtained within the computational complexity formalism of total search problems in , which was spearheaded by [JPY88, MP89, Pap94b] to capture the complexity of search problems that are guaranteed to have a solution. Some key complexity classes in this landscape are shown in Figure 3. We give a non-exhaustive list of intractability results for equilibrium computation: [FPT04] prove that computing pure Nash equilibria in congestion games is -complete; [DGP09] and later [CDT09] show that computing approximate Nash equilibria in normal-form games is -complete; [EY10] study the complexity of computing exact Nash equilibria (which may use irrational probabilities), introducing the complexity class ;
[VY11, CPY17] consider the complexity of computing Market equilibria; [Das13, Rub15, Rub16] consider the complexity of computing approximate Nash equilibria of constant approximation; [KM18] establish a connection between approximate Nash equilibrium computation and the SoS hierarchy; [Meh14, DFS20] study the complexity of computing Nash equilibria in specially structured games. A result that is particularly useful for our work is the result of [HPV89] which shows black-box query lower bounds for computing Brouwer fixed points of a continuous function. We use this result in Section 9 as an ingredient for proving our black-box lower bounds for computing approximate local min-max solutions.
Beyond equilibrium computation and its applications to Economics and Game Theory, the study of total search problems has found profound connections to many scientific fields, including continuous optimization [DP11, DTZ18], combinatorial optimization [SY91], query complexity [BCE+95], topology [GH19], topological combinatorics and social choice theory [FG18, FG19, FRHSZ20b, FRHSZ20a], algebraic combinatorics [BIQ+17, GKSZ19], and cryptography [Jeř16, BPR15, SZZ18]. For a more extensive overview of total search problems we refer the reader to the recent survey by Daskalakis [Das18].
As already discussed, min-max optimization has intimate connections to the foundations of Game Theory, Mathematical Programming, Online Learning, Statistics, and several other fields. Recent applications of min-max optimization to Machine Learning, such as Generative Adversarial Networks and Adversarial Training, have motivated a slew of recent work targeting first-order (or other light-weight online learning) methods for solving min-max optimization problems for convex-concave, nonconvex-concave, as well as nonconvex-nonconcave objectives. Work on convex-concave and nonconvex-concave objectives has focused on obtaining online learning methods with improved rates [KM19, LJJ19, TJNO19, NSH+19, LTHC19, OX19, Zha19, ADSG19, AMLJG20, GPDO20, LJJ20] and last-iterate convergence guarantees [DISZ18, DP18, MR18, MPP18, RLLY18, HA18, ADLH19, DP19, LS19, GHP+19, MOP19, ALW19], while work on nonconvex-nonconcave problems has focused on identifying different notions of local min-max solutions [JNJ19, MV20] and studying the existence and (local) convergence properties of learning methods at these points [WZB19, MV20, MSV20].
Preliminaries
We study optimization problems involving real-valued functions, considering two access models to such functions.
Black Box Model. In this model we are given access to an oracle such that given a point the oracle returns the values and . In this model we assume that we can perform real number arithmetic operations. This is the traditional model used to prove lower bounds in Optimization and Machine Learning [NY83].
Promise Problems. To simplify the exposition of our paper, make the definitions of our computational problems and theorem statements clearer, and make our intractability results stronger, we choose to enforce the following constraints on our function access, or , as a promise, rather than enforcing these constraints in some syntactic manner.
Consistency of Function Values and Gradient Values. Given some oracle or Turing machine , it is difficult to determine by querying the oracle or examining the description of the Turing machine whether the function and gradient values output on different inputs are consistent with some differentiable function. In all our computational problems, we will only consider instances where this is promised to be the case. Moreover, for all our computational hardness results, the instances of the problems arising from our reductions satisfy these constraints, which are guaranteed syntactically by our reduction.
Lipschitzness, Smoothness and Boundedness. Similarly, given some oracle or Turing machine , it is difficult to determine, by querying the oracle or examining the description of the Turing machine, whether the function and gradient values output by or are consistent with some Lipschitz, smooth and bounded function with some prescribed Lipschitzness, smoothness, and bound on its absolute value. In all our computational problems, we only consider instances where the -Lipschitzness, -smoothness and -boundedness of the function are promised to hold for the prescribed, in the input of the problem, parameters , and . Moreover, for all our computational hardness results, the instances of the problems arising from our reductions satisfy this constraint, which is guaranteed syntactically by our reduction.
In summary, in the rest of this paper, whenever we prove an upper bound for some computational problem, namely an upper bound on the number of steps or queries to the function oracle required to solve the problem in the black-box model, or the containment of the problem in some complexity class in the white-box model, we assume that the afore-described properties are satisfied by the or provided in the input. On the other hand, whenever we prove a lower bound for some computational problem, namely a lower bound on the number of steps/queries required to solve it in the black-box model, or its hardness for some complexity class in the white-box model, the instances arising in our lower bounds are guaranteed to satisfy the above properties syntactically by our constructions. As such, our hardness results will not exploit the difficulty in checking whether or satisfy the above constraints in order to infuse computational complexity into our problems, but will faithfully target the computational problems pertaining to min-max optimization of smooth and Lipschitz objectives that we aim to understand in this paper.
1 Complexity Classes and Reductions
In this section we define the main complexity classes that we use in this paper, namely , and , as well as the notion of reduction used to show containment or hardness of a problem for one of these complexity classes.
To define the complexity class we first define the notion of polynomial-time reductions between search problemsIn this paper we only define and consider Karp-reductions between search problems., and the computational problem End-of-a-LineThis problem is sometimes called End-of-the-Line, but we adopt the nomenclature proposed by [Rub16] since we agree that it describes the problem better..
A search problem is polynomial-time reducible to a search problem if there exist polynomial-time computable functions and with the following properties: (i) if is an input to , then is an input to ; and (ii) if is a solution to on input , then is a solution to on input .
To make sense of the above definition, we envision that the circuits and implicitly define a directed graph, with vertex set , such that the directed edge belongs to the graph if and only if and . As such, all vertices in the implicitly defined graph have in-degree and out-degree at most . The above problem permits an output of if has equal in-degree and out-degree in this graph. Otherwise it permits an output such that has in-degree or out-degree equal to . It follows by the parity argument on directed graphs, namely that in every directed graph the sum of in-degrees equals the sum of out-degrees, that End-of-a-Line is a total problem, i.e. that for any possible binary circuits and there exists a solution of the “0.” kind or the “1.” kind in the definition of our problem (or both). Indeed, if has unequal in- and out-degrees, there must exist another vertex with unequal in- and out-degrees, thus one of these degrees must be (as all vertices in the graph have in- and out-degrees bounded by ).
We are finally ready to define the complexity class introduced by [Pap94b].
The complexity class contains all search problems that are polynomial time reducible to the End-of-a-Line problem.
The complexity class is of particular importance, since it contains lots of fundamental problems in Game Theory, Economics, Topology and several other fields [DGP09, Das18]. A particularly important -complete problem is finding fixed points of continuous functions, whose existence is guaranteed by Brouwer’s fixed point theorem.
While not stated exactly in this form, the following is a straightforward implication of the results presented in [CDT09].
Computational Problems of Interest
In this section, we define the computational problems that we study in this paper and discuss our main results, postponing formal statements to Section 4. We start in Section 3.1 by defining the mathematical objects of our study, and proceed in Section 3.2 to define our main computational problems, namely: (1) finding approximate stationary points; (2) finding approximate local minima; and (3) finding approximate local min-max equilibria. In Section 3.3, we present some bonus problems, which are intimately related, as we will see, to problems (2) and (3). As discussed in Section 2, for ease of presentation, we define our problems as promise problems.
We define the concepts of stationary points, local minima, and local min-max equilibria of real valued functions, and make some remarks about their existence, as well as their computational complexity. The formal discussion of the latter is postponed to Sections 3.2 and 4.
Now that we have defined the domain of the real-valued functions that we consider in this paper we are ready to define a notion of approximate stationary points.
It is easy to see that there exist continuously differentiable functions that do not have any (approximate) stationary points, e.g. linear functions. As we will see later in this paper, deciding whether a given function has a stationary point is -hard and, in fact, it is even -hard to decide whether a function has an approximate stationary point of a very gross approximation. At the same time, verifying whether a given point is (approximately) stationary can be done efficiently given access to a polynomial-time Turing machine that computes , so the problem of deciding whether an (approximate) stationary point exists lies in , as long as we can guarantee that, if there is such a point, there will also be one with polynomial bit complexity. We postpone a formal discussion of the computational complexity of finding (approximate) stationary points or deciding their existence until we have formally defined our corresponding computational problem and settled the bit complexity of its solutions.
For the definition of local minima and local min-max equilibria we need the notion of closed -dimensional Euclidean balls.
To be clear, using the term “local minimum” in Definition 3.5 is a bit of a misnomer, since for large enough values of the definition captures global minima as well. As ranges from large to small, our notion of -local minimum transitions from being an -globally optimal point to being an -locally optimal point. Importantly, unlike (approximate) stationary points, a -local minimum is guaranteed to exist for all due to the compactness of and the continuity of . Thus the problem of finding an -local minimum is total for arbitrary values of and . On the negative side, for arbitrary values of and , there is no polynomial-size and polynomial-time verifiable witness for certifying that a point is an -local minimum. Thus the problem of finding an -local minimum is not known to lie in . As we will see in Section 4, this issue can be circumvented if we focus on particular settings of and , in relationship to the Lipschitzness and smoothness of and the dimension .
Finally we define -local min-max equilibrium as follows, recasting Definition 1.1 to the constraint set .
for every with ; and
for every with .
Similarly to Definition 3.5, for large enough values of , Definition 3.6 captures global min-max equilibria as well. As ranges from large to small, our notion of -local min-max equilibrium transitions from being an -approximate min-max equilibrium to being an -approximate local min-max equilibrium. Moreover, in comparison to local minima and stationary points, the problem of finding an -local min-max equilibrium is neither total nor can its solutions be verified efficiently for all values of and , even when . Again, this issue can be circumvented if we focus on particular settings of and values, as we will see in Section 4.
2 First-Order Local Optimization Computational Problems
In this section, we define the search problems associated with our aforementioned definitions of approximate stationary points, local minima, and local min-max equilibria. We state our problems in terms of white-box access to the function and its gradient. Switching to the black-box variants of our computational problems amounts to simply replacing the Turing machines provided in the input of the problems with oracle access to the function and its gradient, as discussed in Section 2. As per our discussion in the same section, we define our computational problems as promise problems, the promise being that the Turing machine (or oracle) provided in the input to our problems outputs function values and gradient values that are consistent with a smooth and Lipschitz function with the prescribed in the input smoothness and Lipschitzness. Besides making the presentation cleaner, as we discussed in Section 2, the motivation for doing so is to prevent the possibility that computational complexity is tacked into our problems due to the possibility that the Turing machines/oracles provided in the input do not output function and gradient values that are consistent with a Lipschitz and smooth function. Importantly, all our computational hardness results syntactically guarantee that the Turing machines/oracles provided as input to our constructed hard instances satisfy these constraints.
Before stating our main computational problems below, we note that, for each problem, the dimension (in unary representation) is also an implicit input, as the description of the Turing machine (or the interface to the oracle in the black-box counterpart of each problem below) has size at least linear in . We also refer to Remark 2.6 for how we may formally study complexity problems that take a polynomial-time Turing Machine in their input.
It is easy to see that StationaryPoint lies in . Indeed, if there exists some point such that , then by the -smoothness of there must exist some point of bit complexity polynomial in the size of the input such that . On the other hand, it is clear that no such point exists if for all , . We note that the looseness of the output requirement in our problem for functions that do not have points such that but do have points such that is introduced for the sole purpose of making the problem lie in , as otherwise we would not be able to guarantee that the solutions to our search problem have polynomial bit complexity. As we show in Section 4, StationaryPoint is also -hard, even when is a constant, the constraint set is very simple, namely , and are both polynomial in .
Next, we define the computational problems associated with local minimum and local min-max equilibrium. Recall that the first is guaranteed to have a solution, because, in particular, a global minimum exists due to the continuity of and the compactness of .
3 Bonus Problems: Fixed Points of Gradient Descent/Gradient Descent-Ascent
Next we present a couple of bonus problems, GDFixedPoint and GDAFixedPoint, which respectively capture the computation of fixed points of the (projected) gradient descent and the (projected) gradient descent-ascent dynamics, with learning rate . As we see in Section 5, these problems are intimately related, indeed equivalent under polynomial-time reductions, to problems LocalMin and LocalMinMax respectively, in certain regimes of the approximation parameters. Before stating problems GDFixedPoint and GDAFixedPoint, we define the mappings and whose fixed points these problems are targeting.
for all , where and .
Note that is called “unsafe” because the projection happens individually for and , thus may not lie in . We also define the “safe” version , which projects the pair jointly onto . As we show in Section 5 (in particular inside the proof of Theorem 5.2), computing fixed points of and are computationally equivalent so we stick to which makes the presentation slightly cleaner.
We are now ready to define GDFixedPoint and GDAFixedPoint. As per earlier discussions, we define these computational problems as promise problems, the promise being that the Turing machine provided in the input to these problems outputs function values and gradient values that are consistent with a smooth and Lipschitz function with the prescribed, in the input to these problems, smoothness and Lipschitzness.
Summary of Results
In this section we summarize our results for the optimization problems that we defined in the previous section. We start with our theorem about the complexity of finding approximate stationary points, which we show to be -complete even for large values of the approximation.
The computational problem StationaryPoint is -complete, even when is set to any value , and even when , , , and .
The complexity of LocalMin and LocalMinMax is more difficult to characterize, as the nature of these problems changes drastically depending on the relationship of with with , , and , which determines whether these problems ask for a globally vs locally approximately optimal solution. In particular, there are two regimes wherein the complexity of both problems is simple to characterize.
Global Regime. When then both LocalMin and LocalMinMax ask for a globally optimal solution. In this regime it is not difficult to see that both problems are -hard to solve even when and , are (see Section 10).
Trivial Regime. When satisfies , then for every point it holds that for every with . Thus, every point in the domain is a solution to both LocalMin and LocalMinMax.
It is clear from our discussion above, and in earlier sections, that, to really capture the complexity of finding local as opposed to global minima/min-max equilibria, we should restrict the value of . We identify the following regime, which we call the “local regime.” As we argue shortly, this regime is markedly different from the global regime identified above in that (i) a solution is guaranteed to exist for both our problems of interest, where in the global regime only LocalMin is guaranteed to have a solution; and (ii) their computational complexity transitions to lower complexity classes.
Local Regime. Our main focus in this paper is the regime defined by . In this regime it is well known that Projected Gradient Descent can solve LocalMin in time (see Appendix E). Our main interest is understanding the complexity of LocalMinMax, which is not well understood in this regime. We note that the use of the constant in the constraint which defines the local regime has a natural motivation: consider a point where a -smooth function has ; it follows from the definition of smoothness that is both an -local min and an -local min-max equilibrium, as long as .
The following theorems provide tight upper and lower bounds on the computational complexity of solving LocalMinMax in the local regime. For compactness, we define the following problem:
We define the local-regime local min-max equilibrium computation problem, in short LR-LocalMinMax, to be the search problem LocalMinMax restricted to instances in the local regime, i.e. satisfying .
The computational problem LR-LocalMinMax belongs to . As a byproduct, if some function is -Lipschitz and -smooth, then an -local min-max equilibrium is guaranteed to exist when , i.e. in the local regime.
An important property of our reduction in the proof of Theorem 4.4 is that it is a black-box reduction. We can hence prove the following unconditional lower bound in the black-box model.
Our main goal in the rest of the paper is to provide the proofs of Theorems 4.3, 4.4 and 4.5. In Section 5, we show how to use Brouwer’s fixed point theorem to prove the existence of approximate local min-max equilibrium in the local regime. Moreover, we establish an equivalence between LocalMinMax and GDAFixedPoint, in the local regime, and show that both belong to . In Sections 6 and 7, we provide a detailed proof of our main result, i.e. Theorem 4.4. Finally, in Section 9, we show how our proof from Section 7 produces as a byproduct the black-box, unconditional lower bound of Theorem 4.5. In Section 8, we outline a useful interpolation technique which allows as to interpolate a function given its values and the values of its gradient on a hypergrid, so as to enforce the Lipschitzness and smoothness of the interpolating function. We make heavy use of this technically involved result in all our hardness proofs.
Existence of Approximate Local Min-Max Equilibrium
In this section, we establish the totality of LR-LocalMinMax, i.e. LocalMinMax for instances satisfying as defined in Definition 4.2. In particular, we prove that every -Lipschitz and -smooth function admits an -local min-max equilibrium, as long as . A byproduct of our proof is in fact that LR-LocalMinMax lies inside . Specifically the main tool that we use to prove our result is a computational equivalence between the problem of finding fixed points of the Gradient Descent/Ascent dynamic, i.e. GDAFixedPoint, and the problem LR-LocalMinMax. A similar equivalence between GDFixedPoint and LocalMin also holds, but the details of that are left to the reader as a simple exercise. Next, we first present the equivalence between GDAFixedPoint and LR-LocalMinMax, and we then show that GDAFixedPoint is in , which then also establishes that LR-LocalMinMax is in .
For arbitrary and , suppose that is an -approximate fixed point of , i.e., , where . Then is also a -local min-max equilibrium of .
For arbitary , suppose that is an -local min-max equilibrium of for and . Then is also an -approximate fixed point of .
The proof of Theorem 5.1 is presented in Appendix B.1. As already discussed, we use GDAFixedPoint as an intermediate step to establish the totality of LR-LocalMinMax and to show its inclusion in . This leads to the following theorem.
The computational problems GDAFixedPoint and LR-LocalMinMax are both total search problems and they both lie in .
Observe that Theorem 4.3 is implied by Theorem 5.2 whose proof is presented in Appendix B.2.
Hardness of Local Min-Max Equilibrium – Four-Dimensions
In Section 5, we established that LR-LocalMinMax belongs to . Our proof is via the intermediate problem GDAFixedPoint which we showed that it is computationally equivalent to LR-LocalMinMax. Our next step is to prove the -hardness of LR-LocalMinMax using again GDAFixedPoint as an intermediate problem.
The problem GDAFixedPoint is -complete even in dimension and . Therefore, LR-LocalMinMax is -complete even in dimension and .
The first color of every vertex is either or and the second color is either or .
The first color of all vertices on the left boundary of the grid is .
The first color of all vertices on the right boundary of the grid is .
The second color of all vertices on the bottom boundary of the grid is .
The second color of all vertices on the top boundary of the grid is .
Consider the function . Since is -Lipschitz, the function is also -Lipschitz. Additionally can be easily computed via a polynomial-time Turing machine that uses as a subroutine. We construct a proper coloring of a fine grid of using the signs of the outputs of . Namely we set and this defines a grid over that is indexed by . Let be the function that the Turing Machine evaluate when the requested accuracy is . Now we can define the circuit as follows, We remind that we abuse the notation and we use a coordinate both as a binary string and as a number in and it is clear from the context which of the two we use.
Let , there exists such that and such that .
We will prove the existence of and the existence of follows using an identical argument. If there exists a corner of such that is in the range then the claim follows. Suppose not. Using this together with the fact that the first color of one of the corners of is and also the first color of one of the corners of is we conclude that there exist points such that and The latter is inaccurate for the cases where the vertex belongs to either facets or . Notice that the coloring in such vertices does not depend on the value of . However in case where the color of such a corner is not consistent with the value of , i.e. and then this means that . This is due to the fact that and .. But we have that . This together with the fact that and implies that and also . But because of the -Lipschitzness of and because the distance between and is at most we conclude that . Hence due to the signs of and we conclude that . The same way we can prove that and the claim follows. ∎
Using the Claim 6.3 and the -Lipschitzness of we get that for every
where we have used also the fact that for any two points it holds that which follows from the definition of the size of the grid. Therefore we have that and hence which implies that any point is a -approximate fixed point of and the lemma follows. ∎
2 From 2D Bi-Sperner to Fixed Points of Gradient Descent/Ascent
The simplest way to achieve this is to define the function locally close to to be equal to
Similarly, if is on a vertex of the grid, and the coloring of this vertex is , i.e. the output of on this vertex is , then we would like to have
The simplest way to achieve this is to define the function locally close to to be equal to
In Figure 5 we show pictorially the correspondence of the colors of the vertices of the grid with the gradient of the function that we design. As shown in the figure, any set of vertices that share at least one of the colors , , , , agree on the direction of the gradient with respect the horizontal or the vertical axis. This observation is one of the main ingredients in the proof of correctness of our reduction that we present later in this section.
When is not on a vertex of the grid then our goal is to define via interpolating the functions corresponding to the corners of the cell in which belongs. The reason that this interpolation is challenging is that we need to make sure the following properties are satisfied
the resulting function is both Lipschitz and smooth inside every cell,
the resulting function is both Lipschitz and smooth even at the boundaries of every cell, where two differect cells stick together,
For the low dimensional case, that we explore in this section, satisfying the first two properties is not a very difficult task, whereas for the third property we need to be careful and achieving this property is the main technical contribution of this section. On the contrary, for the high-dimensional case that we explore in Section 7 even achieving the first two properties is very challenging and technical.
As we will see in Section 6.2.1, if we accomplish a construction of a function with the aforementioned properties, then the fixed points of the projected Gradient Descent/Ascent can only appear inside cells that have all of the colors at their corners. To see this consider a cell that misses some color, e.g. . Then all the corners of this cell have as first color . Since is defined as interpolation of the functions in the corners of the cells, with the aforementioned properties, inside that cell there is always a direction with respect to and for which the gradient is large enough. Hence any point inside that cell cannot be a fixed point of the projected Gradient Descent/Ascent. Of course this example provides just an intuition of our construction and ignores case where the cell is on the boundary of the grid. We provide a detailed explanation of this case in Section 6.2.1.
The above neat idea needs some technical adjustments in order to work. At first, the interpolation of the function in the interior of the cell must be smooth enough so that the resulting function is both Lipschitz and smooth. In order to satisfy this, we need to choose appropriate coefficients of the interpolation that interpolate smoothly not only the value of the function but also its derivatives. For this purpose we use the following smooth step function of order .
We define to be the smooth step function of order that is equal to . Observe that the following hold , , , and .
As we have discussed, another issue is that since the interpolation coefficients depend on the value of it could be that the derivatives of these coefficients overpower the derivatives of the functions that we interpolate. In this case we could be potentially creating fixed points of Gradient Descent/Ascent even in non panchromatic squares. As we will see later the magnitude of the derivatives from the interpolation coefficients depends on the differences and . Hence if we ensure that these differences are small then the derivatives of the interpolation coefficients will have to remain small and hence they can never overpower the derivatives from the corners of every cell. This is the place in our reduction where we add the constraints that define the domain of the function as we describe in Section 3.
Now that we have summarized the main ideas of our construction we are ready for the formal definition of based on the coloring circuit .
where the coefficients are defined as follows
where .
In Figure 6 we present an example of the application of Definition 6.5 to a specific cell with some given coloring on the corners.
An important property of the definition of the function is that the coefficients used in the definition of have the following two properties
Hence the function inside a cell is a smooth convex combination of the functions on the corners of the cell, as is suggested from Figure 6. Of course there are many ways to define such convex combination but in our case we use the smooth step function to ensure the Lipschitz continuous gradient of the overall function . We prove this formally in the next lemma.
Let be the function defined based on a coloring circuit , as per Definition 6.5. Then is continuous and differentiable at any point . Moreover, is -Lipschitz and -smooth in the whole 4-dimensional hypercube , where .
Clearly from Definition 6.5, is differentiable at any point in which lies on the strict interior of its respective cell. In this case the derivative with respect to is
where for we have that
Now since , we can conclude that . Similarly we can prove that , which combined with implies . Using similar reasoning we can prove that and that for . Hence
The only thing we are missing to prove the Lipschitzness of is to prove its continuity on the boundaries of the cells of our subdivision. Suppose lies on the boundary of some cell, e.g. let lie on edge of one cell that is the same as the edge of the cell to the right of that cell. Since , and it holds that and the same for . Therefore the value of remains the same no matter the cell according to which it was calculated. As a result, is differentiable with respect to even if belongs in the boundary of its cell. Using the exact same reasoning for the rest of the variables, one can show that the function is differentiable at any point and because of the aforementioned bound on the gradient we can conclude that is -Lipschitz.
Using very similar calculations, we can compute the closed formulas of the second derivatives of and using the bounds , , , and , we can prove that each entry of the Hessian is bounded by and thus
which implies the -smoothness of . ∎
Construction of Instance for Fixed Points of Gradient Descent/Ascent.
Our construction can be described via the following properties.
The payoff function is the real-valued function from the Definition 6.5.
The domain is the polytope that we described in Section 3. The matrix and the vector have constant size and they are computed so that the following inequalities hold
where and .
The parameter is set to be equal to .
The parameters and are set to be equal to the upper bounds on the Lipschitzness and the smoothness of respectively that we derived in Lemma 6.6. Namely we have that and .
We prove this last statement in Lemma 6.8, but before that we need the following technical lemma that will be useful to argue about solution on the boundary of .
If and then .
If or then .
If or then .
The symmetric statements for hold. For :
If and then .
If or then .
If or then .
For this proof it is convenient to define , , and .
For the second case, we assume for the sake of contradiction that and . These imply that and that . As a result, which is greater than . The latter is a contradiction with the assumption that is a solution to the GDAFixedPoint problem. Also if we assume that using the same reasoning we get that . From this we can again prove that which contradicts the fact that is a solution to GDAFixedPoint.
The third case can be proved using the same arguments as the second case. Then using the corresponding arguments we can prove the corresponding statements for the variables. ∎
We are now ready to prove that solutions of GDAFixedPoint can only occur in cells that are either panchromatic or violate the boundary conditions of a proper coloring. For convenience in the rest of this section we define to be the cell of the grid that contains .
for such that if there are multiple , that satisfy the above condition then we choose to be the cell that corresponds to the , such that the pair it the lexicographically first such that , satisfy the above condition. We also define the corners of as
where .
and, for all , it holds that .
and, for all , it holds that .
and, for all , it holds that .
and, for all , it holds that .
We prove that there is no solution of GDAFixedPoint that satisfies the statement 1. and the fact that cannot satisfy the other statements follows similarly. It is convenient for us to define , , , and , , .
For the sake of contradiction we assume that there exists a solution of such that and for all it holds that . Using the fact that the first color of all the corners of is , we will prove that (1) , and (2) .
Let , then since all the corners have , from the Definition 6.5 we have that
where , , , and . If we differentiate this with respect to we immediately get that . On the other hand if we differentiate with respect to we get
where the last inequality follows from the fact that and the fact that, due to the constraints that define the polytope , it holds that .
Hence we have established that if and for all it holds that then it holds that that (1) , and (2) . Now it is easy to see that the only way to satisfy both and is that either or . The first case is excluded by the assumption in the first statement of our lemma and our choice of thus it holds that . But then we can use the case 3 for the variables of Lemma 6.7 and we get that , which cannot be true since we proved that . Therefore we have a contradiction and the first statement of the lemma holds. Using the same reasoning we prove the rest of the statements. ∎
The computations presented in (6.4) is the precise point where an attempt to prove the hardness of minimization problems would fail. In particular, if our goal was to construct a hard minimization instance then the function would need to have the terms instead of so that the fixed points of gradient descent coincide with approximate local minimum of . In that case we cannot lower bound the gradient of (6.4) below from because the term will be the dominant one and hence the sign of the derivative can change depending on the value . For a more intuitive explanation of the reason why we cannot prove hardness of minimization problems we refer to the Introduction, at Section 1.2.
We have now all the ingredients to prove Theorem 6.1.
Hardness of Local Min-Max Equilibrium – High-Dimensions
The th color of every vertex is either the color or the color .
All the vertices whose th coordinate is , i.e. they are at the lower boundary of the th direction, should have the th color equal to .
All the vertices whose th coordinate is , i.e. they are at the higher boundary of the th direction, should have the th color equal to .
-SuccinctBrouwer is -complete for any fixed constant .
Consider the function . Since is -Lipschitz, is also -Lipschitz. Additionally can be easily computed via a polynomial-time Turing machine that uses as a subroutine. We construct the coloring sequences of every vertex of a -dimensional grid with points in every direction using . Let be the function that the Turing Machine evaluate when the requested accuracy is . For each vertex of the -dimensional grid its coloring sequence is constructed as follows: For each coordinate ,
Let be any vertex on the same cubelet with the output vertices , , , , , . From the guarantees of colors of the sequences , , , , , we have that either or , let be the vertex or depending on which one the th color has product equal to with . Now let if then the wanted inequality follows. On the other hand if then using the fact that and that from the definition of the colors we have that either , or , we conclude that , or , and thus,
where in the second inequality we have used the -Lipschitzness of . As a result, the point satisfies and thus for if we pick then any vertex of the panchromatic cell is a solution for -SuccinctBrouwer. ∎
2 From High Dimensional Bi-Sperner to Fixed Points of Gradient Descent/Ascent
where such that and if there are multiple corners that satisfy this condition then we choose to be the cell that corresponds to the that is lexicographically first among those that satisfy the condition. We also define to be the set of vertices that are corners of the cublet , namely
where such that Every that belongs to the cubelet can be written as a convex combination of the vectors where . The value of the function that we construct in this section is determined by the coloring sequences of the vertices . One of the main challenges that we face though is that the size of is and hence if we want to be able to compute the value of efficiently then we have to find a consistent rule to pick a subset of the vertices of whose coloring sequence we need to define the function value . Although there are traditional ways to overcome this difficulty using the canonical simplicization of the cubelet , these technique leads only to functions that are continuous and Lipschitz but they are not enough to guarantee continuity of the gradient and hence the resulting functions are not smooth.
The problem of finding a computationally efficient way to define a continuous function as an interpolation of some fixed function in the corners of a cubelet so that the resulting function is both Lischitz and smooth is surprisingly difficult to solve. For this reason we introduce in this section the smooth and efficient interpolation coefficients (SEIC) that as we will see in Section 7.2.2, is the main technical tool to implement such an interpolation. Our novel interpolation coefficients are of independent interest and we believe that they will serve as a main technical tool for proving other hardness results in continuous optimization in the future.
In this section we only give a high level description of the smooth and efficient interpolation coefficients via their properties that we use in Section 7.2.2 to define the function . The actual construction of the coefficients is very challenging and technical and hence we postpone a detail exposition for Section 8.
For all vertices , the coefficient is a twice-differentiable function and satisfies
.
For all , it holds that and .
For all , if for some then there exists such that . Respectively, if then there exists such that .
An intuitive explanation of the properties of the SEIC coefficients is the following
The coefficients are both Lipschitz and smooth with Lipschitzness and smoothness parameters that depends polynomially in and .
The coefficients define a convex combination of the vertices .
For every , out of the coefficients only have non-zero value, or non-zero gradient or non-zero Hessian when evaluated at the point . Moreover, given we can identify these coefficients efficiently.
For every that is in a cubelet that touches the boundary there is at least one of the vertices in that is on the boundary of the continuous hypercube .
In Section 10 in the proof of Theorem 10.4 we present a simple application of the existence of the SEIC coefficients for proving very simple black box oracle lower bounds for the global minimization problem.
Based on the existence of these coefficients we are now ready to define the function which is the main construction of our reduction.
2.2 Definition of a Lipschitz and Smooth Function Based on a BiSperner Instance
In this section our goal is to formally define the function and prove its Lipschitzness and smoothness properties in Lemma 7.5.
where , and are the coefficients defined in Definition 7.3.
We first prove that the function constructed in Definition 7.4 is -Lipschitz and -smooth for some appropriately selected parameters , that are polynomial in the dimension and in the discretization parameter . We use this property to establish that is a valid input to the promise problem GDAFixedPoint.
The function of Definition 7.4 is -Lipschitz and -smooth.
If we take the derivative with respect to and and using property (B) of the coefficients we get the following relations,
Now by the property (C) of Definition 7.3 there are most vertices of with the property . Then if we also use property (A) we get and using the property (B) we get . Thus and . Therefore we can conclude that and hence this proves that the function is Lipschitz continuous with Lipschitz constant .
To prove the smoothness of , we use the property (B) of the Definition 7.3 and we have
2.3 Description and Correctness of the Reduction – Proof of Theorem 4.4
Construction of Instance for Fixed Points of Gradient Descent/Ascent.
The payoff function is the real-valued function from the Definition 7.4.
The domain is the polytope that we described in Section 3. The matrix and the vector are computed so that the following inequalities hold
The parameter is set to be equal to .
The parameters and are set to be equal to the upper bounds on the Lipschitzness and the smoothness of respectively that we derived in Lemma 7.5. Namely we have that and .
The first thing to observe is that the afore-described reduction is polynomial-time. For this observe that all of , , , , and have representation that is polynomial in even if we use unary instead of binary representation. So the only thing that remains is the existence of a Turing machine that computes the function and the gradient value of in time polynomial to the size of and the requested accuracy. To prove this we need a detailed description of the SEIC coefficients and for this reason we postpone the proof of this to the Appendix D. Here we state the formally the result that we prove in the Appendix D which together with the discussion above proves that our reduction is indeed polynomial-time.
Moreover the running time of is polynomial in the binary representation of , , and .
If and then .
If or then .
If or then .
The symmetric statements for hold.
If and then .
If or then .
If or then .
The proof of this lemma is identical to the proof of Lemma 6.7 and for this reason we skip the details of the proof here. ∎
and for any , it holds that .
and for any , it holds that .
We prove that there is no solution of GDAFixedPoint that satisfies the statement 1. and the fact that cannot satisfy the statement 2. follows similarly. It is convenient for us to define , , , and , , .
For the sake of contradiction we assume that there exists a solution of such that and for any it holds that . Using this fact, we will prove that (1) , and (2) .
Let , then since all the corners have , from the Definition 7.4 we have that
If we differentiate this with respect to we immediately get that . On the other hand if we differentiate with respect to we get
where the above follows from the following facts: (1) that , which is proved in the proof of Lemma 7.5, (2) , and (3) the definition of . Now it is easy to see that the only way to satisfy both and is that either or . The first case is excluded by the assumption of the first statement of our lemma and our choice of , thus it holds that . But then we can use the case 3. for the variables of Lemma 6.7 and we get that , which cannot be true since we proved that . Therefore we have a contradiction and the first statement of the lemma holds. Using the same reasoning we prove the second statement too. ∎
Let be a solution to the GDAFixedPoint problem with input a Turing machine that represents the function , , where , , , and , as described in ().
For each coordinate , there exist the following three mutually exclusive cases,
: Since , it follows directly from Lemma 7.8 that there exists such that and such that .
Smooth and Efficient Interpolation Coefficients
In this section we describe the construction of the smooth and efficient interpolation coefficients (SEIC) that we introduced in Section 7.2.1. After the description of the construction we present the statements of the lemmas that prove the properties (A) - (D) of their Definition 7.3 and we refer to the Appendix C. We first remind the definition of the SEIC coefficients.
Our main goal in this section is to prove the following theorem.
One important component of the construction of the SEIC coefficients is the smooth-step functions which we introduce in Section 8.1. These functions also provide a toy example of smooth and efficient interpolation coefficients in dimension. Then in Section 8.2 we present the construction of the SEIC coefficients in multiple dimensions and in Section 8.3 we state the main lemmas that lead to the proof of Theorem 8.1.
For every it holds that , for every it holds that and for every it holds that .
For some it holds that is times continuously differentiable and its th derivative satisfies and .
The largest number such that the smoothness property from above holds is characterizes the order of smoothness of the smooth step function .
In Section 6 we have already defined and used the smooth step function of order . For the construction of the SEIC coefficients we use the smooth step function of order and the smooth step function of order defined as follows.
We note that we use the notation instead of for the smooth step function of order for simplicitly of the exposition of the paper.
We present a plot of these step function in Figure 7, and we summarize some of their properties in Lemma 8.3. A more detailed lemma with additional properties of that are useful for the proof of Theorem 8.1 is presented in Lemma C.5 in the Appendix C.
The calculations for are more complicated. We have that
We set for and doing simple calculations we get that for it holds that . But the later can be easily lower bounded by . Applying the same argument for we get that in general . Also it is not hard to see that for it holds that , whereas for it holds that . Combining all these we can conclude that . Using similar argument we can prove that . For all the derivatives of we can inductively prove that
where and all the functions are bounded. Then the fact that all the derivatives of vanish at and at follows by a simple inductive argument. ∎
Using the smooth step functions that we described above we can get a construction of SEIC coefficients for the single dimensional case. Unfortunately the extension to multiple dimensions is substantially harder and invokes new ideas that we explore later in this section. For the single dimensional problem of this example we have the interval $NN\mathsf{P}_{1}\mathsf{P}_{N}$ that satisfy the properties (A) - (D) of Definition 7.3. A simple construction of such functions is the following
Based on Lemma 8.3 it is not hard then to see that is twice differentiable and it has bounded first and second derivatives, hence it satisfies property (A) of Definition 8. Using the fact that we can also prove property (B). Finally properties (C) and (D) can be proved via the definition of the coefficient from above. In Figure 7 we can see the plot of for . We leave the exact proofs of this example as an exercise for the reader.
2 Construction of SEIC Coefficients in High-Dimensions
The goal of this section is to present the construction of the family of smooth and efficient interpolation coefficients for every number of dimensions and any discretization parameter . Before diving into the details of our construction observe that even the 2-dimensional case with is not trivial. In particular, the first attempt would be to define the SEIC coefficients based on the simple split of the square to two triangles divided by the diagonal of . Then using any soft-max function that is twice continuously differentiable we define a convex combination at every triangle. Unfortunately this approach cannot work since the resulting coefficients have discontinuous gradients along the diagonal of . We leave the presice calculations of this example as an exercise to the reader.
We start with some definitions about the orientation and the representation of the cubelets of the grid . Then we proceed with the definition of the functions in Definition 8.7. Finally using we can proceed with the construction of the SEIC coefficients.
Each cubelet , where admits a source vertex and a target vertex defined as follows,
Notice that the source and the target are vertices of the cubelet whose down-left corner is .
(Canonical Representation) Let and where . The canonical representation of under cubelet with down-left corner , denoted by is defined as follows,
where and are respectively the target and the source of .
Let lying in the cublet
with corners , where . Let also be the source vertex of and be the canonical representation of . Then for each vertex we define the following partition of the set of coordinates ,
where and are the smooth step function defined in Definition 8.2.
To provide a better understanding of the Definitions 8.5, 8.6, and 8.7 we present the following -dimensional example.
We consider a case where and . Let lying in the cubelet , and let . Then the source of is and the target (Definition 8.5). The canonical representation of is (Definition 8.6). The only vertices with no-zero coefficients are those belonging in the set and again by Definition 8.7 we have that
,
,
,
.
Now based on the Definitions 8.5, 8.6, and 8.7 we are ready to present the construction of the smooth and efficient interpolation coefficients.
Let lying in the cubelet . Then for each vertex the coefficient is defined as follows,
where the functions are defined in Definition 8.7 for any .
3 Sketch of the Proof of Theorem 8.1
First it is necessary to argue that is a continuous function since it could be the case that for some point that lies in the boundary of two adjacent cubelets with down-left corners and respectively. We specifically design the coefficients such as the latter does not occur and this is the main reason that the definition of the function is slightly complicated. For this reason we prove the following lemma.
For any vertex , is a continuous and twice differentiable function and for any it holds that . Moreover, for every the set of vertices such that satisfies .
Based on Lemma 8.10 and the expression of we can prove that the coefficients defined in Definition 8.9 satisfy the properties (B) and (C) of the definition 7.3. To prove the properties (A) and (D) we also need the following two lemmas.
For any vertex , it holds that
Let a point and the set of vertices with , then we have that
If then there always exists a vertex such that .
If then there always exists a vertex such that .
The proofs of Lemmas 8.10, 8.11, and 8.12 can be found in Appendix C. Based on Lemmas 8.10, 8.11, and 8.12 we are now ready to prove Theorem 8.1.
The fact that the coefficients satisfy the property (A) follows directly from Lemma 8.11. Property (B) follows directly from the definition of in Definition 8.9 and the simple fact that . Property (C) follows from the second part of Lemma 8.10. Finally Property (D) follows directly from Lemma 8.12. ∎
Unconditional Black-Box Lower Bounds
In this section our goal is to prove Theorem 4.5 based on the Theorem 4.4 that we proved in Section 7 and the known black box lower bounds that we know for by [HPV89]. In this section we assume that all the real number operation are performed with infinite precision.
Assume that there exists an algorithm that has black-box oracle access to the value of a function and outputs . There exists a universal constant such that if is -Lipschitz and , then has to make at least different oracle calls to the function value of .
It is easy to observe in the reduction in the proof of Theorem 7.2 is a black-box reduction and in every evaluation of the constructed circuit only requires one evaluation of the input function . Therefore the proof of Theorem 7.2 together with the Theorem 9.1 imply the following corollary.
Based on Corollary 9.2 and the reduction that we presented in Section 7, we are now ready to prove Theorem 4.5.
Hardness in the Global Regime
In this section our goal is to prove that the complexity of the problems LocalMinMax and LocalMin is significantly increased when , lie outside the local regime, in the global regime. We start with the following theorem where we show that -hardness of LocalMinMax.
LocalMinMax is -hard even when is set to any value , is set to any value , and even when , , , and .
We now present a reduction from 3-SAT(3) to LocalMinMax that proves Theorem 10.1. First we remind the definition of the problem 3-SAT(3).
where each are additional variables associated with clause . The player that wants to minimize controls vectors while the maximizing player controls the variables.
The formula admits a satisfying assignment if and only if there exist an -local min-max equilibrium of with , and .
Let us assume that there exists a satisfying assignment. Given such a satisfying assignment we will construct that is a -local min-max equilibrium of . We set each variable if and only if the respective boolean variable is true. Observe that this implies that for all , meaning that the strategy profile is a global Nash equilibrium no matter the values of .
On the opposite direction, let us assume that there exists an -local min-max equilibrium of with and . In this case we first prove that for each
Fix any clause . In case then the minimizing player can further decrease by at least by setting . On the other hand in case then the maximizing player can increase by at least by moving either to or to . We remark that both of the options are feasible since .
Now consider the probability distribution over the boolean assignments where each boolean variable is independently selected to be true with probability . Then,
Since each shares variables with at most other clauses, the event of not being satisfied is dependent with at most other events. By the Lovász Local Lemma [EL73], we get that the probability none of these events occur is positive. As a result, there exists a satisfying assignment. ∎
Next we show the -hardness of LocalMin. As we can see there is a gap between Theorem 10.1 and Theorem 10.3. In particular, the -hardness result of LocalMinMax is stronger since it holds for any whereas for the -hardness of LocalMin our proof needs when the rest of the parameters remain the same.
LocalMin is -hard even when is set to any value , is set to any value , and even when , , , and .
We follow the same proof as in the proof of Theorem 10.1 but we instead set where (the number of variables is ). We then get that if the initial formula is satisfiable then there exist , such that . On the other hand if there exist such that then the formula is satisfiable due to the Lovász Local Lemma [EL73]. Therefore the -hardness follows again from the constructive proof of the Lovász Local Lemma [Mos09, MT10]. Setting which equals the diameter of the feasibility set implies that in case there exists with then all -LocalMin must admit value and thus a satisfying assignment is implied. ∎
Next we prove a black box lower bound for minimization in the global regime. The proof of following lower bound illustrates the strength of the SEIC coefficients presented in Section 8. The next Theorem can also be used to prove the -hardness of LocalMin in the global regime but with worse Lipschitzness and smoothness parameters than the once at Theorem 10.3 and for this reason we present both of them.
In the worst case, value/gradient black-box queries are needed to determine a -LocalMin for functions with , , , .
The proof is based on the fact that given just black-box access to a boolean formula , at least queries are needed in order to determine whether admits a satisfying assignment. The term black-box access refers to the fact that the clauses of the formula are not given and the only way to determine whether a specific boolean assignment is satisfying is by quering the specific binary string.
Given such a black-box oracle for a satisfying assignment , we construct the function as follows:
for each corner of the hypercube, i.e. , we set .
for the rest of the points , where are the coefficients of Definition 8.9.
We remind that by Lemma 8.11, we get that and , meaning that is -Lipschitz and -smooth. Moreover by Lemma 8.7 , for any the set has cardinality at most , while at the same time .
In case is not satisfiable then for all since for all . In case there exists a satisfying assignment then . Since that is the diameter of , any -LocalMin must have . Since , there exists at least one vertex with , meaning that . As a result, given an -LocalMin with , we can find a satisfying by querying for each vertex . Since , this will take at most additional queries.
Up next, we argue that in case an -LocalMin could be determined with less than value/gradient queries, then determining whether admits a satisfying assignment could be done with less that queries on (the latter is obviously impossible). Notice that any value/gradient query both and can be computed by querying the value of the vertices . Since , any value/gradient query of can be simulated by queries on . ∎
Acknowledgements
This work was supported by NSF Awards IIS-1741137, CCF-1617730 and CCF-1901292, by a Simons Investigator Award, by the DOE PhILMs project (No. DE-AC05-76RL01830), and by the DARPA award HR00111990021. M.Z. was also supported by Google Ph.D. Fellowship. S.S. was supported by NRF 2018 Fellowship NRF-NRFF2018-07.
References
Appendix A Proof of Theorem 4.1
We first remind the definition of the 3-SAT(3) problem that we will use for our reduction.
It is well known that 3-SAT(3) is -complete, for details see of [Pap94a]. To prove Theorem 4.1, we reduce 3-SAT(3) to -StationaryPoint.
There exists a satisfying assignment for the clauses if and only if there solution of the constructed StationaryPoint with a admits solution such that .
By the definition of StationaryPoint, in case there exists a pair of points with , then a pair of points with must be returned. In case for all , the null symbol is returned.
Let us assume that there exists a satisfying assignment of . Consider the solution constructed as follows: each variable is set to iff the respective boolean variable is true and for all . Since the assignment satisfies the CNF-formula , there exists at least one true literal in each clause which means that for all . As a result for all . At the same time, since for all . Overall we have that . As a result, the constructed StationaryPoint instance must return a solution with .
On the opposite direction, the existence of a pair of points with implies for all . Consider the probability distribution over the boolean assignments in which each boolean variable is independently selected to be true with probability . Then,
Since shares variables with at most other clauses, the bad event of not being satisfied is dependent with at most other bad events. By Lovász Local Lemma [EL73], we get that the probability none of the events occurs is positive. As a result, there exists a satisfying assignment. ∎
Using Lemma A.1 we can conclude that is satisfiable if and only if has a -approximate stationary point. What is left to prove the -hardness is to show how we can find a satisfying assignment of given an approximate stationary point of . This can be done using the celebrated results that provide constructive proofs of the Lovász Local Lemma [Mos09, MT10]. Finally, we remind that the constructed function is -Lipschitz and -smooth, where is the number of variables that is equal to .
Appendix B Missing Proofs from Section 5
In this section we give proofs for the statements presented in Section 5. These statements establish the totality and inclusion to of LR-LocalMinMax and GDAFixedPoint.
We start with establishing claim “1.” in the statement of the theorem. It will be clear that our proof will provide a polynomial-time reduction from LR-LocalMinMax to GDAFixedPoint. Suppose that is an -approximate fixed point of , where is the specified in the theorem statement function of , and . To simplify our proof, we abuse notation and define , , and . Because is an -approximate fixed point of , it follows that .
.
Using the fact that and that is a convex set we can apply Theorem 1.5.5 (b) of [FP07] to get that
Next, we do some simple algebra to get that, for all ,
where the second to last inequality follows from Cauchy–Schwarz inequality and the triangle inequality, and the last inequality follows from the triangle inequality and the following facts: (1) , (2) , and (3) for all . ∎
For all , from the -smoothness of we have that
: In this case we stop, remembering that
: In this case, we consider two further sub-cases:
: in this sub-case, Eq (B.2) gives
where for the last inequality we used that , and that .
: in this sub-case, Eq (B.2) gives
where the second inequality follows from the fact that , the third inequality follows from Claim B.1, and the last inequality follows from the constraints and .
In all cases, we get from (B.3), (B.4) and (B.5) that , for all . Thus, lifting our abuse of notation, we get that , for all . Using an identical argument we can also show that for all . The first part of the theorem follows.
Now let us establish claim “2.” in the theorem statement. It will be clear that our proof will provide a polynomial-time reduction from GDAFixedPoint to LR-LocalMinMax. For the choice of parameters and described in the theorem statement, we will show that, if is an -local min-max equilibrium of , then and . The second part of the theorem will then follow. We only prove that , as the argument for is identical. In the argument below we abuse notation in the same way we described earlier. With that notation we will show that .
Proof that . From our choice of and , it is easy to see that . Thus, if , then we automatically get . So it remains to handle the case . We choose . It is easy to see that and hence we get that
where the first inequality follows from the fact that is an -local min-max equilibrium, the second inequality follows from the -smoothness of , and the third inequality follows from and our choice of . The above implies:
Since we get that . Therefore
where in the above inequality we have also used (B.1). As a result, .
B.2 Proof of Theorem 5.2
We provide a polynomial-time reduction from GDAFixedPoint to Brouwer. This establishes both the totality of GDAFixedPoint and its inclusion to , since Brouwer is both total and lies in , as per Lemma 2.5. It also establishes the totality and inclusion to of LR-LocalMinMax, since LR-LocalMinMax is polynomial-time reducible to GDAFixedPoint, as shown in Theorem 5.1.
We proceed to describe our reduction. Suppose that is the -Lipschitz and -smooth function provided as input to GDAFixedPoint. Suppose also that is the approximation parameter provided as input to GDAFixedPoint. Given and , we define function , which serves as input to Brouwer, as follows:
Given that is -smooth, it follows that is -Lipschitz. We set the approximation parameter provided as input to Brouwer be .
To show the validity of the afore-described reduction, we prove that every feasible point that is a -approximate fixed point of , i.e. is also an -approximate fixed point of . Observe that since it holds that for all . Hence, if , then finding -approximate fixed points of is trivial and the same is true for fiding -approximate fixed points of , since which implies that, if , then . Thus, we may assume that .
Next, to simplify notation we define and . Given that is a -approximate fixed point of , we have that
Using Theorem 1.5.5 (b) of [FP07], we get that
For all , .
Now let where . Using Theorem 1.5.5 (b) of [FP07] for we get that . Using Claim B.2 for vector we get that . Adding the last two inequalities and using the fact that we get the following
Using the exact same reasoning we can also prove that
where . Combining the last two inequalities we get that is an -approximate fixed point of .
Appendix C Missing Proofs from Section 8
In this section we present the missing proofs from Section 8 and more precisely in the following sections we prove the Lemmas 8.10, 8.11, and 8.12. For the rest of the proofs in this section we define to be the cubelet which has the down-left corner equal to , formaly
and we also define to be the set of corners of the cubelet , or more formally
We start with a lemma about the differentiability properties of the functions which we defined in Definition 8.7.
1st order differentiability: We remind from the Definition 8.7 that if we let be the source vertex of and be the canonical representation of . Then for each vertex we define the following partition of the set of coordinates ,
Now in case , which corresponds to being the source node then which is clearly differentiable as product of compositions of differentiable functions. The exact same holds for which corresponds to being the target vertex of the cubelet . We thus focus on the case where . To simplify notation we denote by , by and by for the rest of this proof. We prove that in case then always exits. The case follows then symmetrically. We have the following cases
for all : Then exists since both and are differentiable.
for some : By Definition 8.7, if is sufficiently small then . Thus exists and equals .
for some and for all : By Definition 8.7, if is sufficiently small then and also , thus
since both and are differentiable functions, , and .
for all : Then exists since both and are twice differentiable.
for some . By Definition 8.7, . Thus exists and equals .
for some and for all . By Definition 8.7, if is sufficiently small then and thus
At the same time exists since both and are twice differentiable. Moreover equals since and .
In every step of the above proof where we use properties of and we use Lemma 8.3. ∎
So far we have established the fact that the functions are twice differentiable when moves within the same cubelet. Next we will show that when moves from one cubelet to another then the corresponding functions changes value smoothly.
Let such that there exists a coordinate with the property and , with and sufficiently small, i.e. lies in the boundary of two cubelets. Then the following statements hold.
For all vertices , it holds that
,
for all , and
Lemma C.2 is crucial since it establishes that is a continuous and twice differentiable even when moves from one cubelet to another. Since the proof of Lemma C.2 is very long and contains the proof of some sublemmas, we postpone it for the end of this section in Section C.1.1. We now proceed with the proof of Lemma 8.10.
We first prove that is a continuous function. Let lying on the boundary of the following cubelets
. By Definition 8.9, in all the aforementioned cubelets, the coefficient takes value and hence it is continuous in this part of the space.
and , for some with . In this case was computed according to a cubelet with . Then Lemma C.2 implies that since where and . Therefore we conclude that and
. By Lemma C.2 for all it holds that
which again implies the continuity of at .
To prove that always exists, we consider the following mutually exclusive cases.
for and for . Since the coefficient is a continuous function, we have that
Both of the above limits exists due to the fact that is differentiable (Lemma C.1). Moreover, since , Case of Lemma C.2 implies that the two limits above have exactly the same value and hence is differentiable at .
for all . In the case where for all the down-left corners of the cubelets at which lies, then by Definition 8.9 . Thus exists and equals . Therefore we may assume that for some down-left corner of a cubelet at which lies. Due to the fact that is a continuous function and that for all , we get that
We also have that where , are down-left corners of cubelets at which lies and lies respectively. Therefore we get by Case of Lemma C.2 that implying that . As a result,
We now need to argue that exists and equals . At first observe that since lies in the cubelet with down-left corner . In case then lies in for arbitrarily small , meaning that . The latter contradicts the fact that for all . As a result, which implies that and hence
The above limit equals to since by applying Lemma C.2 due to the fact that .
for all . Symmetrically with the previous case.
The second order differentiability of can be established using exactly the same arguments for computing the following limit
since for all the others it holds that , , and . These vertices can be computed in polynomial time as follows: i) the coordinates are sorted in increasing order, and ii) for each compute the vertex ,
To finish the proof of Lemma 8.10 we only need the proof of Lemma C.2 which we present in the following section.
Let a point lying in the boundary of the cubelets with down-left corners and . Then the canonical representation of in the cubelet is the same with the the canonical representation of in the cubelet . More precisely, .
Let be even. By the definition of the canonical representation in Definition 8.6, the source and target of the cubelets and are respectively,
,
,
,
.
Hence we get that for . Since belongs to the boundary of both cublets and we get that which implies that . In case is odd we get that but with . ∎
Let lying at the intersection of the cubelets , with down-left corners , and . Then the following statements are true.
For all vertices it holds that
,
,
Let then we have that
. By Lemma C.3 we get that the canonical representation . Since is a function of the canonical representation (see Definition 8.9), it holds that for all vertices .
. For , we get that since and for all . The latter argument cannot be applied for the -th coordinate since . However since belongs to the boundary of both the cubelets and it is implied that is either or , meaning that since from Lemma 8.3.
This case follows with the same reasoning with previous case .
Let . There exists a sequence of corners
such that and for all . By Lemma C.4 we get that,
.
.
C.2 Proof of Lemma 8.11
We start this section with some fundamental properties of the smooth step function that are more fine-grained than the properties we presented in Lemma 8.3.
For there exists a universal constant such that the following statements hold.
If then .
If then .
If then .
If then .
If then .
We compute the derivative of and we have that
from which we immediately get . Then we can compute the second derivative of as follows
We next want to prove that for . To see this observe that for and therefore
hence for it holds that . By similar but more tedious calculations we can conclude that for . Hence in the interval all the functions , , are all increasing functions of .
Next we show that the function is upper and lower bounded. First observe that . Now if we set then and hence for . The same way we can prove that for . Therefore for all . Also it is not hard to see that and which implies . Hence overall we have that for all . We are now ready to prove the statements.
We have shown that for all . Hence is an increasing function and therefore for . Now we have that .
Since is increasing for , we have that for and therefore
Follows directly from the statement 1., the fact that is increasing for and the above expression of this statement follows.
This statement follows using the same reasoning with statement 3.
In order to prove Lemma 8.11. We first introduce several technical lemmas.
Let lying in cublet , with and let be the canonical representation of . Then for all vertices , it holds that
where the last inequality follows by the fact that . Since the proof of the lemma will be completed if we are able to show that for any , it holds that
In case then by case of Lemma C.5 we get that , which implies gthe following
Now consider the case where . Using case of Lemma C.5, we have that
Combining the later with the discussion in the rest of the proof the lemma follows. ∎
For any vertex it holds that .
To simplify notation we use instead of for the rest of the proof. Without loss of generality we assume that lies on a cubelet with and , since otherwise . Let be the canonical representation of in the cubelet . Then it holds that
where the last inequality follows by Lemma C.6 and the fact that at most vertices of have non-zero gradient as we have proved in Lemma 8.10. Then the proof of Lemma C.7 follows by the fact that . ∎
If additionally it holds that or , then by the case of Lemma C.5, we have that
In case then by case of Lemma C.5, we get that which implies that .
On the other hand if then by case of Lemma C.5, we get that . As in the proof of Lemma C.6, there exists a vertex such that and thus . Overall we get that
If we combine all the above cases then the Lemma follows. ∎
Finally using Lemma C.7 and Lemma C.9 we get the proof of Lemma 8.11.
C.3 Proof of Lemma 8.12
Let and denote down-left corner of the cubelet at which lies, i.e. . Since , this means that . By the definition of sources and targets in Definition 8.6, we have that and , where , are respectively the -th coordinate of the source and the target vertex. Let the canonical representation of in the cubelet . Now partition the coordinates in the following sets
If then notice that , since , by the fact that . Thus the lemma follows since . So we may assume that . In this case consider the corner defined as follows
Observe that and thus . Moreover the coordinate and therefore it holds that . This proves the first statement of the Lemma.
For the second statement let and denote down-left corner of the cubelet at which lies, i.e. . This means that .
Let be odd. In this case by the definition of sources and targets in Definition 8.6, we have that and , where , are respectively the -th coordinate of the source and target vertex. Let be the canonical representation of under in the cubelet . Now partition the coordinates as follows,
If then notice that for the target vertex , , since , by the fact that . Thus the lemma follows since . So we may assume that . In this case consider the corner defined as follows,
Observe that and thus . Moreover the coordinate and thus .
Let be even. In this case we have that and . Now partition the coordinates as follows,
If then notice that for the source vertex , , since , by the fact that . Thus the lemma follows since . In case consider the corner defined as follows,
Observe that and thus . Moreover the coordinate and thus .
If we put together the last two cases then this implies the second statement of the lemma.
Appendix D Constructing the Turing Machine – Proof of Theorem 7.6
In this section we prove Theorem 7.6 establishing that both the function of Definition 7.4 and its gradient, is computable by a polynomial-time Turing Machine. We prove Theorem 7.6 through a series of Lemmas. To simplify notation we set .
There exist Turing Machines , that given input and in binary form, compute and in time polynomial in and the binary representation of .
The Turing Machine outputs the fist bits of the following quantity,
where will be selected sufficiently large. Notice it is possible to compute the above quantity due to the fact that all functions , and can be computed with accuracy in polynomial time with respect to and the binary representation of [Bre76]. Moreover,
where the first inequality follows from triangle inequality and the second follows from the facts that is a -Lipschitz function of for , and is an -Lipschitz function of for . The last inequality follows from the definition of . Hence is indeed equal to if we choose .
Next we explain how computes . First notice that is equal to
To describe how to compute we first assume that we have computed the following quantities. Then based on these quantities we show how can be computed and finally we consider the computation of these quantities.
,
,
.
Then outputs the fist bits of the quantity . We now prove that
Consider the function where and where are universal constants. Notice that is -Lipschitz for . Since for sufficiently large all the quantities and where are universal constants we get that
Now consider the function where where is a universal constant. In this case is -Lipschitz continuous. Since for sufficiently large all the quantities are bounded by a universal constant , we have that,
Next we explain how the values and are computed while can easily be computed via standard techniques [Bre76].
Computation of . The Turing Machine will compute by taking the first bits of the following quantity,
where will be taken sufficiently large. We remark that both where both the exponentiation and the natural logarithm can be computed in polynomial-time with respect to the number of accuracy bits and the binary representation of the input [Bre76]. The function is -Lipschitz where is a universal constant. Thus,
Computation of . Using the same arguments as for .
Computation of . To compute we first compute bits of the following quantity,
The latter follows by applying the triangle inequality and the following inequalities.
this holds since for we have
are both upper-bounded by while the function is -Lipschitz for .
The latter follows since for larger than a universal constant, both and are greater than a universal constant , while the function is -Lipschitz for .
The latter follows since for larger than a universal constant it holds that both the quantities in the left hand side are greater than a positive universal constant , while the function for , , and is -Lipschitz.
There exist Turing Machines and that given and in binary form, respectively compute and for all vertices with , where . These vertices are most . Moreover both and run in polynomial time with respect to , and the binary representation of .
Both , firsts compute the canonical representation with the respect to the cell in which lies. Such a cell can be computed by taking the first -bits at each coordinate of . The source vertex and the target vertex with respect to are also computed. Once this is done we are only interested in vertices for which
since for all the other both and . These vertices, that are denoted by , are at most and can be computed in polynomial time.
The vertices can be computed in polynomial time as follows: (i) the coordinates are sorted in increasing order ii) for each compute the vertex ,
By Definition 8.7 it immediately follows that which also establish that .
where is selected sufficiently large. We next prove that this computation indeed outputs accurately.
We can now use the above inequality to bound . More precisely,
Thus the analysis is completed by selecting + . ∎
For accuracy we get that,
Consider the function . Notice that for and then . The latter implies that for such that and that , it holds that
Since there are at most vertices while both the term and the term are greater than , we can apply the above inequality with and we get the following
The proof is completed via selecting .
we next prove that the above computation is correct.
Setting we get the desired result. Similarly for and . ∎
Appendix E Convergence of PGD to Approximate Local Minimum
where and is a global minimum of .
If we run the Projected Gradient Descent algorithm on then we have
then due to the -smoothness of we have that
We can now apply Theorem 1.5.5 (b) of [FP07] to get that
If sum all the above inequalities and divide by then we get
Therefore for we have that