Solving a Class of Non-Convex Min-Max Games Using Iterative First Order Methods

Maher Nouiehed, Maziar Sanjabi, Tianjian Huang, Jason D. Lee, Meisam Razaviyayn

Introduction

Recent years have witnessed a wide range of machine learning and robust optimization applications being formulated as a min-max saddle point game; see and the references therein. Examples of problems that are formulated under this framework include generative adversarial networks (GANs) , reinforcement learning , adversarial learning , learning exponential families , fair statistical inference , generative adversarial imitation learning , distributed non-convex optimization and many others. These applications require solving an optimization problem of the form

This optimization problem can be viewed as a zero-sum game between two players. The goal of the first player is to minimize f(θ,α)f(\bm{\theta},\bm{\alpha}) by tuning θ\bm{\theta}, while the other player’s objective is to maximize f(θ,α)f(\bm{\theta},\bm{\alpha}) by tuning α\bm{\alpha}. Gradient-based methods, especially gradient descent-ascent (GDA), are widely used in practice to solve these problems. GDA alternates between a gradient ascent steps on α\bm{\alpha} and a gradient descent steps on θ\bm{\theta}. Despite its popularity, this algorithm fails to converge even for simple bilinear zero-sum games . This failure was fixed by adding negative momentum or by using primal-dual methods proposed by .

When the objective ff is convex in θ\bm{\theta} and concave in α\bm{\alpha}, the corresponding variational inequality becomes monotone. This setting has been extensively studied and different algorithms have been developed for finding a Nash equilibrium . Moreover, proposed an algorithm for solving a more general setting that covers both monotone and psuedo-monotone variational problems.

While the convex-concave setting has been extensively studied in the literature, recent machine learning applications urge the necessity of moving beyond these classical settings. For example, in a typical GAN problem formulation, two neural networks (generator and discriminator) compete in a non-convex zero-sum game framework . For general non-convex non-concave games, [28, Proposition 10] provides an example for which local Nash equilibrium does not exist. Similarly, one can show that even second-order Nash equilibrium may not exist for non-convex games, see Section 2 for more details. Therefore, a well-justified objective is to find first order Nash equilibria of such games ; see definitions and discussion in Section 2. The first order Nash equilibrium can be viewed as a direct extension of the concept of first order stationarity in optimization to the above min-max game. While ε\varepsilon–first order stationarity in the context of optimization can be found efficiently in O(ε2){\cal O}(\varepsilon^{-2}) iterations with gradient descent algorithm , the question of whether it is possible to design a gradient-based algorithm that can find an ε\varepsilon–first order Nash equilibrium for general non-convex saddle point games remains open.

Several recent results provided a partial answer to the problem of finding first-order stationary points of a non-convex min-max game. For instance, proposed a stochastic gradient descent algorithm for the case when f(,)f(\cdot,\cdot) is strongly concave in α\bm{\alpha} and show convergence of the algorithm to an ε\varepsilon–first-order Nash equilibrium with O~(ε2)\widetilde{\cal O}(\varepsilon^{-2}) gradient evaluations. Also, the work analyzes the gradient descent algorithm with Max-oracle and shows O(ϵ4)O(\epsilon^{-4}) gradient evaluations and max-oracle calls for solving min-max problems where the inner problem can be solved in one iteration using an existing oracle. More recently, considered the case where ff is concave in α\bm{\alpha}. They developed a descent-ascent algorithm with iteration complexity O~(ε4)\widetilde{\cal O}(\varepsilon^{-4}). In this non-convex concave setting, proposed a stochastic sub-gradient descent method with worst-case complexity O~(ε6)\widetilde{\cal O}(\varepsilon^{-6}). Under the same concavity assumption on ff, in this paper, we propose an alternative multi-step framework that finds an ε\varepsilon–first order Nash equilibrium/stationary with O~(ε3.5)\widetilde{\cal O}(\varepsilon^{-3.5}) gradient evaluations.

In an effort to solve the more general non-convex non-concave setting, developed a framework that converges to ε\varepsilon-first order stationarity/Nash equilibrium under the assumption that there exists a solution to the Minty variational inequality at each iteration. Although among the first algorithms with have theoretical convergence guarantees in the non-convex non-concave setting, the conditions required are strong and difficult to check. To the best of our knowledge, there is no practical problem for which the Minty variational inequality condition has been proven. With the motivation of exploring the non-convex non-concave setting, we propose a simple multi-step gradient descent ascent algorithm for the case where the objective of one of the players satisfies the Polyak-Łojasiewicz (PL) condition. We show the worst-case complexity of O~(ε2)\widetilde{\cal O}(\varepsilon^{-2}) for our algorithm. This rate is optimal in terms of dependence on ε\varepsilon up to logarithmic factors as discussed in Section 3. Compared to Minty variational inequality condition used in , the PL condition is very well studied in the literature and has been theoretically verified for objectives of optimization problems arising in many practical problems. For example, it has been proven to be true for objectives of over-parameterized deep networks , learning LQR models , phase retrieval , and many other simple problems discussed in . In the context of min-max games, it has also been proven useful in generative adversarial imitation learning with LQR dynamics , as discussed in Section 3.

The rest of this paper is organized as follows. In Section 2 we define the concepts of First-order Nash equilibrium (FNE) and ε\varepsilon–FNE. In Section 3, we describe our algorithm designed for min-max games with the objective of one player satisfying the PL condition. Finally, in Section 4 we describe our method for solving games in which the function f(θ,α)f(\theta,\alpha) is concave in α\alpha (or convex in θ\theta).

Two-player Min-Max Games and First-Order Nash Equilibrium

Consider the two-player zero sum min-max game

where Θ\Theta and A\mathcal{A} are both convex sets, and f(θ,α)f(\bm{\theta},\bm{\alpha}) is a continuously differentiable function. We say (θ,α)Θ×A(\bm{\theta}^{*},\bm{\alpha}^{*})\in\Theta\times\mathcal{A} is a Nash equilibrium of the game if

In convex-concave games, such a Nash equilibrium always exists and several algorithms were proposed to find Nash equilibria . However, in the non-convex non-concave regime, computing these points is in general NP-Hard. In fact, even finding local Nash equilibria is NP-hard in the general non-convex non-concave regime.In addition, as shown by [28, Proposition 10], local Nash equilibria for general non-convex non-concave games may not exist. Thus, in this paper we aim for the less ambitious goal of finding first-order Nash equilibrium which is defined in the sequel.

A point (θ,α)Θ×A(\bm{\theta}^{*},\bm{\alpha}^{*})\in\Theta\times\mathcal{A} is a first order Nash equilibrium (FNE) of the game (2) if

Notice that this definition, which is also used in , contains the first order necessary optimality conditions of the objective function of each player . Thus they are necessary conditions for local Nash equilibrium. Moreover, in the absence of constraints, the above definition simplifies to θf(θ,α)=0\nabla_{\bm{\theta}}f(\bm{\theta}^{*},\bm{\alpha}^{*})=0 and αf(θ,α)=0\nabla_{\bm{\alpha}}f(\bm{\theta}^{*},\bm{\alpha}^{*})=0, which are the well-known unconstrained first-order optimality conditions. Based on this observation, it is tempting to think that the above first-order Nash equilibrium condition does not differentiate between the min-max type solutions of (2) and min-min solutions of the type minθΘ,αAf(θ,α)\min_{\bm{\theta}\in\Theta,\bm{\alpha}\in\mathcal{A}}f(\bm{\theta},\bm{\alpha}). However, the direction of the second inequality in (3) would be different if we have considered the min-min problem instead of min-max problem. This different direction makes the problem of finding a FNE non-trivial. The following theorem guarantees the existence of first-order Nash equilibria under some mild assumptions.

Suppose the sets Θ\Theta and A\mathcal{A} are no-empty, compact, and convex. Moreover, assume that the function f(,)f(\cdot,\cdot) is twice continuously differentiable. Then there exists a feasible point (θˉ,αˉ)(\bar{\bm{\theta}},\bar{\bm{\alpha}}) that is first-order Nash equilibrium.

The above theorem guarantees existence of FNE points even when (local) Nash equilibria may not exist. The next natural question is about the computability of such methods. Since in practice we use iterative methods for computation, we need to define the notion of approximate–FNE.

A point (θ,α)(\bm{\theta}^{*},\bm{\alpha}^{*}) is said to be an ε\varepsilon–first-order Nash equilibrium (ε\varepsilon–FNE) of the game (2) if

In the absence of constraints, ε\varepsilon–FNE in Definition 2.3 reduces to θf(θ,α)εandαf(θ,α)ε.\|\nabla_{\bm{\theta}}f(\bm{\theta}^{*},\bm{\alpha}^{*})\|\leq\varepsilon\quad{\rm and}\quad\|\nabla_{\bm{\alpha}}f(\bm{\theta}^{*},\bm{\alpha}^{*})\|\leq\varepsilon.

The ε\varepsilon–FNE definition above is based on the first order optimality measure of the objective of each player. Such first-order optimality measure has been used before in the context of optimization; see . Such a condition guarantees that each player cannot improve their objective function using first order information. Similar to the optimization setting, one can define the second-order Nash equilibrium as a point that each player cannot improve their objective further by using first and second order information of their objectives. However, the use of second order Nash equilibria is more subtle in the context of games. The following example shows that such a point may not exist. Consider the game

Then (0,0)(0,0) is the only first-order Nash equilibrium and is not a second-order Nash equilibrium.

In this paper, our goal is to find an ε\varepsilon–FNE of the game (2) using iterative methods. To proceed, we make the following standard assumptions about the smoothness of the objective function ff.

The function ff is continuously differentiable in both θ\bm{\theta} and α\bm{\alpha} and there exists constants L11L_{11}, L22L_{22} and L12L_{12} such that for every α,α1,α2A\bm{\alpha},\bm{\alpha}_{1},\bm{\alpha}_{2}\in\mathcal{A}, and θ,θ1,θ2Θ\bm{\theta},\bm{\theta}_{1},\bm{\theta}_{2}\in\Theta, we have

Non-Convex PL-Game

In this section, we consider the problem of developing an “efficient" algorithm for finding an ε\varepsilon–FNE of (2) when the objective of one of the players satistys Polyak-Łojasiewicz (PL) condition. To proceed, let us first formally define the Polyak-Łojasiewicz (PL) condition.

A differentiable function h(x)h({\mathbf{x}}) with the minimum value h=minx  h(x)h^{*}=\min_{x}\;h({\mathbf{x}}) is said to be μ\mu-Polyak-Łojasiewicz (μ\mu-PL) if

The PL-condition has been established and utilized for analyzing many practical modern problems . Moreover, it is well-known that a function can be non-convex and still satisfy the PL condition . Based on the definition above, we define a class of min-max PL-games.

A simple example of a practical PL-game is detailed next.

Imitation learning is a paradigm that aims to learn from an expert’s demonstration of performing a task . It is known that this learning process can be formulated as a min-max game . In such a game the minimization is performed over all the policies and the goal is to minimize the discrepancy between the accumulated reward for expert’s policy and the proposed policy. On the other hand, the maximization is done over the parameters of the reward function and aims at maximizing this discrepancy over the parameters of the reward function. This approach is also referred to as generative adversarial imitation learning (GAIL) . The problem of generative adversarial imitation learning for linear quadratic regulators refers to solving this problem for the specific case where the underlying dynamic and the reward function come from a linear quadratic regulator . To be more specific, this problem can be formulated as minKmaxθΘm(K,θ)\min_{K}\max_{\bm{\theta}\in\Theta}m(K,\bm{\theta}), where KK represents the choice of the policy and θ\bm{\theta} represents the parameters of the dynamic and the reward functions. Under the discussed setting, mm is strongly concave in θ\bm{\theta} and PL in KK (see for more details). Note that since mm is strongly concave in θ\bm{\theta} and PLPL in KK, any FNE of the game would also be a Nash equilibrium point. Also note that the notion of FNE does not depend on the ordering of the min\min and max\max. Thus, to be consistent with our notion of PL-games, we can formulate the problem as

Thus, generative adversarial imitation learning of linear quadratic regulators is an example of finding a FNE for a min-max PL-game.

In what follows, we present a simple iterative method for computing an ε\varepsilon–FNE of PL games.

In this section, we propose a multi-step gradient descent ascent algorithm that finds an ε\varepsilon–FNE point for PL-games. At each iteration, our method runs multiple projected gradient ascent steps to estimate the solution of the inner maximization problem. This solution is then used to estimate the gradient of the inner maximization value function, which directly provides a descent direction. In a nutshell, our proposed algorithm is a gradient descent-like algorithm on the inner maximization value function. To present the ideas of our multi-step algorithm, let us re-write (2) as

A famous classical result in optimization is Danskin’s theorem which provides a sufficient condition under which the gradient of the value function maxαAf(θ,α)\max_{\bm{\alpha}\in\mathcal{A}}f(\bm{\theta},\bm{\alpha}) can be directly evaluated using the gradient of the objective f(θ,α)f(\bm{\theta},\bm{\alpha}^{*}) evaluated at the optimal solution α\bm{\alpha}^{*}. This result requires the optimizer α\bm{\alpha}^{*} to be unique. Under our PL assumption on f(θ,)f(\bm{\theta},\cdot), the inner maximization problem (9) may have multiple optimal solutions. Hence, Danskin’s theorem does not directly apply. However, as we will show in Lemma A.5 in the supplementary, under the PL assumption, we still can show the following result

despite the non-uniqueness of the optimal solution.

Motivated by this result, we propose a Multi-step Gradient Descent Ascent algorithm that solves the inner maximization problem to “approximate” the gradient of the value function gg. This gradient direction is then used to descent on θ\bm{\theta}. More specifically, the inner loop (Step 4) in Algorithm 1 solves the maximization problem (9) for a given fixed value θ=θt\bm{\theta}=\bm{\theta}_{t}. The computed solution of this optimization problem provides an approximation for the gradient of the function g(θ)g(\bm{\theta}), see Lemma A.6 in Appendix A. This gradient is then used in Step 7 to descent on θ\bm{\theta}.

2 Convergence analysis of Multi-Step Gradient Descent Ascent Algorithm for PL games

Throughout this section, we make the following assumption.

The constraint set Θ\Theta is convex and compact. Moreover, there exists a ball with radius RR, denoted by BR{\cal B}_{R}, such that ΘBR\Theta\subseteq{\cal B}_{R}.

We are now ready to state the main result of this section.

Under Assumptions 2.5 and 3.3, for any given scalar ε(0,1)\varepsilon\in(0,1), if we choose KK and TT large enough such that

then there exists an iteration t{0,,T}t\in\{0,\cdots,T\} such that (θt,αt+1)(\bm{\theta}_{t},\bm{\alpha}_{t+1}) is an ε\varepsilon–FNE of (2).

Under Assumption 2.5 and Assumption 3.3, Algorithm 1 finds an ε\varepsilon-FNE of the game (2) with O(ε2){\cal O}(\varepsilon^{-2}) gradient evaluations of the objective with respect to θ\bm{\theta} and O(ε2log(ε1)){\cal O}(\varepsilon^{-2}\log(\varepsilon^{-1})) gradient evaluations with respect to α\bm{\alpha}. If the two gradient oracles have the same complexity, the overall complexity of the method would be O(ε2log(ε1)){\cal O}(\varepsilon^{-2}\log(\varepsilon^{-1})).

The iteration complexity order O(ε2log(ε1))\mathcal{O}(\varepsilon^{-2}\log(\varepsilon^{-1})) in Theorem 3.4 is tight (up to logarithmic factors). This is due to the fact that for general non-convex smooth problems, finding an ε\varepsilon–stationary solution requires at least Ω(ε2)\Omega(\varepsilon^{-2}) gradient evaluations . Clearly, this lower bound is also valid for finding an ε\varepsilon–FNE of PL-games. This is because we can assume that the function f(θ,α)f(\bm{\theta},\bm{\alpha}) does not depend on α\bm{\alpha} (and thus PL in α\bm{\alpha}).

Theorem 3.4 shows that under the PL assumption, the pair (θt,αK(θt))(\bm{\theta}_{t},\bm{\alpha}_{K}(\theta_{t})) computed by Algorithm 1 is an ε\varepsilon–FNE of the game (2). Since αK(θt)\bm{\alpha}_{K}(\bm{\theta}_{t}) is an approximate solution of the inner maximization problem, we get that θt\bm{\theta}_{t} is concurrently an ε\varepsilon–first order stationary solution of the optimization problem (8).

In [51, Theorem 4.2], a similar result was shown for the case when f(θ,α)f(\bm{\theta},\bm{\alpha}) is strongly concave in α\bm{\alpha}. Hence, Theorem 3.4 can be viewed as an extension of [51, Theorem 4.2]. Similar to [51, Theorem 4.2], one can easily extend the result of Theorem 3.4 to the stochastic setting by replacing the gradient of ff with respect to θ\theta in Step 7 by the stochastic version of the gradient.

In the next section we consider the non-convex concave min-max saddle game. It is well-known that convexity/concavity does not imply the PL condition and PL condition does not imply convexity/concavity . Therefore, the problems we consider in the next section are neither restriction nor extension of our results on PL games.

Non-Convex Concave Games

In this section, we focus on “non-convex concave" games satisfying the following assumption:

The objective function f(θ,α)f(\bm{\theta},\bm{\alpha}) is concave in α\bm{\alpha} for any fixed value of θ\bm{\theta}. Moreover, the set A\mathcal{A} is convex and compact, and there exists a ball with radius RR that contains the feasible set A\mathcal{A}.

One major difference of this case with the PL-games is that in this case the function g(θ)=maxαAf(θ,α)g(\bm{\theta})=\max_{\bm{\alpha}\in\mathcal{A}}\,\,f(\bm{\theta},\bm{\alpha}) might not be differentiable. To see this, consider the example g(α)=max0α1(2α1)θg(\alpha)=\max_{0\leq\alpha\leq 1}(2\alpha-1)\theta which is concave in α\alpha. However, the value function g(θ)=θg(\theta)=|\theta| is non-smooth.

Using a small regularization term, we approximate the function g()g(\cdot) by a differentiable function

where fλ(θ,α)f(θ,α)λ2ααˉ2.f_{\lambda}(\bm{\theta},\bm{\alpha})\triangleq f(\bm{\theta},\bm{\alpha})-\dfrac{\lambda}{2}\|\bm{\alpha}-\bar{\bm{\alpha}}\|^{2}. Here αˉA\bar{\bm{\alpha}}\in\mathcal{A} is some given fixed point and λ>0\lambda>0 is a regularization parameter that we will specify later. Since f(θ,α)f(\bm{\theta},\bm{\alpha}) is concave in α\bm{\alpha}, fλ(θ,)f_{\lambda}(\bm{\theta},\cdot) is λ\lambda-strongly concave. Thus, the function gλ()g_{\lambda}(\cdot) becomes smooth with Lipschitz gradient; see Lemma B.1 in the supplementary. Using this property, we propose an algorithm that runs at each iteration multiple steps of Nesterov accelerated projected gradient ascent to estimate the solution of (10). This solution is then used to estimate the gradient of gλ(θ)g_{\lambda}(\bm{\theta}) which directly provides a descent direction on θ\bm{\theta}. Our algorithm computes an ε\varepsilon–FNE point for non-convex concave games with O~(ε3.5)\widetilde{{\cal O}}(\varepsilon^{-3.5}) gradient evaluations. Then for sufficiently small regularization coefficient, we show that the computed point is an ε\varepsilon-FNE.

Notice that since fλf_{\lambda} is Lipschitz smooth and based on the compactness assumption, we can define

where α(θ)arg maxαA  fλ(θ,α)\bm{\alpha}^{*}(\bm{\theta})\triangleq\operatorname*{arg\,max}_{\bm{\alpha}\in\mathcal{A}}\;f_{\lambda}(\bm{\theta},\bm{\alpha}). We are now ready to describe our proposed algorithm.

Our proposed method is outlined in Algorithm 2. This algorithm has two steps: step 2 and step 3. In step 2, KK steps of accelerated gradient ascent method is run over the variable α\bm{\alpha} to find an approximate maximizer of the problem maxα  fλ(θt,α)\max_{\bm{\alpha}}\;f_{\lambda}(\bm{\theta}_{t},\bm{\alpha}). Then using approximate maximizer αt+1\bm{\alpha}_{t+1}, we update θ\bm{\theta} variable using one step of first order methods in step 3.

In step 2, we run KK step of accelerated gradient ascent algorithm over the variable α\alpha with restart every NN iterations. The details of this subroutine can be found in subsection B.1 of the supplementary materials. In step 3 of Algorithm 2, we can either use projected gradient descent update rule

or Frank-Wolfe update rule described in subsection B.2 in the supplementary material. We show convergence of the algorithm to ε\varepsilon–FNE in Theorems 4.2.

Given a scalar ε(0,1)\varepsilon\in(0,1). Assume that Step 7 in Algorithm 2 sets either runs projected gradient descent or Frank-Wolfe iteration. Under Assumptions 4.1 and 2.5,

then there exists t{0,,T}t\in\{0,\ldots,T\} such that (θt,αt+1)(\bm{\theta}_{t},\bm{\alpha}_{t+1}) is an ε\varepsilon–FNE of problem (2).

The proof is relegated to Appendix B.4. ∎

Under Assumptions 2.5 and 4.1, Algorithm 2 finds an ε\varepsilon-first-order stationary solution of the game (2) with O(ε3){\cal O}(\varepsilon^{-3}) gradient evaluations of the objective with respect θ\bm{\theta} and O(ε0.5log(ε1)){\cal O}(\varepsilon^{-0.5}\log(\varepsilon^{-1})) gradient evaluations with respect to α\bm{\alpha}. If the two oracles have the same complexity, the overall complexity of the method would be O(ε3.5log(ε1)){\cal O}(\varepsilon^{-3.5}\log(\varepsilon^{-1})).

Numerical Results

We evaluate the numerical performance of Algorithm 2 in the following two applications:

We conduct two experiment on the Fashion MNIST dataset . This dataset consists of 28×2828\times 28 arrays of grayscale pixel images classified into 10 categories of clothing. It includes 60,00060,000 training images and 10,00010,000 testing images.

Experimental Setup: The recent work in observed that training a logisitic regression model to classify the images of the Fashion MNIST dataset can be biased against certain categories. To remove this bias, proposed to minimize the maximum loss incurred by the different categories. We repeat the experiment when using a more complex non-convex Convolutional Neural Network (CNN) model for classification. Similar to , we limit our experiment to the three categories T-shirt/top, Coat, and Shirts, that correspond to the lowest three testing accuracies achieved by the trained classifier. To minimize the maximum loss over these three categories, we train the classifier to minimize

where W\mathbf{W} represents the parameters of the CNN; and L1{\cal L}_{1}, L2{\cal L}_{2}, and L3{\cal L}_{3} correspond to the loss incurred by samples in T-shirt/top, Coat, and Shirt categories. Problem (12) can be re-written as

Clearly the inner maximization problem is concave; and thus our theory can be applied. To empirically evaluate the regularization scheme proposed in Section 4, we implement two versions of Algorithm 2. The first version solves at each iteration the regularized strongly concave sub-problem

and use the optimum tt to perform a gradient descent step on W\mathbf{W} (notice that fixing the value of W\mathbf{W}, the optimum tt can be computed using KKT conditions and a simple sorting or bisection procedure). The second version of Algorithm 2 solves at each iteration the concave inner maximization problem without the regularization term. Then uses the computed solution to perform a descent step on W\mathbf{W}. Notice that in both cases, the optimization with respect to tt variable can be done in (almost) closed-form update. Although regularization is required to have theoretical convergence guarantees, we compare the two versions of the algorithm on empirical data to determine whether we lose by adding such regularization. We further compare these two algorithms with normal training that uses gradient descent to minimize the average loss among the three categories. We run all algorithms for 55005500 epochs and record the test accuracy of the categories. To reduce the effect of random initialization, we run our methods with 5050 different random initializations and record the average and standard deviation of the test accuracy collected. For fair comparison, the same initialization is used for all methods in each run. The results are summarized in Tables 1. To test our framework in stochastic settings, we repeat the experiment running all algorithms for 12,00012,000 iterations with Adam and SGD optimizer with a bath size of 600600 images (200200 from each category). The results of the second experiment with Adam optimizer are summarized in Table 2. The model architecture and parameters are detailed in Appendix F. The choice of Adam optimizer is mainly because it is more robust to the choice of the step-size and thus can be easily tuned. In fact, the use of SGD or Adam does not change the overall takeaways of the experiments. The results of using SGD optimizer are relegated to Appendix C.

Results: Tables 1 and 2 show the average and standard deviation of the number of correctly classified samples. The average and standard deviation are taken over 50 runs. For each run 1000 testing samples are considered for each category. The results show that when using MinMax and MinMax with regularization, the accuracies across the different categories are more balanced compared to normal training. Moreover, the tables show that Algorithm 2 with regularization provides a slightly better worst-case performance compared to the unregularized approach. Note that the empirical advantages due to regularization appears more in the stochastic setting. To see this compare the differences between MinMax and MinMax with Regularization in Tables 1 and 2. Figure 1 depicts a sample trajectory of deterministic algorithm applied to the regularized and regularized formulations. This figures shows that regularization provides a smoother and slightly faster convergence compared to the unregularized approach. In addition, we apply our algorithm to the exact similar logistic regression setup as in . Results of this experiment can be found in Appendix D.

2 Robust Neural Network Training

Experimental Setup: Neural networks have been widely used in various applications, especially in the field of image recognition. However, these neural networks are vulnerable to adversarial attacks, such as Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD) attack . These adversarial attacks show that a small perturbation in the data input can significantly change the output of a neural network. To train a robust neural network against adversarial attacks, researchers reformulate the training procedure into a robust min-max optimization formulation , such as

Here w\mathbf{w} is the parameter of the neural network, the pair (xi,yi)(x_{i},y_{i}) denotes the ii-th data point, and δi\delta_{i} is the perturbation added to data point ii. As discussed in this paper, solving such a non-convex non-concave min-max optimization problem is computationally challenging. Motivated by the theory developed in this work, we approximate the above optimization problem with a novel objective function which is concave in the parameters of the (inner) maximization player. To do so, we first approximate the inner maximization problem with a finite max problem

where each x^ij(w)\hat{x}_{ij}(\mathbf{w}) is the result of a targeted attack on sample xix_{i} aiming at changing the output of the network to label jj. These perturbed inputs, which are explained in details in Appendix E, are the function of the weights of the network. Then we replace this finite max inner problem with a concave problem over a probability simplex. Such a concave inner problem allows us to use the multi-step gradient descent-ascent method. The structure of the network and the details of the formulation is detailed in Appendix E.

Results: We compare our results with . Note is the state-of-the-art algorithm and has won the first place, out of 2000\approx 2000 submissions, in the NeurIPS 2018 Adversarial Vision Challenge. The accuracy of our formulation against popular attacks, FGSM and PGD , are summarized in Table 3. This table shows that our formulation leads to a comparable results against state-of-the-art algorithms (while in some cases it also outperform those methods by as much as 15%\approx 15\% accuracy).

Links to code and pre-trained models of above two simulations are available at Appendix G.

References

Appendix A Proofs for results in Section 3

Before proceeding to the proofs of the main results, we need some intermediate lemmas and preliminary definitions.

A function h(x)h({\mathbf{x}}) is said to satisfy the Quadratic Growth (QG) condition with constant γ>0\gamma>0 if

where hh^{*} is the minimum value of the function, and dist(x){\rm dist}({\mathbf{x}}) is the distance of the point xx to the optimal solution set.

The following lemma shows that PL implies QG .

If function ff is PL with constant μ\mu, then ff satisfies the quadratic growth condition with constant γ=4μ\gamma=4\mu.

The next Lemma shows the stability of argmaxαf(θ,α)\arg\max_{\bm{\alpha}}f(\bm{\theta},\bm{\alpha}) with respect to θ\bm{\theta} under PL condition.

Assume that {hθ(α)=f(θ,α)  θ}\{h_{\bm{\theta}}(\bm{\alpha})=-f(\bm{\theta},\bm{\alpha})~{}|~{}\bm{\theta}\} is a class of μ\mu-PL functions in α\bm{\alpha}. Define A(θ)=argmaxαf(θ,α)A(\bm{\theta})=\arg\max_{\bm{\alpha}}f(\bm{\theta},\bm{\alpha}) and assume A(θ)A(\bm{\theta}) is closed. Then for any θ1\bm{\theta}_{1}, θ2\bm{\theta}_{2} and α1A(θ1)\bm{\alpha}_{1}\in A(\bm{\theta}_{1}), there exists an α2A(θ2)\bm{\alpha}_{2}\in A(\bm{\theta}_{2}) such that

Based on the Lipchitzness of the gradients, we have that αf(θ2,α1)L12θ1θ2\|\nabla_{\bm{\alpha}}f(\bm{\theta}_{2},\bm{\alpha}_{1})\|\leq L_{12}\|\bm{\theta}_{1}-\bm{\theta}_{2}\|. Then using the PL condition, we know that

Now we use the result of Lemma A.2 to show that there exists α2=argminαA(θ2)αα12 A(θ2)\bm{\alpha}_{2}=\arg\min_{\bm{\alpha}\in A(\bm{\theta}_{2})}\|\bm{\alpha}-\bm{\alpha}_{1}\|^{2}~{}\in A(\bm{\theta}_{2}) such that

re-arranging the terms, we get the desired result that

Finally, the following lemma would be useful in the proof of Theorem 3.4.

Assume h(x)h({\mathbf{x}}) is μ\mu-PL and LL-smooth. Then, by applying gradient descent with step-size 1/L1/L from point x0x_{0} for KK iterations we get an xKx_{K} such that

We are now ready to prove the results in Section 3.

Under Assumption 2.5 and PL-game assumption,

Moreover, gg is LL-Lipschitz smooth with L=L11+L1222μL=L_{11}+\dfrac{L_{12}^{2}}{2\mu}.

Let αarg maxαAf(θ,α)\bm{\alpha}^{*}\in\operatorname*{arg\,max}_{\bm{\alpha}\in{\cal A}}f(\bm{\theta},\bm{\alpha}). By Lemma A.3, for any scalar τ\tau and direction dd, there exists α(τ)arg maxα  f(θ+τd,α)\bm{\alpha}^{*}(\tau)\in\operatorname*{arg\,max}_{\bm{\alpha}}\;f(\bm{\theta}+\tau d,\bm{\alpha}) such that

To find the directional derivative of g()g(\cdot), we compute

where the second equality holds by writing the Taylor series expansion of f()f(\cdot). Thus, by definition of the directional derivative of g()g(\cdot), we obtain

Note that this relationship holds for any dd. Thus, g(θ)=θf(θ,α)\nabla g(\bm{\theta})=\nabla_{\bm{\theta}}f(\bm{\theta},\bm{\alpha}^{*}) for any αarg maxαAf(θ,α)=A(θ)\bm{\alpha}^{*}\in\operatorname*{arg\,max}_{\bm{\alpha}\in{\cal A}}f(\bm{\theta},\bm{\alpha})=A(\bm{\theta}). Interestingly, the directional derivative does not depend on the choice of α\bm{\alpha}^{*}. This means that θf(θ,α1)=θf(θ,α2)\nabla_{\bm{\theta}}f(\bm{\theta},\bm{\alpha}_{1})=\nabla_{\bm{\theta}}f(\bm{\theta},\bm{\alpha}_{2}) for any α1\bm{\alpha}_{1} and α2\bm{\alpha}_{2} in arg maxαAf(θ,α)\operatorname*{arg\,max}_{\bm{\alpha}\in{\cal A}}f(\bm{\theta},\bm{\alpha}).

We finally show that function gg is Lipschitz smooth. Let α1A(θ1)\bm{\alpha}_{1}^{*}\in A(\bm{\theta}_{1}) and α2=argminαA(θ2)αα12 A(θ2)\bm{\alpha}_{2}^{*}=\arg\min_{\bm{\alpha}\in A(\bm{\theta}_{2})}\|\bm{\alpha}-\bm{\alpha}^{*}_{1}\|^{2}~{}\in A(\bm{\theta}_{2}), then

where the last inequality holds by Lemma A.3.

A.2 Proof of Theorem 3.4

Using Lemma A.5 and Assumption 3.3, we can define

The next result shows that the inner loop in Algorithm 1 computes an approximate gradient of g()g(\cdot). In other words, θf(θt,αt+1)g(θt)\nabla_{\bm{\theta}}f(\bm{\theta}_{t},\bm{\alpha}_{t+1})\approx\nabla g(\bm{\theta}_{t}).

Define κ=L22μ1\kappa=\frac{L_{22}}{\mu}\geq 1 and ρ=11κ<1\rho=1-\frac{1}{\kappa}<1 and assume g(θt)f(θt,α0(θt))<Δg(\bm{\theta}_{t})-f(\bm{\theta}_{t},\bm{\alpha}_{0}(\bm{\theta}_{t}))<\Delta, then for any prescribed ε(0,1)\varepsilon\in(0,1) if we choose KK large enough such that

where Lˉ=max{L12,L22,L,gmax,1}\bar{L}=\max\{L_{12},L_{22},L,g_{max},1\} and Rˉ=max{R,1}\bar{R}=\max\{R,1\}, then the error etθf(θt,αK(θt))g(θt)e_{t}\triangleq\nabla_{\bm{\theta}}f(\bm{\theta}_{t},\bm{\alpha}_{K}(\bm{\theta}_{t}))-\nabla g(\bm{\theta}_{t}) has a norm

Thus, using the QG result of Lemma A.2, we know that there exists an αA(θt)\bm{\alpha}^{*}\in A(\bm{\theta}_{t}) such that

where the last inequality holds by our choice of KK which yields

Here the second inequality holds since Rˉ1\bar{R}\geq 1, and the third inequality holds since gmaxLˉg_{max}\leq\bar{L}.

To prove the argument of the Lemma, note that

where the last inequality holds by our choice of KK which yields

Here the second inequality holds since ε<1\varepsilon<1, Lˉ,Rˉ1\bar{L},\bar{R}\geq 1, and LLˉL\leq\bar{L}.

The above lemma implies that Algorithm 1 behaves similar to the simple vanilla gradient descent method applied to problem (8).

Notice that the assumption g(θt)f(θt,α0(θt))Δ,  tg(\bm{\theta}_{t})-f(\bm{\theta}_{t},\bm{\alpha}_{0}(\bm{\theta}_{t}))\leq\Delta,\;\forall t could be justified by Lemma A.3. More specifically, by Lemma A.3,

where αt+1arg maxαf(θt+1,α)\bm{\alpha}_{t+1}\triangleq\operatorname*{arg\,max}_{\bm{\alpha}}\,f(\bm{\theta}_{t+1},\bm{\alpha}) and αtarg maxαf(θt,α)\bm{\alpha}_{t}\triangleq\operatorname*{arg\,max}_{\bm{\alpha}}\,f(\bm{\theta}_{t},\bm{\alpha}). Hence, the difference between consecutive optimal solutions computed by the inner loop of the algorithm, are upper bounded by the difference between corresponding θ\bm{\theta}’s. Since Θ\Theta is a compact set, we can find an upper bound Δ\Delta such that g(θt)f(θt,α0(θt))Δg(\bm{\theta}_{t})-f(\bm{\theta}_{t},\alpha_{0}(\bm{\theta}_{t}))\leq\Delta, for all tt. We are now ready to show Theorem 3.4

where gminθ  g(θ)g^{*}\triangleq\min_{\bm{\theta}}\;g(\bm{\theta}) is the optimal value of gg. Note that by the compactness assumption of the set Θ\Theta, we have Δg=g(θ0)g<\Delta_{g}=g(\theta_{0})-g^{*}<\infty.

Based on the projection property, we know that

Therefore, by setting θ=θt\bm{\theta}=\bm{\theta}_{t}, we get

where α(θt)arg maxαAf(θt,α)\bm{\alpha}^{*}(\bm{\theta}_{t})\in\operatorname*{arg\,max}_{\bm{\alpha}\in\mathcal{A}}\,f(\bm{\theta}_{t},\bm{\alpha}) and e_{t}\triangleq\nabla_{\bm{\theta}}f\big{(}\bm{\theta}_{t},\bm{\alpha}_{t+1}\big{)}-\nabla_{\bm{\theta}}f\big{(}\bm{\theta}_{t},\bm{\alpha}^{*}(\bm{\theta}_{t})\big{)}. By Taylor expansion, we have

where the last inequality holds by (27). Moreover, by the projection property, we know that

Here the second inequality holds by Cauchy-Schwartz, the definition of ete_{t} and our assumption that ΘBR\Theta\subseteq{\cal B}_{R}. Moreover, the last inequality holds by our choice of KK in Lemma A.6 which yields

where the inequality holds by using Cauchy Schwartz and our assumption that Θ\Theta is in a ball of radius RR. Hence,

where the last inequality holds by using Lemma A.6 and choosing KK and TT:

Therefore, using Lemma A.6, there exists at least one index t^\widehat{t} for which

Appendix B Algorithmic details and proofs for the results in Section 4

B.2 Frank–Wolfe update rule for Step 3 in Algorithm 2

In Step 3 of Algorithm 2, instead of projected gradient descent discussed in the main body, we can also run one step of Frank–Wolfe method. More precisely, we can set

is the first order descent direction. In the unconstrained case, the descent direction is s^t=θfλ(θt,αt+1)\widehat{s}_{t}=-\nabla_{\bm{\theta}}f_{\lambda}(\bm{\theta}_{t},\bm{\alpha}_{t+1}), which becomes the same as the gradient descent step.

Under Assumption 2.5 and Assumption 4.1, the function gλg_{\lambda} is LL-Lipschitz smooth with L=L11+L122λL=L_{11}+\dfrac{L_{12}^{2}}{\lambda}.

First notice that the differentiability of the function gλ()g_{\lambda}(\cdot) follows directly from Danskin’s Theorem . It remains to show that gλg_{\lambda} is a Lipschitz smooth function. Let

Then by strong convexity of fλ(θ,)-f_{\lambda}(\bm{\theta},\cdot), we have

Moreover, due to optimality of α1\bm{\alpha}_{1}^{*}, we have

where the last inequality holds by Cauchy-Schwartz and the Lipschtizness assumption. We finally show that gλg_{\lambda} is Lipschitz smooth.

where the last inequality holds by (39). ∎

Algorithm 2 solves the inner maximization problem using accelerated projected gradient descent (outlined in Algorithm 3). The next lemma is known for accelerated projected gradient descent when applied to strongly convex functions.

Assume h(x)h({\mathbf{x}}) is λ\lambda-strongly convex and LL-smooth. Then, applying accelerated projected gradient descent algorithm with step-size 1/L1/L and restart parameter N8L/λ1N\triangleq\sqrt{8L/\lambda}-1 for KK iterations, we get xK{\mathbf{x}}_{K} such that

where xarg minxFh(x){\mathbf{x}}^{*}\triangleq\operatorname*{arg\,min}_{{\mathbf{x}}\in{\cal F}}h({\mathbf{x}}).

where the second inequality holds by strong convexity of hh and the optimality condition of xx^{*}, and the last inequality holds by our choice of NN. This yields,

B.4 Proof of Theorem 4.2

We first show that the inner loop in Algorithm 2 computes an approximate gradient of gλ()g_{\lambda}(\cdot). In other words, θfλ(θt,αt+1)gλ(θt)\nabla_{\bm{\theta}}f_{\lambda}(\bm{\theta}_{t},\bm{\alpha}_{t+1})\approx\nabla g_{\lambda}(\bm{\theta}_{t}).

Define κ=L22λ1\kappa=\dfrac{L_{22}}{\lambda}\geq 1 and assume gλ(θt)fλ(θt,α0(θt))<Δg_{\lambda}(\bm{\theta}_{t})-f_{\lambda}(\bm{\theta}_{t},\bm{\alpha}_{0}(\bm{\theta}_{t}))<\Delta, then for any prescribed ε(0,1)\varepsilon\in(0,1) if we choose KK large enough such that

where Lˉmax{L12,L22,L,gmax,1}\bar{L}\triangleq\max\{L_{12},L_{22},L,g_{max},1\} and Rˉ=max{R,1}\bar{R}=\max\{R,1\}, then the error etθfλ(θt,αK(θt))gλ(θ)e_{t}\triangleq\nabla_{\bm{\theta}}f_{\lambda}(\bm{\theta}_{t},\bm{\alpha}_{K}(\bm{\theta}_{t}))-\nabla g_{\lambda}(\bm{\theta}) has a norm

Let α(θt)arg maxαA  fλ(θt,α)\bm{\alpha}^{*}(\bm{\theta}_{t})\triangleq\operatorname*{arg\,max}_{\bm{\alpha}\in\mathcal{A}}\;f_{\lambda}(\bm{\theta}_{t},\bm{\alpha}). Then by strong convexity of f(θt,)-f(\bm{\theta}_{t},\cdot), we get

Combined with the Lipschitz smoothness property of the objective, we obtain

where the second inequality uses (47), and the third inequality uses the choice of KK in (43) which yields yields

Here the second inequality holds since Rˉ1\bar{R}\geq 1, and the third inequality holds since gmaxLˉg_{max}\leq\bar{L}. To prove the second argument of the lemma, we also use the Lipschitz smoothness property of the objective to get

where the second inequality holds by our Lipschitzness assumption and the last inequality holds by our assumption that L22/λ1L_{22}/\lambda\geq 1. Moreover,

where the last equality holds since α(θt)\bm{\alpha}^{*}(\bm{\theta}_{t}) is optimal and α(θt)αK(θt)1.\|\bm{\alpha}^{*}(\bm{\theta}_{t})-\bm{\alpha}_{K}(\bm{\theta}_{t})\|\leq 1. Combining (49) and (50), we get

where the second inequality uses (47), and the last inequality holds by our choice of KK in (43) and since ε(0,1)\varepsilon\in(0,1). ∎

The above lemma implies that θfλ(θt,αK(θt))gλ(θt)δLε264Rˉ3Lˉ2\|\nabla_{\bm{\theta}}f_{\lambda}(\bm{\theta}_{t},\bm{\alpha}_{K}(\bm{\theta}_{t}))-\nabla g_{\lambda}(\bm{\theta}_{t})\|\leq\delta\triangleq\dfrac{L\varepsilon^{2}}{64\bar{R}^{3}\bar{L}^{2}}. We now show that our assumption g(θt)f(θt,α0(θt))Δg(\bm{\theta}_{t})-f(\bm{\theta}_{t},\bm{\alpha}_{0}(\bm{\theta}_{t}))\leq\Delta for all t in the above Lemma holds. Let

Then by strong convexity of fλ(θ,)-f_{\lambda}(\bm{\theta},\cdot), we have

Moreover, due to optimality of αt\bm{\alpha}_{t}^{*}, we have

Hence, the difference between consecutive optimal solutions computed by the inner loop of the algorithm, are upper bounded by the difference between corresponding θ\bm{\theta}’s. Since Θ\Theta is a compact set, we can find an upper bound Δ\Delta such that g(θt)f(θt,α0(θt))Δg(\bm{\theta}_{t})-f(\bm{\theta}_{t},\alpha_{0}(\bm{\theta}_{t}))\leq\Delta, for all tt.

We are now ready to show the main theorem that implies convergence of our proposed algorithm to an ε\varepsilon–first-order stationary solution of problem (2). In particular, we show that using θfλ(θt,αK(θt))\nabla_{\bm{\theta}}f_{\lambda}(\bm{\theta}_{t},\bm{\alpha}_{K}(\bm{\theta}_{t})) instead of gλ(θt)\nabla g_{\lambda}(\bm{\theta}_{t}) for a small enough λ\lambda in the Frank-Wolfe or projected descent algorithms applied to gλg_{\lambda}, finds an ε\varepsilon–FNE. We are now ready to show Theorem 4.2.

Frank-Wolfe Steps: We now show the result when Step 7 of Algorithm 2 sets

Using descent lemma on gλg_{\lambda} and the definition of L~\widetilde{L} in Algorithm 2, we have

where s^t\widehat{s}_{t} and Xt{\cal X}_{t} are defined in equations (35) and (36) of the manuscript, and the second and last inequalities use the fact that s^t1\|\widehat{s}_{t}\|\leq 1.

Summing up these inequalities for all values of tt leads to

Here the first inequality in (57) holds by (11), Cauchy-Schwartz, and the fact that s^t1\|\widehat{s}_{t}\|\leq 1. The last inequality holds by our choice of KK in Lemma B.3

which yields et1gmax\|e_{t}\|\leq 1\leq g_{max} and by choosing TT such that

Therefore, using Lemma B.3, there exists at least one index t^\widehat{t} for which

where the first inequality uses Cauchy Shwartz and the fact that s1\|s\|\leq 1, and the last inequality holds due to (58), the choice of λ\lambda in the theorem and our assumption that αK(θt^)αˉ2R\|\bm{\alpha}_{K}(\bm{\theta}_{\widehat{t}})-\bar{\bm{\alpha}}\|\leq 2R.

Projected Gradient Descent: We start by defining

where gλminθ  gλ(θ)g_{\lambda}^{*}\triangleq\min_{\bm{\theta}}\;g_{\lambda}(\bm{\theta}) is the optimal value of gλg_{\lambda}. Note that by the compactness assumption of the set Θ\Theta, we have Δg=gλ(θ0)gλ<\Delta_{g}=g_{\lambda}(\theta_{0})-g_{\lambda}^{*}<\infty.

We now show the result when Step 7 of Algorithm 2 sets

Based on the projection property, we know that

Therefore, by setting θ=θt\bm{\theta}=\bm{\theta}_{t}, we get

where α(θt)arg maxαAfλ(θt,α)\bm{\alpha}^{*}(\bm{\theta}_{t})\triangleq\operatorname*{arg\,max}_{\bm{\alpha}\in\mathcal{A}}\,f_{\lambda}(\bm{\theta}_{t},\bm{\alpha}) and e_{t}\triangleq\nabla_{\bm{\theta}}f\big{(}\bm{\theta}_{t},\bm{\alpha}_{t+1}\big{)}-\nabla_{\bm{\theta}}f\big{(}\bm{\theta}_{t},\bm{\alpha}^{*}(\bm{\theta}_{t})\big{)}.

Moreover, by the projection property, we know that

Here the second inequality holds by Cauchy-Schwartz, the definition of ete_{t} and our assumption that ΘBR\Theta\subseteq{\cal B}_{R}. Moreover, the last inequality holds by our choice of KK in Lemma A.6 which yields

where the inequality holds by using Cauchy Schwartz and our assumption that Θ\Theta is in a ball of radius RR. Hence,

where the last inequality holds by using Lemma B.3 and choosing KK and TT:

Therefore, using Lemma B.3, there exists at least one index t^\widehat{t} for which

where the first inequality uses Cauchy Shwartz and the fact that s1\|s\|\leq 1, and the last inequality holds due to (67), the choice of λ\lambda in the theorem and our assumption that αK(θt^)αˉ2R\|\bm{\alpha}_{K}(\bm{\theta}_{\widehat{t}})-\bar{\bm{\alpha}}\|\leq 2R.

Appendix C Numerical Results on Fashion MNIST with SGD

The results of using SGD optimizer are summarized in Table 4 and Table 5. Note SGD optimizer requires more tuning and therefore the results when batch-size=3000\text{batch-size}=3000 is also included here.

Appendix D Numerical Results on Fashion MNIST with Logistic Rgression Model

Table 6 shows that the proposed formulation gives better accuracies under the worst category (Shirts), and the accuracies over three categories are more balanced. Note that this model is trained by gradient descent. The standard derivations not equal to is due to the early termination of the simulation.

Appendix E Numerical Results on Robust Neural Network Training

Neural networks have been widely used in various applications, especially in the field of image recognition. However, these neural networks are vulnerable to adversarial attacks, such as Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD) attack . These adversarial attacks show that a small perturbation in the data input can significantly change the output of a neural network. To train a robust neural network against adversarial attacks, researchers reformulate the training procedure into a robust min-max optimization formulation , such as

Here w\mathbf{w} is the parameter of the neural network, the pair (xi,yi)(x_{i},y_{i}) denotes the ii-th data point, and δi\delta_{i} is the perturbation added to data point ii. As discussed in this paper, solving such a non-convex non-concave min-max optimization problem is computationally challenging. Motivated by the theory developed in this work, we approximate the above optimization problem with a novel min-max objective function which has concave inner optimization problem. To do so, we first approximate the inner maximization problem with a finite max problem

where each x^ij(w)\hat{x}_{ij}(\mathbf{w}) is the result of a targeted attack on sample xix_{i} aiming at changing the output of the network to label jj. More specifically, x^ij(w)\hat{x}_{ij}(\mathbf{w}) is obtained through the following procedure:

In the one but last layer of the neural network architecture for learning classification on MNIST we have 10 different neurons, each corresponding with one category of classification. For any sample (xi,yi)(x_{i},y_{i}) in the dataset and any j=0,,9j=0,\cdots,9, starting from xij0=xix_{ij}^{0}=x_{i}, we run gradient ascent to obtain the following chain of points:

where ZjZ_{j} is the network logit before softmax corresponding to label jj; α>0\alpha>0 is the step-size; and ProjB(x,ε)[]\text{Proj}_{B(x,\varepsilon)}[\cdot] is the projection to the infinity ball with radius ε\varepsilon centered at xx. Finally, we set x^ij(w)=xijK\hat{x}_{ij}(\mathbf{w})=x_{ij}^{K} in (69).

Clearly, we can replace the finite max problem (69) with a concave problem over a probability simplex, i.e.,

which is non-convex in ww, but concave in t\mathbf{t}. Hence we can apply Algorithm 2 to solve this opimization problem. We test (70) on MNIST dataset with a Convolutional Neural Network(CNN) with the architecture detailed in Table 7. The result of our experiment is presented in Table 8.

We would like to note that there is a mismatch between our theory and this numerical experiment. In particular, we assume smoothness of the objective function in our theory. However, in this experiment, the ReLu activation functions and the projection operator make the objective function non-smooth. We also did not include regularizer (strongly concave term) while solving (70) as the optimal regularizer was very small (and almost zero).

The main take away from this experiment is to demonstrate the practicality of the following idea: when solving general challenging non-convex min-max problems, it might be possible to approximate it with one-sided non-convex min-max problems where the objective function is solvable with respect to one of the player’s variable. Such a reformulation leads to computationally tractable problems and (possibly) no loss in the performance.

Appendix F Experimental Setup of Fair Classifier

Appendix G Links

Robust NN Training: https://github.com/optimization-for-data-driven-science/Robust-NN-Training Fair Classifier: https://github.com/optimization-for-data-driven-science/FairFashionMNIST