Solving a Class of Non-Convex Min-Max Games Using Iterative First Order Methods

Maher Nouiehed, Maziar Sanjabi, Tianjian Huang, Jason D. Lee, Meisam Razaviyayn

Introduction

Recent years have witnessed a wide range of machine learning and robust optimization applications being formulated as a min-max saddle point game; see and the references therein. Examples of problems that are formulated under this framework include generative adversarial networks (GANs) , reinforcement learning , adversarial learning , learning exponential families , fair statistical inference , generative adversarial imitation learning , distributed non-convex optimization and many others. These applications require solving an optimization problem of the form

This optimization problem can be viewed as a zero-sum game between two players. The goal of the first player is to minimize $f(\bm{\theta},\bm{\alpha})$ by tuning $\bm{\theta}$ , while the other player’s objective is to maximize $f(\bm{\theta},\bm{\alpha})$ by tuning $\bm{\alpha}$ . Gradient-based methods, especially gradient descent-ascent (GDA), are widely used in practice to solve these problems. GDA alternates between a gradient ascent steps on $\bm{\alpha}$ and a gradient descent steps on $\bm{\theta}$ . Despite its popularity, this algorithm fails to converge even for simple bilinear zero-sum games . This failure was fixed by adding negative momentum or by using primal-dual methods proposed by .

When the objective $f$ is convex in $\bm{\theta}$ and concave in $\bm{\alpha}$ , the corresponding variational inequality becomes monotone. This setting has been extensively studied and different algorithms have been developed for finding a Nash equilibrium . Moreover, proposed an algorithm for solving a more general setting that covers both monotone and psuedo-monotone variational problems.

While the convex-concave setting has been extensively studied in the literature, recent machine learning applications urge the necessity of moving beyond these classical settings. For example, in a typical GAN problem formulation, two neural networks (generator and discriminator) compete in a non-convex zero-sum game framework . For general non-convex non-concave games, [28, Proposition 10] provides an example for which local Nash equilibrium does not exist. Similarly, one can show that even second-order Nash equilibrium may not exist for non-convex games, see Section 2 for more details. Therefore, a well-justified objective is to find first order Nash equilibria of such games ; see definitions and discussion in Section 2. The first order Nash equilibrium can be viewed as a direct extension of the concept of first order stationarity in optimization to the above min-max game. While $\varepsilon$ –first order stationarity in the context of optimization can be found efficiently in ${\cal O}(\varepsilon^{-2})$ iterations with gradient descent algorithm , the question of whether it is possible to design a gradient-based algorithm that can find an $\varepsilon$ –first order Nash equilibrium for general non-convex saddle point games remains open.

Several recent results provided a partial answer to the problem of finding first-order stationary points of a non-convex min-max game. For instance, proposed a stochastic gradient descent algorithm for the case when $f(\cdot,\cdot)$ is strongly concave in $\bm{\alpha}$ and show convergence of the algorithm to an $\varepsilon$ –first-order Nash equilibrium with $\widetilde{\cal O}(\varepsilon^{-2})$ gradient evaluations. Also, the work analyzes the gradient descent algorithm with Max-oracle and shows $O(\epsilon^{-4})$ gradient evaluations and max-oracle calls for solving min-max problems where the inner problem can be solved in one iteration using an existing oracle. More recently, considered the case where $f$ is concave in $\bm{\alpha}$ . They developed a descent-ascent algorithm with iteration complexity $\widetilde{\cal O}(\varepsilon^{-4})$ . In this non-convex concave setting, proposed a stochastic sub-gradient descent method with worst-case complexity $\widetilde{\cal O}(\varepsilon^{-6})$ . Under the same concavity assumption on $f$ , in this paper, we propose an alternative multi-step framework that finds an $\varepsilon$ –first order Nash equilibrium/stationary with $\widetilde{\cal O}(\varepsilon^{-3.5})$ gradient evaluations.

In an effort to solve the more general non-convex non-concave setting, developed a framework that converges to $\varepsilon$ -first order stationarity/Nash equilibrium under the assumption that there exists a solution to the Minty variational inequality at each iteration. Although among the first algorithms with have theoretical convergence guarantees in the non-convex non-concave setting, the conditions required are strong and difficult to check. To the best of our knowledge, there is no practical problem for which the Minty variational inequality condition has been proven. With the motivation of exploring the non-convex non-concave setting, we propose a simple multi-step gradient descent ascent algorithm for the case where the objective of one of the players satisfies the Polyak-Łojasiewicz (PL) condition. We show the worst-case complexity of $\widetilde{\cal O}(\varepsilon^{-2})$ for our algorithm. This rate is optimal in terms of dependence on $\varepsilon$ up to logarithmic factors as discussed in Section 3. Compared to Minty variational inequality condition used in , the PL condition is very well studied in the literature and has been theoretically verified for objectives of optimization problems arising in many practical problems. For example, it has been proven to be true for objectives of over-parameterized deep networks , learning LQR models , phase retrieval , and many other simple problems discussed in . In the context of min-max games, it has also been proven useful in generative adversarial imitation learning with LQR dynamics , as discussed in Section 3.

The rest of this paper is organized as follows. In Section 2 we define the concepts of First-order Nash equilibrium (FNE) and $\varepsilon$ –FNE. In Section 3, we describe our algorithm designed for min-max games with the objective of one player satisfying the PL condition. Finally, in Section 4 we describe our method for solving games in which the function $f(\theta,\alpha)$ is concave in $\alpha$ (or convex in $\theta$ ).

Two-player Min-Max Games and First-Order Nash Equilibrium

Consider the two-player zero sum min-max game

where $\Theta$ and $\mathcal{A}$ are both convex sets, and $f(\bm{\theta},\bm{\alpha})$ is a continuously differentiable function. We say $(\bm{\theta}^{*},\bm{\alpha}^{*})\in\Theta\times\mathcal{A}$ is a Nash equilibrium of the game if

In convex-concave games, such a Nash equilibrium always exists and several algorithms were proposed to find Nash equilibria . However, in the non-convex non-concave regime, computing these points is in general NP-Hard. In fact, even finding local Nash equilibria is NP-hard in the general non-convex non-concave regime.In addition, as shown by [28, Proposition 10], local Nash equilibria for general non-convex non-concave games may not exist. Thus, in this paper we aim for the less ambitious goal of finding first-order Nash equilibrium which is defined in the sequel.

A point $(\bm{\theta}^{*},\bm{\alpha}^{*})\in\Theta\times\mathcal{A}$ is a first order Nash equilibrium (FNE) of the game (2) if

Notice that this definition, which is also used in , contains the first order necessary optimality conditions of the objective function of each player . Thus they are necessary conditions for local Nash equilibrium. Moreover, in the absence of constraints, the above definition simplifies to $\nabla_{\bm{\theta}}f(\bm{\theta}^{*},\bm{\alpha}^{*})=0$ and $\nabla_{\bm{\alpha}}f(\bm{\theta}^{*},\bm{\alpha}^{*})=0$ , which are the well-known unconstrained first-order optimality conditions. Based on this observation, it is tempting to think that the above first-order Nash equilibrium condition does not differentiate between the min-max type solutions of (2) and min-min solutions of the type $\min_{\bm{\theta}\in\Theta,\bm{\alpha}\in\mathcal{A}}f(\bm{\theta},\bm{\alpha})$ . However, the direction of the second inequality in (3) would be different if we have considered the min-min problem instead of min-max problem. This different direction makes the problem of finding a FNE non-trivial. The following theorem guarantees the existence of first-order Nash equilibria under some mild assumptions.

Suppose the sets $\Theta$ and $\mathcal{A}$ are no-empty, compact, and convex. Moreover, assume that the function $f(\cdot,\cdot)$ is twice continuously differentiable. Then there exists a feasible point $(\bar{\bm{\theta}},\bar{\bm{\alpha}})$ that is first-order Nash equilibrium.

The above theorem guarantees existence of FNE points even when (local) Nash equilibria may not exist. The next natural question is about the computability of such methods. Since in practice we use iterative methods for computation, we need to define the notion of approximate–FNE.

A point $(\bm{\theta}^{*},\bm{\alpha}^{*})$ is said to be an $\varepsilon$ –first-order Nash equilibrium ( $\varepsilon$ –FNE) of the game (2) if

In the absence of constraints, $\varepsilon$ –FNE in Definition 2.3 reduces to $\|\nabla_{\bm{\theta}}f(\bm{\theta}^{*},\bm{\alpha}^{*})\|\leq\varepsilon\quad{\rm and}\quad\|\nabla_{\bm{\alpha}}f(\bm{\theta}^{*},\bm{\alpha}^{*})\|\leq\varepsilon.$

The $\varepsilon$ –FNE definition above is based on the first order optimality measure of the objective of each player. Such first-order optimality measure has been used before in the context of optimization; see . Such a condition guarantees that each player cannot improve their objective function using first order information. Similar to the optimization setting, one can define the second-order Nash equilibrium as a point that each player cannot improve their objective further by using first and second order information of their objectives. However, the use of second order Nash equilibria is more subtle in the context of games. The following example shows that such a point may not exist. Consider the game

Then $(0,0)$ is the only first-order Nash equilibrium and is not a second-order Nash equilibrium.

In this paper, our goal is to find an $\varepsilon$ –FNE of the game (2) using iterative methods. To proceed, we make the following standard assumptions about the smoothness of the objective function $f$ .

The function $f$ is continuously differentiable in both $\bm{\theta}$ and $\bm{\alpha}$ and there exists constants $L_{11}$ , $L_{22}$ and $L_{12}$ such that for every $\bm{\alpha},\bm{\alpha}_{1},\bm{\alpha}_{2}\in\mathcal{A}$ , and $\bm{\theta},\bm{\theta}_{1},\bm{\theta}_{2}\in\Theta$ , we have

Non-Convex PL-Game

In this section, we consider the problem of developing an “efficient" algorithm for finding an $\varepsilon$ –FNE of (2) when the objective of one of the players satistys Polyak-Łojasiewicz (PL) condition. To proceed, let us first formally define the Polyak-Łojasiewicz (PL) condition.

A differentiable function $h({\mathbf{x}})$ with the minimum value $h^{*}=\min_{x}\;h({\mathbf{x}})$ is said to be $\mu$ -Polyak-Łojasiewicz ( $\mu$ -PL) if

The PL-condition has been established and utilized for analyzing many practical modern problems . Moreover, it is well-known that a function can be non-convex and still satisfy the PL condition . Based on the definition above, we define a class of min-max PL-games.

A simple example of a practical PL-game is detailed next.

Imitation learning is a paradigm that aims to learn from an expert’s demonstration of performing a task . It is known that this learning process can be formulated as a min-max game . In such a game the minimization is performed over all the policies and the goal is to minimize the discrepancy between the accumulated reward for expert’s policy and the proposed policy. On the other hand, the maximization is done over the parameters of the reward function and aims at maximizing this discrepancy over the parameters of the reward function. This approach is also referred to as generative adversarial imitation learning (GAIL) . The problem of generative adversarial imitation learning for linear quadratic regulators refers to solving this problem for the specific case where the underlying dynamic and the reward function come from a linear quadratic regulator . To be more specific, this problem can be formulated as $\min_{K}\max_{\bm{\theta}\in\Theta}m(K,\bm{\theta})$ , where $K$ represents the choice of the policy and $\bm{\theta}$ represents the parameters of the dynamic and the reward functions. Under the discussed setting, $m$ is strongly concave in $\bm{\theta}$ and PL in $K$ (see for more details). Note that since $m$ is strongly concave in $\bm{\theta}$ and $PL$ in $K$ , any FNE of the game would also be a Nash equilibrium point. Also note that the notion of FNE does not depend on the ordering of the $\min$ and $\max$ . Thus, to be consistent with our notion of PL-games, we can formulate the problem as

Thus, generative adversarial imitation learning of linear quadratic regulators is an example of finding a FNE for a min-max PL-game.

In what follows, we present a simple iterative method for computing an $\varepsilon$ –FNE of PL games.

In this section, we propose a multi-step gradient descent ascent algorithm that finds an $\varepsilon$ –FNE point for PL-games. At each iteration, our method runs multiple projected gradient ascent steps to estimate the solution of the inner maximization problem. This solution is then used to estimate the gradient of the inner maximization value function, which directly provides a descent direction. In a nutshell, our proposed algorithm is a gradient descent-like algorithm on the inner maximization value function. To present the ideas of our multi-step algorithm, let us re-write (2) as

A famous classical result in optimization is Danskin’s theorem which provides a sufficient condition under which the gradient of the value function $\max_{\bm{\alpha}\in\mathcal{A}}f(\bm{\theta},\bm{\alpha})$ can be directly evaluated using the gradient of the objective $f(\bm{\theta},\bm{\alpha}^{*})$ evaluated at the optimal solution $\bm{\alpha}^{*}$ . This result requires the optimizer $\bm{\alpha}^{*}$ to be unique. Under our PL assumption on $f(\bm{\theta},\cdot)$ , the inner maximization problem (9) may have multiple optimal solutions. Hence, Danskin’s theorem does not directly apply. However, as we will show in Lemma A.5 in the supplementary, under the PL assumption, we still can show the following result

despite the non-uniqueness of the optimal solution.

Motivated by this result, we propose a Multi-step Gradient Descent Ascent algorithm that solves the inner maximization problem to “approximate” the gradient of the value function $g$ . This gradient direction is then used to descent on $\bm{\theta}$ . More specifically, the inner loop (Step 4) in Algorithm 1 solves the maximization problem (9) for a given fixed value $\bm{\theta}=\bm{\theta}_{t}$ . The computed solution of this optimization problem provides an approximation for the gradient of the function $g(\bm{\theta})$ , see Lemma A.6 in Appendix A. This gradient is then used in Step 7 to descent on $\bm{\theta}$ .

2 Convergence analysis of Multi-Step Gradient Descent Ascent Algorithm for PL games

Throughout this section, we make the following assumption.

The constraint set $\Theta$ is convex and compact. Moreover, there exists a ball with radius $R$ , denoted by ${\cal B}_{R}$ , such that $\Theta\subseteq{\cal B}_{R}$ .

We are now ready to state the main result of this section.

Under Assumptions 2.5 and 3.3, for any given scalar $\varepsilon\in(0,1)$ , if we choose $K$ and $T$ large enough such that

then there exists an iteration $t\in\{0,\cdots,T\}$ such that $(\bm{\theta}_{t},\bm{\alpha}_{t+1})$ is an $\varepsilon$ –FNE of (2).

Under Assumption 2.5 and Assumption 3.3, Algorithm 1 finds an $\varepsilon$ -FNE of the game (2) with ${\cal O}(\varepsilon^{-2})$ gradient evaluations of the objective with respect to $\bm{\theta}$ and ${\cal O}(\varepsilon^{-2}\log(\varepsilon^{-1}))$ gradient evaluations with respect to $\bm{\alpha}$ . If the two gradient oracles have the same complexity, the overall complexity of the method would be ${\cal O}(\varepsilon^{-2}\log(\varepsilon^{-1}))$ .

The iteration complexity order $\mathcal{O}(\varepsilon^{-2}\log(\varepsilon^{-1}))$ in Theorem 3.4 is tight (up to logarithmic factors). This is due to the fact that for general non-convex smooth problems, finding an $\varepsilon$ –stationary solution requires at least $\Omega(\varepsilon^{-2})$ gradient evaluations . Clearly, this lower bound is also valid for finding an $\varepsilon$ –FNE of PL-games. This is because we can assume that the function $f(\bm{\theta},\bm{\alpha})$ does not depend on $\bm{\alpha}$ (and thus PL in $\bm{\alpha}$ ).

Theorem 3.4 shows that under the PL assumption, the pair $(\bm{\theta}_{t},\bm{\alpha}_{K}(\theta_{t}))$ computed by Algorithm 1 is an $\varepsilon$ –FNE of the game (2). Since $\bm{\alpha}_{K}(\bm{\theta}_{t})$ is an approximate solution of the inner maximization problem, we get that $\bm{\theta}_{t}$ is concurrently an $\varepsilon$ –first order stationary solution of the optimization problem (8).

In [51, Theorem 4.2], a similar result was shown for the case when $f(\bm{\theta},\bm{\alpha})$ is strongly concave in $\bm{\alpha}$ . Hence, Theorem 3.4 can be viewed as an extension of [51, Theorem 4.2]. Similar to [51, Theorem 4.2], one can easily extend the result of Theorem 3.4 to the stochastic setting by replacing the gradient of $f$ with respect to $\theta$ in Step 7 by the stochastic version of the gradient.

In the next section we consider the non-convex concave min-max saddle game. It is well-known that convexity/concavity does not imply the PL condition and PL condition does not imply convexity/concavity . Therefore, the problems we consider in the next section are neither restriction nor extension of our results on PL games.

Non-Convex Concave Games

In this section, we focus on “non-convex concave" games satisfying the following assumption:

The objective function $f(\bm{\theta},\bm{\alpha})$ is concave in $\bm{\alpha}$ for any fixed value of $\bm{\theta}$ . Moreover, the set $\mathcal{A}$ is convex and compact, and there exists a ball with radius $R$ that contains the feasible set $\mathcal{A}$ .

One major difference of this case with the PL-games is that in this case the function $g(\bm{\theta})=\max_{\bm{\alpha}\in\mathcal{A}}\,\,f(\bm{\theta},\bm{\alpha})$ might not be differentiable. To see this, consider the example $g(\alpha)=\max_{0\leq\alpha\leq 1}(2\alpha-1)\theta$ which is concave in $\alpha$ . However, the value function $g(\theta)=|\theta|$ is non-smooth.

Using a small regularization term, we approximate the function $g(\cdot)$ by a differentiable function

where $f_{\lambda}(\bm{\theta},\bm{\alpha})\triangleq f(\bm{\theta},\bm{\alpha})-\dfrac{\lambda}{2}\|\bm{\alpha}-\bar{\bm{\alpha}}\|^{2}.$ Here $\bar{\bm{\alpha}}\in\mathcal{A}$ is some given fixed point and $\lambda>0$ is a regularization parameter that we will specify later. Since $f(\bm{\theta},\bm{\alpha})$ is concave in $\bm{\alpha}$ , $f_{\lambda}(\bm{\theta},\cdot)$ is $\lambda$ -strongly concave. Thus, the function $g_{\lambda}(\cdot)$ becomes smooth with Lipschitz gradient; see Lemma B.1 in the supplementary. Using this property, we propose an algorithm that runs at each iteration multiple steps of Nesterov accelerated projected gradient ascent to estimate the solution of (10). This solution is then used to estimate the gradient of $g_{\lambda}(\bm{\theta})$ which directly provides a descent direction on $\bm{\theta}$ . Our algorithm computes an $\varepsilon$ –FNE point for non-convex concave games with $\widetilde{{\cal O}}(\varepsilon^{-3.5})$ gradient evaluations. Then for sufficiently small regularization coefficient, we show that the computed point is an $\varepsilon$ -FNE.

Notice that since $f_{\lambda}$ is Lipschitz smooth and based on the compactness assumption, we can define

where $\bm{\alpha}^{*}(\bm{\theta})\triangleq\operatorname*{arg\,max}_{\bm{\alpha}\in\mathcal{A}}\;f_{\lambda}(\bm{\theta},\bm{\alpha})$ . We are now ready to describe our proposed algorithm.

Our proposed method is outlined in Algorithm 2. This algorithm has two steps: step 2 and step 3. In step 2, $K$ steps of accelerated gradient ascent method is run over the variable $\bm{\alpha}$ to find an approximate maximizer of the problem $\max_{\bm{\alpha}}\;f_{\lambda}(\bm{\theta}_{t},\bm{\alpha})$ . Then using approximate maximizer $\bm{\alpha}_{t+1}$ , we update $\bm{\theta}$ variable using one step of first order methods in step 3.

In step 2, we run $K$ step of accelerated gradient ascent algorithm over the variable $\alpha$ with restart every $N$ iterations. The details of this subroutine can be found in subsection B.1 of the supplementary materials. In step 3 of Algorithm 2, we can either use projected gradient descent update rule

or Frank-Wolfe update rule described in subsection B.2 in the supplementary material. We show convergence of the algorithm to $\varepsilon$ –FNE in Theorems 4.2.

Given a scalar $\varepsilon\in(0,1)$ . Assume that Step 7 in Algorithm 2 sets either runs projected gradient descent or Frank-Wolfe iteration. Under Assumptions 4.1 and 2.5,

then there exists $t\in\{0,\ldots,T\}$ such that $(\bm{\theta}_{t},\bm{\alpha}_{t+1})$ is an $\varepsilon$ –FNE of problem (2).

The proof is relegated to Appendix B.4. ∎

Under Assumptions 2.5 and 4.1, Algorithm 2 finds an $\varepsilon$ -first-order stationary solution of the game (2) with ${\cal O}(\varepsilon^{-3})$ gradient evaluations of the objective with respect $\bm{\theta}$ and ${\cal O}(\varepsilon^{-0.5}\log(\varepsilon^{-1}))$ gradient evaluations with respect to $\bm{\alpha}$ . If the two oracles have the same complexity, the overall complexity of the method would be ${\cal O}(\varepsilon^{-3.5}\log(\varepsilon^{-1}))$ .

Numerical Results

We evaluate the numerical performance of Algorithm 2 in the following two applications:

We conduct two experiment on the Fashion MNIST dataset . This dataset consists of $28\times 28$ arrays of grayscale pixel images classified into 10 categories of clothing. It includes $60,000$ training images and $10,000$ testing images.

Experimental Setup: The recent work in observed that training a logisitic regression model to classify the images of the Fashion MNIST dataset can be biased against certain categories. To remove this bias, proposed to minimize the maximum loss incurred by the different categories. We repeat the experiment when using a more complex non-convex Convolutional Neural Network (CNN) model for classification. Similar to , we limit our experiment to the three categories T-shirt/top, Coat, and Shirts, that correspond to the lowest three testing accuracies achieved by the trained classifier. To minimize the maximum loss over these three categories, we train the classifier to minimize

where $\mathbf{W}$ represents the parameters of the CNN; and ${\cal L}_{1}$ , ${\cal L}_{2}$ , and ${\cal L}_{3}$ correspond to the loss incurred by samples in T-shirt/top, Coat, and Shirt categories. Problem (12) can be re-written as

Clearly the inner maximization problem is concave; and thus our theory can be applied. To empirically evaluate the regularization scheme proposed in Section 4, we implement two versions of Algorithm 2. The first version solves at each iteration the regularized strongly concave sub-problem

and use the optimum $t$ to perform a gradient descent step on $\mathbf{W}$ (notice that fixing the value of $\mathbf{W}$ , the optimum $t$ can be computed using KKT conditions and a simple sorting or bisection procedure). The second version of Algorithm 2 solves at each iteration the concave inner maximization problem without the regularization term. Then uses the computed solution to perform a descent step on $\mathbf{W}$ . Notice that in both cases, the optimization with respect to $t$ variable can be done in (almost) closed-form update. Although regularization is required to have theoretical convergence guarantees, we compare the two versions of the algorithm on empirical data to determine whether we lose by adding such regularization. We further compare these two algorithms with normal training that uses gradient descent to minimize the average loss among the three categories. We run all algorithms for $5500$ epochs and record the test accuracy of the categories. To reduce the effect of random initialization, we run our methods with $50$ different random initializations and record the average and standard deviation of the test accuracy collected. For fair comparison, the same initialization is used for all methods in each run. The results are summarized in Tables 1. To test our framework in stochastic settings, we repeat the experiment running all algorithms for $12,000$ iterations with Adam and SGD optimizer with a bath size of $600$ images ( $200$ from each category). The results of the second experiment with Adam optimizer are summarized in Table 2. The model architecture and parameters are detailed in Appendix F. The choice of Adam optimizer is mainly because it is more robust to the choice of the step-size and thus can be easily tuned. In fact, the use of SGD or Adam does not change the overall takeaways of the experiments. The results of using SGD optimizer are relegated to Appendix C.

Results: Tables 1 and 2 show the average and standard deviation of the number of correctly classified samples. The average and standard deviation are taken over 50 runs. For each run 1000 testing samples are considered for each category. The results show that when using MinMax and MinMax with regularization, the accuracies across the different categories are more balanced compared to normal training. Moreover, the tables show that Algorithm 2 with regularization provides a slightly better worst-case performance compared to the unregularized approach. Note that the empirical advantages due to regularization appears more in the stochastic setting. To see this compare the differences between MinMax and MinMax with Regularization in Tables 1 and 2. Figure 1 depicts a sample trajectory of deterministic algorithm applied to the regularized and regularized formulations. This figures shows that regularization provides a smoother and slightly faster convergence compared to the unregularized approach. In addition, we apply our algorithm to the exact similar logistic regression setup as in . Results of this experiment can be found in Appendix D.

2 Robust Neural Network Training

Experimental Setup: Neural networks have been widely used in various applications, especially in the field of image recognition. However, these neural networks are vulnerable to adversarial attacks, such as Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD) attack . These adversarial attacks show that a small perturbation in the data input can significantly change the output of a neural network. To train a robust neural network against adversarial attacks, researchers reformulate the training procedure into a robust min-max optimization formulation , such as

Here $\mathbf{w}$ is the parameter of the neural network, the pair $(x_{i},y_{i})$ denotes the $i$ -th data point, and $\delta_{i}$ is the perturbation added to data point $i$ . As discussed in this paper, solving such a non-convex non-concave min-max optimization problem is computationally challenging. Motivated by the theory developed in this work, we approximate the above optimization problem with a novel objective function which is concave in the parameters of the (inner) maximization player. To do so, we first approximate the inner maximization problem with a finite max problem

where each $\hat{x}_{ij}(\mathbf{w})$ is the result of a targeted attack on sample $x_{i}$ aiming at changing the output of the network to label $j$ . These perturbed inputs, which are explained in details in Appendix E, are the function of the weights of the network. Then we replace this finite max inner problem with a concave problem over a probability simplex. Such a concave inner problem allows us to use the multi-step gradient descent-ascent method. The structure of the network and the details of the formulation is detailed in Appendix E.

Results: We compare our results with . Note is the state-of-the-art algorithm and has won the first place, out of $\approx 2000$ submissions, in the NeurIPS 2018 Adversarial Vision Challenge. The accuracy of our formulation against popular attacks, FGSM and PGD , are summarized in Table 3. This table shows that our formulation leads to a comparable results against state-of-the-art algorithms (while in some cases it also outperform those methods by as much as $\approx 15\%$ accuracy).

Links to code and pre-trained models of above two simulations are available at Appendix G.

References

Appendix A Proofs for results in Section 3

Before proceeding to the proofs of the main results, we need some intermediate lemmas and preliminary definitions.

A function $h({\mathbf{x}})$ is said to satisfy the Quadratic Growth (QG) condition with constant $\gamma>0$ if

where $h^{*}$ is the minimum value of the function, and ${\rm dist}({\mathbf{x}})$ is the distance of the point $x$ to the optimal solution set.

The following lemma shows that PL implies QG .

If function $f$ is PL with constant $\mu$ , then $f$ satisfies the quadratic growth condition with constant $\gamma=4\mu$ .

The next Lemma shows the stability of $\arg\max_{\bm{\alpha}}f(\bm{\theta},\bm{\alpha})$ with respect to $\bm{\theta}$ under PL condition.

Assume that $\{h_{\bm{\theta}}(\bm{\alpha})=-f(\bm{\theta},\bm{\alpha})~{}|~{}\bm{\theta}\}$ is a class of $\mu$ -PL functions in $\bm{\alpha}$ . Define $A(\bm{\theta})=\arg\max_{\bm{\alpha}}f(\bm{\theta},\bm{\alpha})$ and assume $A(\bm{\theta})$ is closed. Then for any $\bm{\theta}_{1}$ , $\bm{\theta}_{2}$ and $\bm{\alpha}_{1}\in A(\bm{\theta}_{1})$ , there exists an $\bm{\alpha}_{2}\in A(\bm{\theta}_{2})$ such that

Based on the Lipchitzness of the gradients, we have that $\|\nabla_{\bm{\alpha}}f(\bm{\theta}_{2},\bm{\alpha}_{1})\|\leq L_{12}\|\bm{\theta}_{1}-\bm{\theta}_{2}\|$ . Then using the PL condition, we know that

Now we use the result of Lemma A.2 to show that there exists $\bm{\alpha}_{2}=\arg\min_{\bm{\alpha}\in A(\bm{\theta}_{2})}\|\bm{\alpha}-\bm{\alpha}_{1}\|^{2}~{}\in A(\bm{\theta}_{2})$ such that

re-arranging the terms, we get the desired result that

Finally, the following lemma would be useful in the proof of Theorem 3.4.

Assume $h({\mathbf{x}})$ is $\mu$ -PL and $L$ -smooth. Then, by applying gradient descent with step-size $1/L$ from point $x_{0}$ for $K$ iterations we get an $x_{K}$ such that

We are now ready to prove the results in Section 3.

Under Assumption 2.5 and PL-game assumption,

Moreover, $g$ is $L$ -Lipschitz smooth with $L=L_{11}+\dfrac{L_{12}^{2}}{2\mu}$ .

Let $\bm{\alpha}^{*}\in\operatorname*{arg\,max}_{\bm{\alpha}\in{\cal A}}f(\bm{\theta},\bm{\alpha})$ . By Lemma A.3, for any scalar $\tau$ and direction $d$ , there exists $\bm{\alpha}^{*}(\tau)\in\operatorname*{arg\,max}_{\bm{\alpha}}\;f(\bm{\theta}+\tau d,\bm{\alpha})$ such that

To find the directional derivative of $g(\cdot)$ , we compute

where the second equality holds by writing the Taylor series expansion of $f(\cdot)$ . Thus, by definition of the directional derivative of $g(\cdot)$ , we obtain

Note that this relationship holds for any $d$ . Thus, $\nabla g(\bm{\theta})=\nabla_{\bm{\theta}}f(\bm{\theta},\bm{\alpha}^{*})$ for any $\bm{\alpha}^{*}\in\operatorname*{arg\,max}_{\bm{\alpha}\in{\cal A}}f(\bm{\theta},\bm{\alpha})=A(\bm{\theta})$ . Interestingly, the directional derivative does not depend on the choice of $\bm{\alpha}^{*}$ . This means that $\nabla_{\bm{\theta}}f(\bm{\theta},\bm{\alpha}_{1})=\nabla_{\bm{\theta}}f(\bm{\theta},\bm{\alpha}_{2})$ for any $\bm{\alpha}_{1}$ and $\bm{\alpha}_{2}$ in $\operatorname*{arg\,max}_{\bm{\alpha}\in{\cal A}}f(\bm{\theta},\bm{\alpha})$ .

We finally show that function $g$ is Lipschitz smooth. Let $\bm{\alpha}_{1}^{*}\in A(\bm{\theta}_{1})$ and $\bm{\alpha}_{2}^{*}=\arg\min_{\bm{\alpha}\in A(\bm{\theta}_{2})}\|\bm{\alpha}-\bm{\alpha}^{*}_{1}\|^{2}~{}\in A(\bm{\theta}_{2})$ , then

where the last inequality holds by Lemma A.3.

A.2 Proof of Theorem 3.4

Using Lemma A.5 and Assumption 3.3, we can define

The next result shows that the inner loop in Algorithm 1 computes an approximate gradient of $g(\cdot)$ . In other words, $\nabla_{\bm{\theta}}f(\bm{\theta}_{t},\bm{\alpha}_{t+1})\approx\nabla g(\bm{\theta}_{t})$ .

Define $\kappa=\frac{L_{22}}{\mu}\geq 1$ and $\rho=1-\frac{1}{\kappa}<1$ and assume $g(\bm{\theta}_{t})-f(\bm{\theta}_{t},\bm{\alpha}_{0}(\bm{\theta}_{t}))<\Delta$ , then for any prescribed $\varepsilon\in(0,1)$ if we choose $K$ large enough such that

where $\bar{L}=\max\{L_{12},L_{22},L,g_{max},1\}$ and $\bar{R}=\max\{R,1\}$ , then the error $e_{t}\triangleq\nabla_{\bm{\theta}}f(\bm{\theta}_{t},\bm{\alpha}_{K}(\bm{\theta}_{t}))-\nabla g(\bm{\theta}_{t})$ has a norm

Thus, using the QG result of Lemma A.2, we know that there exists an $\bm{\alpha}^{*}\in A(\bm{\theta}_{t})$ such that

where the last inequality holds by our choice of $K$ which yields

Here the second inequality holds since $\bar{R}\geq 1$ , and the third inequality holds since $g_{max}\leq\bar{L}$ .

To prove the argument of the Lemma, note that

where the last inequality holds by our choice of $K$ which yields

Here the second inequality holds since $\varepsilon<1$ , $\bar{L},\bar{R}\geq 1$ , and $L\leq\bar{L}$ .

The above lemma implies that Algorithm 1 behaves similar to the simple vanilla gradient descent method applied to problem (8).

Notice that the assumption $g(\bm{\theta}_{t})-f(\bm{\theta}_{t},\bm{\alpha}_{0}(\bm{\theta}_{t}))\leq\Delta,\;\forall t$ could be justified by Lemma A.3. More specifically, by Lemma A.3,

where $\bm{\alpha}_{t+1}\triangleq\operatorname*{arg\,max}_{\bm{\alpha}}\,f(\bm{\theta}_{t+1},\bm{\alpha})$ and $\bm{\alpha}_{t}\triangleq\operatorname*{arg\,max}_{\bm{\alpha}}\,f(\bm{\theta}_{t},\bm{\alpha})$ . Hence, the difference between consecutive optimal solutions computed by the inner loop of the algorithm, are upper bounded by the difference between corresponding $\bm{\theta}$ ’s. Since $\Theta$ is a compact set, we can find an upper bound $\Delta$ such that $g(\bm{\theta}_{t})-f(\bm{\theta}_{t},\alpha_{0}(\bm{\theta}_{t}))\leq\Delta$ , for all $t$ . We are now ready to show Theorem 3.4

where $g^{*}\triangleq\min_{\bm{\theta}}\;g(\bm{\theta})$ is the optimal value of $g$ . Note that by the compactness assumption of the set $\Theta$ , we have $\Delta_{g}=g(\theta_{0})-g^{*}<\infty$ .

Based on the projection property, we know that

Therefore, by setting $\bm{\theta}=\bm{\theta}_{t}$ , we get

where $\bm{\alpha}^{*}(\bm{\theta}_{t})\in\operatorname*{arg\,max}_{\bm{\alpha}\in\mathcal{A}}\,f(\bm{\theta}_{t},\bm{\alpha})$ and $e_{t}\triangleq\nabla_{\bm{\theta}}f\big{(}\bm{\theta}_{t},\bm{\alpha}_{t+1}\big{)}-\nabla_{\bm{\theta}}f\big{(}\bm{\theta}_{t},\bm{\alpha}^{*}(\bm{\theta}_{t})\big{)}$ . By Taylor expansion, we have

where the last inequality holds by (27). Moreover, by the projection property, we know that

Here the second inequality holds by Cauchy-Schwartz, the definition of $e_{t}$ and our assumption that $\Theta\subseteq{\cal B}_{R}$ . Moreover, the last inequality holds by our choice of $K$ in Lemma A.6 which yields

where the inequality holds by using Cauchy Schwartz and our assumption that $\Theta$ is in a ball of radius $R$ . Hence,

where the last inequality holds by using Lemma A.6 and choosing $K$ and $T$ :

Therefore, using Lemma A.6, there exists at least one index $\widehat{t}$ for which

Appendix B Algorithmic details and proofs for the results in Section 4

B.2 Frank–Wolfe update rule for Step 3 in Algorithm 2

In Step 3 of Algorithm 2, instead of projected gradient descent discussed in the main body, we can also run one step of Frank–Wolfe method. More precisely, we can set

is the first order descent direction. In the unconstrained case, the descent direction is $\widehat{s}_{t}=-\nabla_{\bm{\theta}}f_{\lambda}(\bm{\theta}_{t},\bm{\alpha}_{t+1})$ , which becomes the same as the gradient descent step.

Under Assumption 2.5 and Assumption 4.1, the function $g_{\lambda}$ is $L$ -Lipschitz smooth with $L=L_{11}+\dfrac{L_{12}^{2}}{\lambda}$ .

First notice that the differentiability of the function $g_{\lambda}(\cdot)$ follows directly from Danskin’s Theorem . It remains to show that $g_{\lambda}$ is a Lipschitz smooth function. Let

Then by strong convexity of $-f_{\lambda}(\bm{\theta},\cdot)$ , we have

Moreover, due to optimality of $\bm{\alpha}_{1}^{*}$ , we have

where the last inequality holds by Cauchy-Schwartz and the Lipschtizness assumption. We finally show that $g_{\lambda}$ is Lipschitz smooth.

where the last inequality holds by (39). ∎

Algorithm 2 solves the inner maximization problem using accelerated projected gradient descent (outlined in Algorithm 3). The next lemma is known for accelerated projected gradient descent when applied to strongly convex functions.

Assume $h({\mathbf{x}})$ is $\lambda$ -strongly convex and $L$ -smooth. Then, applying accelerated projected gradient descent algorithm with step-size $1/L$ and restart parameter $N\triangleq\sqrt{8L/\lambda}-1$ for $K$ iterations, we get ${\mathbf{x}}_{K}$ such that

where ${\mathbf{x}}^{*}\triangleq\operatorname*{arg\,min}_{{\mathbf{x}}\in{\cal F}}h({\mathbf{x}})$ .

where the second inequality holds by strong convexity of $h$ and the optimality condition of $x^{*}$ , and the last inequality holds by our choice of $N$ . This yields,

B.4 Proof of Theorem 4.2

We first show that the inner loop in Algorithm 2 computes an approximate gradient of $g_{\lambda}(\cdot)$ . In other words, $\nabla_{\bm{\theta}}f_{\lambda}(\bm{\theta}_{t},\bm{\alpha}_{t+1})\approx\nabla g_{\lambda}(\bm{\theta}_{t})$ .

Define $\kappa=\dfrac{L_{22}}{\lambda}\geq 1$ and assume $g_{\lambda}(\bm{\theta}_{t})-f_{\lambda}(\bm{\theta}_{t},\bm{\alpha}_{0}(\bm{\theta}_{t}))<\Delta$ , then for any prescribed $\varepsilon\in(0,1)$ if we choose $K$ large enough such that

where $\bar{L}\triangleq\max\{L_{12},L_{22},L,g_{max},1\}$ and $\bar{R}=\max\{R,1\}$ , then the error $e_{t}\triangleq\nabla_{\bm{\theta}}f_{\lambda}(\bm{\theta}_{t},\bm{\alpha}_{K}(\bm{\theta}_{t}))-\nabla g_{\lambda}(\bm{\theta})$ has a norm

Let $\bm{\alpha}^{*}(\bm{\theta}_{t})\triangleq\operatorname*{arg\,max}_{\bm{\alpha}\in\mathcal{A}}\;f_{\lambda}(\bm{\theta}_{t},\bm{\alpha})$ . Then by strong convexity of $-f(\bm{\theta}_{t},\cdot)$ , we get

Combined with the Lipschitz smoothness property of the objective, we obtain

where the second inequality uses (47), and the third inequality uses the choice of $K$ in (43) which yields yields

Here the second inequality holds since $\bar{R}\geq 1$ , and the third inequality holds since $g_{max}\leq\bar{L}$ . To prove the second argument of the lemma, we also use the Lipschitz smoothness property of the objective to get

where the second inequality holds by our Lipschitzness assumption and the last inequality holds by our assumption that $L_{22}/\lambda\geq 1$ . Moreover,

where the last equality holds since $\bm{\alpha}^{*}(\bm{\theta}_{t})$ is optimal and $\|\bm{\alpha}^{*}(\bm{\theta}_{t})-\bm{\alpha}_{K}(\bm{\theta}_{t})\|\leq 1.$ Combining (49) and (50), we get

where the second inequality uses (47), and the last inequality holds by our choice of $K$ in (43) and since $\varepsilon\in(0,1)$ . ∎

The above lemma implies that $\|\nabla_{\bm{\theta}}f_{\lambda}(\bm{\theta}_{t},\bm{\alpha}_{K}(\bm{\theta}_{t}))-\nabla g_{\lambda}(\bm{\theta}_{t})\|\leq\delta\triangleq\dfrac{L\varepsilon^{2}}{64\bar{R}^{3}\bar{L}^{2}}$ . We now show that our assumption $g(\bm{\theta}_{t})-f(\bm{\theta}_{t},\bm{\alpha}_{0}(\bm{\theta}_{t}))\leq\Delta$ for all t in the above Lemma holds. Let

Then by strong convexity of $-f_{\lambda}(\bm{\theta},\cdot)$ , we have

Moreover, due to optimality of $\bm{\alpha}_{t}^{*}$ , we have

Hence, the difference between consecutive optimal solutions computed by the inner loop of the algorithm, are upper bounded by the difference between corresponding $\bm{\theta}$ ’s. Since $\Theta$ is a compact set, we can find an upper bound $\Delta$ such that $g(\bm{\theta}_{t})-f(\bm{\theta}_{t},\alpha_{0}(\bm{\theta}_{t}))\leq\Delta$ , for all $t$ .

We are now ready to show the main theorem that implies convergence of our proposed algorithm to an $\varepsilon$ –first-order stationary solution of problem (2). In particular, we show that using $\nabla_{\bm{\theta}}f_{\lambda}(\bm{\theta}_{t},\bm{\alpha}_{K}(\bm{\theta}_{t}))$ instead of $\nabla g_{\lambda}(\bm{\theta}_{t})$ for a small enough $\lambda$ in the Frank-Wolfe or projected descent algorithms applied to $g_{\lambda}$ , finds an $\varepsilon$ –FNE. We are now ready to show Theorem 4.2.

Frank-Wolfe Steps: We now show the result when Step 7 of Algorithm 2 sets

Using descent lemma on $g_{\lambda}$ and the definition of $\widetilde{L}$ in Algorithm 2, we have

where $\widehat{s}_{t}$ and ${\cal X}_{t}$ are defined in equations (35) and (36) of the manuscript, and the second and last inequalities use the fact that $\|\widehat{s}_{t}\|\leq 1$ .

Summing up these inequalities for all values of $t$ leads to

Here the first inequality in (57) holds by (11), Cauchy-Schwartz, and the fact that $\|\widehat{s}_{t}\|\leq 1$ . The last inequality holds by our choice of $K$ in Lemma B.3

which yields $\|e_{t}\|\leq 1\leq g_{max}$ and by choosing $T$ such that

Therefore, using Lemma B.3, there exists at least one index $\widehat{t}$ for which

where the first inequality uses Cauchy Shwartz and the fact that $\|s\|\leq 1$ , and the last inequality holds due to (58), the choice of $\lambda$ in the theorem and our assumption that $\|\bm{\alpha}_{K}(\bm{\theta}_{\widehat{t}})-\bar{\bm{\alpha}}\|\leq 2R$ .

Projected Gradient Descent: We start by defining

where $g_{\lambda}^{*}\triangleq\min_{\bm{\theta}}\;g_{\lambda}(\bm{\theta})$ is the optimal value of $g_{\lambda}$ . Note that by the compactness assumption of the set $\Theta$ , we have $\Delta_{g}=g_{\lambda}(\theta_{0})-g_{\lambda}^{*}<\infty$ .

We now show the result when Step 7 of Algorithm 2 sets

Based on the projection property, we know that

Therefore, by setting $\bm{\theta}=\bm{\theta}_{t}$ , we get

where $\bm{\alpha}^{*}(\bm{\theta}_{t})\triangleq\operatorname*{arg\,max}_{\bm{\alpha}\in\mathcal{A}}\,f_{\lambda}(\bm{\theta}_{t},\bm{\alpha})$ and $e_{t}\triangleq\nabla_{\bm{\theta}}f\big{(}\bm{\theta}_{t},\bm{\alpha}_{t+1}\big{)}-\nabla_{\bm{\theta}}f\big{(}\bm{\theta}_{t},\bm{\alpha}^{*}(\bm{\theta}_{t})\big{)}$ .

Moreover, by the projection property, we know that

where the inequality holds by using Cauchy Schwartz and our assumption that $\Theta$ is in a ball of radius $R$ . Hence,

where the last inequality holds by using Lemma B.3 and choosing $K$ and $T$ :

Therefore, using Lemma B.3, there exists at least one index $\widehat{t}$ for which

where the first inequality uses Cauchy Shwartz and the fact that $\|s\|\leq 1$ , and the last inequality holds due to (67), the choice of $\lambda$ in the theorem and our assumption that $\|\bm{\alpha}_{K}(\bm{\theta}_{\widehat{t}})-\bar{\bm{\alpha}}\|\leq 2R$ .

Appendix C Numerical Results on Fashion MNIST with SGD

The results of using SGD optimizer are summarized in Table 4 and Table 5. Note SGD optimizer requires more tuning and therefore the results when $\text{batch-size}=3000$ is also included here.

Appendix D Numerical Results on Fashion MNIST with Logistic Rgression Model

Table 6 shows that the proposed formulation gives better accuracies under the worst category (Shirts), and the accuracies over three categories are more balanced. Note that this model is trained by gradient descent. The standard derivations not equal to is due to the early termination of the simulation.

Appendix E Numerical Results on Robust Neural Network Training

Neural networks have been widely used in various applications, especially in the field of image recognition. However, these neural networks are vulnerable to adversarial attacks, such as Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD) attack . These adversarial attacks show that a small perturbation in the data input can significantly change the output of a neural network. To train a robust neural network against adversarial attacks, researchers reformulate the training procedure into a robust min-max optimization formulation , such as

Here $\mathbf{w}$ is the parameter of the neural network, the pair $(x_{i},y_{i})$ denotes the $i$ -th data point, and $\delta_{i}$ is the perturbation added to data point $i$ . As discussed in this paper, solving such a non-convex non-concave min-max optimization problem is computationally challenging. Motivated by the theory developed in this work, we approximate the above optimization problem with a novel min-max objective function which has concave inner optimization problem. To do so, we first approximate the inner maximization problem with a finite max problem

where each $\hat{x}_{ij}(\mathbf{w})$ is the result of a targeted attack on sample $x_{i}$ aiming at changing the output of the network to label $j$ . More specifically, $\hat{x}_{ij}(\mathbf{w})$ is obtained through the following procedure:

In the one but last layer of the neural network architecture for learning classification on MNIST we have 10 different neurons, each corresponding with one category of classification. For any sample $(x_{i},y_{i})$ in the dataset and any $j=0,\cdots,9$ , starting from $x_{ij}^{0}=x_{i}$ , we run gradient ascent to obtain the following chain of points:

where $Z_{j}$ is the network logit before softmax corresponding to label $j$ ; $\alpha>0$ is the step-size; and $\text{Proj}_{B(x,\varepsilon)}[\cdot]$ is the projection to the infinity ball with radius $\varepsilon$ centered at $x$ . Finally, we set $\hat{x}_{ij}(\mathbf{w})=x_{ij}^{K}$ in (69).

Clearly, we can replace the finite max problem (69) with a concave problem over a probability simplex, i.e.,

which is non-convex in $w$ , but concave in $\mathbf{t}$ . Hence we can apply Algorithm 2 to solve this opimization problem. We test (70) on MNIST dataset with a Convolutional Neural Network(CNN) with the architecture detailed in Table 7. The result of our experiment is presented in Table 8.

We would like to note that there is a mismatch between our theory and this numerical experiment. In particular, we assume smoothness of the objective function in our theory. However, in this experiment, the ReLu activation functions and the projection operator make the objective function non-smooth. We also did not include regularizer (strongly concave term) while solving (70) as the optimal regularizer was very small (and almost zero).

The main take away from this experiment is to demonstrate the practicality of the following idea: when solving general challenging non-convex min-max problems, it might be possible to approximate it with one-sided non-convex min-max problems where the objective function is solvable with respect to one of the player’s variable. Such a reformulation leads to computationally tractable problems and (possibly) no loss in the performance.

Appendix F Experimental Setup of Fair Classifier

Appendix G Links

Robust NN Training: https://github.com/optimization-for-data-driven-science/Robust-NN-Training Fair Classifier: https://github.com/optimization-for-data-driven-science/FairFashionMNIST