Universal Stagewise Learning for Non-Convex Problems with Convergence on Averaged Solutions

Zaiyi Chen, Zhuoning Yuan, Jinfeng Yi, Bowen Zhou, Enhong Chen, Tianbao Yang

Introduction

Non-convex optimization has recently received increasing attention due to its popularity in emerging machine learning tasks, particularly for learning deep neural networks. One of the keys to the success of deep learning for big data problems is the employment of simple stochastic algorithms such as Sgd or AdaGrad . Analysis of these stochastic algorithms for non-convex optimization is an important and interesting research topic, which already attracts much attention from the community of theoreticians . However, one issue that has been largely ignored in existing theoretical results is that the employed algorithms in practice usually differ from their plain versions that are well understood in theory. Below, we will mention several important heuristics used in practice that have not been well understood for non-convex optimization, which motivates this work.

First, a heuristic for setting the step size in training deep neural networks is to change it in a stagewise manner from a large value to a small value (i.e., a constant step size is used in a stage for a number of iterations and is decreased for the next stage) , which lacks theoretical analysis to date. In existing literature , Sgd with an iteratively decreasing step size or a small constant step size has been well analyzed for non-convex optimization problems with guaranteed convergence to a stationary point. For example, the existing theory usually suggests an iteratively decreasing step size proportional to $1/\sqrt{t}$ at the $t$ -th iteration or a small constant step size, e.g., proportional to $\epsilon^{2}$ with $\epsilon\ll 1$ for finding an $\epsilon$ -stationary solution whose gradient’s magnitude (in expectation) is small than $\epsilon$ .

Second, the averaging heuristic is usually used in practice, i.e., an averaged solution is returned for prediction , which could yield improved stability and generalization . However, existing theory for many stochastic non-convex optimization algorithms only provides guarantee on a uniformly sampled solution or a non-uniformly sampled solution with decreasing probabilities for latest solutions . In particular, if an iteratively decreasing step size proportional to $1/\sqrt{t}$ at the $t$ -th iteration is employed, the convergence guarantee was provided for a random solution that is non-uniformly selected from all iterates with a sampling probability proportional to $1/\sqrt{t}$ for the $t$ -th iterate. This means that the latest solution always has the smallest probability to be selected as the final solution, which contradicts to the common wisdom. If a small constant step size is used, then usually a uniformly sampled solution is returned with convergence guarantee. However, both options are seldomly used in practice.

A third common heuristic in practice is to use adaptive coordinate-wise step size of AdaGrad . Although adaptive step size has been well analyzed for convex problems (i.e., when it can yield faster convergence than Sgd) , it still remains an mystery for non-convex optimization with missing insights from theory. Several recent studies have attempted to analyze AdaGrad for non-convex problems . Nonetheless, none of them are able to exhibit the adaptive convergence of AdaGrad to data as in the convex case and its advantage over Sgd for non-convex problems.

To overcome the shortcomings of existing theories for stochastic non-convex optimization, this paper analyzes new algorithms that employ some or all of these commonly used heuristics in a systematic framework, aiming to fill the gap between theory and practice. The main results and contributions are summarized below:

We propose a universal stagewise optimization framework for solving a family of non-convex problems, i.e., weakly convex problems, which is broader than smooth non-convex problems and includes some non-smooth non-convex problems. At each stage, any suitable stochastic convex optimization algorithms (e.g., Sgd, AdaGrad) with a constant step size parameter can be employed for optimizing a regularized convex problem with a number of iterations, which usually return an averaged solution. The step size parameter is decreased in a stagewise manner following a polynomial decaying scheme.

We analyze several variants of the proposed framework by employing different basic algorithms, including Sgd, stochastic heavy-ball (Shb) method, stochastic Nesterov’s accelerated gradient (Snag) method, stochastic alternating direction methods of multipliers (ADMM), and AdaGrad. We prove the convergence of their stagewise versions for an averaged solution that is randomly selected from all stagewise averaged solutions.

To justify a heuristic approach that returns the last averaged solution in stagewise learning, we present and analyze a non-uniform sampling strategy over stagewise averaged solutions with sampling probabilities increasing as the stage number.

Regarding the convergence results, for stagewise Sgd, Shb, Snag, we establish the same order of iteration complexity for finding a nearly stationary point as the existing theories of their non-stagewise variants. For stagewise AdaGrad, we establish an adaptive convergence for finding a nearly stationary point, which is provably better than (stagewise) Sgd, Shb, and Snag when the cumulative growth of stochastic gradient is slow.

Besides theoretical contributions, we also empirically verify the effectiveness of the proposed stagewise algorithms. In particular, our empirical studies show that (i) the stagewise AdaGrad dramatically improves the generalization performance of existing variants of AdaGrad, (ii) stagewise Sgd, Shb, Snag also outperform their plain variants with an iteratively decreasing step size; (iii) the proposed stagewise algorithms achieve similar if not better generalization performance than their heuristic variants implemented in existing libraries on standard benchmark datasets.

Related Work

We review some theoretical results for stochastic non-convex optimization in this section.

Although adaptive variants of Sgd, e.g., AdaGrad , Adam , were widely used for training deep neural networks, there are few studies on theoretical analysis of these algorithms for non-convex problems. Several recent studies attempted to analyze AdaGrad for non-convex problems . Ward et al. only analyzed a variant of AdaGrad that uses a global adaptive step size instead of coordinate-wise adaptive step size as in the original AdaGrad used in practice. Li and Orabona gave two results about the convergence of variants of AdaGrad. One is given in terms of asymptotic convergence for coordinate-wise adaptive step size, and another one is given in terms of non-asymptotic convergence for global adaptive step size. When we prepare this manuscript, we note that two recent studies appeared online, which also analyzed the convergence of AdaGrad with coordinate-wise adaptive step size and its momentum variants. Although all of these studies established an iteration complexity of $O(1/\epsilon^{4})$ for different variants of AdaGrad for finding an $\epsilon$ -stationary solution of a stochastic non-convex optimization problem, none of them can exhibit the potential adaptive advantage of AdaGrad over Sgd as in the convex case. To the best of our knowledge, our result is the first one that explicitly shows that coordinate-wise adaptive step size could yield faster convergence than using non-adaptive step size for non-convex problems similar to that in the convex case. Besides that, these studies also suffer from the following shortcomings: (i) they all assume smoothness of the problem, while we consider non-smooth and non-convex problems; (ii) their convergence is provided on a solution with minimum magnitude of gradient that is expensive to compute, though their results also imply a convergence on a random solution selected from all iterates with decreasing sampling probabilities. In contrast, these shortcomings do not exist in this paper.

Statewise step size has been employed in stochastic algorithms and analyzed for convex optimization problems . Hazan and Kale proposed an epoch-GD method for stochastic strongly convex problems, in which a stagewise step size is used that decreases geometrically and the number of iteration for each stage increases geometrically. Xu et al. proposed an accelerated stochastic subgradient method for optimizing convex objectives satisfying a local error bound condition, which also employs a stagewise scheme with a constant number of iterations per-stage and geometrically decreasing stagewise step size. The difference from the present work is that they focus on convex problems.

The proposed stagewise algorithm is similar to several existing algorithms in design , which are originated from the proximal point algorithm . I.e., at each stage a proximal strongly convex subproblem is formed and then a stochastic algorithm is employed for optimizing the proximal subproblem inexactly with a number of iterations. Xu et al. used this idea for solving problems that satisfy a local error bound condition, aiming to achieve faster convergence than vanilla Sgd. Davis and Grimmer followed this idea to solve weakly convex problems. At each stage, Sgd with decreasing step sizes for a strongly convex problem is employed for solving the proximal subproblem in these two papers. Our stagewise algorithm is developed following the similar idea. The key differences from are that (i) we focus on weakly convex problems instead of convex problems considered in ; (ii) we use non-uniform sampling probabilities that increase as the stage number to select an averaged solution as the final solution, unlike the uniform sampling used in ; (iii) we present a unified algorithmic framework and convergence analysis, which enable one to employ any suitable stochastic convex optimization algorithms at each stage. It gives us several interesting variants including stagewise stochastic momentum methods, stagewise AdaGrad, and stagewise stochastic ADMM. For stagewise AdaGrad that employs AdaGrad as the basic algorithm for solving the proximal subproblem, we derive an adaptive convergence that is faster than Sgd when the cumulative growth of stochastic gradients is slow.

Finally, we refer readers to several recent papers for other algorithms for weakly convex problems . For example, Drusvyatskiy and Paquette studied a subclass of weakly convex problems whose objective consists of a composition of a convex function and a smooth map, and proposed a prox-linear method that could enjoy a lower iteration complexity than $O(1/\epsilon^{4})$ by smoothing the objective of each subproblem. Davis and Drusvyatskiy studied a more general algorithm that successively minimizes a proximal regularized stochastic model of the objective function. When the objective function is smooth and has a finite form, variance-reduction based methods are also studied , which have provable faster convergence than Sgd in terms of $\epsilon$ . However, in all of these studies the convergence is provided on an impractical solution, which is either a solution that gives the minimum value of the (proximal) subgradient’s norm or on a uniformly sampled solution from all iterations .

Preliminaries

The problem of interest in this paper is:

where $\Omega$ is a closed convex set, $\xi\in\mathcal{U}$ is a random variable, $\phi(\mathbf{x})$ and $\phi(\mathbf{x};\xi)$ are non-convex functions, with the basic assumptions on the problem given in Assumption 1.

To state the convergence property of an algorithm for solving the above problem. We need to introduce some definitions. These definitions can be also found in related literature, e.g., . In the sequel, we let $\|\cdot\|$ denote an Euclidean norm, $[S]=\{1,\ldots,S\}$ denote a set, and $\delta_{\Omega}(\cdot)$ denote the indicator function of the set $\Omega$ .

(Fréchet subgradient) For a non-smooth and non-convex function $f(\cdot)$ ,

(First-order stationarity) For problem (1), a point $\mathbf{x}\in\Omega$ is a first-order stationary point if

where $\delta_{\Omega}$ denotes the indicator function of $\Omega$ . Moreover, a point $\mathbf{x}$ is said to be $\epsilon$ -stationary if

where dist denotes the Euclidean distance from a point to a set.

(Moreau Envelope and Proximal Mapping) For any function $f$ and $\lambda>0$ , the following function is called a Moreau envelope of $f$

Further, the optimal solution to the above problem denoted by

(Weakly convex) A function $f$ is $\rho$ -weakly convex, if $f(\mathbf{x})+\frac{\rho}{2}\|\mathbf{x}\|^{2}$ is convex.

It is known that if $f(\mathbf{x})$ is $\rho$ -weakly convex and $\lambda<\rho^{-1}$ , then its Moreau envelope $f_{\lambda}(\mathbf{x})$ is $C^{1}$ -smooth with the gradient given by (see e.g. )

This means that a point $\mathbf{x}$ satisfying $\|\nabla f_{\lambda}(\mathbf{x})\|\leq\epsilon$ is close to a point in distance of $O(\epsilon)$ that is $\epsilon$ -stationary.

It is notable that for a non-smooth non-convex function $f(\cdot)$ , there could exist a sequence of solutions $\{\mathbf{x}_{k}\}$ such that $\nabla f_{\lambda}(\mathbf{x}_{k})$ converges while $\text{dist}(0,\partial f(\mathbf{x}_{k}))$ may not converge . To handle such a challenging issue for non-smooth non-convex problems, we will follow existing works to prove the near stationarity in terms of $\nabla f_{\lambda}(\mathbf{x})$ . In the case when $f$ is smooth, $\|\nabla f_{\lambda}(\mathbf{x})\|$ is closely related to the magnitude of the projected gradient $\mathcal{G}_{\lambda}(\mathbf{x})$ defined below, which has been used as a criterion for constrained non-convex optimization ,

It was shown that when $f(\cdot)$ is smooth with $L$ -Lipschitz continuous gradient :

Thus, the near stationarity in terms of $\nabla f_{\lambda}(\mathbf{x})$ implies the near stationarity in terms of $\mathcal{G}_{\lambda}(\mathbf{x})$ for a smooth function $f(\cdot)$ .

Now, we are ready to state the basic assumptions of the considered problem (1).

Objective function $\phi$ is $\mu$ -weakly convex.

there exists $\Delta_{\phi}>0$ such that $\phi(\mathbf{x})-\min_{\mathbf{z}\in\Omega}\phi(\mathbf{z})\leq\Delta_{\phi}$ for any $\mathbf{x}\in\Omega$ .

Remark: Assumption 1-(A1), 1-(A2) assume a stochastic subgradient is available for the objective function and its Euclidean norm square is bounded in expectation, which are standard assumptions for non-smooth optimization. Assumption (A3) assumes weak convexity of the objective function, which is weaker than assuming smoothness. Assumption (A4) assumes that the objective value with respect to the optimal value is bounded. Below, we present some examples of objective functions in machine learning that are weakly convex.

If $\phi(\cdot)$ is a $L$ -smooth function (i.e., its gradient is $L$ -Lipschitz continuous), then it is $L$ -weakly convex.

Ex. 2: Additive Composition.

Ex. 3: Convex and Smooth Composition

Ex. 4: Smooth and Convex Composition

Ex. 5: Weakly Convex Sparsity-Promoting Regularizers

where $f(\mathbf{x};\xi)$ is a convex or a weakly-convex function, and $g(\mathbf{x})$ is a weakly-convex sparsity-promoting regularizer. Examples of weakly-convex sparsity-promoting regularizers include:

Smoothly Clipped Absolute Deviation (SCAD) penalty : $g(\mathbf{x})=\sum_{i=1}^{d}g_{\lambda}(x_{i})$ and

where $a>2$ is fixed and $\lambda>0$ . It can be shown that SCAD penalty is $(1/(a-1))$ -weakly convex .

Minimax Convex Penalty (MCP) : $g(\mathbf{x})=\sum_{i=1}^{d}g_{\lambda}(x_{i})$ and

where $b>0$ is fixed and $\lambda>0$ . MCP is $1/b$ -weakly convex .

Stagewise Optimization: Algorithms and Analysis

In this section, we will present the proposed algorithms and the analysis of their convergence. We will first present a Meta algorithmic framework highlighting the key features of the proposed algorithms and then present several variants of the Meta algorithm by employing different basic algorithms.

The Meta algorithmic framework is described in Algorithm 1. There are several key features that differentiate Algorithm 1 from existing stochastic algorithms that come with theoretical guarantee. First, the algorithm is run with multiple stages. At each stage, a stochastic algorithm (SA) is called to optimize a proximal problem $f_{s}(\mathbf{x})$ inexactly that consists of the original objective function and a quadratic term, which is guaranteed to be convex due to the weak convexity of $\phi$ and $\gamma<\mu^{-1}$ . The convexity of $f_{s}$ allows one to employ any suitable existing stochastic algorithms (cf. Theorem 1) that have convergence guarantee for convex problems. It is notable that SA usually returns an averaged solution $\mathbf{x}_{s}$ at each stage. Second, a decreasing sequence of step size parameters $\eta_{s}$ is used. At each stage, the SA uses a constant step size parameter $\eta_{s}$ and runs the updates for a number of $T_{s}$ iterations. We do not initialize $T_{s}$ as it might be adaptive to the data as in stagewise AdaGrad. Third, the final solution is selected from the stagewise averaged solutions $\{\mathbf{x}_{s}\}$ with non-uniform sampling probabilities proportional to a sequence of non-decreasing positive weights $\{w_{s}\}$ . In the sequel, we are particularly interested in $w_{s}=s^{\alpha}$ with $\alpha>0$ . The setup of $\eta_{s}$ and $T_{s}$ will depend on the specific choice of SA, which will be exhibited later for different variants.

To illustrate that Algorithm 1 is a universal framework such that any suitable SA algorithm can be employed, we present the following result by assuming that SA has an appropriate convergence for a convex problem.

Let $f(\cdot)$ be a convex function, $\mathbf{x}_{*}=\arg\min_{\mathbf{x}\in\Omega}f(\mathbf{x})$ and $\Theta$ denote some problem dependent parameters. Suppose for $\mathbf{x}_{+}=\textsc{SA}(f,\mathbf{x}_{0},\eta,T)$ , we have

Under assumption 1-(A1), (A3) and (A4), by running Algorithm 1 with $\gamma=1/(2\mu)$ , $w_{s}=s^{\alpha},\alpha>0$ , and with $\eta_{s},T_{s}$ satisfying $\varepsilon_{1}(\eta_{s},T_{s},\Theta)\leq 1/(48\gamma),\varepsilon_{2}(\eta_{s},T_{s},\Theta)\leq 1/2$ , we have

where $\tau$ is randomly selected from $\{0,\ldots,S\}$ with probabilities $p_{\tau}\propto w_{\tau+1},\tau=0,\ldots,S$ . If $\varepsilon_{3}(\eta_{s},T_{s},\Theta)\leq c_{3}/s$ for some constant $c_{3}\geq 0$ that may depend on $\Theta$ , we have

Remark: It is notable that the convergence guarantee is provided on a stagewise average solution $\mathbf{x}_{\tau}$ . To justify a heuristic approach that returns the final average solution for prediction, we analyze a new sampling strategy that samples a solution among all stagewise average solutions with sampling probabilities increasing as the stage number increases. This sampling strategy is better than uniform sampling strategy or a strategy with decreasing sampling probabilities in the existing literature. The convergence upper bound in (7) of SA covers the results of a broad family of stochastic convex optimization algorithms. When $\varepsilon_{2}(\eta_{s},T_{s},\Theta)=0$ (as in Sgd), the upper bound can be improved by a constant factor. Moreover, we do not optimize the value of $\gamma$ . Indeed, any $\gamma<1/\mu$ will work, which only has an effect on constant factor in the convergence upper bound.

Next, we present several variants of the Meta algorithm by employing Sgd, AdaGrad, and stochastic momentum methods as the basic SA algorithm, to which we refer as stagewise Sgd, stagewise AdaGrad, and stagewise stochastic momentum methods, respectively. It is worth mentioning that one can follow similar analysis to analyze other stagewise algorithms by using their basic convergence for stochastic convex optimization, including RMSProp , AMSGrad , which is omitted in this paper.

Then $\nabla\phi_{\gamma}(\mathbf{x}_{s-1})=\gamma^{-1}(\mathbf{x}_{s-1}-\mathbf{z}_{s})$ . Then we have $\phi(\mathbf{x}_{s})\geq\phi(\mathbf{z}_{s+1})+\frac{1}{2\gamma}\|\mathbf{x}_{s}-\mathbf{z}_{s+1}\|^{2}$ . Next, we apply Lemma 1 to each call of Sgd in stagewise Sgd,

where the inequality follows from the Young’s inequality with $0<\alpha_{s}<1$ . Thus we have that

Next, we bound $f_{s}(\mathbf{x}_{s-1})-f_{s}(\mathbf{z}_{s})$ given that $\mathbf{x}_{s-1}$ is fixed. According to the definition of $f_{s}(\cdot)$ , we have

Taking expectation over randomness in the $s$ -th stage on both sides, we have

Assuming that $\varepsilon_{2}(\eta_{s},T_{s},\Theta)\leq 1/2$ , we have

Plugging this upper bound into (4), we have

By setting $\alpha_{s}=1/2,\gamma=1/(2\mu)$ and assuming $\varepsilon_{1}(\eta_{s},T_{s},\Theta)\leq 1/(48\gamma)$ , we have

Multiplying both sides by $w_{s}$ , we have that

By summing over $s=1,\ldots,S+1$ , we have

Taking the expectation w.r.t. $\tau\in\{0,\ldots,S\}$ , we have that

For the first term on the R.H.S, we have that

Combining these facts and the assumption $\varepsilon_{3}(\eta_{s},T_{s},\Theta)\leq c/s$ , we have that

Next, we present several variants of the Meta algorithm by employing Sgd, stochastic momentum methods, and AdaGrad as the basic SA algorithm, to which we refer as stagewise Sgd, stagewise stochastic momentum methods, and stagewise AdaGrad, respectively.

In this subsection, we analyze the convergence of stagewise Sgd, in which Sgd shown in Algorithm 2 is employed in the Meta framework. Besides Assumption 1, we impose the following assumption in this subsection.

the domain $\Omega$ is bounded, i.e., there exists $D>0$ such that $\|\mathbf{x}-\mathbf{y}\|\leq D$ for any $\mathbf{x},\mathbf{y}\in\Omega$ .

To state the convergence, we introduce a notation

which is the gradient of the Moreau envelope of the objective function $\phi+\delta_{\Omega}$ . The following theorem exhibits the convergence of stagewise Sgd

Suppose Assumption 1 and 2 hold. By setting $\gamma=1/(2\mu),w_{s}=s^{\alpha},\alpha>0$ , $\eta_{s}=c/s,T_{s}=12\gamma s/c$ where $c>0$ is a free parameter, then stagewise Sgd (Algorithm 1 employing Sgd) returns a solution $\mathbf{x}_{\tau}$ satisfying

where $\hat{G}^{2}=2G^{2}+2\gamma^{-2}D^{2}$ , and $\tau$ is randomly selected from $\{0,\ldots,S\}$ with probabilities $p_{\tau}\propto w_{\tau+1},\tau=0,\ldots,S$ .

2 Stagewise stochastic momentum (SM) methods

To present the analysis of stagewise SM methods, we first provide a convergence result for minimizing $f_{s}(\mathbf{x})$ at each stage.

Suppose Assumption 1 holds. By setting $\gamma=1/(2\mu),w_{s}=s^{\alpha},\alpha>0$ , $\eta_{s}=(1-\beta)\gamma/(96s(\rho\beta+1))$ , $T_{s}\geq 2304(\rho\beta+1)s$ , then we have

where $\tau$ is randomly selected from $\{0,\ldots,S\}$ with probabilities $p_{\tau}\propto w_{\tau+1},\tau=0,\ldots,S$ .

3 Stagewise AdaGrad

In this subsection, we analyze stagewise AdaGrad and establish its adaptive complexity. In particular, we consider the Meta algorithm that employs AdaGrad in Algorithm 4. The key difference of stagewise AdaGrad from stagewise Sgd and stagewise SM is that the number of iterations $T_{s}$ at each stage is adaptive to the history of learning. It is this adaptiveness that makes the proposed stagewise AdaGrad achieve adaptive convergence. It is worth noting that such adaptive scheme has been also considered in for solving stochastic strongly convex problems. In contrast, we consider stochastic weakly convex problems. Similar to previous analysis of AdaGrad , we assume $\|g(\mathbf{x};\xi)\|_{\infty}\leq G,\forall\mathbf{x}\in\Omega$ in this subsection. Note that this is stronger than Assumption 1-(A1). We formally state this assumption required in this subsection below.

$\|g(\mathbf{x};\xi)\|_{\infty}\leq G$ for any $\mathbf{x}\in\Omega$ .

The convergence analysis of stagewise AdaGrad is build on the following lemma, which is attributed to .

Let $f(\mathbf{x})$ be a convex function, $H_{0}=GI$ with $G\geq\max_{t}\|\mathbf{g}_{t}\|_{\infty}$ , and iteration number $T$ satisfy $T\geq M\max\{\frac{G+\max_{i}\|g_{1:T,i}\|}{2},\sum_{i=1}^{d}\|g_{1:T,i}\|\}$ . Algorithm 4 returns an averaged solution $\widehat{\mathbf{x}}_{T}$ such that

where $\mathbf{x}_{*}=\arg\min_{\mathbf{x}\in\Omega}f(\mathbf{x})$ , $g_{1:t}=(\mathbf{g}(\mathbf{x}_{1}),\ldots,\mathbf{g}(\mathbf{x}_{t}))$ and $g_{1:t,i}$ denotes the $i$ -th row of $g_{1:t}$ .

The convergence property of stagewise AdaGrad is described by following theorem.

Suppose Assumption 1, Assumption 2 and Assumption 3 hold. By setting $\gamma=1/(2\mu),w_{s}=s^{\alpha},\alpha>0$ , $\eta_{s}=c/\sqrt{s}$ , $T_{s}\geq M_{s}\max\{\hat{G}+\max_{i}\|g^{s}_{1:T_{s},i}\|,\sum_{i=1}^{d}\|g^{s}_{1:T_{s},i}\|\}$ where $c>0$ is a free parameter, and $M_{s}\eta_{s}\geq 24\gamma$ , then we have

where $\hat{G}=G+\gamma^{-1}D$ , and $g^{s}_{1:t,i}$ denotes the cumulative stochastic gradient of the $i$ -th coordinate at the $s$ -th stage.

4 Stagewise Stochastic ADMM for Solving Problems with Structured Regularizers

In this subsection, we consider solving a regularized problem with a structured regularizer, i.e.,

where $\beta>0$ is the penalty parameter of ADMM, $\|\mathbf{x}\|^{2}_{C}=\mathbf{x}^{\top}C\mathbf{x}$ , and $C=\alpha I-\eta\beta A^{\top}A\succeq I$ with some appropriate $\alpha>0$ .

A convergence upper bound of stochastic ADMM for solving

(Corollary 3 ) For Algorithm 5, assume $f(\mathbf{x})$ is a convex function and $\psi(\cdot)$ is a $\rho$ -Lipschitz continuous convex function, $\mathbf{g}(\mathbf{x}_{t};\xi)\in\partial_{F}f(\mathbf{x}_{t};\xi_{t})$ is used in the update, $C=\alpha I-\eta\beta A^{\top}A\succeq I$ , and Assumption 2 holds. Then,

Suppose Assumption 1 and Assumption 2 hold and SADMM $(f_{s},\mathbf{x}_{s-1},\eta_{s},\beta_{s},T_{s})$ is employed in the Meta Algorithm 1. By setting $\gamma=1/(2\mu),w_{s}=s^{\alpha},\alpha>0$ , $\eta_{s}=c_{1}/s$ , $\beta_{s}=c_{2}s$ , $T_{s}\geq 24s\gamma\max(\alpha/c_{1},c_{2}\|A\|_{2}^{2})$ , where $c_{1},c_{2}>0$ , then we have

where $\tau$ is randomly selected from $\{0,\ldots,S\}$ with probabilities $p_{\tau}\propto w_{\tau+1},\tau=0,\ldots,S$ , and $C$ is some constant depending on $c_{1},c_{2},\rho,G,D,\|A\|_{2}$ .

Remark: The above result can be easily proved. Therefore, the proof is omitted.

Experiments

In this section, we present some empirical results to verify the effectiveness of the proposed stagewise algorithms. We use two benchmark datasets, namely CIFAR-10 and CIFAR-100 for our experiments. We implement the proposed stagewise algorithms in TensorFlow. We compare different algorithms for learning ResNet-20 with batch normalization adopted after each convolution and before ReLU activation.

Baselines. We compare the proposed stagewise algorithms with their variants implemented in TensorFlow. It is notable that AdaGrad has a step size (aka learning rate) parameter note it is not equivalent to the step size in Sgd., which is a constant in theory . However, in the deep learning community a heuristic fixed frequency decay scheme for the step size parameter is commonly adopted . We thus compare two implementations of AdaGrad - one with a constant learning rate parameter and another one with a fixed frequency decay scheme, which are referred to as AdaGrad (theory) and AdaGrad (heuristic). For each baseline algorithms of Sgd, Shb, Snag, we also implement two versions - a theory version with iteratively decreasing size $\eta_{0}/\sqrt{t}$ suggested by previous theories and a heuristic approach with fixed frequency decay scheme used in practice, using (theory) and (heuristic) to indicate them. The fixed frequency decay scheme used in the heuristic variants is similar to that in , i.e., the step size parameter is decreased by 10 at 40k, 60k iterations. We also compare stagwise AdaGrad with AMSGrad - a corrected version of Adam.

Parameters. The stagewise step size $\eta_{s}=\eta_{0}/\sqrt{s}$ is used in stagwise AdaGrad, the number of iterations $T_{s}$ in stagwise AdaGrad is set according to Theorem 4 with some simplifications for dealing with unknown $\hat{G}$ , in particular we set $T_{s}$ to the smallest value larger than $T_{0}\sqrt{s\max_{i}\|g^{s}_{1:T_{s},i}\|\sum_{i}\|g^{s}_{1:T_{s},i}\|}$ . For stagewise Sgd, Shb, Snag, the stagewise step size and iteration number is set to $\eta_{s}=\eta_{0}/s$ and $T_{s}=T_{0}s$ , respectively. For parameter tuning, the initial step sizes of all algorithms are tuned in $\{0.1,0.3,0.5,0.7,0.9\}$ . The value of $\gamma$ of stagewise algorithms is tuned in $\{1,10,100,500,1000,1500,2000,3000\}$ . The initial value $T_{0}$ for stagewise Sgd, Shb, Snag is tuned in $\{10,100,1\text{k},5\text{k},6\text{k},7\text{k},10\text{k},20\text{k}\}$ , and that for stagewise AdaGrad is tuned in $\{1,10,15,20,25,50,100\}$ .

Conclusion

In this paper, we have proposed a universal stagewise learning framework for solving stochastic non-convex optimization problems, which employs well-known tricks in practice that have not been well analyzed theoretically. We provided theoretical convergence results for the proposed algorithms for non-smooth non-convex optimization problems. We also established an adaptive convergence of a stochastic algorithm using data adaptive coordinate-wise step size of AdaGrad, and exhibited its faster convergence than non-adaptive stepsize when the growth of cumulative stochastic gradients is slow similar to that in the convex case. For future work, one may consider developing more variants of the proposed meta algorithm, e.g., stagewise AMSGrad, stagewise RMSProp, etc.

Acknowledgement

T. Yang are partially supported by National Science Foundation (IIS-1545995). Part of this work was done when Chen is interning at JD AI Research and Yang is visiting JD AI Research.

Appendix A Proof of Theorem 2

where the inequality follows from the Young’s inequality with $0<\alpha_{s}<1$ . Thus we have that

Combining the above inequalities, we have that

Multiplying both sides by $w_{s}$ , we have that

By setting $\alpha_{s}=1/2$ and $\gamma=1/(2\mu)$ , $T_{s}\eta_{s}\geq 12\gamma$ , we have

By summing over $s=1,\ldots,S+1$ , we have

Taking the expectation w.r.t. $\tau\in\{0,\ldots,S\}$ , we have that

For the first term on the R.H.S, we have that

Appendix B Proof of Theorem 3

According to the definition of $\mathbf{z}_{s}$ in (20) and Lemma 2, we have that

Similar to the proof of Theorem 2, we have

Rearranging above inequality, we have that

On the other hand, the $\mu$ -weakly convexity of $\phi$ gives that

where $\mathbf{g}(\mathbf{x}_{s-1})\in\partial_{F}\phi(\mathbf{x}_{s-1})$ . Combing these two inequalities we have that

where the second inequality follows from Jensen’s inequality for $\|\cdot\|$ and Young’s inequality. Combining above inequalities and multiplying both side by $w_{s}$ , we have that

By setting $\alpha_{s}=1/2$ , $\eta_{s}(T_{s}+1)\geq 24(1-\beta)\gamma$ , we have that

Summing over $s=1,\ldots,S+1$ and rearranging, we have

Following similar analysis as in the proof of Theorem 2, we can finish the proof. ∎

Appendix C Proof of Theorem 4

Applying Lemma 3 with $T_{s}\geq M_{s}\max\{\frac{\hat{G}+\max_{i}\|g^{s}_{1:T_{s},i}\|}{2},\sum_{i=1}^{d}\|g^{s}_{1:T_{s},i}\|\}$ $M_{s}>0$ , and the fact that $\phi(\mathbf{x}_{s-1})\geq\phi(\mathbf{z}_{s})+\frac{1}{2\gamma}\|\mathbf{x}_{s-1}-\mathbf{z}_{s}\|^{2}$ in $s$ th stage, we have that

Rearranging above inequality then multiplying both side by $w_{s}$ , we have that

By using $M_{s}\eta_{s}\geq 24\gamma$ and summing over $s=1,\ldots,S+1$ , we have that

By the definition of $p_{s}$ in the theorem, taking expectation of $\|\nabla\phi_{\gamma}(\mathbf{x}_{\tau})\|^{2}$ w.r.t. $\tau\in\{0,\ldots,S\}$ we have that

Appendix D Proof of Lemma 2

Following the analysis in , we directly have the following inequality,

where the first two inequalities are due to the strong convexity of $f(\cdot)$ and the last three inequalities are due to the boundness assumption. Thus

By summarizing the above inequality over $k=0,\ldots,T$ , we have

When $\eta\leq(1-\beta)\gamma^{2}\lambda/(8\rho\beta+4)$ , we have

Appendix E Proof of Lemma 3

The proof is almost a duplicate of the proof of Proposition 1 in . For completeness, we present a proof here.

Let $\psi_{0}(\mathbf{x})=0$ and $\|\mathbf{x}\|_{H}=\sqrt{\mathbf{x}^{\top}H\mathbf{x}}$ . First, we can see that $\psi_{t+1}(\mathbf{x})\geq\psi_{t}(\mathbf{x})$ for any $t\geq 0$ . Define $\zeta_{t}=\sum_{\tau=1}^{t}\mathbf{g}_{t}$ and $\Delta_{\tau}=(\partial F(\mathbf{x}_{t})-\mathbf{g}_{t})^{\top}(\mathbf{x}_{t}-\mathbf{x})$ . Let $\psi_{t}^{*}$ be defined by

Taking the summation of objective gap in all iterations, we have

where the last inequality uses the fact that $\psi_{t}(\mathbf{x})$ is 1-strongly convex w.r.t $\|\cdot\|_{\psi_{t}}=\|\cdot\|_{H_{t}}$ and consequentially $\psi_{t}^{*}(\mathbf{x})$ is $\eta$ -smooth wr.t. $\|\cdot\|_{\psi^{*}_{t}}=\|\cdot\|_{H_{t}^{-1}}$ . Thus, we have

Now by the value of $T\geq M\max\{\frac{G+\max_{i}\|g_{1:T,i}\|}{2},\sum_{i=1}^{d}\|g_{1:T,i}\|\}$ , we have

Dividing by $T$ on both sides and setting $\mathbf{x}=\mathbf{x}_{*}$ , following the inequality (3) and the convexity of $f(\mathbf{x})$ we have