Non-convex Finite-Sum Optimization Via SCSG Methods

Lihua Lei, Cheng Ju, Jianbo Chen, Michael I. Jordan

Introduction

We study smooth non-convex finite-sum optimization problems of the form

where each component fi(x)f_{i}(x) is possibly non-convex with a Lipschitz gradient. This generic form captures numerous statistical learning problems, ranging from generalized linear models to deep neural networks .

As in the convex case, gradient methods have better dependence on ϵ\epsilon in the non-convex case but worse dependence on nn. This is due to the requirement of computing a full gradient. Comparing the complexity of SGD and the best achievable rate for stochastic gradient methods, achieved via variance-reduction methods, the dependence on ϵ\epsilon is significantly improved in the latter case. However, unless ϵ< ⁣ ⁣<n1/2\epsilon<\!\!<n^{-1/2}, SGD has similar or even better theoretical complexity than gradient methods and existing variance-reduction methods. In practice, it is often the case that nn is very large (10510910^{5}\sim 10^{9}) while the target accuracy is moderate (10110310^{-1}\sim 10^{-3}). In this case, SGD has a meaningful advantage over other methods, deriving from the fact that it does not require a full gradient computation. This motivates the following research question: Is there an algorithm that

achieves/beats the theoretical complexity of SGD in the regime of modest target accuracy;

and achieves/beats the theoretical complexity of existing variance-reduction methods in the regime of high target accuracy?

The question has been partially answered in the convex case by in their formulation of the stochastically controlled stochastic gradient (SCSG) methods. When the target accuracy is low, SCSG has the same O(ϵ2)O\left(\epsilon^{-2}\right) rate as SGD but with a much smaller data-dependent constant factor (which does not even require bounded gradients). When the target accuracy is high, SCSG achieves the same rate as the best non-accelerated methods, O(nϵ)O(\frac{n}{\epsilon}). Despite the gap between this and the optimal rate, SCSG is the first known algorithm that provably achieves the desired performance in both regimes.

In this paper, we show how to generalize SCSG to the non-convex setting and, surprisingly, provide a completely affirmative answer to the question raised above. Even though we only assume smoothness of each component, we show that SCSG is always O(ϵ1/3)O\left(\epsilon^{-1/3}\right) faster than SGD and is never worse than recently developed variance-reduction methods. When ϵ> ⁣ ⁣>1n\epsilon>\!\!>\frac{1}{n}, SCSG is at least O((ϵn)2/3)O((\epsilon n)^{2/3}) faster than the best variance-reduction algorithm. Comparing with gradient methods, SCSG has a better convergence rate provided ϵ> ⁣ ⁣>n6/5\epsilon>\!\!>n^{-6/5}, which is the common setting in practice. Interestingly, there is a parallel to recent advances in gradient methods; improved the classical O(ϵ1)O(\epsilon^{-1}) rate of gradient descent to O(ϵ5/6)O(\epsilon^{-5/6}); this parallels the improvement of SCSG over SGD from O(ϵ2)O(\epsilon^{-2}) to O(ϵ5/3)O(\epsilon^{-5/3}).

Beyond the theoretical advantages of SCSG, we also show that SCSG yields good empirical performance for the training of multi-layer neural networks. It is worth emphasizing that the mechanism by which SCSG achieves acceleration (variance reduction) is qualitatively different from other speed-up techniques, including momentum and adaptive stepsizes . It will be of interest in future work to explore combinations of these various approaches in the training of deep neural networks.

The rest of paper is organized as follows: In Section 2 we discuss our notation and assumptions and we state the basic SCSG algorithm. We present the theoretical convergence analysis in Section 3. Experimental results are presented in Section 4. All the technical proofs are relegated to the Appendices. Our code is available at https://github.com/Jianbo-Lab/SCSG.

Notation, Assumptions and Algorithm

To formulate our complexity bounds, we define

Further we define H\mathcal{H}^{*} as an upper bound on the variance of the stochastic gradients:

Assumption A1 on the smoothness of individual functions will be made throughout the paper.

for some L<L<\infty and for all i{1,,n}i\in\{1,\ldots,n\}.

In this paper, we also consider the Polyak-Lojasiewicz (P-L) condition . It is weaker than strong convexity as well as other popular conditions that appear in the optimization literature; see for an extensive discussion.

f(x)f(x) satisfies the P-L condition with μ>0\mu>0 if

where xx^{*} is the global minimum of ff.

The algorithm we propose in this paper is similar to that of except (critically) the number of inner loops is a geometric random variable. This is an essential component in the analysis of SCSG, and, as we will show below, it is key in allowing us to extend the complexity analysis for SCSG to the non-convex case. Moreover, that algorithm that we present here employs a mini-batch procedure in the inner loop and outputs a random sample instead of an average of the iterates. The pseudo-code is shown in Algorithm 1.

Define T(ϵ)T(\epsilon) as the minimum number of epochs such that all outputs afterwards are ϵ\epsilon-accurate solutions, i.e.

2 Parameter settings

The generic form (Algorithm 1) allows for flexibility in both stepsize, ηj\eta_{j}, and batch/mini-batch size, (Bj,bj)(B_{j},b_{j}). In order to minimize the amount of tuning needed in practice, we provide several default settings which have theoretical support. The settings and the corresponding complexity results are summarized in Table 2. Note that all settings fix bj=1b_{j}=1 since this yields the best rate as will be shown in Section 3. However, in practice a reasonably large mini-batch size bjb_{j} might be favorable due to the acceleration that could be achieved by vectorization; see Section 4 for more discussions on this point.

Convergence Analysis

First we present the analysis for a single epoch. Given jj, we define

As shown in , the gradient update νk(j)\nu^{(j)}_{k} is a biased estimate of the gradient f(xk(j))\nabla f(x^{(j)}_{k}) conditioning on the current random index iki_{k}. Specifically, within the jj-th epoch,

This reveals the basic qualitative difference between SVRG and SCSG. Most of the novelty in our analysis lies in dealing with the extra term eje_{j}. Unlike , we do not assume xk(j)x\|x^{(j)}_{k}-x^{*}\| to be bounded since this is invalid in unconstrained problems, even in convex cases.

By careful analysis of primal and dual gaps , we find that the stepsize ηj\eta_{j} should scale as (Bj/bj)23(B_{j}/b_{j})^{-\frac{2}{3}}. Then same phenomenon has also been observed in when bj=1b_{j}=1 and Bj=nB_{j}=n.

Let ηjL=γ(Bj/bj)23\eta_{j}L=\gamma(B_{j}/b_{j})^{-\frac{2}{3}}. Suppose γ13\gamma\leq\frac{1}{3} and Bj8bjB_{j}\geq 8b_{j} for all jj, then under Assumption A1,

The proof is presented in Appendix B. It is not surprising that a large mini-batch size will increase the theoretical complexity as in the analysis of mini-batch SGD. For this reason we restrict most of our subsequent analysis to bj1b_{j}\equiv 1.

2 Convergence analysis for smooth non-convex objectives

Under the specifications of Theorem 3.1 and Assumption A1,

However, both of the above settings are suboptimal since they either set the batch sizes BjB_{j} too large or set the mini-batch sizes bjb_{j} too large. By Theorem 3.2, SCSG can be regarded as an interpolation between SGD and SVRG. By leveraging these two parameters, SCSG is able to outperform both methods.

We start from considering a constant batch/mini-batch size BjB,bj1B_{j}\equiv B,b_{j}\equiv 1. Similar to SGD and SCSG, BB should be at least O(Hϵ)O(\frac{\mathcal{H}^{*}}{\epsilon}). In applications like the training of neural networks, the required accuracy is moderate and hence a small batch size suffices. This is particularly important since the gradient can be computed without communication overhead, which is the bottleneck of SVRG-type algorithms. As shown in Corollary 3.3 below, the complexity of SCSG beats both SGD and SVRG.

Assume that LΔf,H=O(1)L\Delta_{f},\mathcal{H}^{*}=O(1), the above bound can be simplified to

When the target accuracy is high, one might consider a sequence of increasing batch sizes. Heuristically, a large batch is wasteful at the early stages when the iterates are inaccurate. Fixing the batch size to be nn as in SVRG is obviously suboptimal. Via an involved analysis, we find that Bjj32B_{j}\sim j^{\frac{3}{2}} gives the best complexity among the class of SCSG algorithms.

The proofs of both Corollary 3.3 and Corollary 3.4 are presented in Appendix C. To simplify the bound (8), we assume that Δf,H=O(1)\Delta_{f},\mathcal{H}^{*}=O(1) in order to highlight the dependence on ϵ\epsilon and nn. Then (8) can be simplified to

The log-factor log5(1ϵ)\log^{5}\left(\frac{1}{\epsilon}\right) is purely an artifact of our proof. It can be reduced to log32+μ(1ϵ)\log^{\frac{3}{2}+\mu}\left(\frac{1}{\epsilon}\right) for any μ>0\mu>0 by setting Bjj32(logj)32+μB_{j}\sim j^{\frac{3}{2}}(\log j)^{\frac{3}{2}+\mu}; see remark 1 in Appendix C.

3 Convergence analysis for P-L objectives

Let λj=5Lbj13μγBj13+5Lbj13\lambda_{j}=\frac{5Lb_{j}^{\frac{1}{3}}}{\mu\gamma B_{j}^{\frac{1}{3}}+5Lb_{j}^{\frac{1}{3}}}. Then under the same settings of Theorem 3.2,

The proofs and additional discussion are presented in Appendix D. Again, Theorem 3.5 covers existing complexity bounds for both SGD and SVRG. In fact, when Bj=bjBB_{j}=b_{j}\equiv B as in SGD, via some calculation, we obtain that

By leveraging the batch and mini-batch sizes, we obtain a counterpart of Corollary 3.3 as below.

Recall the results from Table 1, SCSG is O(1μ+1(μϵ)1/3)O\left(\frac{1}{\mu}+\frac{1}{(\mu\epsilon)^{1/3}}\right) faster than SGD and is never worse than SVRG. When both μ\mu and ϵ\epsilon are moderate, the acceleration of SCSG over SVRG is significant. Unlike the smooth case, we do not find any possible choice of setting that can achieve a better rate than Corollary 3.6.

Experiments

We evaluate SCSG and mini-batch SGD on the MNIST dataset with (1) a three-layer fully-connected neural network with 512512 neurons in each layer (FCN for short) and (2) a standard convolutional neural network LeNet (CNN for short), which has two convolutional layers with 3232 and 6464 filters of size 5×55\times 5 respectively, followed by two fully-connected layers with output size 10241024 and 1010. Max pooling is applied after each convolutional layer. The MNIST dataset of handwritten digits has 50,00050,000 training examples and 10,00010,000 test examples. The digits have been size-normalized and centered in a fixed-size image. Each image is 2828 pixels by 2828 pixels. All experiments were carried out on an Amazon p2.xlarge node with a NVIDIA GK210 GPU with algorithms implemented in TensorFlow 1.0.

Due to the memory issues, sampling a chunk of data is costly. We avoid this by modifying the inner loop: instead of sampling mini-batches from the whole dataset, we split the batch Ij\mathcal{I}_{j} into Bj/bjB_{j}/b_{j} mini-batches and run SVRG-type updates sequentially on each. Despite the theoretical advantage of setting bj=1b_{j}=1, we consider practical settings bj>1b_{j}>1 to take advantage of the acceleration obtained by vectorization. We initialized parameters by TensorFlow’s default Xavier uniform initializer. In all experiments below, we show the results corresponding to the best-tuned stepsizes.

We consider three algorithms: (1) SGD with a fixed batch size B{512,1024}B\in\{512,1024\}; (2) SCSG with a fixed batch size B{512,1024}B\in\{512,1024\} and a fixed mini-batch size b=32b=32; (3) SCSG with time-varying batch sizes Bj=j3/2nB_{j}=\lceil j^{3/2}\wedge n\rceil and bj=Bj/32b_{j}=\lceil B_{j}/32\rceil. To be clear, given TT epochs, the IFO complexity of the three algorithms are TBTB, 2TB2TB and 2j=1TBj2\sum_{j=1}^{T}B_{j}, respectively. We run each algorithm with 20 passes of data. It is worth mentioning that the largest batch size in Algorithm 3 is 2751.5=4561\lceil 275^{1.5}\rceil=4561, which is relatively small compared to the sample size 5000050000.

We plot in Figure 1 the training and the validation loss against the IFO complexity—i.e., the number of passes of data—for fair comparison. In all cases, both versions of SCSG outperform SGD, especially in terms of training loss. SCSG with time-varying batch sizes always has the best performance and it is more stable than SCSG with a fixed batch size. For the latter, the acceleration is more significant after increasing the batch size to 10241024. Both versions of SCSG provide strong evidence that variance reduction can be achieved efficiently without evaluating the full gradient.

Given 2B2B IFO calls, SGD implements updates on two fresh batches while SCSG replaces the second batch by a sequence of variance reduced updates. Thus, Figure 1 shows that the gain due to variance reduction is significant when the batch size is fixed. To further explore this, we compare SCSG with time-varying batch sizes to SGD with the same sequence of batch sizes. The results corresponding to the best-tuned constant stepsizes are plotted in Figure 3(a). It is clear that the benefit from variance reduction is more significant when using time-varying batch sizes.

We also compare the performance of SGD with that of SCSG with time-varying batch sizes against wall clock time, when both algorithms are implemented in TensorFlow and run on a Amazon p2.xlarge node with a NVIDIA GK210 GPU. Due to the cost of computing variance reduction terms in SCSG, each update of SCSG is slower per iteration compared to SGD. However, SCSG makes faster progress in terms of both training loss and validation loss compared to SCD in wall clock time. The results are shown in Figure 2.

Finally, we examine the effect of Bj/bjB_{j}/b_{j}, namely the number of mini-batches within an iteration, since it affects the efficiency in practice where the computation time is not proportional to the batch size. Figure 3(b) shows the results for SCSG with Bj=j3/2nB_{j}=\lceil j^{3/2}\wedge n\rceil and Bj/bj{2,5,10,16,32}\lceil B_{j}/b_{j}\rceil\in\{2,5,10,16,32\}. In general, larger Bj/bjB_{j}/b_{j} yields better performance. It would be interesting to explore the tradeoff between computation efficiency and this ratio on different platforms.

Discussion

We have presented the SCSG method for smooth, non-convex, finite-sum optimization problems. SCSG is the first algorithm that achieves a uniformly better rate than SGD and is never worse than SVRG-type algorithms. When the target accuracy is low, SCSG significantly outperforms the SVRG-type algorithms. Unlike various other variants of SVRG, SCSG is clean in terms of both implementation and analysis. Empirically, SCSG outperforms SGD in the training of multi-layer neural networks.

Compared to momentum-based methods and methods with adaptive stepsizes , the mechanism whereby SCSG achieves acceleration is qualitatively different: while momentum aims at balancing primal and dual gaps , adaptive stepsizes aim at balancing the scale of each coordinate, and variance reduction aims at removing the noise. We believe that an algorithm that combines these three techniques is worthy of further study, especially in the training of deep neural networks where the target accuracy is moderate.

Acknowledgments

The authors thank Zeyuan Allen-Zhu, Chi Jin, Nilesh Tripuraneni, Yi Xu, Tianbao Yang, Shenyi Zhao and anonymous reviewers for helpful discussions.

References

Appendix A Technical Lemmas

In this section we present several technical lemmas that facilitate the proofs of our main results.

We start with a lemma on the variance of the sample mean (without replacement).

Further let J\mathcal{J} be a uniform random subset of {1,,M}\{1,\ldots,M\} with size mm. Then

Proof Let Wj=I(jJ)W_{j}=I(j\in\mathcal{J}), then it is easy to see that

Since the geometric random variable NjN_{j} plays an important role in our analysis, we recall its key properties below.

For any η>1\eta>1 and z>0z>0, define gη(x)g_{\eta}(x) and x(z)x(z) as

Proof For any xx(z)x\geq x(z), denote α=x/z1η\alpha=x/z^{-\frac{1}{\eta}}. Then

Taking the logarithm of both sides, we obtain that

Appendix B One-Epoch Analysis

Assume that ηjL1/4Bj2/3\eta_{j}L\leq 1/4B_{j}^{2/3}. Then for any jj,

where the last line uses Assumption A1. Therefore,

Based on Lemma A.2, Lemma B.1, Lemma B.2 and Lemma B.3, we can derive bounds for primal and dual gaps respectively.

Suppose ηjL<1\eta_{j}L<1, then under Assumption A1,

Let k=Njk=N_{j} in (12). By taking expectation with respect to NjN_{j} and using Fubini’s theorem, we arrive at

Suppose ηj2L2Bj<bj2\eta_{j}^{2}L^{2}B_{j}<b_{j}^{2}, then under Assumption A1,

Proof Since xk+1(j)=xk(j)ηjνk(j)x^{(j)}_{k+1}=x^{(j)}_{k}-\eta_{j}\nu^{(j)}_{k}, we have

Let k=Njk=N_{j} in (15). By taking expectation with respect to NjN_{j} and using Fubini’s theorem, we arrive at

Proof Let Mk(j)=ej,xk(j)x0(j)M^{(j)}_{k}=\langle e_{j},x^{(j)}_{k}-x^{(j)}_{0}\rangle. By definition, we have

Since NjN_{j} is independent of (x0(j),ej)(x^{(j)}_{0},e_{j}), this implies that

Also we have M0(j)=0M^{(j)}_{0}=0. On the other hand,

Let k=Njk=N_{j} in (18). By taking an expectation with respect to NjN_{j} and using Lemma A.2 and B.1, we obtain that

Proof [Theorem 3.1] Multiplying equation (10) by 22, equation (14) by bjηjBj\frac{b_{j}}{\eta_{j}B_{j}} and summing them, we obtain that

By Lemma B.6, the second row can be simplified as

Using the fact that 2a,bβa2+1βb22\langle a,b\rangle\leq\beta\|a\|^{2}+\frac{1}{\beta}\|b\|^{2} for any β>0\beta>0, we have

Putting the pieces together, we conclude that

Since ηjL=θj=γ(bj/Bj)23\eta_{j}L=\theta_{j}=\gamma(b_{j}/B_{j})^{\frac{2}{3}} and bj1,Bj8bj8b_{j}\geq 1,B_{j}\geq 8b_{j}\geq 8,

Since Bj8bj,γ13B_{j}\geq 8b_{j},\gamma\leq\frac{1}{3}, we have

Proof [of Lemma B.1] We prove the first two claims by induction. The first inequality in the proof of Lemma B.4 yields

Using the fact that 2a,b=a2/c+cb2acb2/c2\langle a,b\rangle=\|a\|^{2}/c+c\|b\|^{2}-\|a-cb\|^{2}/c for any c>0c>0, we have

Since ηjL1/4Bj2/31/4\eta_{j}L\leq 1/4B_{j}^{2/3}\leq 1/4, we obtain that

On the other hand, the first inequality in the proof of Lemma B.5 implies that

Using the fact that 2a,b=a2/c+cb2acb2/c2\langle a,b\rangle=\|a\|^{2}/c+c\|b\|^{2}-\|a-cb\|^{2}/c for any c>0c>0, we have

where the last line uses the condition that ηjL1/4\eta_{j}L\leq 1/4. Plugging (23) into (24), we obtain that

Taking expectation in (25) then entails that

This together with the induction hypothesis entail that

To other claims are simple consequences of the first two. In fact, the third claim is followed by (23) and the last two claims followed by the first two claims and the fact that a,ba2/2+b2/2\langle a,b\rangle\leq\|a\|^{2}/2+\|b\|^{2}/2.

Appendix C Convergence Analysis for Smooth Objectives

The corollary is then proved by substituting for BB.

Let T=n23T_{*}=\lfloor n^{\frac{2}{3}}\rfloor. First we prove that W(T)W(T) is strictly decreasing.

When TTT\geq T_{*}, the numerator is a constant and the denominator is strictly increasing. Thus, W(T)W(T) is strictly decreasing on [T,)[T_{*},\infty);

When T<TT<T_{*}, let a1j=6Hja_{1j}=\frac{6\mathcal{H}^{*}}{j} and a2j=j12a_{2j}=j^{\frac{1}{2}}. Further let

It is obvious that U(T)U(T) is strictly decreasing. Noticing that a1ja2j=6Hj32\frac{a_{1j}}{a_{2j}}=\frac{6\mathcal{H}^{*}}{j^{\frac{3}{2}}} is strictly decreasing, we also conclude that V(T)V(T) is strictly decreasing. Therefore, W(T)W(T) is strictly decreasing on [1,T][1,T_{*}].

In summary, W(T)W(T) is strictly decreasing.

Now we derive the bound for T(ϵ)T(\epsilon). To do so, we distinguish two cases to analyze W(T)W(T).

Since 1j\frac{1}{j} is decreasing, we have

Similarly, since j12j^{\frac{1}{2}} is increasing, we have

Putting the pieces together, we obtain that

It is easy to see that both W1(T)W_{1}(T) and W2(T)W_{2}(T) are strictly decreasing and limTW1(T)=limTW2(T)=0\lim_{T\rightarrow\infty}W_{1}(T)=\lim_{T\rightarrow\infty}W_{2}(T)=0. Let

Recall that W(T)W(T) is also strictly decreasing, we have

On the other hand, it is straightforward to derive a bound for T2(ϵ)T_{2}(\epsilon) as

Finally, to obtain the bound for the computation complexity, we notice that

Let x+x_{+} denote the positive part of xx; i.e., x+=max{x,0}x_{+}=\max\{x,0\}. Therefore,

The log-factor log5(1ϵ)\log^{5}\left(\frac{1}{\epsilon}\right) can be reduced to log32+μ(1ϵ)\log^{\frac{3}{2}+\mu}\left(\frac{1}{\epsilon}\right) for any μ>0\mu>0 by setting Bj=j32(logj)32+μnB_{j}=\lceil j^{\frac{3}{2}}(\log j)^{\frac{3}{2}+\mu}\wedge n\rceil. In this case,

Using similar arguments and treating Δf\Delta_{f} and H\mathcal{H}^{*} as O(1)O(1) for simplicity, we can obtain that

If BT(ϵ)nB_{T(\epsilon)}\geq n, we obtain the same bound as in Corollary 3.4.

Appendix D Convergence Analysis for P-L Objectives

Proof [Theorem 3.5] By equation (22) in the proof of Theorem 3.2 (see p. 22) and the P-L condition,

By definition of λj\lambda_{j}, this can be reformulated as

Applying the above inequality iteratively for j=T,T1,,1j=T,T-1,\ldots,1, we prove the result.

Proof [Corollary 3.6] When BjB,bj1B_{j}\equiv B,b_{j}\equiv 1 and γ=16\gamma=\frac{1}{6}, (30) in the proof of Theorem 3.5 can be reformulated as