Stochastic Variance Reduction for Nonconvex Optimization

Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, Alex Smola

Introduction

We study nonconvex finite-sum problems of the form

where neither ff nor the individual fif_{i} (i[n]i\in[n]) are necessarily convex; just Lipschitz smooth (i.e., Lipschitz continuous gradients). We use Fn\mathcal{F}_{n} to denote all functions of the form (1). We optimize such functions in the Incremental First-order Oracle (IFO) framework (Agarwal & Bottou, 2014) defined below.

IFO based complexity analysis was introduced to study lower bounds for finite-sum problems. Algorithms that use IFOs are favored in large-scale applications as they require only a small amount first-order information at each iteration. Two fundamental models in machine learning that profit from IFO algorithms are (i) empirical risk minimization, which typically uses convex finite-sum models; and (ii) deep learning, which uses nonconvex ones.

The prototypical IFO algorithm, stochastic gradient descent (Sgd)We use ‘incremental gradient’ and ‘stochastic gradient’ interchangeably, though we are only interested in finite-sum problems. has witnessed tremendous progress in the recent years. By now a variety of accelerated, parallel, and faster converging versions are known. Among these, of particular importance are variance reduced (VR) stochastic methods (Schmidt et al., 2013; Johnson & Zhang, 2013; Defazio et al., 2014a), which have delivered exciting progress such as linear convergence rates (for strongly convex functions) as opposed to sublinear rates of ordinary Sgd (Robbins & Monro, 1951; Nemirovski et al., 2009). Similar (but not same) benefits of VR methods can also be seen in smooth convex functions. The Svrg algorithm of (Johnson & Zhang, 2013) is particularly attractive here because of its low storage requirement in comparison to the algorithms in (Schmidt et al., 2013; Defazio et al., 2014a).

Despite the meteoric rise of VR methods, their analysis for general nonconvex problems is largely missing. Johnson & Zhang (2013) remark on convergence of Svrg when fFnf\in\mathcal{F}_{n} is locally strongly convex and provide compelling experimental results (Fig. 4 in (Johnson & Zhang, 2013)). However, problems encountered in practice are typically not even locally convex, let alone strongly convex. The current analysis of Svrg does not extend to nonconvex functions as it relies heavily on convexity for controlling the variance. Given the dominance of stochastic gradient methods in optimizing deep neural nets and other large nonconvex models, theoretical investigation of faster nonconvex stochastic methods is much needed.

Convex VR methods are known to enjoy the faster convergence rate of GradientDescent but with a much weaker dependence on nn, without compromising the rate like Sgd. However, it is not clear if these benefits carry beyond convex problems, prompting the central question of this paper:

For nonconvex functions in Fn\mathcal{F}_{n}, can one achieve convergence rates faster than both Sgd and GradientDescent using an IFO? If so, then how does the rate depend on nn and on the number of iterations performed by the algorithm?

Perhaps surprisingly, we provide an affirmative answer to this question by showing that a careful selection of parameters in Svrg leads to faster convergence than both Sgd and GradientDescent. To our knowledge, ours is the first work to improve convergence rates of Sgd and GradientDescent for IFO-based nonconvex optimization.

Main Contributions. We summarize our main contributions below and also list the key results in Table 1.

We analyze nonconvex stochastic variance reduced gradient (Svrg), and prove that it has faster rates of convergence than GradientDescent and ordinary Sgd. We show that Svrg is faster than GradientDescent by a factor of n1/3n^{1/3} (see Table 1).

We provide new theoretical insights into the interplay between step-size, iteration complexity and convergence of nonconvex Svrg (see Corollary 2).

For an interesting nonconvex subclass of Fn\mathcal{F}_{n} called gradient dominated functions Polyak (1963); Nesterov & Polyak (2006), we propose a variant of Svrg that attains a global linear rate of convergence. We improve upon many prior results for this subclass of functions (see Section 3.1). To the best of our knowledge, ours is the first work that shows a stochastic method with linear convergence for gradient dominated functions.

We analyze mini-batch nonconvex Svrg and show that it provably benefits from mini-batching. Specifically, we show theoretical linear speedups in parallel settings for large mini-batch sizes. By using a mini-batch of size bb (<n2/3)(<n^{2/3}), we show that mini-batch nonconvex Svrg is faster by a factor of bb (Theorem 7). We are not aware of any prior work on mini-batch first-order stochastic methods that shows linear speedup in parallel settings for nonconvex optimization.

Our analysis yields as a byproduct a direct convergence analysis for Svrg for smooth convex functions (Section 4).

We examine a variant of Svrg (called Msvrg) that has faster rates than both GradientDescent and Sgd.

Nonconvex. Sgd dates at least to the seminal work (Robbins & Monro, 1951); and since then it has been developed in several directions (Poljak & Tsypkin, 1973; Ljung, 1977; Bottou, 1991; Kushner & Clark, 2012). In the (nonsmooth) finite-sum setting, Sra (2012) considers proximal splitting methods, and analyzes asymptotic convergence with nonvanishing gradient errors. Hong (2014) studies a distributed nonconvex incremental ADMM algorithm.

These works, however, only prove expected convergence to stationary points and often lack analysis of rates. The first nonasymptotic convergence rate analysis for Sgd is in (Ghadimi & Lan, 2013), who show that Sgd ensures f2ϵ\|\nabla f\|^{2}\leq\epsilon in O(1/ϵ2)O(1/\epsilon^{2}) iterations. A similar rate for parallel and distributed Sgd was shown recently in (Lian et al., 2015). GradientDescent is known to ensure f2ϵ\|\nabla f\|^{2}\leq\epsilon in O(1/ϵ)O(1/\epsilon) iterations (Nesterov, 2003, Chap. 1.2.3).

The first analysis of nonconvex Svrg seems to be due to Shamir (2014), who considers the special problem of computing a few leading eigenvectors (e.g., for PCA); see also the follow up work (Shamir, 2015). Finally, we note another interesting example, stochastic optimization of locally quasi-convex functions (Hazan et al., 2015), wherein actually a O(1/ϵ2)O(1/\epsilon^{2}) convergence in function value is shown.

Background & Problem Setup

We say ff is LL-smooth if there is a constant LL such that

Throughout, we assume that the functions fif_{i} in (1) are LL-smooth, so that fi(x)fi(y)Lxy\|\nabla f_{i}(x)-\nabla f_{i}(y)\|\leq L\|x-y\| for all i[n]i\in[n]. Such an assumption is very common in the analysis of first-order methods. Here the Lipschitz constant LL is assumed to be independent of nn. A function ff is called λ\lambda-strongly convex if there is λ0\lambda\geq 0 such that

The quantity κ:=L/λ\kappa:=L/\lambda is called the condition number of ff, whenever ff is LL-smooth and λ\lambda-strongly convex. We say ff is non-strongly convex when ff is -strongly convex.

where xx^{*} is a global minimizer of ff. Note that such a function ff need not be convex; it is also easy to show that a λ\lambda-strongly convex function is 1/2λ1/2\lambda-gradient dominated.

We analyze convergence rates for the above classes of functions. Following Nesterov (2003); Ghadimi & Lan (2013) we use f(x)2ϵ\|\nabla f(x)\|^{2}\leq\epsilon to judge when is iterate xx approximately stationary. Contrast this with Sgd for convex ff, where one uses [f(x)f(x)][f(x)-f(x^{*})] or xx2\|x-x^{*}\|^{2} as a convergence criterion. Unfortunately, such criteria cannot be used for nonconvex functions due to the hardness of the problem. While the quantities f(x)2\|\nabla f(x)\|^{2} and f(x)f(x)f(x)-f(x^{*}) or xx2\|x-x^{*}\|^{2} are not comparable in general (see (Ghadimi & Lan, 2013)), they are typically assumed to be of similar magnitude. Throughout our analysis, we do not assume nn to be constant, and report dependence on it in our results. For our analysis, we need the following definition.

We introduce one more definition useful in the analysis of Sgd methods for bounding the variance.

Stochastic gradient descent (Sgd) is one of the simplest algorithms for solving (1); Algorithm 1 lists its pseudocode.

By using a uniformly randomly chosen (with replacement) index iti_{t} from [n][n], Sgd uses an unbiased estimate of the gradient at each iteration. Under appropriate conditions, Ghadimi & Lan (2013) establish convergence rate of Sgd to a stationary point of ff. Their results include the following theorem.

Suppose ff has σ\sigma-bounded gradient; let ηt=η=c/T\eta_{t}=\eta=c/\sqrt{T} where c=2(f(x0)f(x))Lσ2c=\sqrt{\tfrac{2(f(x^{0})-f(x^{*}))}{L\sigma^{2}}}, and xx^{*} is an optimal solution to (1). Then, the iterates of Algorithm 1 satisfy

For completeness we present a proof in the appendix. Note that our choice of step size η\eta requires knowing the total number of iterations TT in advance. A more practical approach is to use a ηt1/t\eta_{t}\propto 1/\sqrt{t} or 1/t1/t. A bound on IFO calls made by Algorithm 1 follows as a corollary of Theorem 1.

Suppose function ff has σ\sigma-bounded gradient, then the IFO complexity of Algorithm 1 to obtain an ϵ\epsilon-accurate solution is O(1/ϵ2)O(1/\epsilon^{2}).

As seen in Theorem 1, Sgd has a convergence rate of O(\nicefrac1T)O(\nicefrac{{1}}{{\sqrt{T}}}). This rate is not improvable in general even when the function is (non-strongly) convex (Nemirovski & Yudin, 1983). This barrier is due to the variance introduced by the stochasticity of the gradients, and it is not clear if better rates can be obtained Sgd even for convex fFnf\in\mathcal{F}_{n}.

Nonconvex SVRG

We now turn our focus to variance reduced methods. We use Svrg (Johnson & Zhang, 2013), an algorithm recently shown to be very effective for reducing variance in convex problems. As a result, it has gained considerable interest in both machine learning and optimization communities. We seek to understand its benefits for nonconvex optimization. For reference, Algorithm 2 presents Svrg’s pseudocode.

Algorithm 2 attains linear convergence for strongly convex ff (Johnson & Zhang, 2013); for non-strongly convex functions, rates faster than Sgd can be shown by using an indirect perturbation argument—see e.g., (Konečný & Richtárik, 2013; Xiao & Zhang, 2014).

We first state an intermediate result for the iterates of nonconvex Svrg. To ease exposition, we define

for some parameters ct+1c_{t+1} and βt\beta_{t} (to be defined shortly).

Our first main result is the following theorem that provides convergence rate of Algorithm 2.

Let fFnf\in\mathcal{F}_{n}. Let cm=0c_{m}=0, ηt=η>0\eta_{t}=\eta>0, βt=β>0\beta_{t}=\beta>0, and ct=ct+1(1+ηβ+2η2L2)+η2L3c_{t}=c_{t+1}(1+\eta\beta+2\eta^{2}L^{2})+\eta^{2}L^{3} such that Γt>0\Gamma_{t}>0 for 0    t    m10{\;\leq\;}t{\;\leq\;}m-1. Define the quantity γn:=mintΓt\gamma_{n}:=\min_{t}\Gamma_{t}. Further, let pi=0p_{i}=0 for 0    i  <  m0{\;\leq\;}i{\;<\;}m and pm=1p_{m}=1, and let TT be a multiple of mm. Then for the output xax_{a} of Algorithm 2 we have

where xx^{*} is an optimal solution to (1).

Furthermore, we can also show that nonconvex Svrg exhibits expected descent (in objective) after every epoch. The condition that TT is a multiple of mm is solely for convenience and can be removed by slight modification of the theorem statement. Note that the value γn\gamma_{n} above can depend on nn. To obtain an explicit dependence, we simplify it using specific choices for η\eta and β\beta, as formalized below.

Suppose fFnf\in\mathcal{F}_{n}. Let η=μ0/(Lnα)\eta=\mu_{0}/(Ln^{\alpha}) (0<μ0<10<\mu_{0}<1 and 0<α10<\alpha\leq 1), β=L/nα/2\beta=L/n^{\alpha/2}, m=n3α/2/(3μ0)m=\lfloor n^{3\alpha/2}/(3\mu_{0})\rfloor and TT is some multiple of mm. Then there exists universal constants μ0,ν>0\mu_{0},\nu>0 such that we have the following: γnνLnα\gamma_{n}\geq\frac{\nu}{Ln^{\alpha}} in Theorem 2 and

where xx^{*} is an optimal solution to the problem in (1) and xax_{a} is the output of Algorithm 2.

By rewriting the above result in terms IFO calls, we get the following general corollary for nonconvex Svrg.

Suppose fFnf\in\mathcal{F}_{n}. Then the IFO complexity of Algorithm 2 (with parameters from Theorem 3) for achieving an ϵ\epsilon-accurate solution is:

Corollary 2 shows the interplay between step size and the IFO complexity. We observe that the number of IFO calls is minimized in Corollary 2 when α=2/3\alpha=2/3. This gives rise to the following key results of the paper.

Suppose fFnf\in\mathcal{F}_{n}. Let η=μ1/(Ln2/3)\eta=\mu_{1}/(Ln^{2/3}) (0<μ1<10<\mu_{1}<1), β=L/n1/3\beta=L/n^{1/3}, m=n/(3μ1)m=\lfloor n/(3\mu_{1})\rfloor and TT is some multiple of mm. Then there exists universal constants μ1,ν1>0\mu_{1},\nu_{1}>0 such that we have the following: γnν1Ln2/3\gamma_{n}\geq\frac{\nu_{1}}{Ln^{2/3}} in Theorem 2 and

where xx^{*} is an optimal solution to the problem in (1) and xax_{a} is the output of Algorithm 2.

If fFnf\in\mathcal{F}_{n}, then the IFO complexity of Algorithm 2 (with parameters in Corollary 3) to obtain an ϵ\epsilon-accurate solution is O(n+(n2/3/ϵ))O(n+(n^{2/3}/\epsilon)).

Note the rate of O(1/T)O(1/T) in the above results, as opposed to slower O(1/T)O(1/\sqrt{T}) rate of Sgd (Theorem 1). For a more comprehensive comparison of the rates, refer to Section 6.

Before ending our discussion on convergence of nonconvex Svrg, we prove a linear convergence rate for the class of τ\tau-gradient dominated functions (2). For ease of exposition, assume that τ>n1/3\tau>n^{1/3}, a property analogous to the “high condition number regime” for strongly convex functions typical in machine learning. Note that gradient dominated functions can be nonconvex.

Suppose ff is τ\tau-gradient dominated where τ>n1/3\tau>n^{1/3}. Then, the iterates of Algorithm 3 with T=2Lτn2/3/ν1T=\lceil 2L\tau n^{2/3}/\nu_{1}\rceil, m=n/(3μ1)m=\lfloor n/(3\mu_{1})\rfloor, ηt=μ1/(Ln2/3)\eta_{t}=\mu_{1}/(Ln^{2/3}) for all 0tm10\leq t\leq m-1 and pm=1p_{m}=1 and pi=0p_{i}=0 for all 0i<m0\leq i<m satisfy

Here μ1\mu_{1} and ν1\nu_{1} are the constants used in Corollary 3.

In fact, for τ\tau-gradient dominated functions we can prove a stronger result of global linear convergence.

If ff is τ\tau-gradient dominated (τ>n1/3\tau>n^{1/3}), then with T=2Lτn2/3/ν1T=\lceil 2L\tau n^{2/3}/\nu_{1}\rceil, m=n/(3μ1)m=\lfloor n/(3\mu_{1})\rfloor, ηt=μ1/(Ln2/3)\eta_{t}=\mu_{1}/(Ln^{2/3}) for 0tm10\leq t\leq m-1 and pm=1p_{m}=1 and pi=0p_{i}=0 for all 0i<m0\leq i<m, the iterates of Algorithm 3 satisfy

Here μ1\mu_{1}, ν1\nu_{1} are as in Corollary 3; xx^{*} is an optimal solution.

An immediate consequence is the following.

If ff is τ\tau-gradient dominated, the IFO complexity of Algorithm 3 (with parameters from Theorem 4) to compute an ϵ\epsilon-accurate solution is O((n+τn2/3)log(1/ϵ))O((n+\tau n^{2/3})\log(1/\epsilon)).

Note that GradientDescent can also achieve linear convergence rate for gradient dominated functions (Polyak, 1963). However, GradientDescent requires O(n+nτlog(1/ϵ))O(n+n\tau\log(1/\epsilon)) IFO calls to obtain an ϵ\epsilon-accurate solution as opposed to O(n+n2/3τlog(1/ϵ))O(n+n^{2/3}\tau\log(1/\epsilon)) for Svrg. Similar (but not the same) gains can be seen for Svrg for strongly convex functions (Johnson & Zhang, 2013). Also notice that we did not assume anything except smoothness on the individual functions fif_{i} in the above results. In particular, the following corollary is also an immediate consequence.

If ff is λ\lambda-strongly convex and the functions {fi}i=1n\{f_{i}\}_{i=1}^{n} are possibly nonconvex, then the number of IFO calls made by Algorithm 3 (with parameters from Theorem 4) to compute an ϵ\epsilon-accurate solution is O((n+n2/3κ)log(1/ϵ))O((n+n^{2/3}\kappa)\log(1/\epsilon)).

Recall that here κ\kappa denotes the condition number L/λL/\lambda for a λ\lambda-strongly convex function. Corollary 6 follows from Corollary 5 upon noting that λ\lambda-strongly convex function is 1/2λ1/2\lambda-gradient dominated. Theorem 5 generalizes the linear convergence result in (Johnson & Zhang, 2013) since it allows nonconvex fif_{i}. Observe that Corollary 6 also applies when fif_{i} is strongly convex for all i[n]i\in[n], though in this case a more refined result can be proved (Johnson & Zhang, 2013).

Finally, we note that our result also improves on a recent result on Sdca in the setting of Corollary 6 when the condition number κ\kappa is reasonably large – a case that typically arises in machine learning. More precisely, for l2l_{2}-regularized empirical loss minimization, Shalev-Shwartz (2015) show that Sdca requires O((n+κ2)log(1/ϵ)O((n+\kappa^{2})\log(1/\epsilon) iterations when the fif_{i}’s are possibly nonconvex but their sum ff is strongly convex. In comparison, we show that Algorithm 3 requires O((n+n2/3κ)log(1/ϵ))O((n+n^{2/3}\kappa)\log(1/\epsilon)) iterations, which is an improvement over Sdca when κ>n2/3\kappa>n^{2/3}.

Convex Case

In the previous section, we showed nonconvex Svrg converges to a stationary point at the rate O(n2/3/T)O(n^{2/3}/T). A natural question is whether this rate can be improved if we assume convexity? We provide an affirmative answer. For non-strongly convex functions, this yields a direct analysis (i.e., not based on strongly convex perturbations) for Svrg. While we state our results in terms of stationarity gap f(x)2\|\nabla f(x)\|^{2} for the ease of comparison, our analysis also provides rates with respect to the optimality gap [f(x)f(x)][f(x)-f(x^{*})] (see the proof of Theorem 6 in the appendix).

If fif_{i} is convex for all i[n]i\in[n], pi=1/mp_{i}=1/m for 0    i    m10{\;\leq\;}i{\;\leq\;}m-1, and pm=0p_{m}=0, then for Algorithm 2, we have

where xx^{*} is optimal for (1) and xax_{a} is the output of Algorithm 2.

We now state corollaries of this theorem that explicitly show the dependence on nn in the convergence rates.

If m=nm=n and η=1/(8Ln)\eta=1/(8L\sqrt{n}) in Theorem 6, then we have the following bound:

where xx^{*} is optimal for (1) and xax_{a} is the output of Algorithm 2.

The above result uses a step size that depends on nn. For the convex case, we can also use step sizes independent of nn. The following corollary states the associated result.

If m=nm=n and η=1/(8L)\eta=1/(8L) in Theorem 6, then we have the following bound:

where xx^{*} is optimal for (1) and xax_{a} is the output of Algorithm 2.

We can rewrite these corollaries in terms of IFO complexity to get the following corollaries.

If fif_{i} is convex for all i[n]i\in[n], then the IFO complexity of Algorithm 2 (with parameters from Corollary 7) to compute an ϵ\epsilon-accurate solution is O(n+(n/ϵ))O(n+(\sqrt{n}/\epsilon)).

If fif_{i} is convex for all i[n]i\in[n], then the IFO complexity of Algorithm 2 (with parameters from Corollary 8) to compute ϵ\epsilon-accurate solution is O(n/ϵ)O(n/\epsilon).

These results follow from Corollary 7 and Corollary 8 and noting that for m=O(n)m=O(n) the total IFO calls made by Algorithm 2 is O(n)O(n). It is instructive to quantitatively compare Corollary 9 and Corollary10. With a step size independent of nn, the convergence rate of Svrg has a dependence that is in the order of nn (Corollary 8). But this dependence can be reduced to n\sqrt{n} by either carefully selecting a step size that diminishes with nn (Corollary 7) or by using a good initial point x0x^{0} obtained by, say, running O(n)O(n) iterations of Sgd.

We emphasize that the convergence rate for convex case can be improved significantly by slightly modifying the algorithm (either by adding an appropriate strongly convex perturbation (Xiao & Zhang, 2014) or by using a choice of mm that changes with epoch (Zhu & Yuan, 2015)). However, it is not clear if these strategies provide any theoretical gains for the general nonconvex case.

Mini-batch Nonconvex SVRG

In this section, we study the mini-batch version of Algorithm 2. Mini-batching is a popular strategy, especially in multicore and distributed settings as it greatly helps one exploit parallelism and reduce the communication costs. The pseudocode for mini-batch nonconvex Svrg (Algorithm 4) is provided in the supplement due to lack of space. The key difference between the mini-batch Svrg and Algorithm 2 lies in lines 6 to 8. To use mini-batches we replace line 6 with sampling (with replacement) a mini-batch It[n]I_{t}\subset[n] of size bb; lines 7 to 8 are replaced with the following updates:

When b=1b=1, this reduces to Algorithm 2. Mini-batch is typically used to reduce the variance of the stochastic gradient and increase the parallelism. Lemma 4 (in Section G of the appendix) shows the reduction in the variance of stochastic gradients with mini-batch size bb. Using this lemma, one can derive the mini-batch equivalents of Lemma 1, Theorem 2 and Theorem 3. However, for the sake of brevity, we directly state the following main result for mini-batch Svrg.

Let γn\overline{\gamma}_{n} denote the following quantity:

where cm=0\overline{c}_{m}=0, ct=ct+1(1+ηβ+\nicefrac2η2L2b)+\nicefracηt2L3b\overline{c}_{t}=\overline{c}_{t+1}(1+\eta\beta+\nicefrac{{2\eta^{2}L^{2}}}{{b}})+\nicefrac{{\eta_{t}^{2}L^{3}}}{{b}} for 0    t    m10{\;\leq\;}t{\;\leq\;}m-1. Suppose η=μ2b/(Ln2/3)\eta=\mu_{2}b/(Ln^{2/3}) (0<μ2<10<\mu_{2}<1), β=L/n1/3\beta=L/n^{1/3}, m=n/(3bμ2)m=\lfloor n/(3b\mu_{2})\rfloor and TT is some multiple of mm. Then for the mini-batch version of Algorithm 2 with mini-batch size b<n2/3b<n^{2/3}, there exists universal constants μ2,ν2>0\mu_{2},\nu_{2}>0 such that we have the following: γnν2bLn2/3\overline{\gamma}_{n}\geq\frac{\nu_{2}b}{Ln^{2/3}} and

It is important to compare this result with mini-batched Sgd. For a batch size of bb, Sgd obtains a rate of O(1/bT)O(1/\sqrt{bT}) Dekel et al. (2012) (obtainable by a simple modification of Theorem 1). Specifically, Sgd has a 1/b1/\sqrt{b} dependence on the batch size. In contrast, Theorem 7 shows that Svrg has a much better dependence of 1/b1/b on the batch size. Hence, compared to Sgd, Svrg allows more efficient mini-batching. More formally, in terms of IFO queries we have the following result.

If fFnf\in\mathcal{F}_{n}, then the IFO complexity of the mini-batch version of Algorithm 2 (with parameters from Theorem 7 and mini-batch size b<n2/3b<n^{2/3}) to obtain an ϵ\epsilon-accurate solution is O(n+(n2/3/ϵ))O(n+(n^{2/3}/\epsilon)).

Corollary 11 shows an interesting property of mini-batch Svrg. First, note that bb IFO calls are required for calculating the gradient on a mini-batch of size bb. Hence, Svrg does not gain on IFO complexity by using mini-batches. However, if the bb gradients are calculated in parallel, then this leads to a theoretical linear speedup in multicore and distributed settings. In contrast, Sgd does not yield an efficient mini-batch strategy as it requires O(b1/2/ϵ2)O(b^{1/2}/\epsilon^{2}) IFO calls for achieving an ϵ\epsilon-accurate solution (Li et al., 2014). Thus, the performance of Sgd degrades with mini-batching.

Comparison of the convergence rates

In this section, we give a comprehensive comparison of results obtained in this paper. In particular, we compare key aspects of the convergence rates for Sgd, GradientDescent, and Svrg. The comparison is based on IFO complexity to achieve an ϵ\epsilon-accurate solution.

Dependence on nn: The number of IFO calls of Svrg and GradientDescent depend explicitly on nn. In contrast, the number of oracle calls of Sgd is independent of nn (Theorem 1). However, this comes at the expense of worse dependence on ϵ\epsilon. The number of IFO calls in GradientDescent is proportional to nn. But for Svrg this dependence reduces to n1/2n^{1/2} for convex (Corollary 7) and n2/3n^{2/3} for nonconvex (Corollary 3) problems. Whether this difference in dependence on nn is due to nonconvexity or just an artifact of our analysis is an interesting open problem.

Dependence on ϵ\epsilon: The dependence on ϵ\epsilon (or alternatively TT) follows from the convergence rates of the algorithms. Sgd is seen to depend as O(1/ϵ2)O(1/\epsilon^{2}) on ϵ\epsilon, regardless of convexity or nonconvexity. In contrast, for both convex and nonconvex settings, Svrg and GradientDescent converge as O(1/ϵ)O(1/\epsilon). Furthermore, for gradient dominated functions, Svrg and GradientDescent have global linear convergence. This speedup in convergence over Sgd is especially significant when medium to high accuracy solutions are required (i.e., ϵ\epsilon is small).

Assumptions used in analysis: It is important to understand the assumptions used in deriving the convergence rates. All algorithms assume Lipschitz continuous gradients. However, Sgd requires two additional subtle but important assumptions: σ\sigma-bounded gradients and advance knowledge of TT (since its step sizes depend on TT). On the other hand, both Svrg and GradientDescent do not require these assumptions, and thus, are more flexible.

Step size / learning rates: It is valuable to compare the step sizes used by the algorithms. The step sizes of Sgd shrink as the number of iterations TT increases—an undesirable property. On the other hand, the step sizes of Svrg and GradientDescent are independent of TT. Hence, both these algorithms can be executed with a fixed step size. However, Svrg uses step sizes that depend on nn (see Corollary 3 and Corollary 7). A step size independent of nn can be used for Svrg for convex ff, albeit at cost of worse dependence on nn (Corollary 8). GradientDescent does not have this issue as its step size is independent of both nn and TT.

Dependence on initial point and mini-batch: Svrg is more sensitive to the initial point in comparison to Sgd. This can be seen by comparing Corollary 3 (of Svrg) to Theorem 1 (of Sgd). Hence, it is important to use a good initial point for Svrg. Similarly, a good mini-batch can be beneficial to Svrg. Moreover, mini-batches not only provides parallelism but also good theoretical guarantees (see Theorem 7). In contrast, the performance gain in Sgd with mini-batches is not very pronounced (see Section 5).

Best of two worlds

We have seen in the previous section that Svrg combines the benefits of both GradientDescent and Sgd. We now show that these benefits of Svrg can be made more pronounced by an appropriate step size under additional assumptions. In this case, the IFO complexity of Svrg is lower than those of Sgd and GradientDescent. This variant of Svrg (Msvrg) chooses a step size based on the total number of iterations TT (or alternatively ϵ\epsilon). For our discussion below, we assume that T>nT>n.

Let fFnf\in\mathcal{F}_{n} have σ\sigma-bounded gradients. Let ηt=η=max{\nicefraccT,\nicefracμ1(Ln2/3)}\eta_{t}=\eta=\max\{\nicefrac{{c}}{{\sqrt{T}}},\nicefrac{{\mu_{1}}}{{(Ln^{2/3})}}\} (μ1\mu_{1} is the universal constant from Corollary 3), m=\nicefracn(3μ1)m=\lfloor\nicefrac{{n}}{{(3\mu_{1})}}\rfloor, and c=f(x0)f(x)2Lσ2c=\sqrt{\frac{f(x^{0})-f(x^{*})}{2L\sigma^{2}}}. Further, let TT be a multiple of mm, pm=1p_{m}=1, and pi=0p_{i}=0 for 0    i  <  m0{\;\leq\;}i{\;<\;}m. Then, the output xax_{a} of Algorithm 2 satisfies

where νˉ\bar{\nu} is a universal constant, ν1\nu_{1} is the universal constant from Corollary 3 and xx^{*} is an optimal solution to (1).

If fFnf\in\mathcal{F}_{n} has σ\sigma-bounded gradients, the IFO complexity of Algorithm 2 (with parameters from Theorem 8) to achieve an ϵ\epsilon-accurate solution is O(min{1/ϵ2,n2/3/ϵ})O(\min\{1/\epsilon^{2},n^{2/3}/\epsilon\}).

An almost identical reasoning can be applied when ff is convex to get the bounds specified in Table 1. Hence, we omit the details and directly state the following result.

Suppose fif_{i} is convex for i[n]i\in[n] and ff has σ\sigma-bounded gradients, then the IFO complexity of Algorithm 2 (with step size η=max{1/(LT),1/(8Ln)}\eta=\max\{1/(L\sqrt{T}),1/(8L\sqrt{n})\}, m=nm=n and pi=1/mp_{i}=1/m for 0im10\leq i\leq m-1 and pm=0p_{m}=0) to achieve an ϵ\epsilon-accurate solution is O(min{1/ϵ2,n/ϵ})O(\min\{1/\epsilon^{2},\sqrt{n}/\epsilon\}).

Msvrg has a convergence rate faster than those of both Sgd and Svrg, though this benefit is not without cost. Msvrg, in contrast to Svrg, uses the additional assumption of σ\sigma-bounded gradients. Furthermore, its step size is not fixed since it depends on the number of iterations TT. While it is often difficult in practice to compute the step size of Msvrg (Theorem 8), it is typical to try multiple step sizes and choose the one with the best results.

Experiments

We present our empirical results in this section. For our experiments, we study the problem of multiclass classification using neural networks. This is a typical nonconvex problem encountered in machine learning.

We compare Sgd (the de-facto algorithm for training neural networks) against nonconvex Svrg. The step size (or learning rate) is critical for Sgd. We set the learning rate of Sgd using the popular tt-inverse schedule ηt=η0(1+ηt/n)1\eta_{t}=\eta_{0}(1+\eta^{\prime}\lfloor t/n\rfloor)^{-1}, where η0\eta_{0} and η\eta^{\prime} are chosen so that Sgd gives the best performance on the training loss. In our experiments, we also use η=0\eta^{\prime}=0; this results in a fixed step size for Sgd. For Svrg, we use a fixed step size as suggested by our analysis. Again, the step size is chosen so that Svrg gives the best performance on the training loss.

Initialization & mini-batching. Initialization is critical to training of neural networks. We use the normalized initialization in (Glorot & Bengio, 2010) where parameters are chosen uniformly from [6/(ni+no)[-\sqrt{6/(n_{i}+n_{o})}, 6/(ni+no)]\sqrt{6/(n_{i}+n_{o})}] where nin_{i} and non_{o} are the number of input and output layers of the neural network, respectively.

For Svrg, we use nn iterations of Sgd for CIFAR-10 and MINST and 2n2n iterations of Sgd for STL-10 before running Algorithm 2. Such initialization is standard for variance reduced schemes even for convex problems (Johnson & Zhang, 2013; Schmidt et al., 2013). As noted earlier in Section 6, Svrg is more sensitive than Sgd to the initial point, so such an initialization is typically helpful. We use mini-batches of size 10 in our experiments. Sgd with mini-batches is common in training neural networks. Note that mini-batch training is especially beneficial for Svrg, as shown by our analysis in Section 5. Along the lines of theoretical analysis provided by Theorem 7, we use an epoch size m=n/10m=n/10 in our experiments.

Results. We report objective function (training loss), test error (classification error on the test set), and f(xt)2\|\nabla f(x^{t})\|^{2} (convergence criterion throughout our analysis) for the datasets. For all the algorithms, we compare these criteria against the number of effective passes through the data, i.e., IFO calls divided by nn. This includes the cost of calculating the full gradient at the end of each epoch of Svrg. Due to the Sgd initialization in Svrg and mini-batching, the Svrg plots start from x-axis value of 10 for CIFAR-10 and MNIST and 20 for STL-10. Figure 1 shows the results for our experiment. It can be seen that the f(xt)2\|\nabla f(x^{t})\|^{2} for Svrg is lower compared to Sgd, suggesting faster convergence to a stationary point. Furthermore, the training loss is also lower compared to Sgd in all the datasets. Notably, the test error for CIFAR-10 is lower for Svrg, indicating better generalization; we did not notice substantial difference in test error for MNIST and STL-10 (see Section H in the appendix). Overall, these results on a network with one hidden layer are promising; it will be interesting to study Svrg for deep neural networks in the future.

Discussion

In this paper, we examined a VR scheme for nonconvex optimization. We showed that by employing VR in stochastic methods, one can perform better than both Sgd and GradientDescent in the context of nonconvex optimization. When the function ff in (1) is gradient dominated, we proposed a variant of Svrg that has linear convergence to the global minimum. Our analysis shows that Svrg has a number of interesting properties that include convergence with fixed step size, descent property after every epoch; a property that need not hold for Sgd. We also showed that Svrg, in contrast to Sgd, enjoys efficient mini-batching, attaining speedups linear in the size of the mini-batches in parallel settings. Our analysis also reveals that the initial point and use of mini-batches are important to Svrg.

Before concluding the paper, we would like to discuss the implications of our work and few caveats. One should exercise some caution while interpreting the results in the paper. All our theoretical results are based on the stationarity gap. In general, this does not necessarily translate to optimality gap or low training loss and test error. One criticism against VR schemes in nonconvex optimization is the general wisdom that variance in the stochastic gradients of Sgd can actually help it escape local minimum and saddle points. In fact, Ge et al. (2015) add additional noise to the stochastic gradient in order to escape saddle points. However, one can reap the benefit of VR schemes even in such scenarios. For example, one can envision an algorithm which uses Sgd as an exploration tool to obtain a good initial point and then uses a VR algorithm as an exploitation tool to quickly converge to a good local minimum. In either case, we believe variance reduction can be used as an important tool alongside other tools like momentum, adaptive learning rates for faster and better nonconvex optimization.

References

Appendix

Appendix A Nonconvex SGD: Convergence Rate

Proof of Theorem 1

Suppose ff has σ\sigma-bounded gradient; let ηt=η=c/T\eta_{t}=\eta=c/\sqrt{T} where c=2(f(x0)f(x))Lσ2c=\sqrt{\tfrac{2(f(x^{0})-f(x^{*}))}{L\sigma^{2}}}, and xx^{*} is an optimal solution to (1). Then, the iterates of Algorithm 1 satisfy

We include the proof here for completeness. Please refer to (Ghadimi & Lan, 2013) for a more general result.

The iterates of Algorithm 1 satisfy the following bound:

Summing Equation (6) from t=0t=0 to T1T-1 and using that ηt\eta_{t} is constant η\eta we obtain

The first step holds because the minimum is less than the average. The second and third steps are obtained from Equation (6) and the fact that f(x)f(xT)f(x^{*})\leq f(x^{T}), respectively. The final inequality follows upon using η=c/T\eta=c/\sqrt{T}. By setting

in the above inequality, we get the desired result. ∎

Appendix B Nonconvex SVRG

In this section, we provide the proofs of the results for nonconvex Svrg. We first start with few useful lemmas and then proceed towards the main results.

For ct,ct+1,βt>0c_{t},c_{t+1},\beta_{t}>0, suppose we have

Let ηt\eta_{t}, βt\beta_{t} and ct+1c_{t+1} be chosen such that Γt>0\Gamma_{t}>0 (in Equation (3)). The iterate xts+1x^{s+1}_{t} in Algorithm 2 satisfy the bound:

Using the Svrg update in Algorithm 2 and its unbiasedness, the right hand side above is further upper bounded by

For bounding it we will require the following:

The second equality follows from the unbiasedness of the update of Svrg. The last inequality follows from a simple application of Cauchy-Schwarz and Young’s inequality. Plugging Equation (7) and Equation (8) into Rt+1s+1R^{s+1}_{t+1}, we obtain the following bound:

The second inequality follows from the definition of ctc_{t} and Rts+1R_{t}^{s+1}, thus concluding the proof. ∎

Proof of Theorem 2

Let fFnf\in\mathcal{F}_{n}. Let cm=0c_{m}=0, ηt=η>0\eta_{t}=\eta>0, βt=β>0\beta_{t}=\beta>0, and ct=ct+1(1+ηβ+2η2L2)+η2L3c_{t}=c_{t+1}(1+\eta\beta+2\eta^{2}L^{2})+\eta^{2}L^{3} such that Γt>0\Gamma_{t}>0 for 0    t    m10{\;\leq\;}t{\;\leq\;}m-1. Define the quantity γn:=mintΓt\gamma_{n}:=\min_{t}\Gamma_{t}. Further, let pi=0p_{i}=0 for 0    i  <  m0{\;\leq\;}i{\;<\;}m and pm=1p_{m}=1, and let TT be a multiple of mm. Then for the output xax_{a} of Algorithm 2 we have

where xx^{*} is an optimal solution to (1).

Since ηt=η\eta_{t}=\eta for t{0,,m1}t\in\{0,\dots,m-1\}, using Lemma 1 and telescoping the sum, we obtain

Proof of Theorem 3

Suppose fFnf\in\mathcal{F}_{n}. Let η=μ0/(Lnα)\eta=\mu_{0}/(Ln^{\alpha}) (0<μ0<10<\mu_{0}<1 and 0<α10<\alpha\leq 1), β=L/nα/2\beta=L/n^{\alpha/2}, m=n3α/2/(3μ0)m=\lfloor n^{3\alpha/2}/(3\mu_{0})\rfloor and TT is some multiple of mm. Then there exists universal constants μ0,ν>0\mu_{0},\nu>0 such that we have the following: γnνLnα\gamma_{n}\geq\frac{\nu}{Ln^{\alpha}} in Theorem 2 and

where xx^{*} is an optimal solution to the problem in (1) and xax_{a} is the output of Algorithm 2.

For our analysis, we will require an upper bound on c0c_{0}. We observe that c0=μ02Ln2α(1+θ)m1θc_{0}=\tfrac{\mu_{0}^{2}L}{n^{2\alpha}}\tfrac{(1+\theta)^{m}-1}{\theta} where θ=2η2L2+ηβ\theta=2\eta^{2}L^{2}+\eta\beta. This is obtained using the relation ct=ct+1(1+ηβ+2η2L2)+η2L3c_{t}=c_{t+1}(1+\eta\beta+2\eta^{2}L^{2})+\eta^{2}L^{3} and the fact that cm=0c_{m}=0. Using the specified values of β\beta and η\eta we have

The above inequality follows since μ01\mu_{0}\leq 1 and n1n\geq 1. Using the above bound on θ\theta, we get

wherein the second inequality follows upon noting that (1+1l)l(1+\frac{1}{l})^{l} is increasing for l>0l>0 and liml(1+1l)l=e\lim_{l\rightarrow\infty}(1+\frac{1}{l})^{l}=e (here ee is the Euler’s number). Now we can lower bound γn\gamma_{n}, as

where ν\nu is a constant independent of nn. The first inequality holds since ctc_{t} decreases with tt. The second inequality holds since (a) c0/βc_{0}/\beta is upper bounded by a constant independent of nn as c0/βμ0(e1)c_{0}/\beta\leq\mu_{0}(e-1) (follows from Equation (Proof.)), (b) η2Lμ0η\eta^{2}L\leq\mu_{0}\eta and (c) 2c0η22μ02(e1)η2c_{0}\eta^{2}\leq 2\mu_{0}^{2}(e-1)\eta (follows from Equation (Proof.)). By choosing μ0\mu_{0} (independent of nn) appropriately, one can ensure that γnν/(Lnα)\gamma_{n}\geq\nu/(Ln^{\alpha}) for some universal constant ν\nu. For example, choosing μ0=1/4\mu_{0}=1/4, we have γnν/(Lnα)\gamma_{n}\geq\nu/(Ln^{\alpha}) with ν=1/40\nu=1/40. Substituting the above lower bound in Equation (11), we obtain the desired result. ∎

Proof of Corollary 2

Suppose fFnf\in\mathcal{F}_{n}. Then the IFO complexity of Algorithm 2 (with parameters from Theorem 3) for achieving an ϵ\epsilon-accurate solution is:

This result follows from Theorem 3 and the fact that m=n3α/2/(3μ0)m=\lfloor n^{3\alpha/2}/(3\mu_{0})\rfloor. Suppose α<2/3\alpha<2/3, then m=o(n)m=o(n). However, nn IFO calls are invested in calculating the average gradient at the end of each epoch. In other words, computation of average gradient requires nn IFO calls for every mm iterations of the algorithm. Using this relationship, we get O\bigl{(}n+(n^{1-\tfrac{\alpha}{2}}/\epsilon)\bigr{)} in this case.

On the other hand, when α2/3\alpha\geq 2/3, the total number of IFO calls made by Algorithm 2 in each epoch is Ω(n)\Omega(n) since m=n3α/2/(3μ0)m=\lfloor n^{3\alpha/2}/(3\mu_{0})\rfloor. Hence, the oracle calls required for calculating the average gradient (per epoch) is of lower order, leading to O\bigl{(}n+(n^{\alpha}/\epsilon)\bigr{)} IFO calls. ∎

Appendix C GD-SVRG

Proof of Theorem 4

Suppose ff is τ\tau-gradient dominated where τ>n1/3\tau>n^{1/3}. Then, the iterates of Algorithm 3 with T=2Lτn2/3/ν1T=\lceil 2L\tau n^{2/3}/\nu_{1}\rceil, m=n/(3μ1)m=\lfloor n/(3\mu_{1})\rfloor, ηt=μ1/(Ln2/3)\eta_{t}=\mu_{1}/(Ln^{2/3}) for all 0tm10\leq t\leq m-1 and pm=1p_{m}=1 and pi=0p_{i}=0 for all 0i<m0\leq i<m satisfy

Here μ1\mu_{1} and ν1\nu_{1} are the constants used in Corollary 3.

Corollary 3 shows that the iterates of Algorithm 3 satisfy

Substituting the specified value of TT in the above inequality, we have

The second inequality follows from τ\tau-gradient dominance of the function ff. ∎

Proof of Theorem 5

If ff is τ\tau-gradient dominated (τ>n1/3\tau>n^{1/3}), then with T=2Lτn2/3/ν1T=\lceil 2L\tau n^{2/3}/\nu_{1}\rceil, m=n/(3μ1)m=\lfloor n/(3\mu_{1})\rfloor, ηt=μ1/(Ln2/3)\eta_{t}=\mu_{1}/(Ln^{2/3}) for 0tm10\leq t\leq m-1 and pm=1p_{m}=1 and pi=0p_{i}=0 for all 0i<m0\leq i<m, the iterates of Algorithm 3 satisfy

Here μ1\mu_{1}, ν1\nu_{1} are as in Corollary 3; xx^{*} is an optimal solution.

The proof mimics that of Theorem 4; now we have the following condition on the iterates of Algorithm 3:

Appendix D Convex SVRG: Convergence Rate

Proof of Theorem 6

If fif_{i} is convex for all i[n]i\in[n], pi=1/mp_{i}=1/m for 0    i    m10{\;\leq\;}i{\;\leq\;}m-1, and pm=0p_{m}=0, then for Algorithm 2, we have

where xx^{*} is optimal for (1) and xax_{a} is the output of Algorithm 2.

Consider the following sequence of inequalities:

The second inequality uses unbiasedness of the Svrg update and convexity of ff. The third inequality follows from Lemma 6. Defining the Lyapunov function

and summing the above inequality over tt, we get

The above equality uses the fact that pm=0p_{m}=0 and pi=1/mp_{i}=1/m for 0i<m0\leq i<m. Summing over all epochs and telescoping we then obtain

The inequality also uses the definition of xax_{a} given in Alg 2. On this inequality we use Lemma 5, which yields

It is easy to see that we can obtain convergence rates for E[f(xa)f(x)]E[f(x_{a})-f(x^{*})] from the above reasoning. This leads to a direct analysis of Svrg for convex functions.

Appendix E Minibatch Nonconvex SVRG

Proof of Theorem 7

The proofs essentially follow along the lines of Lemma 1, Theorem 2 and Theorem 3 with the added complexity of mini-batch. We first prove few intermediate results before proceeding to the proof of Theorem 7.

for  0sS1\ 0\leq s\leq S-1 and 0tm10\leq t\leq m-1 and the parameters ηt,βt\eta_{t},\beta_{t} and ct+1\overline{c}_{t+1} are chosen such that

Then the iterates xts+1x^{s+1}_{t} in the mini-batch version of Algorithm 2 i.e., Algorithm 4 with mini-batch size bb satisfy the bound:

Using essentially the same argument as the proof of Lemma. 1 until Equation (9), we have

The second inequality follows from the definition of ct\overline{c}_{t} and Rts+1\overline{R}^{s+1}_{t}, thus concluding the proof. ∎

Our intermediate key result is the following theorem that provides convergence rate of mini-batch Svrg.

Let γn\overline{\gamma}_{n} denote the following quantity:

Suppose ηt=η\eta_{t}=\eta and βt=β\beta_{t}=\beta for all t{0,,m1}t\in\{0,\dots,m-1\}, cm=0\overline{c}_{m}=0, ct=ct+1(1+ηtβt+2ηt2L2b)+ηt2L3b\overline{c}_{t}=\overline{c}_{t+1}(1+\eta_{t}\beta_{t}+\frac{2\eta_{t}^{2}L^{2}}{b})+\frac{\eta_{t}^{2}L^{3}}{b} for t{0,,m1}t\in\{0,\dots,m-1\} and γn>0\overline{\gamma}_{n}>0. Further, let pm=1p_{m}=1 and pi=0p_{i}=0 for 0i<m0\leq i<m. Then for the output xax_{a} of mini-batch version of Algorithm 2 with mini-batch size bb, we have

where xx^{*} is an optimal solution to (1).

Since ηt=η\eta_{t}=\eta for t{0,,m1}t\in\{0,\dots,m-1\}, using Lemma 2 and telescoping the sum, we obtain

We now present the proof of Theorem 7 using the above results.

Let γn\overline{\gamma}_{n} denote the following quantity:

where cm=0\overline{c}_{m}=0, ct=ct+1(1+ηβ+\nicefrac2η2L2b)+\nicefracηt2L3b\overline{c}_{t}=\overline{c}_{t+1}(1+\eta\beta+\nicefrac{{2\eta^{2}L^{2}}}{{b}})+\nicefrac{{\eta_{t}^{2}L^{3}}}{{b}} for 0    t    m10{\;\leq\;}t{\;\leq\;}m-1. Suppose η=μ2b/(Ln2/3)\eta=\mu_{2}b/(Ln^{2/3}) (0<μ2<10<\mu_{2}<1), β=L/n1/3\beta=L/n^{1/3}, m=n/(3bμ2)m=\lfloor n/(3b\mu_{2})\rfloor and TT is some multiple of mm. Then for the mini-batch version of Algorithm 2 with mini-batch size b<n2/3b<n^{2/3}, there exists universal constants μ2,ν2>0\mu_{2},\nu_{2}>0 such that we have the following: γnν2bLn2/3\overline{\gamma}_{n}\geq\frac{\nu_{2}b}{Ln^{2/3}} and

We first observe that using the specified values of β\beta and η\eta we obtain

The above inequality follows since μ21\mu_{2}\leq 1 and n1n\geq 1. For our analysis, we will require the following bound on c0\overline{c}_{0}:

wherein the first equality holds due to the relation ct=ct+1(1+ηtβt+2ηt2L2b)+ηt2L3b\overline{c}_{t}=\overline{c}_{t+1}(1+\eta_{t}\beta_{t}+\tfrac{2\eta_{t}^{2}L^{2}}{b})+\tfrac{\eta_{t}^{2}L^{3}}{b}, and the inequality follows upon again noting that (1+1/l)l(1+1/l)^{l} is increasing for l>0l>0 and liml(1+1l)l=e\lim_{l\rightarrow\infty}(1+\frac{1}{l})^{l}=e. Now we can lower bound γn\overline{\gamma}_{n}, as

where ν2\nu_{2} is a constant independent of nn. The first inequality holds since ct\overline{c}_{t} decreases with tt. The second one holds since (a) c0/β\overline{c}_{0}/\beta is upper bounded by a constant independent of nn as c0/βμ2(e1)\overline{c}_{0}/\beta\leq\mu_{2}(e-1) (due to Equation (Proof of Theorem 7.)), (b) η2Lμ2η\eta^{2}L\leq\mu_{2}\eta (as b<n2/3b<n^{2/3}) and (c) 2c0η22μ22(e1)η2\overline{c}_{0}\eta^{2}\leq 2\mu_{2}^{2}(e-1)\eta (again due to Equation (Proof of Theorem 7.) and the fact b<n2/3b<n^{2/3}). By choosing an appropriately small constant μ2\mu_{2} (independent of n), one can ensure that γnbν2/(Ln2/3)\overline{\gamma}_{n}\geq{b\nu_{2}}/(Ln^{2/3}) for some universal constant ν2\nu_{2}. For example, choosing μ2=1/4\mu_{2}=1/4, we have γnbν2/(Ln2/3)\overline{\gamma}_{n}\geq{b\nu_{2}}/(Ln^{2/3}) with ν2=1/40\nu_{2}=1/40. Substituting the above lower bound in Theorem 9, we get the desired result. ∎

Appendix F MSVRG: Convergence Rate

Proof of Theorem 8

Let fFnf\in\mathcal{F}_{n} have σ\sigma-bounded gradients. Let ηt=η=max{\nicefraccT,\nicefracμ1(Ln2/3)}\eta_{t}=\eta=\max\{\nicefrac{{c}}{{\sqrt{T}}},\nicefrac{{\mu_{1}}}{{(Ln^{2/3})}}\} (μ1\mu_{1} is the universal constant from Corollary 3), m=\nicefracn(3μ1)m=\lfloor\nicefrac{{n}}{{(3\mu_{1})}}\rfloor, and c=f(x0)f(x)2Lσ2c=\sqrt{\frac{f(x^{0})-f(x^{*})}{2L\sigma^{2}}}. Further, let TT be a multiple of mm, pm=1p_{m}=1, and pi=0p_{i}=0 for 0    i  <  m0{\;\leq\;}i{\;<\;}m. Then, the output xax_{a} of Algorithm 2 satisfies

where νˉ\bar{\nu} is a universal constant, ν1\nu_{1} is the universal constant from Corollary 3 and xx^{*} is an optimal solution to (1).

First, we observe that the step size η\eta is chosen to be max{c/T,μ1/(Ln2/3)}\max\{c/\sqrt{T},\mu_{1}/(Ln^{2/3})\} where

Suppose η=μ1/(Ln2/3)\eta=\mu_{1}/(Ln^{2/3}), we obtain the convergence rate in Corollary 3. Now, lets consider the case where η=c/T\eta=c/\sqrt{T}. In this case, we have the following bound:

The only thing that remains to be proved is that with the step size choice of max{c/T,μ1/(Ln2/3)}\max\{c/\sqrt{T},\mu_{1}/(Ln^{2/3})\}, the minimum of two bounds hold. Consider the case c/T>μ1/(Ln2/3)c/\sqrt{T}>\mu_{1}/(Ln^{2/3}). In this case, we have the following:

where ν1\nu_{1} is the constant in Corollary 3. This inequality holds since c/T>μ1/(Ln2/3)c/\sqrt{T}>\mu_{1}/(Ln^{2/3}). Rearranging the above inequality, we have

in this case. Note that the left hand side of the above inequality is precisely the bound obtained by using step size c/Tc/\sqrt{T} (see Equation (16)). Similarly, when c/Tμ1/(Ln2/3)c/\sqrt{T}\leq\mu_{1}/(Ln^{2/3}), the inequality holds in the other direction. Using these two observations, we have the desired result. ∎

Appendix G Key Lemmatta

For the intermediate iterates vts+1v^{s+1}_{t} computed by Algorithm 2, we have the following:

The proof simply follows from the proof of Lemma 4 with It={it}I_{t}=\{i_{t}\}. ∎

We now present a result to bound the variance of mini-batch Svrg.

Let uts+1u^{s+1}_{t} be computed by the mini-batch version of Algorithm 2 i.e., Algorithm 4 with mini-batch size bb. Then,

For the ease of exposition, we use the following notation:

We use the definition of uts+1u^{s+1}_{t} to get

Appendix H Experiments

Figure 2 shows the remaining plots for MNIST and STL-10 datasets. As seen in the plots, there is no significant difference in the test error of Svrg and Sgd for these datasets.

Appendix I Other Lemmas

We need Lemma 5 for our results in the convex case.

Rewriting in terms of gg we obtain the required result. ∎

Lemma 6 bounds the variance of Svrg for the convex case. Please refer to (Johnson & Zhang, 2013) for more details.

Suppose fif_{i} is convex for all i[n]i\in[n]. For the updates in Algorithm 2 we have the following inequality:

The proof follows upon observing the following:

For random variables z1,,zrz_{1},\dots,z_{r}, we have