A Proximal Stochastic Gradient Method with Progressive Variance Reduction

Lin Xiao, Tong Zhang

Introduction

We consider the problem of minimizing the sum of two convex functions:

where F(x)F(x) is the average of many smooth component functions fi(x)f_{i}(x), i.e.,

and R(x)R(x) is relative simple but can be non-differentiable. We are especially interested in the case where the number of components nn is very large, and it can be advantageous to use incremental methods (such as stochastic gradient method) that operate on a single component fif_{i} at each iteration, rather than on the entire cost function.

The results presented in this paper are based on the following assumptions.

Moreover, we have L(1/n)i=1nLiL\leq(1/n)\sum_{i=1}^{n}L_{i}.

The convexity parameter of a function is the largest μ\mu such that the above condition holds. The strong convexity of P(x)P(x) may come from either F(x)F(x) or R(x)R(x) or both. More precisely, let F(x)F(x) and R(x)R(x) have convexity parameters μF\mu_{F} and μR\mu_{R} respectively, then μμF+μR\mu\geq\mu_{F}+\mu_{R}. We note that it is possible to have μ>L\mu>L although we must have μFL\mu_{F}\leq L.

where ηk\eta_{k} is the step size at the kk-th iteration. Throughout this paper, we use \|\cdot\| to denote the usual Euclidean norm, i.e., 2\|\cdot\|_{2}, unless otherwise specified. With the definition of proximal mapping

the proximal gradient method can be written more compactly as

This method can be viewed as a special case of splitting algorithms [LM79, CR97, Tse00], and its accelerated variants have been proposed and analyzed in [BT09, Nes13].

When the number of components nn is very large, each iteration of (5) can be very expensive since it requires computing the gradients for all the nn component functions fif_{i}, and also their average. For this reason, we refer to (5) as the proximal full gradient (Prox-FG) method. An effective alternative is the proximal stochastic gradient (Prox-SG) method: at each iteration k=1,2,k=1,2,\ldots, we draw iki_{k} randomly from {1,,n}\{1,\ldots,n\} and take the update

(See Appendix B for a proof of this result.) The most interesting case for large-scale applications is when μL\mu\ll L, and the ratio L/μL/\mu is often called the condition number of the problem (1). In this case, the Prox-FG method needs O((L/μ)log(1/ϵ))O\left((L/\mu)\log(1/\epsilon)\right) iterations to ensure P(xk)P(x)ϵP(x_{k})-P(x_{\star})\leq\epsilon. Thus the overall complexity of Prox-FG, in terms of the total number of component gradients evaluated to find an ϵ\epsilon-accurate solution, is O(n(L/μ)log(1/ϵ))O\left(n(L/\mu)\log(1/\epsilon)\right). The accelerated Prox-FG methods in [BT09, Nes13] reduce the complexity to O\bigl{(}n\sqrt{L/\mu}\log(1/\epsilon)\bigr{)}.

On the other hand, with a diminishing step size ηk=1/(μk)\eta_{k}=1/(\mu k), the Prox-SG method converges at a sublinear rate ([DS09, LLZ09]):

Consequently, the total number of component gradient evaluations required by the Prox-SG method to find an ϵ\epsilon-accurate solution (in expectation) is O(1/μϵ)O(1/\mu\epsilon). This complexity scales poorly in ϵ\epsilon compared with log(1/ϵ)\log(1/\epsilon), but it is independent of nn. Therefore, when nn is very large, the Prox-SG method can be more efficient, especially to obtain low-precision solutions.

There is also a vast literature on incremental gradient methods for minimizing the sum of a large number of component functions. The Prox-SG method can be viewed as a variant of the randomized incremental proximal algorithms proposed in [Ber11]. Asymptotic convergence of such methods typically requires diminishing step sizes and only have sublinear convergence rates. A comprehensive survey on this topic can be found in [Ber10].

2 Recent progresses and our contributions

Both the Prox-FG and Prox-SG methods do not fully exploit the problem structure defined by (1) and (2). In particular, Prox-FG ignores the fact that the smooth part F(x)F(x) is the average of nn component functions. On the other hand, Prox-SG can be applied for more general stochastic optimization problems, and it does not exploit the fact that the objective function in (1) is actually a deterministic function. Such inefficiencies in exploiting problem structure leave much room for further improvements.

Several recent work considered various special cases of (1) and (2), and developed algorithms that enjoy the complexity (total number of component gradient evaluations)

More recently, Johnson and Zhang [JZ13] developed another algorithm for the case R(x)0R(x)\equiv 0, called stochastic variance-reduced gradient (SVRG). The SVRG method employs a multi-stage scheme to progressively reduce the variance of the stochastic gradient, and achieves the same low complexity in (9). Moreover, it avoids storage of past gradients for the component functions, and its convergence analysis is considerably simpler than that of SAG. A very similar algorithm was proposed by Zhang et al. [ZMJ13], but with a worse convergence rate analysis. Another recent effort to extend the SVRG method is [KR13].

In this paper, we extend the variance reduction technique of SVRG to develop a proximal SVRG (Prox-SVRG) method for solving the more general class of problems defined in (1) and (2). We show that with uniform sampling of the component functions, the Prox-SVRG method achieves the same complexity in (9). Moreover, our method incorporates a weighted sampling strategy. When the sampling probabilities for fif_{i} are proportional to their Lipschitz constants LiL_{i}, the Prox-SVRG method has complexity

The Prox-SVRG method

Recall that in the Prox-SG method (6), with uniform sampling of iki_{k}, we have unbiased estimate of the full gradient at each iteration. In order to ensure asymptotic convergence, the step size ηk\eta_{k} has to decay to zero to mitigate the effect of variance introduced by random sampling, which leads to slow convergence. However, if we can gradually reduce the variance in estimating the full gradient, then it is possible to use much larger (even constant) step sizes and obtain much faster convergence rate. Several recent work (e.g., [FS12, BCNW12, FG13]) have explored this idea by using mini-batches with exponentially growing sizes, but their overall computational cost is still on the same order as full gradient methods.

then we replace fik(xk1)\nabla f_{i_{k}}(x_{k-1}) in the Prox-SG method (6) with vkv_{k}, i.e.,

Conditioned on xk1x_{k-1}, we can take expectation with respect to iki_{k} and obtain

Figure 1 gives the full description of the Prox-SVRG method with a constant step size η\eta. It allows random sampling from a general distribution {q1,,qn}\{q_{1},\ldots,q_{n}\}, thus is more flexible than the uniform sampling scheme described above. It is not hard to verify that the modified stochastic gradient,

Convergence analysis

Then the Prox-SVRG method in Figure 1 has geometric convergence in expectation:

We have the following remarks regarding the above result:

The ratio LQ/μL_{Q}/\mu can be viewed as a “weighted” condition number of P(x)P(x). Theorem 1 implies that setting mm to be on the same order as LQ/μL_{Q}/\mu is sufficient to have geometric convergence. To see this, let η=θ/LQ\eta=\theta/L_{Q} with 0<θ<1/40<\theta<1/4. When m1m\gg 1, we have

As a result, choosing θ=0.1\theta=0.1 and m=100(LQ/μ)m=100(L_{Q}/\mu) results in ρ5/6\rho\approx 5/6.

Since each stage requires n+2mn+2m component gradient evaluations, and it is sufficient to set m=Θ(LQ/μ)m=\Theta(L_{Q}/\mu), the overall complexity is

For uniform sampling, qi=1/nq_{i}=1/n for all i=1,,ni=1,\ldots,n, so we have LQ=maxiLiL_{Q}=\max_{i}L_{i} and the above complexity bound becomes (9).

The smallest possible value for LQL_{Q} is LQ=(1/n)i=1nLiL_{Q}=(1/n)\sum_{i=1}^{n}L_{i}, achieved at qi=Li/j=1nLjq_{i}=L_{i}/\sum_{j=1}^{n}L_{j}, i.e., when the sampling probabilities for the component functions are proportional to their Lipschitz constants. In this case, the above complexity bound becomes (10).

Thus we have the following high-probability bound.

Suppose the assumptions in Theorem 1 hold. Then for any ϵ>0\epsilon>0 and δ(0,1)\delta\in(0,1), we have

provided that the number of stages ss satisfies

If P(x)P(x) is convex but not strongly convex, then for any ϵ>0\epsilon>0, we can define

It follows that Pϵ(x)P_{\epsilon}(x) is ϵ\epsilon-strongly convex. We can apply the Prox-SVRG method in Figure 1 to Pϵ(x)P_{\epsilon}(x), which replaces the update formula for xkx_{k} by the following update rule:

Suppose Assumption 1 holds and let LQ=maxiLi/(qin)L_{Q}=\max_{i}L_{i}/(q_{i}n). In addition, assume that 0<η<1/(4LQ)0<\eta<1/(4L_{Q}) and mm is sufficiently large so that

Then the Prox-SVRG method in Figure 1, applied to Pϵ(x)P_{\epsilon}(x), achieves

This result means that if we take m=O(LQ/ϵ)m=O(L_{Q}/\epsilon) and slog(1/ϵ)/log(1/ρ)s\geq\log(1/\epsilon)/\log(1/\rho), then

The overall complexity (in terms of the number of component gradient evaluations) is

Similar results for the case of R(x)0R(x)\equiv 0 have been obtained in [RSB12, MZJ13, KR13]. We can also derive a high-probability bound based on Corollary 1, but omit the details here.

Our bound on the variance of the modified stochastic gradient vkv_{k} is a corollary of the following lemma.

Given any i{1,,n}i\in\{1,\ldots,n\}, consider the function

It is straightforward to check that ϕi(x)=0\nabla\phi_{i}(x_{\star})=0, hence minxϕi(x)=ϕi(x)=0\min_{x}\phi_{i}(x)=\phi_{i}(x_{\star})=0. Since ϕi(x)\nabla\phi_{i}(x) is Lipschitz continuous with constant LiL_{i}, we have (see, e.g., [Nes04, Theorem 2.1.5])

By dividing the above inequality by 1/(n2qi)1/(n^{2}q_{i}), and summing over i=1,,ni=1,\ldots,n, we obtain

there exist ξR(x)\xi_{\star}\in\partial R(x_{\star}) such that F(x)+ξ=0\nabla F(x_{\star})+\xi_{\star}=0. Therefore

where in the last inequality, we used convexity of R(x)R(x). This proves the desired result. ∎

Conditioned on xk1x_{k-1}, we take expectation with respect to iki_{k} to obtain

2 Proof of Theorem 1

For convenience, we define the stochastic gradient mapping

so that the proximal gradient step (11) can be written as

We need the following lemmas in the convergence analysis. The first one is on the non-expansiveness of proximal mapping, which is well known (see, e.g., [Roc70, Section 31]).

The next lemma provides a lower bound of the function P(x)P(x) using stochastic gradient mapping. It is a slight generalization of [HKP09, Lemma 3], and we give the proof in Appendix A for completeness.

Now we proceed to prove Theorem 1. We start by analyzing how the distance between xkx_{k} and xx_{\star} changes in each iteration. Using the update rule (15), we have

Applying Lemma 3 with x=xk1x=x_{k-1}, v=vkv=v_{k}, x+=xkx^{+}=x_{k}, g=gkg=g_{k} and y=xy=x_{\star}, we have

where Δk=vkF(xk1)\Delta_{k}=v_{k}-\nabla F(x_{k-1}). Note that the assumption in Theorem 1 implies η<1/(4LQ)<1/L\eta<1/(4L_{Q})<1/L because LQ(1/n)i=1nLiLL_{Q}\geq(1/n)\sum_{i=1}^{n}L_{i}\geq L. Therefore,

Next we upper bound the quantity 2ηΔkT(xkx)-2\eta\Delta_{k}^{T}(x_{k}-x_{\star}). Although not used in the Prox-SVRG algorithm, we can still define the proximal full gradient update as

which is independent of the random variable iki_{k}. Then,

where in the first inequality we used the Cauchy-Schwarz inequality, and in the second inequality we used Lemma 2. Combining with (16), we get

Now we take expectation on both sides of the above inequality with respect to iki_{k} to obtain

Divide both sides of the above inequality by 2η(14LQη)m2\eta(1-4L_{Q}\eta)m, we arrive at

Finally using the definition of ρ\rho in (14), and applying the above inequality recursively, we obtain

Numerical experiments

In this section we present results of several numerical experiments to illustrate the properties of the Prox-SVRG method, and compare its performance with several related algorithms.

We used three publicly available data sets. Their sizes nn, dimensions dd as well as sources as listed in Table 1. For rcv1 and covertype, we used the processed data for binary classification from [FL11]. The table also listed the values of λ2\lambda_{2} and λ1\lambda_{1} that were used in our experiments. These choices are typical in machine learning benchmarks to obtain good classification performance.

We first illustrate the numerical characteristics of Prox-SVRG on the rcv1 dataset. Each example in this dataset has been normalized so that ai2=1\|a_{i}\|_{2}=1 for all i=1,,ni=1,\ldots,n, which leads to the same upper bound on the Lipschitz constants L=Li=ai22/4L=L_{i}=\|a_{i}\|_{2}^{2}/4. In our implementation, we used the splitting in (17) and uniform sampling of the component functions. We choose the number of stochastic gradient steps mm between full gradient evaluations as a small multiple of nn.

Figure 2 shows the behavior of Prox-SVRG with m=2nm=2n when we used three different step sizes. The horizontal axis is the number of effective passes over the data, where each effective pass evaluates nn component gradients. Each full gradient evaluation counts as one effective pass, and appears as a small flat segment of length 1 on the curves. It can be seen that the convergence of Prox-SVRG becomes slow if the step size is either too big or too small. The best choice of η=0.1/L\eta=0.1/L matches our theoretical analysis (see the first remark after Theorem 1). The number of non-zeros (NNZs) in the iterates xkx_{k} converges quickly to 72377237 after about 1010 passes over the data.

Figure 3 shows how the objective gap P(xk)PP(x_{k})-P_{\star} decreases when we vary the period mm of evaluating full gradients. For λ2=104\lambda_{2}=10^{-4}, the fastest convergence per stage is achieved by m=1m=1, but the frequent evaluation of full gradients makes its overall performance slightly worse than m=2m=2. Longer periods leads to slower convergence, due to the lack of effective variance reduction. For λ2=105\lambda_{2}=10^{-5}, the condition number is much larger, thus longer period mm is required to have sufficient reduction during each stage.

2 Comparison with related algorithms

We implemented the following algorithms to compare with Prox-SVRG:

Prox-SG: the proximal stochastic gradient method given in (6). We used a constant step size that gave the best performance among all powers of 1010.

RDA: the regularized dual averaging method in [Xia10]. The step size parameter γ\gamma in RDA is also chosen as the one that gave best performance among all powers of 1010.

Prox-FG: the proximal full gradient method given in (5), with an adaptive line search scheme proposed in [Nes13].

Prox-AFG: an accelerated version of the Prox-FG method that is very similar to FISTA [BT09], also with an adaptive line search scheme.

Prox-SAG: a proximal version of the stochastic average gradient (SAG) method [SRB13, Section 6]. We note that the convergence of this Prox-SAG method has not been established for the general model considered in this paper. Nevertheless it demonstrates good performance in practice.

Prox-SDCA: the proximal stochastic dual coordinate ascent method [SSZ12]. In order to obtain the complexity O((n+L/μ)log(1/ϵ))O\left((n+L/\mu)\log(1/\epsilon)\right), it needs to use the splitting (18).

Figure 5 shows the comparison of different methods on two other data sets listed in Table 1. Here we also included comparison with Prox-SVRG2, which is a hybrid method by performing Prox-SG for one pass over the data and then switch to Prox-SVRG. This hybrid scheme was suggested in [JZ13], and it often improves the performance of Prox-SVRG substantially. Similar hybrid schemes also exist for SDCA [SSZ12] and SAG [SRB13].

The behaviors of the stochastic gradient type of algorithms on covertype (Figure 5, left) are similar to those on rcv1, but the two full gradient methods Prox-FG and Prox-AFG perform worse because of the smaller regularization parameter λ2\lambda_{2} and hence worse condition number. The sido0 data set turns out to be more difficult to optimize, and much slower convergence are observed in Figure 5 (right). The Prox-SAG method performs best on this data set, followed by Prox-SVRG2 and Prox-SVRG.

Conclusions

We developed a new proximal stochastic gradient method, called Prox-SVRG, for minimizing the sum of two convex functions: one is the average of a large number of smooth component functions, and the other is a general convex function that admits a simple proximal mapping. This method exploits the finite average structure of the smooth part by extending the variance reduction technique of SVRG [JZ13], which computes the full gradient periodically to modify the stochastic gradients in order to reduce their variance.

The Prox-SVRG method enjoys the same low complexity as that of SDCA [SSZ13, SSZ12] and SAG [RSB12, SRB13], but applies to a more general class of problems, and does not require the storage of the most recent gradient for each component function. In addition, our method incorporates a weighted sampling scheme, which achieves an improved complexity result for problems where the component functions vary substantially in smoothness.

Appendix A Proof of Lemma 3

The associated optimality condition states that there is a ξR(x+)\xi\in\partial R(x^{+}) such that

Combining with the definition of g=(xx+)/ηg=(x-x^{+})/\eta, we have ξ=gv\xi=g-v.

By smoothness of FF, we can further lower bound F(x)F(x) by

where in the last equality we used P(x+)=F(x+)+R(x+)P(x^{+})=F(x^{+})+R(x^{+}) and x+x=ηgx^{+}-x=-\eta g. Collecting all inner products on the right-hand side, we have

where in the first equality we used ξ=gv\xi=g-v, in the third equality we used Δ=vF(x)\Delta=v-\nabla F(x), and in the last equality we used xx+=ηgx-x^{+}=\eta g. Putting everything together, we obtain

Finally using the assumption 0<η1/L0<\eta\leq 1/L, we arrive at the desired result.

Appendix B Convergence analysis of the Prox-FG method

Here we prove the convergence rate in (7) for the Prox-FG method (5). First we define the full gradient mapping Gk=(xkxk1)/ηG_{k}=(x_{k}-x_{k-1})/\eta and use it to obtain

Applying Lemma 3 with x=xk1x=x_{k-1}, v=F(xk1)v=\nabla F(x_{k-1}), x+=xkx^{+}=x_{k}, g=Gkg=G_{k} and y=xy=x_{\star}, we have Δ=0\Delta=0 and

Rearranging terms in the above inequality yields

Dropping the nonnegative term 2\eta\bigl{(}F(x_{k})-F(x_{\star})\bigr{)} on the left-hand side results in

Dropping the nonnegative term (1+ημR)xkx2(1+\eta\mu_{R})\|x_{k}-x_{\star}\|^{2} on the left-hand side of (19) yields

Setting η=1/L\eta=1/L, the above inequality is equivalent to (7).

References