A Proximal Stochastic Gradient Method with Progressive Variance Reduction
Lin Xiao, Tong Zhang
Introduction
We consider the problem of minimizing the sum of two convex functions:
where is the average of many smooth component functions , i.e.,
and is relative simple but can be non-differentiable. We are especially interested in the case where the number of components is very large, and it can be advantageous to use incremental methods (such as stochastic gradient method) that operate on a single component at each iteration, rather than on the entire cost function.
The results presented in this paper are based on the following assumptions.
Moreover, we have .
The convexity parameter of a function is the largest such that the above condition holds. The strong convexity of may come from either or or both. More precisely, let and have convexity parameters and respectively, then . We note that it is possible to have although we must have .
where is the step size at the -th iteration. Throughout this paper, we use to denote the usual Euclidean norm, i.e., , unless otherwise specified. With the definition of proximal mapping
the proximal gradient method can be written more compactly as
This method can be viewed as a special case of splitting algorithms [LM79, CR97, Tse00], and its accelerated variants have been proposed and analyzed in [BT09, Nes13].
When the number of components is very large, each iteration of (5) can be very expensive since it requires computing the gradients for all the component functions , and also their average. For this reason, we refer to (5) as the proximal full gradient (Prox-FG) method. An effective alternative is the proximal stochastic gradient (Prox-SG) method: at each iteration , we draw randomly from and take the update
(See Appendix B for a proof of this result.) The most interesting case for large-scale applications is when , and the ratio is often called the condition number of the problem (1). In this case, the Prox-FG method needs iterations to ensure . Thus the overall complexity of Prox-FG, in terms of the total number of component gradients evaluated to find an -accurate solution, is . The accelerated Prox-FG methods in [BT09, Nes13] reduce the complexity to O\bigl{(}n\sqrt{L/\mu}\log(1/\epsilon)\bigr{)}.
On the other hand, with a diminishing step size , the Prox-SG method converges at a sublinear rate ([DS09, LLZ09]):
Consequently, the total number of component gradient evaluations required by the Prox-SG method to find an -accurate solution (in expectation) is . This complexity scales poorly in compared with , but it is independent of . Therefore, when is very large, the Prox-SG method can be more efficient, especially to obtain low-precision solutions.
There is also a vast literature on incremental gradient methods for minimizing the sum of a large number of component functions. The Prox-SG method can be viewed as a variant of the randomized incremental proximal algorithms proposed in [Ber11]. Asymptotic convergence of such methods typically requires diminishing step sizes and only have sublinear convergence rates. A comprehensive survey on this topic can be found in [Ber10].
2 Recent progresses and our contributions
Both the Prox-FG and Prox-SG methods do not fully exploit the problem structure defined by (1) and (2). In particular, Prox-FG ignores the fact that the smooth part is the average of component functions. On the other hand, Prox-SG can be applied for more general stochastic optimization problems, and it does not exploit the fact that the objective function in (1) is actually a deterministic function. Such inefficiencies in exploiting problem structure leave much room for further improvements.
Several recent work considered various special cases of (1) and (2), and developed algorithms that enjoy the complexity (total number of component gradient evaluations)
More recently, Johnson and Zhang [JZ13] developed another algorithm for the case , called stochastic variance-reduced gradient (SVRG). The SVRG method employs a multi-stage scheme to progressively reduce the variance of the stochastic gradient, and achieves the same low complexity in (9). Moreover, it avoids storage of past gradients for the component functions, and its convergence analysis is considerably simpler than that of SAG. A very similar algorithm was proposed by Zhang et al. [ZMJ13], but with a worse convergence rate analysis. Another recent effort to extend the SVRG method is [KR13].
In this paper, we extend the variance reduction technique of SVRG to develop a proximal SVRG (Prox-SVRG) method for solving the more general class of problems defined in (1) and (2). We show that with uniform sampling of the component functions, the Prox-SVRG method achieves the same complexity in (9). Moreover, our method incorporates a weighted sampling strategy. When the sampling probabilities for are proportional to their Lipschitz constants , the Prox-SVRG method has complexity
The Prox-SVRG method
Recall that in the Prox-SG method (6), with uniform sampling of , we have unbiased estimate of the full gradient at each iteration. In order to ensure asymptotic convergence, the step size has to decay to zero to mitigate the effect of variance introduced by random sampling, which leads to slow convergence. However, if we can gradually reduce the variance in estimating the full gradient, then it is possible to use much larger (even constant) step sizes and obtain much faster convergence rate. Several recent work (e.g., [FS12, BCNW12, FG13]) have explored this idea by using mini-batches with exponentially growing sizes, but their overall computational cost is still on the same order as full gradient methods.
then we replace in the Prox-SG method (6) with , i.e.,
Conditioned on , we can take expectation with respect to and obtain
Figure 1 gives the full description of the Prox-SVRG method with a constant step size . It allows random sampling from a general distribution , thus is more flexible than the uniform sampling scheme described above. It is not hard to verify that the modified stochastic gradient,
Convergence analysis
Then the Prox-SVRG method in Figure 1 has geometric convergence in expectation:
We have the following remarks regarding the above result:
The ratio can be viewed as a “weighted” condition number of . Theorem 1 implies that setting to be on the same order as is sufficient to have geometric convergence. To see this, let with . When , we have
As a result, choosing and results in .
Since each stage requires component gradient evaluations, and it is sufficient to set , the overall complexity is
For uniform sampling, for all , so we have and the above complexity bound becomes (9).
The smallest possible value for is , achieved at , i.e., when the sampling probabilities for the component functions are proportional to their Lipschitz constants. In this case, the above complexity bound becomes (10).
Thus we have the following high-probability bound.
Suppose the assumptions in Theorem 1 hold. Then for any and , we have
provided that the number of stages satisfies
If is convex but not strongly convex, then for any , we can define
It follows that is -strongly convex. We can apply the Prox-SVRG method in Figure 1 to , which replaces the update formula for by the following update rule:
Suppose Assumption 1 holds and let . In addition, assume that and is sufficiently large so that
Then the Prox-SVRG method in Figure 1, applied to , achieves
This result means that if we take and , then
The overall complexity (in terms of the number of component gradient evaluations) is
Similar results for the case of have been obtained in [RSB12, MZJ13, KR13]. We can also derive a high-probability bound based on Corollary 1, but omit the details here.
Our bound on the variance of the modified stochastic gradient is a corollary of the following lemma.
Given any , consider the function
It is straightforward to check that , hence . Since is Lipschitz continuous with constant , we have (see, e.g., [Nes04, Theorem 2.1.5])
By dividing the above inequality by , and summing over , we obtain
there exist such that . Therefore
where in the last inequality, we used convexity of . This proves the desired result. ∎
Conditioned on , we take expectation with respect to to obtain
2 Proof of Theorem 1
For convenience, we define the stochastic gradient mapping
so that the proximal gradient step (11) can be written as
We need the following lemmas in the convergence analysis. The first one is on the non-expansiveness of proximal mapping, which is well known (see, e.g., [Roc70, Section 31]).
The next lemma provides a lower bound of the function using stochastic gradient mapping. It is a slight generalization of [HKP09, Lemma 3], and we give the proof in Appendix A for completeness.
Now we proceed to prove Theorem 1. We start by analyzing how the distance between and changes in each iteration. Using the update rule (15), we have
Applying Lemma 3 with , , , and , we have
where . Note that the assumption in Theorem 1 implies because . Therefore,
Next we upper bound the quantity . Although not used in the Prox-SVRG algorithm, we can still define the proximal full gradient update as
which is independent of the random variable . Then,
where in the first inequality we used the Cauchy-Schwarz inequality, and in the second inequality we used Lemma 2. Combining with (16), we get
Now we take expectation on both sides of the above inequality with respect to to obtain
Divide both sides of the above inequality by , we arrive at
Finally using the definition of in (14), and applying the above inequality recursively, we obtain
Numerical experiments
In this section we present results of several numerical experiments to illustrate the properties of the Prox-SVRG method, and compare its performance with several related algorithms.
We used three publicly available data sets. Their sizes , dimensions as well as sources as listed in Table 1. For rcv1 and covertype, we used the processed data for binary classification from [FL11]. The table also listed the values of and that were used in our experiments. These choices are typical in machine learning benchmarks to obtain good classification performance.
We first illustrate the numerical characteristics of Prox-SVRG on the rcv1 dataset. Each example in this dataset has been normalized so that for all , which leads to the same upper bound on the Lipschitz constants . In our implementation, we used the splitting in (17) and uniform sampling of the component functions. We choose the number of stochastic gradient steps between full gradient evaluations as a small multiple of .
Figure 2 shows the behavior of Prox-SVRG with when we used three different step sizes. The horizontal axis is the number of effective passes over the data, where each effective pass evaluates component gradients. Each full gradient evaluation counts as one effective pass, and appears as a small flat segment of length 1 on the curves. It can be seen that the convergence of Prox-SVRG becomes slow if the step size is either too big or too small. The best choice of matches our theoretical analysis (see the first remark after Theorem 1). The number of non-zeros (NNZs) in the iterates converges quickly to after about passes over the data.
Figure 3 shows how the objective gap decreases when we vary the period of evaluating full gradients. For , the fastest convergence per stage is achieved by , but the frequent evaluation of full gradients makes its overall performance slightly worse than . Longer periods leads to slower convergence, due to the lack of effective variance reduction. For , the condition number is much larger, thus longer period is required to have sufficient reduction during each stage.
2 Comparison with related algorithms
We implemented the following algorithms to compare with Prox-SVRG:
Prox-SG: the proximal stochastic gradient method given in (6). We used a constant step size that gave the best performance among all powers of .
RDA: the regularized dual averaging method in [Xia10]. The step size parameter in RDA is also chosen as the one that gave best performance among all powers of .
Prox-FG: the proximal full gradient method given in (5), with an adaptive line search scheme proposed in [Nes13].
Prox-AFG: an accelerated version of the Prox-FG method that is very similar to FISTA [BT09], also with an adaptive line search scheme.
Prox-SAG: a proximal version of the stochastic average gradient (SAG) method [SRB13, Section 6]. We note that the convergence of this Prox-SAG method has not been established for the general model considered in this paper. Nevertheless it demonstrates good performance in practice.
Prox-SDCA: the proximal stochastic dual coordinate ascent method [SSZ12]. In order to obtain the complexity , it needs to use the splitting (18).
Figure 5 shows the comparison of different methods on two other data sets listed in Table 1. Here we also included comparison with Prox-SVRG2, which is a hybrid method by performing Prox-SG for one pass over the data and then switch to Prox-SVRG. This hybrid scheme was suggested in [JZ13], and it often improves the performance of Prox-SVRG substantially. Similar hybrid schemes also exist for SDCA [SSZ12] and SAG [SRB13].
The behaviors of the stochastic gradient type of algorithms on covertype (Figure 5, left) are similar to those on rcv1, but the two full gradient methods Prox-FG and Prox-AFG perform worse because of the smaller regularization parameter and hence worse condition number. The sido0 data set turns out to be more difficult to optimize, and much slower convergence are observed in Figure 5 (right). The Prox-SAG method performs best on this data set, followed by Prox-SVRG2 and Prox-SVRG.
Conclusions
We developed a new proximal stochastic gradient method, called Prox-SVRG, for minimizing the sum of two convex functions: one is the average of a large number of smooth component functions, and the other is a general convex function that admits a simple proximal mapping. This method exploits the finite average structure of the smooth part by extending the variance reduction technique of SVRG [JZ13], which computes the full gradient periodically to modify the stochastic gradients in order to reduce their variance.
The Prox-SVRG method enjoys the same low complexity as that of SDCA [SSZ13, SSZ12] and SAG [RSB12, SRB13], but applies to a more general class of problems, and does not require the storage of the most recent gradient for each component function. In addition, our method incorporates a weighted sampling scheme, which achieves an improved complexity result for problems where the component functions vary substantially in smoothness.
Appendix A Proof of Lemma 3
The associated optimality condition states that there is a such that
Combining with the definition of , we have .
By smoothness of , we can further lower bound by
where in the last equality we used and . Collecting all inner products on the right-hand side, we have
where in the first equality we used , in the third equality we used , and in the last equality we used . Putting everything together, we obtain
Finally using the assumption , we arrive at the desired result.
Appendix B Convergence analysis of the Prox-FG method
Here we prove the convergence rate in (7) for the Prox-FG method (5). First we define the full gradient mapping and use it to obtain
Applying Lemma 3 with , , , and , we have and
Rearranging terms in the above inequality yields
Dropping the nonnegative term 2\eta\bigl{(}F(x_{k})-F(x_{\star})\bigr{)} on the left-hand side results in
Dropping the nonnegative term on the left-hand side of (19) yields
Setting , the above inequality is equivalent to (7).