Less than a Single Pass: Stochastically Controlled Stochastic Gradient Method
Lihua Lei, Michael I. Jordan
Introduction
Optimizing the finite-sum convex objectives is ubiquitous in different areas:
where each is a convex function. These problems are often solved by algorithms that either make use of full gradients (obtained by processing the entire dataset) or stochastic gradients (obtained by processing single data points or mini-batches of data points). The use of the former provides guarantees of eventual convergence and the latter yields advantages in terms of rate of convergence rate, scalability and simplicity of implementation . An impactful recent line of research has shown that a hybrid methodology that makes use of both full gradients and stochastic gradients can obtain the best of both worlds—guaranteed convergence at favorable rates, e.g. . The full gradients provide variance control for the stochastic gradients.
While this line of research represents significant progress towards the goal of designing scalable, autonomous learning algorithms, there remain some inefficiencies in terms of computation. With the definition of computation and communication cost in Section 2.1, the methods referred to above require computation to achieve an -approximate solution, where is the number of data points, is a target accuracy and is the dimension of the parameter vector. Some methods incur a storage cost . The linear dependence on is problematic in general. Clearly there will be situations in which accurate solutions can be obtained with less than a single pass through the data; indeed, some problems will require a constant number of steps. This will be the case, for example, if the data in a regression problem consist of a fixed number of pairs repeated a large number of times. For deterministic algorithms, the worst case analysis in shows that scanning at least a fixed proportion of the data is necessary; however, learning algorithms are generally stochastic and real-world learning problems are generally not worst case.
An equally important bottleneck for learning algorithms is the cost of communication. For large data sets that must be stored on disk or distributed across many computing nodes, the communication cost can be significant, even dominating the computation cost. For example, SVRG makes use of full gradient over the whole dataset which can incur prohibitive communication cost. There is an active line of research that focuses on communication costs; see, e.g. .
In this article, we present a variant of the stochastic variance reduced gradient (SVRG) method that we refer to as stochastically controlled stochastic gradient (SCSG). The basic idea behind SCSG—that of approximating the full gradient in SVRG via a subsample—has been explored by others, but we present several innovations that yield significant improvements both in theory and in practice. In contradistinction to SVRG, the theoretical convergence rate of SCSG has a sublinear regime in terms of both computation and communication. This regime is important in machine learning problems, notably in the common situation in which the sample size is large, (), while the required accuracy is low, . The analysis in this article shows that SCSG is able to achieve the target accuracy in this regime with potentially less than a single pass through the data.
In the regime of low accuracy, SCSG is never worse than the classical stochastic gradient descent (SGD). Although SCSG has the same dependence on the target accuracy as SGD, it has a potentially much smaller factor. In fact, the theoretical complexity of SGD depends on the uniform bound of over the domain and the component index. This might be infinite even in the most common least square problems. By contrast, the complexity of SCSG depends on a new measure , defined in Section 2 and discussed in Section 4, which is finite and small for a large class of practical problems. In particular, in many cases where SGD does not have theoretical guarantees to converge. The measure sheds light upon characterizing the difficulty of optimization problems in the form of a finite sum and reveals some intrinsic difference between finite-sum optimization and stochastic approximation, which is considered by other relevant works; e.g., streaming SVRG and dynaSAGA.
The remainder of the paper is organized as follows. In Section 2, we review SVRG, discuss several of its variants and we describe the SCSG algorithm. We provide a theoretical convergence analysis in Section 3. In Section 4, we give a comprehensive discussion on the difficulty measure . The empirical results on real datasets are presented in Section 5. Finally, we conclude our work and discuss potential extensions in Section 6. All technical proofs are relegated to the Appendices.
Notation, Assumptions and Algorithm
The assumption A1 on the smoothness of individual functions will be used throughout this paper.
is convex with -Lipschitz gradient
for some and all ;
The following assumption will be used in the context of strongly-convex objectives.
Note that we only require the strong convexity of instead of each component.
Let denote the minimizer of that minimizes in (2), then can be written as
Then under assumption A1 and A2. A point , possibly random, is called an -approximated solution if
The expectation of the above distributions satisfy that
The stochastic variance reduced gradient (SVRG) method blends gradient descent and stochastic gradient descent, using the former to control the effect of the variance of the latter . We summarize SVRG in Algorithm 1.
Using the definition from Section 2.1, it is easy to see that the computation cost of SVRG is . As shown in the convergence analysis of , is required to be to guarantee convergence. Thus, the computation cost of SVRG is . The costs of the other algorithms considered in Table 1 can be obtain in a similar fashion. For comparison, we only present the results for smooth case (assumption A1).
Much of this work focuses on the strongly convex case. In the non-strongly convex setting one way to proceed is to add a regularizer . Tuning , however, is subtle and requires multiple runs of the algorithm on a grid of . For general convex functions an alternative approach has been presented by (they generate by a different scheme in line 4), which proves a computation complexity . Another approach is discussed by , who improve the complexity to by scaling the stepsize as . However, their algorithm still relies on calculating a full gradient. Other variants of SVRG have been proposed in the distributed computing setting and in the stochastic setting .
2 SCSG
The average computation cost of SCSG is . By the law of large numbers and the expectation formula (4), this is close to . Table 1 summarizes the computation complexity as well as some other details of SCSG and 11 other existing popular algorithms. The table includes the computation cost of optimizing non-strongly-convex functions (column 1) and strongly convex functions (column 2). In practice, the amount of tuning is of major concern. For this reason, a fixed stepsize is usually preferred to a complicated stepsize scheme and it is better that the tuning parameter does not depend on unknown quantities; e.g., or the total number of epochs . These issues are documented in column 3 and column 4. Moreover, many algorithms requires to be bounded, i.e. to be Lipschitz. However, this assumption is not realistic in many cases and it is better to discard it. To address this issue, we document it in column 5. To highlight the dependence on and (or ), we implicitly assume that other parameters, e.g. , are as a convention.
As seen from Table 1, SCSG and SGD are the only two methods which are able to reach an -approximate solution with potentially less than a single pass through the data; moreover, the number of accesses of the data is independent of the sample size . Comparing to SCSG, SGD requires each to be Lipschitz, which is not satisfied by least-square objectives. By contrast, as will be shown in Section 3, the computation cost of SCSG only depends on the quantity , which is relatively small in many cases. Furthermore, SGD either sets the stepsize based on unknown quantities like the total number of epochs or needs to use a time-varying sequence of stepsizes. This involves intensive tuning as opposed to a fixed stepsize.
On the other hand, SCSG is communication-efficient since it only needs to operate on mini-batches as SGD. This is particularly important in modern large-scale tasks. By contrast, those algorithms that require full gradients evaluation either need extra communication for synchronization or need extra computational cost for the asynchronous version to converge; See e.g. .
Convergence Analysis
In this section we present a convergence analysis of SCSG. We first state the following key lemma that connects our algorithm with the measure defined in (2).
The proof, which appears in Appendix B, involves a standard technique for analyzing sampling without replacement. Obviously, if is uniformly bounded as is often assumed in the literature. In section 4 we will present various other situations where .
Note that the extra variation vanishes when and in general is inversely proportional to the batch size. In the rest of this section, we will first discuss the case , which we refer to as R-SVRG (Randomized SVRG), to compare with the original SVRG. Later we will discuss the general case.
Let and assume that , then
Based on Theorem 3.2, we first consider a constant stepsize scaled as .
Let with . Then under the assumption A1, with the output ,
By scaling as , R-SVRG is able to achieve the same complexity of , which is the best bound in the class of SVRG-type algorithms without acceleration techniques.
Let with . Then under the assumption A1, with the output ,
Assume that . Under the assumption A1,
which does not equal in general. Most novelty of our analysis lies in dealing with the extra bias. Fortunately, we found that the extra terms do not worsen the complexity by scaling as .
Assume that , then with the output ,
Corollary 3.6 shows that SCSG is never worse than SGD and SVRG (with constant stepsize scaled as ). Compared with SGD whose complexity is where
SCSG has a factor which is strictly smaller than . It will be shown in Section 4, can be much smaller than even in the case where .
Assume that , then with the output ,
Under the same settings of Corollary 3.8 except that setting
More Details on ℋ(f)ℋ𝑓\mathcal{H}(f)
The first attempt to describe the heterogeneity is through an unrealistic condition, called strong growth condition(), which requires
Under (13), proves that the stochastic gradient methods have the same convergence rate as the full gradient methods. However, (13) is unrealistic since it implies for any minimizer of , is the stationary point of all individual loss functions.
By contrast, our proposed measure is well-behaved in most applications without awkward assumptions such as the bounded domain. Recall that
It can be viewed as a version of which replaces the supremum by the value at a single point, when the optimum of is unique. As a consequence, . In addition, when the strong growth condition (13) holds, for all and hence . These simple facts show that is strictly better than and as a measure of difficulty. We will show in the next two subsections that can be controlled and estimated in almost all problems and is well-behaved in a wide range of applications.
Although being unrealistic, it is often assumed that is uniformly bounded over the domain. This implies the boundedness of directly and hence provides an example where the problem (1) is “easy”.
Let and be defined in Table 2, then
Surprisingly, can be bounded even without any assumption other than A1 by using an arbitrary reference point.
This entails that problem (1) with i.i.d. individual functions is “easy”. This is heuristically reasonable since the i.i.d. assumption, plus the moment conditions, forces the data to be highly homogeneous.
In fact, (17) can be proved under much broader settings. For example, when solving a linear equation , can be set as and no randomness is involved. If we set , then (16) implies that
Then provided .
Another type of problems with involves pairwise comparisons, i.e.
where are independent samples. For example, in preference elicitation or sporting competitions where the data is collected as pairwise-comparisons, one can fit a Bradley-Terry model to obtain the underlying “score” that represents the quality of each unit. The objective function of the Bradley-Terry model is where is the number of times that the unit beats the unit (, ). Other examples that involve a similar structure are metric learning (, ) and convex relaxation of graph cuts (). In these cases, we can also bound under mild conditions.
Finally, it is worth mentioning that (17) cannot be established for unless the domain is compact and more regularity conditions, than the existence of second moment, are imposed to ensure that a certain version of uniform law of large number can be applied.
2 Estimating ℋ(f)ℋ𝑓\mathcal{H}(f) in Generalized Linear Models
Optimzation problems in machine learning are often generalized linear models where , with being the covariates and being the responses, for some convex loss function . Let . Then by definition
If is uniformly bounded with , then
We will show in appendix E that for multi-class logistic regression, regardless of the number of classes. The same bound can also be derived for Huber regression , Probit model (), etc.. When the domain is unbounded, the (penalized) least square regression has an unbounded . However notice that where , one can easily show that
3 ℋ(f)ℋ𝑓\mathcal{H}(f) in Pathological Cases
The first example is due to the high dimension. When the dimension is comparable to , even the i.i.d. assumption cannot guarantee a good behavior of , without further conditions, since the law of large number fails. The second example is due to the severe heterogeneity of components. In fact the -th component reaches its minimum at while the global function reaches its minimum at and thus most components behaves completely different from the global function.
Nevertheless, it is worth emphasizing that SGD also faces with the same issue in these two cases. More importantly, SCSG does not suffer from these undesirable properties since it will choose automatically; See Corollary 3.6 to Corollary 3.9.
Experiments
In this section, we illustrate the performance of SCSG by implementing it for multi-class logistic regression on the MNIST dataset http://yann.lecun.com/exdb/mnist/. We normalize the data into the range $256$. No regularization term is added and so the function to be minimized is
The performance is measured by versus the number of passes of data Although beging ideal to report , it is not feasible in that is unknown. . For each algorithm mentioned later, we selects the best-tuned stepsize and then implement the algorithm for 10 times and report the average to avoid the random effect.
Here we compare SCSG with mini-batch SGD, with the batch size , and SVRG. Moreover, we consider three variants of SCSG:
(SCSGFixN) set , instead of generated from a geometric distribution;
(SCSGNew) randomly pick , instead of from the whole dataset ;
(SCSGNewFixN) set and randomly pick .
The first variant is to check whether geometric random variable is essential in practice; the second variant is to check whether running SGD from the whole dataset is necessary; and the third variant is the combination.
For all the variants of SCSG and SGD, we consider three batch sizes . The results are plotted in Figure 1, from which we make the following observations:
SCSG is able to reach an accurate solution very fast since all versions of SCSG are more efficient than SGD and SVRG in the first 5 passes. This confirms our theory;
SCSG with fixed is slightly more effective than the original SCSG. Thus the geometric random variable might not be essential in practice;
It makes no difference whether sampling from the whole dataset or sampling from the mini-batch when running the SGD steps in SCSG.
Based on our observations, we recommend implementing SCSGNewFixN as the default since 1) the fixed number of SGD steps stablizes the procedure; 2) sampling from the mini-batch reduces the communication cost incurred by accessing data from the whole dataset.
Discussion
We propose SCSG as a member of the SVRG family of algorithms, proving its superior performance in terms of both computation and communication cost. Both complexities are independent of sample size when the required accuracy is low, for various functions which are widely optimized in practice. The real data example also validates our theory.
We plan to explore several variants of SCSG in future work. For example, a non-uniform sampling scheme can be applied to SGD steps to leverage the Lipschitz constants as in SVRG. More interestingly, we can consider a better sampling scheme for by putting more weight on influential observations. The proximal settings are also straightforward extensions of our current work.
As a final comment, we found that the previous complexity analysis focuses on the high-accuracy computation for which the dependence on the sample size and condition number is of major concern. The low-accuracy regime is unfortunately under-studied theoretically even though it is commonly encountered in practice. We advocate taking all three parameters, namely , and , into consideration and distinguishing the analyses for high-accuracy computation and low-accuracy computation as standard practice in the literature.
Acknowledgment
We thank the Chi Jin, Nathan Srebro and anonymous reviewers for their helpful comments, which greatly improved this work.
References
Appendix A Lemmas
Let be a convex function that satisfies the assumption ,
Proof This is the standard Co-coercivity argument; See e.g. , Theorem 2.1.5 of .
Proof An elementary computation shows that
Using the fact that , we have
Appendix B One-Epoch Analysis
First we prove a lemma that generalizes Lemma 3.1.
Further let be a uniform random subset of with size . Then
Proof Let , then it is easy to see that
Then the sampling mean can be reformulated as
Proof [Lemma 3.1] Let . Then
As all other algorithms, we start from deriving a bound for the stochastic gradients . For convenience, we define as the bias of , i.e.
Under the assumption A1 and A2 with possibly equal to ,
By Lemma A.1 with ,
where the last line uses the fact that is independent of . Similarly, by Lemma A.1 with ,
where the last line uses the smoothness of . Putting the pieces together, we conclude that
Assume that . Then for any ,
Similarly, since is also independent of ,
Now let and taking expectation with respect to , by Lemma A.2 and B.4,
The Cauchy-Schwartz inequality implies that
Proof Let . By Lemma B.2 and Lemma B.5, we have
Proof Using the fact that , we have
Noticing that , by Lemma B.1,
On the other hand, by Lemma B.1 again, we obtain that
Putting the pieces together we prove the result.
Putting all the pieces together, we can derive the key inequality on the performance of SCSG within a single epoch.
Suppose and . Then under the assumption A1 and A2,
Using the fact that for any , we have
Then by Corollary B.7 and Lemma B.8, we obtain that
Since , the assumption A2 implies that
Finally, since , we conclude that
Proof [of Lemma B.4] We prove the first claim by induction. When , the claim is obvious. Suppose we prove the claim for , i.e.
Let be another sequence constructed as follows:
On the other hand, showed in the proof of their Theorem 1 that
Since and , we have
Putting (29) and (30) together, and using the fact that , we obtain that
Appendix C Analysis of R-SVRG
Throughout the rest of appendices, we will denote and by
Although we can directly apply Theorem B.9 with , the constants involved in the analysis are compromised. To sharpen the constants, we derive a counterpart of Theorem B.9 for R-SVRG.
Let and assume that . Under the assumption A1 and A2,
Proof In this case, . By Lemma B.5 with and Lemma B.2,
Since , the assumption A2 implies that
Based on Theorem C.1, we can prove the results in Section 3.1.
Summing the above inequality for , we have
The result is then proved by noticing that
Proof [Corollary 3.3] In the non-strongly convex case, by part (1) of Theorem 3.2,
Recalling the definitions of and in (33) and (34), this implies that
In the strongly convex case, by part (2) of Theorem 3.2,
Note that , we obtain that
Proof [Corollary 3.4] By part (1) of Theorem 3.2, we have
Appendix D Analysis of SCSG
Proof [Theorem 3.5] By Theorem B.9, we have
Telescoping the above inequality for , we have
where the last inequality uses . By convexity,
Before proving the results in Section 3.2, we derive the computation complexity for arbitrary batch size with an appropriately scaled stepsize in the non-strongly convex case.
Assume A1 holds. Set with
Under these conditions, is bounded by
Proof [Corollary 3.6] Let . Then
where the last equality uses the fact that and . As a consequence,
D.2 Convergence Analysis for Strongly Convex Objectives
Multiplying both sides of (37) by and summing over , we obtain that
Proof [Corollary 3.8] By definition, , and
and whenever , . Thus,
Proof [Corollary 3.9] Using the same argument as in the proof of Corollary 3.8, . By (10) in Theorem 3.7,
Appendix E Proof of Results in Section 4
Proof [Proposition 4.2] By Lemma A.1 in Appendix A
Averaging the above inequality for all results in
Proof [Proposition 4.3] In this case, the RHS of (16) is a form of order-two V-statistics () which can be written as
The above inequalities imply that in this case. The conclusion can be easily extended to the case where can be written as a higher-order V-statistics.
Denote by the concatenation of as in Section 5. For multi-class logistic regression loss
It is easy to see that for any and