Stochastic Gradient Descent, Weighted Sampling, and the Randomized Kaczmarz algorithm
Deanna Needell, Nathan Srebro, Rachel Ward
Introduction
This paper connects two algorithms which until now have remained remarkably disjoint in the literature: the randomized Kaczmarz algorithm for solving linear systems and the stochastic gradient descent (SGD) method for optimizing a convex objective using unbiased gradient estimates. The connection enables us to make contributions by borrowing from each body of literature to the other. In particular, it helps us highlight the role of weighted sampling for SGD and obtain a tighter guarantee on the linear convergence regime of SGD.
In a seemingly independent line of research, the Kaczmarz method was proposed as an iterative method for solving (usually overdetermined) systems of linear equations . The simplicity of the method makes it useful in a wide array of applications ranging from computer tomography to digital signal processing . Recently, Strohmer and Vershynin proposed a variant of the Kaczmarz method using a random selection method which select rows with probability proportional to their squared norm, and showed that using this selection strategy, a desired accuracy of can be reached in the noiseless setting in a number of steps that scales like and linearly in the condition number.
From a birds-eye perspective, this paper aims to extend the notion of importance sampling from stochastic sampling methods for numerical linear algebra applications, to more general stochastic convex optimization problems. Strohmer and Vershynin’s incorporation of importance sampling into the Kaczmarz setup is just one such example, and most closely related to the SGD set-up. But importance sampling has also been considered in stochastic coordinate-descent methods . There also, the weights are proportional to some power of the Lipschitz constants (of the gradient coordinates).
Importance sampling has also played a key role in designing sampling-based low-rank matrix approximation algorithms – both row/column based sampling and entry-wise sampling – where it goes by the name of leverage score sampling. The resulting sampling methods are again proportional to the squared Euclidean norms of rows and columns of the underlying matrix. See , and references therein for applications to the column subset selection problem and matrix completion. See for applications of importance sampling to the Nyström Method.
Importance sampling has also been introduced to the compressive sensing framework, where it translates to sampling rows of an orthonormal matrix proportionally to their squared inner products with the rows of a second orthonormal matrix in which the underlying signal is assumed sparse. See for more details.
2. Contributions
Inspired by the analysis of Strohmer and Vershynin and Bach and Moulines, we prove convergence results for stochastic gradient descent as well as for SGD variants where gradient estimates are chosen based on a weighted sampling distribution, highlighting the role of importance sampling in SGD.
Finally, in Section 5, we turn to the Kaczmarz algorithm, explain how it is an instantiation of SGD, and how using partially biased sampling improves known guarantees in this context as well. We show that the randomized Kaczmarz method with uniform i.i.d. row selection can be recast as an instance of preconditioned Stochastic Gradient Descent acting on a re-weighted least squares problem and through this connection, provide exponential convergence rates for this algorithm. We also consider the Kaczmarz algorithm corresponding to SGD with hybrid row selection strategy which shares the exponential convergence rate of Strohmer and Vershynin while also sharing a small error residual term of the SGD algorithm. This presents a clear tradeoff between convergence rate and the convergence residual, not present in other results for the method.
SGD for Strongly Convex Smooth Optimization
We consider the problem of minimizing a smooth convex function,
We will instate the following assumptions on the function :
Each is continuously differentiable and the gradient function has Lipschitz constant ; that is, for all vectors and .
has strong convexity parameter ; that is, for all vectors and .
A unbiased gradient estimate for can be obtained by drawing and using as the estimate. The SGD updates with (fixed) step size based on these gradient estimates are then given by:
where are drawn i.i.d. from . We are interested in the distance of the iterates from the unique minimum, and denote the initial distance by .
Bach and Moulines [1, Theorem 1] considered this settingBach and Moulines’s results are somewhat more general. Their Lipschitz requirement is a bit weaker and more complicated, but in terms of yields (2.3). They also study the use of polynomial decaying step-sizes, but these do not lead to improved runtime if the target accuracy is known ahead of time. and established that
where the expectation is with respect to the sampling of .
If we are given a desired tolerance, , and we know the Lipschitz constants and parameters of strong convexity, we may optimize the step-size , and obtain:
For any desired , using a step-size of
Substituting into the second term of (2.4) and simplifying gives the bound
substituting for , and rearranging to solve for , shows that we need such that
Utilizing the fact that for and rearranging again yields the requirement that
Noting that this inequality holds when yields the stated number of steps in (2.5). Since the expression on the right hand side of (2.4) decreases with , the corollary is proven. ∎
The crux of the improvement over Bach and Moulines is in a tighter recursive equation. Bach and Moulines rely on the recursion
whereas we use the Co-Coercivity Lemma A.1, with which we can obtain the recursion
where is the Lipschitz constant of the component used in the current iterate. The significant difference is that one of the factors of (an upper bound on the second derivative), in the third term inside the parenthesis, is replaced by (a lower bound on the second derivative of ). A complete proof can be found in the appendix.
Comparison to results of Bach and Moulines.
Tightness.
Importance Sampling
We will now consider stochastic gradient descent, where gradient estimates are sampled from a weighted distribution.
For a weight function which assigns a non-negative weight to each index , the weighted distribution is defined as the distribution such that
Similarly, for a continuous distribution, this corresponds to multiplying the density by and renormalizing.
One way to construct the weighted distribution , and sample from it, is through rejection sampling: sample , and accept with probability , for some . Otherwise, reject and continue to re-sample until a suggestion is accepted. The accepted samples are then distributed according to .
2. Reweighted SGD
For any normalized weight function , we can weight each component , defining:
The representation (3.3) is an equivalent, and equally valid, stochastic representation of the objective , and we can just as well base SGD on this representation. In this case, at each iteration we sample and then use as an unbiased gradient estimate. SGD iterates based on the representation (3.3), which we will also refer to as -weighted SGD, are then given by
where are drawn i.i.d. from .
The important observation here is that all SGD guarantees are equally valid for the -weighted updates (3.4)–the objective is the same objective , the sub-optimality is the same, and the minimizer is the same. We do need, however, to calculate the relevant quantities controlling SGD convergence with respect to the modified components and the weighted distribution .
3. Strongly Convex Smooth Optimization using Weighted SGD
We now return to the analysis of strongly convex smooth optimization and investigate how re-weighting can yield a better guarantee. To do so, we must analyze the relevant quantities involved.
The Lipschitz constant of each component is now scaled, and we have, . The supremum is given by:
It is easy to verify that (3.5) is minimized by the weights
Before applying Corollary 2.2, we must also calculate:
Now, applying Corollary 2.2 to the -weighted SGD iterates (3.4) with weights (3.6), we have that, with an appropriate stepsize,
4. Partially biased sampling
If , i.e. we are in the “realizable” situation, with true linear convergence, then we also have . In this case, we already obtain the desired guarantee: linear convergence with a linear dependence on the average conditioning , strictly improving over Bach and Moulines. However, the inequality in (3.8) might be tight in the presence of components with very small that contribute towards the residual error (as might well be the case for a small component). When , we therefore get a dissatisfying scaling of the second term, relative to Bach and Moulines, by a factor of .
Fortunately, we can easily overcome this factor. To do so, consider sampling from a distribution which is a mixture of the original source distribution and its re-weighting using the weights (3.6). That is, sampling using the weights:
We refer to this as partially biased sampling. Using these weights, we have
Plugging these into Corollary 2.2 we obtain:
We now obtain the desired linear scaling on , without introducing any additional factor to the residual term, except for a constant factor of two. We thus obtain a result which dominates Bach and Moulines (up to a factor of 2) and substantially improves upon it (with a linear rather than quadratic dependence on the conditioning).
One might also ask whether the previous best known result (2.3) could be improved using weighted sampling. The relevant quantity to consider is the average square Lipschitz constant for the weighted representation: (3.3):
Interestingly, this quantity is minimized by the same weights as , given by (3.6), and with these weights we have:
5. Implementing Importance Sampling
As discussed above, when the magnitudes of are highly variable, importance sampling is necessarily in order to obtain a dependence on the average, rather than worst-case, conditioning. In some applications, especially when the Lipschitz constants are known in advance or easily calculated or bounded, such importance sampling might be possible by directly sampling from . This is the case, for example, in trigonometric approximation problems or linear systems which need to be solved repeatedly, or when the Lipschitz constant is easily computed from the data, and multiple passes over the data are needed anyway. We do acknowledge that in other regimes, when data is presented in an online fashion, or when we only have sampling access to the source distribution (or the implied distribution over gradient estimates), importance sampling might be difficult.
One option that could be considered, in light of the above results, is to use rejection sampling to simulate sampling from . For the weights (3.6), this can be done by accepting samples with probability proportional to . The overall probability of accepting a sample is then , introducing an additional factor of . This results in a sample complexity with a linear dependence on , as in Corollary 2.2 (for the weights (3.10), we can first accept with probability 1/2, and then if we do not accept, perform this procedure). Thus, if we are presented samples from , and the cost of obtaining the sample dominates the cost of taking the gradient step, we do not gain (but do not lose much either) from rejection sampling. We might still gain from rejection sampling if the cost of operating on a sample (calculating the actual gradient and taking a step according to it) dominates the cost of obtaining it and (a bound on) the Lipschitz constant.
6. A Family of Partially Biased Schemes
The choice of weights (3.10) corresponds to an equal mix of uniform and fully biased sampling. More generally, we could consider sampling according to any one of a family of weights which interpolate between uniform and fully biased sampling:
To be concrete, we summarize below the a template algorithm for SGD with partially biased sampling:
[ht] Stochastic Gradient Descent with Partially Biased Sampling
For arbitrary , we have the bounds
Plugging these quantities into Corollary 2.2, we obtain:
In this corollary, even if is close to 1, i.e. we add only a small amount of bias to the sampling, we obtain a bound with a linear dependence on the average conditioning (multiplied by a factor of ), since we can bound .
Importance Sampling for SGD in Other Scenarios
In the previous Section, we considered SGD for smooth and strongly convex objectives, and were particularly interested in the regime where the residual is low, and the linear convergence term is dominant. Weighted SGD could of course be relevant also in other scenarios, and we now briefly survey them, as well as relate them to our main scenario of interest.
When each component is convex, non-negative, and has an -Lipschitz gradient, but the objective is not necessarily strongly convex, then after
iterations of SGD with an appropriately chosen step-size we will have , where is an appropriate averaging of the iterates Srebro et al. . The relevant quantity here determining the iteration complexity is again . Furthermore, Srebro et al. , relying on an example from Foygel and Srebro , point out that the dependence on the supremum is unavoidable and cannot be replaced with the average Lipschitz constant . That is, if we sample gradients according to the source distribution , we must have a linear dependence on .
The only quantity in the bound (4.1) that changes with a re-weighting is —all other quantities (, , and the sub-optimality ) are invariant to re-weightings. We can therefor replace the dependence on with a dependence on by using a weighted SGD as in (3.4). As we already calculated, the optimal weights are given by (3.6), and using them we have . In this case, there is no need for partially biased sampling, and we obtain that with an appropriate step-size,
iterations of weighed SGD updates (3.4) using the weights (3.6) suffice.
We again see that using importance sampling allows us to reduce the dependence on , which is unavoidable without biased sampling, to a dependence on .
2. Non-Smooth Objectives
In parallel work, Zhao and Zhang also consider importance sampling for stochastic optimization for non-smooth objectives. Zhao and Zhang consider a more general setting, with a composite objective that is only partially linearized. But also there, the iteration complexity depends on the second moment of the gradient estimates, and the analysis performed above applies (Zhao and Zhang perform a specialized analysis instead).
3. Non-Realizable Regime
An alternative approach is to bound and so . Taking this bound, we are back to the same quantity as in the non-smooth case, and the optimal weights are proportional to . Note that this is a different weighting then using weights proportional to , which optimize the linear-convergence term as studied in Section 3.3.
To understand how weighting according to and are different, consider a generalized linear objective where , and is a scalar function with and . We have that while . Weighting according to the Lipschitz constants of the gradients, i.e. the “smoothness” parameters, as in (3.6), versus weighting according to the Lipschitz constants of as in (4.4), thus corresponds to weighting according to versus , and are rather different. We can also calculate that weighing by (i.e. following (3.6)), yields . That is, weights proportional to yield a suboptimal gradient-dependent term (the same dependence as if no weighting at all was used). Conversely, using weights proportional to , i.e. proportional to yields – a suboptimal dependence, though better then no weighting at all.
Again, as with partially biased sampling, we can weight by the average, and ensure both terms are optimal up to a factor of two.
The least squares case and the Randomized Kaczmarz Method
A special case of interest is the least squares problem, where
with an -dimensional vector, an matrix with rows , and is the least-squares solution. Writing the least squares problem (5.1) in the form (2.1), we see that the source distribution is uniform over , the components are , the Lipschitz constants are , the average Lipschitz constant is , the strong convexity parameter is , so that , and the residual is . Note that in the case that is not full-rank, one can instead replace with the smallest nonzero eigenvalue of as in [23, Equation (3)]. In that case, we instead write as the appropriate condition number.
The randomized Kaczmarz method for solving the least squares problem (5.1) begins with an arbitrary estimate , and in the th iteration selects a row i.i.d. at random from the matrix and iterates by:
where the step size in the standard method.
Strohmer and Vershynin provided the first non-asymptotic convergence rates, showing that drawing rows proportionally to leads to provable exponential convergence in expectation for the full-rank case . Their method can easily be extended to the case when the matrix is not full-rank to yield convergence to some solution, see e.g. [23, Equation (3)]. Recent works use acceleration techniques to improve convergence rates .
However, one can easily verify that the iterates (5.2) are precisely weighted SGD iterates (3.4) with the fully biased weights (3.6).
The reduction of the quadratic dependence on the conditioning to a linear dependence in Theorem 2.1, as well as the use of biased sampling which we investigate here was motivated by Strohmer and Vershynin’s analysis of the randomized Kaczmarz method. Indeed, applying Theorem 2.1 to the weighted SGD iterates (2.2) for (5.1) with the weights (3.6) and a stepsize of yields precisely the Strohmer and Vershynin guarantee.
Understanding the randomized Kaczmarz method as SGD allows us also to obtain improved methods and results for the randomized Kaczmarz method:
As shown by Strohmer and Vershynin and extended by Needell , the randomized Kaczmarz method with weighted sampling exhibits exponential convergence, but only to within a radius, or convergence horizon, of the least-squares solution. This is because a step-size of is used, and so the second term in (2.4) does not vanish. It has been shown that changing the step size can allow for convergence inside of this convergence horizon, although non-asymptotic results have been difficult to obtain. Our results allow for finite-iteration guarantees with arbitrary step-sizes and can be immediately applied to this setting. Indeed, applying Theorem 2.1 with the weights (3.6) gives
Let be an matrix with rows . Set , where is the minimizer of the problem
Suppose that . Set , . Then the expected error at the iteration of the Kaczmarz method described by (5.2) with row selected with probability satisfies
with . The expectation is taken with respect to the weighted distribution over the rows.
When e.g. , we recover the exponential rate of Strohmer and Vershynin up to a factor of , and nearly the same convergence horizon. For arbitrary , Corollary 5.1 implies a tradeoff between a smaller convergence horizon and a slower convergence rate.
Uniform Row Selection.
The Kaczmarz variant of Strohmer and Vershynin calls for weighted row sampling, and thus requires pre-computing all the row norms. Although certainly possible in some applications, in other cases this might be better avoided. Understanding the randomized Kaczmarz as SGD allows us to apply Theorem 2.1 also with uniform weights (i.e. to the unbiased SGD), and obtain a randomized Kaczmarz using uniform sampling, which converges to the least-squares solution and enjoys finite-iteration guarantees:
Let be an matrix with rows . Let be the diagonal matrix with terms , and consider the composite matrix . Set , where is the minimizer of the weighted least squares problem
Suppose that . Then the expected error after iterations of the Kaczmarz method described by (5.2) with uniform row selection satisfies
where
Note that the randomized Kaczmarz algorithm with uniform row selection converges exponentially to a weighted least-squares solution, to within arbitrary accuracy by choosing sufficiently small stepsize . Thus, in general, the randomized Kaczmarz algorithms with uniform and biased row selection converge (up to a convergence horizon) towards different solutions.
Partially Biased Sampling.
As in our SGD analysis, using the partially biased sampling weights is applicable also for the randomized Kaczmarz method. Applying Theorem 2.1 using weights (3.10) gives
Let be an matrix with rows . Set , where is the minimizer of the problem
Suppose . Then the iterate of the modified Kaczmarz method described by
with row selected with probability satisfies
The partially biased randomized Kaczmarz method described above (which does have modified update equation (5.4) compared to the standard update equation (5.2)) yields the same convergence rate as the fully biased randomized Kaczmarz method (up to a factor of 2), but gives a better dependence on the residual error over the fully biased sampling, as the final term in (5.5) is smaller than the final term in (5.3).
Numerical Experiments
In this section we present some numerical results for the randomized Kaczmarz algorithm with partially biased sampling, that is, applying Algorithm 3.6 to the least squares problem (so ) and considering . Recall that corresponds to the randomized Kaczmarz algorithm of Strohmer and Vershynin with fully weighted sampling . corresponds to the partially biased randomized Kaczmarz algorithm outlined in Corollary 5.3. We demonstrate how the behavior of the algorithm depends on , the conditioning of the system, and the residual error at the least squares solution. We focus on exploring the role of on the convergence rate of the algorithm for various types of matrices . We consider five types of systems, described below, each using a matrix . In each setting, we create a vector with standard normal entries. For the described matrix and residual , we create the system and run the randomized Kaczmarz method with various choices of . Each experiment consists of independent trials and uses the optimal step size as in Corollary 3.2 with ; the plots show the average behavior over these trials. The settings below show the various types of behavior the Kazcmarz method can exhibit.
Each row of the matrix has standard normal entries, except the last row which has normal entries with mean and variance . The residual vector has normal entries with mean and variance .
Each row of the matrix has standard normal entries. The residual vector has normal entries with mean and variance .
The th row of has normal entries with mean and variance . The residual vector has normal entries with mean and variance .
The th row of has normal entries with mean and variance . The residual vector has normal entries with mean and variance .
The th row of has normal entries with mean and variance . The residual vector has normal entries with mean and variance .
Figure 1 shows the convergence behavior of the randomized Kaczmarz method in each of these five settings. As expected, when the rows of are far from normalized, as in Case 1, we see different behavior as varies from to . Here, weighted sampling () significantly outperforms uniform sampling (), and the trend is monotonic in . On the other hand, when the rows of are close to normalized, as in Case 2, the various give rise to similar convergence rates, as is expected. Out of the tested (we tested increments of from to ), the choice gave the worst convergence rate, and again purely weighted sampling gives the best. Still, the worst-case convergence rate was not much worse, as opposed to the situation with uniform sampling in Case 1. Cases 3, 4, and 5 use matrices with varying row norms and cover “high”, “medium”, and “low” noise regimes, respectively. In the high noise regime (Case 3), we find that fully weighted sampling, , is relatively very slow to converge, as the theory suggests, and hybrid sampling outperforms both weighted and uniform selection. In the medium noise regime (Case 4), hybrid sampling still outperforms both weighted and uniform selection. Again, this is not surprising, since hybrid sampling allows a balance between small convergence horizon (important with large residual norm) and convergence rate. As we decrease the noise level (as in Case 5), we see that again weighted sampling is preferred.
Figure 2 shows the number of iterations of the randomized Kaczmarz method needed to obtain a fixed approximation error. For the choice for Case 1, we cut off the number of iterations after 50,000, at which point the desired approximation error was still not attained. As seen also from Figure 1, Case 1 exhibits monotonic improvements as we scale . For Cases 2 and 5, the optimal choice is pure weighted sampling, whereas Cases 3 and 4 prefer intermediate values of .
Summary and outlook
We consider this paper as making three contributions: the improved dependence on the conditioning for smooth and strongly convex SGD, the discussion of importance sampling for SGD, and the connection between SGD and the randomized Kaczmarz method.
For simplicity, we only considered SGD iterates with a fixed step-size . This is enough for getting the optimal iteration complexity if the target accuracy is known in advance, which was our approach in this paper. It is easy to adapt the analysis, using standard techniques, to incorporate decaying step-sizes, which are appropriate if we don’t know in advance.
We suspect that the assumption of strong convexity can be weakened to restricted strong convexity without changing any of the results of this paper; we leave this analysis to future work.
Finally, our discussion of importance sampling is limited to a static reweighting of the sampling distribution. A more sophisticated approach would be to update the sampling distribution dynamically as the method progresses, and as we gain more information about the relative importance of components. Although such dynamic importance sampling is sometimes attempted heuristically, we are not aware of any rigorous analysis of such a dynamic biasing scheme.
Acknowledgements
We would like to thank the anonymous reviewers for their useful feedback which significantly improved the manuscript. We would like to thank Chris White for pointing out a simplified proof of Corollary 2.2. DN was partially supported by a Simons Foundation Collaboration grant, NSF CAREER and an Alfred P. Sloan Fellowship. NS was partially supported by a Google Research Award. RW was supported in part by ONR Grant N00014-12-1-0743, an AFOSR Young Investigator Program Award, and an NSF CAREER award.
References
Appendix A Proofs
Our main results utilize an elementary fact about smooth functions with Lipschitz continuous gradient, called the co-coercivity of the gradient. We state the lemma and recall its proof for completeness.
For a smooth function whose gradient has Lipschitz constant ,
Since has Lipschitz constant , if is the minimizer of , then [see e.g. 32, page 26]
and observe that both have Lipschitz constants and minimizers and , respectively. Applying (A.1) to these functions therefore gives that
Adding these two inequalities and canceling terms yields the desired result.
A.2. Proof of Theorem 2.1
With the notation of Theorem 2.1, and where is the random index chosen at iteration , and , we have
when . Recursively applying this bound over the first iterations yields the desired result,