Minimizing Finite Sums with the Stochastic Average Gradient

Mark Schmidt, Nicolas Le Roux, Francis Bach

Introduction

A plethora of the optimization problems arising in practice involve computing a minimizer of a finite sum of functions measuring misfit over a large number of data points. A classical example is least-squares regression,

In this work, we focus on such finite data problems where each fif_{i} is smooth and convex.

where each lil_{i} is a data-misfit function (as in least-squares and logistic regression) and the positive scalar λ\lambda controls the strength of the regularization. These problems can be put in the framework of (1) by using the choice

The resulting function gg will be strongly-convex provided that the individual loss functions lil_{i} are convex. An extensive list of convex loss functions used in a statistical data-fitting context is given by Teo et al. , and non-smooth loss functions (or regularizers) can also be put in this framework by using smooth approximations (for example, see Nesterov ).

For optimizing problem (1), the standard deterministic or full gradient (FG) method, which dates back to Cauchy , uses iterations of the form

where αk\alpha_{k} is the step size on iteration kk. Assuming that a minimizer xx^{\ast} exists, then under standard assumptions the sub-optimality achieved on iteration kk of the FG method with a constant step size is given by

when gg is convex [see Nesterov, 2004, Corollary 2.1.2]. This results in a sublinear convergence rate. When gg is strongly-convex, the error also satisfies

for some ρ<1\rho<1 which depends on the condition number of gg [see Nesterov, 2004, Theorem 2.1.5]. This results in a linear convergence rate, which is also known as a geometric or exponential rate because the error is cut by a fixed fraction on each iteration. Unfortunately, the FG method can be unappealing when nn is large because its iteration cost scales linearly in nn.

The main appeal of SG methods is that they have an iteration cost which is independent of nn, making them suited for modern problems where nn may be very large. The basic SG method for optimizing (1) uses iterations of the form

where at each iteration an index iki_{k} is sampled uniformly from the set {1,,n}\{1,\dots,n\}. The randomly chosen gradient fik(xk)f_{i_{k}}^{\prime}(x^{k}) yields an unbiased estimate of the true gradient g(xk)g^{\prime}(x^{k}) and one can show under standard assumptions (see [Nemirovski et al., 2009]) that, for a suitably chosen decreasing step-size sequence {αk}\{\alpha_{k}\}, the SG iterations have an expected sub-optimality for convex objectives of

and an expected sub-optimality for strongly-convex objectives of

In these rates, the expectations are taken with respect to the selection of the iki_{k} variables. These sublinear rates are slower than the corresponding rates for the FG method, and under certain assumptions these convergence rates are optimal in a model of computation where the algorithm only accesses the function through unbiased measurements of its objective and gradient (see Nemirovski and Yudin , Nemirovski et al. , Agarwal et al. ). Thus, we should not expect to be able to obtain the convergence rates of the FG method if the algorithm only relies on unbiased gradient measurements. Nevertheless, by using the stronger assumption that the functions are sampled from a finite dataset, in this paper we show that we can achieve the convergence rates of FG methods while preserving the iteration complexity of SG methods.

The primary contribution of this work is the analysis of a new algorithm that we call the stochastic average gradient (SAG) method, a randomized variant of the incremental aggregated gradient (IAG) method of Blatt et al. . The SAG method has the low iteration cost of SG methods, but achieves the convergence rates stated above for the FG method. The SAG iterations take the form

where at each iteration a random index iki_{k} is selected and we set

That is, like the FG method, the step incorporates a gradient with respect to each function. But, like the SG method, each iteration only computes the gradient with respect to a single example and the cost of the iterations is independent of nn. Despite the low cost of the SAG iterations, we show in this paper that with a constant step-size the SAG iterations have an O(1/k)O(1/k) convergence rate for convex objectives and a linear convergence rate for strongly-convex objectives, like the FG method. That is, by having access to iki_{k} and by keeping a memory of the most recent gradient value computed for each index ii, this iteration achieves a faster convergence rate than is possible for standard SG methods. Further, in terms of effective passes through the data, we will also see that for many problems the convergence rate of the SAG method is also faster than is possible for standard FG methods.

One of the main contexts where minimizing the sum of smooth convex functions arises is machine learning. In this context, gg is often an empirical risk (or a regularized empirical risk), which is a sample average approximation to the true risk that we are interested in. It is known that with nn training examples the empirical risk minimizer (ERM) has an error for the true risk of O(1/n)O(1/\sqrt{n}) in the convex case and O(1/n)O(1/n) in the strongly-convex case. Since these rates are achieved by doing one pass through the data with an SG method, in the worst case the SAG algorithm applied to the empirical risk cannot improve the convergence rate in terms of the true risk over this simple method. Nevertheless, Srebro and Sridharan note that “overwhelming empirical evidence shows that for almost all actual data, the ERM is better. However, we have no understanding of why this happens”. Although our analysis does not give insight into the better performance of ERM, our analysis shows that the SAG algorithm will be preferable to SG methods for finding the ERM and hence for many machine learning applications.

The next section reviews several closely-related algorithms from the literature, including previous attempts to combine the appealing aspects of FG and SG methods. However, despite 6060 years of extensive research on SG methods, with a significant portion of the applications focusing on finite datasets, we believe that this is the first general method that achieves the convergence rates of FG methods while preserving the iteration cost of standard SG methods. Section 3 states the (standard) assumptions underlying our analysis and gives our convergence rate results. Section 4 discusses practical implementation issues including how we adaptively set the step size and how we can reduce the storage cost needed by the algorithm. For example, we can reduce the memory requirements from O(np)O(np) to O(n)O(n) in the common scenario where each fif_{i} only depends on a linear function of xx, as in least-squares and logistic regression. Section 5 presents a numerical comparison of an implementation based on SAG to competitive SG and FG methods, indicating that the method may be very useful for problems where we can only afford to do a few passes through a data set.

A preliminary conference version of this work appears in Le Roux et al. , and we extend this work in various ways. Most notably, the analysis in the prior work focuses only on showing linear convergence rates in the strongly-convex case while the present work also gives an O(1/k)O(1/k) convergence rate for the general convex case. In the prior work we show (Proposition 1) that a small step-size gives a slow linear convergence rate (comparable to the rate of FG methods in terms of effective passes through the data), while we also show (Proposition 2) that a much larger step-size yields a much faster convergence rate, but this requires that nn is sufficiently large compared to the condition number of the problem. In the present work (Section 3) our analysis yields a very fast convergence rate using a large step-size (Theorem 1), even when this condition required by the prior work is not satisfied. Surprisingly, for ill-conditioned problems our new analysis shows that using SAG iterations can be nearly nn times as fast as the standard gradient method. To prove this stronger result, Theorem 1 employs a Lyapunov function that generalizes the Lyapunov functions used in Propositions 1 and 2 of the previous work. This new Lyapunov function leads to a unified proof for both the convex and the strongly-convex cases, and for both well-conditioned and ill-conditioned problems. However, this more general Lyapunov function leads to a more complicated analysis. To significantly simplify the formal proof, we use a computed-aided strategy to verify the non-negativity of certain polynomials that arise in the proof. Beyond this significantly strengthened result, in this work we also argue that yet-faster convergence rates may be achieved by non-uniform sampling (Section 4.8) and present numerical results showing that this can lead to drastically improved performance (Section 5.5).

Related Work

There are a large variety of approaches available to accelerate the convergence of SG methods, and a full review of this immense literature would be outside the scope of this work. Below, we comment on the relationships between the new method and several of the most closely-related ideas.

Momentum: SG methods that incorporate a momentum term use iterations of the form

see Tseng . It is common to set all βk=β\beta_{k}=\beta for some constant β\beta, and in this case we can rewrite the SG with momentum method as

We can re-write the SAG updates (5) in a similar form as

where the selection function S(j,i1:k)S(j,i_{1:k}) is equal to 1/n1/n if jj is the maximum iteration number where example iji_{j} was selected and is set to otherwise. Thus, momentum uses a geometric weighting of previous gradients while the SAG iterations select and average the most recent evaluation of each previous gradient. While momentum can lead to improved practical performance, it still requires the use of a decreasing sequence of step sizes and is not known to lead to a faster convergence rate.

Gradient Averaging: Closely related to momentum is using the sample average of all previous gradients,

which is similar to the SAG iteration in the form (5) but where all previous gradients are used. This approach is used in the dual averaging method of Nesterov and, while this averaging procedure and its variants lead to convergence for a constant step size and can improve the constants in the convergence rate [Xiao, 2010], it does not improve on the sublinear convergence rates for SG methods.

Iterate Averaging: Rather than averaging the gradients, some authors propose to perform the basic SG iteration but use an average over certain xkx^{k} values as the final estimator. With a suitable choice of step-sizes, this gives the same asymptotic efficiency as Newton-like second-order SG methods and also leads to increased robustness of the convergence rate to the exact sequence of step sizes [Polyak and Juditsky, 1992, Bach and Moulines, 2011]. Bather’s method [Kushner and Yin, 2003, §1.3.4] combines gradient averaging with online iterate averaging and also displays appealing asymptotic properties. Several authors have recently shown that suitable iterate averaging schemes obtain an O(1/k)O(1/k) rate for strongly-convex optimization even for non-smooth objectives [Hazan and Kale, 2011, Rakhlin et al., 2012]. However, none of these methods improve on the O(1/k)O(1/\sqrt{k}) and O(1/k)O(1/k) rates for SG methods.

Stochastic versions of FG methods: Various options are available to accelerate the convergence of the FG method for smooth functions, such as the accelerated full gradient (AFG) method of Nesterov , as well as classical techniques based on quadratic approximations such as diagonally-scaled FG methods, non-linear conjugate gradient, quasi-Newton, and Hessian-free Newton methods (see [Nocedal and Wright, 2006]). There has been a substantial amount of work on developing stochastic variants of these algorithms, with several of the notable recent examples including Bordes et al. , Hu et al. , Sunehag et al. , Ghadimi and Lan , Martens , Xiao . Duchi et al. have recently shown an improved regret bound using a diagonal scaling that takes into account previous gradient magnitudes. Alternately, if we split the convergence rate into a deterministic and stochastic part, these methods can improve the dependency of the convergence rate of the deterministic part [Hu et al., 2009, Ghadimi and Lan, 2010, Xiao, 2010]. However, we are not aware of any existing method of this flavor that improves on the O(1/k)O(1/\sqrt{k}) and O(1/k)O(1/k) dependencies on the stochastic part. Further, many of these methods typically require carefully setting parameters (beyond the step size) and often aren’t able to take advantage of sparsity in the gradients fif_{i}^{\prime}.

Constant step size: If the SG iterations are used for strongly-convex optimization with a constant step size (rather than a decreasing sequence), then Nedic and Bertsekas [2000, Proposition 3.4] showed that the convergence rate of the method can be split into two parts. The first part depends on kk and converges linearly to . The second part is independent of kk and does not converge to . Thus, with a constant step size, the SG iterations have a linear convergence rate up to some tolerance, and in general after this point the iterations do not make further progress. Indeed, up until the recent work of Bach and Moulines , convergence of the basic SG method with a constant step size had only been shown for the strongly-convex quadratic case (with averaging of the iterates) [Polyak and Juditsky, 1992], or under extremely strong assumptions about the relationship between the functions fif_{i} [Solodov, 1998]. This contrasts with the method we present in this work which converges to the optimal solution using a constant step size and does so with a linear rate (without additional assumptions).

Accelerated methods: Accelerated SG methods, which despite their name are not related to the aforementioned AFG method, take advantage of the fast convergence rate of SG methods with a constant step size. In particular, accelerated SG methods use a constant step size by default, and only decrease the step size on iterations where the inner-product between successive gradient estimates is negative [Kesten, 1958, Delyon and Juditsky, 1993]. This leads to convergence of the method and allows it to potentially achieve periods of faster convergence where the step size stays constant. However, the overall convergence rate of the method is not improved.

Hybrid Methods: Some authors have proposed variants of the SG method for problems of the form (1) that seek to gradually transform the iterates into the FG method in order to achieve a faster convergence rate. Bertsekas proposes to go through the data cyclically with a specialized weighting that allows the method to achieve a linear convergence rate for strongly-convex quadratic functions. However, the weighting is numerically unstable and the linear convergence rate presented treats full passes through the data as iterations. A related strategy is to group the functions fif_{i} into ‘batches’ of increasing size and perform SG iterations on the batches. Friedlander and Schmidt give conditions under which this strategy achieves the O(1/k)O(1/k) and O(ρk)O(\rho^{k}) convergence rates of FG methods. However, in both cases the iterations that achieve the faster rates have a cost that is not independent of nn, as opposed to SAG iterations.

Incremental Aggregated Gradient: Blatt et al. present the most closely-related algorithm to the SAG algorithm, the IAG method. This method is identical to the SAG iteration (5), but uses a cyclic choice of iki_{k} rather than sampling the iki_{k} values. This distinction has several important consequences. In particular, Blatt et al. are only able to show that the convergence rate is linear for strongly-convex quadratic functions (without deriving an explicit rate), and their analysis treats full passes through the data as iterations. Using a non-trivial extension of their analysis and a novel proof technique involving bounding the gradient and iterates simultaneously by a Lyapunov potential function, in this work we give an O(1/k)O(1/k) rate for general convex functions and an explicit linear convergence rate for general strongly-convex functions using the SAG iterations that only examine a single function. Further, as our analysis and experiments show, the SAG iterations allow a much larger step size than is required for convergence of the IAG method. This leads to more robustness to the selection of the step size and also, if suitably chosen, leads to a faster convergence rate and substantially improved practical performance. This shows that the simple change (random selection vs. cycling) can dramatically improve optimization performance.

Special Problem Classes: For certain highly-restricted classes of problems, it is possible to show faster convergence rates for methods that only operate on a single function fif_{i}. For example, Strohmer and Vershynin show that the randomized Kaczmarz method with a particular sampling scheme achieves a linear convergence rate for the problem of solving consistent linear systems. It is also known that the SG method with a constant step-size has the O(1/k)O(1/k) and O(ρk)O(\rho^{k}) convergence rates of FG methods if fi(x)\|f_{i}^{\prime}(x)\| is bounded by a linear function of g(x)\|g^{\prime}(x)\| for all ii and xx Schmidt and Le Roux . This is the strong condition required by Solodov to show convergence of the SG method with a constant step size. Unlike these previous works, our analysis in the next section applies to general fif_{i} that satisfy standard assumptions, and only requires gradient evaluations of the functions fif_{i} rather than dual block-coordinate steps.

Subsequent work: Since the first version of this work was released, there has been an explosion of research into stochastic gradient methods with faster convergence rates. It has been shown that similar rates can be achieved for certain constrained and non-smooth problems, that similar rates can be achieved without the memory requirements, that Newton-like variants of the method may be possible, and that similar rates can be achieved with other algorithms. In Section 6, we survey these recent developments.

Convergence Analysis

This is a fairly weak assumption on the fif_{i} functions, and in cases where the fif_{i} are twice-differentiable it is equivalent to saying that the eigenvalues of the Hessians of each fif_{i} are bounded above by LL. We will also assume the existence of at least one minimizer xx^{\ast} that achieves the optimal function value. We denote the average iterate by xˉk=1ki=0k1xi\bar{x}^{k}=\frac{1}{k}\sum_{i=0}^{k-1}x^{i}, and the variance of the gradient norms at the optimum xx^{\ast} by σ2=1nifi(x)2\sigma^{2}=\frac{1}{n}\sum_{i}\|f_{i}^{\prime}(x^{\ast})\|^{2}. Our convergence results consider two different initializations for the yi0y_{i}^{0} variables: setting yi0=0y_{i}^{0}=0 for all ii, or setting them to the centered gradient at the initial point x0x^{0} given by yi0=fi(x0)g(x0)y_{i}^{0}=f_{i}^{\prime}(x^{0})-g^{\prime}(x^{0}). We note that all our convergence results are expressed in terms of expectations with respect to the internal randomization of the algorithm (the selection of the random variables iki_{k}), and not with respect to the data which is assumed to be deterministic and fixed.

Under these standard assumptions, we now state our convergence result.

With a constant step size of αk=116L\alpha_{k}=\frac{1}{16L}, the SAG iterations satisfy for k1k\geq 1:

where if we initialize with yi0=0y_{i}^{0}=0 we have

and if we initialize with yi0=fi(x0)g(x0)y_{i}^{0}=f_{i}^{\prime}(x^{0})-g^{\prime}(x^{0}) we have

Further, if gg is μ\mu-strongly convex we have

Note that while the first part of Theorem (1) is stated for the average xˉk\bar{x}^{k}, with a trivial change to the proof technique it can be shown to also hold for any iterate xkx^{k} where g(xk)g(x^{k}) is lower than the average function value up to iteration kk, 1ki=0k1g(xi)\frac{1}{k}\sum_{i=0}^{k-1}g(x^{i}). Thus, in addition to xˉk\bar{x}^{k} the result also holds for the best iterate. We also note that our bounds are valid for any LL greater than or equal to the minimum LL satisfying (8), implying an O(1/k)O(1/k) and linear convergence rate for any αk1/16L\alpha_{k}\leqslant 1/16L (but the bound becomes worse as LL grows). Although initializing each yi0y_{i}^{0} with the centered gradient may have an additional cost and slightly worsens the dependency on the initial sub-optimality (g(x0)g(x))(g(x^{0})-g(x^{\ast})), it removes the dependency on the variance σ2\sigma^{2} of the gradients at the optimum. While we have stated Theorem 1 in terms of the function values, in the strongly-convex case we also obtain a convergence rate on the iterates because we have

Theorem 1 shows that the SAG iterations are advantageous over SG methods in later iterations because they obtain a faster convergence rate. However, the SAG iterations have a worse constant factor because of the dependence on nn. We can improve the dependence on nn using an appropriate choice of x0x^{0}. In particular, following Le Roux et al. we can set x0x^{0} to the result of nn iterations of an appropriate SG method. In this setting, the expectation of g(x0)g(x)g(x^{0})-g(x^{*}) is O(1/n)O(1/\sqrt{n}) in the convex setting, while both g(x0)g(x)g(x^{0})-g(x^{*}) and x0x2\|x^{0}-x^{\ast}\|^{2} would be in O(1/n)O(1/n) in the strongly-convex setting. If we use this initialization of x0x^{0} and set yi0=fi(x0)g(x0)y_{i}^{0}=f_{i}^{\prime}(x^{0})-g^{\prime}(x^{0}), then in terms of nn and kk the SAG convergence rates take the form O(n/k)O(\sqrt{n}/k) and O(ρk/n)O(\rho^{k}/n) in the convex and strongly-convex settings, instead of the O(n/k)O(n/k) and O(ρk)O(\rho^{k}) rates implied by Theorem 1. However, in our experiments we do not use an SG initialization but rather use a minor variant of SAG in the early iterations (discussed in the next section), which appears more difficult to analyze but which gives better empirical performance.

An interesting consequence of using a step-size of αk=1/16L\alpha_{k}=1/16L is that it makes the method adaptive to the strong-convexity constant μ\mu. That is, for problems with a higher degree of local strong-convexity around the solution xx^{\ast}, the algorithm will automatically take advantage of this and yield a faster local rate. This can even lead to a local linear convergence rate if the problem is strongly-convex near the optimum but not globally strongly-convex. This adaptivity to the problem difficulty is in contrast to SG methods whose sequence of step sizes typically depend on global constants and thus do not adapt to local strong-convexity.

We have observed in practice that the IAG method with a step size of αk=116L\alpha_{k}=\frac{1}{16L} may diverge. While the step-size needed for convergence of the IAG iterations is not precisely characterized, we have observed that it requires a step-size of approximately 1/nL1/nL in order to converge. Thus, the SAG iterations can tolerate a step size that is roughly nn times larger, which leads to increased robustness to the selection of the step size. Further, as our analysis and experiments indicate, the ability to use a large step size leads to improved performance of the SAG iterations. Note that using randomized selection with a larger step-size leading to vastly improved performance is not an unprecedented phenomenon; the analysis of Nedic and Bertsekas shows that the iterations of the basic stochastic gradient method with a constant step-size can achieve the same error bound as full cycles through the data of the cyclic variant of the method by using steps that are nn times larger (see the discussion after Proposition 3.4). Related results also appear in Collins et al. , Lacoste-Julien et al. showing the advantage of stochastic optimization strategies over deterministic optimization strategies in the context of certain dual optimization problems.

The convergence rate of the SAG iterations in the strongly-convex case takes a somewhat surprising form. For ill-conditioned problems where n2Lμn\leqslant\frac{2L}{\mu}, nn does not appear in the convergence rate and the SAG algorithm has nearly the same convergence rate as the FG method with a step size of 1/16L1/16L, even though it uses iterations which are nn times cheaper. This indicates that the basic gradient method (under a slightly sub-optimal step-size) is not slowed down by using out-of-date gradient measurements for ill-conditioned problems. Although nn appears in the convergence rate in the well-conditioned setting where n>2Lμn>\frac{2L}{\mu}, if we perform nn iterations of SAG (i.e., one effective pass through the data), the error is multiplied by (11/8n)nexp(1/8)(1-1/8n)^{n}\leqslant\exp(-1/8), which is independent of nn. Thus, in this setting each pass through the data reduces the excess objective by a constant multiplicative factor that is independent of the problem.

It is interesting to compare the convergence rate of SAG in the strongly-convex case with the known convergence rates for first-order methods [see Nesterov, 2004, §2]. In Table 1, we use two examples to compare the convergence rate of SAG to the convergence rates of the standard FG method, the faster AFG method, and the lower-bound for any first-order strategy (under certain dimensionality assumptions) for optimizing a function gg satisfying our assumptions. In this table, we compare the rate obtained for these FG methods to the rate obtained by running nn iterations of SAG, since this requires the same number of evaluations of fif_{i}^{\prime}. Case 11 in this table focuses on a well-conditioned case where the rate of SAG is (11/8n)(1-1/8n), while Case 22 focuses on an ill-conditioned example where the rate is (1μ/16L)(1-\mu/16L). Note that in the latter case the O(1/k)O(1/k) rate for the method may be faster.

Implementation Details

In Algorithm 1 we give pseudo-code for an implementation of the basic method, where we use a variable dd to track the quantity (i=1nyi)(\sum_{i=1}^{n}y_{i}). This section focuses on further implementation details that are useful in practice. In particular, we discuss modifications that lead to better practical performance than the basic Algorithm 1, including ways to reduce the storage cost, how to handle regularization, how to set the step size, using mini-batches, and using non-uniform sampling. Note that an implementation of the algorithm that incorporates many of these aspects is available from the first author’s webpage.

For many problems the storage cost of O(np)O(np) for the yiky_{i}^{k} vectors is prohibitive, but we can often use the structure of the gradients fif_{i}^{\prime} to reduce this cost. For example, a commonly-used specialization of (1) is linearly-parameterized models which take form

Since each aia_{i} is constant, for these problems we only need to store the scalar fik(uik)f_{i_{k}}^{\prime}(u_{i}^{k}) for uik=aikxku_{i}^{k}=a_{i_{k}}^{\top}x^{k} rather than the full gradient aifi(uik)a_{i}f_{i}^{\prime}(u_{i}^{k}). This reduces the storage cost from O(np)O(np) down to O(n)O(n).

For problems where the vectors aia_{i} are sparse, an individual gradient fif_{i}^{\prime} will inherit the sparsity pattern of the corresponding aia_{i}. However, the update of xx in Algorithm 1 appears unappealing since in general dd will be dense, resulting in an iteration cost of O(p)O(p). Nevertheless, we can take advantage of the simple form of the SAG updates to implement a ‘just-in-time’ variant of the SAG algorithm where the iteration cost is proportional to the number of non-zeroes in aika_{i_{k}}. In particular, we do this by not explicitly storing the full vector xkx^{k} after each iteration. Instead, on each iteration we only compute the elements xjkx_{j}^{k} corresponding to non-zero elements of aika_{i_{k}}, by applying the sequence of updates to each variable xjkx_{j}^{k} since the last iteration where it was non-zero in aika_{i_{k}}. This sequence of updates can be applied efficiently since it simply involves changing the step size. For example, if variable jj has been zero in aika_{i_{k}} for 55 iterations, then we can compute the needed value xjkx_{j}^{k} using

This update allows SAG to be efficiently applied to sparse data sets where nn and pp are both in the millions or higher but the number of non-zeros is much less than npnp.

2 Re-weighting on early iterations

In the update of xx in Algorithm 1, we normalize the direction dd by the total number of data points nn. When initializing with yi0=0y_{i}^{0}=0 we believe this leads to steps that are too small on early iterations of the algorithm where we have only seen a fraction of the data points, because many yiy_{i} variables contributing to dd are set to the uninformative zero-vector. Following Blatt et al. , the more logical normalization is to divide dd by mm, the number of data points that we have seen at least once (which converges to nn once we have seen the entire data set), leading to the update x=xαmdx=x-\frac{\alpha}{m}d. Although this modified SAG method appears more difficult to analyze, in our experiments we found that running the basic SAG algorithm from the very beginning with this modification outperformed the basic SAG algorithm as well as the SG/SAG hybrid algorithm mentioned in the Section 3. In addition to using the gradient information collected during the first kk iterations, this modified SAG algorithm is also advantageous over hybrid SG/SAG algorithms because it only requires estimating a single constant step size.

3 Exact and efficient regularization

If the loss function gradients lil_{i}^{\prime} are sparse as in Section 4.1, then these modifications lead to a reduced storage requirement even though the gradient of the regularizer is dense. Further, although the update of xx again appears to require dense vector operations, we can implement the algorithm efficiently if the aia_{i} are sparse. In particular, to allow efficient multiplication of xx by the scalar (1αλ)(1-\alpha\lambda), it is useful to represent xx in the form x=κzx=\kappa z, where κ\kappa is a scalar and zz is a vector (as done by Shalev-Shwartz et al. ). Under this representation, we can multiply xx by a scalar in O(1)O(1) by simply updating κ\kappa (though to prevent κ\kappa becoming too large or too small we may need to occasionally re-normalize by setting z=κzz=\kappa z and κ=1\kappa=1). To efficiently implement the vector subtraction operation, we can use a variant of the just-in-time updates from Section 4.1. In Algorithm 2, we give pseudo-code for a variant of SAG that includes all of these modifications, and thus uses no full-vector operations. This code uses a vector yy to keep track of the scalars li(uik)l_{i}^{\prime}(u_{i}^{k}), a vector CC to determine whether a data point has previously been visited, a vector VV to track the last time a variable was updated, and a vector SS to keep track of the cumulative sums needed to implement the just-in-time updates.

4 Warm starting

In many scenarios we may need to solve a set of closely-related optimization problems. For example, we may want to apply Algorithm 2 to a regularized objective of the form (2) for several values of the regularization parameter λ\lambda. Rather than solving these problems independently, we might expect to obtain better performance by warm-starting the algorithm. Although initializing xx with the solution of a related problem can improve performance, we can expect an even larger performance improvement if we also use the gradient information collected from a run of SAG for a close value of λ\lambda. For example, in Algorithm 2 we could initialize xx, yiy_{i}, dd, mm, and CiC_{i} based on a previous run of the SAG algorithm. In this scenario, Theorem 1 suggests that it may be beneficial in this setting to center the yiy_{i} variables around dd.

5 Larger step sizes

In our experiments we have observed that utilizing a step size of 1/L1/L, as in standard FG methods, always converged and often performed better than the step size of 1/16L1/16L suggested by our analysis. Thus, in our experiments we used αk=1/L\alpha_{k}=1/L even though we do not have a formal analysis of the method under this step size. We also found that a step size of 2/(L+nμ)2/(L+n\mu), which in the strongly-convex case corresponds to the best fixed step size for the FG method in the special case of n=1n=1 [see Nesterov, 2004, Theorem 2.1.15], sometimes yields even better performance (though in other cases it performs poorly).

6 Line-search when L𝐿L is not known

In general the Lipschitz constant LL will not be known, but we may obtain a reasonable approximation of a valid LL by evaluating fif_{i} values while running the algorithm. In our experiments, we used a basic line-search where we start with an initial estimate L0L^{0}, and double this estimate whenever we do not satisfy the inequality

Note that the cost of this line-search is independent of nn, making it suitable for large problems. Further, for linearly-parameterized models of the form fi(aiTx)f_{i}(a_{i}^{T}x), it is also possible to implement the line-search so that its cost is also independent of the number of variables pp. To see why, if we use δk=aikTxk\delta^{k}=a_{i_{k}}^{T}x^{k} and the structure in the gradient then the left side is given by

Thus, if we pre-compute the squared norms ai2\|a_{i}\|^{2} and note that δk\delta^{k} and fik(δk)f_{i_{k}}^{\prime}(\delta^{k}) are already needed by the SAG update, then each iteration only involves operations on scalar values and the single-variable function fikf_{i_{k}}.

7 Mini-batches for vectorized computation and reduced storage

Because of the use of vectorization and parallelism in modern architectures, practical SG implementations often group functions into ‘mini-batches’ and perform SG iterations on the mini-batches. We can also use mini-batches within the SAG iterations to take advantage of the same vectorization and parallelism. Additionally, for problems with dense gradients mini-batches can dramatically decrease the storage requirements of the algorithm, since we only need to store a vector yiy_{i} for each mini-batch rather than for each example. Thus, for example, using a mini-batch of size 100100 leads to a 100100-fold reduction in the storage cost.

A subtle issue that arises when using mini-batches is that the value of LL in the Lipschitz condition (8) is based on the mini-batches instead of the original functions fif_{i}. For example, consider the case where we have a batch B\mathcal{B} and the minimum value of LL in (8) for each ii is given by LiL_{i}. In this case, a valid value of LL for the function x1BiBfi(x)x\mapsto\frac{1}{|\mathcal{B}|}\sum_{i\in\mathcal{B}}f_{i}(x) would be maxiB{Li}\max_{i\in\mathcal{B}}\{L_{i}\}. We refer to this as LmaxL_{\textrm{max}}. But we could also consider using Lmean=1BiBLiL_{\textrm{mean}}=\frac{1}{|\mathcal{B}|}\sum_{i\in\mathcal{B}}L_{i}. The value LmeanL_{\textrm{mean}} is still valid and will be smaller than LmaxL_{\textrm{max}} unless all LiL_{i} are equal. We could even consider the minimum possible value of LL, which we refer to as LHessianL_{\textrm{Hessian}} because (if each fif_{i} is twice-differentiable) it is equal to the maximum eigenvalue of 1BiBfi(x)\frac{1}{|\mathcal{B}|}\sum_{i\in\mathcal{B}}f_{i}^{\prime\prime}(x) across all xx. Note that LHessianLmeanLmaxL_{\textrm{Hessian}}\leq L_{\textrm{mean}}\leq L_{\textrm{max}}, although LHessianL_{\textrm{Hessian}} will typically be more difficult to compute than LmeanL_{\textrm{mean}} or LmaxL_{\textrm{max}} (although a line-search as discussed in the previous section can reduce this cost). Due to the potential of using a smaller LL, we may obtain a faster convergence rate by using larger mini-batches. However, in terms of passes through the data this faster convergence may be offset by the higher iteration cost associated with using mini-batches.

8 Non-uniform example selection

In standard SG methods, it is crucial to sample the functions fif_{i} uniformly, at least asymptotically, in order to yield an unbiased gradient estimate and subsequently achieve convergence to the optimal value (alternately, the bias induced by non-uniform sampling would need to be asymptotically corrected). In SAG iterations, however, the weight of each gradient is constant and equal to 1/n1/n, regardless of the frequency at which the corresponding function is sampled. We might thus consider sampling the functions fif_{i} non-uniformly, without needing to correct for this bias. Though we do not yet have any theoretical proof as to why a non-uniform sampling might be beneficial, intuitively we would expect that we do not need to sample functions fif_{i} whose gradient changes slowly as often as functions fif_{i} whose gradient changes more quickly. Indeed, we provide here an argument to justify a non-uniform sampling strategy based on the Lipschitz constants of the individual gradients fif_{i}^{\prime} and we note that in subsequent works this intuition has proved correct for related algorithms [Xiao and Zhang, 2014, Schmidt et al., 2015].

Let LiL_{i} again be the Lipschitz constant of fif_{i}^{\prime}, and assume that the functions are placed in increasing order of Lipschitz constants, so that L1L2LnL_{1}\leqslant L_{2}\leqslant\ldots\leqslant L_{n}. In the ill-conditioned setting where the convergence rate depends on μL\frac{\mu}{L}, a simple way to improve the rate by decreasing LL is to replace fnf_{n} by two functions fn1f_{n1} and fn2f_{n2} such that

We have thus replaced the original problem by a new, equivalent problem where:

LiL_{i} for i(n1)i\leqslant(n-1) is Li(n+1)n\frac{L_{i}(n+1)}{n},

LnL_{n} and Ln+1L_{n+1} are equal to Ln(n+1)2n\frac{L_{n}(n+1)}{2n}.

Hence, if Ln1<nLnn+1L_{n-1}<\frac{nL_{n}}{n+1}, this problem has the same μ\mu but a smaller LL than the original one, improving the bound on the convergence rate. By duplicating fnf_{n}, we increase its probability of being sampled from 1n\frac{1}{n} to 2n+1\frac{2}{n+1}, but we also replace ynky_{n}^{k} by a noisier version, i.e. yn1k+yn2ky_{n1}^{k}+y_{n2}^{k}. Using a noisier version of the gradient appears detrimental, so we assume that the improvement comes from increasing the frequency at which fnf_{n} is sampled, and that logically we might obtain a better rate by simply sampling fnf_{n} more often in the original problem and not explicitly duplicating the data.

We now consider the extreme case of duplicating each function fif_{i} a number of times equal to the Lipschitz constant of their gradient, assuming that these constants are integers. The new problem becomes

The function gg is now written as the sum of kLk\sum_{k}L_{k} functions, each with a gradient with Lipschitz constant kLkn\frac{\sum_{k}L_{k}}{n}. The new problem has the same μ\mu as before, but now has an LL equal to the average of the Lipschitz constants across the fif_{i}^{\prime}, rather than their maximum, thus improving the bound on the convergence rate. Sampling these functions uniformly is now equivalent to sampling the original fif_{i}’s according to their Lipschitz constant. Thus, we might expect to obtain better performance by, instead of creating a larger problem by duplicating the functions in proportion to their Lipschitz constant, simply sampling the functions from the original problem in proportion to their Lipschitz constants.

Sampling in proportion to the Lipschitz constants of the gradients was explored by Nesterov in the context of coordinate descent methods, and is also somewhat related to the sampling scheme used by Strohmer and Vershynin in the context of their randomized Kaczmarz algorithm. Since the first version of this work was released, Needell et al. have analyzed sampling according to the Lipschitz constant in the context of SG iterations. Such a sampling scheme makes the iteration cost depend on nn, due to the need to generate samples from a general discrete distribution over nn variables. However, after an initial preprocessing cost of O(n)O(n) we can sample from such distributions in O(logn)O(\log n) using a simple binary search [see Robert and Casella, 2004, Example 2.10].

Unfortunately, sampling the functions according to the Lipschitz constants and using a step size of αk=niLi\alpha_{k}=\frac{n}{\sum_{i}L_{i}} does not seem to converge in general. However, by changing the number of times we duplicate each fif_{i}, we can interpolate between the Lipschitz sampling and the uniform sampling. In particular, if each function fif_{i} is duplicated Li+cL_{i}+c times, where LiL_{i} is the Lipschitz constant of fif_{i}^{\prime} and cc a positive number, then the new problem becomes

Unlike in the previous case, these k(Lk+c)\sum_{k}(L_{k}+c) functions have gradients with different Lipschitz constants. Denoting L=maxiLiL=\max_{i}L_{i}, the maximum Lipschitz constant is equal to k(Lk+c)nLL+c\frac{\sum_{k}(L_{k}+c)}{n}\frac{L}{L+c} and we must thus use the step size α=L+cL(kLkn+c)\alpha=\frac{L+c}{L\left(\frac{\sum_{k}L_{k}}{n}+c\right)}.

Experimental Results

In this section we perform empirical evaluations of the SAG iterations. We first compare the convergence of an implementation of the SAG iterations to a variety of competing methods available. We then seek to evaluate the effect of different algorithmic choices such as the step size, mini-batches, and non-uniform sampling.

The theoretical convergence rates suggest the following strategies for deciding on whether to use an FG or an SG method:

If we can only afford one pass through the data, then an SG method should be used.

If we can afford to do many passes through the data (say, several hundred), then an FG method should be used.

We expect that the SAG iterations will be most useful between these two extremes, where we can afford to do more than one pass through the data but cannot afford to do enough passes to warrant using FG algorithms like the AFG or L-BFGS methods. To test whether this is indeed the case in practice, we perform a variety of experiments evaluating the performance of the SAG algorithm in this scenario.

as a canonical problem satisfying our assumptions. In our experiments we set the regularization parameter λ\lambda to 1/n1/n, which is in the range of the smallest values that would typically be used in practice, and thus which results in the most ill-conditioned problems of this form that would be encountered. Our experiments focus on the freely-available benchmark binary classification data sets listed in Table 2. The quantum and protein data set was obtained from the KDD Cup 2004 website;http://osmot.cs.cornell.edu/kddcup the covertype (based on the datset of Blackard, Jock, and Dean), rcv1, news, and rcv1Full data sets were obtained from the LIBSVM Data website; http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets; the sido data set was obtained from the Causality Workbench website, http://www.causality.inf.ethz.ch/home.php the spam data set was prepared by [see Carbonetto, 2009, §2.6.5] using the TREC 2005 corpushttp://plg.uwaterloo.ca/~gvcormac/treccorpus; and the alpha data set was obtained from the Pascal Large Scale Learning Challenge websitehttp://largescale.ml.tu-berlin.de. We added a (regularized) bias term to all data sets, and for dense features we standardized so that they would have a mean of zero and a variance of one. To obtain results that are independent of the practical implementation of the algorithm, we measure the objective as a function of the number of effective passes through the data, measured as the number of times we evaluate lil_{i}^{\prime} divided by the number of examples nn. If they are implemented to take advantage of the sparsity present in the data sets, the runtimes of all algorithms examined in this section differ by at most a constant times this measure.

In our first experiment we compared the following variety of competitive FG and SG methods:

L-BFGS: A publicly-available limited-memory quasi-Newton method that has been tuned for log-linear models such as logistic regression [Schmidt, 2005]. This method is the most complicated method we considered.

SG: The stochastic gradient method described by iteration (4). Since setting the step-size in this method is a tenuous issue, we chose the constant step size that gave the best performance (in hindsight) among all powers of 1010 (we found that this constant step-size strategies gave better performance than the variety of decreasing step-size strategies that we experimented with).

ASG: The average of the iterations generated by the SG method above, where again we choose the best step size among all powers of 1010.Note that we also compared to a variety of other SG methods including the popular Pegasos SG method of Shalev-Shwartz et al. , SG with momentum, SG with gradient averaging, the regularized dual averaging method of Xiao (a stochastic variant of the primal-dual subgradient method of Nesterov for regularized objectives), the accelerated SG method of Delyon and Juditsky , SG methods that only average the later iterations as in the ‘optimal’ SG method for non-smooth optimization of Rakhlin et al. , the epoch SG method of Hazan and Kale , the ‘nearly-optimal’ SG method of Ghadimi and Lan , a diagonally-scaled SG method using the inverse of the coordinate-wise Lipshitz constants as the diagonal terms, and the adaptive diagonally-scaled AdaGrad method of Duchi et al. . However, we omit results obtained using these algorithms since they never performed substantially better than the minimum between the SG and ASG methods when their step-size was chosen in hindsight.

IAG: The incremental aggregated gradient method of Blatt et al. described by iteration (5) with a cyclic choice of iki_{k}. We used the re-weighting described in Section 4.2, we used the exact regularization as described in Section 4.3, and we chose the step-size that gave the best performance among all powers of 1010.

SAG-LS: The proposed stochastic average gradient method described by iteration (5). We used the re-weighting described in Section 4.2, the exact regularization as described in Section 4.3, and we used a step size of αk=1/Lk\alpha_{k}=1/L^{k} where LkL^{k} is an approximation of the Lipschitz constant for the negative log-likelihoods li(x)=log(1+exp(biaix))l_{i}(x)=\log(1+\exp(-b_{i}a_{i}^{\top}x)). Although this Lipschitz constant is given by 0.25maxi{ai2}0.25\max_{i}\{\|a_{i}\|^{2}\}, we used the line-search described in Section 4.6 to estimate it, to test the ability to use SAG as a black-box algorithm (in addition to avoiding this calculation and potentially obtaining a faster convergence rate by obtaining an estimate that could be smaller than the global value). To initialize the line-search we set L0=1L^{0}=1.

We plot the results of the different methods for the first 5050 effective passes through the data in Figure 1. For the stochastic methods, we plot the mean performance as well as the minimum and maximum function values across 1010 choices for the initial random seed. We can observe several trends across the experiments:

FG vs. SG: Although the performance of SG methods is known to be catastrophic if the step size is not chosen carefully, after giving the SG methods (SG and ASG) an unfair advantage (by allowing them to choose the best step-size in hindsight), the SG methods always do substantially better than the FG methods (AFG and L-BFGS) on the first few passes through the data. However, the SG methods typically make little progress after the first few passes. In contrast, the FG methods make steady progress and eventually the faster FG method (L-BFGS) typically passes the SG methods.

(FG and SG) vs. SAG: The SAG iterations seem to achieve the best of both worlds. They start out substantially better than FG methods, often obtaining similar performance to an SG method with the best step-size chosen in hindsight. But the SAG iterations continue to make steady progress even after the first few passes through the data. This leads to better performance than SG methods on later iterations, and on most data sets the sophisticated FG methods do not catch up to the SAG method even after 5050 passes through the data.

IAG vs. SAG: Even though these two algorithms differ in only the seemingly-minor detail of selecting data points at random (SAG) compared to cycling through the data (IAG), the performance improvement of SAG over its deterministic counterpart IAG is striking (even though the IAG method was allowed to choose the best step-size in hindsight). We believe this is due to the larger step sizes allowed by the SAG iterations, which would cause the IAG iterations to diverge.

2 Comparison to Coordinate Optimization Methods

PCD: The randomized primal coordinate-descent method of Nesterov , using a step-size of 1/Lj1/L_{j}, where LjL_{j} is the Lipschitz-constant with respect to coordinate jj of gg^{\prime}. Here, we sampled the coordinates uniformly.

PCD-L: The same as above, but sampling coordinates according to their Lipschitz constant, which can lead to an improved convergence rate [Nesterov, 2010].

DCA: Applying randomized coordinate ascent to the dual, with uniform example selection and an exact line-search [Shalev-Schwartz and Zhang, 2013b].

As with comparing FG and SG methods, it is difficult to compare coordinate-wise methods to FG and SG methods in an implementation-independent way. To do this in a way that we believe is fair (when discussing convergence rates), we measure the number of effective passes of the DCA method as the number of iterations of the method divided by nn (since each iteration accesses a single example as in SG and SAG iterations). We measure the number of effective passes for the PCD and PCD-L methods as the number of iterations multiplied by n/pn/p so that 11 effective pass for this method has a cost of O(np)O(np) as in FG and SG methods. We ignore the additional cost associated with the Lipschitz sampling for the PCD-L method (as well as the expense incurred because the PCD-L method tended to favour updating the bias variable for sparse data sets) and we also ignore the cost of numerically computing the optimal step-size for the DCA method.

We compare the performance of the randomized coordinate optimization methods to several of the best methods from the previous experiment in Figure 2. In these experiments we observe the following trends:

PCD vs. PCD-L: For the problems with n>pn>p (top and bottom rows of Figure 2), there is little difference between uniform and Lipschitz sampling of the coordinates. For the problems with p>np>n (middle row of Figure 2), sampling according to the Lipschitz constant leads to a large performance improvement over uniform sampling.

PCD vs. DCA: For the problems with p>np>n, DCA and PCD-L have similar performance. For the problems with n>pn>p, the performance of the methods typically differed but neither strategy tended to dominate the other.

(PCD and DCA) vs. (SAG): For some problems, the PCD and DCA methods have performance that is similar to SAG-LS and the DCA method even gives better performance than SAG-LS on one data set. However, for many data sets either the PCD-L or the DCA method (or both) perform poorly while the SAG-LS iterations are among the best or substantially better than all other methods on every data set. This suggests that, while coordinate optimization methods are clearly extremely effective for some problems, the SAG method tends to be a more robust choice across problems.

3 Comparison of Step-Size Strategies

In our prior work we analyzed the step-sizes αk=1/2nL\alpha_{k}=1/2nL and αk=1/2nμ\alpha_{k}=1/2n\mu [Le Roux et al., 2012], while Section 3 considers the choice αk=1/16L\alpha_{k}=1/16L and Section 4.5 discusses the choices αk=1/L\alpha_{k}=1/L and αk=2/(L+nμ)\alpha_{k}=2/(L+n\mu) as in FG methods. In Figure 3 we compare the performance of these various strategies to the performance of the SAG algorithm with our proposed line-search as well as the IAG and SAG algorithms when the best step-size is chosen in hindsight. In this plot we see the following trends:

Proposition 1 of Le Roux et al. : Using a step-size of αk=1/2nL\alpha_{k}=1/2nL performs poorly, and makes little progress compared to the other methods. This makes sense because Proposition 1 in Le Roux et al. implies that the convergence rate (in terms of effective passes through the data) under this step size will be similar to the basic gradient method, which is known to perform poorly unless the problem is very well-conditioned.

Proposition 2 of Le Roux et al. : Using a step-size of αk=1/2nμ\alpha_{k}=1/2n\mu performs extremely well on the data sets with p>np>n (middle row). In contrast, for the data sets with n>pn>p it often performs very poorly, and in some cases appears to diverge. This is consistent with Proposition 2 in Le Roux et al. , which shows a fast convergence rate under this step size only if certain conditions on {n,μ,L}\{n,\mu,L\} hold.

Theorem 1: Using a step-size of αk=1/16L\alpha_{k}=1/16L performs consistently better than the smaller step size αk=1/2nL\alpha_{k}=1/2nL, but in some cases it performs worse than αk=1/2nμ\alpha_{k}=1/2n\mu. However, in contrast to αk=1/2nμ\alpha_{k}=1/2n\mu, the step size αk=1/16L\alpha_{k}=1/16L always has reasonable performance.

Section 4.5: The step size of αk=1/L\alpha_{k}=1/L performs performs extremely well on the data sets with p>np>n, and performs better than the step sizes discussed above on all but one of the remaining data sets. The step size of αk=2/(L+nμ)\alpha_{k}=2/(L+n\mu) seems to perform the same or slightly better than using αk=1/L\alpha_{k}=1/L except on one data set where it performs poorly.

Line-Search: Using the line-search from Section 4.6 tends to perform as well or better than the various constant step size strategies, and tends to have similar performance to choosing the best step size in hindsight.

IAG vs. SAG: When choosing the best step size in hindsight, the SAG iterations tend to choose a much larger step size than the IAG iterations. The step sizes chosen for SAG were 100100 to 1000010000 times larger than the step sizes chosen by IAG, and always lead to better performance by several orders of magnitude.

4 Effect of mini-batches

As we discuss in Section 4.7, when using mini-batches within the SAG iterations there is a trade-off between the higher iteration cost of using mini-batches and the faster convergence rate obtained using mini-batches due to the possibility of using a smaller value of LL. In Figure 4, we compare (on the dense data sets) the excess sub-optimality as a function of the number of examples seen for various mini-batch sizes and the three step-size strategies 1/Lmax1/L_{\textrm{max}}, 1/Lmean1/L_{\textrm{mean}}, and 1/LHessian1/L_{\textrm{Hessian}} discussed in Section 4.7.

Several conclusions may be drawn from these experiments:

Even though Theorem 1 hints at a maximum mini-batch size of nμ2L\frac{n\mu}{2L} without loss of convergence speed, this is a very conservative estimate. In our experiments, the original value of nμL\frac{n\mu}{L} was on the order of 10510^{-5} and mini-batch sizes of up to 500 could be used without a loss in performance. Not only does this yield large memory storage gains, it would increase the computational efficiency of the algorithm when taking into account vectorization.

To achieve fast convergence, it is essential to use a larger step-size when larger mini-batches are used. For instance, in the case of the quantum dataset with a mini-batch size of 2000020000, we have 1Lmax=4.4104\frac{1}{L_{\textrm{max}}}=4.4\cdot 10^{-4}, 1Lmean=5.5102\frac{1}{L_{\textrm{mean}}}=5.5\cdot 10^{-2} and 1LHessian=3.7101\frac{1}{L_{\textrm{Hessian}}}=3.7\cdot 10^{-1}. In the case of the covertype dataset and for a mini-batch size of 2000020000, we have 1Lmax=2.1105\frac{1}{L_{\textrm{max}}}=2.1\cdot 10^{-5}, 1Lmean=6.2102\frac{1}{L_{\textrm{mean}}}=6.2\cdot 10^{-2} and 1LHessian=4.1101\frac{1}{L_{\textrm{Hessian}}}=4.1\cdot 10^{-1}.

5 Effect of non-uniform sampling

In our final experiment, we explored the effect of using the non-uniform sampling strategy discussed in Section 4.8. In Figure 5, we compare several of the SAG variants with uniform sampling to the following two methods based on non-uniform sampling:

SAG (Lipschitz): In this method we sample the functions in proporition to Li+cL_{i}+c, where LiL_{i} is the global Lipschitz constant of the corresponding fif_{i}^{\prime} and we set cc to the average of these constants, c=Lmean=(1/n)iLic=L_{\textrm{mean}}=(1/n)\sum_{i}L_{i}. Plugging in these values into the formula at the end of Section 4.8 and using LmaxL_{\textrm{max}} to denote the maximum LiL_{i} value, we set the step-size to αk=1/2Lmax+1/2Lmean\alpha_{k}=1/2L_{\textrm{max}}+1/2L_{\textrm{mean}}.

SAG-LS (Lipschitz): In this method we formed estimates of the quantities LiL_{i}, LmaxL_{\textrm{max}}, and LmeanL_{\textrm{mean}}. The estimator LmaxkL^{k}_{\textrm{max}} is computed in the same way as the SAG-LS method. To estimate each LiL_{i}, we keep track of an estimate LikL_{i}^{k} for each ii and we set LmeankL^{k}_{\textrm{mean}} to the average of the LikL_{i}^{k} values among the fif_{i} that we have sampled at least once. We set Lik=Lik1L_{i}^{k}=L_{i}^{k-1} if example ii was not selected and otherwise we initialize to Lik=Lik1/2L_{i}^{k}=L_{i}^{k-1}/2 and perform the line-search until we have a valid LikL_{i}^{k} (this means that LmeankL^{k}_{\textrm{mean}} will be approximately halved if we perform a full pass through the data set and never violate the inequality). To ensure that we do not ignore important unseen data points for too long, in this method we sample a previously unseen function with probability (nm)/n(n-m)/n, and otherwise we sample from the previously seen fif_{i} in proportion to Lik+LmeankL_{i}^{k}+L^{k}_{\textrm{mean}}. To prevent relying too much on our initially-poor estimate of LmeanL_{\textrm{mean}}, we use a step size of αk=nmnαmax+mnαmean\alpha_{k}=\frac{n-m}{n}\alpha_{\textrm{max}}+\frac{m}{n}\alpha_{\textrm{mean}}, where αmax=1/Lmaxk\alpha_{\textrm{max}}=1/L^{k}_{\textrm{max}} is the step-size we normally use with uniform sampling and αmean=1/2Lmaxk+1/2Lmeank\alpha_{\textrm{mean}}=1/2L^{k}_{\textrm{max}}+1/2L^{k}_{\textrm{mean}} is the step-size we use with the non-uniform sampling method, so that the method interpolates between these extremes until the entire data set has been sampled.

We make the following observations from these experiments:

SAG (1/L) vs. SAG (Lipschitz): With access to global quantities and a constant step size, the difference between uniform and non-uniform sampling was typically quite small. However, in some cases the non-uniform sampling method behaved much better (top row of Figure 5).

SAG-LS vs. SAG-LS (Lipschitz): When estimating the Lipschitz constants of the individual functions, the non-uniform sampling strategy often gave better performance. Indeed, the adaptive non-uniform sampling strategy gave solutions that are orders of magnitude more accurate than any method we examined for several of the data sets (e.g., the protein, covertype, and sido data sets) In the context of logistic regression, it makes sense that an adaptive sampling scheme could lead to better performance, as many correctly-classified data samples might have a very slowly-changing gradient near the solution, and thus they do not need to be sampled often.

Discussion

Since the first version of this work was published [Le Roux et al., 2012], there has been an explosion of interest in stochastic methods with improved convergence rates. In this section we first review other algorithms that have been discovered to have this property, and then we discuss the many possible variants on these basic algorithms that have been explored. As this is a very quickly-evolving area there are likely to be many new developments in the near future, but we note that this literature review is up to date as of January, 2015.

where each fif_{i} is convex and each fif_{i}^{\prime} is Lipschitz-continuous, by optimizing its Fenchel dual,

They consider applying exact coordinate optimization to a randomly selected coordinate. Using the primal-dual relationship x=1nλi=1Nxiaix=\frac{1}{n\lambda}\sum_{i=1}^{N}x_{i}a_{i}, they show a linear convergence rate in terms of the duality gap. Since there is one dual variable associated with each example ii, the iteration cost is independent of nn and thus the strategy has similar convergence properties to SAG (the memory requirements are identical in this context, see Section 4.1).

This result is related to the earlier result of [Nesterov, 2010], who shows a similar convergence rate for randomized coordinate descent.Local linear convergence rates of deterministic coordinate descent methods had been established much earlier [Luo and Tseng, 1992]. However, the result of Nesterov cannot be directly applied to the dual problem in general since the dual does not necessarily have a Lipschitz-continuous gradient. A similar result was also reported even earlier by Collins et al. without requiring that the gradient of the dual is Lipschitz-continuous, but this result again only applied to the dual objective. More recently, Shalev-Schwartz and Zhang [2013a] show linear convergence of SDCA in the more general setting

where AiA_{i} are matrices and rr is 11-strongly convex. They also relax the requirement of exact coordinate optimization, providing a variety of more practical alternatives. Further, in subsequent work they obtain a convergence rate in the convex case by adding an explicit strongly-convex regularizer to the problem [Shalev-Schwartz and Zhang, 2014].

A disadvantage of SDCA compared to the SAG algorithm is that the SDCA convergence rates depend on λ\lambda rather than the strong-convexity constant μ\mu. In the worst case we have μ=λ\mu=\lambda, but if μ\mu is much larger then the convergence rate of SAG is much faster. Further, even in cases where μ=λ\mu=\lambda, the convergence rate of SAG might be much faster if the iterates stay in local region with a higher strong-convexity constant. As an extreme example, due to local strong-convexity SAG might have a linear convergence rate in scenarios where SDCA has a sub-linear convergence. This subtle but practically important issue was a key focus in the recent work of [Agarwal and Bottou, 2014], and indeed the performance of SDCA was very poor on three of the test problems in our experiments. See Figure 2 (top left, top right, bottom right).

MISO: Mairal analyzes a very general surrogate optimization framework, that includes a wide vareity of existing algorithms. He also considers incremental algorithms in this framework, and specialized to the smooth and unconstraeind setting (with a ‘Lipschitz surrogate’) obtains an algorithm (MISO) that is very similar to the SAG algorithm,

where yiky^{k}_{i} is defined as in the SAG algorithm 6, and xikx_{i}^{k} is the parameter vector used to compute the corresponding yiky_{i}^{k}. Thus, instead of applying the SAG step to xkx^{k}, MISO applies the step to the average of the previous xikx_{i}^{k} values used to form the current yiky_{i}^{k} variables. Mairal shows that this algorithm also achieves an O(1/k)O(1/k) rate for convex objectives and a linear convergence rate for strongly-convex objectives.

However, MISO has the disadvantage that it not only requires storing the nn gradient values but also storing nn previous iterations (which are less likely to have a nice structure). Further, the linear convergence rate shown for MISO is substantially slower than the convergence rate shown in Theorem 1. In particular, the rate is more similar to the substantially slower Proposition 1 in our prior work [Le Roux et al., 2012]. Subsequent work on the MISO algorithm has shown an analogous result to Proposition 2 in our prior work; if the nn is sufficiently larger than LμL\mu, then using a step-size proportional to 1/μ1/\mu yields a faster convergence rate [Mairal, 2014, Defazio et al., 2014b]. However, using this step-size causes divergence if μ\mu is not sufficiently large.

SVRG: Another interesting framework that has been considered is known as ‘mixed optimization’, ‘stochastic variance-reduced gradient’ (SVRG), or ‘semi-stochastic gradient descent’ (S2GD) [Mahdavi and Jin, 2013, Johnson and Zhang, 2013, Zhang et al., 2013, Konečnỳ and Richtárik, 2013].Wang et al. consider a related algorithm that maintains an easily-computable approximation to the current gradient g(xk)g^{\prime}(x^{k}). Although this can improve the constants in the sublinear SG convergence rates, it does not improve the rates. Unlike FG methods that utilize the full gradient gg^{\prime} and SG methods that consider individual gradients fif_{i}^{\prime}, these mixed optimization methods combine the two. In particular, they use a (possibly regularized) iteration of the form

A disadvantage of the SVRG method is that it requires setting two parameters (rather than one) and the convergence rate depends in a non-trivial way on their interaction both with each other and with both the Lipschitz constant and the strong-convexity. As with SDCA, the dependence on the strong-convexity constant is notable. In particular, the algorithm can’t be applied without modification to general convex problems, and (although the effect is not as severe as it is with SDCA) the convergence rate may be substantially slower than SAG if there exists hidden strong convexity.

Self-Concordant Objectives: A surprising recent development is that Bach and Moulines have shown that the finite sum assumption is not required to obtain the O(1/k)O(1/k) convergence rate in the special case of least squares. They show that an averaged SG method achieves this rate, and give a 2-phase Newton-like SG method that achieves this rate under an assumption similar to self-concordance. However, this assumption does not hold in general for the class of problems considered here.

SAGA: One of the most recent developments in this area is the SAGA algorithm Defazio et al. [2014a]. This algorithm intelligently combines the updates used in the SAG and SVRG algorithms. This method maintains the appealing properties of SAG, but yields a simpler proof. However, the proof technique used in that work does not yield a simpler analysis of the original SAG algorithm.

Lower Bounds: There has been recent work on determining a lower bound on the convergence rate that can be expected for minimizing finite sums. Defazio et al. [2014b] show that the rate must be at least (11/n)(1-1/n) in the strongly-convex case, while Agarwal and Bottou establish a bound that also depends on the condition number L/μL/\mu.

2 Generalizations and Other Issues

Accelerated gradient: AFG methods are variants of the basic FG method that can obtain an O(1/k2)O(1/k^{2}) convergence rate for the general convex case, and faster linear convergence rates that depend on the square root of the condition number (L/μ)(\sqrt{L/\mu}) rather than the condition number (L/μ)(L/\mu) in the strongly-convex case [see Nesterov, 2004, §2.2.1]. For strongly-convex objectives, it has been shown that a mini-batch strategy can obtain a better dependence on the condition number in certain regimes for SDCA [Shalev-Shwartz and Zhang, 2013] and more recently for SVRG [Nitanda, 2014]. It is possible that similar arguments could hold for SAG algorithms, which could be advantageous over these methods for reasons discussed in the previous section.

Shalev-Schwartz and Zhang have also given an accelerated version of the SDCA method for ill-conditioned problems, that uses SDCA to solve a sequence of regularized problems up to a prescribed optimality. Since this procedure ultimately relies on the sequence of primal solutions, we can also accelerate SAG using a procedure like this. However, a difficulty with this procedure (whether SAG or SDCA are used) is the cost of measuring the optimality of the sub-problems.

Rather than using this inner-outer procedure, Lin et al. show that using a deterministic primal iteration (based on the full data set) allows one to construct a primal solution that has the accelerated rate from the result of an accelerated dual coordinate ascent method. Zhang and Xiao give a coordinate-wise variant of the accelerated primal-dual method of Chambolle and Pock that also achieves this rate. Based on these results, it is possible that accelerated versions of SAG could be developed that do not rely on an inner-outer procedure.

The convergence rates of these accelerated methods have the same form as the lower bound established by Agarwal and Bottou . However, Agarwal and Bottou point out that these accelerated convergence rates depend on λ\lambda rather than μ\mu. Thus, they conclude that the basic SAG algorithm may still be much faster on some problems than accelerated SDCA methods, and indeed show how to construct a problem where SAG is arbitrarily faster than accelerated SDCA. The possibility of developing an accelerated SAG algorithm that is adaptive to μ\mu remains open.

Proximal gradient and ADMM: It is becoming increasingly common to address problems of the form

where fif_{i} and gg satisfy our assumptions and rr is a general convex function that could be non-smooth or enforce that constraints are satisfied. Proximal-gradient methods for problems with this structure use iterations of the form

where the proxαk[y]{}_{\alpha_{k}}[y] operator is the solution to the proximity problem

Proximal-gradient and accelerated proximal-gradient methods are appealing for solving non-smooth optimization problems because they achieve the same convergence rates as FG methods for smooth optimization problems [Nesterov, 2007, Schmidt et al., 2011]. We have explored a variant of the proximal-gradient method where the average over the fi(xk)f_{i}^{\prime}(x^{k}) values is replaced by the SAG approximation (6). Although our analysis does not directly apply to this scenario, we believe that this proximal-SAG algorithm for composite non-smooth optimization achieves the same convergence rates as the SAG algorithm for smooth optimization; this is supported by the experiments of Xiao and Zhang . Indeed, there now exist proximal-gradient variants of SDCA [Shalev-Schwartz and Zhang, 2013a], MISO [Mairal, 2013], SVRG [Xiao and Zhang, 2014], and SAGA [Defazio et al., 2014a].

In cases where rr is composed with a linear function, we can consider approaches based on the alternating direction method of multipliers (ADMM). This has been explored for SDCA [Suzuki, 2014] and MISO [Zhong and Kwok, 2014], and a variant based on SAG is also likely to be possible.

Coordinate-wise: The key advantage of SG and SAG methods is that the iteration cost is independent of the number of functions nn. However, in many applications we may also be concerned with the dependence of the iteration cost on the number of variables pp. Randomized coordinate-wise methods offer linear convergence rates with an iteration cost that is linear in nn but independent of pp [Nesterov, 2010]. We can consider a variant of SAG whose iteration cost is independent of both nn and pp by using the update

to each coordinate jj of each vector yiy_{i} in the SAG algorithm, and where jkj_{k} is a sample from the set {1,2,,p}\{1,2,\dots,p\}. Konečnỳ et al. recently proposed an SVRG algorithm of this flavour.

Newton-like: In cases where gg is twice-differentiable, we can also consider Newton-like variants of the SAG algorithm,

where BkB^{k} is a positive-definite approximation to the inverse Hessian g(xk)g^{\prime\prime}(x^{k}). We would expect to obtain a faster convergence rate by using an appropriate choice of the sequence {Bk}\{B^{k}\}. However, in order to not increase the iteration cost these matrices should be designed to allow fast multiplication. For example, we could choose BkB^{k} to be diagonal, which would also preserve any sparsity present in the updates. Sohl-Dickstein et al. propose a quasi-Newton method in this class that shows impressive empirical results, although the iteration cost is much higher.

Relaxing Convexity Assumptions: It is likely that the convexity assumptions made in this work could be relaxed. For example, Gong and Ye show SVRG can obtain a linear convergence under weaker assumptions. Since SAG is adaptive to hidden strong-convexity, the assumptions needed for its linear convergence are likely even weaker. Further, the SAG algorithm may also be useful even if gg is non-convex (which is different than SDCA). In this case we expect that, similar to the IAG method [Blatt et al., 2007], the algorithm converges to a stationary point under very general conditions.

Non-Uniform Sampling: We have given an argument that non-uniform sampling should benefit the SAG algorithm, and shown empirically that it can lead to a substantial improvement. However, we have not yet given a full analysis of this scheme. Subsequent works have shown that the type of dependency we conjecture here (e.g., dependence on the average Lispschitz constant) can be achieved with non-uniform sampling in the context of SDCA [Qu et al., 2014, Zhao and Zhang, 2014], SVRG [Xiao and Zhang, 2014], and SAGA [Schmidt et al., 2015]

Step-size selection and termination criteria: The three major disadvantages of SG methods are: (i) the slow convergence rate, (ii) deciding when to terminate the algorithms, and (iii) choosing the step size while running the algorithm. This work shows that the SAG iterations achieve a much faster convergence rate, but the SAG iterations may also be advantageous in terms of termination criteria and choosing step sizes. In particular, the SAG iterations suggest a natural termination criterion; since the quantity dd in Algorithm 1 converges to f(xk)f^{\prime}(x^{k}) as xkxk1\|x^{k}-x^{k-1}\| converges to zero, we can use d\|d\| as an approximation of the optimality of xkx^{k} as the iterates converge. Regarding choosing the step-size, a disadvantage of a constant step-size strategy is that a step-size that is too large may cause divergence. But, we expect that it is possible to design line-search or trust-region strategies that avoid this issue. Such strategies might even lead to faster convergence for functions that are locally well-behaved around their optimum, as indicated in our experiments. Further, while SG methods require specifying a sequence of step sizes and mis-specifying this sequence can have a disastrous effect on the convergence rate [see Nemirovski et al., 2009, §2.1], our theory shows that the SAG iterations achieve a fast convergence rate for any sufficiently small constant step size, and our experiments indicate that a simple line-search gives strong performance.

Acknowledgements

We would like to thank the anonymous reviewers for their many useful comments. This work was partially supported by the European Research Council (SIERRA-ERC-239993) and a Google Research Award. Mark Schmidt is also supported by a postdoctoral fellowship from the Natural Sciences and Engineering Research Council of Canada.

Appendix A: Comparison of convergence rates

where to apply the SG or SAG method we can use

If we use bb to denote a vector containing the values bib_{i} and AA to denote a matrix withs rows aia_{i}, we can re-write this problem as

We can obtain the primal variables from the dual variables by the formula x=(1/λ)Ayx=(-1/\lambda)A^{\top}y. Convergence rates of different primal and dual algorithms are often expressed in terms of the following Lipschitz constants:

Here, we use MσM_{\sigma} to denote the maximum eigenvalue of AAA^{\top}A, MiM_{i} to denote the maximum squared row-norm maxi{ai2}\max_{i}\{\|a_{i}\|^{2}\}, and MjM_{j} to denote the maximum squared column-norm maxj{i=1n(ai)j2}\max_{j}\{\sum_{i=1}^{n}(a_{i})^{2}_{j}\}. We use gjg_{j}^{\prime} to refer to element of jj of gg^{\prime}, and similarly for did_{i}^{\prime}. The convergence rates will also depend on the primal and dual strong-convexity constants:

Here, mσm_{\sigma} is the minimum eigenvalue of AAA^{\top}A, and mσm_{\sigma}^{\prime} is the minimum eigenvalue of AAAA^{\top}.

Using a similar argument to [Nesterov, 2004, Theorem 2.1.15], if we use the basic FG method with a step size of 1/Lg1/L_{g}, then (f(xk)f(x))(f(x^{k})-f(x^{\ast})) converges to zero with rate

while a larger step-size of 2/(Lg+μg)2/(L_{g}+\mu_{g}) gives a faster rate of

and we see that the speed improvement is determined by how much smaller mσm_{\sigma} is than MσM_{\sigma}.

If we use the basic FG method on the dual problem with a step size of 1/Ld1/L_{d}, then (d(xk)d(x))(d(x^{k})-d(x^{\ast})) converges to zero with rate

and with a step-size of 2/(Ld+μd)2/(L_{d}+\mu_{d}) the rate is

Thus, whether we can solve the primal or dual method faster depends on mσm_{\sigma} and mσm_{\sigma}^{\prime}. In the over-determined case where AA has independent columns, a primal method should be preferred. In the under-determined case where AA has independent rows, we can solve the dual more efficiently. However, we note that a convergence rate on the dual objective does not necessarily yield the same rate in the primal objective. If AA is invertible (so that mσ=mσm_{\sigma}=m_{\sigma}^{\prime}) or it has neither independent columns nor independent rows (so that mσ=mσ=0m_{\sigma}=m_{\sigma}^{\prime}=0), then there is no difference between the primal and dual rates.

The AFG method achieves a faster rate. Applied to the primal with a step-size of 1/Lg1/L_{g} it has a rate of [Nesterov, 2004, Theorem 2.2.2]

and applied to the dual with a step-size of 1/Ld1/L_{d} it has a rate of

A.4 Coordinate-Descent Methods

The cost of applying one iteration of an FG method is O(np)O(np). For this same cost we could apply pp iterations of a coordinate descent method to the primal, assuming that selecting the coordinate to update has a cost of O(1)O(1). If we select coordinates uniformly at random, then the convergence rate for pp iterations of coordinate descent with a step-size of 1/Lgj1/L_{g}^{j} is [Nesterov, 2010, Theorem 2]

Here, we see that applying a coordinate-descent method can be much more efficient than an FG method if Mj<<MσM_{j}<<M_{\sigma}. This can happen, for example, when the number of variables pp is much larger than the number of examples nn. Further, it is possible for coordinate descent to be faster than the AFG method if the difference between MσM_{\sigma} and MjM_{j} is sufficiently large.

For the O(np)O(np) cost of one iteration of the FG method, we could alternately perform nn iterations of coordinate descent on the dual problem. With a step size of 1/Ldi1/L_{d}^{i} this would obtain a rate on the dual objective of

which will be faster than the dual FG method if Mi<<MσM_{i}<<M_{\sigma}. This can happen, for example, when the number of examples nn is much larger than the number of variables pp. The difference between the primal and dual coordinate methods depends on MiM_{i} compared to MjM_{j} and mσm_{\sigma} compared to mσm_{\sigma}^{\prime}.

The analysis of SDCA gives a convergence rate in terms of the duality gap (and hence the primal) rather than simply in terms of the dual Shalev-Schwartz and Zhang [2013b]. Using the SDCA analysis, we obtain a rate in the duality gap of

which is the same as the dual rate given above when mσ=0m_{\sigma}^{\prime}=0, and is slower otherwise.

A.5 Stochastic Average Gradient

For the O(np)O(np) cost of one iteration of the FG method, we can perform nn iterations of SAG. With a step size of 1/16L1/16L, performing nn iterations of the SAG algorithm has a rate of

In the case where n2Lgi/μgn\leqslant 2L_{g}^{i}/\mu_{g}, this is most similar to the rate obtained with the dual coordinate descent method, though there is a constant factor of 1616 and the rate depends on the primal strong convexity constant mσm_{\sigma} rather than the dual mσm_{\sigma}^{\prime}. However, in this case the SAG rate will often be faster because the term nλn\lambda in the denominator is replaced by λ\lambda. An interesting aspect of the SAG rate is that unlike other methods the convergence rate of SAG reaches a limit: the convergence rate improves as nn grows and as the condition number Lgi/μgL_{g}^{i}/\mu_{g} decreases but no further improvement in the convergence rate is obtained beyond the point where n=2Lgi/μgn=2L_{g}^{i}/\mu_{g}. Thus, while SAG may not be the method of choice for very well-conditioned problems, it is a robust choice as it will always tend to be among the best methods in most situations.

Appendix B: Proof of the theorem

In this Appendix, we give the proof of Theorem 1.

Recall that the SAG algorithm performs the recursion (for k1k\geqslant 1):

where an integer iki_{k} is selected uniformly at random from {1,,n}\{1,\dots,n\} and we set

and we will also find it convenient to use

With this notation, note that g(x)=1nef(x)g^{\prime}(x)=\frac{1}{n}e^{\top}f^{\prime}(x) and xk=xk1αneykx^{k}=x^{k-1}-\frac{\alpha}{n}e^{\top}y^{k}.

For a square n×nn\times n matrix MM, we use diag(M)\operatorname{diag}(M) to denote a vector of size nn composed of the diagonal of MM, while for a vector mm of dimension nn, Diag(m)\mathop{\rm Diag}(m) is the n×nn\times n diagonal matrix with mm on its diagonal. Thus Diag(diag(M))\mathop{\rm Diag}(\operatorname{diag}(M)) is a diagonal matrix with the diagonal elements of MM on its diagonal, and diag(Diag(m))=m\operatorname{diag}(\mathop{\rm Diag}(m))=m.

In addition, if MM is a tp×tptp\times tp matrix and mm is a tp×ptp\times p matrix, then we will use the convention that:

diag(M)\operatorname{diag}(M) is the tp×ptp\times p matrix being the concatenation of the tt (p×pp\times p)-blocks on the diagonal of MM;

Diag(m)\mathop{\rm Diag}(m) is the tp×tptp\times tp block-diagonal matrix whose (p×pp\times p)-blocks on the diagonal are equal to the (p×pp\times p)-blocks of mm.

Finally, Fk\mathcal{F}_{k} will denote the σ\sigma-field of information up to (and including) time kk. In other words, Fk\mathcal{F}_{k} is the σ\sigma-field generated by i1,,iki_{1},\dots,i_{k}. Given Fk1\mathcal{F}_{k-1}, we can write the expected values of the yky^{k} and xkx^{k} variables in the SAG algorithm as

The Lyapunov function contains a term of the form (\theta^{k}-\theta^{*})^{\top}\left(\begin{array}[]{cc}A&B\\ B^{\top}&C\end{array}\right)(\theta^{k}-\theta^{*}) for some values of AA, BB and CC. Our analysis makes use of the following lemma, derived in [Le Roux et al., 2012, Appendix A.4], showing how this quadratic form evolves through the SAG recursion in terms of the elements of θk1\theta^{k-1}.

Our proof also uses the following lemma, giving the inverse of a highly-structured matrix that arises in the analysis.

It is sufficient to verify that the inverse of the left side times the right side is equal to the identity matrix,

where we use that eTe=nIe^{T}e=nI, giving (1neeT)(1neeT)=1neeT\left(\frac{1}{n}ee^{T}\right)\left(\frac{1}{n}ee^{T}\right)=\frac{1}{n}ee^{T}. ∎

B.4 General Lyapunov function

For some h0h\geqslant 0, we consider a Lyapunov function of the form

To achieve the desired convergence rate, our goal is to show for appropriate values of δ0\delta\geqslant 0 and γ0\gamma\geqslant 0 that

almost surely. Thus, in addition to the algorithm parameter α\alpha, there are 2 parameters of the result {γ\gamma, δ\delta} and 6 parameters of the Lyapunov function {a1a_{1}, a2a_{2}, bb, cc, dd, hh}.

B.5 Lyapunov upper bound

To show (a), we derive an upper bound on the quantity

Lemma 1 gives an expression for the expectation over the quadratic term in the last line, and we will have

We also use that strong convexity of gg implies

Further, using the Lipschitz-continuity of gg^{\prime} and the identity xk+deyk=xk1+(dαn)eykx^{k}+de^{\top}y^{k}=x^{k-1}+(d-\frac{\alpha}{n})e^{\top}y^{k}, followed by applying Lemma 1 using A=eeA=ee^{\top} to expand the last term, gives us the bound

We combine the expressions from Table 3 using the stated groups in the last column. For example, we add together all the terms in group to obtain a scalar coefficient B0B_{0} for terms in this group. We do this in the straightforward way (using that g(x)=1nef(x)g^{\prime}(x)=\frac{1}{n}e^{\top}f^{\prime}(x) and g(x)=0g^{\prime}(x^{\ast})=0) for groups , 11, 22, 55, 88, and 99. For groups 66 and 77 we split into terms that have an identity matrix MM (B6B_{6}) and terms with an eeee^{\top} term (B7B_{7}). Similarly, B3B_{3} comes from terms with an identity matrix in MM and B4B_{4} correspond to terms with an eeee^{\top} term. Combining expressions in this way (and adding/subtracting (B3/n)ee(B_{3}/n)ee^{\top}) gives:

In this expression, yk1yy^{k-1}-y^{\ast} only appears through a quadratic form, with a quadratic term -(y^{k-1}-y^{\ast})^{\top}\bigg{[}B_{3}(I-\frac{1}{n}ee^{\top})+B_{4}\frac{1}{n}ee^{\top}\bigg{]}(y^{k-1}-y^{\ast}). If B30B_{3}\geqslant 0 and B40B_{4}\geqslant 0, then this quadratic term is non-positive and we may maximize in closed form with respect to yk1y^{k-1} to obtain an upper bound. Assuming that B30B_{3}\neq 0 and B40B_{4}\neq 0, we use Lemma 2, \bigg{[}B_{3}(I-\frac{1}{n}ee^{\top})+B_{4}\frac{1}{n}ee^{\top}\bigg{]}^{-1}=\bigg{[}B_{3}^{-1}(I-\frac{1}{n}ee^{\top})+B_{4}^{-1}\frac{1}{n}ee^{\top}\bigg{]}, to obtain:

By using Lipschitz continuity of each fif_{i}^{\prime} to observe that f(xk1)f(x)2=i=1nfi(xk1)fi(x)2Li=1n(fi(xk1)fi(x))(xk1x)=nLg(xk1)(xk1x)\|f^{\prime}(x^{k-1})-f^{\prime}(x^{\ast})\|^{2}=\sum_{i=1}^{n}\|f_{i}^{\prime}(x^{k-1})-f_{i}^{\prime}(x^{\ast})\|^{2}\leqslant L\sum_{i=1}^{n}(f_{i}^{\prime}(x^{k-1})-f_{i}^{\prime}(x^{\ast}))^{\top}(x^{k-1}-x^{\ast})=nLg^{\prime}(x^{k-1})^{\top}(x^{k-1}-x^{\ast}), we obtain:

In order to show the decrease of the Lyapunov function, we need to show that C0xk1x2+C1(xk1x)g(xk1)+C2g(xk1)20C_{0}\|x^{k-1}-x^{\ast}\|^{2}+C_{1}(x^{k-1}-x^{\ast})^{\top}g^{\prime}(x^{k-1})+C_{2}\|g^{\prime}(x^{k-1})\|^{2}\leqslant 0 for all xk1x^{k-1}.

If we assume that C10C_{1}\leqslant 0 and C20C_{2}\leqslant 0, then we have by strong-convexity

By again using the strong convexity of gg (as in (18)), we have

using the smoothness of gg and the assumption that 2hγ02h-\gamma\leqslant 0.

Thus, in order for L(θk)\mathcal{L}(\theta^{k}) to dominate g(xk)g(x)g(x^{k})-g(x^{*}) we require the additional constraints

B.7 Finding constants

Given the algorithm and result parameters (α,γ,δ)(\alpha,\gamma,\delta), we have the following constraints:

We also require C10C_{1}\leqslant 0, but this will follow from Constraint 8 since we will have C30C_{3}\geqslant 0. All of the constraints above are convex in a1a_{1}, a2a_{2}, bb, cc, dd, hh. Thus, given candidate values for the remaining values, the feasibility of these constraints may be checked using a numerical toolbox (as a second-order cone program). However, these parameters should be functions of nn and should be valid for all μ\mu. Thus, given a sampling of values of μ\mu and nn, representing these parameters as polynomials in 1/n1/n, the candidate functions may be found through a second-order cone programs.

By experimenting with this strategy, we have come up with the following values for the constants:

We will assume n>1n>1, since the result for n=1n=1 follows because SAG is equivalent to gradient descent in this case. Under this assumption, we see that Constraints 1 and 2 are satisfied by the above parameterizaiton. Below we verify that Constraints 3-9 are also satisfied using symbolic computations in Matlab (the code used in this section is available on the first author’s webpage). Specifically, we parameterize the constraints as polynomials and then verify that the polynomials are positive over an appropriate interval using the function below (which simply checks that no roots of the polynomial PP lie in the interval (x1,x2)(x_{1},x_{2}), and that PP is positive at the mid-point of the interval).

B.8 Verifying the result

To verify the result under these constants, we consider whether δ=1/8n\delta=1/8n or μ/16L\mu/16L. In B4B_{4}, we discard the term of the form (1δ)μhd2(1-\delta)\mu hd^{2}, which does not impact the validity of our result. Moreover, without loss of generality, we may assume L=1L=1 and μ\mu\in.

In this situation, it suffices to show the result for μ=2/n\mu=2/n since, (a) if a function is μ\mu-strongly convex, then it is μ\mu^{\prime}-strongly convex for all smaller μ\mu^{\prime}, and (b) the final inequality does not involve μ\mu. All constraints (when properly multiplied where appropriate by B3B4B_{3}B_{4}) can be written as rational functions of x=1/nx=1/n:

We only need to check the positivity of these polynomials in xx. We do this using the function above, via the script below. Note that we also have B3>0B_{3}>0 and B4>0B_{4}>0 for n>1n>1.

B.8.2 Ill-conditioned problems (μ⩽2/n𝜇2𝑛\mu\leqslant 2/n)

We consider the variables x=1/n[0,1/2]x=1/n\in[0,1/2] and y=nμ/2y=n\mu/2\in, so that μ=2y/n\mu=2y/n. We may express all quantities using xx and yy. The difficulty here is that we have two variables. We first check the dependency in terms of yy of the expressions B3B_{3}, B4B_{4}, and B6+B7B_{6}+B_{7}. These are univariate polynomials with an affine dependence on yy, so their positivity can be checked by checking positivity for y=1y=1 or y=0y=0. Again making use of symbolic computation, we can deduce that

We include below the additional commands to verify these properties:

Given the monotonicity of the bounds in B3B_{3} and B4B_{4}, we only need to check our results for the smaller values of B3B_{3} and B4B_{4}, i.e., for y=1y=1. Similarly, because of the monotonicity in the term (B6+B7)2(B_{6}+B_{7})^{2}, we may replace (B6+B7)2(B_{6}+B_{7})^{2} by (B6+B7)(B_{6}+B_{7}) times its upper-bound (i.e., its value at y=1y=1). Also note that B5B_{5} is divisble by yy, and we have B3>0B_{3}>0 and B4>0B_{4}>0 for n>1n>1. Using this, we can show that Constraints 5-9 are satisfied by checking the positivity of the following polynomials:

Constraints 5-7 are affine in yy and we thus only need to check positivity for y=0y=0 and y=1y=1.

The other two expressions (Constraints 8 and 9) are more complicated as there are second-order polynomials in yy. If n5n\geqslant 5, they have positive second derivatives, negative derivatives at y=0y=0 and y=1y=1, and positive values at y=1y=1. They are thus positive, which is verified by the code below.

For the remaining scenarios where n{2,3,4}n\in\{2,3,4\}, we have a fixed xx so we can check the positivity of the polynomials in yy. A Matlab script that does the steps above in addition to checking these not-particularly-interesting cases is available from the first author’s web page.

B.9 Convergence Rate

In the strongly-convex case (δ>0\delta>0), combining these gives us the linear convergence rate

In the convex case (μ=0\mu=0 and δ=0\delta=0), we have by convexity that

Finally, by Jensen’s inequality note that

so with xˉk=1ki=0k1xi\bar{x}^{k}=\frac{1}{k}\sum_{i=0}^{k-1}x^{i} we have

B.10 Intial values of Lyapunov function

To obtain the results of Theorem 1, all that remains is computing the initial value of the Lyapunov function for the two initializations.

Plugging in the values of our parameters,

and using σ2=1nifi(x)2\sigma^{2}=\frac{1}{n}\sum_{i}\|f_{i}^{\prime}(x^{\ast})\|^{2} we obtain (noting that ef(x)=0e^{\top}f^{\prime}(x^{\ast})=0)

B.10.2 Initialization with average gradient

If we initialize with yi0=yi0=fi(x0)(1/n)ifi(x0)y_{i}^{0}=y_{i}^{0}=f_{i}^{\prime}(x^{0})-(1/n)\sum_{i}f_{i}^{\prime}(x^{0}) we have, by noting that we still have (y0)e=0(y^{0})^{\top}e=0,

By observing that y0=f(x0)eg(x0)y^{0}=f^{\prime}(x^{0})-eg^{\prime}(x^{0}) and using [Nesterov, 2004, Equations 2.17], we can bound the norm in the second term as follows:

Using this in the Lyapunov function we obtain

References