Non-Uniform Stochastic Average Gradient Method for Training Conditional Random Fields

Mark Schmidt, Reza Babanezhad, Mohamed Osama Ahmed, Aaron Defazio, Ann Clifton, Anoop Sarkar

Introduction

Conditional random fields (CRFs) (Lafferty et al., 2001) are a ubiquitous tool in natural language processing. They are used for part-of-speech tagging (McCallum et al., 2003), semantic role labeling (Cohn and Blunsom, 2005), topic modeling (Zhu and Xing, 2010), information extraction (Peng and McCallum, 2006), shallow parsing (Sha and Pereira, 2003), named-entity recognition (Settles, 2004), as well as a host of other applications in natural language processing and in other fields such as computer vision (Nowozin and Lampert, 2011). Similar to generative Markov random field (MRF) models, CRFs allow us to model probabilistic dependencies between output variables. The key advantage of discriminative CRF models is the ability to use a very high-dimensional feature set, without explicitly building a model for these features (as required by MRF models). Despite the widespread use of CRFs, a major disadvantage of these models is that they can be very slow to train and the time needed for numerical optimization in CRF models remains a bottleneck in many applications.

Due to the high cost of evaluating the CRF objective function on even a single training example, it is now common to train CRFs using stochastic gradient methods (Vishwanathan et al., 2006). These methods are advantageous over deterministic methods because on each iteration they only require computing the gradient of a single example (and not all example as in deterministic methods). Thus, if we have a data set with $n$ training examples, the iterations of stochastic gradient methods are $n$ times faster than deterministic methods. However, the number of stochastic gradient iterations required might be very high. This has been studied in the optimization community, which considers the problem of finding the minimum number of iterations $t$ so that we can guarantee that we reach an accuracy of $\epsilon$ , meaning that

For problems with a finite number of training examples, Le Roux et al. (2012) recently proposed the stochastic average gradient (SAG) algorithm which combines the advantages of deterministic and stochastic methods: it only requires evaluating a single randomly-chosen training example on each iteration, and only requires $O(\log(1/\epsilon))$ iterations to reach an accuracy of $\epsilon$ . Beyond this faster convergence rate, the SAG method also allows us to address two issues that have traditionally frustrated users of stochastic gradient methods: setting the step-size and deciding when to stop. Implementations of the SAG method use both an adaptive step-size procedure and a cheaply-computable criterion for deciding when to stop. Le Roux et al. (2012) show impressive empirical performance of the SAG algorithm for binary classification.

This is the first work to apply a SAG algorithm to train CRFs. We show that tracking marginals in the CRF can drastically reduce the SAG method’s huge memory requirement. We also give a non-uniform sampling (NUS) strategy that adaptively estimates how frequently we should sample each data point, and we show that the SAG-like algorithm of Defazio et al. (2014) converges under any NUS strategy while a particular NUS strategy achieves a faster rate. Our experiments compare the SAG algorithm with a variety of competing deterministic, stochastic, and semi-stochastic methods on benchmark data sets for four common tasks: part-of-speech tagging, named entity recognition, shallow parsing, and optical character recognition. Our results indicate that the SAG algorithm with NUS often outperforms previous methods by an order of magnitude in terms of the training objective and, despite not requiring us to tune the step-size, performs as well or better than optimally tuned stochastic gradient methods in terms of the test error.

Conditional Random Fields

CRFs model the conditional probability of a structured output $y\in\mathcal{Y}$ (such as a sequence of labels) given an input $x\in\mathcal{X}$ (such as a sequence of words) based on features $F(x,y)$ and parameters $w$ using

where $\lambda>0$ is the strength of the regularization parameter. Unfortunately, evaluating $\log p(y_{i}|x_{i},w)$ is expensive due to the summation over all possible configurations $y^{\prime}$ . For example, in chain-structured models the forward-backward algorithm is used to compute $\log p(y_{i}|x_{i},w)$ and its gradient. A second problem with solving (2) is that the number of training examples $n$ in applications is constantly-growing, and thus we would like to use methods that only require a few passes through the data set.

Related Work

Lafferty et al. (2001) proposed an iterative scaling algorithm to solve problem (2), but this proved to be inferior to generic deterministic optimization strategies like the limited-memory quasi-Newton algorithm L-BFGS (Wallach, 2002; Sha and Pereira, 2003). The bottleneck in these methods is that we must evaluate $\log p(y_{i}|x_{i},w)$ and its gradient for all $n$ training examples on every iteration. This is very expensive for problems where $n$ is very large, so to deal with this problem stochastic gradient methods were examined (Vishwanathan et al., 2006; Finkel et al., 2008). However, traditional stochastic gradient methods require $O(1/\epsilon)$ iterations rather than the much smaller $O(\log(1/\epsilon))$ required by deterministic methods.

There have been several attempts at improving the cost of deterministic methods or the convergence rate of stochastic methods. For example, the exponentiated gradient method of Collins et al. (2008) processes the data online and only requires $O(\log(1/\epsilon))$ iterations to reach an accuracy of $\epsilon$ in terms of the dual objective. However, this does not guarantee good performance in terms of the primal objective or the weight vector. Although this method is highly-effective if $\lambda$ is very large, our experiments and the experiments of others show that the performance of online exponentiated gradient can degrade substantially if a small value of $\lambda$ is used (which may be required to achieve the best test error), see Collins et al. (2008, Figures 5-6 and Table 3) and Lacoste-Julien et al. (2013, Figure 1). In contrast, SAG degrades more gracefully as $\lambda$ becomes small, even achieving a convergence rate faster than classic SG methods when $\lambda=0$ (Schmidt et al., 2013). Lavergne et al. (2010) consider using multiple processors and vectorized computation to reduce the high iteration cost of quasi-Newton methods, but when $n$ is enormous these methods still have a high iteration cost. Friedlander and Schmidt (2012) explore a hybrid deterministic-stochastic method that slowly grows the number of examples that are considered in order to achieve an $O(\log(1/\epsilon))$ convergence rate with a decreased cost compared to deterministic methods.

Below we state the convergence rates of different methods for training CRFs, including the fastest known rates for deterministic algorithms (like L-BFGS and accelerated gradient) (Nesterov, 2004), stochastic algorithms (like [averaged] stochastic gradient and AdaGrad) (Ghadimi and Lan, 2012), online exponentiated gradient, and SAG. Here $L$ is the Lipschitz constant of the gradient of the objective, $\mu$ is the strong-convexity constant (and we have $\lambda\leq\mu\leq L$ ), and $\sigma^{2}$ bounds the variance of the gradients.

Stochastic Average Gradient

Le Roux et al. (2012) introduce the SAG algorithm, a simple method with the low iteration cost of stochastic gradient methods but that only requires $O(\log(1/\epsilon))$ iterations. To motivate this new algorithm, we write the classic gradient descent iteration as

where $\alpha$ is the step-size and at each iteration we set the ‘slope’ variables $s_{i}^{t}$ to the gradient with respect to training example $i$ at $w^{t}$ , so that $s_{i}^{t}=-\nabla\log p(y_{i}|x_{i},w^{t})+\lambda w^{t}$ . The SAG algorithm uses this same iteration, but instead of updating $s_{i}^{t}$ for all $n$ data points on every iterations, it simply sets $s_{i}^{t}=-\nabla\log p(y_{i}|x_{i},w^{t})+\lambda w^{t}$ for one randomly chosen data point and keeps the remaining $s_{i}^{t}$ at their value from the previous iteration. Thus the SAG algorithm is a randomized version of the gradient algorithm where we use the gradient of each example from the last iteration where it was selected. The surprising aspect of the work of Le Roux et al. (2012) is that this simple delayed gradient algorithm achieves a similar convergence rate to the classic full gradient algorithm despite the iterations being $n$ times faster.

Unfortunately, a major problem with applying (3) to CRFs is the requirement to store the $s_{i}^{t}$ . While the CRF gradients $\nabla\log p(y_{i}|x_{i},w^{t})$ have a nice structure (see Section 4.2), $s_{i}^{t}$ includes $\lambda w^{t}$ for some previous $t$ , which is dense and unstructured. To get around this issue, instead of using (3) we use the following SAG-like update (Le Roux et al., 2012, Section 4)

where $g_{i}^{t}$ is the value of $-\nabla\log p(y_{i}|x_{i},w^{k})$ for the last iteration $k$ where $i$ was selected and $d$ is the sum of the $g_{i}^{t}$ over all $i$ . Thus, this update uses the exact gradient of the regularizer and only uses an approximation for the (structured) CRF log-likelihood gradients. Since we don’t yet have any information about these log-likelihoods at the start, we initialize the algorithm by setting $g_{i}^{0}=0$ . But to compensate for this, we track the number of examples seen $m$ , and normalize $d$ by $m$ in the update (instead of $n$ ). In Algorithm 1, we summarize this variant of the SAG algorithm for training CRFs.If we solve the problem for a sequence of regularization parameters, we can obtain better performance by warm-starting $g_{i}^{0}$ , $d$ , and $m$ .

In many applications of CRFs the $g_{i}^{t}$ are very sparse, and we would like to take advantage of this as in stochastic gradient methods. Fortunately, we can implement (4) without using dense vector operations by using the representation $w^{t}=\beta^{t}v^{t}$ for a scalar $\beta^{t}$ and a vector $v^{t}$ , and using ‘lazy updates’ that apply $d$ repeatedly to an individual variable when it is needed (Le Roux et al., 2012).

Also following Le Roux et al. (2012), we set the step-size to $\alpha=1/L$ , where $L$ is an approximation to the maximum Lipschitz constant of the gradients. This is the smallest number $L$ such that

for all $i$ , $w$ , and $v$ . This quantity is a bound on how fast the gradient can change as we change the weight vector. The Lipschitz constant with respect to the gradient of the regularizer is simply $\lambda$ . This gives $L\leq L_{g}+\lambda$ , where $L_{g}$ is the Lipschitz constant of the gradient of the log-likelihood. Unfortunately, $L_{g}$ depends on the covariance of the CRF and is typically too expensive to compute. To avoid this computation, as in Le Roux et al. (2012) we approximate $L_{g}$ in an online fashion using the standard backtracking line-search given by Algorithm 2 (Beck and Teboulle, 2009). The test used in this algorithm is faster than testing (5), since it uses function values (which only require the forward algorithm for CRFs) rather than gradient values (which require the forward and backward steps). Algorithm 2 monotonically increases $L_{g}$ , but we also slowly decrease it in Algorithm 1 in order to allow the possibility that we can use a more aggressive step-size as we approach the solution.

Since the solution is the only stationary point, we must have $\nabla f(w^{t})=0$ at the solution. Further, the value $\frac{1}{n}d+\lambda w^{t}$ converges to $\nabla f(w^{t})$ so we can use the size of this value to decide when to stop the algorithm (although we also require that $m=n$ to avoid premature stopping before we have seen the full data set). This is in contrast to classic stochastic gradient methods, where the step-size must go to zero and it is therefore difficult to decide if the algorithm is close to the optimal value or if we simply require a small step-size to continue making progress.

2 Reducing the Memory Requirements

Even if the gradients $g_{i}^{t}$ are not sparse, we can often reduce the memory requirements of Algorithm 1 because it is known that the CRF gradients only depend on $w$ through marginals of the features. Specifically, the gradient of the log-likelihood under model (1) with respect to feature $j$ is given by

Notice that Algorithm 1 only depends on the old gradient through its difference with the new gradient (line 10), which in this example gives

where $w$ is the current parameter vector and $w_{\text{old}}$ is the old parameter vector. Thus, to perform this calculation the only thing we need to know about $w_{\text{old}}$ is the unary marginal $p(y_{k}=s|x,w_{\text{old}})$ , which will be shared across features that only depend on the event that $y_{k}=s$ . Similarly, features that depend on pairs of values in $y$ will need to store pairwise marginals, $p(y_{k}=s,y_{k}^{\prime}=s^{\prime}|x,w_{\text{old}})$ . For general pairwise graphical model structures, the memory requirements to store these marginals will thus be $O(VK+EK^{2})$ , where $V$ is the number of vertices and $E$ is the number of edges. This can be an enormous reduction since it does not depend on the number of features. Further, since computing these marginals is a by-product of computing the gradient, this potentially-enormous reduction in the memory requirements comes at no extra computational cost.

Non-Uniform Sampling

Recently, several works show that we can improve the convergence rates of randomized optimization algorithms by using non-uniform sampling (NUS) schemes. This includes randomized Kaczmarz (Strohmer and Vershynin, 2009), randomized coordinate descent (Nesterov, 2012), and stochastic gradient methods (Needell et al., 2014). The key idea behind all of these NUS strategies is to bias the sampling towards the Lipschitz constants of the gradients, so that gradients that change quickly get sampled more often and gradients that change slowly get sampled less often. Specifically, we maintain a Lipschitz constant $L_{i}$ for each training example $i$ and, instead of the usual sampling strategy $p_{i}=1/n$ , we bias towards the distribution $p_{i}=L_{i}/\sum_{j}L_{j}$ . In these various contexts, NUS allows us to improve the dependence on the values $L_{i}$ in the convergence rate, since the NUS methods depend on $\bar{L}=(1/n)\sum_{j}L_{j}$ , which may be substantially smaller than the usual dependence on $L=\max_{j}\{L_{j}\}$ . Schmidt et al. (2013) argue that faster convergence rates might be achieved with NUS for SAG since it allows a larger step size $\alpha$ that depends on $\bar{L}$ instead of $L$ .An interesting difference between the SAG update with NUS and NUS for stochastic gradient methods is that the SAG update does not seem to need to decrease the step-size for frequently-sampled examples (since the SAG update does not rely on using an unbiased gradient estimate).

The scheme for SAG proposed by Schmidt et al. (2013, Section 5.5) uses a fairly complicated adaptive NUS scheme and step-size, but the key ingredient is estimating each constant $L_{i}$ using Algorithm 2. Our experiments show this method often already improves on state of the art methods for training CRFs by a substantial margin, but we found we could obtain improved performance for training CRFs using the following simple NUS scheme for SAG: as in Needell et al. (2014), with probability $0.5$ choose $i$ uniformly and with probability $0.5$ sample $i$ with probability $L_{i}/(\sum_{j}L_{j})$ (restricted to the examples we have previously seen).Needell et al. (2014) analyze the basic stochastic gradient method and thus require $O(1/\epsilon)$ iterations. We also use a step-size of $\alpha=\frac{1}{2}\left(1/L+1/\bar{L}\right)$ , since the faster convergence rate with NUS is due to the ability to use a larger step-size than $1/L$ . This simple step-size and sampling scheme contrasts with the more complicated choices described by Schmidt et al. (2013, Section 5.5), that make the degree of non-uniformity grow with the number of examples seen $m$ . This prior work initializes each $L_{i}$ to $1$ , and updates $L_{i}$ to $0.5L_{i}$ each subsequent time an example is chosen. In the context of CRFs, this leads to a large number of expensive backtracking iterations. To avoid this, we initialize $L_{i}$ with $0.5\bar{L}$ the first time an example is chosen, and decrease $L_{i}$ to $0.9L_{i}$ each time it is subsequently chosen. Allowing the $L_{i}$ to decrase seems crucial to obtaining the best practical performance of the method, as it allows the algorithm to take bigger step sizes if the values of $L_{i}$ are small near the solution.

Schmidt et al. (2013) give an intuitive but non-rigorous motivation for using NUS in SAG. More recently, Xiao and Zhang (2014) show that NUS gives a dependence on $\bar{L}$ in the context of a related algorithm that uses occasional full passes through the data (which substantially simplifies the analysis). Below, we analyze a NUS extension of the SAGA algorithm of Defazio et al. (2014), which does not require full passes through the data and has similar performance to SAG in practice but is much easier to analyze.

Let the sequences $\{w^{t}\}$ and $\{s_{j}^{t}\}$ be defined by

where $j_{t}$ is chosen with probability $p_{j}$ .

(a) If $r_{t}$ is set to $j_{t}$ , then with $\alpha=\frac{np_{\text{min}}}{4L+n\mu}$ we have

where $p_{\text{min}}=\min_{i}\{p_{i}\}$ and

(b) If $p_{j}=\frac{L_{j}}{\sum_{i=1}^{n}L_{i}}$ and $r_{t}$ is chosen uniformly at random, then with $\alpha=\frac{1}{4\bar{L}}$ we have

This result (which we prove in Appendix A and B) shows that SAGA has (a) a linear convergence rate for any NUS scheme where $p_{i}>0$ for all $i$ , and (b) a rate depending on $\bar{L}$ by sampling proportional to the Lipschitz constants and also generating a uniform sample. However, (a) achieves the fastest rate when $p_{i}=1/n$ while (b) requires two samples on each iteration. We were not able to show a faster rate using only one sample on each iteration as used in our implementation.

2 Line-Search Skipping

To reduce the number of function evaluations required by the NUS strategy, we also explored a line-search skipping strategy. The general idea is to consider skipping the line-search for example $i$ if the line-search criterion was previously satisfied for example $i$ without backtracking. Specifically, if the line-search criterion was satisfied $\xi$ consecutive times for example $i$ (without backtracking), then we do not do the line-search on the next $2^{\xi-1}$ times example $i$ is selected (we also do not multiply $L_{i}$ by $0.9$ on these iterations). This drastically reduces the number of function evaluations required in the later iterations.

Experiments

We compared a wide variety of approaches on four CRF training tasks: the optical character recognition (OCR) dataset of Taskar et al. (2003), the CoNLL-2000 shallow parse chunking dataset,http://www.cnts.ua.ac.be/conll2000/chunking the CoNLL-2002 Dutch named-entity recognition dataset,http://www.cnts.ua.ac.be/conll2002/ner and a part-of-speech (POS) tagging task using the Penn Treebank Wall Street Journal data (POS-WSJ). The optimal character recognition dataset labels the letters in images of words. Chunking segments a sentence into syntactic chunks by tagging each sentence token with a chunk tag corresponding to its constituent type (e.g., ‘NP’, ‘VP’, etc.) and location (e.g., beginning, inside, ending, or outside any constituent). We use standard n-gram and POS tag features (Sha and Pereira, 2003). For the named-entity recognition task, the goal is to identify named entities and correctly classify them as persons, organizations, locations, times, or quantities. We again use standard n-gram and POS tag features, as well as word shape features over the case of the characters in the token. The POS-tagging task assigns one of 45 syntactic tags to each token in each of the sentences in the data. For this data, we follow the standard division of the WSJ data given by Collins (2002), using sections 0-18 for training, 19-21 for development, and 22-24 for testing. We use the standard set of features following Ratnaparkhi (1996) and Collins (2002): n-gram, suffix, and shape features. As is common on these tasks, our pairwise features do not depend on $x$ .

Figure 1 shows the result of our experiments on the training objective and Figure 2 shows the result of tracking the test error. Here we measure the number of ‘effective passes’, meaning $(1/n)$ times the number of times we performed the bottleneck operation of computing $\log p(y_{i}|x_{i},w)$ and its gradient. This is an implementation-independent way to compare the convergence of the different algorithms (most of whose runtimes differ only by a small constant), but we have included the performance in terms of runtime in Appendix E. For the different SAG methods that use a line-search we count the extra ‘forward’ operations used by the line-search as full evaluations of $\log p(y_{i}|x_{i},w)$ and its gradient, even though these operations are cheaper because they do not require the backward pass nor computing the gradient. In these experiments we used $\lambda=1/n$ , which yields a value close to the optimal test error across all data sets. The objective is strongly-convex and thus has a unique minimum value. We approximated this value by running L-BFGS for up to $1000$ iterations, which always gave a value of $w$ satisfying $\|\nabla f(w)\|_{\infty}\leq 1.4\times 10^{-7}$ , indicating that this is a very accurate approximation of the true solution. In the test error plots, we have excluded the SAG and SAG-NUS methods to keep the plots interpretable (while Pegasos does not appear becuase it performs very poorly), but Appendx C includes these plots with all methods added. In the test error plots, we have also plotted as dotted lines the performance of the classic stochastic gradient methods when the second-best step-size is used.

We make several observations based on these experiments:

SG outperformed Pegasos. Pegasos is known to move exponentially away from the solution in the early iterations (Bach and Moulines, 2011), meaning that $\|w^{t}-w^{*}\|\geq\rho^{t}\|w^{0}-w^{*}\|$ for some $\rho>1$ , while SG moves exponentially towards the solution ( $\rho<1$ ) in the early iterations (Nedic and Bertsekas, 2000).

ASG outperformed AdaGrad and SMD (in addition to SG). ASG methods are known to achieve the same asymptotic efficiency as an optimal stochastic Newton method (Polyak and Juditsky, 1992), while AdaGrad and SMD can be viewed as approximations to a stochastic Newton method. Vishwanathan et al. (2006) did not compare to ASG, because applying ASG to large/sparse data requires the recursion of Xu (2010).

Hybrid outperformed L-BFGS. The hybrid algorithm processes fewer data points in the early iterations, leading to cheaper iterations.

None of the three algorithms ASG/Hybrid/SAG dominated the others: the relative ranks of these methods changed based on the data set and whether we could choose the optimal step-size.

OEG performed very well on the first two datasets, but was less effective on the second two. By experimenting with various initializations, we found that we could obtain much better performance with OEG on these two datasets. We report these results in the Appendix D, although Appendix E shows that OEG was less competitive in terms of runtime.

Both SAG-NUS methods outperform all other methods (except OEG) by a substantial margin based on the training objective, and are always among the best methods in terms of the test error. Further, our proposed SAG-NUS* always outperforms SAG-NUS.

On three of the four data sets, the best classic stochastic gradient methods (AdaGrad and ASG) seem to reach the optimal test error with a similar speed to the SAG-NUS* method, although they require many passes to reach the optimal test error on the OCR data. Further, we see that the good test error performance of the AdaGrad and ASG methods is very sensitive to choosing the optimal step-size, as the methods perform much worse if we don’t use the optimal step-size (dashed lines in Figure 2). In contrast, SAG uses an adaptive step-size and has virtually identical performance even if the initial value of $L_{g}$ is too small by several orders of magnitude (the line-search quickly increases $L_{g}$ to a reasonable value on the first training example, so the dashed black line in Figure 2 would be on top of the solid line).

To quantify the memory savings given by the choices in Section 4, below we report the size of the memory required for these datasets under different memory-saving strategies divided by the memory required by the naive SAG algorithm. Sparse refers to only storing non-zero gradient values, Marginals refers to storing all unary and pairwise marginals, and Mixed refers to storing node marginals and the gradient with respect to pairwise features (recall that the pairwise features do not depend on $x$ in our models).

Discussion

Due to its memory requirements, it may be difficult to apply the SAG algorithm for natural language applications involving complex features that depend on a large number of labels. However, grouping training examples into mini-batches can also reduce the memory requirement (since only the gradients with respect to the mini-batches would be needed). An alternative strategy for reducing the memory is to use the algorithm of Johnson and Zhang (2013) or Zhang et al. (2013). These require evaluating the chosen training example twice on each iteration, and occasionally require full passes through the data, but do not have the memory requirements of SAG (in our experiments, these performed similar to or slightly worse than running SAG at half speed).

Acknowledgments

We would like to thank the anonymous reviewers as well as Simon Lacoste-Julien for their helpful comments. This research was supported by the Natural Sciences and Engineering Research Council of Canada (RGPIN 262313, RGPAS 446348, and CRDPJ 412844 which was made possible by a generous contribution from The Boeing Company and AeroInfo Systems). Travel support to attend the conference was provided by the Institute for Computing, Information and Cognitive Systems (ICICS).

Appendx A: Proof of Part (a) of Proposition 1

In this section we consider the minimization problem

where each $f_{i}^{\prime}$ is $L$ -Lipschitz continuous and each $f_{i}$ is $\mu$ -strongly-convex. We will define Algorithm 2, a variant of SAGA, by the sequences $\{x^{k}\}$ , $\{\nu_{k}\}$ , and $\{\phi_{j}^{k}\}$ given by

where $j_{k}=j$ with probability $p_{j}$ . In this section we’ll use the convention that $x=x^{k}$ , that $\phi_{j}=\phi_{j}^{k}$ , and that $x^{*}$ is the minimizer of $f$ . We first show that $\nu_{k}$ is an unbiased gradient estimator and derive a bound on its variance.

which follows from Defazio et al. (2014, Lemma 1) using that $f^{\prime}(x^{*})=0$ and the non-positivity of $\frac{L-\mu}{L}[f(x^{*})-f(x)]$ . We now give the proof of part (a) of Proposition 1, which we state below.

If $\eta=\frac{4L+n\mu}{np_{m}}$ and $p_{m}=\min_{j}\{p_{j}\}$ , then Algorithm 2 has

We denote the Lyapunov function $T^{k}$ at iteration $k$ by

We now use Lemma 1 followed by Inequality (6),

We use this to bound the expected improvement in the Lyapunov function,

where in $(*)$ we add and subtract $\frac{1}{\kappa}T^{k}$ and in the last line we assumed $c\geq 0$ and used $p_{i}\geq p_{m}$ . The terms in square brackets are positive, and if we can choose the constants $\{c,\kappa,\eta\}$ to make the round brackets non-positive, we have

For the first expression, choosing $\kappa=\frac{\eta}{\mu}$ makes it zero. We can make the third expression zero under this choice of $\kappa$ by choosing $c=\frac{\eta^{2}p_{m}}{2}-\frac{\mu\eta}{2}$ . This follows because

For the second expression, note that with our choice of $c$ we have

which (multiplying by $n$ ) is negative if we have

We will also require that $c\geq 0$ to complete the proof, but this follows because $\eta\geq\frac{\mu}{p_{m}}$ . By using that

and chaining the expectations while using the definition of $\eta$ we obtain

Appendix B: Proof of Part (b) of Proposition 1

In this section we consider the minimization problem

where each $f_{i}^{\prime}$ is $L_{i}$ -Lipschitz continuous and $f$ is $\mu$ -strongly-convex. We will define Algorithm 3 by the sequences $\{x^{k}\}$ , $\{\nu_{k}\}$ , and $\{\phi_{j}^{k}\}$ given by

From our assumptions about $f$ and the $f_{i}$ , we have (Nesterov, 2004, see Chapter 2).

We use these to derive several useful inequalities that we will use in the analysis. Adding the former times $\frac{1}{2n}$ for all $i$ to the latter times $\frac{1}{2}$ for $y=x^{*}$ gives the inequality

Also by applying (8) with $y=x^{*}$ and $x=\phi_{i}$ , for each $f_{i}$ and summing, we have that for all $\phi_{i}$ and $x^{*}$ :

Further, by both minimizing sides of (9) we obtain

We next derive a bound on the variance of the gradient estimate.

It holds that for any $\phi_{i}$ that with $x^{k+1}$ and $x^{k}$ as given by Algorithm 2 we have

We again follow the SAGA argument closely here

We can expand those expectations as follows:

We now give the proof of part (b) of Proposition 1, which we state below.

If $\gamma=\frac{1}{4L}$ , then Algorithm 3 has

The expectations of the first terms in $T^{k+1}$ are straightforward to simplify:

We can now combine the bounds we have derived for each term in $T$ , and pull out a fraction $\frac{1}{\kappa}$ of $T^{k}$ (for any $\kappa$ at this point). Together with (12) this yields:

Note that the term in square brackets in the second row is positive in light of (11). We now attempt to find constants that satisfy the required relations. We start with naming the constants that we need to be non-positive:

Recall that we are using the step size $\gamma=1/4\bar{L}$ , and thus $c_{4}=0$ . Setting $c_{1}$ to zero gives

which is positive since $\gamma\mu<1$ . Now we look at the restriction that $c_{2}\leq 0$ places on $\kappa$ :

We also have the restriction from $c_{3}=\frac{1}{\kappa}-\frac{1}{2}\gamma\mu$ of

Note that $c\left\|x^{k}-x^{*}\right\|^{2}\leq T^{k}$ , and therefore by chaining expectations and plugging in constants we get:

Appendix C: Test Error Plots for All Methods

In the main body we only plotted test error for a subset of the methods. In Figure 3 we plot the test error of all methods considered in Figure 1. Note that Pegasos does not appear on the plot (despite being in the legend) because its values exceed the maximum plotted values. In these plots we see that the SAG-NUS methods perform similarly to the best among the optimally-tuned stochastic gradient methods in terms of test error, despite the lack of tuning required to apply these methods.

Appendix D: Improved Results for OEG

Owing to the high variance of the performance of the OEG method, we explored whether better performance could be obtained with the OEG method. The two most salient observations from these experiments where that (i) utilizing a random permutation on the first pass through the data seems to be crucial to performance, and (ii) that better performance could be obtained on the two datasets where OEG performed poorly by using a different initialization. In particular, better performance could be obtained by initializing the parts with the correct labels to a larger value, such as 10. In Figure 4, we plot the performance of the OEG method without using the random permutation (OEG-noRP) as well as OEG with this initialization (OEG-10). Removing the random permutation makes OEG perform much worse on one of the datasets, while using the different initialization makes OEG perform nearly as well as SAG-NUS* on the datasets where previously it performed poorly (although it does not make up the performance gap on the remaining data set). Performance did not further improve by using even larger values in the initialization, and using a value that was too large lead to numerical problems.

Appendix E: Runtime Plots

In the main body we plot the performance against the effective number of passes as an implementation-independent way of comparing the different algorithms. In all cases except SMD, we implemented a C version of the method and also compared the running times of our different implementations. This ties the results to the hardware used to perform the experiments and to our specific implementation, and thus says little about the runtime in different hardware settings or different implementations, but does show the practical performance of the methods in this particular setting. We plot the training objective against runtime in Figure 5 and the test error in Figure 6. In general, the runtime plots show the exact same trends as the plots against the effective number of passes. However, we note several small differences:

AdaGrad performs slightly worse in terms of runtime. This seems to be due to the extra square root operators needed to implement the method.

Hybrid performs worse in terms of runtime, although it was still faster than the L-BFGS method. This seems to be due to the higher relative cost of applying the L-BFGS update when the batch size is small.

OEG performed much worse in terms of runtime, even with the better initialization from the previous section.

Finally, we note that these implementations are available on the first author’s webpage: http://www.cs.ubc.ca/~schmidtm/Software/SAG4CRF.html