Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n)

Francis Bach, Eric Moulines

Introduction

Large-scale machine learning problems are becoming ubiquitous in many areas of science and engineering. Faced with large amounts of data, practitioners typically prefer algorithms that process each observation only once, or a few times. Stochastic approximation algorithms such as stochastic gradient descent (SGD) and its variants, although introduced more than 60 years ago , still remain the most widely used and studied method in this context (see, e.g., ).

We partially understand the properties of ff that affect the problem difficulty. Strong convexity (i.e., when ff is twice differentiable, a uniform strictly positive lower-bound μ\mu on Hessians of ff) is a key property. Indeed, after nn observations and with the proper step-sizes, averaged SGD achieves the rate of O(1/μn)O(1/\mu n) in the strongly-convex case , while it achieves only O(1/n)O(1/\sqrt{n}) in the non-strongly-convex case , with matching lower-bounds .

The main issue with strong convexity is that typical machine learning problems are high dimensional and have correlated variables so that the strong convexity constant μ\mu is zero or very close to zero, and in any case smaller than O(1/n)O(1/\sqrt{n}). This then makes the non-strongly convex methods better. In this paper, we aim at obtaining algorithms that may deal with arbitrarily small strong-convexity constants, but still achieve a rate of O(1/n)O(1/n).

Smoothness plays a central role in the context of deterministic optimization. The known convergence rates for smooth optimization are better than for non-smooth optimization (e.g., see ). However, for stochastic optimization the use of smoothness only leads to improvements on constants (e.g., see ) but not on the rate itself, which remains O(1/n)O(1/\sqrt{n}) for non-strongly-convex problems.

We show that for the square loss and for the logistic loss, we may use the smoothness of the loss and obtain algorithms that have a convergence rate of O(1/n)O(1/n) without any strong convexity assumptions. More precisely, for least-squares regression, we show in Section 2 that averaged stochastic gradient descent with constant step-size achieves the desired rate. For logistic regression this is achieved by a novel stochastic gradient algorithm that (a) constructs successive local quadratic approximations of the loss functions, while (b) preserving the same running time complexity as stochastic gradient descent (see Section 3). For these algorithms, we provide a non-asymptotic analysis of their generalization error (in expectation, and also in high probability for least-squares), and run extensive experiments on standard machine learning benchmarks showing in Section 4 that they often outperform existing approaches.

Constant-step-size least-mean-square algorithm

In this section, we consider stochastic approximation for least-squares regression, where SGD is often referred to as the least-mean-square (LMS) algorithm. The novelty of our convergence result is the use of the constant step-size with averaging, leading to O(1/n)O(1/n) rate without strong convexity.

H{\mathcal{H}} is a dd-dimensional Euclidean space, with d1d\geqslant 1.

The observations (xn,zn)H×H(x_{n},z_{n})\in{\mathcal{H}}\times{\mathcal{H}} are independent and identically distributed.

We study the stochastic gradient (a.k.a. least mean square) recursion defined as

started from θ0H\theta_{0}\in{\mathcal{H}}. We also consider the averaged iterates θˉn=(n+1)1k=0nθk\bar{\theta}_{n}=(n+1)^{-1}\sum_{k=0}^{n}\theta_{k}.

Assume (A1-6). For any constant step-size γ<1R2\gamma<\frac{1}{R^{2}}, we have

Proof technique.

We adapt and extend a proof technique from which is based on non-asymptotic expansions in powers of γ\gamma. We also use a result from which studied the recursion in Eq. (1), with xnxnx_{n}\otimes x_{n} replaced by its expectation HH. See the appendix for details.

Optimality of bounds.

Our bound in Eq. (2) leads to a rate of O(1/n)O(1/n), which is known to be optimal for least-squares regression (i.e., under reasonable assumptions, no algorithm, even more complex than averaged SGD can have a better dependence in nn) . The term σ2d/n\sigma^{2}d/n is also unimprovable.

Initial conditions.

If γ\gamma is small, then the initial condition is forgotten more slowly. Note that with additional strong convexity assumptions, the initial condition would be forgotten faster (exponentially fast without averaging), which is one of the traditional uses of constant-step-size LMS .

Specificity of constant step-sizes.

2 Convergence in higher orders

We are now going to consider an extra assumption in order to bound the pp-th moment of the excess risk and then get a high-probability bound. Let pp be a real number greater than 11.

There exists R>0R>0, κ>0\kappa>0 and τσ>0\tau\geqslant\sigma>0 such that, for all n1n\geqslant 1, xn2R2\|x_{n}\|^{2}\leqslant R^{2} a.s., and

The last condition in Eq. (4) says that the kurtosis of the projection of the covariates xnx_{n} on any direction zHz\in{\mathcal{H}} is bounded. Note that computing the constant κ\kappa happens to be equivalent to the optimization problem solved by the FastICA algorithm , which thus provides an estimate of κ\kappa. In Table 1, we provide such an estimate for the non-sparse datasets which we have used in experiments, while we consider only directions zz along the axes for high-dimensional sparse datasets. For these datasets where a given variable is equal to zero except for a few observations, κ\kappa is typically quite large. Adapting and analyzing normalized LMS techniques to this set-up is likely to improve the theoretical robustness of the algorithm (but note that results in expectation from Theorem 1 do not use κ\kappa). The next theorem provides a bound for the pp-th moment of the excess risk.

Assume (A1-7). For any real p1p\geqslant 1, and for a step-size γ1/(12pκR2)\gamma\leqslant 1/(12p\kappa R^{2}), we have:

Note that to control the pp-th order moment, a smaller step-size is needed, which scales as 1/p1/p. We can now provide a high-probability bound; the tails decay polynomially as 1/(nδ12γκR2)1/(n\delta^{12\gamma\kappa R^{2}}) and the smaller the step-size γ\gamma, the lighter the tails.

For any step-size such that γ1/(12κR2)\gamma\leqslant 1/(12\kappa R^{2}), any δ(0,1)\delta\in(0,1),

Beyond least-squares: M-estimation

Although this algorithm has shown to be the most efficient in practice (see Section 4) we currently have no proof of convergence. Given that the behavior of the algorithms does not change much when the support point is updated less frequently than each iteration, there may be some connections to two-time-scale algorithms (see, e.g., ). In Section 4, we also consider several other strategies based on doubling tricks.

Interestingly, for non-quadratic functions, our algorithm imposes a new bias (by replacing the true gradient by an approximation which is only valid close to θˉn1\bar{\theta}_{n-1}) in order to reach faster convergence (due to the linearity of the underlying gradients).

Relationship with one-step-estimators.

One-step estimators (see, e.g., ) typically take any estimator with O(1/n)O(1/n)-convergence rate, and make a full Newton step to obtain an efficient estimator (i.e., one that achieves the Cramer-Rao lower bound). Although our novel algorithm is largely inspired by one-step estimators, our situation is slightly different since our first estimator has only convergence rate O(n1/2)O(n^{-1/2}) and is estimated on different observations.

1 Self-concordance and logistic regression

H{\mathcal{H}} is a dd-dimensional Euclidean space, with d1d\geqslant 1.

The observations (xn,yn)H×{1,1}(x_{n},y_{n})\in{\mathcal{H}}\times\{-1,1\} are independent and identically distributed.

We denote by θ\theta_{\ast} a global minimizer of ff, which we thus assume to exist, and we denote by H=f(θ)H=f^{\prime\prime}(\theta_{\ast}) the Hessian operator at a global optimum θ\theta_{\ast}.

We assume that there exists R>0R>0, κ>0\kappa>0 and ρ>0\rho>0 such that xn2R2\mboxalmostsurely\|x_{n}\|^{2}\leqslant R^{2}\mbox{ almost surely}, and

We get an O(1/n)O(1/n) convergence rate without assuming strong convexity, even locally, thus improving on results from where the the rate is proportional to 1/(nλmin(H))1/(n\lambda_{\min}(H)). The proof relies on self-concordance properties and the sharp analysis of the Newton step (see appendix).

Experiments

We consider normally distributed inputs, with covariance matrix HH that has random eigenvectors and eigenvalues 1/k1/k, k=1,,dk=1,\dots,d. The outputs are generated from a linear function with homoscedastic noise with unit signal to noise-ratio. We consider d=20d=20 and the least-mean-square algorithm with several settings of the step size γn\gamma_{n}, constant or proportional to 1/n1/\sqrt{n}. Here R2R^{2} denotes the average radius of the data, i.e., R2=trHR^{2}=\mathop{\rm tr}H. In the left plot of Figure 1, we show the results, averaged over 10 replications.

Without averaging, the algorithm with constant step-size does not converge pointwise (it oscillates), and its average excess risk decays as a linear function of γ\gamma (indeed, the gap between each values of the constant step-size is close to log10(4)\log_{10}(4), which corresponds to a linear function in γ\gamma).

With averaging, the algorithm with constant step-size does converge at rate O(1/n)O(1/n), and for all values of the constant γ\gamma, the rate is actually the same. Moreover (although it is not shown in the plots), the standard deviation is much lower.

With decaying step-size γn=1/(2R2n)\gamma_{n}=1/(2R^{2}\sqrt{n}) and without averaging, the convergence rate is O(1/n)O(1/\sqrt{n}), and improves to O(1/n)O(1/n) with averaging.

Logistic regression.

We consider the same input data as for least-squares, but now generates outputs from the logistic probabilistic model. We compare several algorithms and display the results in Figure 1 (middle and right plots).

On the middle plot, we consider SGD. Without averaging, the algorithm with constant step-size does not converge and its average excess risk reaches a constant value which is a linear function of γ\gamma (indeed, the gap between each values of the constant step-size is close to log10(4)\log_{10}(4)). With averaging, the algorithm does converge, but as opposed to least-squares, to a point which is not the optimal solution, with an error proportional to γ2\gamma^{2} (the gap between curves is twice as large).

On the right plot, we consider various variations of our Newton-approximation scheme. The “2-step” algorithm is the one for which our convergence rate holds (nn being the total number of examples, we perform n/2n/2 steps of averaged SGD, then n/2n/2 steps of LMS). Not surprisingly, it is not the best in practice (in particular at n/2n/2, when starting the constant-size LMS, the performance worsens temporarily). It is classical to use doubling tricks to remedy this problem while preserving convergence rates , this is done in “2-step-dbl.”, which avoids the previous erratic behavior.

We have also considered getting rid of the first stage where plain averaged stochastic gradient is used to obtain a support point for the quadratic approximation. We now consider only Newton-steps but change only these support points. We consider updating the support point at every iteration, i.e., the recursion from Eq. (8), while we also consider updating it every dyadic point (“dbl.-approx”). The last two algorithms perform very similarly and achieve the O(1/n)O(1/n) early. In all experiments on real data, we have considered the simplest variant (which corresponds to Eq. (8)).

2 Standard benchmarks

We have considered 6 benchmark datasets which are often used in comparing large-scale optimization methods. The datasets are described in Table 1 and vary in values of dd, nn and sparsity levels. These are all finite binary classification datasets with outputs in {1,1}\{-1,1\}. For least-squares and logistic regression, we have followed the following experimental protocol: (1) remove all outliers (i.e., sample points xnx_{n} whose norm is greater than 5 times the average norm), (2) divide the dataset in two equal parts, one for training, one for testing, (3) sample within the training dataset with replacement, for 100 times the number of observations in the training set (this corresponds to 100100 effective passes; in all plots, a black dashed line marks the first effective pass), (4) compute averaged cost on training and testing data (based on 10 replications). All the costs are shown in log-scale, normalized to that the first iteration leads to f(θ0)f(θ)=1f(\theta_{0})-f(\theta_{\ast})=1.

All algorithms that we consider (ours and others) have a step-size, and typically a theoretical value that ensures convergence. We consider two settings: (1) one when this theoretical value is used, (2) one with the best testing error after one effective pass through the data (testing powers of 44 times the theoretical step-size).

Here, we only consider covertype, alpha, sido and news, as well as test errors. For all training errors and the two other datasets (quantum, rcv1), see the appendix.

We compare three algorithms: averaged SGD with constant step-size, averaged SGD with step-size decaying as C/R2nC/R^{2}\sqrt{n}, and the stochastic averaged gradient (SAG) method which is dedicated to finite training data sets , which has shown state-of-the-art performance in this set-upThe original algorithm from is considering only strongly convex problems, we have used the step-size of 1/16R21/16R^{2}, which achieves fast convergence rates in all situations (see \urlhttp://research.microsoft.com/en-us/um/cambridge/events/mls2013/downloads/stochastic_gradient.pdf).. We show the results in the two left plots of Figure 2 and Figure 3.

Averaged SGD with decaying step-size equal to C/R2nC/R^{2}\sqrt{n} is slowest (except for sido). In particular, when the best constant CC is used (right columns), the performance typically starts to increase significantly. With that step size, even after 100 passes, there is no sign of overfitting, even for the high-dimensional sparse datasets.

SAG and constant-step-size averaged SGD exhibit the best behavior, for the theoretical step-sizes and the best constants, with a significant advantage for constant-step-size SGD. The non-sparse datasets do not lead to overfitting, even close to the global optimum of the (unregularized) training objectives, while the sparse datasets do exhibit some overfitting after more than 10 passes.

Logistic regression.

We also compare two additional algorithms: our Newton-based technique and “Adagrad” , which is a stochastic gradient method with a form a diagonal scalingSince a bound on θ\|\theta_{\ast}\| is not available, we have used step-sizes proportional to 1/supnxn1/\sup_{n}\|x_{n}\|_{\infty}. that allows to reduce the convergence rate (which is still in theory proportional to O(1/n)O(1/\sqrt{n})). We show results in the two right plots of Figure 2 and Figure 3.

Averaged SGD with decaying step-size proportional to 1/R2n1/R^{2}\sqrt{n} has the same behavior than for least-squares (step-size harder to tune, always inferior performance except for sido).

SAG, constant-step-size SGD and the novel Newton technique tend to behave similarly (good with theoretical step-size, always among the best methods). They differ notably in some aspects: (1) SAG converges quicker for the training errors (shown in the appendix) while it is a bit slower for the testing error, (2) in some instances, constant-step-size averaged SGD does underfit (covertype, alpha, news), which is consistent with the lack of convergence to the global optimum mentioned earlier, (3) the novel Newton approximation is consistently better.

On the non-sparse datasets, Adagrad performs similarly to the Newton-type method (often better in early iterations and worse later), except for the alpha dataset where the step-size is harder to tune (the best step-size tends to have early iterations that make the cost go up significantly). On sparse datasets like rcv1, the performance is essentially the same as Newton. On the sido data set, Adagrad (with fixed steps size, left column) achieves a good testing loss quickly then levels off, for reasons we cannot explain. On the news dataset, it is inferior without parameter-tuning and a bit better with. Adagrad uses a diagonal rescaling; it could be combined with our technique, early experiments show that it improves results but that it is more sensitive to the choice of step-size.

Overall, even with dd and κ\kappa very large (where our bounds are vacuous), the performance of our algorithm still achieves the state of the art, while being more robust to the selection of the step-size: finer quantities likes degrees of freedom should be able to quantify more accurately the quality of the new algorithms.

Conclusion

In this paper, we have presented two stochastic approximation algorithms that can achieve rates of O(1/n)O(1/n) for logistic and least-squares regression, without strong-convexity assumptions. Our analysis reinforces the key role of averaging in obtaining fast rates, in particular with large step-sizes. Our work can naturally be extended in several ways: (a) an analysis of the algorithm that updates the support point of the quadratic approximation at every iteration, (b) proximal extensions (easy to implement, but potentially harder to analyze); (c) adaptive ways to find the constant-step-size; (d) step-sizes that depend on the iterates to increase robustness, like in normalized LMS , and (e) non-parametric analysis to improve our theoretical results for large values of dd.

This work was partially supported by the European Research Council (SIERRA Project 239993). The authors would like to thank Simon Lacoste-Julien and Mark Schmidt for discussions related to this work.

In the appendix we provide proofs of all three theorems, as well as additional experimental results (all training objectives, and two additional datasets quantum and rcv1).

Appendix A Proof of Theorem 1

We first denote by ηn=θnθH\eta_{n}=\theta_{n}-\theta_{\ast}\in{\mathcal{H}} the deviation to θ\theta_{\ast}. Since we consider quadratic functions, it satisfies a simplified recursion:

We also consider ηˉn=1n+1k=0nηk=θˉnθ\bar{\eta}_{n}=\frac{1}{n+1}\sum_{k=0}^{n}\eta_{k}=\bar{\theta}_{n}-\theta_{\ast} the averaged iterate. We have f(θn)f(θ)=12ηn,Hηnf(\theta_{n})-f(\theta_{\ast})=\frac{1}{2}\langle\eta_{n},H\eta_{n}\rangle and f(θˉn)f(θ)=12ηˉn,Hηnf(\bar{\theta}_{n})-f(\theta_{\ast})=\frac{1}{2}\langle\bar{\eta}_{n},H\eta_{n}\rangle.

The crux of the proof is to consider the same recursion as Eq. (12), but replacing xnxnx_{n}\otimes x_{n} by its expectation HH (which is related to fixed design analysis in linear regression). This is of course only an approximation, and thus one has to study the remainder term; it happens to satisfy a similar recursion, on which we can apply the same technique, and so on. This proof technique is taken from . Here we push it to arbitrary orders with explicit constants for averaged constant-step-size stochastic gradient descent.

We denote by Fn{\mathcal{F}}_{n} the σ\sigma-algebra generated by (x1,z1,,xn,zn)(x_{1},z_{1},\dots,x_{n},z_{n}). Both θn\theta_{n} and θˉn\bar{\theta}_{n} are Fn{\mathcal{F}}_{n}-measurable.

A.1 Two main lemmas

The proof relies on two lemmas, one that provides a weak result essentially equivalent (but more specific and simpler because the step-size is constant) to non-strongly-convex results from , and one that replaces xnxnx_{n}\otimes x_{n} by its expectation HH in Eq. (12), which may then be seen as a non-asymptotic counterpart to the similar set-tup in .

Proof. We follow the proof technique of (which relies only on smoothness) and get:

This leads to the desired result, because, by convexity, αˉn1,Hαˉn11nk=0n1αk,Hαk\langle\bar{\alpha}_{n-1},H\bar{\alpha}_{n-1}\rangle\leqslant\frac{1}{n}\sum_{k=0}^{n-1}\langle{\alpha}_{k},H{\alpha}_{k}\rangle.

Proof. The proof relies on the fact that cost functions are quadratic and our recursions are thus linear, allowing to obtain αn\alpha_{n} in closed form. The sequence (αn)(\alpha_{n}) satisfies a linear recursion, from which we get, for all n1n\geqslant 1:

which leads to the first result using classical martingale second moment expansions (which amount to considering ξi\xi_{i}, i=1,,ni=1,\dots,n independent, so that the variance of the sum is the sum of variances). Moreover, using the identity \sum_{k=0}^{n-1}(I-\gamma H)^{k}=\big{(}I-(I-\gamma H)^{n}\big{)}\big{(}\gamma H\big{)}^{-1}, we get:

We then get, using standard martingale square moment inequalities (which here also amount to considering ξi\xi_{i}, i=1,,ni=1,\dots,n independent, so that the variance of the sum is the sum of variances):

because for all uu\in, (1(1u)n)2nu1\frac{(1-(1-u)^{n})^{2}}{nu}\leqslant 1 (see Lemma 3 in Section A.6), and the second term is the sum of terms which are all less than trH1C\mathop{\rm tr}H^{-1}C.

Note that we may replace the term 1nγα02\frac{1}{n\gamma}\|\alpha_{0}\|^{2} by 1n2γ2α0,H1α0\displaystyle\frac{1}{n^{2}\gamma^{2}}\langle\alpha_{0},H^{-1}\alpha_{0}\rangle, which is only interesting when α0,H1α0\langle\alpha_{0},H^{-1}\alpha_{0}\rangle is small.

A.2 Proof principle

The proof relies on an expansion of ηn\eta_{n} and ηˉn1\bar{\eta}_{n-1} as polynomials in γ\gamma due to . This expansion is done separately for the noise process (i.e., when assuming η0=0\eta_{0}=0) and for the noise-free process that depends only on the initial conditions (i.e., when assuming that σ=0\sigma=0). The bounds may then be added.

Indeed, we have ηn=M1nη0+γk=1nMk+1nξk\eta_{n}=M_{1}^{n}\eta_{0}+\gamma\sum_{k=1}^{n}M_{k+1}^{n}\xi_{k}, with Mij=(Iγxjxj)(Iγxixi)M_{i}^{j}=(I-\gamma x_{j}\otimes x_{j})\cdots(I-\gamma x_{i}\otimes x_{i}) and Mii1=IM_{i}^{i-1}=I, and thus \bar{\eta}_{n}=\frac{1}{n+1}\sum_{i=0}^{n}\bigg{[}M_{1}^{i}\eta_{0}+\gamma\sum_{k=1}^{i}M_{k+1}^{i}\xi_{k}\bigg{]}=\frac{1}{n+1}\sum_{i=0}^{n}M_{1}^{i}\eta_{0}+\gamma\sum_{k=1}^{n}\bigg{(}\sum_{i=k}^{n}M_{k+1}^{i}\bigg{)}\xi_{k}, leading to

for any p2p\geqslant 2 for which it is defined: the left term depends only on initial conditions and the right term depends only on the noise process (note the similarity with bias-variance decompositions).

A.3 Initial conditions

In this section, we assume that ξn\xi_{n} is uniformly equal to zero, and that γR21\gamma R^{2}\leqslant 1.

We thus have ηn=(Iγxnxn)ηn1\eta_{n}=(I-\gamma x_{n}\otimes x_{n})\eta_{n-1} and thus

By taking expectations (first given Fn1{\mathcal{F}}_{n-1}, then unconditionally), we get:

from which we obtain, by summing from 11 to nn and using convexity (note that Lemma 1 could be used directly as well):

Here, it would be interesting to explore conditions under which the initial conditions may be forgotten at a rate O(1/n2)O(1/n^{2}), as obtained by in the strongly convex case.

A.4 Noise process

In this section, we assume that η0=θ0θ=0\eta_{0}=\theta_{0}-\theta_{\ast}=0 and γR21\gamma R^{2}\leqslant 1 (which implies γHI\gamma H\preccurlyeq I). Following , we recursively define the sequences (ηnr)n0(\eta_{n}^{r})_{n\geqslant 0} for r0r\geqslant 0 (and their averaged counterparts ηˉnr\bar{\eta}_{n}^{r}):

The sequence (ηn0)(\eta_{n}^{0}) is defined as η00=η0=0\eta^{0}_{0}=\eta_{0}=0 and for n1n\geqslant 1, ηn0=(IγH)ηn10+γξn\eta^{0}_{n}=(I-\gamma H)\eta^{0}_{n-1}+\gamma\xi_{n}.

The sequence (ηnr)(\eta_{n}^{r}) is defined from (ηnr1)(\eta_{n}^{r-1}) as η0r=0\eta_{0}^{r}=0 and, for all n1n\geqslant 1:

We now show that the sequence ηni=0rηni\eta_{n}-\sum_{i=0}^{r}\eta_{n}^{i} then satisfies the following recursion, for any r0r\geqslant 0 (which is of the same type than (ηn)(\eta_{n})):

In order to prove Eq. (16) by recursion, we have, for r=0r=0,

Bound on covariance operators.

We now show that we also have a bound on the covariance operator of ηn1r\eta_{n-1}^{r}, for any r0r\geqslant 0 and n2n\geqslant 2:

In order to prove Eq. (17) by recursion, we get for r=0r=0:

In order to go from rr to r+1r+1, we have, using Lemma 2 and the fact that ηk1r\eta_{k-1}^{r} and xkx_{k} are independent:

Putting things together.

We may apply Lemma 1 to the sequence \big{(}\eta_{n}-\sum_{i=0}^{r}\eta_{n}^{i}\big{)}, to get

We may now apply Lemma 2 to Eq. (15), to get, with a noise process ξnr=(Hxnxn)ηn1r1\xi_{n}^{r}=(H-x_{n}\otimes x_{n})\eta_{n-1}^{r-1} which is such that

We thus get, using Minkowski’s inequality (i.e., triangle inequality for the norms p\|\cdot\|_{p}):

This implies that for any γR2<1\gamma R^{2}<1, we obtain, by letting rr tend to ++\infty:

A.5 Final bound

We can now take results from Appendices A.3 and A.4, to get

A.6 Proof of Lemma 3

In this section, we state and prove a simple lemma.

For any uu\in and n>0n>0, (1(1u)n)2nu{(1-(1-u)^{n})^{2}}{}\leqslant nu.

Proof. Since uu\in, we have, 1(1u)n11-(1-u)^{n}\leqslant 1. Moreover, n(1u)n1nn(1-u)^{n-1}\leqslant n, and by integrating between and uu, we get 1(1u)nnu1-(1-u)^{n}\leqslant nu. By multiplying the two previous inequalities, we get the desired result.

Appendix B Proof of Theorem 2

We use the same notations than the proof of Theorem 1, and the same proof principle: (a) splitting the contributions of the initial conditions and the noise, (b) providing a direct argument for the initial condition, and (c) performing an expansion for the noise contribution.

B.1 Contribution of initial conditions

When the noise is assumed to be zero, we have ηn=(Iγxnxn)ηn1\eta_{n}=(I-\gamma x_{n}\otimes x_{n})\eta_{n-1} almost surely, and thus, since 0γxnxnI0\preccurlyeq\gamma x_{n}\otimes x_{n}\preccurlyeq I, ηnη0\|\eta_{n}\|\leqslant\|\eta_{0}\| almost surely, and

and Mnγη02R2|M_{n}|\leqslant\gamma\|\eta_{0}\|^{2}R^{2}. We may now apply the Burkholder-Rosenthal-Pinelis inequality in Eq. (B), to get:

We have used above that (a) k=1nηk1,Hηk1Anγ\sum_{k=1}^{n}\langle\eta_{k-1},H\eta_{k-1}\rangle\leqslant\frac{A_{n}}{\gamma} and that (b) \big{\|}A_{n}\big{\|}_{p/2}\leqslant\big{\|}A_{n}\big{\|}_{p}. This leads to

Finally, we obtain, for any p2p\geqslant 2

i.e., by a change of variable pp2p\rightarrow\frac{p}{2}, for any p4p\geqslant 4, we get

By using monotonicity of norms, we get, for any pp\in:

Note that the constants in the bound above could be improved by using a proof by recursion.

B.2 Contribution of the noise

We follow the same proof technique than for Theorem 1 and consider the expansion based on the sequences (ηnr)n(\eta_{n}^{r})_{n}, for r0r\geqslant 0. We need (a) bounds on ηn0\eta_{n}^{0}, (b) a recursion on the magnitude (in p\|\cdot\|_{p} norm) of ηnr\eta_{n}^{r} and (c) a control of the error made in the expansions.

We start by a lemma similar to Lemma 2 but for all moments. This will show a bound for the sequence ηˉn0\bar{\eta}_{n}^{0}.

Proof. We have, from the proof of Lemma 2:

with \beta_{j}=\big{(}{I-(I-\gamma H)^{n-j}}\big{)}\big{(}{\gamma H}\big{)}^{-1}H^{1/2}\xi_{j}. We have

using Lemma 3 in Section A.6, and assumption (A7).

Using Burkholder-Rosenthal-Pinelis inequality in Eq. (B), we then obtain

Following the same proof technique as above, we have

from which we get, for any positive semidefinite operator MM such that trM=1\mathop{\rm tr}M=1, using BRP’s inequality:

We introduce the following quantity to control the deviations of ηnr\eta_{n}^{r}:

We have from Eq. (20), A01RpγR2(σ+τpγR2)A_{0}\leqslant\frac{1}{R}\sqrt{p\gamma R^{2}}(\sigma+\tau\sqrt{p\gamma R^{2}}).

Since ηnr=(IγH)ηn1r+γ(Hxnxn)ηn1r1\eta_{n}^{r}=(I-\gamma H)\eta_{n-1}^{r}+\gamma(H-x_{n}\otimes x_{n})\eta_{n-1}^{r-1}, for all n1n\geqslant 1, we have the closed form expression

and we may use BRP’s inequality in Eq. (B) to get, for any MM such that trM=1\mathop{\rm tr}M=1:

This implies that A_{r}\leqslant A_{0}\big{(}\sqrt{p\gamma R^{2}\kappa}+p\gamma R^{2}+p(\gamma R^{2})^{1-1/p}\big{)}^{r}, which in turn implies, from Eq. (20),

The condition on γ\gamma will come from the requirement that pγR2κ+pγR2+p(γR2)11/p<1\sqrt{p\gamma R^{2}\kappa}+p\gamma R^{2}+p(\gamma R^{2})^{1-1/p}<1.

leading to, using BRP’s inequality in Eq. (B), similar arguments than in the previous bounds, \big{(}{I-(I-\gamma H)^{n-j}}\big{)}^{2}\big{(}{\gamma H}\big{)}^{-2}H\preccurlyeq H^{-1}\gamma and \big{(}{I-(I-\gamma H)^{n-j}}\big{)}^{2}\big{(}{\gamma H}\big{)}^{-2}H\preccurlyeq\frac{n}{\gamma}I:

We may then impose a restriction on γR2\gamma R^{2}, i.e., γR21ακp\gamma R^{2}\leqslant\frac{1}{\alpha\kappa p} with α>1\alpha>1. We then have

With α=12\alpha=12, we obtain a bound of 0.7818100.781\leqslant\frac{8}{10} above.

From Eq. (16) and the fact that 0IγxnxnI0\preccurlyeq I-\gamma x_{n}\otimes x_{n}\preccurlyeq I almost surely, we get:

Putting things together.

We get by combining Lemma 4 with Eq. (22) and Eq. (23) and then letting rr tends to infinity,

B.3 Final bound

For γ112κpR2\gamma\leqslant\frac{1}{12\kappa pR^{2}}, we obtain, from the last equations of Section B.1 and Section B.2,

Moreover, when γ=112κpR2\gamma=\frac{1}{12\kappa pR^{2}}, we have:

B.4 Proof of Corollary 1

We have from the previous proposition, for γ112κpR2\gamma\leqslant\frac{1}{12\kappa pR^{2}}:

with η=12γκR21/p\eta=12\gamma\kappa R^{2}\leqslant 1/p, and =7τd+Rθ0θ3\square=7\tau\sqrt{d}+R\|\theta_{0}-\theta_{\ast}\|\sqrt{3} and =Rθ0θ24κ\triangle=R\|\theta_{0}-\theta_{\ast}\|\sqrt{24\kappa}.

This leads to, using Markov’s inequality:

Thus the large deviations decay as power of tt, with a power that decays as 1/(12γκR2)1/(12\gamma\kappa R^{2}). If γ\gamma is small, the deviations are lighter.

In order to get the desired result, we simply take t=112γκR2δ12κγR2t=\frac{1}{12\gamma\kappa R^{2}}\delta^{-12\kappa\gamma R^{2}}.

Appendix C Proof of Theorem 3

In terms of convergence rates, θ1\theta_{1} will be (1/n)(1/\sqrt{n})-optimal, while θ3\theta_{3} will be (1/n)(1/n)-optimal for the quadratic problem because of previous results on averaged LMS. A classical property is that a single Newton step squares the error. Therefore, the full Newton step should have an error which is the square of the one of θ1\theta_{1}, i.e., O(1/n)O(1/n). Overall, since θ3\theta_{3} approaches the full Newton step with rate O(1/n)O(1/n), this makes a bound of O(1/n)O(1/n).

In Section C.1, we provide a general deterministic result on the Newton step, while in Section C.2, we combine with two stochastic approximation results, making the informal reasoning above more precise.

In this section, we study the effect of an approximate Newton step. We consider θ1H\theta_{1}\in{\mathcal{H}}, the Newton iterate θ2=θ1f(θ1)1f(θ1)\theta_{2}=\theta_{1}-f^{\prime\prime}(\theta_{1})^{-1}f^{\prime}(\theta_{1}), and an approximation θ3\theta_{3} of θ2\theta_{2}. In the next proposition, we provide a bound on f(θ3)f(θ)f(\theta_{3})-f(\theta_{\ast}), under different conditions, whether θ1\theta_{1} is close to optimal for ff, and/or θ3\theta_{3} is close to optimal for the quadratic approximation around θ1\theta_{1} (i.e., close to θ2\theta_{2}). Eq. (25) corresponds to the least-favorable situations where both errors are small, while Eq. (26) and Eq. (27) consider cases where θ1\theta_{1} is sufficiently good. See proof in Section E.3. These three cases are necessary for the probabilistic control.

Assume (B3-4), and θ1,θ2,θ3H\theta_{1},\theta_{2},\theta_{3}\in{\mathcal{H}} such that f(θ1)f(θ)=ε1f(\theta_{1})-f(\theta_{\ast})=\varepsilon_{1}, θ2=θ1f(θ1)1f(θ1)\theta_{2}=\theta_{1}-f^{\prime\prime}(\theta_{1})^{-1}f^{\prime}(\theta_{1}) and 12θ3θ2,f(θ1)(θ3θ2)=ε2\frac{1}{2}\langle\theta_{3}-\theta_{2},f^{\prime\prime}(\theta_{1})(\theta_{3}-\theta_{2})\rangle=\varepsilon_{2}. Then, if t2=ε1κρt^{2}=\varepsilon_{1}\kappa\rho,

If t=ε1κρ1/16t=\sqrt{\varepsilon_{1}\kappa\rho}\leqslant 1/16, then

Moreover, if t=ε1κρ1/16t=\sqrt{\varepsilon_{1}\kappa\rho}\leqslant 1/16 and ε2κρ1/16\varepsilon_{2}\kappa\rho\leqslant 1/16, then

Note that in the favorable situation in Eq. (26), we get error of the form O(ε12+ε2)O(\varepsilon_{1}^{2}+\varepsilon_{2}). It essentially suffices now to show that in our set-up, in a probabilistic sense to be determined, ε1=O(1/n)\varepsilon_{1}=O(1/\sqrt{n}) and ε2=O(1/n)\varepsilon_{2}=O(1/{n}), while controlling the unfavorable situations.

C.2 Stochastic analysis

We consider the following two-step algorithm:

Starting from any initialization θ0\theta_{0}, run nn iterations of averaged stochastic gradient descent to get θ1\theta_{1},

Run from θ1\theta_{1} nn steps of LMS on the quadratic approximation around θ1\theta_{1}, to get θ3\theta_{3}, which is an approximation of the Newton step θ2\theta_{2}.

We denote by G1\mathcal{G}_{1} the σ\sigma-field generated by the first nn observations (the ones used to define θ1\theta_{1}). We have, by separating all events, i.e., using 1=1A11A2+1A11A2c+1A1c1=1_{A_{1}}1_{A_{2}}+1_{A_{1}}1_{A_{2}^{\sf c}}+1_{A_{1}^{\sf c}}:

We now need to control ε2\varepsilon_{2}, i.e., the error made by the LMS algorithm started from θ1\theta_{1}.

We consider the quadratic approximation around θ1H\theta_{1}\in{\mathcal{H}}, and write is as an expectation, i.e.,

xn2R2/4\|x_{n}\|^{2}\leqslant R^{2}/4 almost surely.

We may thus apply the previous results, i.e., Theorem 1, to obtain with the LMS algorithm a θ3H\theta_{3}\in{\mathcal{H}} such that, with γ=1R2\gamma=\frac{1}{R^{2}}:

which are such that 2(t)0.6\triangle_{2}(t)\leqslant 0.6 and 3(t)1.2\triangle_{3}(t)\leqslant 1.2 if t1/16t\leqslant 1/16.

The last inequalities may be checked graphically.

We now need to use bounds on the behavior of the first nn steps of regular averaged stochastic gradient descent.

Fine results on averaged stochastic gradient descent.

In order to get error bounds on θ1\theta_{1}, we run nn steps of averaged stochastic gradient descent with constant-step size γ=1/(2R2n)\gamma=1/(2R^{2}\sqrt{n}). We need the following bounds from [21, Appendix E and Prop. 1]:

However, we need a finer control of the deviations in order to bound quantities of the form eαε1e^{\alpha\varepsilon_{1}}. In Section D, extending results from , we show in Prop. 3 that if α(10+2R2θ0θ2)n12e\frac{\alpha(10+2R^{2}\|\theta_{0}-\theta_{\ast}\|^{2})}{\sqrt{n}}\leqslant\frac{1}{2e}, then Eeα(f(θ1)f(θ))1Ee^{\alpha(f(\theta_{1})-f(\theta_{\ast}))}\leqslant 1.

Putting things together.

From Eq. (30), we then get, if 6(10+2R2θ0θ2)n12e\frac{6(10+2R^{2}\|\theta_{0}-\theta_{\ast}\|^{2})}{\sqrt{n}}\leqslant\frac{1}{2e}:

Using the notation D=(R2θ0θ2+32)D=(R^{2}\|\theta_{0}-\theta_{\ast}\|^{2}+\frac{3}{2}), we obtain:

The condition 6(10+2R2θ0θ2)n12e\frac{6(10+2R^{2}\|\theta_{0}-\theta_{\ast}\|^{2})}{\sqrt{n}}\leqslant\frac{1}{2e} is implied by n(19+9Rθ0θ)4n\geqslant(19+9R\|\theta_{0}-\theta_{\ast}\|)^{4}.

Appendix D Higher-order bounds for stochastic gradient descent

In this section, we provide high-order bounds for averaged stochastic gradient for logistic regression. The first proposition gives a finer result than , with a simpler proof, while the second proposition is new.

Proof. Following , we have the recursion:

Thus if B=\bigg{\|}\sup_{k\in\{0,\dots,n\}}A_{k}\bigg{\|}_{p}, we have

By solving this quadratic inequality, we get:

The previous statement leads to the desired result if p2p\geqslant 2. For pp\in, we may bound it by the value at p=2p=2, and a direct calculation shows that the bound is still correct.

Proof. Using that almost surely, θˉn1θθ0θ+nγR\|\bar{\theta}_{n-1}-\theta_{\ast}\|\leqslant\|\theta_{0}-\theta_{\ast}\|+n\gamma R we obtain that almost surely f(θˉn1)f(θ)Rθ0θ+nγR2f(\bar{\theta}_{n-1})-f(\theta_{\ast})\leqslant R\|\theta_{0}-\theta_{\ast}\|+n\gamma R^{2}.

Moreover, from the previous proposition, we have for pn4p\leqslant\frac{n}{4},

For γ=12R2n\gamma=\frac{1}{2R^{2}\sqrt{n}}, we get:

This leads to the bound valid for all pp:

Appendix E Properties of self-concordance functions

In this section, we review various properties of self-concordant functions, that will prove useful in proving Theorem 3. All these properties rely on bounding the third-order derivatives by second-order derivatives. More precisely, from assumptions (B3-4), we have for any θ,δ,ηH\theta,\delta,\eta\in{\mathcal{H}}, where f(r)[δ1,,δk]f^{(r)}[\delta_{1},\dots,\delta_{k}] denotes the kk-th order differential of ff:

In this section, we derive global non-asymptotic Taylor expansions for self-concordant functions, which show that they behave similarly to like quadratic functions.

The following proposition shows that having a small excess risk f(θ)f(θ)f(\theta)-f(\theta_{\ast}) implies that the weighted distance to optimum θθ,H(θθ)\langle\theta-\theta_{\ast},H(\theta-\theta_{\ast})\rangle is small. Note that for quadratic functions, these two quantities are equal and that throughout this section, we always consider norms weighted by the matrix HH (Hessian at optimum).

Assume (B3-4). Then, for any θH\theta\in{\mathcal{H}}:

Proof. Let \varphi:t\mapsto f\big{[}\theta_{\ast}+t(\theta-\theta_{\ast})\big{]}. Denoting d=θθ,f(θ)(θθ)d=\sqrt{\langle\theta-\theta_{\ast},f^{\prime\prime}(\theta_{\ast})(\theta-\theta_{\ast})\rangle}, we have:

from which we obtain φ(t)φ(0)eκρdt\varphi^{\prime\prime}(t)\geqslant\varphi^{\prime\prime}(0)e^{-\sqrt{\kappa\rho}dt}. Following , by integrating twice (and noting that φ(0)=0\varphi^{\prime}(0)=0 and φ(0)=d2\varphi^{\prime\prime}(0)=d^{2}), we get

The identity κ1(v)3v+v2\kappa^{-1}(v)\leqslant\sqrt{3v+v^{2}} is equivalent to eu+u1u2+α2αe^{-u}+u-1\geqslant\sqrt{u^{2}+\alpha^{2}}-\alpha, for α=32\alpha=\frac{3}{2}. It then suffices to show that 1euuu2+α21-e^{-u}\geqslant\frac{u}{\sqrt{u^{2}+\alpha^{2}}}. This can be shown by proving the monotonicity of ueu+u1u2+α2+αu\mapsto e^{-u}+u-1-\sqrt{u^{2}+\alpha^{2}}+\alpha, and we leave this exercise to the reader.

The next proposition shows that Hessians between two points which are close in weighted distance are close to each other, for the order between positive semi-definite matrices.

Assume (B3-4). Then, for any θ1,θ2H\theta_{1},\theta_{2}\in{\mathcal{H}}:

Proof. Let zHz\in{\mathcal{H}} and ψ(t)=zf(θ1+t(θ2θ1))z\psi(t)=z^{\top}f^{\prime\prime}(\theta_{1}+t(\theta_{2}-\theta_{1}))z. We have:

with d12=θ2θ1,H(θ2θ1)d_{12}=\sqrt{\langle\theta_{2}-\theta_{1},H(\theta_{2}-\theta_{1})\rangle}. Thus ψ(0)eκρd12tψ(t)ψ(0)eκρd12t\psi(0)e^{\sqrt{\kappa\rho}d_{12}t}\geqslant\psi(t)\geqslant\psi(0)e^{-\sqrt{\kappa\rho}d_{12}t}. This implies, for t=1t=1, that

which implies the desired results. op\|\cdot\|_{\rm op} denotes the operator norm (largest singular value).

The following proposition gives an approximation result bounding the first order expansion of gradients by the first order expansion of function values.

Assume (B3-4). Then, for any θ1,θ2H\theta_{1},\theta_{2}\in{\mathcal{H}} and ΔH\Delta\in{\mathcal{H}}:

Proof. Let φ(t)=Δ,f(θ1+t(θ2θ1))f(θ1)tf(θ1)(θ2θ1)\varphi(t)=\langle\Delta,f^{\prime}(\theta_{1}+t(\theta_{2}-\theta_{1}))-f^{\prime}(\theta_{1})-tf^{\prime\prime}(\theta_{1})(\theta_{2}-\theta_{1})\rangle. We have φ(t)=Δ,f(θ1+t(θ2θ1))(θ1θ2Δ,f(θ1)(θ1θ2)\varphi^{\prime}(t)=\langle\Delta,f^{\prime\prime}(\theta_{1}+t(\theta_{2}-\theta_{1}))(\theta_{1}-\theta_{2}\rangle-\langle\Delta,f^{\prime\prime}(\theta_{1})(\theta_{1}-\theta_{2})\rangle and φ(t)=f(θ1+t(θ2θ1))[θ2θ1,θ2θ1,Δ]κρΔ,HΔ1/2θ1θ2,f(θ1+t(θ2θ1))(θ1θ2)|\varphi^{\prime\prime}(t)|=|f^{\prime\prime\prime}(\theta_{1}+t(\theta_{2}-\theta_{1}))[\theta_{2}-\theta_{1},\theta_{2}-\theta_{1},\Delta]|\leqslant\sqrt{\kappa\rho}\langle\Delta,H\Delta\rangle^{1/2}\langle\theta_{1}-\theta_{2},f^{\prime\prime}(\theta_{1}+t(\theta_{2}-\theta_{1}))(\theta_{1}-\theta_{2})\rangle. This leads to

The following proposition considers a global Taylor expansion of function values. Note that when κρθ2θ1,H(θ2θ1){\kappa\rho}{\langle\theta_{2}-\theta_{1},H(\theta_{2}-\theta_{1})\rangle} tends to zero, we obtain exactly the second-order Taylor expansion. For more details, see . This is followed by a corrolary that upper bounds excess risk by distance to optimum (this is thus the other direction than Prop. 4).

Assume (B3-4). Then, for any θ1,θ2H\theta_{1},\theta_{2}\in{\mathcal{H}} and ΔH\Delta\in{\mathcal{H}}:

Proof. Let \varphi(t)=f\big{[}\theta_{1}+t(\theta_{2}-\theta_{1})\big{]}-f(\theta_{1})-t\langle f^{\prime}(\theta_{1}),\theta_{2}-\theta_{1}\rangle. We have \varphi^{\prime}(t)=\langle f^{\prime}\big{[}\theta_{1}+t(\theta_{2}-\theta_{1})\big{]},\theta_{2}-\theta_{1}\rangle-\langle f^{\prime}(\theta_{1}),\theta_{2}-\theta_{1}\rangle and \varphi^{\prime\prime}(t)=\langle\theta_{2}-\theta_{1},f^{\prime\prime}\big{[}\theta_{1}+t(\theta_{2}-\theta_{1})\big{]}(\theta_{2}-\theta_{1})\rangle. Moreover, φ(t)κρφ(t)θ2θ1,H(θ2θ1)\varphi^{\prime\prime\prime}(t)\leqslant\sqrt{\kappa\rho}\varphi^{\prime\prime}(t)\sqrt{\langle\theta_{2}-\theta_{1},H(\theta_{2}-\theta_{1})\rangle}, leading to φ(t)eκρtθ2θ1,H(θ2θ1)φ(0)\varphi^{\prime\prime}(t)\leqslant e^{\sqrt{\kappa\rho}t\sqrt{\langle\theta_{2}-\theta_{1},H(\theta_{2}-\theta_{1})\rangle}}\varphi^{\prime\prime}(0). Integrating twice between 0 and 1 leads to the desired result.

Assume (B3-4), and θ1H\theta_{1}\in{\mathcal{H}} and θ2=θ1f(θ1)1f(θ1)\theta_{2}=\theta_{1}-f^{\prime\prime}(\theta_{1})^{-1}f^{\prime}(\theta_{1}). Then

where d=θθ,H(θθ)d=\sqrt{\langle\theta-\theta_{\ast},H(\theta-\theta_{\ast})\rangle}.

Proof. Applying Prop. 7 to θ2=θ\theta_{2}=\theta and θ1=θ\theta_{1}=\theta_{\ast}, we get the desired result.

The following proposition looks at a similar type of bounds than Prop. 7; it is weaker when θ2\theta_{2} and θ1\theta_{1} are close (it does not converge to the second-order Taylor expansion), but stronger for large values (it does not grow exponentially fast).

Assume (B3-4), and θ1,θ2H\theta 1,\theta_{2}\in{\mathcal{H}}. Then

E.2 Analysis of Newton step

Self-concordance has been traditionally used in the analysis of Newton’s method (see ). In this section, we adapt classical results to our specific notion of self-concordance (see also ). A key quantity is the so-called “Newton decrement” at a certain point θ1\theta_{1}, equal to f(θ1),f(θ1)1f(θ1)\langle f^{\prime}(\theta_{1}),f^{\prime\prime}(\theta_{1})^{-1}f^{\prime}(\theta_{1})\rangle, which governs the convergence behavior of Newton methods (this is the quantity which is originally shown to be quadratically convergent). In this paper, we consider a slightly different version where the Hessian is chosen to be the one at θ\theta_{\ast}, i.e., f(θ1),H1f(θ1)\langle f^{\prime}(\theta_{1}),H^{-1}f^{\prime}(\theta_{1})\rangle.

The following proposition shows how a full Newton step improves the Newton decrement (by taking a square).

Assume (B3-4), and θ1H\theta_{1}\in{\mathcal{H}} and θ2=θ1f(θ1)1f(θ1)\theta_{2}=\theta_{1}-f^{\prime\prime}(\theta_{1})^{-1}f^{\prime}(\theta_{1}). Then

where d122=θ2θ1,H(θ2θ1)eκρd1f(θ1),H1f(θ1)d_{12}^{2}={\langle\theta_{2}-\theta_{1},H(\theta_{2}-\theta_{1})\rangle}\leqslant e^{\sqrt{\kappa\rho}d_{1}}\langle f^{\prime}(\theta_{1}),H^{-1}f^{\prime}(\theta_{1})\rangle and d1=θ1θ,H(θ1θ)1/2d_{1}=\langle\theta_{1}-\theta_{\ast},H(\theta_{1}-\theta_{\ast})\rangle^{1/2}.

Proof. When applying the two previous propositions to the Newton step θ2=θ1f(θ1)1f(θ1)\theta_{2}=\theta_{1}-f^{\prime\prime}(\theta_{1})^{-1}f^{\prime}(\theta_{1}), we get:

We then optimize with respect to Δ\Delta to obtain the desired result.

The following proposition shows how the Newton decrement is upper bounded by a function of the excess risk.

Assume (B3-4), and θ1H\theta_{1}\in{\mathcal{H}}, then,

with d1=θ1θ,H(θ1θ)d_{1}=\sqrt{\langle\theta_{1}-\theta_{\ast},H(\theta_{1}-\theta_{\ast})\rangle} and Δ1=f(θ1)f(θ)\Delta_{1}=f(\theta_{1})-f(\theta_{\ast}).

Proof. We may bound the Newton decrement as follows:

The following proposition provides a bound on a quantity which is not the Newton decrement. Indeed, this is (up to the difference in the Hessians), the norm of the Newton step. This will be key in the following proofs.

Assume (B3-4), and θ1H\theta_{1}\in{\mathcal{H}}, then,

with d1=θ1θ,H(θ1θ)d_{1}=\sqrt{\langle\theta_{1}-\theta_{\ast},H(\theta_{1}-\theta_{\ast})\rangle}.

The next proposition shows that having a small Newton decrement implies that the weighted distance to optimum is small.

Assume (B3-4). If we have κρeκρdf(θ),H1f(θ)1/212{\sqrt{\kappa\rho}e^{\sqrt{\kappa\rho}d}\langle f^{\prime}(\theta),H^{-1}f^{\prime}(\theta)\rangle^{1/2}}\leqslant\frac{1}{2}, with d=θθ,H(θθ)d=\sqrt{\langle\theta-\theta_{\ast},H(\theta-\theta_{\ast})\rangle}, then

Proof. For any ΔH\Delta\in{\mathcal{H}} such that Δ,HΔ=1\langle\Delta,H\Delta\rangle=1, and t0t\geqslant 0, we have, following the same reasoning than for Prop. 7:

with v=κρΔ,HΔ=κρv=\sqrt{\kappa\rho}\sqrt{\langle\Delta,H\Delta\rangle}=\sqrt{\kappa\rho} and

This implies that if s1/2s\leqslant 1/2, for t=2κρ1s1st=\frac{2\sqrt{\kappa\rho}^{-1}s}{1-s}, f(θ2+tΔ)f(θ2)f(\theta_{2}+t\Delta)\geqslant f(\theta_{2}). Thus,

Note that the quantity dd appears twice in the result above.

E.3 Proof of Prop. 1

In this section, we prove Prop. 1 using tools from self-concordance analysis. These tools are described in the previous Sections E.1 and E.2. In order to understand the proof, it is preferable to read these sections first.

We use the notation t2=κρε1t^{2}=\kappa\rho\varepsilon_{1}. We then get d12=defθ1θ,H(θ1θ)(3+t2)ε1d_{1}^{2}\stackrel{{\scriptstyle\rm def}}{{=}}\langle\theta_{1}-\theta_{\ast},H(\theta_{1}-\theta_{\ast})\rangle\leqslant(3+t^{2})\varepsilon_{1} from Prop. 4.

Moreover, we have, also from Prop. 8, f(θ2)f(θ)ρH1/2f(θ1)1f(θ1)f(\theta_{2})-f(\theta_{\ast})\leqslant\sqrt{\rho}\|H^{1/2}f^{\prime\prime}(\theta_{1})^{-1}f^{\prime}(\theta_{1})\|, and using Prop. 5, we get

We may now use Prop. 10 and use the bound:

Combining with Eq. (46) and Eq. (47), we get

Proof of Eq. (26) and Eq. (27).

For these two inequalities, the starting point is the same. Using Eq. (48) (i.e., the Newton decrement at θ1\theta_{1}), we first show that the distances d12d_{12} and d2d_{2} are bounded. Using f(θ1)eκρd1Hf^{\prime\prime}(\theta_{1})\succcurlyeq e^{-\sqrt{\kappa\rho}d_{1}}H (Prop. 5):

Now, we can bound the Newton decrement at θ2\theta_{2}, using Prop. 9:

Thus, using Prop. 12, if κρe2κρd2f(θ2),H1f(θ2)t4e2t[3+t2+2(t)]3(t)14\kappa\rho e^{2\sqrt{\kappa\rho}d_{2}}\langle f^{\prime}(\theta_{2}),H^{-1}f^{\prime}(\theta_{2})\rangle\leqslant t^{4}e^{2t[\sqrt{3+t^{2}}+\square_{2}(t)]}\square_{3}(t)\leqslant\frac{1}{4}, then

where ε2κρu2\varepsilon_{2}\kappa\rho\leqslant u^{2}.

We now have two separate paths to obtain Eq. (26) and Eq. (27).

If we assume that ε2\varepsilon_{2} is bounded, i.e., with t=1/16t=1/16 and u=1/4u=1/4, then, one can check computationally that we obtain d3κρ0.41d_{3}\sqrt{\kappa\rho}\leqslant 0.41 and thus b=0.576b=0.576 below:

If we only assume ε1\varepsilon_{1} bounded, then we have (from the beginning of the proof):

because we may use the earlier reasoning with ε3=0\varepsilon_{3}=0. This is Eq. (26).

Appendix F Additional experiments

In Table 2, we describe the datasets we have used in experiments and where they were downloaded from.

In Figure 4, we provide similar results than in Section 4, for two additional datasets, quantum and rcv1, while in Figure 5, Figure 6 and Figure 7, we provide training objectives for all methods. We can make the following observations:

For non-sparse datasets, SAG manages to get the smallest training error, confirming the results of .

For the high-dimensional sparse datasets, constant step-size SGD is performing best (note that as shown in Section 3, it is not converging to the optimal value in general, this happens notably for the alpha dataset).

References