Convergence rates of sub-sampled Newton methods

Murat A. Erdogdu, Andrea Montanari

Introduction

in a batch setting, where $n$ is assumed to be much larger than $p$ . Most machine learning models can be expressed as above, where each function $f_{i}$ corresponds to an observation. Examples include logistic regression, support vector machines, neural networks and graphical models.

Many optimization algorithms have been developed to solve the above minimization problem using iterative methods [Bis95, BV04, Nes04]. In this paper, we consider the iterations of the following form

where, as $\lim_{t\to\infty}\xi^{t}_{1,\mbox{\tiny GD}}=1-(\lambda_{p}^{*}/\lambda_{1}^{*})$ , and $\lambda^{*}_{i}$ is the $i$ -th largest eigenvalue of the Hessian of $f(\theta)$ at minimizer $\theta_{*}$ .

A key challenge is that the sub-sampled Hessian is close to the actual Hessian along the directions corresponding to large eigenvalues (large curvature directions in $f(\theta)$ ), but is a poor approximation in the directions corresponding to small eigenvalues (flatter directions in $f(\theta)$ ). In order to overcome this problem, we use low-rank approximation. More precisely, we treat all the eigenvalues below the $r$ -th as if they were equal to the $(r+1)$ -th. This yields the desired stability with respect to the sub-sample: we call our algorithm NewSamp . In this paper, we establish the following:

NewSamp has a composite convergence rate: quadratic at start and linear near the minimizer, as illustrated in Figure 1. Formally, we prove a bound of the form

with coefficient that are explicitly given (and are computable from data).

The asymptiotic behavior of the linear convergence coefficient is $\lim_{t\to\infty}\xi_{1}^{t}=1-(\lambda^{*}_{p}/\lambda^{*}_{r+1})+\delta$ , for $\delta$ small. The condition number $(\lambda^{*}_{1}/\lambda^{*}_{p})$ which controls the convergence of GD, has been replaced by the milder $(\lambda^{*}_{r+1}/\lambda^{*}_{p})$ . For datasets with strong spectral features, this can be a large improvement, as shown in Figure 1.

The above results are achived without tuning the step-size, in particular, by setting ${\eta}_{t}=1$ .

The complexity per iteration of NewSamp is ${\mathcal{O}}(np+|S|p^{2})$ with $|S|$ the sample size.

Our theoretical results can be used to obtain convergence rates of previously proposed sub-sampling algorithms.

We demonstrate the performance of NewSamp on four datasets, and compare it to the well-known optimization methods.

The rest of the paper is organized as follows: Section 1.1 surveys the related work. In Section 2, we describe the proposed algorithm and provide the intuition behind it. Next, we present our theoretical results in Section 3, i.e., convergence rates corresponding to different sub-sampling schemes, followed by a discussion on how to choose the algorithm parameters. Two applications of the algorithm are discussed in Section 4. We compare our algorithm with several existing methods on various datasets in Section 5. Finally, in Section 6, we conclude with a brief discussion.

Even a synthetic review of optimization algorithms for large-scale machine learning would go beyond the page limits of this paper. Here, we emphasize that the method of choice depends crucially on the amount of data to be used, and their dimensionality (i.e., respectively, on the parameters $n$ and $p$ ). In this paper, we focus on a regime in which $p$ is large but not so large as to make matrix manipulations (of order $p^{2}$ to $p^{3}$ ) impossible. Also $n$ is large but not so large as to make batch gradient computation (of order $np$ ) prohibitive. On the other hand, our aim is to avoid ${\mathcal{O}}(np^{2})$ calculations required by standard Newton method. Examples of this regime are given in Section 4.

In contrast, online algorithms are the option of choice for very large $n$ since the computation per update is independent of $n$ . In the case of Stochastic Gradient Descent (SGD), the descent direction is formed by a randomly selected gradient [RM51]. Improvements to SGD have been developed by incorporating the previous gradient directions in the current update [SRB13, SHRY13, Bot10, DHS11].

Batch algorithms, on the other hand, can achieve faster convergence and exploit second order information. They are competitive for intermediate $n$ . Several methods in this category aim at quadratic, or at least super-linear convergence rates. In particular, Quasi-Newton methods have proven effective [Bis95, Nes04]. Another approach towards the same goal is to utilize sub-sampling to form an approximate Hessian [Mar10, BCNN11, VP12, QRTF15, EM15, Erd15a]. If the sub-sampled Hessian is close to the true Hessian, these methods can approach NM in terms of convergence rate, nevertheless, they enjoy much smaller complexity per update. No convergence rate analysis is available for these methods: this analysis is the main contribution of our paper. To the best of our knowledge, the best result in this direction is proven in [BCNN11] that estabilishes asymptotic convergence without quantitative bounds (exploiting general theory from [GNS09]).

Further improvements have been suggested either by utilizing Conjugate Gradient (CG) methods and/or using Krylov sub-spaces [Mar10, BCNN11, VP12]. Sub-sampling can be also used to obtain an approximate solution, if an exact solution is not required [DLFU13]. Lastly, there are various hybrid algorithms that combine two or more techniques to gain improvement. Examples include, sub-sampling and Quasi-Newton [SYG07, SDPG13, BHNS14], SGD and GD [FS12], NGD and NM [RF10], NGD and low-rank approximation [RaMB08].

NewSamp : A Newton method via sub-sampling and eigenvalue thresholding

In the regime we consider, $n\gg p\gg 1$ , there are two main drawbacks associated with the classical second order methods such as Newton’s method. The predominant issue in this regime is the computation of the Hessian matrix, which requires ${\mathcal{O}}(np^{2})$ operations, and the other issue is finding the inverse of the Hessian, which requires ${\mathcal{O}}(p^{3})$ computation. Sub-sampling is an effective and efficient way of addressing the first issue, by forming an approximate Hessian to exploit curvature information. Recent empirical studies show that sub-sampling the Hessian provides significant improvement in terms of computational cost, yet preserves the fast convergence rate of second order methods [Mar10, VP12, Erd15b]. If a uniform sub-sample is used, the sub-sampled Hessian will be a random matrix with expected value at the true Hessian, which can be considered as a sample estimator to the mean. Recent advances in statistics have shown that the performance of various estimators can be significantly improved by simple procedures such as shrinkage and/or thresholding [CCS10, DGJ13, GD14, GD14]. To this extent, we use a specialized low-rank approximation as the important second order information is generally contained in the largest few eigenvalues/vectors of the Hessian. We will see in Section 3, how this procedure provides faster convergence rates compared to the bare sub-sampling methods.

which is simply the sum of a scaled identity matrix and a rank- $r$ matrix. Note that the low-rank approximation that is suggested to improve the curvature estimation has been further utilized to reduce the cost of computing the inverse matrix. Final per-iteration cost of NewSamp will be ${\mathcal{O}}\left(np+(|S_{t}|+r)p^{2}\right)\approx{\mathcal{O}}\left(np+|S_{t}|p^{2}\right)$ . NewSamp takes the parameters $\{{\eta}_{t},|S_{t}|\}_{t}$ and $r$ as inputs. We discuss in Section 3.4, how to choose these parameters near-optimally, based on the theory we develop in Section 3.

Operator $\mathcal{P}_{\mathcal{C}}$ projects the current iterate to the feasible set ${\mathcal{C}}$ using Euclidean projection. Throughout, we assume that this projection can be done efficiently. In general, most unconstrained optimization problems do not require this step, and can be omitted. The purpose of projected iterations in our algorithm is mostly theoretical, and will be clear in Section 3.

Theoretical results

In this section, we provide the convergence analysis of NewSamp based on two different sub-sampling schemes:

Independent sub-sampling: At each iteration $t$ , $S_{t}$ is uniformly sampled from $[n]=\{1,2,...,n\}$ , independently from the sets $\{S_{\tau}\}_{\tau<t}$ , with or without replacement.

Sequentially dependent sub-sampling: At each iteration $t$ , $S_{t}$ is sampled from $[n]$ , based on a distribution which might depend on the previous sets $\{S_{\tau}\}_{\tau<t}$ , but not on any randomness in the data.

The first sub-sampling scheme is simple and commonly used in optimization. One drawback is that the sub-sampled set at the current iteration is independent of the previous sub-samples, hence does not consider which of the samples were previously used to form the approximate curvature information. In order to prevent cycles and obtain better performance near the optimum, one might want to increase the sample size as the iteration advances [Mar10], including previously unused samples. This process results in a sequence of dependent sub-samples which falls into the sub-sampling scheme S2. In our theoretical analysis, we make the following assumptions:

For any subset $S\subset[n]$ , there exists a constant $M_{|S|}$ depending on the size of $S$ , such that $\forall{\theta},{\theta}^{\prime}\in{\mathcal{C}}$ ,

$\forall i=1,2,...,n$ , the Hessian of the function $f_{i}({\theta})$ , $\boldsymbol{\nabla}_{\theta}^{2}f_{i}({\theta})$ , is upper bounded by an absolute constant $K$ , i.e.,

Assume that the parameter set ${\mathcal{C}}$ is convex and $S_{t}\subset[n]$ is based on sub-sampling scheme S1. Further, let the Assumptions 1 and 2 hold and ${\theta}_{*}\in{\mathcal{C}}$ . Then, for an absolute constant $c>0$ , with probability at least $1-2/p$ , the updates of the form Eq. (1.2) satisfy

for coefficients $\xi_{1}^{t}$ and $\xi_{2}^{t}$ defined as

If the initial point $\hat{\theta}^{0}$ is close to ${\theta}_{*}$ , the algorithm will start with a quadratic rate of convergence which will transform into linear rate later in the close neighborhood of the optimum.

then we have, with probability at least $1-2/p$ ,

for an absolute constant $c>0$ , for the coefficients $\xi_{1}^{t}$ and $\xi_{2}^{t}$ are defined as

NewSamp has a composite convergence rate where $\xi_{1}^{t}$ and $\xi_{2}^{t}$ are the coefficients of the linear and the quadratic terms, respectively (See the right plot in Figure 1). We observe that the sub-sampling size has a significant effect on the linear term, whereas the quadratic term is governed by the Lipschitz constant. We emphasize that the case ${\eta}_{t}=1$ is feasible for the conditions of Theorem 3.2. In the case of quadratic functions, since the Lipschitz constant is 0 , we obtain $\xi_{2}^{t}=0$ and the algorithm converges linearly. Following corollary summarizes this case.

2 Sequentially dependent sub-sampling

Here, we assume that the sub-sampling scheme S2 is used to generate $\{S_{\tau}\}_{\tau\geq 1}$ . Distribution of sub-sampled sets may depend on each other, but not on any randomness in the dataset. Examples include fixed sub-samples as well as sub-samples of increasing size, sequentially covering unused data. In addition to Assumptions 1-2, we assume the following.

Most statistical learning algorithms can be formulated as above, e.g., in classification problems, one has access to i.i.d. samples $\{(y_{i},x_{i})\}_{i=1}^{n}$ where $y_{i}$ and $x_{i}$ denote the class label and the covariate, and $\varphi$ measures the classification error (See Section 4 for examples). For the sub-sampling scheme S2, an analogue of Lemma 3.1 is stated in Appendix as Lemma A.1, which immediately leads to the following theorem.

for the coefficients $\xi_{1}^{t}$ and $\xi_{2}^{t}$ defined as

Compared to the Theorem 3.2, we observe that the coefficient of the quadratic term does not change. This is due to Assumption 1. However, the bound on the linear term is worse, since we use the uniform bound over the convex parameter set ${\mathcal{C}}$ . The same order of magnitude is also observed by [Erd15b], which relies on a similar proof technique. Similar to Corollary 3.3, we have the following result for the quadratic functions.

Let the assumptions of Theorem 3.4 hold. Further assume that $\forall i\in[n]$ , the functions ${\theta}\to f_{i}({\theta})$ are quadratic. Then, conditioned on the event $\mathcal{E}$ , with probability at least $1-c_{\mathcal{E}}\,e^{-p}$ , NewSamp iterates satisfy

for coefficient $\xi_{1}^{t}$ defined as in Theorem 3.4.

3 Dependence of coefficients on t𝑡t and convergence guarantees

The coefficients $\xi_{1}^{t}$ and $\xi_{2}^{t}$ depend on the iteration step which is an undesirable aspect of the above results. However, these constants can be well approximated by their analogues $\xi_{1}^{*}$ and $\xi_{2}^{*}$ evaluated at the optimum which are defined by simply replacing $\lambda_{j}^{t}$ with $\lambda_{j}^{*}$ in their definition, where the latter is the $j$ -th eigenvalue of full-Hessian at $\theta_{*}$ . For the sake of simplicity, we only consider the case where the functions ${\theta}\to f_{i}({\theta})$ are quadratic.

Assume that the functions $f_{i}({\theta})$ are quadratic, $S_{t}$ is based on scheme S1 and ${\eta}_{t}=1$ . Let the full Hessian at ${\theta}_{*}$ be lower bounded by a constant $k$ . Then for sufficiently large $|S_{t}|$ , we have, with probability $1-2/p$

for some absolute constants $c_{1},c_{2}$ .

Theorem 3.6 implies that, when the sub-sampling size is sufficiently large, $\xi_{1}^{t}$ will concentrate around $\xi_{1}^{*}$ . Generalizing the above theorem to non-quadratic functions is straightforward, in which case, one would get additional terms involving the difference $\|\hat{\theta}^{t}-{\theta}_{*}\|_{2}$ . In the case of scheme S2, if one uses fixed sub-samples, i.e., $\forall t$ , $S_{t}=S$ , then the coefficient $\xi_{1}^{t}$ does not depend on $t$ . The following corollary gives a sufficient condition for convergence. A detailed discussion on the number of iterations until convergence and further local convergence properties can be found in Appendix B.

Assume that $\xi_{1}^{t}$ and $\xi_{2}^{t}$ are well-approximated by $\xi_{1}^{*}$ and $\xi_{2}^{*}$ with an error bound of $\delta$ , i.e., $\xi_{i}^{t}\leq\xi_{i}^{*}+\delta$ for $i=1,2$ , as in Theorem 3.6. For the initial point $\hat{\theta}^{0}$ , a sufficient condition for convergence is

4 Choosing the algorithm parameters

Algorithm parameters play a crucial role in most optimization methods. Based on the theoretical results from previous sections, we discuss procedures to choose the optimal values for the step size ${\eta}_{t}$ , sub-sample size $|S_{t}|$ and rank threshold.

Step size: For the step size of NewSamp at iteration $t$ , we suggest

where $\gamma={\mathcal{O}}(\log(p)/|S_{t}|)$ . Note that ${\eta}_{t}(0)$ is the upper bound in Theorems 3.2 and 3.4 and it minimizes the first component of $\xi_{1}^{t}$ . The other terms in $\xi_{1}^{t}$ and $\xi^{t}_{2}$ linearly depend on ${\eta}_{t}$ . To compensate for that, we shrink ${\eta}_{t}(0)$ towards 1. Contrary to most algorithms, optimal step size of NewSamp is larger than 1. See Appendix C for a rigorous derivation of Eq. 3.3.

Sample size: By Theorem 3.2, a sub-sample of size ${\mathcal{O}}((K/\lambda^{*}_{p})^{2}\log(p))$ should be sufficient to obtain a small coefficient for the linear phase. Also note that sub-sample size $|S_{t}|$ scales quadratically with the condition number.

Rank threshold: For a full-Hessian with effective rank $R$ (trace divided by the largest eigenvalue), it suffices to use ${\mathcal{O}}(R\log(p))$ samples [Ver10, Ver12]. Effective rank is upper bounded by the dimension $p$ . Hence, one can use $p\log(p)$ samples to approximate the full-Hessian and choose a rank threshold which retains the important curvature information.

Examples

Finding the maximum likelihood estimator in Generalized Linear Models (GLMs) is equivalent to minimizing the negative log-likelihood $f({\theta})$ ,

The gradient and the Hessian of the above function can be written as:

We note that the Hessian of the GLM problem is always positive definite. This is because the second derivative of the cumulant generating function is simply the variance of the observations. Using the results from Section 3, we perform a convergence analysis of our algorithm on a GLM problem.

Let $S_{t}\subset[n]$ be a uniform sub-sample, and ${\mathcal{C}}$ be a convex parameter set. Assume that the second derivative of the cumulant generating function, ${\Phi^{(2)}}$ is bounded by $1$ , and it is Lipschitz continuous with Lipschitz constant $L$ . Further, assume that the covariates are contained in a ball of radius $\sqrt{R_{x}}$ , i.e. $\max_{i\in[n]}\|x_{i}\|_{2}\leq\sqrt{R_{x}}.$ Then, for $\hat{\theta}^{t}$ given by NewSamp with constant step size ${\eta}_{t}=1$ at iteration $t$ , with probability at least $1-2/p$ , we have

for constants $\xi^{t}_{1}$ and $\xi^{t}_{2}$ defined as

Proof of Corollary 4.1 can be found in Appendix A. Note that the bound on the second derivative is quite loose for Poisson regression due to exponentially fast growing cumulant generating function.

2 Support Vector Machines

A linear Support Vector Machine (SVM) provides a separating hyperplane which maximizes the margin, i.e., the distance between the hyperplane and the support vectors. Although the vast majority of the literature focuses on the dual problem [Vap98, SS02], SVMs can be trained using the primal as well. Since the dual problem does not scale well with the number of data points (some approaches get ${\mathcal{O}}(n^{3})$ complexity, [WG11]), the primal might be better-suited for optimization of linear SVMs [KD05, Cha07].

The primal problem for the linear SVM can be written as

For the sake of simplicity, we will focus on Hinge-2 loss. Denote by $SV_{t}$ , the set of indices of all the support vectors at iteration $t$ , i.e.,

When the loss is set to be the Hinge-2 loss, the Hessian of the SVM problem, normalized by the number of support vectors, can be written as

When $|SV_{t}|$ is large, the problem falls into our setup and can be solved efficiently using NewSamp . Note that unlike the GLM setting, Lipschitz condition of our Theorems do not apply here. However, we empirically demonstrate that NewSamp works regardless of such assumptions.

Experiments

In this section, we validate the performance of NewSamp through extensive numerical studies. We experimented on two optimization problems, namely, Logistic Regression (LR) and Support Vector Machines (SVM) with quadratic loss. LR minimizes Eq. 4.1 for the logistic function, whereas SVM minimizes Eq. 4.3 for the Hinge-2 loss.

In the following, we briefly describe the algorithms that are used in the experiments:

Gradient Descent (GD), at each iteration, takes a step proportional to negative of the full gradient evaluated at the current iterate. Under certain regularity conditions, GD exhibits a linear convergence rate.

Accelerated Gradient Descent (AGD) is proposed by Nesterov [Nes83], which improves over the gradient descent by using a momentum term. Performance of AGD strongly depends of the smoothness of the function $f$ and decreasing step size adjustments may be necessary for convergence.

Newton’s Method (NM) achieves a quadratic convergence rate by utilizing the inverse Hessian evaluated at the current iterate. However, the computation of Hessian makes it impractical for large-scale datasets.

Broyden-Fletcher-Goldfarb-Shanno (BFGS) is the most popular and stable Quasi-Newton method. Scaling matrix is formed by accumulating the information from iterates and gradients, satisfying Quasi-Newton rule. The convergence rate is locally super-linear and per-iteration cost is comparable to first order methods.

Limited Memory BFGS (L-BFGS) is a variant of BFGS, which uses only the recent iterates and gradients to form the approximate Hessian, providing significant improvement in terms of memory usage.

Stochastic Gradient Descent (SGD) is a simplified version of GD where, at each iteration, instead of the full gradient, a randomly selected gradient is used. Per-iteration cost is independent of $n$ , yet the convergence rate is significantly slower compared to batch algorithms. We follow the guidelines of [Bot10, SHRY13] for the step size,, i.e.,

Adaptive Gradient Scaling (AdaGrad) is an online algorithm which uses an adaptive learning rate based on the previous gradients. AdaGrad significantly improves the performance and stability of SGD [DHS11]. This is achieved by scaling each entry of gradient differently. , i.e., at iteration step $t$ , step size for the $j$ -th coordinate is

For each of the batch algorithms, we used constant step size, and for all the algorithms, we choose the step size that provides the fastest convergence. For the stochastic algorithms, we optimized over the parameters that define the step size. Parameters of NewSamp are selected following the guidelines described in Section 3.4.

The effects of sub-sampling size $|S_{t}|$ and rank threshold are demonstrated in Figure 1. A thorough comparison of the aforementioned optimization techniques is presented in Figure 2. In the case of LR, we observe that stochastic algorithms enjoy fast convergence at start, but slows down later as they get close to the true minimizer. The algorithm that comes close to NewSamp in terms of performance is BFGS. In the case of SVM, Newton’s method is the closest algorithm to NewSamp, yet in all scenarios, NewSamp outperforms its competitors. Note that the global convergence of BFGS is not better than that of GD [Nes04]. The condition for super-linear rate is $\sum_{t}\|\theta^{t}-\theta_{*}\|_{2}<\infty$ for which, an initial point close to the optimum is required [DM77]. This condition can be rarely satisfied in practice, which also affects the performance of the other second order methods. For NewSamp , even though the rank thresholding provides a certain level of robustness, we observed that the choice of a good starting point is still an important factor. Details about Figure 2 can be found in Table 3 in Appendix. For additional experiments and a detailed discussion, see Appendix D.

Conclusion

In this paper, we proposed a sub-sampling based second order method utilizing low-rank Hessian estimation. The proposed method has the target regime $n\gg p$ and has ${\mathcal{O}}\left(np+|S|p^{2}\right)$ complexity per-iteration. We showed that the convergence rate of NewSamp is composite for two widely used sub-sampling schemes, i.e., starts as quadratic convergence and transforms to linear convergence near the optimum. Convergence behavior under other sub-sampling schemes is an interesting line of research. Numerical experiments on both real and synthetic datasets demonstrate the performance of the proposed algorithm which we compared to the classical optimization methods.

Acknowledgments

We are grateful to Mohsen Bayati for stimulating conversations on the topic of this work. We would like to thank Robert M. Gower for carefully reading this manuscript and providing valuable feedback. A.M. was partially supported by NSF grants CCF-1319979 and DMS-1106627 and the AFOSR grant FA9550-13-1-0036.

References

Appendix A Proofs of Theorems and Lemmas

Note that the first term on the right hand side governs the convergence behavior of the algorithm.

By the triangle inequality, the governing term that determines the convergence rate can be bounded as

In the following, we will use some matrix concentration results to bound the right hand side of Eq. (A.1). The result for sampling with replacement can be obtained by matrix Hoeffding’s inequality given in [Tro12]. Note that this explicitly assumes that the samples are independent. For the concentration bounds under sampling without replacement (see i.e. [GN10, Gro11, MJC+14]), we will use the Operator-Bernstein inequality given in [GN10] which is provided in Section E as Lemma E.3 for convenience.

Using any indexing over the elements of sub-sample $S$ , we denote the each element in $S$ by $s_{i}$ , i.e.,

Next, we apply the matrix Bernstein’s inequality given in Lemma E.3. For ${\epsilon}\leq 4K$ , and ${\theta}\in{\mathcal{C}}$ ,

Therefore, to obtain a convergence rate of ${\mathcal{O}}(1/p)$ , we let

where $C=6K$ is sufficient. We also note that the condition on ${\epsilon}$ is trivially satisfied by our choice of ${\epsilon}$ in the target regime.

First inequality follows from the fact that norm of an integral is less than or equal to the integral of the norm. Second inequality follows from the Lipschitz property.

Combining the above results, we obtain the following for the governing term in Eq.(A.1): For some absolute constants $c,C>0$ , with probability at least $1-2/p$ , we have

Assume that the parameter set ${\mathcal{C}}$ is convex and $S_{t}\subset[n]$ is based on sub-sampling scheme S2. Further, let the Assumptions 1, 2 and 3 hold, almost surely. Then, for some absolute constants $c,C>0$ , with probability at least $1-e^{-p}$ , the updates of the form stated in Eq. (1.2) satisfy

for coefficients $\xi_{1}^{t},\xi_{2}^{t}$ defined as

The first part of the proof is the same as Lemma 3.1. We carry our analysis from Eq.(A.1). Note that in this general set-up, the iterates are random variables that depend on the random functions. Therefore, we use a uniform bound for the right hand side in Eq.(A.1). That is,

By the Assumption 1, given ${\theta},{\theta}^{\prime}\in{\mathcal{C}}$ such that $\|{\theta}-{\theta}^{\prime}\|_{2}\leq\Delta$ , we have,

Next, we will use a covering net argument to obtain a bound on the matrix empirical process. Note that similar bounds on the matrix forms can be obtained through other approaches like chaining as well [DE15]. Let ${\mathcal{T}}_{\Delta}$ be a $\Delta$ -net over the convex set ${\mathcal{C}}$ . By the above inequality, we obtain

Now we will argue that the right hand side is small with high probability using the matrix Hoeffding’s inequality from [Tro12]. By the union bound over ${\mathcal{T}}_{\Delta}$ , we have

For the first term on the right hand side, by Lemma E.1, we write:

By the Assumption (2), we have the same bounds as in Eq.(A.2). Hence, for ${\epsilon}>0$ and ${\theta}\in{\mathcal{C}}$ , by the matrix Hoeffding’s inequality [Tro12],

We would like to obtain an exponential decay with a rate of at least ${\mathcal{O}}(p)$ . Hence, we require,

which gives the optimal value of ${\epsilon}$ as

Therefore, we conclude that for the above choice of ${\epsilon}$ , with probability at least $1-e^{-p}$ , we have

Applying this result to the inequality in Eq.(A.5), we obtain that with probability at least $1-e^{-p}$ ,

The right hand side of the above inequality depends on the net covering diameter $\Delta$ . We optimize over $\Delta$ using Lemma E.5 which provides for

we obtain that with probability at least $1-e^{-p}$ ,

Combining this with the bound stated in Eq.(A.1), we conclude the proof. ∎

By the Weyl’s and matrix Hoeffding’s [Tro12] inequalities (See Eq. (A.3) for details), we can write

for some constants $c$ and $c^{\prime\prime\prime}$ . ∎

Observe that $f_{i}({\theta})=\Phi(\langle x_{i},{\theta}\rangle)-y_{i}\langle x_{i},{\theta}\rangle$ , and $\boldsymbol{\nabla}^{2}_{\theta}f_{i}({\theta})=x_{i}x_{i}^{T}{\Phi^{(2)}}(\langle x_{i},{\theta}\rangle)$ . For an index set $S$ , we have $\forall{\theta},{\theta}^{\prime}\in{\mathcal{C}}$

Therefore, the Assumption 1 is satisfied with the Lipschitz constant $M_{|S_{t}|}\coloneqq LR_{x}^{3/2}.$ Moreover, by the inequality

the Assumption 2 is satisfied for $K\coloneqq R_{x}.$ We conclude the proof by applying Theorem 3.2. ∎

Appendix B Properties of composite convergence

and assume that the algorithm gets a composite convergence rate as

where $\xi_{1},\xi_{2}>0$ denote the coefficients of linearly and quadratically converging terms, respectively.

We state the following theorem on the local convergence properties of compositely converging algorithms.

For a compositely converging algorithm as in Eq. (B.1) with coefficients $1>\xi_{1},\xi_{2}>0$ , if the initial distance $\Delta_{0}$ satisfies $\Delta_{0}<(1-\xi_{1})/\xi_{2}$ , then we have

The above theorem states that the local convergence of a compositely converging algorithm will be dominated by the linear term.

The condition on the initial point implies that $\Delta_{t}\to 0$ as $t\to\infty$ . Hence, for any given $\delta>0$ , there exists a positive integer $T$ such that $\forall t\geq T$ , we have $\Delta_{t}<\delta/\xi_{2}$ . For such values of $t$ , we write

Taking the limit on both sides concludes the proof. ∎

B.2 Number of iterations

The total number of iterations, combined with the per-iteration cost, determines the total complexity of an algorithm. Therefore, it is important to derive an upper bound on the total number of iterations of a compositely converging algorithm.

For a compositely converging algorithm as in Eq. (B.1) with coefficients $\xi_{1},\xi_{2}\in(0,1)$ , assume that the initial distance $\Delta_{0}$ satisfies $\Delta_{0}<(1-\xi_{1})/\xi_{2}$ and for a given tolerance $\epsilon$ , define the interval

Then the total number of iterations needed to approximate the true minimizer with $\epsilon$ tolerance is upper bounded by $T(\delta_{*})$ , where

We have $\Delta_{t}\to 0$ as $t\to\infty$ by the condition on initial point $\Delta_{0}$ . Let $\delta\in D$ be a real number and $t_{1}$ be the last iteration step such that $\Delta_{t}>\delta$ . Then $\forall t\geq t_{1}$ ,

Therefore, in this regime, the convergence rate of the algorithm is dominated by a quadratically converging term with coefficient $(\xi_{1}/\delta+\xi_{2})$ . The total number of iterations needed to attain a tolerance of $\delta$ is upper bounded by

When $\Delta_{t}<\delta$ , namely $t>t_{1}$ , we have

In this regime, the convergence rate is dominated by a linearly converging term with coefficient $\left(\xi_{1}+\xi_{2}\delta\right)$ . Therefore, the total number of iterations since $t_{1}$ until a tolerance of $\epsilon$ is reached can be upper bounded by

Hence, the total number of iterations needed for a composite algorithm as in Eq. B.1 to reach a tolerance of $\epsilon$ is upper bounded by

The above statement holds for any $\delta\in D$ . Therefore, we minimize $T(\delta)$ over the set $D$ . ∎

In most optimization algorithms, step size plays a crucial role. If the dataset is so large that one cannot try out many values of the step size. In this section, we describe an efficient and adaptive way for this purpose by using the theoretical results derived in the previous sections.

In the proof of Lemma 3.1, we observe that the convergence rate of NewSamp is governed by the term

If we optimize the above quantity over ${\eta}_{t}$ , we obtain the optimal step size as

By the Lipschitz continuity of eigenvalues , we write

Similarly, for the minimum eigenvalue, we can write

We are more interested in the case where $p/2>r$ . In this case, we suggest the step size for the iteration step $t$ as

Appendix D Further experiments and details

In this section, we present the details of the experiments presented in Figure 2 and provide additional simulation results.

We first start with additional experiments. The goal of this experiment is to further analyze the effect of rank in the performance of NewSamp . We experimented using $r$ -spiked model for $r=3,10,20$ . The case $r=3$ was already presented in Figure 2, which is included in Figure 3 to ease the comparison. The results are presented in Figures 3 and the details are summarized in Table 2. In the case of LR optimization, we observe through Figure 3 that stochastic algorithms enjoy fast convergence in the beginning but slows down later as they get close to the true minimizer. The algorithms that come closer to NewSamp in terms of performance are BFGS and LBFGS. Especially when $r=20$ , performance of BFGS and that of NewSamp are similar, yet NewSamp still does better. In the case of SVM optimization, the algorithm that comes closer to NewSamp is Newton’s method.

We further demonstrate how the algorithm coefficients $\xi_{1}$ and $\xi_{2}$ between datasets in Figure 4.

Appendix E Useful lemmas

A similar proof appears in [VdVW96]. The set ${\mathcal{C}}$ can be contained in a $p$ -dimensional cube of size ${{\rm diam}({\mathcal{C}})}$ . Consider a grid over this cube with mesh width $2\epsilon/\sqrt{p}$ . Then ${\mathcal{C}}$ can be covered with at most $({{\rm diam}({\mathcal{C}})}/(2\epsilon/\sqrt{p}))^{p}$ many cubes of edge length $2\epsilon/\sqrt{p}$ . If ones takes the projection of the centers of such cubes onto ${\mathcal{C}}$ and considers the circumscribed balls of radius $\epsilon$ , we may conclude that ${\mathcal{C}}$ can be covered with at most

Let $X$ be a symmetric $p\times p$ matrix, and let $T_{\epsilon}$ be an $\epsilon$ -net over $S^{p-1}$ . Then,

Given its size, let $S$ denote a uniformly random sample from $\{1,2,...,|\mathcal{X}|\}$ with or without replacement. Then we have

Let $Z$ be a random variable with a density function $f$ and cumulative distribution function $F$ . If $F^{C}=1-F$ , then,

Since $\lim_{z\to\infty}zF^{C}(z)=\lim_{z\to-\infty}zF(z)=0$ , we have

we have ${\epsilon}^{2}\geq a\log(b/{\epsilon})$ .

Since $a,b>0$ and $x\to e^{x}$ is a monotone increasing function, the above inequality condition is equivalent to

Now, we define the function $f(w)=we^{w}$ for $w>0$ . $f$ is continuous and invertible on $[0,\infty)$ . Note that $f^{-1}$ is also a continuous and increasing function for $w>0$ . Therefore, we have

Observe that the smallest possible value for ${\epsilon}$ would be simply the square root of ${a}f^{-1}\left({2b^{2}}/{a}\right)/{2}$ . For simplicity, we will obtain a more interpretable expression for ${\epsilon}$ . By the definition of $f^{-1}$ , we have

Since the condition on $a$ and $b$ enforces $f^{-1}(y)$ to be larger than 1, we obtain the simple inequality that

Using the above inequality, if ${\epsilon}$ satisfies