Competing with the Empirical Risk Minimizer in a Single Pass

Roy Frostig, Rong Ge, Sham M. Kakade, Aaron Sidford

Introduction

Consider the following optimization problem:

is small, where the expectation is over the estimator w^N\widehat{w}_{N} (which depends on the sampled functions).

Stochastic approximation algorithms, such as stochastic gradient descent (SGD) (Robbins and Monro, 1951), are the most widely used in practice, due to their ease of implementation and their efficiency with regards to runtime and memory. Without consideration for computational constraints, we often wish to compute the empirical risk minimizer (ERM; or, equivalently, the MM-estimator):

In the context of statistical modeling, the ERM is the maximum likelihood estimator (MLE). Under certain regularity conditions, and under correct model specification,A well specified statistical model is one where the data is generated under some model in the parametric class. See the linear regression Section 3.1. the MLE is asymptotically efficient, in that no unbiased estimator can have a lower variance in the limit (see Lehmann and Casella (1998); van der Vaart (2000)).However, note that biased estimators, such as the James-Stein estimator, can outperform the MLE (Lehmann and Casella, 1998). Analogous arguments have been made in the stochastic approximation setting, where we do not necessarily have a statistical model of the distribution D\mathcal{D} (see Kushner and Yin (2003)).

The question we aim to address is as follows. Consider the ratio:

We seek an algorithm to compute w^N\widehat{w}_{N} in which: (1) under sufficient regularity conditions, this ratio approaches 11 on every problem D\mathcal{D} and (2) it does so quickly, at a rate quantifiable in terms of the number of samples, the dependence on the initial error (and other relevant quantities), and the computational time and space usage.

Under certain smoothness assumptions on ψ\psi and strong convexity assumptions on PP (applicable to linear and logistic regression, generalized linear models, smoothed Huber losses, and various other MM-estimation problems), we provide an algorithm where:

The algorithm achieves the same statistical rate of convergence as the ERM on every problem, even considering constant factors, and we quantify the sample size at which this occurs.

The algorithm can be implemented in linear time with a single pass of the observed data, using space linear in the size of a single sample.

The algorithm decreases the standard notion of initial error at a super-polynomial rate.A function is super-polynomial if grows faster than any polynomial.

The algorithm is trivially parallelizable (see Remark 3).

Table 1 compares previous (and concurrent) algorithms that enjoy the first two guarantees; this work is the first with a finite-sample analysis handling the more general class of problems. Our algorithm is a variant of the stochastic variance reduced gradient procedure of Johnson and Zhang (2013).

Importantly, we quantify how fast we obtain a rate comparable to that of the ERM. For the case of linear regression, we have non-trivial guarantees when the sample size NN is larger than a constant times what can be interpreted as a condition number, κ=L/μ\kappa=L/\mu, where μ\mu is a strong convexity parameter of PP and where LL is a smoothness parameter of each ψ\psi. Critically, after NN is larger than κ\kappa, the initial error is divided by a factor that can be larger than any polynomial in N/κN/\kappa.

Finally, in order to address this question on a per-problem basis, we provide both upper and lower bounds for the rate of convergence of the ERM.

2 Related work

Stochastic optimization dates back to the work of Robbins and Monro (1951) and has seen much subsequent work (Kushner and Clark, 1978; Kushner and Yin, 2003; Nemirovski and Yudin, 1983). More recently, questions of how to quantify and compare rates of estimation procedures – with implications to machine learning problems in the streaming and large dataset settings – have been raised and discussed several times (see Bottou and Bousquet (2008); Agarwal and Bottou (2014)).

The pioneering work of Polyak and Juditsky (1992) and Ruppert (1988) provides an asymptotically optimal streaming algorithm, by averaging the iterates of an SGD procedure. It is unclear how quickly these algorithms converge to the rate of the ERM in finite sample; the relevant dependencies, such as the dependence on the initial error – that is, P(w0)P(w)P(w_{0})-P(w_{*}) where w0w_{0} is the starting point of the algorithm – are not specified. In particular, they characterize the limiting distribution of N(w^Nw)\sqrt{N}(\widehat{w}_{N}-w_{*}), essentially arguing that the variance of the iterate-averaging procedure matches the asymptotic distribution of the ERM (see Kushner and Yin (2003)).

In a series of papers, Bach and Moulines (2011), Bach and Moulines (2013), Dieuleveut and Bach (2014), and Defossez and Bach (2015) provide non-asymptotic analysis of the same averaging schemes. Of these, for the specific case of linear least-squares regression, Dieuleveut and Bach (2014) and Defossez and Bach (2015) provide rates which are competitive with the ERM, concurrently and independent of results presented herein. The work in Bach and Moulines (2011) and Bach and Moulines (2013) either does not achieve the ERM rate or has a dependence on the initial error which is not lower in order; it is rather in Dieuleveut and Bach (2014) and Defossez and Bach (2015) that dependence on the initial error decaying as 1/N21/N^{2} is shown.

For the special case of least squares, one could adapt the algorithm and guarantees of Dieuleveut and Bach (2014); Defossez and Bach (2015), by replacing global averaging with random restarts, to obtain super-polynomial rates (results comparable to ours when specializing to linear regression). For more general problems, it is unclear how such an adaptation would work – using constant step sizes alone may not suffice. In contrast, as shown in Table 1, our algorithm is identical for a wide variety of cases and does not need decaying rates (whose choices may be difficult in practice).

We should also note that much work has characterized rates of convergence under various assumptions on PP and ψ\psi different than our own. Our case of interest is when PP is strongly convex. For such PP, the rates of convergence of many algorithms are O(1/N)O(1/N), often achieved by averaging the iterates in some way (Nemirovski et al., 2009; Juditsky and Nesterov, 2010; Rakhlin et al., 2012; Hazan and Kale, 2014). These results do not achieve a constant competitive ratio, for a variety of reasons (they have a leading order dependencies on various quantities, including the initial error along with strong convexity and smoothness parameters). Solely in terms of the dependence on the sample size NN, these rates are known to be optimal (Nemirovski and Yudin, 1983; Nesterov, 2004; Agarwal et al., 2012).

In statistics, it is classically argued that the MLE, under certain restrictions, is an asymptotically efficient estimator for well-specified statistical models (Lehmann and Casella, 1998; van der Vaart, 2000). Analogously, in an optimization context, applicable to mis-specified models, similar asymptotic arguments have been made: under certain restrictions, the asymptotically optimal estimator is one which has a limiting variance that is equivalent to that of the ERM (Anbar, 1971; Fabian, 1973; Kushner and Clark, 1978).

With regards to finite-sample rates, Agarwal et al. (2012) provide information-theoretic lower bounds (for any strategy) for certain stochastic convex optimization problems. This result does not imply our bounds as they do not consider the same smoothness assumptions on ψ\psi. For the special case of linear least-squares regression, there are several upper bounds (for instance, Caponnetto and De Vito (2007); Hsu et al. (2014)). Recently, Shamir (2014) provides lower bounds specifically for the least-squares estimator, applicable under model mis-specification, and sharp only for specific problems.

There are numerous algorithms for optimizing sums of convex functions that converge linearly, i.e. that depend only logarithmically on the target precision. Notably, several recently developed such algorithms are applicable in the setting where the sample size NN becomes large, due to their stochastic nature (Strohmer and Vershynin, 2009; Le Roux et al., 2012; Shalev-Shwartz and Zhang, 2013; Johnson and Zhang, 2013). These procedures minimize a sum of NN losses in time (near to) linear in NN, provided NN is sufficiently large relative to the dimension and the condition number.

A second natural question is: can one naively use a doubling trick with an extant algorithm to compete with the ERM? By this we mean to iteratively run such a linearly convergent optimization algorithm, on increasingly larger subsets of the data, with the hope of cutting the error at each iteration by a constant fraction, eventually down to that of the ERM. There are two points to note for this approach. First, the approach is not implementable in a streaming model as one would eventually have to run the algorithm on a constant fraction of the entire dataset size, thus essentially holding the entire dataset in memory. Second, proving such an algorithm succeeds would similarly involve the aforementioned type of generalization argument.

We conjecture that these tight generalization arguments described are attainable, although with a somewhat involved analysis. For linear regression, the bounds in Hsu et al. (2014) may suffice. More generally, we believe the detailed ERM analysis provided herein could be used.

In contrast, the statistical convergence analysis of our single-pass algorithm is self-contained and does not go through any generalization arguments about the ERM. In fact, it avoids matrix concentration arguments entirely.

To our knowledge, this work provides the first streaming algorithm guaranteed to have a rate that approaches that of the ERM (under certain regularity assumptions on D\mathcal{D}), where the initial error is decreased at a super-polynomial rate. The previous work, in the general case that we consider, only provides asymptotic convergence guarantees (Polyak and Juditsky, 1992). For the special case of linear least-squares regression, the concurrent and independent work presented in Dieuleveut and Bach (2014) and Defossez and Bach (2015) also converges to the rate of the ERM, with a lower-order dependence on the initial error of Ω(1/N2)\Omega(1/N^{2}). Furthermore, even if we ignored memory constraints and focused solely on computational complexity, our algorithm compares favorably to using state-of-the-art algorithms for minimizing sums of functions (such as the linearly convergent algorithms in Le Roux et al. (2012); Shalev-Shwartz and Zhang (2013); Johnson and Zhang (2013)); as discussed above, obtaining a convergence rate with these algorithms would entail some further generalization analysis.

It would be interesting if one could quantify an approach of restarting the algorithm of Polyak and Juditsky (1992) to obtain guarantees comparable to our streaming algorithm. Such an analysis could be delicate in settings other than linear regression, as their learning rates do not decay too quickly or too slowly (they must decay strictly faster than 1/N1/\sqrt{N}, yet more slowly than 1/N1/N). In contrast, our algorithm takes a constant learning rate to obtain its constant competitive ratio. Furthermore, our algorithm is easily parallelizable and its analysis, we believe, is relatively transparent.

3 Organization

Section 2 summarizes our main results, and Section 3 provides applications to a few standard statistical models. Section 4 provides the main technical claims for our algorithm, Streaming SVRG (Algorithm 1). Section 5 provides finite-sample rates for the ERM, along with proofs for these rates. The Appendix contains various technical lemmas and proofs of our corollaries.

Main results

This section summarizes our main results, as corollaries of more general theorems provided later. After providing our assumptions in Section 2.1, Section 2.2 provides the algorithm, along with performance guarantees. Then Section 2.3 provides upper and lower bounds of the statistical rate of the empirical risk minimizer.

This quantity governs the precise (problem dependent) convergence rate of the ERM. Namely, under certain restrictions on D\mathcal{D}, we have

This limiting rate is well-established in asymptotic statistics (see, for instance, van der Vaart (2000)), whereas Section 2.3 provides upper and lower bounds on this rate for finite sample sizes NN. Analogous to the Cramér-Rao lower bound, under certain restrictions, σ2/N\sigma^{2}/N is the asymptotically efficient rate for stochastic approximation problems (Anbar, 1971; Fabian, 1973; Kushner and Yin, 2003).Though, as with Cramér-Rao, this may be improvable with biased estimators.

The problem dependent rate of σ2/N\sigma^{2}/N sets the benchmark. Statistically, we hope to achieve a leading order dependency of σ2/N\sigma^{2}/N quickly, with rapidly-decaying dependence on the initial error.

We now provide two assumptions under which we analyze the convergence rate of our streaming algorithm, Algorithm 1. Our first assumption is relatively standard. It provides upper and lower quadratic approximations (the lower approximation is on the full objective PP).

The objective PP is twice differentiable.

(Strong convexity) The objective PP is μ\mu-strongly convex, i.e. for all w,wSw,w^{\prime}\in\mathcal{S},

(Smoothness) Each loss ψ\psi is LL-smooth (with probability one), i.e. for all w,wSw,w^{\prime}\in\mathcal{S},

Our results in fact hold under a slightly weaker version of this assumption – see Remark 9. Define:

The quantity κ\kappa can be interpreted as the condition number of the optimization objective (1). The following definition quantifies a global bound on the Hessian.

Let α1\alpha\geq 1 be the smallest value (if it exists) such that for all wSw\in\mathcal{S}, 2P(w)α2P(w)\nabla^{2}P(w_{*})\preceq\alpha\nabla^{2}P(w).

Under Assumption 2.1, we have ακ\alpha\leq\kappa, because LL-smoothness implies 2P(w)LI\nabla^{2}P(w_{*})\preceq LI and μ\mu-strong convexity implies μI2P(w)\mu I\preceq\nabla^{2}P(w). However, α\alpha could be much smaller. For instance, α=1\alpha=1 in linear regression, whereas κ\kappa is the maximum to minimum eigenvalue ratio of the design matrix.

PP is MM-self concordant (or that the weaker condition in Equation (30) holds).

Note that these two assumptions are also standard assumptions in the analysis of the two phases of Newton’s method (aside from the kurtosis condition): the first phase of Newton’s method gets close to the minimizer quickly (based on a global strong convexity assumption) and the second phase obtains quadratic convergence (based on local curvature assumptions on how fast the local Hessian changes, e.g. self-concordance). Moreover, our proof of the streaming algorithm follows a similar structure; we use Assumption 2.1 to analyze the progress of our algorithm when the current point is far away from optimality and Assumption 2.2 when it is close.

2 Algorithm

The remainder of this section shows that, for suitable choices of ksk_{s} and mm, Algorithm 1 achieves desirable convergence rates under the aforementioned assumptions.

Algorithm 1 is not implementable in the standard stochastic first-order oracle model, e.g. that which is assumed in order to obtain the lower bounds in Nemirovski and Yudin (1983) and Agarwal et al. (2012). Streaming SVRG computes the gradient of the randomly drawn ψ\psi at two points, while the oracle model only allows gradient queries at one point.

We have the following algorithmic guarantee under only Assumption 2.1, which is a corollary of Theorem 4.1 (also see the Appendix).

(NsN_{s} is an upper bound on the number of samples drawn up to the end of stage ss).

Let w^Ns\widehat{w}_{N_{s}} be the parameter returned at iteration ss by Algorithm 1. For Nsbp2+6pκN_{s}\geq b^{p^{2}+6p}\kappa (and so s>p2+6ps>p^{2}+6p), we have

When α=1\alpha=1 (such as for least squares regression), the above bound achieves the ERM rate of σ2/N\sigma^{2}/N (up to a constant factor, which can be driven to one, as discussed later). Furthermore, under self-concordance, we can drive the competitive ratio (3) down from α\alpha to arbitrarily near to 11. The following is a corollary of Theorem 4.2 (also see the Appendix):

Note that Algorithm 1 is simple to implement and requires little space. In each iteration, the space usage is linear in the size of a single sample (along with needing to count to ksk_{s} and mm). Furthermore, the algorithm is easily parallelizable once we have run enough stages. In both Theorem 4.1 and Theorem 4.2 as ss increases ksk_{s} grows geometrically, whereas mm remains constant. Hence, the majority of the computation time is spent averaging the gradient, i.e. (9), which is easily parallelizable.

Note that the constants in the parameter settings for the Algorithm have not been optimized. Furthermore, we have not attempted to fully optimize the time it takes the algorithm to enter the second phase (in which self-concordance is relevant), and we conjecture that the algorithm in fact enjoys even better dependencies. Our emphasis is on an analysis that is flexible in that it allows for a variety of assumptions in driving the competitive ratio to 11 (as is done in the case of logistic regression in Section 3, where we use a slight variant of self-concordance).

Before providing statistical rates for the ERM, let us remark that the above achieves super-polynomial convergence rates and that the competitive ratio can be driven to 11 (recall that σ2/N\sigma^{2}/N is the rate of the ERM).

By choosing bb sufficiently large, the competitive ratio (3) can be made close to 11 (on every problem). Furthermore, we can ensure this constant goes to 11 by altering the parameter choices adaptively: let ks=4s(s!)k0k_{s}=4^{s}(s!)k_{0}, and let ηs=η/2s\eta_{s}=\eta/2^{s}, ms=m4sm_{s}=m\cdot 4^{s}. Intuitively, kk grows so fast that limsks/Ns=1\lim_{s\to\infty}k_{s}/N_{s}=1; ηs\eta_{s} and msm_{s} are also changing fast enough so the initial error vanishes very quickly.

3 Competing with the ERM

Now we provide a finite-sample characterization of the rate of convergence of the ERM under regularity conditions. This essentially gives the numerator of (3), allowing us to compare the rate of the ERM against the rate achieved by Streaming SVRG. We provide the more general result in Theorem 5.1; this section focuses on a corollary.

In the following, we constrain the domain S\mathcal{S}; so the ERM, as defined in (2), is taken over this restricted set. Further discussion appears in Theorem 5.1 and the comments thereafter.

Suppose ψ1,ψ2,,ψN\psi_{1},\psi_{2},\ldots,\psi_{N} are an independently drawn sample from D\mathcal{D}. Assume the following regularity conditions hold; see Theorem 5.1 for weaker conditions.

ww_{*} is an interior point of S\mathcal{S}, and 2P(w)\nabla^{2}P(w_{*}) exists and is positive definite.

(Smoothness) Assume the first, second, and third derivatives of ψ\psi exist and are uniformly bounded on S\mathcal{S}.

Then, for the ERM w^NERM\widehat{w}^{\textrm{ERM}}_{N} (as defined in (2)), we have

In particular, the following lower and upper bounds hold. With problem dependent constants C0C_{0} and C1C_{1} (polynomial in the relevant quantities, as specified in Theorem 5.1), we have for all p2p\geq 2, if NN satisfies plogdNNC0\frac{p\log dN}{N}\leq C_{0}, then

Applications: one pass learning and generalization

This section provides applications to a few standard statistical models, in part providing a benchmark for comparison on concrete problems. For the widely studied problem of least-squares regression, we also instantiate upper and lower bounds for the ERM. The applications in this section can be extended to include generalized linear models, some MM-estimation problems, and other loss functions (e.g. the Huber loss).

Using that α=1\alpha=1, the following corollary illustrates that Algorithm 1 achieves the rate of the ERM,

Suppose that X2L\|X\|^{2}\leq L. Define μ=λ+λmin(Σ)\mu=\lambda+\lambda_{\min}(\Sigma). Using the parameter settings of Theorem 2.1 and supposing that Nbp2+6pκN\geq b^{p^{2}+6p}\kappa,

If the sample size is less than κ\kappa and λ=0\lambda=0, there exist distributions on XX in which the ERM is not unique (as the sample matrix 1NXiXi\frac{1}{N}\sum X_{i}X_{i}^{\top} will not be invertible, with reasonable probability, on these distributions by construction).

Algorithm 1 is competitive with the performance of the ERM when the sample size NN is slightly larger than a constant times κ\kappa. In particular, as the sample NN size grows larger than κ\kappa, then the initial error is decreased at an arbitrary polynomial rate in N/κN/\kappa.

In other words, Corollary 3.1 recovers the classical rate in this case.

where the last equality exposes the effects of the approximation error:

In the regularized setting (a.k.a. ridge regression) – also not necessarily well-specified – we have

1.2 Statistical upper and lower bounds

For comparison, the following corollary (of Theorem 5.1) provides lower and upper bounds for the statistical rate of the ERM.

where Z=ψ(w)(2P(w))1=(YwX)X+λw(Σ+λI)1Z=\left\|\nabla\psi(w_{*})\right\|_{(\nabla^{2}P(w_{*}))^{-1}}=\|(Y-w_{*}^{\top}X)X+\lambda w_{*}\|_{(\Sigma+\lambda I)^{-1}}.

(Unregularized case) Suppose λ=0\lambda=0. Assume that we constrain the ERM to lie in some compact set S\mathcal{S} (and supposing wSw_{*}\in\mathcal{S}). Then for all p>0p>0, if plogNNcκ~\frac{p\log N}{N}\leq\frac{c}{\widetilde{\kappa}}, we have

(Regularized case) Suppose λ>0\lambda>0. Then for all p>0p>0, if plogNNcκ~\frac{p\log N}{N}\leq\frac{c}{\widetilde{\kappa}}, we have

(this last equation follows from a modification of the argument in Equation (5)).

Interestingly, for the upper bound (when λ=0\lambda=0), we see no way to avoid constraining the ERM to lie in some compact set; this allows us to bound the loss PP in the event of some extremely low probability failure (see Theorem 5.1). The ERM upper bound has a term comparable to the initial error of our algorithm. In contrast, the lower bound is for the usual unconstrained least-squares estimator.

2 Logistic regression

Under this definition of σ2\sigma^{2}, by Theorem 2.2 together with the following defined quantities, the single-pass estimator of Algorithm 1 achieves a rate competitive with that of the ERM:

The corollary uses Lemma 10, a straightforward lemma to handle self-concordance for logistic regression, which is included for completeness. See Bach (2010) for techniques for analyzing the self-concordance of logistic regression.

Analysis of Streaming SVRG

Here we analyze Algorithm 1. Section 4.1 provides useful common lemmas. Section 4.2 uses these lemmas to characterize the behavior of the Algorithm 1. These are then used to prove convergence in terms of both α\alpha-bounded Hessians (Section 4.3) and MM-self-concordance (Section 4.4).

Our first lemma is a consequence of smoothness. It is the same observation made in Johnson and Zhang (2013).

If ψ\psi is smooth (with probability one), then

Instead of the smoothness Assumption 2.1 in Equation 7, it suffices to directly assume (38) and still have all results hold as presented. In doing so, we incur an additional factor of 22 as in this case we have 2P(w)2LI\nabla^{2}P(w_{*})\preceq 2LI by Lemma 9. For further explanation see Appendix A.

Since ψ\psi is LL-smooth (with probability one) gg is LL-smooth (with probability one) and it follows that:

Our second lemma bounds the variance of ψD\psi\sim\mathcal{D} in the (2P(w))1(\nabla^{2}P(w_{*}))^{-1} norm.

Combining and using the definition of κ\kappa yields the result. ∎

2 Progress of the algorithm

The following bounds the progress of one step of Algorithm 1.

For all tt let αt\alpha_{t} be such that

(note that such an αt\alpha_{t} exists by Assumption 2.1, as αtκ\alpha_{t}\leq\kappa).

and recalling the definition of wt+1w_{t+1} and Δ\Delta we have

Using Cauchy-Schwarz and that 2aba2+b22a\cdot b\leq a^{2}+b^{2} for scalar aa and bb, we have

Combining (26), (27), (28), and (29) yields

and multiplying both sides by LL yields the result. ∎

Finally we bound the progress of one stage of Algorithm 1.

Taking an unconditional expectation with respect to {ψt}\{\psi_{t}\} and summing (25) from Lemma 3 from t=m1t=m-1 down to t=0t=0 yields

3 With α𝛼\alpha-bounded Hessians

Here we prove the progress made by Algorithm 1 in a single stage under only Assumption 2.1.

Under Assumption 2.1, for Algorithm 1, we have for all ss:

By definition of α\alpha, we have αtα\alpha_{t}\leq\alpha for all tt in Lemma 4 and therefore

where we have also used Lemma 2 and Jensen’s inequality. ∎

4 With M𝑀M-self-concordance

Our main result in the self-concordant case follows.

Suppose Assumption 2.1 and 2.2 hold. Under Algorithm 1, for η18\eta\leq\frac{1}{8}, k10Ck\geq 10C, and all ss, we have

The proof utilizes the following lemmas. First, we show how self concordance implies that there is a better effective strong convexity parameter in 2P(w)\nabla^{2}P(w_{*}) norm when we are close to ww_{*}.

First we use the property of self-concordant functions: if ff is MM-self-concordant, then

Apply this property to the function PP restricted to the line between wtw_{t} and ww_{*}, where the point is at wtw_{t} and tt is wtw2P(wt)\|w_{t}-w_{*}\|_{\nabla^{2}P(w_{t})}, then we have

In order to convert 2P(wt)\nabla^{2}P(w_{t}) norm to 2P(w)\nabla^{2}P(w_{*}) norm, we use another property of self-concordant function:

Again we restrict to the line between ww_{*} and wtw_{t}, where point corresponds to ww_{*}, and tt is wtw\|w_{t}-w_{*}\|, and we get

Now consider the function let h(x)=xln(1+x)h(x)=x-\ln(1+x). The function has the following two properties: When x0x\geq 0, h(x)h(x) is monotone and h(x)x2/2(1+x)h(x)\geq x^{2}/2(1+x). This claim can be verified directly by taking derivatives.

Essentially, this means when wtw2P(w)2\|w_{t}-w_{*}\|_{\nabla^{2}P(w_{*})}^{2} is small the effective strong convexity in 2P(w)\left\|{\cdot}\right\|_{\nabla^{2}P(w_{*})} is small. In particular,

Thus we need to bound the residual error wtw2P(w)2\|w_{t}-w_{*}\|_{\nabla^{2}P(w_{*})}^{2}.

Suppose the same assumptions in Lemma 3 hold and that η18\eta\leq\frac{1}{8}. Then for all tt, we have

Since αtκ\alpha_{t}\leq\kappa and by Lemma 3 we have

Using that by strong convexity P(wt)P(w)μ2wtw2P(w_{t})-P(w_{*})\geq\frac{\mu}{2}\|w_{t}-w_{*}\|^{2} we have

Solving for the maximum value of Lwtw22L\|w_{t}-w_{*}\|_{2}^{2} in this recurrence we have, for all tt,

Using that 2P(w)LI\nabla^{2}P(w_{*})\preceq LI yields the result. ∎

Finally, we end up needing to bound higher moments of the error from Δ\Delta. For this we provide two technical lemmas.

Recalling the definition of σ2\sigma^{2} yields the result. ∎

Using these lemmas, we are ready to provide the proof.

of Theorem 4.2. We analyze stage ss of the algorithm. Let us define the variance term (A) as

Our main goal in the proof is to bound (A)(A). First, for all α,x,y\alpha,x,y and positive semidefinite HH we have

We use that min{a,b+c}b+min{a,c}\min\{a,b+c\}\leq b+\min\{a,c\} (for positive aa, bb, and cc) by Lemma 1, the definition of σ2\sigma^{2}, as well as (32)

where (D) is defined below. Using Lemma 6 and the independence of the different types of ψ\psi

where (E) is defined below. Using kurtosis,

Using that x+yx+y\sqrt{|x|+|y|}\leq\sqrt{|x|}+\sqrt{|y|} this implies

Using this bound in Lemma 4 then yields the result. ∎

Empirical risk minimization (M𝑀M-estimation) for smooth functions

We now provide finite-sample rates for the ERM. We take the domain S\mathcal{S} to be compact in (1) (see Remark 11). Throughout this section, define:

for a matrix AA (of appropriate dimensions).

Suppose ψ1,ψ2,\psi_{1},\psi_{2},\ldots are an independently drawn sample from D\mathcal{D}. Assume:

(Convexity of ψ\psi) Assume that ψ\psi is convex (with probability one).

(Smoothness of ψ\psi) Assume that ψ\psi is smooth in the following sense: the first, second, and third derivatives exist at all interior points of S\mathcal{S} (with probability one).

S\mathcal{S} is compact (so P(w)P(w) is bounded on S\mathcal{S}).

ww_{*} is an interior point of S\mathcal{S}.

2P(w)\nabla^{2}P(w_{*}) is positive definite (and, thus, is invertible).

There exists a neighborhood BB of ww_{*} and a constant L3L_{3}, such that (with probability one) 2ψ(w)\nabla^{2}\psi(w) is L3L_{3}-Lipschitz, namely 2ψ(w)2ψ(w)L3ww2P(w)\|\nabla^{2}\psi(w)-\nabla^{2}\psi(w^{\prime})\|_{*}\leq L_{3}\|w-w^{\prime}\|_{\nabla^{2}P(w_{*})}, for w,ww,w^{\prime} in this neighborhood.

(Concentration at ww_{*}) Suppose ψ(w)(2P(w))1L1\|\nabla\psi(w_{*})\|_{(\nabla^{2}P(w_{*}))^{-1}}\leq L_{1} and 2ψ(w)L2\|\nabla^{2}\psi(w_{*})\|_{*}\leq L_{2} hold with probability one. Suppose the dimension dd is finite (or, in the infinite dimensional setting, the intrinsic dimension is bounded, as in Remark 10).

In particular, the following lower and upper bounds hold. Define

where cc is an appropriately chosen universal constant. Also, let cc^{\prime} be another appropriately chosen universal constant. We have that for all p2p\geq 2, if NN is large enough so that plogdNNcmin{1L2,1L1L3,1diameter(B)L1}\sqrt{\frac{p\log dN}{N}}\leq c^{\prime}\min\left\{\frac{1}{\sqrt{L_{2}}},\frac{1}{L_{1}L_{3}},\frac{1\cdot\textrm{diameter}(B)}{L_{1}}\right\}, then

The lower bound holds even if S\mathcal{S} is not compact. For the upper bound, the proof technique uses the compactness of S\mathcal{S} to bound the contribution to the expected regret due to a (low probability) failure event that the ERM may not lie in the ball BB (or even the interior of S\mathcal{S}). If PP is regularized then this last term can be improved, as S\mathcal{S} need not be compact.

The basic idea of the proof follows that of Hsu et al. (2014), along with various arguments based on Taylor’s theorem.

Throughout the proof use w^N\widehat{w}_{N} to denote the ERM w^NERM\widehat{w}^{\textrm{ERM}}_{N}. Define:

which is convex as it is the average of convex functions.

Throughout the proof we take t=cplog(dN)t=cp\log(dN) in the tail probability bounds in Appendix D (for some universal constant cc). This implies a probability of error less than 1Np\frac{1}{N^{p}}.

For all wBw\in B, the empirical function 2P^N(w)\nabla^{2}\widehat{P}_{N}(w) is L3L_{3}-Lipschitz. In Lemma 12 in Appendix D, we may take v2L2v\leq 2\sqrt{L_{2}} (as all eigenvalues of of 2P(w)\nabla^{2}P(w_{*}) are one, under the choice of norm). Using Lemma 12 in Appendix D, for wBw\in B, we have:

for some (other) universal constant cc. Now we seek to ensure that P^N(w)\widehat{P}_{N}(w) is a constant spectral approximation to 2P(w)\nabla^{2}P(w_{*}). By choosing a sufficiently smaller ball B1B_{1} (choose B1B_{1} to have radius of min{1/(10L3),diameter(B)}\min\{1/(10L_{3}),\textrm{diameter}(B)\}), the first term can be made small for wB1w\in B_{1}. Also, for sufficiently large NN, the second term can be made arbitrarily small (smaller than 1/101/10), which occurs if plogdNNcL2\sqrt{\frac{p\log dN}{N}}\leq\frac{c^{\prime}}{\sqrt{L_{2}}}. Hence, for such large enough NN, we have for wB1w\in B_{1}:

Suppose NN is at least this large from now on.

Hence, for all wB1w\in B_{1} and if Equation 34 holds,

Observe that if the right hand side is positive for some wB1w\in B_{1}, then ww is not a local minimum. Also, since P^N(w)0\|\nabla\widehat{P}_{N}(w_{*})\|\xrightarrow{}0, for a sufficiently small value of P^N(w)\|\nabla\widehat{P}_{N}(w_{*})\|, all points on the boundary of B1B_{1} will have values greater than that of ww_{*}. Hence, we must have a local minimum of P^N(w)\widehat{P}_{N}(w) that is strictly inside B1B_{1} (for NN large enough). We can ensure this local minimum condition is achieved by choosing an NN large enough so that plogNNcmin{1L1L3,diameter(B)L1}\sqrt{\frac{p\log N}{N}}\leq c^{\prime}\min\left\{\frac{1}{L_{1}L_{3}},\frac{\textrm{diameter}(B)}{L_{1}}\right\}, using Lemma 11 (and our bound on the diameter of B1B_{1}). By convexity, we have that this is the global minimum, w^N\widehat{w}_{N}, and so w^NB1\widehat{w}_{N}\in B_{1} for NN large enough. Assume now that NN is this large from here on.

For the ERM, 0=P^N(w^N)0=\nabla\widehat{P}_{N}(\widehat{w}_{N}). Again, by Taylor’s theorem if w^N\widehat{w}_{N} is an interior point, we have:

where the invertibility is guaranteed by Equation 34 and the positive definiteness of P(w)\nabla P(w_{*}). Using Lemma 11 in Appendix D,

Here the universal constant cc is chosen so that:

(using standard matrix perturbation results).

where we have used the ERM expression in Equation 35.

Let I(E)I(\mathcal{E}) be the indicator that the desired previous events hold, which we can ensure with probability greater than 1cNp1-\frac{c}{N^{p}}. We have:

(for a universal constant cc^{\prime}). Now define the random variable Z=P^N(w)(2P(w))1Z=\left\|\nabla\widehat{P}_{N}(w_{*})\right\|_{(\nabla^{2}P(w_{*}))^{-1}}. With a failure event probability of less than 12Np\frac{1}{2N^{p}}, for any z0z_{0}, we have:

since the probability of not E\textrm{not }\mathcal{E} is less than 1Np\frac{1}{N^{p}}.

For an upper bound of the first term, observe that:

This completes the proof (using a different universal constant cc^{\prime} in εN\varepsilon_{N}). ∎

Acknowledgments

The authors would like to thank Jonathan Kelner, Yin Tat Lee, and Boaz Barak for helpful discussion. Part of this work was done while RF and AS were at Microsoft Research, New England, and another part done while AS was visiting the Simons Institute for the Theory of Computing, UC Berkeley. This work was partially supported by NSF awards 0843915 and 1111109, NSF Graduate Research Fellowship (grant no. 1122374).

References

Appendix A A weaker smoothness assumption

Instead, of the smoothness Assumption 2.1 in Equation 7, we could instead directly assume that:

Our proofs only use this condition, as well as an upper bound on the Hessian of PP at ww_{*}. However, we can show that this weaker assumption implies such an upper bound as follows.

If (38) holds then 2P(w)2LI\nabla^{2}P(w_{*})\preceq 2LI.

First we note that for all ww, by (38), the convexity of PP, and Jensen’s inequality, we have:

Combining (39) and (40) and using Cauchy-Schwarz yields that for all ww,

Consequently, by definition and the fact that PP is twice differentiable at ww_{*} we have

Applying Cauchy-Schwarz and (41) yields that

Since Δ\Delta was arbitrary we have the desired result. ∎

Appendix B Proofs of Corollaries 2.1 and 2.2

of Corollary 2.1. From Theorem 4.1, we have:

We do this using Theorem 4.1 and some explicit calculations as follows. We shall make use of that for x1x\leq 1, 1x1x\sqrt{1-x}\geq 1-x, and for 0x120\leq x\leq\frac{1}{2}, 11x1+x+2x21+2x\frac{1}{1-x}\leq 1+x+2x^{2}\leq 1+2x. We have:

This completes the claim, by substitution into Theorem 4.1.

We do this by induction. The claim is true for s=1s=1. For the inductive argument,

We now relate ksk_{s} and b(p+1)s\sqrt{b}^{(p+1)s} to the sample size NsN_{s}. First, observe that mm is bounded as:

where we have used that, for s>p2+6ps>p^{2}+6p, bs(bp+5)p1\frac{b^{s}}{(b^{p+5})^{p}}\geq 1. Hence,

The proof is completed substituting (43), (44) in (42). ∎

of Corollary 2.2. Under the choice of parameters and using Theorem 4.1, we have

On the other hand, suppose t0t_{0} is the first time that kt0(Mσ+1)2400κ2b2p+3=(Mσ+1)2k0k_{t_{0}}\geq(M\sigma+1)^{2}400\kappa^{2}b^{2p+3}=(M\sigma+1)^{2}k_{0}. When st0s\geq t_{0}, we can use Theorem 4.2, and under the choice of parameters we have:

Here E0E_{0} is the initial error. When s=1s=1 the statement is true. When st0s\leq t_{0} we use the first recursion (from Theorem 4.1), clearly

Here the second step uses induction hypothesis, and the third step uses the fact that ks1/ks2=bk_{s-1}/k_{s-2}=b and (1+2/b)/bp+11+2/b(1+2/b)/\sqrt{b}^{p}+1\leq 1+2/b.

When s>t0s>t_{0} we can use the second recursion (from Theorem 4.2), now we have

Again, the second step uses induction hypothesis, and final step uses ks1/ks2=bk_{s-1}/k_{s-2}=b and (1+2/b)/bp+1+1/b1+2/b(1+2/b)/\sqrt{b}^{p}+1+1/b\leq 1+2/b. This concludes the induction.

Finally, we need to relate the values ksk_{s}, b(p+1)(s+1)\sqrt{b}^{(p+1)(s+1)}, b(p+1)(st0)\sqrt{b}^{(p+1)(s-t_{0})} with NsN_{s}.

First, it is clear that ksbmk_{s}\geq bm for all ss, therefore

Therefore we can substitute 1/ks11/\sqrt{k_{s-1}} with 1+3/b/Ns\sqrt{1+3/b}/\sqrt{N_{s}}. Also, we know Ns/2k0bs1N_{s}/2k_{0}\leq b^{s-1}, therefore b(p+1)sb(p+1)(s1)(N/2k0)(p+1)/2\sqrt{b}^{-(p+1)s}\leq\sqrt{b}^{-(p+1)(s-1)}\leq(N/2k_{0})^{-(p+1)/2}.

Finally, since kk increase by a factor of bb, we know kt0k_{t_{0}} is at most b(Mσ+1)2k0b(M\sigma+1)^{2}k_{0}. Therefore Ns2(Mσ+1)2k0bNs/2kt0ks/kt0=bst0\frac{N_{s}}{2(M\sigma+1)^{2}k_{0}}\leq bN_{s}/2k_{t_{0}}\leq k_{s}/k_{t_{0}}=b^{s-t_{0}}, which means bp(st0)(Ns2(Mσ+1)2k0)p/2\sqrt{b}^{-p(s-t_{0})}\leq\left(\frac{N_{s}}{2(M\sigma+1)^{2}k_{0}}\right)^{p/2}. ∎

Appendix C Self-concordance for logistic regression

The following straightforward lemma, to handle self-concordance for logistic regression, is included for completeness (see Bach (2010) for a more detailed treatment for analyzing the self-concordance of logistic regression).

where z1z_{1} is between ww_{*} and wtw_{t}. Again, by Taylor’s theorem,

where z2z_{2} is between ww_{*} and z1z_{1}.

Using the definition of α\alpha, shows that Equation 45 holds.

Now for Z>0Z>0, consider the quantity max{1α,1Z}\max\{\frac{1}{\alpha},1-Z\}, and observe that the 1Z1-Z term achieves the max when 1Z1α1-Z\geq\frac{1}{\alpha} or equivalently when 1+ααZ0-1+\alpha-\alpha Z\geq 0. Hence,

Appendix D Probability tail inequalities

The following probability tail inequalities are used in our analysis.

The first tail inequality is for sums of bounded random vectors; it is a standard application of Bernstein’s inequality.

Let x1,x2,,xnx_{1},x_{2},\dotsc,x_{n} be independent random vectors such that

for all i=1,2,,ni=1,2,\dotsc,n, almost surely. Let s:=x1+x2++xns:=x_{1}+x_{2}+\dotsb+x_{n}. For all t>0t>0,

The next tail inequality concerns the spectral accuracy of an empirical second moment matrix, where we do not assume the dimension is finite.

If t2.6t\geq 2.6, then t(ett1)1et/2t(e^{t}-t-1)^{-1}\leq e^{-t/2}.