Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization

Roy Frostig, Rong Ge, Sham M. Kakade, Aaron Sidford

Introduction

This problem underlies supervised learning (e.g. the training of logistic regressors when ϕi(z)=log(1+ezbi)\phi_{i}(z)=\log(1+e^{-zb_{i}}), or their regularized form when ϕi(z)=log(1+ezbi)+γ2nx22\phi_{i}(z)=\log(1+e^{-zb_{i}})+\tfrac{\gamma}{2n}\|{x}\|_{2}^{2} for a scalar γ>0\gamma>0) and captures the widely-studied problem of linear least-squares regression when ϕi(z)=12(zbi)2\phi_{i}(z)=\tfrac{1}{2}(z-b_{i})^{2}.

Over the past five years, problems such as (1) have received increased attention, with a recent burst of activity in the design of fast randomized algorithms. Iterative methods that randomly sample the ϕi\phi_{i} have been shown to outperform standard first-order methods under mild assumptions (Bottou and Bousquet, 2008; Johnson and Zhang, 2013; Xiao and Zhang, 2014; Defazio et al., 2014; Shalev-Shwartz and Zhang, 2014).

Despite the breadth of these recent results, their running time guarantees when solving the ERM problem (1) are sub-optimal in terms of their dependence on a natural notion of the problem’s condition number (See Section 1.1). This dependence can, however, significantly impact their guarantees on running time. High-dimensional problems encountered in practice are often poorly conditioned. In large-scale machine learning applications, the condition number of the ERM problem (1) captures notions of data complexity arising from variable correlation in high dimensions and is hence prone to be very large.

More specifically, among the recent randomized algorithms, each one either:

Solves the ERM problem (1), under an assumption of strong convexity, with convergence that depends linearly on the problem’s condition number (Johnson and Zhang, 2013; Defazio et al., 2014).

Solves only an explicitly regularized ERM problem, minx{F(x)+λr(x)}\min_{x}\{{F(x)+\lambda r(x)}\} where the regularizer rr is a known 1-strongly convex function and λ\lambda must be strictly positive, even when FF is itself strongly convex. One such result is due to Shalev-Shwartz and Zhang (2014) and is the first to achieve acceleration for this problem, i.e. dependence only on the square root of the regularized problem’s condition number, which scales inversely with λ\lambda. Hence, taking small λ\lambda to solve the ERM problem (where λ=0\lambda=0 in effect) is not a viable option.

In this paper we show how to bridge this gap via black-box reductions. Namely, we develop algorithms to solve the ERM problem (1) – under a standard assumption of strong convexity – through repeated, approximate minimizations of the regularized ERM problem minx{F(x)+λr(x)}\min_{x}\{{F(x)+\lambda r(x)}\} for fairly large λ\lambda. Instantiating our framework with known randomized algorithms that solve the regularized ERM problem, we achieve accelerated running time guarantees for solving the original ERM problem.

The key to our reductions are approximate variants of the classical proximal point algorithm (PPA) (Rockafellar, 1976; Parikh and Boyd, 2014). We show how both PPA and the inner minimization procedure can then be accelerated and our analysis gives precise approximation requirements for either option. Furthermore, we show further practical improvements when the inner minimizer operates by a dual ascent method. In total, this provides at least three different algorithms for achieving an improved accelerated running time for solving the ERM problem (1) under the standard assumption of strongly convex FF and smooth ϕi\phi_{i}. (Table 1 summarizes our improvements in comparison to existing minimization procedures.)

Perhaps the strongest and most general theoretical reduction we provide in this paper is encompassed by the following theorem which we prove in Section 3.

in time Tc{\mathcal{T}}_{c}. Then given any x0x_{0}, c>0c>0, λ2μ\lambda\geq 2\mu, we can compute x1x_{1} such that

in time O(T4(2λ+μμ)3/2λ/μlogc)O\left(\mathcal{T}_{4(\frac{2\lambda+\mu}{\mu})^{3/2}}\sqrt{\lceil\lambda/\mu\rceil}\log c\right).

This theorem essentially states that we can use a linearly convergent algorithm for minimizing f(x)+λxx022f(x)+\lambda\|{x-x_{0}}\|_{2}^{2} in order to minimize ff, while incurring a multiplicative overhead of only O(λ/μ\polylog(λ/μ))O(\sqrt{\lceil\lambda/\mu\rceil}\polylog(\lambda/\mu)). Applying this theorem to previous state-of-the-art algorithms improves both the running time for solving (1), as well as the following more general ERM problem:

Problem (2) is fundamental in the theory of convex optimization and covers ERM problems for multiclass and structured prediction.

There are a variety of additional extensions to the ERM problem to which some of our analysis easily applies. For instance, we could work in more general normed spaces, allow non-uniform smoothness of the ϕ\phi, add an explicit regularizer, etc. However, to simplify exposition and comparison to related work, we focus on (1) and make clear the extensions to (2) in Section 3. These cases capture the core of the arguments presented and illustrate the generality of this approach.

Several of the algorithmic tools and analysis techniques in this paper are similar in principle to (and sometimes appear indirectly in) work scattered throughout the machine learning and optimization literature – from classical treatments of error-tolerant PPA (Rockafellar, 1976; Guler, 1992) to the effective proximal term used by Accelerated Proximal SDCA Shalev-Shwartz and Zhang (2014) in enabling its acceleration.

By analyzing these as separate tools, and by bookkeeping the error requirements that they impose, we are able to assemble them into algorithms with improved guarantees. We believe that the presentation of Accelerated APPA (Algorithm 2) arising from this view simplifies, and clarifies in terms of broader convex optimization theory, the “outer loop” steps employed by Accelerated Proximal SDCA. More generally, we hope that disentangling the relevant algorithmic components into this general reduction framework will lead to further applications both in theory and in practice.

We consider the ERM problem (1) in the following common setting:

Although many algorithms are designed for special cases of the ERM objective FF where there is some known, exploitable structure to the problem, our aim is to study the most general case subject to Assumption 1.2. To standardize the comparison among algorithms, we consider the following generic model of interaction with FF:

We refer to these operations, as well as to the evaluation of ϕi(aiTx)\phi_{i}(a_{i}^{\mathsf{T}}x), as single accesses to ϕi\phi_{i}, and assume that these operations can be performed in O(d)O(d) time.

Regularization and duality

In such context, we let xs,λopt=defargminxfs,λ(x)x^{\text{opt}}_{s,\lambda}\stackrel{{\scriptstyle\rm def}}{{=}}\operatorname*{argmin}_{x}f_{s,\lambda}(x) and we call

where G(y)=defi=1nϕi(yi)G(y)\stackrel{{\scriptstyle\rm def}}{{=}}\sum_{i=1}^{n}\phi_{i}^{*}(y_{i}). Similar to the above, we let ys,λopt=defargminygs,λ(y)y^{\text{opt}}_{s,\lambda}\stackrel{{\scriptstyle\rm def}}{{=}}\operatorname*{argmin}_{y}g_{s,\lambda}(y).

To make corresponding primal progress, dual-based algorithms make use of the dual-to-primal mapping, given by

and the primal-to-dual mapping, given entrywise by

for i=1,,ni=1,\dots,n. (See Appendix B for a derivation of these facts and further properties of the dual.)

2 Running times and related work

In Table 1 we compare our results with the running time of both classical and recent algorithms for solving the ERM problem (1) and linear least-squares regression. Here we briefly explain these running times and related work.

In the context of the ERM problem, GD refers to canonical gradient descent on FF, Accel. GD is Nesterov’s accelerated gradient decent (Nesterov, 1983, 2004), SVRG is the stochastic variance-reduced gradient of Johnson and Zhang (2013), SAG is the stochastic average gradient of Roux et al. (2012) and Defazio et al. (2014), SDCA is the stochastic dual coordinate ascent of Shalev-Shwartz and Zhang (2013), AP-SDCA is the Accelerated Proximal SDCA of Shalev-Shwartz and Zhang (2014) and APCG is the accelerated coordinate algorithm of Lin et al. (2014). The latter three algorithms are more restrictive in that they only solve the explicitly regularized problem F+λrF+\lambda r, even if FF is itself strongly convex (such algorithms run in time inversely proportional to λ\lambda).

The running times of the algorithms are presented based on the setting considered in this paper, i.e. under Assumptions 1.2 and 1.3. Many of the algorithms can be applied in more general settings (e.g. even if the function FF is not strongly convex) and have different convergence guarantees in those cases. The running times are characterized by four parameters: dd is the data dimension, nn is the number of samples, κ=LR2/μ\kappa=\lceil LR^{2}/\mu\rceil is the condition number (for F+λrF+\lambda r minimizers, the condition number κλ=LR2/λ\kappa_{\lambda}=\lceil LR^{2}/\lambda\rceil is used) and ϵ0/ϵ\epsilon_{0}/\epsilon is the ratio between the initial and desired accuracy. Running times are stated per O~\widetilde{O}-notation; factors that depend polylogarithmically on nn, κ\kappa, and κλ\kappa_{\lambda} are ignored.

Linear least-squares regression

For the linear least-squares regression problem, there is greater variety in the algorithms that apply. For comparison, Table 1 includes Moore-Penrose pseudoinversion – computed via naive matrix multiplication and inversion routines, as well as by their asymptotically fastest counterparts – in order to compute a closed-form solution via the standard normal equations. The table also lists algorithms based on the randomized Kaczmarz method (Strohmer and Vershynin, 2009; Needell et al., 2014) and their accelerated variant (Lee and Sidford, 2013), as well as algorithms based on subspace embedding (OSNAP) or row sampling (Nelson and Nguyen, 2013; Li et al., 2013; Cohen et al., 2015). Some Kaczmarz-based methods only apply to the more restrictive problem of solving a consistent system (finding xx satisfying Ax=bAx=b) rather than minimize the squared loss Axb22\|Ax-b\|^{2}_{2}. The running times depend on the same four parameters n,d,κ,ϵ0/ϵn,d,\kappa,\epsilon_{0}/\epsilon as before, except for computing the closed-form pseudoinverse, which for simplicity we consider “exact,” independent of initial and target errors ϵ0/ϵ\epsilon_{0}/\epsilon.

Approximate proximal point

The key to our improved running times is a suite of approximate proximal point algorithms that we propose and analyze. We remark that notions of error-tolerance in the typical proximal point algorithm – for both its plain and accelerated variants – have been defined and studied in prior work (Rockafellar, 1976; Guler, 1992). However, these mainly consider the cumulative absolute error of iterates produced by inner minimizers, assuming that such a sequence is somehow produced. Since essentially any procedure of interest begins at some initial point – and has runtime that depends on the relative error ratio between its start and end – such a view does not yield fully concrete algorithms, nor does it yield end-to-end runtime upper bounds such as those presented in this paper.

Additional related work

There is an immense body of literature on proximal point methods and alternating direction method of multipliers (ADMM) that are relevant to the approach in this paper; see Boyd et al. (2011); Parikh and Boyd (2014) for modern surveys. We also note that the independent work of Lin et al. (2015) contains results similar to some of those in this paper.

3 Main results

All formal results in this paper are obtained through a framework that we develop for iteratively applying and accelerating various minimization algorithms. When instantiated with recently-developed fast minimizers we obtain, under Assumptions 1.2 and 1.3, algorithms guaranteed to solve the ERM problem in time O~(ndκlog(1/ϵ))\widetilde{O}(nd\sqrt{\kappa}\log(1/\epsilon)).

Our framework stems from a critical insight of the classical proximal point algorithm (PPA) or proximal iteration: to minimize FF (or more generally, any convex function) it suffices to iteratively minimize

and converges to the minimizer of FF. The minimization in the update is known as the proximal operator (Parikh and Boyd, 2014), and we refer to it in the sequel as the inner minimization problem.

In this paper we provide three distinct approximate proximal point algorithms, i.e. algorithms that do not require full inner minimization. Each enables the use of existing fast algorithm as its inner minimizer, in turn yielding several ways to obtain our improved ERM running time:

In Section 2 we develop a basic approximate proximal point algorithm (APPA). The algorithm is essentially PPA with a relaxed requirement of inner minimization by only a fixed multiplicative constant in each iteration. Instantiating this algorithm with an accelerated, regularized ERM solver – such as APCG (Lin et al., 2014) – as its inner minimizer yields the improved accelerated running time for the ERM problem (1).

In Section 3 we develop Accelerated APPA. Instantiating this algorithm with SVRG (Johnson and Zhang, 2013) as its inner minimizer yields the improved accelerated running time for both the ERM problem (1) as well as the general ERM problem (2).

In Section 4 we develop Dual APPA: an algorithm whose approximate inner minimizers operate on the dual fs,λf_{s,\lambda}, with warm starts between iterations. Dual APPA enables several inner minimizers that are a priori incompatible with APPA. Instantiating this algorithm with an accelerate, regularized ERM solver – such as APCG (Lin et al., 2014) – as its inner minimizer yields the improved accelerated running time for the ERM problem (1).

Each of the three algorithms exhibits a slight advantage over the others in different regimes. APPA has by far the simplest and most straightforward analysis, and applies directly to any μ\mu-strongly convex function FF (not only FF given by (1)). Accelerated APPA is more complicated, but in many regimes is a more efficient reduction than APPA; it too applies to any μ\mu-strongly convex function FF and in turn proves Theorem 1.1.

Our third algorithm, Dual APPA, is the least general in terms of the assumptions on which it relies. It is the only reduction we develop that requires the ERM structure of FF. However, this algorithm is a natural choice in conjunction with inner minimizers that operate on a popular dual objective. In Section 5 we demonstrate moreover that this algorithm has properties that make it desirable in practice.

4 Paper organization

The remainder of this paper is organized as follows. In Section 2, Section 3, and Section 4 we state and analyze the approximate proximal point algorithms described above. In Section 5 we discuss practical concerns and cover numerical experiments involving Dual APPA and related stochastic algorithms. In Appendix A we prove general technical lemmas used throughout the paper and in Appendix B we provide a derivation of regularized ERM duality and related technical lemmas.

Approximate proximal point algorithm (APPA)

In this section we describe our approximate proximal point algorithm (APPA). This algorithm is perhaps the simplest, both in its description and in its analysis, in comparison to the others described in this paper. This section also introduces technical machinery that is used throughout the sequel.

We first present our formal abstraction of inner minimizers (Section 2.1), then we present our algorithm (Section 2.2), and finally we step through its analysis (Section 2.3).

To design APPA, we first quantify the error that can be tolerated of an inner minimizer, while accounting for the computational cost of ensuring such error. The abstraction we use is the following notion of inner approximation:

In other words, a primal oracle is an algorithm initialized at xx that reduces the error of fx,λf_{x,\lambda} by a 1/c1/c fraction, in time that depends on λ\lambda, and cc, and regularity properties of FF. Typical iterative first-order algorithms, such as those in Table 1, yield primal (c,λ)(c,\lambda)-oracles with runtimes TP\mathcal{T}_{\mathcal{P}} that scale inversely in λ\lambda or λ\sqrt{\lambda}, and logarithmically in cc. For instance:

SVRG (Johnson and Zhang, 2013) is a primal (c,λ)(c,\lambda)-oracle with runtime complexity TP=O(ndmin{κ,κλ}logc)\mathcal{T}_{\mathcal{P}}=O(nd\min\{{\kappa,\kappa_{\lambda}}\}\log c) for both the ERM problem (1) and the general ERM problem (2).

Using APCG (Lin et al., 2014) we can obtain a primal (c,λ)(c,\lambda)-oracle with runtime complexity TP=O~(ndκλlogc)\mathcal{T}_{\mathcal{P}}=\widetilde{O}(nd\sqrt{\kappa_{\lambda}}\log c) for the ERM problem (1).AP-SDCA could likely also serve as a primal oracle with the same guarantees. However, the results in Shalev-Shwartz and Zhang (2014) are stated assuming initial primal and dual variables are zero. It is not directly clear how one can provide a generic relative decrease in error from this specific initial primal-dual pair.

Corollary B.3 implies that, given a primal point xx, we can obtain, in O(nd)O(nd) time, a corresponding dual point yy such that the duality gap fx,λ(x)+gx,λ(y)f_{x,\lambda}(x)+g_{x,\lambda}(y) (and thus the dual error) is at most O(\poly(κλ))O(\poly(\kappa_{\lambda})) times the primal error. Lemma B.1 implies that decreasing the dual error by a factor O(\poly(κλ)c)O(\poly(\kappa_{\lambda})c) decreases the induced primal error by cc. Therefore, applying APCG to the dual and performing the primal and dual mappings yield the theorem. ∎

2 Algorithm

Our Approximate Proximal Point Algorithm (APPA) is given by the following Algorithm 1.

The central goal of this section is to prove the following lemma, which guarantees a geometric convergence rate for the iterates produced in this manner

This lemma immediately implies the following running-time bounds for APPA.

Given a primal (2(μ+λ)μ,λ)(\frac{2(\mu+\lambda)}{\mu},\lambda)-oracle P\mathcal{P}, Algorithm 1 minimizes the general ERM problem (2) to within accuracy ϵ\epsilon in time O(TPλ/μlog(ϵ0/ϵ))O({\mathcal{T}}_{\mathcal{P}}\lceil\lambda/\mu\rceil\log(\epsilon_{0}/\epsilon)).When the oracle is a randomized algorithm, the expected accuracy is at most ϵ\epsilon.

Combining Theorem 2.5 and Theorem 2.3 immediately yields our desired running time for solving (1).

Instantiating Algorithm 1 with the Theorem 2.3 as the primal oracle and taking λ=μ\lambda=\mu yields the running time of O~(ndκlog(ϵ0/ϵ))\widetilde{O}(nd\sqrt{\kappa}\log(\epsilon_{0}/\epsilon)) for solving (1).

3 Analysis

This section gives a proof of Lemma 2.4. Throughout, no assumption is made on FF aside from μ\mu-strong convexity. Namely, we need not have FF be smooth or at all differentiable.

First, we consider the effect of an exact inner minimizer. Namely, we prove the following lemma relating the minimum of the inner problem fs,λf_{s,\lambda} to FoptF^{\text{opt}}.

Let xopt=argminxF(x)x^{\text{opt}}=\operatorname*{argmin}_{x}F(x) and for all α\alpha\in let xα=(1α)s+αxoptx_{\alpha}=(1-\alpha)s+\alpha x^{\text{opt}}. The μ\mu-strong convexity of FF implies that, for all α\alpha\in,

Consequently, by the definition of fs,λoptf^{\text{opt}}_{s,\lambda},

Choosing α=μμ+λ\alpha=\frac{\mu}{\mu+\lambda} yields the result. ∎

This immediately implies contraction for the exact PPA, as it implies that in every iteration of PPA the error in FF decreases by a multiplicative λ/(λ+μ)\lambda/(\lambda+\mu). Using this we prove Lemma 2.4.

Let x=P(x)x^{\prime}=P(x). By definition of primal oracle PP we have

Using that clearly for all zz we have F(z)fx,λ(z)F(z)\leq f_{x,\lambda}(z) we see that F(x)fx,λ(x)F(x^{\prime})\leq f_{x,\lambda}(x^{\prime}) and Foptfx,λoptF^{\text{opt}}\leq f^{\text{opt}}_{x,\lambda}. Combining with the fact that fx,λ(x)=F(x)f_{x,\lambda}(x)=F(x) yields the result. ∎

Accelerated APPA

In this section we show how generically accelerate the APPA algorithm of Section 2. Accelerated APPA (Algorithm 2) uses inner minimizers more efficiently, but requires a smaller minimization factor when compared to APPA. The algorithm and its analysis immediately prove Theorem 1.1 and in turn yield another means by which we achieve the accelerated running time guarantees for solving (1).

We first present the algorithm and state its running time guarantees (Section 3.1), then prove the guarantees as part of analysis (Section 3.2).

Our accelerated APPA algorithm is given by Algorithm 2. In every iteration it still makes a single call to a primal oracle, but rather than requiring a fixed constant minimization the minimization factor depends polynomial on the ratio of λ\lambda and μ\mu.

The central goal is to prove the following theorem regarding the running time of APPA.

Given a primal (4(2λ+μμ)3/2,λ)(4(\frac{2\lambda+\mu}{\mu})^{3/2},\lambda)-oracle P\mathcal{P} for λ2μ\lambda\geq 2\mu, Algorithm 2 minimizes the general ERM problem (2) to within accuracy ϵ\epsilon in time O(TPλ/μlog(ϵ0/ϵ))O(\mathcal{T}_{\mathcal{P}}\sqrt{\lceil\lambda/\mu\rceil}\log(\epsilon_{0}/\epsilon)).

This theorem is essentially a restatement of Theorem 1.1 and by instantiating it with Theorem 2.2 we obtain the following.

Instantiating Theorem 3.1 with SVRG (Johnson and Zhang, 2013) as the primal oracle and taking λ=2μ+LR2\lambda=2\mu+LR^{2} yields the running time bound O~(ndκlog(ϵ0/ϵ))\widetilde{O}(nd\sqrt{\kappa}\log(\epsilon_{0}/\epsilon)) for the general ERM problem (2).

2 Analysis

Here we establish the convergence rate of Algorithm 2, Accelerated APPA, and prove Theorem 3.1. Note that as in Section 2 the results in this section use nothing about the structure of FF other than strong convexity and thus they apply to the general ERM problem (2).

We remark that aspects of the proofs in this section bear resemblance to the analysis in Shalev-Shwartz and Zhang (2014), which achieves similar results in a more specialized setting.

Our proof is split into the following parts.

In Lemma 3.3 we show that applying a primal oracle to the inner minimization problem gives us a quadratic lower bound on F(x)F(x).

In Lemma 3.4 we use this lower bound to construct a series of lower bounds for the main objective function ff, and accelerate the APPA algorithm, comprising the bulk of the analysis.

In Lemma 3.5 we show that the requirements of Lemma 3.4 can be met by using a primal oracle that decreases the error by a constant factor.

In Lemma 3.6 we analyze the initial error requirements of Lemma 3.4.

The proof of Theorem 3.1 follows immediately from these lemmas.

Note that as μ=μ/2\mu^{\prime}=\mu/2 we are only losing a factor of 2 in the strong convexity parameter for our lower bound. This allows us to account for errors without sacrificing in our ultimate asymptotic convergence rates.

Since FF is μ\mu-strongly convex clearly fx0,λf_{x_{0},\lambda} is μ+λ\mu+\lambda strongly convex, by Lemma A.1

By Cauchy-Schwartz and Young’s Inequality we know that

On the other hand, since fx0,λ(x+)fx0,λ(xx0,λopt)+ϵf_{x_{0},\lambda}(x^{+})\leq f_{x_{0},\lambda}(x^{\text{opt}}_{x_{0},\lambda})+\epsilon by assumption we have λ+μ2x+xopt22ϵ\frac{\lambda+\mu}{2}\|x^{+}-{x^{\text{opt}}}\|_{2}^{2}\leq\epsilon and therefore

and using the fact that fx0,λ(x)=F(x)+λ2xx022f_{x_{0},\lambda}(x)=F(x)+\frac{\lambda}{2}\|x-x_{0}\|_{2}^{2}, we have

The right hand side of the above equation is a quadratic function. Looking at its gradient with respect to xx we see that it obtains its minimum when x=x0(1μ+1λ)gx=x_{0}-(\frac{1}{\mu^{\prime}}+\frac{1}{\lambda})g and has a minimum value of F(x+)12μg22λ+2μμϵF(x^{+})-\frac{1}{2\mu^{\prime}}\|g\|_{2}^{2}-\frac{\lambda+2\mu^{\prime}}{\mu^{\prime}}\epsilon. ∎

Suppose that in each iteration tt we have ψt=defψtopt+μ2xv(t)22\psi_{t}\stackrel{{\scriptstyle\rm def}}{{=}}\psi^{\text{opt}}_{t}+\frac{\mu^{\prime}}{2}\|x-v^{{({t})}}\|_{2}^{2} such that F(x)ψt(x)F(x)\geq\psi_{t}(x) for all xx. Let ρ=defμ+λμ\rho\stackrel{{\scriptstyle\rm def}}{{=}}\frac{\mu^{\prime}+\lambda}{\mu^{\prime}} for λ3μ\lambda\geq 3\mu^{\prime}, and let

y(t)=def(11+ρ1/2)x(t)+(ρ1/21+ρ1/2)v(t)y^{{({t})}}\stackrel{{\scriptstyle\rm def}}{{=}}\left(\frac{1}{1+\rho^{-1/2}}\right)x^{{({t})}}+\left(\frac{\rho^{-1/2}}{1+\rho^{-1/2}}\right)v^{{({t})}},

g(t)=defλ(y(t)x(t+1))g^{{({t})}}\stackrel{{\scriptstyle\rm def}}{{=}}\lambda(y^{{({t})}}-x^{{({t+1})}}),

v(t+1)=def(1ρ1/2)v(t)+ρ1/2[y(t)(1μ+1λ)g(t)]v^{{({t+1})}}\stackrel{{\scriptstyle\rm def}}{{=}}(1-\rho^{-1/2})v^{{({t})}}+\rho^{-1/2}\left[y^{{({t})}}-\left(\frac{1}{\mu^{\prime}}+\frac{1}{\lambda}\right)g^{{({t})}}\right].

Thus, for β=1ρ1/2\beta=1-\rho^{-1/2} we can let

where in the last line we used Lemma A.3. Again, by Lemma A.3 we know that

In the second step we used the following fact:

Furthermore, expanding the term μ2(xy(t))+γμg(t)22\frac{\mu}{2}\|(x-y^{{({t})}})+\frac{\gamma}{\mu}g^{{({t})}}\|_{2}^{2} and instantiating xx with x(t)x^{{({t})}} in (9) yields

Note that we have chosen y(t)y^{{({t})}} so that the inner product term equals , and we choose β=1ρ1/212\beta=1-\rho^{-1/2}\geq\frac{1}{2} which ensures

In the final step we are using the fact that λ+2μμ2ρ\frac{\lambda+2\mu^{\prime}}{\mu^{\prime}}\leq 2\rho and ρ1\rho\geq 1. ∎

We will try to show the lower bound fy(t),λoptf^{\text{opt}}_{y^{{({t})}},\lambda} is larger than ψtopt\psi^{\text{opt}}_{t} by the same amount. This is because for all xx we have

The right hand side is a quadratic function, whose optimal point is at x=μv(t)+λy(t)μ+λx=\frac{\mu^{\prime}v^{{({t})}}+\lambda y^{{({t})}}}{\mu^{\prime}+\lambda} and whose optimal value is equal to

By definition of ρ1\rho^{-1}, we know μλ2(μ+λ)1(1+ρ1/2)2x(t)v(t)22\frac{\mu^{\prime}\lambda}{2(\mu^{\prime}+\lambda)}\cdot\frac{1}{(1+\rho^{-1/2})^{2}}\|x^{{({t})}}-v^{{({t})}}\|_{2}^{2} is exactly equal to λ2ρ1(1+ρ1/2)2x(t)v(t)22\frac{\lambda}{2}\cdot\frac{\rho^{-1}}{(1+\rho^{-1/2})^{2}}\|x^{{({t})}}-v^{{({t})}}\|_{2}^{2}, therefore fy(t),λ(x(t))fy(t),λoptF(x(t))ψtoptf_{y^{{({t})}},\lambda}(x^{{({t})}})-f^{\text{opt}}_{y^{{({t})}},\lambda}\leq F(x^{{({t})}})-\psi^{\text{opt}}_{t}. ∎

In the next lemma we show that moving to the regularized problem has the same effect on the primal function value and the lower bound. This is a result of the choice of β\beta in the proof of Lemma 3.4. However, this does not mean that the choice of β\beta is very fragile. We can choose any β\beta^{\prime} that is between the current β\beta and 11; the effect on this lemma will be that the increase in primal function becomes smaller than the increase in the lower bound (so the lemma continues to hold).

Let ψ0opt=F(x(0))λ+2μμ(F(x(0))fopt)\psi^{\text{opt}}_{0}=F(x^{{({0})}})-\frac{\lambda+2\mu^{\prime}}{\mu^{\prime}}(F(x^{{({0})}})-f^{\text{opt}}), and v(0)=x(0)v^{{({0})}}=x^{{({0})}}, then ψ0=defψ0opt+μ2xv02\psi_{0}\stackrel{{\scriptstyle\rm def}}{{=}}\psi^{\text{opt}}_{0}+\frac{\mu^{\prime}}{2}\|x-v_{0}\|^{2} is a valid lower bound for FF. In particular when λ=LR2\lambda=LR^{2} then F(x(0))ψ0opt2κ(F(x(0))fopt)F(x^{{({0})}})-\psi^{\text{opt}}_{0}\leq 2\kappa(F(x^{{({0})}})-f^{\text{opt}}).

This lemma is a direct corollary of Lemma 3.3 with x+=x(0)x^{+}=x^{{({0})}}. ∎

Dual APPA

In this section we develop Dual APPA (Algorithm 3), a natural approximate proximal point algorithm that operates entirely in the regularized ERM dual. Our focus here is on theoretical properties of Dual APPA; Section 5 later explores aspects of Dual APPA more in practice.

We first present an abstraction for dual-based inner minimizers (Section 4.1), then present the algorithm (Section 4.2), and finally step through its runtime analysis (Section 4.3).

Our primary goal in this section is to quantify how much objective function progress an algorithm needs to make in the dual problem, gs,λg_{s,\lambda} (See Section 1.1) in order to ensure primal progress at a rate similar to that in APPA (Algorithm 1).

Here, similar to Section 2.1, we formally define our requirements for an approximate dual-based inner dual minimize. In particular, we use the following notion of dual oracle.

Dual based algorithms for regularized ERM and variants of coordinate descent typically can be used as such a dual oracle. In particular we note that APCG is such a dual oracle.

APCG (Lin et al., 2014) is a dual (c,λ)(c,\lambda)-oracle with runtime complexity TD=O~(ndκλlogc)\mathcal{T}_{\mathcal{D}}=\widetilde{O}(nd\sqrt{\kappa_{\lambda}}\log c).As in Theorem 2.3, AP-SDCA could likely also serve as a dual oracle with the same guarantees, provided it is modified to allow for the more general primal-dual initialization.

2 Algorithm

Our dual APPA is given by the following Algorithm 3.

Dual APPA (Algorithm 3) repeatedly queries a dual oracle while producing primal iterates via the dual-to-primal mapping (5) along the way. We show that it obtains the following running time bound:

Given a dual (σ,λ)\left(\sigma,\lambda\right)-oracle D\mathcal{D}, where

Algorithm 3 minimizes the ERM problem (1) to within accuracy ϵ\epsilon in time O~(TDλ/μlog(ϵ0/ϵ))\widetilde{O}(\mathcal{T}_{\mathcal{D}}\lceil\lambda/\mu\rceil\log(\epsilon_{0}/\epsilon)).As in Theorem 2.5, when the oracle is a randomized algorithm, the expected accuracy is at most ϵ\epsilon.

Combining Theorem 4.3 and Theorem 4.2 immediately yields another way to achieve our desired running time for solving (1).

Instantiating Theorem 4.3 with Theorem 4.2 as the dual oracle and taking λ=μ\lambda=\mu yields the running time bound O~(ndκlog(ϵ0/ϵ))\widetilde{O}(nd\sqrt{\kappa}\log(\epsilon_{0}/\epsilon)).

While both this result and the results in Section 2 show that APCG can be used to achieve our fastest running times for solving (1), note that the algorithms they suggest are in fact different. In every invocation of APCG in Algorithm 1, we need to explicitly compute both the primal-to-dual and dual-to-primal mappings (in O(nd)O(nd) time). However, here we only need to compute the primal-to-dual mapping once upfront, in order to initialize the algorithm. Every subsequent invocation of APCG then only requires a single dual-to-primal mapping computation, which can often be streamlined. From a practical viewpoint, this can be seen as a natural “warm start” scheme for the dual-based inner minimizer.

3 Analysis

Here we proves Theorem 4.3. We begin by bounding the error of the dual regularized ERM problem when the center of regularization changes. This characterizes the initial error at the beginning of each Dual APPA iteration.

In other words, the dual error gs,λ(y)gs,λoptg_{s,\lambda}(y)-g^{\text{opt}}_{s,\lambda} is bounded across a re-centering step by multiples of previous sub-optimality measurements (namely, dual error and gradient norm).

By the definition of gx,λg_{x,\lambda} and xx^{\prime} we have, for all zz,

Furthermore, since gg is 1L\frac{1}{L}-strongly convex we can invoke Lemma A.2 obtaining

Finally, since FF is μ\mu-strongly convex, by Lemma A.1, we have

Combining and recalling the definition of κ\kappa yields the result. ∎

The following lemma establishes the rate of convergence of the primal iterates {x(t)}\{{x^{({t})}}\} produced over the course of Dual APPA, and in turn implies Theorem 4.3.

Let c(0,1)c^{\prime}\in(0,1) be arbitrary and suppose that σ(40/c)n2κλ2max{κ,κλ}λ/μ\sigma\geq(40/c^{\prime})n^{2}\kappa_{\lambda}^{2}\max\{\kappa,\kappa_{\lambda}\}\lceil\lambda/\mu\rceil in Dual APPA (Algorithm 3). Then in every iteration t1t\geq 1 of Dual APPA (Algorithm 3) the following invariants hold:

For notational convenience we let r=def(λ+cμλ+μ)r\stackrel{{\scriptstyle\rm def}}{{=}}(\frac{\lambda+c^{\prime}\mu}{\lambda+\mu}), gt=defgx(t),λg_{t}\stackrel{{\scriptstyle\rm def}}{{=}}g_{x^{({t})},\lambda}, ft=deffx(t),λf_{t}\stackrel{{\scriptstyle\rm def}}{{=}}f_{x^{({t})},\lambda}, and ϵt=defF(x(t))Fopt\epsilon_{t}\stackrel{{\scriptstyle\rm def}}{{=}}F(x^{({t})})-F^{\text{opt}} for all t0t\geq 0. Thus, we wish to show that ϵt1rt1ϵ0\epsilon_{t-1}\leq r^{t-1}\epsilon_{0} (equivalent to (11)) and we wish to show that gt1(y(t))gt1optrt1ϵ0g_{t-1}(y^{({t})})-g^{\text{opt}}_{t-1}\leq r^{t-1}\epsilon_{0} (equivalent to (10)) for all t1t\geq 1.

By definition of a dual oracle we have, for all t1t\geq 1,

and by Lemma 2.7 we know that for all t1t\geq 1

Furthermore, by Corollary B.3, the definition of y(0)y^{({0})}, and the facts that f0(x(0))=F(x(0))f_{0}(x^{({0})})=F(x^{({0})}) and ft(z)F(z)f_{t}(z)\geq F(z) we have

We show that combining these and applying strong induction on tt yields the desired result.

We begin with our base cases. When t=1t=1 the invariant (11) holds immediately by definition. Furthermore, when t=1t=1 we see that the invariant (10) holds, since σ2κλ\sigma\geq 2\kappa_{\lambda} and

were we used (12) and (16) respectively. Finally we show that invariant (11) holds for t=2t=2:

Now consider t3t\geq 3 for the second invariant (11). We show this holds assuming the invariants hold for all smaller tt.

Since σ20n2κλ2κ/(cλ/(μ+λ))\sigma\geq 20n^{2}\kappa_{\lambda}^{2}\kappa/(c^{\prime}\lambda/(\mu+\lambda)) combining yields that

and the result follows by the inductive hypothesis on ϵt2\epsilon_{t-2}.

Finally we show that invariant (10) holds for any t2t\geq 2 given that it holds for all smaller tt and invariant (11) holds for that tt and all smaller tt.

Implementation

In the following two subsections, respectively, we discuss implementation details and report on an empirical evaluation of the APPA framework.

While theoretical convergence rates lay out a broad-view comparison of the algorithms in the literature, we briefly remark on some of the finer-grained differences between algorithms, which inform their implementation or empirical behavior. To match the terminology used for SVRG in Johnson and Zhang (2013), we refer to a “stage” as a single step of APPA, i.e. the time spent executing the inner minimization of fx(t),λf_{x^{({t})},\lambda} or gx(t),λg_{x^{({t})},\lambda} (as in (3) and (4)).

At the end of every one of its stages, SVRG pauses to compute an exact gradient by a complete pass over the dataset (costing Θ(nd)\Theta(nd) time during which nn gradients are computed). Although an amortized runtime analysis hides this cost, this operation cannot be carried out in-step with the iterative updates of the previous stage, since the exact gradient is computed at a point that is only selected at the stage’s end.

Meanwhile, if each stage in Dual APPA is initialized with a valid primal-dual pair for the inner problem, Dual APPA can update the current primal point together with every dual coordinate update, in time O(d)O(d), i.e. with negligible increase in the overhead of the update. When doing so, the corresponding data row remains fresh in cache and, unlike SVRG, no additional gradient need be computed.

Moreover, initializing each stage with a valid such primal-dual pair can be done in only O(d)O(d) time. At the end of a stage where ss was the center point, Dual APPA holds a primal-dual pair (x,y)(x,y) where x=x^s(y)x=\widehat{x}_{s}(y). The next stage is centered at xx and the dual variables initialized at yy, so it remains to set up a corresponding primal point x=x^x(y)=x1λATyx^{\prime}=\widehat{x}_{x}(y)=x-\frac{1}{\lambda}A^{\mathsf{T}}y. This can be done by computing x2xsx^{\prime}\leftarrow 2x-s, since we know that xs=1λATyx-s=-\frac{1}{\lambda}A^{\mathsf{T}}y.

Decreasing λ𝜆\lambda

APPA and Dual APPA enjoy the nice property that, as long as the inner problems are solved with enough accuracy, the algorithm does not diverge even for large choice of λ\lambda. In practice this allows us to start with a large λ\lambda and make faster inner minimizations. If we heuristically observe that the function error is not decreasing rapidly enough, we can switch to a smaller λ\lambda. Figure 3 (Section 5.2) demonstrates this empirically. This contrasts with algorithm parameters such as step size choices in stochastic optimizers (that may still appear in inner minimization). Such parameters are typically more sensitive, and can suddenly lead to divergence when taken too large, making them less amenable to mid-run parameter tuning.

Stable update steps

When used as inner minimizers, dual coordinate-wise methods such as SDCA typically provide a convenient framework in which to derive parameter updates with data-dependent step sizes, or sometimes enables closed-form updates altogether (i.e. optimal solutions to each single-coordinate maximization sub-problem). For example, when Dual APPA is used together with SDCA to solve a problem of least-squares or ridge regression, the locally optimal SDCA updates can be performed efficiently in closed form. This decreases the number of algorithmic parameters requiring tuning, improves the overall the stability of the end-to-end optimizer and, in turn, makes it easier to use out of the box.

2 Empirical analysis

We experiment with Dual APPA in comparison with SDCA, SVRG, and SGD on several binary classification tasks.

Algorithms

Each algorithm is parameterized by a scalar value λ\lambda analogous to the λ\lambda used in proximal iteration: λ\lambda is the step size for SVRG, λt1/2\lambda t^{-1/2} is the decaying step size for SGD, and λ2x22\frac{\lambda}{2}\|{x}\|_{2}^{2} is the ridge penalty for SDCA. (See Johnson and Zhang (2013) for a comparison of SVRG to a more thoroughly tuned SGD under different decay schemes.) We use Dual APPA (Algorithm 3) with SDCA as the inner minimizer. For the algorithms with a notion of a stage – i.e. Dual APPA’s time spent invoking the inner minimizer, SVRG’s period between computing exact gradients – we set the stage size equal to the dataset size for simplicity.Such a choice is justified by the observation that doubling the stage size does not have noticeable effect on the results discussed. SVRG is given an advantage in that we choose not to count its gradient computations when it computes the exact gradient between stages. All algorithms are initialized at x=0x=0. Each algorithm was run under λ=10i\lambda=10^{i} for i=8,7,,8i=-8,-7,\dots,8, and plots report the trial that best minimized the original ERM objective.

Convergence and bias

Figure 1 also shows dashed lines corresponding to the ERM performance of the least-squares fit and of fully-optimized ridge regression, using λ\lambda as that of the best APPA and SDCA runs. These appear in the legend as “ls(λ\lambda).” They indicate lower bounds on the ERM value attainable by any algorithm that minimizes the corresponding regularized ERM objective. Lastly, test set classification accuracy demonstrates the extent to which a shrinkage bias is statistically desirable. In the MNIST and CIFAR holdout, we want only the small bias taken explicitly by SDCA (and effectively achieved by APPA). In the Protein holdout, we want no bias at all (again effectively achieved by APPA).

Parameter sensitivity

By solving only regularized ERM inner problems, SDCA and APPA enjoy a stable response to poor specification of the biasing parameter λ\lambda. Figure 3 plots the algorithms’ final value after 20 stages, against different choices of λ\lambda. Overestimating the step size in SGD or SVRG incurs a sharp transition into a regime of divergence. Meanwhile, APPA and SDCA always converge, with solution quality degrading more smoothly. APPA then exhibits an even better degradation as it overcomes an overaggressive biasing by the 20th stage.

Acknowledgments

Part of this work took place while RF and AS were at Microsoft Research, New England, and another part while AS was visiting the Simons Institute for the Theory of Computing, UC Berkeley. This work was partially supported by NSF awards 0843915 and 1111109, NSF Graduate Research Fellowship (grant no. 1122374).

References

Appendix A Technical lemmas

In this section we provide several stand-alone technical lemmas we use throughout the paper. First we provide Lemma A.1 some common inequalities regarding smooth or strongly convex functions, then Lemma A.2 which shows the effect of adding a linear term to a convex function, and then Lemma A.3 a small technical lemma regarding convex combinations of quadratic functions.

Apply the definition of smoothness and strong convexity at the points xx and xoptx^{\text{opt}} and minimize the resulting quadratic form. ∎

Let xopt=argminxf(x)x^{\text{opt}}=\operatorname*{argmin}_{x}f(x). Since ff is μ\mu-strongly convex by Lemma A.1 we have f(x)f(xopt)+μ2xxopt22f(x)\geq f(x^{\text{opt}})+\frac{\mu}{2}\|{x-x^{\text{opt}}}\|_{2}^{2} for all xx. Consequently, for all xx

Minimizing with respect to xx yields that faoptfa(xopt)12μa22f^{\text{opt}}_{a}\geq f_{a}(x^{\text{opt}})-\frac{1}{2\mu}\|{a}\|_{2}^{2}. Consequently, by Cauchy Schwarz, and Young’s Inequality we have

Setting the gradient of αf1(x)+(1α)f2(x)\alpha f_{1}(x)+(1-\alpha)f_{2}(x) to we know that vαv_{\alpha} must satisfy

and thus vα=αv1+(1α)v2v_{\alpha}=\alpha v_{1}+(1-\alpha)v_{2}. Finally,

Appendix B Regularized ERM duality

In this section we derive the dual (4) to the problem of computing proximal operator for the ERM objective (3) (Section B.1) and prove several bounds on primal and dual errors (Section B.2). Throughout this section we assume FF is given by the ERM problem (1) and we make extensive use of the notation and assumptions in Section 1.1.

We can rewrite the primal problem, minxfs,λ(x)\min_{x}f_{s,\lambda}(x), as

it follows that the optimization problem is in turn equivalent to

This negated problem is precisely the dual formulation.

The first problem is a Lagrangian saddle-point problem, where the Lagrangian is defined as

The dual-to-primal mapping (5) and primal-to-dual mapping (6) are implied by the KKT conditions under L\mathcal{L}, and can be derived by solving for xx, yy, and zz in the system L(x,y,z)=0\nabla\mathcal{L}(x,y,z)=0.

The duality gap in this context is defined as

B.2 Error bounds

Finally, since each ϕi\phi_{i}^{*} is 1/L1/L-strongly convex, GG is 1/L1/L-strongly convex and hence so is gs,λg_{s,\lambda}. Therefore by Lemma A.1 we have

Substituting (22) in (21) and recalling that κλ1\kappa_{\lambda}\geq 1 yields the result. ∎

To prove the first identity (23), let y^=y^(x)\widehat{y}=\widehat{y}(x) for brevity. Recall that

by definition, and hence xTaiy^iϕi(y^i)=ϕi(aiTx)x^{\mathsf{T}}a_{i}\widehat{y}_{i}-\phi_{i}^{*}(\widehat{y}_{i})=\phi_{i}(a_{i}^{\mathsf{T}}x). Observe that

Now clearly F(x)=fx,λ(x)\nabla F(x)=\nabla f_{x,\lambda}(x). Furthermore, since fx,λ(x)f_{x,\lambda}(x) is (nLR2+λ)(nLR^{2}+\lambda)-smooth by Lemma A.1 we have fx,λ(x)2(nLR2+λ)(fx,λ(x)fx,λopt)\|{\nabla f_{x,\lambda}(x)}\|\leq 2(nLR^{2}+\lambda)(f_{x,\lambda}(x)-f^{\text{opt}}_{x,\lambda}). Consequently,

Recalling the definition of κλ\kappa_{\lambda} and the fact that 1κλ1\leq\kappa_{\lambda} yields the result. ∎