Mini-Batch Primal and Dual Methods for SVMs

Martin Takáč, Avleen Bijral, Peter Richtárik, Nathan Srebro

Introduction

Stochastic optimization approaches have been shown to have significant theoretical and empirical advantages in training linear Support Vector Machines (SVMs), as well as in many other learning applications, and are often the methods of choice in practice. Such methods use a single, randomly chosen, training example at each iteration. In the context of SVMs, approaches of this form include primal stochastic gradient descent (SGD) methods (e.g., Pegasos, Shalev-Shwartz et al. 2011, NORMA, Zhang 2004) and dual stochastic coordinate ascent (Hsieh et al., 2008).

However, the inherent sequential nature of such approaches becomes a problematic limitation for parallel and distributed computations as the predictor must be updated after each training point is processed, providing very little opportunity for parallelization. A popular remedy is to use mini-batches. That is, to use several training points at each iteration, instead of just one, calculating the update based on each point separately and aggregating the updates. The question is then whether basing each iteration on several points can indeed reduce the number of required iterations, and thus yield parallelization speedups.

In this paper, we consider using mini-batches with Pegasos (SGD on the primal objective) and with Stochastic Dual Coordinate Ascent (SDCA). We show that for both methods, the quantity that controls the speedup obtained using mini-batching/parallelization is the spectral norm of the data.

In Section 3 we provide the first analysis of mini-batched Pegasos (with the original, non-smooth, SVM objective) that provably leads to parallelization speedups (Theorem 1). The idea of using mini-batches with Pegasos is not new, and is discussed already by Shalev-Shwartz et al. (2011), albeit without a theoretical justification. The original Pegasos theoretical analysis does not benefit from using mini-batches—the same number of iterations is required even when large mini-batches are used, there is no speedup, and the serial runtime (overall number of operations, in this case data accesses) increases linearly with the mini-batch size. In fact, no parallelization speedup can be guaranteed based only on a bound on the radius of the data, as in the original Pegasos analysis. Instead, we provide a refined analysis based on the spectral norm of the data.

We then move on to SDCA (Section 4). We show the situation is more involved, and a modification to the method is necessary. SDCA has been consistently shown to outperform Pegasos in practice (Hsieh et al., 2008; Shalev-Shwartz et al., 2011), and is also popular as it does not rely on setting a step-size as in Pegasos. It is thus interesting and useful to obtain mini-batch variants of SDCA as well. We first show that a naive mini-batching approach for SDCA can fail, in particular when the mini-batch size is large relative to the spectral norm (Section 4.1). We then present a “safe” variant of mini-batched SDCA, which depends on the spectral norm, and an analysis for this safe variant that establishes the same spectral-norm-dependent parallelization speedups as for Pegasos (Section 4.2). Similar to a recent analysis of non-mini-batched SDCA by Shalev-Shwartz & Zhang (2012), we establish a guarantee on the duality gap, and thus also on the sub-optimality of the primal SVM objective, when using mini-batched SDCA (Theorem 2). We then go on to describe a more aggressive, adaptive, method for mini-batched SDCA, which is based on the analysis of the “safe” approach, and which we show often outperforms it in practice (Section 4.3, with experiments in Section 5).

For simplicity of presentation we focus on the hinge loss, as in the SVM objective. However, all our results for both Pegasos and SDCA are valid for any Lipschitz continuous loss function.

Several recent papers consider the use of mini-batches in stochastic gradient descent, as well as stochastic dual averaging and stochastic mirror descent, when minimizing a smooth loss function (Dekel et al., 2012; Agarwal & Duchi, 2011; Cotter et al., 2011). These papers establish parallelization speedups for smooth loss minimization with mini-batches, possibly with the aid of some “acceleration” techniques, and without relying on, or considering, the spectral norm of the data. However, these results do not apply to SVM training, where the objective to be minimized is the non-smooth hinge loss. In fact, the only data assumption in these papers is an assumption on the radius of the data, which is not enough for obtaining parallelization guarantees when the loss is non-smooth. Our contribution is thus orthogonal to these papers, showing that it is possible to obtain parallelization speedups even for non-smooth objectives, but only with a dependence on the spectral norm. We also analyze SDCA, which is a substantially different method from the methods analyzed in these papers. It is interesting to note that a bound of the spectral norm could perhaps indicate that it is easier to “smooth” the objective, and thus allow obtaining results similar to ours (i.e. on the suboptimality of the original non-smooth objective) by smoothing the objective and relying on mini-batched smooth SGD, where the spectral norm might control how well the smoothed loss captures the original loss. But we are not aware of any analysis of this nature, nor whether such an analysis is possible.

Support Vector Machines

where λ>0\lambda>0 is a regularization trade-off parameter. It is also useful to consider the dual of (1):

is the Gram matrix of the (labeled) data. The (primal) optimum of (1) is given by w=1λni=1nαiyixi{\bf w}^{*}=\frac{1}{\lambda n}\sum_{i=1}^{n}{\boldsymbol{\alpha}}^{*}_{i}y_{i}{\bf x}_{i}, where α{\boldsymbol{\alpha}}^{*} is the (dual) optimum of (2). It is thus natural to associate with each dual solution α{\boldsymbol{\alpha}} a primal solution (i.e., a linear predictor)

Mini-Batches in Primal Stochastic Gradient Descent Methods

Pegasos is an SGD approach to solving (1), where at each iteration the iterate w(t){\bf w}^{(t)} is updated based on an unbiased estimator of a sub-gradient of the objective P(w)\mathbf{P}({\bf w}). Whereas in a “pure” stochastic setting, the sub-gradient is estimated based on only a single training example, in our mini-batched variation (Algorithm 1) at each iteration we consider the partial objective:

where AtRand(b)A_{t}\in\operatorname{Rand}(b). We then calculate the subgradient of the partial objective Pt\mathbf{P}_{t} at w(t){\bf w}^{(t)}:

and χi(w):=1\chi_{i}({\bf w}):=1 if yiw,xi<1y_{i}\left\langle{\bf w},{\bf x}_{i}\right\rangle<1 and otherwise (indicator for not classifying example ii correctly with a margin). The next iterate is obtained by setting w(t+1)=w(t)ηt(t){\bf w}^{(t+1)}={\bf w}^{(t)}-\eta_{t}\boldsymbol{\nabla}^{(t)}. We can now write

Analysis of mini-batched Pegasos rests on bounding the norm of the subgradient estimates (t)\boldsymbol{\nabla}^{(t)}. An unconditional bound on this norm, used in the standard Pegasos analysis, follows from bounding

From (7) we then get (t)λw(t)+1\|\boldsymbol{\nabla}^{(t)}\|\leq\lambda\|{\bf w}^{(t)}\|+1; the standard Pegasos analysis follows. This bound relies only on the assumption maxixi1\max_{i}\|{\bf x}_{i}\|\leq 1, and is the tightest bound without further assumptions on the data.

The core novel observation here is that the expected (square) norm of L^A\nabla\hat{L}_{A} can be bounded in terms of (an upper bound on) the spectral norm of the data:

where \left\lVert{\cdot}\right\rVert denotes the spectral norm (largest singular value) of a matrix. In order to bound L^A\nabla\hat{L}_{A}, we first perform the following calculation, introducing the key quantity βb\beta_{b}, useful also in the analysis of SDCA.

We can now apply Lemma 1 to L^A\nabla\hat{L}_{A}:

Using the by-now standard analysis of SGD for strongly convex functions, we obtain the main result of this section:

After TT iterations of Pegasos with mini-batches (Algorithm 1), we have that for the averaged iterate wˉ(T)=2Tt=T/2+1Tw(t)\bar{{\bf w}}^{(T)}=\frac{2}{T}\sum_{t=\lfloor T/2\rfloor+1}^{T}{\bf w}^{(t)}:

Unrolling (9) with ηt=1/(λt)\eta_{t}=1/(\lambda t) yields

where g(τ):=L^Aτ(w(τ))g^{(\tau)}:=\nabla\hat{L}_{A_{\tau}}({\bf w}^{(\tau)}). Using the inequality τ=1t1g(τ)2(t1)τ=1t1g(τ)2\|\sum_{\tau=1}^{t-1}g^{(\tau)}\|^{2}\leq(t-1)\sum_{\tau=1}^{t-1}\|g^{(\tau)}\|^{2}, we now get

The performance guarantee is now given by the analysis of SGD with tail averaging (Theorem 5 of Rakhlin et al. 2012, with α=12\alpha=\tfrac{1}{2} and G2=4βbbG^{2}=4\tfrac{\beta_{b}}{b}). ∎

Parallelization speedup. When b=1b=1 we have βb=1\beta_{b}=1 (see (11)) and Theorem 1 agrees with the standard (serial) Pegasos analysisExcept that we avoid the logarithmic factor by relying on tail averaging and a more modern SGD analysis. (Shalev-Shwartz et al., 2011). For larger mini-batches, the guarantee depends on the quantity βb\beta_{b}, which in turn depends on the spectral norm σ2\sigma^{2}. Since 1nσ21\tfrac{1}{n}\leq\sigma^{2}\leq 1, we have 1βbb1\leq\beta_{b}\leq b.

However, when σ2<1\sigma^{2}<1, and so βb<1\beta_{b}<1, we see a benefit in using mini-batches in Theorem 1, corresponding to a parallelization speedup of bβb\tfrac{b}{\beta_{b}}. The best situation is when σ2=1n\sigma^{2}=\tfrac{1}{n}, and so βb=1\beta_{b}=1, which happens when all training points are orthogonal. In this case there is never any interaction between points in the mini-batch, and using a mini-batch of size bb is just as effective as making bb single-example steps. When βb=1\beta_{b}=1 we indeed see that the speedup speedup is equal to the number of mini-batches, and that the behavior in terms of the number of data accesses (equivalently, serial runtime) bTbT, does not depend on bb; that is, even with larger mini-batches, we require no more data accesses, and we gain linearly from being able to perform the accesses in parallel. The case σ2=1n\sigma^{2}=\tfrac{1}{n} is rather extreme, but even for intermediate values 1n<σ2<1\tfrac{1}{n}<\sigma^{2}<1 we get speedup. In particular, as long as b1σ2b\leq\tfrac{1}{\sigma^{2}}, we have βb2\beta_{b}\leq 2, and an essentially linear speedup. Roughly speaking, 1σ2\tfrac{1}{\sigma^{2}} captures the number of examples in the mini-batch beyond which we start getting significant interactions between points.

Mini-Batches in Dual Stochastic Coordinate Ascent Methods

An alternative stochastic method to Pegasos is Stochastic Dual Coordinate Ascent (SDCA, Hsieh et al. 2008), aimed to solve the dual problem (2). At each iteration we choose a single training example (xi,yi)({\bf x}_{i},y_{i}), uniformly at random, corresponding to a single dual variable (coordinate) αi=eiα{\boldsymbol{\alpha}}_{i}=e_{i}^{\top}{\boldsymbol{\alpha}}. Subsequently, αi{\boldsymbol{\alpha}}_{i} is updated so as to maximize the (dual) objective, keeping all other coordinates of α{\boldsymbol{\alpha}} unchanged and maintaining the box constraints. At iteration tt, the update δi(t){\boldsymbol{\delta}}_{i}^{(t)} to αi(t){\boldsymbol{\alpha}}_{i}^{(t)} is computed via

where clipI\textrm{clip}_{I} is projection onto the interval II. Variables αj(t){\boldsymbol{\alpha}}_{j}^{(t)} for jij\neq i are unchanged. Hence, a single iteration has the form α(t+1)=α(t)+δi(t)ei{\boldsymbol{\alpha}}^{(t+1)}={\boldsymbol{\alpha}}^{(t)}+{\boldsymbol{\delta}}_{i}^{(t)}e_{i}. Similar to a Pegasos update, at each iteration a single, random, training point is considered, the “response” yiw(α(t)),xiy_{i}\left\langle{\bf w}({\boldsymbol{\alpha}}^{(t)}),{\bf x}_{i}\right\rangle is calculated (this operation dominates the computational effort), and based on the response, a multiple of xi{\bf x}_{i} is added to the weight vector w{\bf w} (corresponding to changing αi{\boldsymbol{\alpha}}_{i}). The two methods thus involve fairly similar operations at each iteration, with essentially identical computational costs. They differ in that in Pegasos, αi{\boldsymbol{\alpha}}_{i} is changed according to some pre-determined step-size, while SDCA changes it optimally so as to maximize the dual objective (and maintain dual feasibility); there is no step-size parameter.

A naive approach to parallelizing SDCA using mini-batches is to compute δi(t){\boldsymbol{\delta}}_{i}^{(t)} in parallel, according to (14), for all iAti\in A_{t}, all based on the current iterate α(t){\boldsymbol{\alpha}}^{(t)}, and then update αi(t+1)=αi(t)+δi(t){\boldsymbol{\alpha}}^{(t+1)}_{i}={\boldsymbol{\alpha}}^{(t)}_{i}+{\boldsymbol{\delta}}^{(t)}_{i} for iAti\in A_{t}, and keep αj(t+1)=αj(t){\boldsymbol{\alpha}}^{(t+1)}_{j}={\boldsymbol{\alpha}}^{(t)}_{j} for j∉Atj\not\in A_{t}. However, not only might this approach not reduce the number of required iterations, it might actually increase the number of required iterations. This is because the dual objective need not improve monotonically (as it does for “pure” SDCA), and even not converge.

To see this, consider an extreme situation with only two identical training examples: Q=[1111]{\bf Q}=\begin{bmatrix}1&1\\ 1&1\end{bmatrix}, λ=1n=12\lambda=\tfrac{1}{n}=\tfrac{1}{2} and mini-batch size b=2b=2 (i.e., in each iteration we use both examples). If we start with α(0)=0{\boldsymbol{\alpha}}^{(0)}={\bf 0} with D(α(0))=0\mathbf{D}({\boldsymbol{\alpha}}^{(0)})=0 then δ1(0)=δ2(0)=1{\boldsymbol{\delta}}_{1}^{(0)}={\boldsymbol{\delta}}_{2}^{(0)}=1 and following the naive approach we have α(1)=(1,1)T{\boldsymbol{\alpha}}^{(1)}=(1,1)^{T} with objective value D(α(1))=0\mathbf{D}({\boldsymbol{\alpha}}^{(1)})=0. In the next iteration δ1(1)=δ2(1)=1{\boldsymbol{\delta}}_{1}^{(1)}={\boldsymbol{\delta}}_{2}^{(1)}=-1 which brings us back to α(2)=0{\boldsymbol{\alpha}}^{(2)}={\bf 0}. So the algorithm will alternate between those two solutions with objective value D(α)=0\mathbf{D}({\boldsymbol{\alpha}})=0, while at the optimum D(α)=D((0.5,0.5))=0.25\mathbf{D}({\boldsymbol{\alpha}}^{*})=\mathbf{D}((0.5,0.5)^{\top})=0.25.

This is of course a simplistic toy example, but the same phenomenon will occur when a large number of training examples are identical or highly correlated. This can also be observed empirically in some of our experiments discussed later, e.g., in Figure 2.

The problem here is that since we update each αi{\boldsymbol{\alpha}}_{i} independently to its optimal value as if all other coordinates were fixed, we are ignoring interactions between the updates. As we see in the extreme example above, two different i,jAti,j\in A_{t}, might suggest essentially the same change to w(α(t)){\bf w}({\boldsymbol{\alpha}}^{(t)}), but we would then perform this update twice, overshooting and yielding a new iterate which is actually worse then the previous one.

2 Safe Mini-Batching

Instead, we propose a “safe” variant, where the term δAQAδA{\boldsymbol{\delta}}_{A}^{{\top}}{\bf Q}_{A}{\boldsymbol{\delta}}_{A} is approximately bounded by the separable surrogate βδA2\beta\left\lVert{{\boldsymbol{\delta}}_{A}}\right\rVert^{2}, for some β>0\beta>0 which we will discuss later. That is, the update is given by:

with αi(t+1)=αi(t)+δi(t){\boldsymbol{\alpha}}^{(t+1)}_{i}={\boldsymbol{\alpha}}^{(t)}_{i}+{\boldsymbol{\delta}}^{(t)}_{i} for iAti\in A_{t}, and αj(t+1)=αj(t){\boldsymbol{\alpha}}^{(t+1)}_{j}={\boldsymbol{\alpha}}^{(t)}_{j} for j∉Atj\not\in A_{t}. In essence, 1β\tfrac{1}{\beta} serves as a step-size, where we are now careful not to take steps so big that they will accumulate together and overshoot the objective. If handling only a single point at each iteration, such a short-step approach is not necessary, we do not need a step-size, and we can take a “full step”, setting αi{\boldsymbol{\alpha}}_{i} optimally (β=1\beta=1). But with the potential for interaction between coordinates updated in parallel, we must use a smaller step, depending on the potential for such interactions.

We will first rely on the bound (10), and establish that the choice β=βb\beta=\beta_{b} as in (11) provides for a safe step size. To do so, we consider the dual objective at α+δ{\boldsymbol{\alpha}}+{\boldsymbol{\delta}},

and the following separable approximation to it:

in which βbδ2\beta_{b}\left\lVert{{\boldsymbol{\delta}}}\right\rVert^{2} replaces δQδ{\boldsymbol{\delta}}^{\top}{\bf Q}{\boldsymbol{\delta}}. Our update (15) with β=βb\beta=\beta_{b} can be written as δ=argmaxδ:0α+δ1H(δ,α)\displaystyle{\boldsymbol{\delta}}=\arg\max_{{\boldsymbol{\delta}}:0\leq{\boldsymbol{\alpha}}+{\boldsymbol{\delta}}\leq 1}\mathbf{H}({\boldsymbol{\delta}},{\boldsymbol{\alpha}}) (we then use the coordinates δi{\boldsymbol{\delta}}_{i} for iAi\in A and ignore the rest). We are essentially performing parallel coordinate ascent on the separable approximation H(δ,α)\mathbf{H}({\boldsymbol{\delta}},{\boldsymbol{\alpha}}) instead of on D(α+δ)\mathbf{D}({\boldsymbol{\alpha}}+{\boldsymbol{\delta}}). To understand this approximation, we note that H(0,α)=D(α)\mathbf{H}({\bf 0},{\boldsymbol{\alpha}})=\mathbf{D}({\boldsymbol{\alpha}}), and show that H(δ,α)\mathbf{H}({\boldsymbol{\delta}},{\boldsymbol{\alpha}}) provides an approximate expected lower bound on D(α+δ)\mathbf{D}({\boldsymbol{\alpha}}+{\boldsymbol{\delta}}):

Inequalities of this general type are also studied in (Richtárik & Takáč, 2012) (see Sections 3 and 4). Based on the above lemma, we can modify the analysis of Shalev-Shwartz & Zhang (2012) to obtain (see complete proof in the appendix):

Consider the SDCA updates given by (15), with AtRand(b)A_{t}\in\operatorname{Rand}(b), starting from α(0)=0{\boldsymbol{\alpha}}^{(0)}={\bf 0} and with β=βb\beta=\beta_{b} (given in eq. (11)). For any ϵ>0\epsilon>0 and

The number of iterations of mini-batched SDCA, sufficient to reach primal suboptimality ϵ\epsilon, is by Theorem 2 equal to

We observe the same speedup as in the case of mini-batched Pegasos: factor of bβb\tfrac{b}{\beta_{b}}, with an essentially linear speedup when b1σ2b\leq\tfrac{1}{\sigma^{2}}. It is interesting to note that the quantity βb\beta_{b} only affects the second, ϵ\epsilon-dependent, term in (22). The “fixed cost” term, which essentially requires a full pass over the data, is not affected by βb\beta_{b}, and is always scaled down by bb.

3 Aggressive Mini-Batching

Using β=βσ\beta=\beta_{\sigma} is safe, but might be too safe/conservative. In particular, we used the spectral norm to bound δQδQδ2{\boldsymbol{\delta}}^{\top}{\bf Q}{\boldsymbol{\delta}}\leq\left\lVert{{\bf Q}}\right\rVert\left\lVert{{\boldsymbol{\delta}}}\right\rVert^{2} in Lemma 3 (through Lemma 1), but this is a worst case bound over all possible vectors, and might be loose for the relevant vectors δ{\boldsymbol{\delta}}. Relying on a worst-case bound might mean we are taking much smaller steps then we could be. Furthermore, the approach we presented thus far relies on knowing the spectral norm of the data, or at least a bound on the spectral norm (recall (10)), in order to set the step-size. Although it is possible to estimate this quantity by sampling, this can certainly be inconvenient.

Instead, we suggest a more aggressive variant of mini-batched SDCA which gradually adapts β\beta based on the actual values of δ[At](t)2\|{\boldsymbol{\delta}}_{[A_{t}]}^{(t)}\|^{2} and δ[At](t)Qδ[At](t){\boldsymbol{\delta}}^{(t)}_{[A_{t}]}{\bf Q}{\boldsymbol{\delta}}^{(t)}_{[A_{t}]}.

In Section 5 one can observe advantages of this aggressive strategy.

Experiments

In Figure 2 we demonstrate the evolution of solutions using the various methods for two specific data sets. Here we can again see the relative behaviour of the methods, as well as clearly see the failure of the naive approach, which past some point causes the objective to deteriorate and does not converge to the optimal solution.

Conclusion

Contribution. Our contribution in this paper is twofold: (i) we identify the spectral norm of the data, and through it the quantity βb\beta_{b}, as the important quantity controlling guarantees for mini-batched/parallelized Pegasos (primal method) and SDCA (dual method). We provide the first analysis of mini-batched Pagasos, with the non-smooth hinge-loss, that shows speedups, and we analyze for the first time mini-batched SDCA with guarantees expressed in terms of the primal problem (hence, our mini-batched SDCA is a primal-dual method); (ii) based on our analysis, we present novel variants of mini-batched SDCA which are necessary for achieving speedups similar to those of Pegasos, and thus open the door to effective mini-batching using the often-empirically-better SDCA.

Related work. Our safe SDCA mini-batching approach is similar to the parallel coordinate descent methods of Bradley et al. (2011) and Richtárik & Takáč (2012), but we provide an analysis in terms of the primal SVM objective, which is the more relevant object of interest. Furthermore, Bradley et al.’s analysis does not use a step-size and is thus limited only to small enough mini-batches—if the spectral norm is unknown and too large a mini-batch is used, their method might not converge. Richtárik & Takáč’s method does incorporate a fixed step-size, similar to our safe variant, but as we discuss this step-size might be too conservative for achieving the true potential of mini-batching.

Generality. We chose to focus on Pegasos and SDCA with regularized hinge-loss minimization, but all our results remain unchanged for any Lipschitz loss functions. Furthermore, Lemma 2 can also be used to establish identical speedups for mini-batched SGD optimization of minwBL^(w)\min_{\left\lVert{{\bf w}}\right\rVert\leq B}\hat{L}({\bf w}), as well as for direct stochastic approximation of the population objective (generalization error) minL(w)\min L({\bf w}). In considering the population objective, the sample size is essentially infinite, we sample with replacements (from the population), σ2\sigma^{2} is a bound on the second moment of the data distribution, and βb=1+(b1)σ2\beta_{b}=1+(b-1)\sigma^{2}.

Experiments. Our experiments confirm the empirical advantages of SDCA over Pegasos, previously observed without mini-batching. However, we also point out that in order to perform mini-batched SDCA effectively, a step-size is needed, detracting from one of the main advantages of SDCA over Pegasos. Furthermore, in the safe variant, this stepsize needs to be set according to the spectral norm (or bound on the spectral norm), with too small a setting for β\beta (i.e., too large steps) possibly leading to non-convergence, and too large a setting for β\beta yielding reduced speedups. In contrast, the Pegasos stepsize is independent of the spectral norm, and in a sense Pegasos adapts implicitly (see, e.g., its behavior compared to aggressive SDCA in the experiments). We do provide a more aggressive variant of SDCA, which does match Pegasos’s speedups empirically, but this requires an explicit heuristic adaptation of the stepsize.

Parallel Implementation. In this paper we analyzed the iteration complexity, and behavior of the iterates, of mini-batched Pegasos and SDCA. Unlike “pure” (bb=1) Pegasos and SDCA, which are not amenable to parallelization, using mini-batches does provide opportunities for it. Of course, actually achieving good parallelization speedups on a specific architecture in practice requires an efficient parallel, possibly distributed, implementation of the iterations. In this regard, we point out that the core computation required for both Pegasos and SDCA is that of computing iAgi(w,xi)xi\sum_{i\in A}g_{i}(\left\langle{\bf w},{\bf x}_{i}\right\rangle){\bf x}_{i}, where gg is some scalar function. Parallelizing such computations efficiently in a distributed environment has been studied by e.g., Dekel et al. (2012); Hsu et al. (2011); their methods can be used here too. Alternatively, one could also consider asynchronous or delayed updates (Agarwal & Duchi, 2011; Niu et al., 2011).

References

Appendix A Proof of Theorem 2

The proof of Theorem 2 follows mostly along the path of Shalev-Shwartz & Zhang (2012), crucially using Lemma 3, and with a few other required modifications detailed below.

The separable approximation H(δ,α)\mathbf{H}({\boldsymbol{\delta}},{\boldsymbol{\alpha}}) defined in (17) now has the more general form:

and all the properties mentioned in Section 4, including Lemma 3, still hold.

Our goal here is to get a bound on the duality gap, which we will denote by

The analysis now rests on the following lemma, paralleling Lemma 1 of Shalev-Shwartz & Zhang (2012), which bounds the expected improvement in the dual objective after a single iteration in terms of the duality gap:

The situation here is trickier then in the case b=1b=1 considered by Shalev-Shwartz & Zhang, and we will first bound the right hand side of (26) by HD\mathbf{H}-\mathbf{D} and then use the fact that δ(t){\boldsymbol{\delta}}^{(t)} is a minimizer of H(,α)\mathbf{H}(\cdot,{\boldsymbol{\alpha}}):

We will bound the change in the dual sub-optimality ϵD(t):=D(α)D(α(t))\epsilon_{\mathbf{D}}^{(t)}:=\mathbf{D}({\boldsymbol{\alpha}}^{*})-\mathbf{D}({\boldsymbol{\alpha}}^{(t)}):

Following the proof of Shalev-Shwartz & Zhang we will now show by induction that

Clearly, (30) implies that (31) holds for t=t0t=t_{0}. Now, if it holds for some tt0t\geq t_{0}, we show that it also holds for t+1t+1. Using s=2n2n+b(tt0)s=\frac{2n}{2n+b(t-t_{0})} in (28) we have:

where in the last inequality we used the arithmetic-geometric mean inequality. This establishes (31).

Now, for the average αˉ\bar{{\boldsymbol{\alpha}}} defined in (21) we have:

Now, we can ensure the above is at most ϵ\epsilon if we require:

Combining the requirements (29), (34) and (35) with Tnb+T0T\geq\lceil\frac{n}{b}\rceil+T_{0} and T0t0T_{0}\geq t_{0}, and recalling that for the hinge loss G=1G=1 and with α(0)=0{\boldsymbol{\alpha}}^{(0)}={\bf 0} we have ϵD(0)=D(α)D(0)10=1\epsilon^{(0)}_{\mathbf{D}}=\mathbf{D}({\boldsymbol{\alpha}}^{*})-\mathbf{D}({\bf 0})\leq 1-0=1 gives the requirements in Theorem 2. ∎