On Optimal Probabilities in Stochastic Coordinate Descent Methods

Peter Richtárik, Martin Takáč

Introduction

In this work we consider the optimization problem

where ϕ\phi is strongly convex and smooth. We propose a new algorithm, and call it ‘NSync (Nonuniform SYNchronous Coordinate descent).

In ‘NSync, we first assign a probability pS0p_{S}\geq 0 to every subset SS of [n]:={1,,n}[n]:=\{1,\dots,n\}, with SpS=1\sum_{S}p_{S}=1, and pick stepsize parameters wi>0w_{i}>0, i=1,2,,ni=1,2,\dots,n. At every iteration, a random set S^\hat{S} is generated, independently from previous iterations, following the law Prob(S^=S)=pS\mathbf{Prob}(\hat{S}=S)=p_{S}, and then coordinates iS^i\in\hat{S} are updated in parallel by moving in the direction of the negative partial derivative with stepsize 1/wi1/w_{i}. The updates are synchronized: no processor/thread is allowed to proceed before all updates are applied, generating the new iterate xk+1x^{k+1}. We specifically study samplings S^\hat{S} which are non-uniform in the sense that pi:=Prob(iS^)=S:iSpSp_{i}:=\mathbf{Prob}(i\in\hat{S})=\sum_{S:i\in S}p_{S} is allowed to vary with ii. By iϕ(x)\nabla_{i}\phi(x) we mean ϕ(x),ei\langle\nabla\phi(x),e^{i}\rangle, where eiRne^{i}\in\mathbf{R}^{n} is the ii-th unit coordinate vector.

Serial stochastic coordinate descent methods were proposed and analyzed in , and more recently in various settings in . Parallel methods were considered in , and more recently in . A memory distributed method scaling to big data problems was recently developed in . A nonuniform coordinate descent method updating a single coordinate at a time was proposed in , and one updating two coordinates at a time in . To the best of our knowledge, ‘NSync is the first nonuniform parallel coordinate descent method.

Analysis

Our analysis of ‘NSync is based on two assumptions. The first assumption generalizes the ESO concept introduced in and later used in to nonuniform samplings. The second assumption requires that ϕ\phi be strongly convex.

Notation: For x,y,uRnx,y,u\in\mathbf{R}^{n} we write xu2:=iuixi2\|x\|_{u}^{2}:=\sum_{i}u_{i}x_{i}^{2}, x,yu:=i=1nuiyixi\langle x,y\rangle_{u}:=\sum_{i=1}^{n}u_{i}y_{i}x_{i}, xy:=(x1y1,,xnyn)x\bullet y:=(x_{1}y_{1},\dots,x_{n}y_{n}) and u1:=(1/u1,,1/un)u^{-1}:=(1/u_{1},\dots,1/u_{n}). For S[n]S\subseteq[n] and hRnh\in\mathbf{R}^{n}, let h[S]:=iShieih_{[S]}:=\sum_{i\in S}h_{i}e^{i}.

Assume p=(p1,,pn)T>0p=(p_{1},\dots,p_{n})^{T}>0 and that for some positive vector wRnw\in\mathbf{R}^{n} and all x,hRnx,h\in\mathbf{R}^{n},

Inequalities of type (2), in the uniform case (pi=pjp_{i}=p_{j} for all i,ji,j), were studied in .

We assume that ϕ\phi is γ\gamma-strongly convex with respect to the norm v\|\cdot\|_{v}, where v=(v1,,vn)T>0v=(v_{1},\dots,v_{n})^{T}>0 and γ>0\gamma>0. That is, we require that for all x,hRnx,h\in\mathbf{R}^{n},

We can now establish a bound on the number of iterations sufficient for ‘NSync to approximately solve (1) with high probability.

Let Assumptions 1 and 2 be satisfied. Choose x0Rnx^{0}\in\mathbf{R}^{n}, 0<ϵ<ϕ(x0)ϕ0<\epsilon<\phi(x^{0})-\phi^{*} and 0<ρ<10<\rho<1, where ϕ:=minxϕ(x)\phi^{*}:=\min_{x}\phi(x). Let

If {xk}\{x^{k}\} are the random iterates generated by ‘NSync, then

Moreover, we have the lower bound Λ(iwivi)/E[S^]\Lambda\geq(\sum_{i}\tfrac{w_{i}}{v_{i}})/\mathbf{E}[|\hat{S}|].

We first claim that ϕ\phi is μ\mu-strongly convex with respect to the norm wp1\|\cdot\|_{w\bullet p^{-1}}, i.e.,

where μ:=γ/Λ\mu:=\gamma/\Lambda. Indeed, this follows by comparing (3) and (6) in the light of (4). Let xx^{*} be such that ϕ(x)=ϕ\phi(x^{*})=\phi^{*}. Using (6) with h=xxh=x^{*}-x,

Let hk:=(Diag(w))1ϕ(xk)h^{k}:=-(\operatorname*{Diag}(w))^{-1}\nabla\phi(x^{k}). Then xk+1=xk+(hk)[S^]x^{k+1}=x^{k}+(h^{k})_{[\hat{S}]}, and utilizing Assumption 1, we get

Taking expectations in the last inequality and rearranging the terms, we obtain E[ϕ(xk+1)ϕ](1μ)E[ϕ(xk)ϕ](1μ)k+1(ϕ(x0)ϕ)\mathbf{E}[\phi(x^{k+1})-\phi^{*}]\leq(1-\mu)\mathbf{E}[\phi(x^{k})-\phi^{*}]\leq(1-\mu)^{k+1}(\phi(x^{0})-\phi^{*}). Using this, Markov inequality, and the definition of KK, we finally get Prob(ϕ(xK)ϕϵ)E[ϕ(xK)ϕ]/ϵ(1μ)K(ϕ(x0)ϕ)/ϵρ\mathbf{Prob}(\phi(x^{K})-\phi^{*}\geq\epsilon)\leq\mathbf{E}[\phi(x^{K})-\phi^{*}]/\epsilon\leq(1-\mu)^{K}(\phi(x^{0})-\phi^{*})/\epsilon\leq\rho. Let us now establish the last claim. First, note that (see [16, Sec 3.2] for more results of this type),

Letting Δ:={pRn:p0,ipi=E[S^]}\Delta:=\{p^{\prime}\in\mathbf{R}^{n}:p^{\prime}\geq 0,\sum_{i}p_{i}^{\prime}=\mathbf{E}[|\hat{S}|]\}, we have

where the last equality follows since optimal pip_{i}^{\prime} is proportional to wi/viw_{i}/v_{i}. ∎

Theorem 3 is generic in the sense that we do not say when Assumptions 1 and 2 are satisfied, how should one go about to choose the stepsizes ww and probabilities {pS}\{p_{S}\}. In the next section we address these issues. On the other hand, this abstract setting allowed us to write a brief complexity proof.

Change of variables. Consider the change of variables y=Diag(d)xy=\operatorname*{Diag}(d)x, where d>0d>0. Defining ϕd(y):=ϕ(x)\phi^{d}(y):=\phi(x), we get ϕd(y)=(Diag(d))1ϕ(x)\nabla\phi^{d}(y)=(\operatorname*{Diag}(d))^{-1}\nabla\phi(x). It can be seen that (2), (3) can equivalently be written in terms of ϕd\phi^{d}, with ww replaced by wd:=wd2w^{d}:=w\bullet d^{-2} and vv replaced by vd:=vd2v^{d}:=v\bullet d^{-2}. By choosing di=vid_{i}=\sqrt{v_{i}}, we obtain vid=1v^{d}_{i}=1 for all ii, recovering standard strong convexity.

Nonuniform samplings and ESO

Consider now problem (1) with ϕ\phi of the form

where v>0v>0. Note that Assumption 2 is satisfied. We further make the following two assumptions.

ff has Lipschitz gradient with respect to the coordinates, with positive constants L1,,LnL_{1},\dots,L_{n}. That is, if(x)if(x+tei)Lit|\nabla_{i}f(x)-\nabla_{i}f(x+te_{i})|\leq L_{i}|t| for all xRnx\in\mathbf{R}^{n} and tRt\in\mathbf{R}.

f(x)=JJfJ(x)f(x)=\sum_{J\in\mathcal{J}}f_{J}(x), where J\mathcal{J} is a finite collection of nonempty subsets of [n][n] and fJf_{J} are differentiable convex functions such that fJf_{J} depends on coordinates iJi\in J only. Let ω:=maxJJ\omega:=\max_{J}|J|. We say that ff is separable of degree ω\omega.

Uniform parallel coordinate descent methods for regularized problems with ff of the above structure were analyzed in .

Let f(x)=12Axb22f(x)=\tfrac{1}{2}\|Ax-b\|_{2}^{2}, where ARm×nA\in\mathbf{R}^{m\times n}. Then Li=A:i22L_{i}=\|A_{:i}\|_{2}^{2} and f(x)=12j=1m(Aj:xbj)2f(x)=\tfrac{1}{2}\sum_{j=1}^{m}(A_{j:}x-b_{j})^{2}, whence ω\omega is the maximum # of nonzeros in a row of AA.

Instead of considering the general case of arbitrary pSp_{S} assigned to all subsets of [n][n], here we consider a special kind of sampling having two advantages: i) sets can be generated easily, ii) it leads to larger stepsizes 1/wi1/w_{i} and hence improved convergence rate. Fix τ[n]\tau\in[n] and c1c\geq 1 and let S1,,ScS_{1},\dots,S_{c} be a collection of (possibly overlapping) subsets of [n][n] such that Sjτ|S_{j}|\geq\tau for all ii and j=1cSj=[n]\cup_{j=1}^{c}S_{j}=[n]. Moreover, let q=(q1,,qc)>0q=(q_{1},\dots,q_{c})>0 be a probability vector. Let S^j\hat{S}_{j} be τ\tau-nice sampling from SjS_{j}; that is, S^j\hat{S}_{j} picks subsets of SjS_{j} having cardinality τ\tau, uniformly at random. We assume these samplings are independent. Now, S^\hat{S} is generated as follows. We first pick j{1,,c}j\in\{1,\dots,c\} with probability qjq_{j}, and then draw S^j\hat{S}_{j}. Note that we do not need to compute the quantities pSp_{S}, S[n]S\subseteq[n], to execute ‘NSync. In fact, it is much easier to implement the sampling via the two-tier procedure explained above. Sampling S^\hat{S} is a nonuniform variant of the τ\tau-nice sampling studied in , which here arises as a special case for c=1c=1. Note that

where δij=1\delta_{ij}=1 if iSji\in S_{j}, and otherwise.

Let Assumptions 4 and 5 be satisfied, and let S^\hat{S} be the sampling described above. Then Assumption 1 is satisfied with pp given by (12) and any w=(w1,,wn)Tw=(w_{1},\dots,w_{n})^{T} for which

where ωj:=maxJJJSjω\omega_{j}:=\max_{J\in\mathcal{J}}|J\cap S_{j}|\leq\omega.

Since ff is separable of degree ω\omega, so is ϕ\phi (because 12xv2\frac{1}{2}\|x\|_{v}^{2} is separable). Now,

where the last inequality follows from the ESO for τ\tau-nice samplings established in [16, Theorem 15]. The claim now follows by comparing the above expression and (2). ∎

Optimal probabilities

Observe that formula (13) can be used to design a sampling (characterized by the sets SjS_{j} and probabilities qjq_{j}) that minimizes Λ\Lambda, which in view of Theorem 3 optimizes the convergence rate of the method.

Serial setting. Consider the serial version of ‘NSync (Prob(S^=1)=1\mathbf{Prob}(|\hat{S}|=1)=1). We can model this via c=nc=n, with Si={i}S_{i}=\{i\} and pi=qip_{i}=q_{i} for all i[n]i\in[n]. In this case, using (12) and (13), we get wi=wi=Li+viw_{i}=w_{i}^{*}=L_{i}+v_{i}. Minimizing Λ\Lambda in (4) over the probability vector pp gives the optimal probabilities (we refer to this as the optimal serial method) and optimal complexity

respectively. Note that the uniform sampling, pi=1/np_{i}=1/n for all ii, leads to ΛUS:=n+nmaxjLj/vj\Lambda_{US}:=n+n\max_{j}L_{j}/v_{j} (we call this the uniform serial method), which can be much larger than ΛOS\Lambda_{OS}. Moreover, under the change of variables y=Diag(d)xy=\operatorname*{Diag}(d)x, the gradient of fd(y):=f(Diag(d1)y)f^{d}(y):=f(\operatorname*{Diag}(d^{-1})y) has coordinate Lipschitz constants Lid=Li/di2L_{i}^{d}=L_{i}/d_{i}^{2}, while the weights in (11) change to vid=vi/di2v_{i}^{d}=v_{i}/d_{i}^{2}. Hence, the condition numbers Li/viL_{i}/v_{i} can not be improved via such a change of variables.

Optimal serial method can be faster than the fully parallel method. To model the fully parallel setting (i.e., the variant of ‘NSync updating all coordinates at every iteration), we can set c=1c=1 and τ=n\tau=n, which yields ΛFP=ω+ωmaxjLj/vj\Lambda_{FP}=\omega+\omega\max_{j}L_{j}/v_{j}. Since ωn\omega\leq n, it is clear that ΛUSΛFP\Lambda_{US}\geq\Lambda_{FP}. However, for large enough ω\omega it will be the case that ΛFPΛOS\Lambda_{FP}\geq\Lambda_{OS}, implying, surprisingly, that the optimal serial method can be faster than the fully parallel method.

Parallel setting. Fix τ\tau and sets SjS_{j}, j=1,2,,cj=1,2,\dots,c, and define θ:=maxj(1+(τ1)(ωj1)max{1,Sj1})\theta:=\max_{j}\left(1+\tfrac{(\tau-1)(\omega_{j}-1)}{\max\{1,|S_{j}|-1\}}\right). Consider running ‘NSync with stepsizes wi=θ(Li+vi)w_{i}=\theta(L_{i}+v_{i}) (note that wiwiw_{i}\geq w_{i}^{*}, so we are fine). From (4), (12) and (13) we see that the complexity of ‘NSync is determined by

The probability vector qq minimizing this quantity can be computed by solving a linear program with c+1c+1 variables (q1,,qc,αq_{1},\dots,q_{c},\alpha), 2n2n linear inequality constraints and a single linear equality constraint:

where biRcb^{i}\in\mathbf{R}^{c}, i[n]i\in[n], are given by bji=vi(Li+vi)δijSjb^{i}_{j}=\tfrac{v_{i}}{(L_{i}+v_{i})}\tfrac{\delta_{ij}}{|S_{j}|}.

Experiments

We now conduct 2 preliminary small scale experiments to illustrate the theory; the results are depicted below. All experiments are with problems of the form (11) with ff chosen as in Example 6.

In the left plot we chose AR2×30A\in\mathbf{R}^{2\times 30}, γ=1\gamma=1, v1=0.05v_{1}=0.05, vi=1v_{i}=1 for i1i\neq 1 and Li=1L_{i}=1 for all ii. We compare the US method (pi=1/np_{i}=1/n, blue) with the OS method (pip_{i} given by (16), red). The dashed lines show 95% confidence intervals (we run the methods 100 times, the line in the middle is the average behavior). While OS can be faster, it is sensitive to over/under-estimation of the constants Li,viL_{i},v_{i}. In the right plot we show that a nonuniform serial (NS) method can be faster than the fully parallel (FP) variant (we have chosen m=8m=8, n=10n=10 and 3 values of ω\omega). On the horizontal axis we display the number of epochs, where 1 epoch corresponds to updating nn coordinates (for FP this is a single iteration, whereas for NS it corresponds to nn iterations).

References