Newton-like method with diagonal correction for distributed optimization

Dragana Bajovic, Dusan Jakovetic, Natasa Krejic, Natasa Krklec Jerinkic

Introduction

Problems of this form arise in many emerging applications like big data analytics, e.g., , distributed inference in sensor networks , and distributed control, .

Various methods for solving (1) in a distributed manner are available in the literature. A class of methods based on gradient descent at each node and exchange of information between neighboring nodes is particularly popular, see . Assuming that the local costs fif_{i}’s are strongly convex and have Lipschitz continuous gradients and that a constant step size α\alpha is used, these methods converge linearly to a solution neighborhood. With such methods, step size α\alpha controls the tradeoff between the convergence speed towards a solution neighborhood and the distance of the limit point from the actual solution, larger α\alpha means faster convergence but larger distance from the solution in the limit; see, e.g., , . Distributed first order (gradient) methods allow for a penalty interpretation, where the distributed method is interpreted as a (centralized) gradient method applied on a carefully constructed penalty reformulation of the original problem (1); see , for details.

Given the existence of well developed theory and efficient implementations of higher order methods in centralized optimization in general, there is a clear need to investigate the possibilities of employing higher order methods in distributed optimization as well. More specifically, for additive cost structures (1) we study here, a further motivation for developing distributed higher order methods comes from their previous success when applied to similar problems in the context of centralized optimization. For example, additive cost (1) is typical in machine learning applications where second order methods play an important role, see, e.g., . Another similar class of problems arise in stochastic optimization, where the objective function is given in the form of mathematical expectation. Again, second order methods are successfully applied in centralized optimization, .

In this paper, we extend and propose a different family of distributed Newton-like methods for solving (1). We refer to the proposed family as Distributed Quasi Newton methods (DQN). The methods are designed to exploit the specific structure of the penalty reformulation , as is done in , but with a different Hessian inverse approximation, for which the idea originates in . Specifically, the Hessian matrix is approximated by its block diagonal part, while the remaining part of the Hessian is used to correct the right hand side of the quasi Newton equation. The methods exhibit linear convergence to a solution neighborhood under a set of standard assumptions for the functions fif_{i} and the network architecture – each fif_{i} is strongly convex and has Lipschitz continuous gradient, and the underlying network is connected. Simulation examples on (strongly convex) quadratic and logistic losses demonstrate that DQN compares favorably with NN proposed in .

This paper is organized as follows. In Section 2 we give the problem statement and some preliminaries needed for the definition of the method and convergence analysis. Section 3 contains the description of the proposed class of Newton-like methods and convergence results. Specific choices of the diagonal matrix that specifies the method completely are presented in Section 4. Some simulation results are presented in Section 5 while Section 6 discusses extensions of embedding DQN in the PMM framework. Finally, conclusions are drawn in Section 7, while Appendix provides some auxiliary derivations.

Preliminaries

Let us first give some preliminaries about the problem (1), its penalty interpretation in , as well as the decentralized gradient descent algorithm in that will be used later on.

The following assumption on the fif_{i}’s is imposed.

Here, II denotes the p×pp\times p identity matrix, and notation MNM\preceq N means that the matrix NMN-M is positive semi-definite.

This assumption implies that the functions fi,  i=1,,nf_{i},\;i=1,\ldots,n are strongly convex with modulus μ>0,\mu>0,

and the gradients are Lipschitz continuous with the constant LL i.e.

Assume that the network of nodes is an undirected network G=(V,E),{\mathcal{G}}=({\mathcal{V}},{\mathcal{E}}), where V{\mathcal{V}} is the set of nodes and E{\mathcal{E}} is the set of all edges, i.e., all pairs {i,j}\{i,j\} of nodes which can exchange information through a communication link.

Assumption A2. The network G=(V,E){\mathcal{G}}=({\mathcal{V}},{\mathcal{E}}) is connected, undirected and simple (no self-loops nor multiple links).

Let us denote by OiO_{i} the set of nodes that are connected with the node ii (open neighborhood of node ii) and let Oˉi=Oi{i}\bar{O}_{i}=O_{i}\bigcup\{i\} (closed neighborhood of node ii). We associate with G{\mathcal{G}} a symmetric, (doubly) stochastic n×nn\times n matrix W.W. The elements of WW are all nonnegative and rows (and columns) sum up to one. More precisely, we assume the following.

and there are constants wminw_{min} and wmaxw_{max} such that for i=1,,ni=1,\ldots,n

Denote by λ1λn\lambda_{1}\geq\ldots\geq\lambda_{n} the eigenvalues of W.W. Then it can be easily seen that λ1=1.\lambda_{1}=1. Furthermore, the null space of IWI-W is spanned by e:=(1,,1).e:=(1,\ldots,1).

The corresponding penalty reformulation of (1) is given by

Applying the standard gradient method to (4) with the unit step size we get

which, denoting the ii-th p×1p\times 1 block of xkx^{k} by xik,x_{i}^{k}, and after rearranging terms, yields the Decentralized Gradient Descent (DGD) method

The convergence of (6) towards xx^{*} is linear, i.e., the following estimate holds ,

where ξ(0,1)\xi\in(0,1) is a constant depending on Φ,α\Phi,\alpha and W.W.

Distributed Quasi Newton method

The problem under consideration has a specific structure as the Hessian is sparse if the underlying network is sparse. However its inverse is dense. Furthermore, the matrix inversion (i.e. linear system solving) is not suitable for decentralized computation. One possibility of exploiting the structure of 2Φ(x)\nabla^{2}\Phi(x) in a distributed environment is presented in where the Newton step is approximated through a number of inner iterations. We present here a different possibility. Namely, we keep the diagonal part of 2Φ(x)\nabla^{2}\Phi(x) as the Hessian approximation but at the same time use the non-diagonal part of 2Φ(x)\nabla^{2}\Phi(x) to correct the right hand side vector in the (Quasi)-Newton equation. Let us define the splitting

where 2F(x)\nabla^{2}F(x) is the block diagonal matrix with the iith diagonal block 2fi(xi)\nabla^{2}f_{i}(x_{i}).

Then, following a typical Quasi-Newton scheme, the next iteration is defined by

In summary, the proposed distributed algorithm for solving (4) is given below.

For the sake of clarity, the proposed algorithm, from the perspective of each node ii in the network, is presented in Algorithm 2.

Algorithm 2: DQN – distributed implementation At each node ii, require ρ,ε>0.\rho,\varepsilon>0.

Each node ii transmits xikx_{i}^{k} to all its neighbors jOij\in O_{i} and receives xjkx_{j}^{k} from all jOij\in O_{i}.

Each node ii transmits dikd_{i}^{k} to all its neighbors jOij\in O_{i} and receives djkd_{j}^{k} from all jOij\in O_{i}.

Each node ii chooses a diagonal p×pp\times p matrix Λik\Lambda_{i}^{k}, such that Λik2ρ.\|\Lambda_{i}^{k}\|_{2}\leq\rho.

Each node ii updates its solution estimate as:

Calculation of Λik\Lambda_{i}^{k} in step 6 will be specified in the next section, and, for certain algorithm variants, will involve an additional inter-neighbor communication of a pp-dimensional vector. Likewise, choices of tuning parameters ε,ρ,θ\varepsilon,\rho,\theta are discussed throughout the remaining of this section and Section 4.

2 Global linear convergence rate

In this subsection, the global linear convergence rate of Algorithm DQN is established. The convergence analysis consists of two parts. First, we demonstrate that sks^{k} is a descent direction. Then we determine a suitable interval for the step size ε\varepsilon that ensures linear convergence of the iterative sequence.

The following Gershgorin type theorem for block matrices is needed for the first part of convergence analysis.

Using the above theorem we can prove the following lower bound for all eigenvalues of a symmetric block matrix.

where λmin(Cii)\lambda_{\min}(C_{ii}) is the smallest eigenvalue of Cii.C_{ii}.

where λj(Cii)\lambda_{j}(C_{ii}) is the jj-th eigenvalue of Cii.C_{ii}. Thus

Hence, for each μ\mu for which (17) holds for some fixed i,i, we have

We are now ready to prove that the search direction (11) is descent.

for some δ(0,1/(αL+(1+θ)(1wmin)))\delta\in(0,1/(\alpha L+(1+\theta)(1-w_{min}))). Then sks^{k} defined by (11) is a descent direction and satisfies

Proof. Let us first show that sks^{k} is descent search direction. As

The next lemma corresponds to the standard property of descent direction methods that establish the relationship between the search vector and the gradient.

This can be shown similarly as the upper bound on (Aik)12\|({A}^{k}_{i})^{-1}\|_{2} below (19). Furthermore,

Assume that the conditions of Theorem 3.2 are satisfied. Define

with β\beta given by (22). Then Algorithm DQN generates a sequence {xk}\{x^{k}\} such that

and the convergence is at least linear with

Proof. The Mean Value Theorem, Lipschitz property of Φ\nabla\Phi, Theorem 3.2 and Lemma 3.1 yield

3 Local linear convergence

We have proved global linear convergence for the specific step length ε\varepsilon given in Theorem 3.3. However, local linear convergence can be obtained for the full step size, using the theory developed for Inexact Newton methods . The step sks^{k} can be considered as an Inexact Newton step and we are able to estimate the residual in Newton equation as follows.

Suppose that A1-A3 hold. Let xkx^{k} be such that Φ(xk)0.\nabla\Phi(x^{k})\neq 0. Assume that sks^{k} is generated in Step 2 of Algorithm 1 with

Proof. First, notice that the interval for ρ\rho is well defined. The definition of the search direction (11) and the splitting of the Hessian (9) yield

Recalling (19) and the fact that the expression above is decreasing with respect to wjj,w_{jj}, we get

Theorem 3.4 introduces an upper bound on the safeguard parameter ρ\rho different than the one considered in Theorem 3.2. The relation between the two bounds depends on the choice of δ\delta in Theorem 3.2. Taking a sufficiently small δ\delta in Theorem 3.2, we obtain that ρ\rho in Theorem 3.2 is larger. However, taking δ<1αL+(1+θ)(1wmin)\delta<\frac{1}{\alpha L+(1+\theta)(1-w_{min})} sufficiently close to 1αL+(1+θ)(1wmin)\frac{1}{\alpha L+(1+\theta)(1-w_{min})}, ρ\rho in Theorem 3.4 eventually becomes larger.

One way to interpret the relation between Theorems 3.2 and 3.3 on one hand, and Theorem 3.4 on the other hand, as far as ρ\rho is concerned, is as follows. Taking a very small δ\delta, Theorem 3.3 allows for a quite large ρ\rho but on the other hand it significantly decreases the admissible step size ε\varepsilon. At the same time, Theorem 3.4 corresponds in a sense to an opposite situation where ε\varepsilon is allowed to be quite large (in fact, equal to one), while ρ\rho is quite restricted. Therefore, the two results exploit the allowed “degrees of freedom” in a different way.

For the sake of completeness we list here the conditions for local convergence of Inexact Newton methods.

Assume that A1 holds and that sks^{k} satisfies the inequality

for some t<1.t<1. Furthermore, assume that xk+1=xk+sk,  k=0,1,.x^{k+1}=x^{k}+s^{k},\;k=0,1,\ldots. Then there exists η>0\eta>0 such that for all x0xη,\|x^{0}-x^{*}\|\leq\eta, the sequence {xk}\{x^{k}\} converges to x.x^{*}. The convergence is linear,

where y=2Φ(x)y.\|y\|_{*}=\|\nabla^{2}\Phi(x^{*})y\|.

The two previous theorems imply the following Corollary.

Assume that the conditions of Theorem 3.4 hold. Then there exists η>0\eta>0 such that for every x0x^{0} satisfying x0xη,\|x^{0}-x^{*}\|\leq\eta, the sequence {xk}\{x^{k}\} generated by Algorithm DQN and ε=1\varepsilon=1 converges linearly to xx^{*} and

For (strongly convex) quadratic functions fif_{i}, i=1,,ni=1,\ldots,n we can also claim global linear convergence as follows.

Assume that all loss functions fif_{i} are strongly convex quadratic and that the conditions of Theorem 3.4 are satisfied. Let {xk}\{x^{k}\} be a sequence generated by Algorithm DQN with ε=1.\varepsilon=1. Then limkxk=x\lim_{k\to\infty}x^{k}=x^{*} and

Proof. Given that the penalty term in (4) is convex quadratic, if all local cost functions fif_{i} are strongly convex quadratic, then the objective function Φ\Phi is also strongly convex quadratic, i.e., it can be written as

Variants of the general DQN

but (37) involves the inverse Hessian. Thus we approximate (2Φ(xk))1(\nabla^{2}\Phi(x^{k}))^{-1} by the Taylor expansion as follows. Clearly,

Assume that α<(1+λn)/L,\alpha<(1+\lambda_{n})/L, with λn\lambda_{n} being the smallest eigenvalue of W.W. Then

Each node ii transmits uiku_{i}^{k} to all its neighbors jOij\in O_{i} and receives ujku_{j}^{k} from all jOij\in O_{i}.

Each node ii calculates Λik{\Lambda}_{i}^{k} – the solution to the following system of equations (where the only unknown is the p×pp\times p diagonal matrix Λik\Lambda_{i}^{k}):

Each node ii projects each diagonal entry of Λik{\Lambda}_{i}^{k} onto the interval [ρ,ρ][-\rho,\,\rho].

Note that step 6 with algorithm DQN-22 requires an additional pp-dimensional communication per each node, per each kk (the communication of the uiku_{i}^{k}’s.) Hence, overall, with algorithm DQN-22 each node transmits three pp-dimensional vectors per kkxikx_{i}^{k}, dikd_{i}^{k}, and uiku_{i}^{k}.

We next show that algorithm DQN-22 exhibits local linear convergence even when safeguarding (Step 6.4 in Algorithm 3) is not used.

Suppose that A1-A3 hold and let xkx^{k} be an arbitrary point such that Φ(xk)0.\nabla\Phi(x^{k})\neq 0. Assume that

and sks^{k} is generated by (11) and Algorithm 2, Steps 6.1 -6.3. Then there exists t(0,1)t\in(0,1) such that

Since 2fi(xi)L,\|\nabla^{2}f_{i}(x_{i})\|\leq L, there follows 2F(xk)L\|\nabla^{2}F(x^{k})\|\leq L and the previous equality implies

Furthermore, the assumption α<wmin/(2L)\alpha<w_{min}/(2L) implies

Moreover, 2fj(xjk)μI\nabla^{2}f_{j}(x_{j}^{k})\succeq\mu I and

where h(α)=12αμ+α2L2h(\alpha)=1-2\alpha\mu+\alpha^{2}L^{2}. This function is convex and nonnegative since μL\mu\leq L and therefore

Moreover, h(0)=h(2μL2)=1h(0)=h\left(\frac{2\mu}{L^{2}}\right)=1 and we conclude that for all α(0,2μ/L2)\alpha\in(0,2\mu/L^{2}) there holds h(α)(0,1)h(\alpha)\in(0,1). As

Putting together (42)-(46), for θ0\theta\geq 0 and α\alpha satisfying (41) we obtain

Applying Theorem 3.5 once again, we get the local linear convergence as stated in the following corollary.

Assume that the conditions of Theorem 4.1 hold. Then there exists η\eta such that for every x0x^{0} satisfying x0xη\|x^{0}-x^{*}\|\leq\eta the sequence {xk},\{x^{k}\}, generated by DQN-2 method with Steps 6.1-6.3 of Algorithm 3 and ε=1,\varepsilon=1, converges linearly to xx^{*} and

We remark that, for strongly convex quadratic fif_{i}’s, the result analogous to Theorem 3.6 holds in the sense of global linear convergence, i.e., inequality (48) holds for all kk and arbitrary initial point x0x^{0}.

Remark. ***An interesting future research direction is to adapt and analyze convergence of the DQN methods in asynchronous environments, as it has been already numerically studied recently in . Therein, it is shown that the studied second order methods still converge in an asynchronous setting, though with a lower convergence speed.

2 Discussion on the tuning parameters

Matrix WW only needs to satisfy that: 1) the underlying support network is connected and 2) all diagonal entries wiiw_{ii} lie between wminw_{min} and wmaxw_{max}, where 0<wminwmax<10<w_{min}\leq w_{max}<1. Regarding the latter condition, it is standard and rather mild; it is only required for (7) to hold, i.e., to ensure that solving (4) gives an approximate solution to the desired problem (1). Regarding the second condition, it can be easily fulfilled through simple weight assignments, e.g., through the Metropolis weights choice; see, e.g., .

Discussion on distributed implementation. The algorithm’s tuning parameters need to be set beforehand in a distributed way. Regarding weight matrix WW, each node ii needs to store beforehand the weights wiiw_{ii} and wijw_{ij}, jOij\in O_{i}, for all its neighbors. The weights can be set according to the Metropolis rule, e.g., , where each node ii needs to know only the degrees of its immediate neighbors. Such weight choice, as noted before, satisfies the imposed assumptions.

In order to set the scalar tuning parameters α,\alpha, θ\theta, ε\varepsilon, and ρ\rho, each node ii needs to know beforehand global quantities wminw_{min}, wmaxw_{max}, μ\mu and LL. Each of these parameters represent either a maximum or a minimum of nodes’ local quantities. For example, wmaxw_{max} is the maximum of the wiiw_{ii}’s over i=1,...,ni=1,...,n, where node ii holds quantity wiiw_{ii}. Hence, each node can obtain wmaxw_{max} by running a distributed algorithm for maximum computation beforehand; for example, nodes can utilize the algorithm in .

Simulations

This section shows numerical performance of the proposed methods on two examples, namely the strongly convex quadratic cost functions and the logistic loss functions.

and τ\tau is a positive regularization parameter. Note that, in this example, we have

With both the proposed methods and the methods in , the step size ε=1\varepsilon=1 is used. Step size ε=1\varepsilon=1 has also been used in . Note that both classes of methods – NN and DQN – guarantee global convergence with ε=1\varepsilon=1 for quadratic costs, while neither of the two groups of methods have guaranteed global convergence with logistic losses. For the proposed methods, safeguarding is not used with quadratic costs. With logistic costs, the safeguarding is not used with DQN- and 22 but it is used with DQN-11, which diverges without the safeguard on the logistic costs. The safeguard parameter ρ\rho defined as the upper bound in (18) with δ=0\delta=0 is employed. Further, with all DQNs, θ=0\theta=0 is used. With all the algorithms, each node’s solution estimate is initialized by a zero vector.

is used and refered to as the relative error at iteration kk.

for small αμ\alpha\,\mu. On the other hand, we have that the DQN-0’s remainder is:

Extensions

As noted before, DQN methods do not converge to the exact solution of (1) but to a solution neighborhood controlled by the step size α\alpha. As such, for high solution accuracies required, they may not be competitive with distributed second order methods which converge to the exact solution .

Initialization: Each node ii sets k=0k=0 and \mathaccent866xi0=\mathaccent866qi0=0\mathaccent 866{x}_{i}^{0}=\mathaccent 866{q}_{i}^{0}=0.

Each node ii transmits \mathaccent866dik\mathaccent 866{d}_{i}^{\,k} to all its neighbors jOij\in O_{i} and receives \mathaccent866djk\mathaccent 866{d}_{j}^{\,k} from all jOij\in O_{i}.

Each node ii chooses a diagonal p×pp\times p matrix \mathaccent866Λik\mathaccent 866{\Lambda}_{i}^{\,k}, such that \mathaccent866Λikρ.\|\mathaccent 866{\Lambda}_{i}^{\,k}\|\leq\rho.

Each node ii updates its solution estimate as:

Each node ii transmits \mathaccent866xik+1\mathaccent 866{x}_{i}^{\,k+1} to all its neighbors jOij\in O_{i} and receives \mathaccent866xjk+1\mathaccent 866{x}_{j}^{\,k+1} from all jOij\in O_{i}.

Each node ii updates the dual variable \mathaccent866qik\mathaccent 866{q}_{i}^{k} as follows:

Each node ii transmits \mathaccent866uik\mathaccent 866{u}_{i}^{\,k} to all its neighbors jOij\in O_{i} and receives \mathaccent866ujk\mathaccent 866{u}_{j}^{\,k} from all jOij\in O_{i}.

Each node ii calculates \mathaccent866Λik\mathaccent 866{\Lambda}_{i}^{\,k} – the solution to the following system of equations (where the only unknown is the p×pp\times p diagonal matrix \mathaccent866Λi\mathaccent 866{\Lambda}_{i}):

Each node ii projects each diagonal entry of \mathaccent866Λik\mathaccent 866{\Lambda}_{i}^{k} onto the interval [ρ,ρ][-\rho,\,\rho].

Conclusions

The cost in terms of computational effort and communication of these three methods correspond to the costs of the state-of-the-art Network Newton methods, NN-0, NN-1 and NN-2, which are used as the benchmark class in this paper. The simulation results on two relevant problems, the quadratic loss and the logistic loss, demonstrate the efficiency of the proposed methods and compare favorably with the benchmark methods. Finally, applying the recent contributions of , the proposed distributed second order methods were extended to the framework of proximal multiplier methods. Unlike DQN, the modified methods converge to the exact solution and further enhance the performance when high solution accuracies are required.

Acknowledgement. We are grateful to the anonymous referees whose comments and suggestions helped us to improve the quality of this paper.

References

Appendix

Following , we briefly explain how we derive the PMM-DQN methods. The starting point for PMM-DQN is the quadratically approximated PMM method, which takes the following form (see for details and derivations):