Newton-like method with diagonal correction for distributed optimization

Dragana Bajovic, Dusan Jakovetic, Natasa Krejic, Natasa Krklec Jerinkic

Introduction

Problems of this form arise in many emerging applications like big data analytics, e.g., , distributed inference in sensor networks , and distributed control, .

Various methods for solving (1) in a distributed manner are available in the literature. A class of methods based on gradient descent at each node and exchange of information between neighboring nodes is particularly popular, see . Assuming that the local costs $f_{i}$ ’s are strongly convex and have Lipschitz continuous gradients and that a constant step size $\alpha$ is used, these methods converge linearly to a solution neighborhood. With such methods, step size $\alpha$ controls the tradeoff between the convergence speed towards a solution neighborhood and the distance of the limit point from the actual solution, larger $\alpha$ means faster convergence but larger distance from the solution in the limit; see, e.g., , . Distributed first order (gradient) methods allow for a penalty interpretation, where the distributed method is interpreted as a (centralized) gradient method applied on a carefully constructed penalty reformulation of the original problem (1); see , for details.

Given the existence of well developed theory and efficient implementations of higher order methods in centralized optimization in general, there is a clear need to investigate the possibilities of employing higher order methods in distributed optimization as well. More specifically, for additive cost structures (1) we study here, a further motivation for developing distributed higher order methods comes from their previous success when applied to similar problems in the context of centralized optimization. For example, additive cost (1) is typical in machine learning applications where second order methods play an important role, see, e.g., . Another similar class of problems arise in stochastic optimization, where the objective function is given in the form of mathematical expectation. Again, second order methods are successfully applied in centralized optimization, .

In this paper, we extend and propose a different family of distributed Newton-like methods for solving (1). We refer to the proposed family as Distributed Quasi Newton methods (DQN). The methods are designed to exploit the specific structure of the penalty reformulation , as is done in , but with a different Hessian inverse approximation, for which the idea originates in . Specifically, the Hessian matrix is approximated by its block diagonal part, while the remaining part of the Hessian is used to correct the right hand side of the quasi Newton equation. The methods exhibit linear convergence to a solution neighborhood under a set of standard assumptions for the functions $f_{i}$ and the network architecture – each $f_{i}$ is strongly convex and has Lipschitz continuous gradient, and the underlying network is connected. Simulation examples on (strongly convex) quadratic and logistic losses demonstrate that DQN compares favorably with NN proposed in .

This paper is organized as follows. In Section 2 we give the problem statement and some preliminaries needed for the definition of the method and convergence analysis. Section 3 contains the description of the proposed class of Newton-like methods and convergence results. Specific choices of the diagonal matrix that specifies the method completely are presented in Section 4. Some simulation results are presented in Section 5 while Section 6 discusses extensions of embedding DQN in the PMM framework. Finally, conclusions are drawn in Section 7, while Appendix provides some auxiliary derivations.

Preliminaries

Let us first give some preliminaries about the problem (1), its penalty interpretation in , as well as the decentralized gradient descent algorithm in that will be used later on.

The following assumption on the $f_{i}$ ’s is imposed.

Here, $I$ denotes the $p\times p$ identity matrix, and notation $M\preceq N$ means that the matrix $N-M$ is positive semi-definite.

This assumption implies that the functions $f_{i},\;i=1,\ldots,n$ are strongly convex with modulus $\mu>0,$

and the gradients are Lipschitz continuous with the constant $L$ i.e.

Assume that the network of nodes is an undirected network ${\mathcal{G}}=({\mathcal{V}},{\mathcal{E}}),$ where ${\mathcal{V}}$ is the set of nodes and ${\mathcal{E}}$ is the set of all edges, i.e., all pairs $\{i,j\}$ of nodes which can exchange information through a communication link.

Assumption A2. The network ${\mathcal{G}}=({\mathcal{V}},{\mathcal{E}})$ is connected, undirected and simple (no self-loops nor multiple links).

Let us denote by $O_{i}$ the set of nodes that are connected with the node $i$ (open neighborhood of node $i$ ) and let $\bar{O}_{i}=O_{i}\bigcup\{i\}$ (closed neighborhood of node $i$ ). We associate with ${\mathcal{G}}$ a symmetric, (doubly) stochastic $n\times n$ matrix $W.$ The elements of $W$ are all nonnegative and rows (and columns) sum up to one. More precisely, we assume the following.

and there are constants $w_{min}$ and $w_{max}$ such that for $i=1,\ldots,n$

Denote by $\lambda_{1}\geq\ldots\geq\lambda_{n}$ the eigenvalues of $W.$ Then it can be easily seen that $\lambda_{1}=1.$ Furthermore, the null space of $I-W$ is spanned by $e:=(1,\ldots,1).$

The corresponding penalty reformulation of (1) is given by

Applying the standard gradient method to (4) with the unit step size we get

which, denoting the $i$ -th $p\times 1$ block of $x^{k}$ by $x_{i}^{k},$ and after rearranging terms, yields the Decentralized Gradient Descent (DGD) method

The convergence of (6) towards $x^{*}$ is linear, i.e., the following estimate holds ,

where $\xi\in(0,1)$ is a constant depending on $\Phi,\alpha$ and $W.$

Distributed Quasi Newton method

The problem under consideration has a specific structure as the Hessian is sparse if the underlying network is sparse. However its inverse is dense. Furthermore, the matrix inversion (i.e. linear system solving) is not suitable for decentralized computation. One possibility of exploiting the structure of $\nabla^{2}\Phi(x)$ in a distributed environment is presented in where the Newton step is approximated through a number of inner iterations. We present here a different possibility. Namely, we keep the diagonal part of $\nabla^{2}\Phi(x)$ as the Hessian approximation but at the same time use the non-diagonal part of $\nabla^{2}\Phi(x)$ to correct the right hand side vector in the (Quasi)-Newton equation. Let us define the splitting

where $\nabla^{2}F(x)$ is the block diagonal matrix with the $i$ th diagonal block $\nabla^{2}f_{i}(x_{i})$ .

Then, following a typical Quasi-Newton scheme, the next iteration is defined by

In summary, the proposed distributed algorithm for solving (4) is given below.

For the sake of clarity, the proposed algorithm, from the perspective of each node $i$ in the network, is presented in Algorithm 2.

Algorithm 2: DQN – distributed implementation At each node $i$ , require $\rho,\varepsilon>0.$

Each node $i$ transmits $x_{i}^{k}$ to all its neighbors $j\in O_{i}$ and receives $x_{j}^{k}$ from all $j\in O_{i}$ .

Each node $i$ transmits $d_{i}^{k}$ to all its neighbors $j\in O_{i}$ and receives $d_{j}^{k}$ from all $j\in O_{i}$ .

Each node $i$ chooses a diagonal $p\times p$ matrix $\Lambda_{i}^{k}$ , such that $\|\Lambda_{i}^{k}\|_{2}\leq\rho.$

Each node $i$ updates its solution estimate as:

Calculation of $\Lambda_{i}^{k}$ in step 6 will be specified in the next section, and, for certain algorithm variants, will involve an additional inter-neighbor communication of a $p$ -dimensional vector. Likewise, choices of tuning parameters $\varepsilon,\rho,\theta$ are discussed throughout the remaining of this section and Section 4.

2 Global linear convergence rate

In this subsection, the global linear convergence rate of Algorithm DQN is established. The convergence analysis consists of two parts. First, we demonstrate that $s^{k}$ is a descent direction. Then we determine a suitable interval for the step size $\varepsilon$ that ensures linear convergence of the iterative sequence.

The following Gershgorin type theorem for block matrices is needed for the first part of convergence analysis.

Using the above theorem we can prove the following lower bound for all eigenvalues of a symmetric block matrix.

where $\lambda_{\min}(C_{ii})$ is the smallest eigenvalue of $C_{ii}.$

where $\lambda_{j}(C_{ii})$ is the $j$ -th eigenvalue of $C_{ii}.$ Thus

Hence, for each $\mu$ for which (17) holds for some fixed $i,$ we have

We are now ready to prove that the search direction (11) is descent.

for some $\delta\in(0,1/(\alpha L+(1+\theta)(1-w_{min})))$ . Then $s^{k}$ defined by (11) is a descent direction and satisfies

Proof. Let us first show that $s^{k}$ is descent search direction. As

The next lemma corresponds to the standard property of descent direction methods that establish the relationship between the search vector and the gradient.

This can be shown similarly as the upper bound on $\|({A}^{k}_{i})^{-1}\|_{2}$ below (19). Furthermore,

Assume that the conditions of Theorem 3.2 are satisfied. Define

with $\beta$ given by (22). Then Algorithm DQN generates a sequence $\{x^{k}\}$ such that

and the convergence is at least linear with

Proof. The Mean Value Theorem, Lipschitz property of $\nabla\Phi$ , Theorem 3.2 and Lemma 3.1 yield

3 Local linear convergence

We have proved global linear convergence for the specific step length $\varepsilon$ given in Theorem 3.3. However, local linear convergence can be obtained for the full step size, using the theory developed for Inexact Newton methods . The step $s^{k}$ can be considered as an Inexact Newton step and we are able to estimate the residual in Newton equation as follows.

Suppose that A1-A3 hold. Let $x^{k}$ be such that $\nabla\Phi(x^{k})\neq 0.$ Assume that $s^{k}$ is generated in Step 2 of Algorithm 1 with

Proof. First, notice that the interval for $\rho$ is well defined. The definition of the search direction (11) and the splitting of the Hessian (9) yield

Recalling (19) and the fact that the expression above is decreasing with respect to $w_{jj},$ we get

Theorem 3.4 introduces an upper bound on the safeguard parameter $\rho$ different than the one considered in Theorem 3.2. The relation between the two bounds depends on the choice of $\delta$ in Theorem 3.2. Taking a sufficiently small $\delta$ in Theorem 3.2, we obtain that $\rho$ in Theorem 3.2 is larger. However, taking $\delta<\frac{1}{\alpha L+(1+\theta)(1-w_{min})}$ sufficiently close to $\frac{1}{\alpha L+(1+\theta)(1-w_{min})}$ , $\rho$ in Theorem 3.4 eventually becomes larger.

One way to interpret the relation between Theorems 3.2 and 3.3 on one hand, and Theorem 3.4 on the other hand, as far as $\rho$ is concerned, is as follows. Taking a very small $\delta$ , Theorem 3.3 allows for a quite large $\rho$ but on the other hand it significantly decreases the admissible step size $\varepsilon$ . At the same time, Theorem 3.4 corresponds in a sense to an opposite situation where $\varepsilon$ is allowed to be quite large (in fact, equal to one), while $\rho$ is quite restricted. Therefore, the two results exploit the allowed “degrees of freedom” in a different way.

For the sake of completeness we list here the conditions for local convergence of Inexact Newton methods.

Assume that A1 holds and that $s^{k}$ satisfies the inequality

for some $t<1.$ Furthermore, assume that $x^{k+1}=x^{k}+s^{k},\;k=0,1,\ldots.$ Then there exists $\eta>0$ such that for all $\|x^{0}-x^{*}\|\leq\eta,$ the sequence $\{x^{k}\}$ converges to $x^{*}.$ The convergence is linear,

where $\|y\|_{*}=\|\nabla^{2}\Phi(x^{*})y\|.$

The two previous theorems imply the following Corollary.

Assume that the conditions of Theorem 3.4 hold. Then there exists $\eta>0$ such that for every $x^{0}$ satisfying $\|x^{0}-x^{*}\|\leq\eta,$ the sequence $\{x^{k}\}$ generated by Algorithm DQN and $\varepsilon=1$ converges linearly to $x^{*}$ and

For (strongly convex) quadratic functions $f_{i}$ , $i=1,\ldots,n$ we can also claim global linear convergence as follows.

Assume that all loss functions $f_{i}$ are strongly convex quadratic and that the conditions of Theorem 3.4 are satisfied. Let $\{x^{k}\}$ be a sequence generated by Algorithm DQN with $\varepsilon=1.$ Then $\lim_{k\to\infty}x^{k}=x^{*}$ and

Proof. Given that the penalty term in (4) is convex quadratic, if all local cost functions $f_{i}$ are strongly convex quadratic, then the objective function $\Phi$ is also strongly convex quadratic, i.e., it can be written as

Variants of the general DQN

but (37) involves the inverse Hessian. Thus we approximate $(\nabla^{2}\Phi(x^{k}))^{-1}$ by the Taylor expansion as follows. Clearly,

Assume that $\alpha<(1+\lambda_{n})/L,$ with $\lambda_{n}$ being the smallest eigenvalue of $W.$ Then

Each node $i$ transmits $u_{i}^{k}$ to all its neighbors $j\in O_{i}$ and receives $u_{j}^{k}$ from all $j\in O_{i}$ .

Each node $i$ calculates ${\Lambda}_{i}^{k}$ – the solution to the following system of equations (where the only unknown is the $p\times p$ diagonal matrix $\Lambda_{i}^{k}$ ):

Each node $i$ projects each diagonal entry of ${\Lambda}_{i}^{k}$ onto the interval $[-\rho,\,\rho]$ .

Note that step 6 with algorithm DQN- $2$ requires an additional $p$ -dimensional communication per each node, per each $k$ (the communication of the $u_{i}^{k}$ ’s.) Hence, overall, with algorithm DQN- $2$ each node transmits three $p$ -dimensional vectors per $k$ – $x_{i}^{k}$ , $d_{i}^{k}$ , and $u_{i}^{k}$ .

We next show that algorithm DQN- $2$ exhibits local linear convergence even when safeguarding (Step 6.4 in Algorithm 3) is not used.

Suppose that A1-A3 hold and let $x^{k}$ be an arbitrary point such that $\nabla\Phi(x^{k})\neq 0.$ Assume that

and $s^{k}$ is generated by (11) and Algorithm 2, Steps 6.1 -6.3. Then there exists $t\in(0,1)$ such that

Since $\|\nabla^{2}f_{i}(x_{i})\|\leq L,$ there follows $\|\nabla^{2}F(x^{k})\|\leq L$ and the previous equality implies

Furthermore, the assumption $\alpha<w_{min}/(2L)$ implies

Moreover, $\nabla^{2}f_{j}(x_{j}^{k})\succeq\mu I$ and

where $h(\alpha)=1-2\alpha\mu+\alpha^{2}L^{2}$ . This function is convex and nonnegative since $\mu\leq L$ and therefore

Moreover, $h(0)=h\left(\frac{2\mu}{L^{2}}\right)=1$ and we conclude that for all $\alpha\in(0,2\mu/L^{2})$ there holds $h(\alpha)\in(0,1)$ . As

Putting together (42)-(46), for $\theta\geq 0$ and $\alpha$ satisfying (41) we obtain

Applying Theorem 3.5 once again, we get the local linear convergence as stated in the following corollary.

Assume that the conditions of Theorem 4.1 hold. Then there exists $\eta$ such that for every $x^{0}$ satisfying $\|x^{0}-x^{*}\|\leq\eta$ the sequence $\{x^{k}\},$ generated by DQN-2 method with Steps 6.1-6.3 of Algorithm 3 and $\varepsilon=1,$ converges linearly to $x^{*}$ and

We remark that, for strongly convex quadratic $f_{i}$ ’s, the result analogous to Theorem 3.6 holds in the sense of global linear convergence, i.e., inequality (48) holds for all $k$ and arbitrary initial point $x^{0}$ .

Remark. ***An interesting future research direction is to adapt and analyze convergence of the DQN methods in asynchronous environments, as it has been already numerically studied recently in . Therein, it is shown that the studied second order methods still converge in an asynchronous setting, though with a lower convergence speed.

2 Discussion on the tuning parameters

Matrix $W$ only needs to satisfy that: 1) the underlying support network is connected and 2) all diagonal entries $w_{ii}$ lie between $w_{min}$ and $w_{max}$ , where $0<w_{min}\leq w_{max}<1$ . Regarding the latter condition, it is standard and rather mild; it is only required for (7) to hold, i.e., to ensure that solving (4) gives an approximate solution to the desired problem (1). Regarding the second condition, it can be easily fulfilled through simple weight assignments, e.g., through the Metropolis weights choice; see, e.g., .

Discussion on distributed implementation. The algorithm’s tuning parameters need to be set beforehand in a distributed way. Regarding weight matrix $W$ , each node $i$ needs to store beforehand the weights $w_{ii}$ and $w_{ij}$ , $j\in O_{i}$ , for all its neighbors. The weights can be set according to the Metropolis rule, e.g., , where each node $i$ needs to know only the degrees of its immediate neighbors. Such weight choice, as noted before, satisfies the imposed assumptions.

In order to set the scalar tuning parameters $\alpha,$ $\theta$ , $\varepsilon$ , and $\rho$ , each node $i$ needs to know beforehand global quantities $w_{min}$ , $w_{max}$ , $\mu$ and $L$ . Each of these parameters represent either a maximum or a minimum of nodes’ local quantities. For example, $w_{max}$ is the maximum of the $w_{ii}$ ’s over $i=1,...,n$ , where node $i$ holds quantity $w_{ii}$ . Hence, each node can obtain $w_{max}$ by running a distributed algorithm for maximum computation beforehand; for example, nodes can utilize the algorithm in .

Simulations

This section shows numerical performance of the proposed methods on two examples, namely the strongly convex quadratic cost functions and the logistic loss functions.

and $\tau$ is a positive regularization parameter. Note that, in this example, we have

With both the proposed methods and the methods in , the step size $\varepsilon=1$ is used. Step size $\varepsilon=1$ has also been used in . Note that both classes of methods – NN and DQN – guarantee global convergence with $\varepsilon=1$ for quadratic costs, while neither of the two groups of methods have guaranteed global convergence with logistic losses. For the proposed methods, safeguarding is not used with quadratic costs. With logistic costs, the safeguarding is not used with DQN- and $2$ but it is used with DQN- $1$ , which diverges without the safeguard on the logistic costs. The safeguard parameter $\rho$ defined as the upper bound in (18) with $\delta=0$ is employed. Further, with all DQNs, $\theta=0$ is used. With all the algorithms, each node’s solution estimate is initialized by a zero vector.

is used and refered to as the relative error at iteration $k$ .

for small $\alpha\,\mu$ . On the other hand, we have that the DQN-0’s remainder is:

Extensions

As noted before, DQN methods do not converge to the exact solution of (1) but to a solution neighborhood controlled by the step size $\alpha$ . As such, for high solution accuracies required, they may not be competitive with distributed second order methods which converge to the exact solution .

Initialization: Each node $i$ sets $k=0$ and $\mathaccent 866{x}_{i}^{0}=\mathaccent 866{q}_{i}^{0}=0$ .

Each node $i$ transmits $\mathaccent 866{d}_{i}^{\,k}$ to all its neighbors $j\in O_{i}$ and receives $\mathaccent 866{d}_{j}^{\,k}$ from all $j\in O_{i}$ .

Each node $i$ chooses a diagonal $p\times p$ matrix $\mathaccent 866{\Lambda}_{i}^{\,k}$ , such that $\|\mathaccent 866{\Lambda}_{i}^{\,k}\|\leq\rho.$

Each node $i$ updates its solution estimate as:

Each node $i$ transmits $\mathaccent 866{x}_{i}^{\,k+1}$ to all its neighbors $j\in O_{i}$ and receives $\mathaccent 866{x}_{j}^{\,k+1}$ from all $j\in O_{i}$ .

Each node $i$ updates the dual variable $\mathaccent 866{q}_{i}^{k}$ as follows:

Each node $i$ transmits $\mathaccent 866{u}_{i}^{\,k}$ to all its neighbors $j\in O_{i}$ and receives $\mathaccent 866{u}_{j}^{\,k}$ from all $j\in O_{i}$ .

Each node $i$ calculates $\mathaccent 866{\Lambda}_{i}^{\,k}$ – the solution to the following system of equations (where the only unknown is the $p\times p$ diagonal matrix $\mathaccent 866{\Lambda}_{i}$ ):

Each node $i$ projects each diagonal entry of $\mathaccent 866{\Lambda}_{i}^{k}$ onto the interval $[-\rho,\,\rho]$ .

Conclusions

The cost in terms of computational effort and communication of these three methods correspond to the costs of the state-of-the-art Network Newton methods, NN-0, NN-1 and NN-2, which are used as the benchmark class in this paper. The simulation results on two relevant problems, the quadratic loss and the logistic loss, demonstrate the efficiency of the proposed methods and compare favorably with the benchmark methods. Finally, applying the recent contributions of , the proposed distributed second order methods were extended to the framework of proximal multiplier methods. Unlike DQN, the modified methods converge to the exact solution and further enhance the performance when high solution accuracies are required.

Acknowledgement. We are grateful to the anonymous referees whose comments and suggestions helped us to improve the quality of this paper.

References

Appendix

Following , we briefly explain how we derive the PMM-DQN methods. The starting point for PMM-DQN is the quadratically approximated PMM method, which takes the following form (see for details and derivations):