Newton-like method with diagonal correction for distributed optimization
Dragana Bajovic, Dusan Jakovetic, Natasa Krejic, Natasa Krklec Jerinkic
Introduction
Problems of this form arise in many emerging applications like big data analytics, e.g., , distributed inference in sensor networks , and distributed control, .
Various methods for solving (1) in a distributed manner are available in the literature. A class of methods based on gradient descent at each node and exchange of information between neighboring nodes is particularly popular, see . Assuming that the local costs ’s are strongly convex and have Lipschitz continuous gradients and that a constant step size is used, these methods converge linearly to a solution neighborhood. With such methods, step size controls the tradeoff between the convergence speed towards a solution neighborhood and the distance of the limit point from the actual solution, larger means faster convergence but larger distance from the solution in the limit; see, e.g., , . Distributed first order (gradient) methods allow for a penalty interpretation, where the distributed method is interpreted as a (centralized) gradient method applied on a carefully constructed penalty reformulation of the original problem (1); see , for details.
Given the existence of well developed theory and efficient implementations of higher order methods in centralized optimization in general, there is a clear need to investigate the possibilities of employing higher order methods in distributed optimization as well. More specifically, for additive cost structures (1) we study here, a further motivation for developing distributed higher order methods comes from their previous success when applied to similar problems in the context of centralized optimization. For example, additive cost (1) is typical in machine learning applications where second order methods play an important role, see, e.g., . Another similar class of problems arise in stochastic optimization, where the objective function is given in the form of mathematical expectation. Again, second order methods are successfully applied in centralized optimization, .
In this paper, we extend and propose a different family of distributed Newton-like methods for solving (1). We refer to the proposed family as Distributed Quasi Newton methods (DQN). The methods are designed to exploit the specific structure of the penalty reformulation , as is done in , but with a different Hessian inverse approximation, for which the idea originates in . Specifically, the Hessian matrix is approximated by its block diagonal part, while the remaining part of the Hessian is used to correct the right hand side of the quasi Newton equation. The methods exhibit linear convergence to a solution neighborhood under a set of standard assumptions for the functions and the network architecture – each is strongly convex and has Lipschitz continuous gradient, and the underlying network is connected. Simulation examples on (strongly convex) quadratic and logistic losses demonstrate that DQN compares favorably with NN proposed in .
This paper is organized as follows. In Section 2 we give the problem statement and some preliminaries needed for the definition of the method and convergence analysis. Section 3 contains the description of the proposed class of Newton-like methods and convergence results. Specific choices of the diagonal matrix that specifies the method completely are presented in Section 4. Some simulation results are presented in Section 5 while Section 6 discusses extensions of embedding DQN in the PMM framework. Finally, conclusions are drawn in Section 7, while Appendix provides some auxiliary derivations.
Preliminaries
Let us first give some preliminaries about the problem (1), its penalty interpretation in , as well as the decentralized gradient descent algorithm in that will be used later on.
The following assumption on the ’s is imposed.
Here, denotes the identity matrix, and notation means that the matrix is positive semi-definite.
This assumption implies that the functions are strongly convex with modulus
and the gradients are Lipschitz continuous with the constant i.e.
Assume that the network of nodes is an undirected network where is the set of nodes and is the set of all edges, i.e., all pairs of nodes which can exchange information through a communication link.
Assumption A2. The network is connected, undirected and simple (no self-loops nor multiple links).
Let us denote by the set of nodes that are connected with the node (open neighborhood of node ) and let (closed neighborhood of node ). We associate with a symmetric, (doubly) stochastic matrix The elements of are all nonnegative and rows (and columns) sum up to one. More precisely, we assume the following.
and there are constants and such that for
Denote by the eigenvalues of Then it can be easily seen that Furthermore, the null space of is spanned by
The corresponding penalty reformulation of (1) is given by
Applying the standard gradient method to (4) with the unit step size we get
which, denoting the -th block of by and after rearranging terms, yields the Decentralized Gradient Descent (DGD) method
The convergence of (6) towards is linear, i.e., the following estimate holds ,
where is a constant depending on and
Distributed Quasi Newton method
The problem under consideration has a specific structure as the Hessian is sparse if the underlying network is sparse. However its inverse is dense. Furthermore, the matrix inversion (i.e. linear system solving) is not suitable for decentralized computation. One possibility of exploiting the structure of in a distributed environment is presented in where the Newton step is approximated through a number of inner iterations. We present here a different possibility. Namely, we keep the diagonal part of as the Hessian approximation but at the same time use the non-diagonal part of to correct the right hand side vector in the (Quasi)-Newton equation. Let us define the splitting
where is the block diagonal matrix with the th diagonal block .
Then, following a typical Quasi-Newton scheme, the next iteration is defined by
In summary, the proposed distributed algorithm for solving (4) is given below.
For the sake of clarity, the proposed algorithm, from the perspective of each node in the network, is presented in Algorithm 2.
Algorithm 2: DQN – distributed implementation At each node , require
Each node transmits to all its neighbors and receives from all .
Each node transmits to all its neighbors and receives from all .
Each node chooses a diagonal matrix , such that
Each node updates its solution estimate as:
Calculation of in step 6 will be specified in the next section, and, for certain algorithm variants, will involve an additional inter-neighbor communication of a -dimensional vector. Likewise, choices of tuning parameters are discussed throughout the remaining of this section and Section 4.
2 Global linear convergence rate
In this subsection, the global linear convergence rate of Algorithm DQN is established. The convergence analysis consists of two parts. First, we demonstrate that is a descent direction. Then we determine a suitable interval for the step size that ensures linear convergence of the iterative sequence.
The following Gershgorin type theorem for block matrices is needed for the first part of convergence analysis.
Using the above theorem we can prove the following lower bound for all eigenvalues of a symmetric block matrix.
where is the smallest eigenvalue of
where is the -th eigenvalue of Thus
Hence, for each for which (17) holds for some fixed we have
We are now ready to prove that the search direction (11) is descent.
for some . Then defined by (11) is a descent direction and satisfies
Proof. Let us first show that is descent search direction. As
The next lemma corresponds to the standard property of descent direction methods that establish the relationship between the search vector and the gradient.
This can be shown similarly as the upper bound on below (19). Furthermore,
Assume that the conditions of Theorem 3.2 are satisfied. Define
with given by (22). Then Algorithm DQN generates a sequence such that
and the convergence is at least linear with
Proof. The Mean Value Theorem, Lipschitz property of , Theorem 3.2 and Lemma 3.1 yield
3 Local linear convergence
We have proved global linear convergence for the specific step length given in Theorem 3.3. However, local linear convergence can be obtained for the full step size, using the theory developed for Inexact Newton methods . The step can be considered as an Inexact Newton step and we are able to estimate the residual in Newton equation as follows.
Suppose that A1-A3 hold. Let be such that Assume that is generated in Step 2 of Algorithm 1 with
Proof. First, notice that the interval for is well defined. The definition of the search direction (11) and the splitting of the Hessian (9) yield
Recalling (19) and the fact that the expression above is decreasing with respect to we get
Theorem 3.4 introduces an upper bound on the safeguard parameter different than the one considered in Theorem 3.2. The relation between the two bounds depends on the choice of in Theorem 3.2. Taking a sufficiently small in Theorem 3.2, we obtain that in Theorem 3.2 is larger. However, taking sufficiently close to , in Theorem 3.4 eventually becomes larger.
One way to interpret the relation between Theorems 3.2 and 3.3 on one hand, and Theorem 3.4 on the other hand, as far as is concerned, is as follows. Taking a very small , Theorem 3.3 allows for a quite large but on the other hand it significantly decreases the admissible step size . At the same time, Theorem 3.4 corresponds in a sense to an opposite situation where is allowed to be quite large (in fact, equal to one), while is quite restricted. Therefore, the two results exploit the allowed “degrees of freedom” in a different way.
For the sake of completeness we list here the conditions for local convergence of Inexact Newton methods.
Assume that A1 holds and that satisfies the inequality
for some Furthermore, assume that Then there exists such that for all the sequence converges to The convergence is linear,
where
The two previous theorems imply the following Corollary.
Assume that the conditions of Theorem 3.4 hold. Then there exists such that for every satisfying the sequence generated by Algorithm DQN and converges linearly to and
For (strongly convex) quadratic functions , we can also claim global linear convergence as follows.
Assume that all loss functions are strongly convex quadratic and that the conditions of Theorem 3.4 are satisfied. Let be a sequence generated by Algorithm DQN with Then and
Proof. Given that the penalty term in (4) is convex quadratic, if all local cost functions are strongly convex quadratic, then the objective function is also strongly convex quadratic, i.e., it can be written as
Variants of the general DQN
but (37) involves the inverse Hessian. Thus we approximate by the Taylor expansion as follows. Clearly,
Assume that with being the smallest eigenvalue of Then
Each node transmits to all its neighbors and receives from all .
Each node calculates – the solution to the following system of equations (where the only unknown is the diagonal matrix ):
Each node projects each diagonal entry of onto the interval .
Note that step 6 with algorithm DQN- requires an additional -dimensional communication per each node, per each (the communication of the ’s.) Hence, overall, with algorithm DQN- each node transmits three -dimensional vectors per – , , and .
We next show that algorithm DQN- exhibits local linear convergence even when safeguarding (Step 6.4 in Algorithm 3) is not used.
Suppose that A1-A3 hold and let be an arbitrary point such that Assume that
and is generated by (11) and Algorithm 2, Steps 6.1 -6.3. Then there exists such that
Since there follows and the previous equality implies
Furthermore, the assumption implies
Moreover, and
where . This function is convex and nonnegative since and therefore
Moreover, and we conclude that for all there holds . As
Putting together (42)-(46), for and satisfying (41) we obtain
Applying Theorem 3.5 once again, we get the local linear convergence as stated in the following corollary.
Assume that the conditions of Theorem 4.1 hold. Then there exists such that for every satisfying the sequence generated by DQN-2 method with Steps 6.1-6.3 of Algorithm 3 and converges linearly to and
We remark that, for strongly convex quadratic ’s, the result analogous to Theorem 3.6 holds in the sense of global linear convergence, i.e., inequality (48) holds for all and arbitrary initial point .
Remark. ***An interesting future research direction is to adapt and analyze convergence of the DQN methods in asynchronous environments, as it has been already numerically studied recently in . Therein, it is shown that the studied second order methods still converge in an asynchronous setting, though with a lower convergence speed.
2 Discussion on the tuning parameters
Matrix only needs to satisfy that: 1) the underlying support network is connected and 2) all diagonal entries lie between and , where . Regarding the latter condition, it is standard and rather mild; it is only required for (7) to hold, i.e., to ensure that solving (4) gives an approximate solution to the desired problem (1). Regarding the second condition, it can be easily fulfilled through simple weight assignments, e.g., through the Metropolis weights choice; see, e.g., .
Discussion on distributed implementation. The algorithm’s tuning parameters need to be set beforehand in a distributed way. Regarding weight matrix , each node needs to store beforehand the weights and , , for all its neighbors. The weights can be set according to the Metropolis rule, e.g., , where each node needs to know only the degrees of its immediate neighbors. Such weight choice, as noted before, satisfies the imposed assumptions.
In order to set the scalar tuning parameters , , and , each node needs to know beforehand global quantities , , and . Each of these parameters represent either a maximum or a minimum of nodes’ local quantities. For example, is the maximum of the ’s over , where node holds quantity . Hence, each node can obtain by running a distributed algorithm for maximum computation beforehand; for example, nodes can utilize the algorithm in .
Simulations
This section shows numerical performance of the proposed methods on two examples, namely the strongly convex quadratic cost functions and the logistic loss functions.
and is a positive regularization parameter. Note that, in this example, we have
With both the proposed methods and the methods in , the step size is used. Step size has also been used in . Note that both classes of methods – NN and DQN – guarantee global convergence with for quadratic costs, while neither of the two groups of methods have guaranteed global convergence with logistic losses. For the proposed methods, safeguarding is not used with quadratic costs. With logistic costs, the safeguarding is not used with DQN- and but it is used with DQN-, which diverges without the safeguard on the logistic costs. The safeguard parameter defined as the upper bound in (18) with is employed. Further, with all DQNs, is used. With all the algorithms, each node’s solution estimate is initialized by a zero vector.
is used and refered to as the relative error at iteration .
for small . On the other hand, we have that the DQN-0’s remainder is:
Extensions
As noted before, DQN methods do not converge to the exact solution of (1) but to a solution neighborhood controlled by the step size . As such, for high solution accuracies required, they may not be competitive with distributed second order methods which converge to the exact solution .
Initialization: Each node sets and .
Each node transmits to all its neighbors and receives from all .
Each node chooses a diagonal matrix , such that
Each node updates its solution estimate as:
Each node transmits to all its neighbors and receives from all .
Each node updates the dual variable as follows:
Each node transmits to all its neighbors and receives from all .
Each node calculates – the solution to the following system of equations (where the only unknown is the diagonal matrix ):
Each node projects each diagonal entry of onto the interval .
Conclusions
The cost in terms of computational effort and communication of these three methods correspond to the costs of the state-of-the-art Network Newton methods, NN-0, NN-1 and NN-2, which are used as the benchmark class in this paper. The simulation results on two relevant problems, the quadratic loss and the logistic loss, demonstrate the efficiency of the proposed methods and compare favorably with the benchmark methods. Finally, applying the recent contributions of , the proposed distributed second order methods were extended to the framework of proximal multiplier methods. Unlike DQN, the modified methods converge to the exact solution and further enhance the performance when high solution accuracies are required.
Acknowledgement. We are grateful to the anonymous referees whose comments and suggestions helped us to improve the quality of this paper.
References
Appendix
Following , we briefly explain how we derive the PMM-DQN methods. The starting point for PMM-DQN is the quadratically approximated PMM method, which takes the following form (see for details and derivations):