Approximate Gaussian Elimination for Laplacians: Fast, Sparse, and Simple

Rasmus Kyng, Sushant Sachdeva

Introduction

A symmetric matrix LL is called Symmetric and Diagonally Dominant (SDD) if for all i,i, L(i,i)jiL(i,j).L(i,i)\geq\sum_{j\neq i}|L(i,j)|. An SDD matrix LL is a Laplacian if L(i,j)0L(i,j)\leq 0 for ij,i\neq j, and for all i,i, jL(i,j)=0.\sum_{j}L(i,j)=0. A Laplacian matrix is naturally associated with a graph on its vertices, where i,ji,j are adjacent if L(i,j)0.L(i,j)\neq 0. The problem of solving systems of linear equations Lx=b,Lx=b, where LL is an SDD matrix (and often a Laplacian), is a fundamental primitive and arises in varied applications in both theory and practice. Example applications include solutions of partial differential equations via the finite element method [Str86, BHV08], semi-supervised learning on graphs [ZGL03, ZS04, ZBL+04], and computing maximum flows in graphs [DS08, CKM+11, Mad13, LS14]. It has also been used as a primitive in the design of several fast algorithms [KM09, OSV12, KMP12, LKP12, KRS15]. It is known that solving SDD linear systems can be reduced to solving Laplacian systems [Gre96].

A natural approach to solving systems of linear equations is Gaussian elimination, or its variant for symmetric matrices, Cholesky factorization. Cholesky factorization of a matrix LL produces a factorization L=LDL,L=\mathcal{L}\mathcal{D}\mathcal{L}^{\top}, where L\mathcal{L} is a lower-triangular matrix, and D\mathcal{D} is a diagonal matrix. Such a factorization allows us to solve a system Lx=bLx=b by computing x=L1b=(L1)D1L1b,x=L^{-1}b=(\mathcal{L}^{-1})^{\top}\mathcal{D}^{-1}\mathcal{L}^{-1}b, where the inverse of L\mathcal{L}, and D\mathcal{D} can be applied quickly since they are lower-triangular and diagonal, respectively.

The fundamental obstacle to using Cholesky factorization for quickly solving systems of linear equations is that L\mathcal{L} can be a dense matrix even if the original matrix LL is sparse. The reason is that the key step in Cholesky factorization, eliminating a variable, say xi,x_{i}, from a system of equations, creates a new coefficient L(j,k)L^{\prime}({j,k}) for every pair j,kj,k such that L(j,i)L({j,i}) and L(i,k)L({i,k}) are non-zero. This phenomenon is called fill-in. For Laplacian systems, eliminating the first variable corresponds to eliminating the first vertex in the graph, and the fill-in corresponds to adding a clique on all the neighbors of the first vertex. Sequentially eliminating variables often produces a sequence of increasingly-dense systems, resulting in an O(n3)O(n^{3}) worst-case time even for sparse L.L. Informally, the algorithm for generating the Cholesky factorization for a Laplacian can be expressed as follows:

Use equation ii to express the variable for vertex ii in terms of the remaining variables.

Eliminate vertex ii, adding a clique on the neighbors of i.i.

Eliminating the vertices in an order given by a permutation π\pi generates a factorization L=PπLDLPπ,L=P_{\pi}\mathcal{L}\mathcal{D}\mathcal{L}^{\top}P^{\top}_{\pi}, where PπP_{\pi} denotes the permutation matrix of π\pi, i.e., (Pπz)i=zπ(i)(P_{\pi}z)_{i}=z_{\pi(i)} for all z.z. Though picking a good order of elimination can significantly reduce the running time of Cholesky factorization, it gives no guarantees for general systems, e.g., for sparse expander graphs, every ordering results in an Ω(n3)\Omega(n^{3}) running time [RJL79].

Our Results.

In this paper, we present the first nearly linear time algorithm that generates a sparse approximate Cholesky decomposition for Laplacian matrices, with provable approximation guarantees. Our algorithm SparseCholesky can be described informally as follows (see Section 3 for a precise description):

Use equation ii to express the variable for vertex ii in terms of the remaining variables.

Eliminate vertex ii, adding random samples from the clique on the neighbors of i.i.

We prove the following theorem about our algorithm, where for symmetric matrices A,B,A,B, we write ABA\preceq B if BAB-A is positive semidefinite (PSD).

where Z=PπLDLPπ,Z=P_{\pi}\mathcal{L}\mathcal{D}\mathcal{L}^{\top}P_{\pi}^{\top}, i.e., ZZ has a sparse Cholesky factorization.

The sparse approximate Cholesky factorization for LL given by Theorem C.1 immediately implies fast solvers for Laplacian systems. We can use the simplest iterative method, called iterative refinement [Hig02, Chapter 12] to solve the system Lx=bLx=b as follows. We let,

For all Laplacian systems Lx=bLx=b with 1b=0,\bm{1}^{\top}b=0, and all ϵ>0,\epsilon>0, using the sparse approximate Cholesky factorization ZZ given by Theorem C.1, the above iterate for t=3log\nicefrac1ϵt=3\log\nicefrac{{1}}{{\epsilon}} satisfies x(t)L+bLϵL+bL.\left\|x^{(t)}-L^{+}b\right\|_{L}\leq\epsilon\left\|L^{+}b\right\|_{L}. We can compute such an x(t)x^{(t)} in time O(mlog3nlog\nicefrac1ϵ).O(m\log^{3}n\log\nicefrac{{1}}{{\epsilon}}).

In our opinion, this is the simplest nearly-linear time solver for Laplacian systems. Our algorithm only uses random sampling, and no graph-theoretic constructions, in contrast with all previous Laplacian solvers. The analysis is also entirely contained in this paper. We also remark that there is a possibility that our analysis is not tight, and that the bounds can be improved by a stronger matrix concentration result.

Technical Contributions.

There are several key ideas that are central to our result: The first is randomizing the order of elimination. At step ii of the algorithm, if we eliminate a fixed vertex and sample the resulting clique, we do not know useful bounds on the sample variances that would allow us to prove concentration. Randomizing over the choice of vertex to eliminate allows us to bound the sample variance by roughly \nicefrac1n\nicefrac{{1}}{{n}} times the Laplacian at the previous step.

The second key idea is our approach to estimating effective resistances: A core element in all nearly linear time Laplacian solvers is a procedure for estimating effective resistances (or leverage scores) for edges in order to compute sampling probabilities. In previous solvers, these estimates are obtained using fairly involved procedures (e.g. low-stretch trees, ultrasparsifiers, or the subsampling procedure due to Cohen et al. [CLM+15]). In contrast, our solver starts with the crudest possible estimates of 11 for every edge, and then uses the triangle inequality for effective resistances (Lemma 5.1) to obtain estimates for the new edges generated. We show that these estimates suffice for constructing a nearly linear time Laplacian solver.

Finally, we develop new concentration bounds for a class of matrix martingales that we call bags-of-dice martingales. The name is motivated by a simple scalar model: at each step, we pick a random bag of dice from a collection of bags (in the algorithm, this corresponds to picking a random vertex to eliminate), and then we independently roll each die in the selected bag (corresponding to drawing independent samples from the clique added). The guarantees obtained from existing matrix concentration results are too weak for our application. The concentration bound gives us a powerful tool for handling conditionally independent sums of variables. We defer a formal description of the martingales and the concentration bound to Section 4.2.

Comparison to other Laplacian solvers.

In contrast, our algorithm requires no graph-theoretic construction, and is based purely on random sampling. Our result only uses two algebraic facts about Laplacian matrices: 1. They are closed under taking Schur complements, and 2. They satisfy the effective resistance triangle inequality (Lemma 5.1).

[KLP+16] presented the first nearly linear time solver for block Diagonally Dominant (bDD) systems – a generalization of SDD systems. If bDD matrices satisfy the effective resistance triangle inequality (we are unaware if they do), then the algorithm in the main body of this paper immediately applies to bDD systems, giving a sparse approximate block Cholesky decomposition and a nearly linear time solver for bDD matrices.

In Section C, we sketch a near-linear time algorithm for computing a sparse approximate block Cholesky factorization for bDD matrices. It combines the approach of SparseCholesky with a recursive approach for estimating effective resistances, as in [KLP+16], using the subsampling procedure [CLM+15]. Though the algorithm is more involved than SparseCholesky, it runs in time O(mlog3n+nlog5n),O(m\log^{3}n+n\log^{5}n), and produces a sparse approximate Cholesky decomposition with only O(mlog2n+nlog4n)O(m\log^{2}n+n\log^{4}n) entries. The algorithm only uses that bDD matrices are closed under taking Schur complements, and that the Schur complements have a clique structure similar to Laplacians (see Section 2).

Comparison to Incomplete Cholesky Factorization.

A popular approach to tackling fill-in is Incomplete Cholesky factorization, where we throw away most of the new entries generated when eliminating variables. The hope is that the resulting factorization is still an approximation to the original matrix L,L, in which case such an approximate factorization can be used to quickly solve systems in L.L. Though variants of this approach are used often in practice, and we have approximation guarantees for some families of Laplacians [Gus78, Gua97, BGH+06], there are no known guarantees for general Laplacians to the best of our knowledge.

Preliminaries

A weighted multi-graph GG is not uniquely defined by its Laplacian, since the Laplacian only depends on the sum of the weights of the multi-edges on each edge. We want to establish a one-to-one correspondence between a weighted multi-graph GG and its Laplacian LL, so from now on, we will consider every Laplacian to be maintained explicitly as a sum of Laplacians of multi-edges, and we will maintain this multi-edge decomposition as part of our algorithms.

If GG is connected, then the kernel of LL is the span of the vector 1.\bm{1}.

Cholesky Factorization in Sum and Product Forms.

We now formally introduce Cholesky factorization. Rather than the usual perspective where we factor out lower triangular matrices at every step of the algorithm, we present an equivalent perspective where we subtract a rank one term from the current matrix, obtaining its Schur complement. The lower triangular structure follows from the fact that the matrix effectively become smaller at every step.

Let LL be any symmetric positive-semidefinite matrix. Let L(\mathchar58,i)L(\mathrel{\mathop{\mathchar 58\relax}},i) denote the ithi^{\text{th}} column of L.L. Using the first equation in the system Lx=bLx=b to eliminate the variable x1x_{1} produces another system S(1)x=b,S^{(1)}x^{\prime}=b^{\prime}, where b1=0,xb^{\prime}_{1}=0,x^{\prime} is xx with x1x_{1} replaced by 0, and

For computing the Cholesky factorization, we perform a sequence of such operations, where in the kthk^{th} step, we select an index vkV{v1,,vk1}v_{k}\in V\setminus\left\{v_{1},\ldots,v_{k-1}\right\} and eliminate the variable vk.v_{k}. We define

If at some step kk, S(k1)(vk,vk)=0S^{(k-1)}(v_{k},v_{k})=0, then we define αk=0\alpha_{k}=0, and ck=0\boldsymbol{\mathit{c}}_{k}=0. Continuing until k=n1k=n-1, S(k)S^{(k)} will have at most one non-zero entry, which will be on the vnv_{n} diagonal. We define αn=S(k)\alpha_{n}=S^{(k)} and cn=evn\boldsymbol{\mathit{c}}_{n}=\boldsymbol{\mathit{e}}_{v_{n}}.

Let C\mathcal{C} be the n×nn\times n matrix with ci\boldsymbol{\mathit{c}}_{i} as its ithi^{\textrm{th}} column, and D\mathcal{D} be the n×nn\times n diagonal matrix D(i,i)=αi\mathcal{D}(i,i)=\alpha_{i}, then L=i=1nαicici=CDC.L=\sum_{i=1}^{n}\alpha_{i}\boldsymbol{\mathit{c}}_{i}\boldsymbol{\mathit{c}}_{i}^{\top}=\mathcal{C}\mathcal{D}\mathcal{C}^{\top}. Define the permutation matrix PP by Pei=evi.P\boldsymbol{\mathit{e}}_{i}=\boldsymbol{\mathit{e}}_{v_{i}}. Letting L=PC,\mathcal{L}=P^{\top}\mathcal{C}, we have L=PLDLP.L=P\mathcal{L}\mathcal{D}\mathcal{L}^{\top}P^{\top}. This decomposition is known as Cholesky factorization. Crucially, L\mathcal{L} is lower triangular, since L(i,j)=(Pcj)(i)=cj(vi),\mathcal{L}(i,j)=(P^{\top}\boldsymbol{\mathit{c}}_{j})(i)=\boldsymbol{\mathit{c}}_{j}(v_{i}), and for i<j,i<j, we have cj(vi)=0\boldsymbol{\mathit{c}}_{j}(v_{i})=0.

Clique Structure of the Schur Complement.

For example, we denote the first column of LL by (da),\begin{pmatrix}d\\ -\boldsymbol{\mathit{a}}\end{pmatrix}, then L1=[daaDiag(a)].L_{1}=\begin{bmatrix}d&-\boldsymbol{\mathit{a}}^{\top}\\ -\boldsymbol{\mathit{a}}&{\bf Diag}\left({\boldsymbol{\mathit{a}}}\right)\end{bmatrix}. We can write the Schur complement S(1)S^{(1)} as S(1)=L(L)v+(L)v1L(v,v)L(\mathchar58,v)L(\mathchar58,v).S^{(1)}=L-\left(L\right)_{v}+\left(L\right)_{v}-\frac{1}{L(v,v)}L(\mathrel{\mathop{\mathchar 58\relax}},v)L(\mathrel{\mathop{\mathchar 58\relax}},v)^{\top}. It is immediate that L(L)vL-\left(L\right)_{v} is a Laplacian matrix, since L(L)v=eE\mathchar58e∌vw(e)bebeL-\left(L\right)_{v}=\sum_{e\in E\mathrel{\mathop{\mathchar 58\relax}}e\not\ni v}w(e)\boldsymbol{\mathit{b}}_{e}\boldsymbol{\mathit{b}}_{e}^{\top}. A more surprising (but well-known) fact is that

is also a Laplacian, and its edges form a clique on the neighbors of vv. It suffices to show it for v=1.v=1. We write iji\sim j to denote (i,j)E.(i,j)\in E. Then

Thus S(1)S^{(1)} is a Laplacian since it is a sum of two Laplacians. By induction, for all k,k, S(k)S^{(k)} is a Laplacian.

The SparseCholesky Algorithm

Algorithm 1 gives the pseudo-code for our algorithm SparseCholesky. Our main result, Theorem 3.1 (a more precise version of Theorem C.1), shows that the algorithm computes an approximate sparse Cholesky decomposition in nearly linear time. We assume the Real RAM model. We prove the theorem in Section 4.

The expected number of non-zero entries in L\mathcal{L} is O(δ2ϵ2mlog3n)O(\frac{\delta^{2}}{\epsilon^{2}}m\log^{3}n). The algorithm runs in expected time O(δ2ϵ2mlog3n)O(\frac{\delta^{2}}{\epsilon^{2}}m\log^{3}n).

Algorithm 2 gives the pseudo-code for our CliqueSample algorithm.

The most significant obstacle to making Cholesky factorization of Laplacians efficient is the fill-in phenomenon, namely that each clique Cv(S)C_{v}(S) has roughly (degS(v))2(\deg_{S}(v))^{2} non-zero entries. To solve this problem, we develop a sampling procedure CliqueSample that produces a sparse Laplacian matrix which approximates the clique Cv(S)C_{v}(S). As input, the procedure requires a Laplacian matrix SS, maintained as a sum of Laplacians of multi-edges, and a vertex vv. It then computes a sampled matrix that approximates Cv(S)C_{v}(S). The elimination step in SparseCholesky removes the degS(v)\deg_{S}(v) edges incident on vv, and \textscCliqueSample(S,v)\textsc{CliqueSample}(S,v) only adds at most degS(v)\deg_{S}(v) multi-edges. This means the total number of multi-edges does not increase with each elimination step, solving the fill-in problem. The sampling procedure is also very fast: It takes O(degS(v))O(\deg_{S}(v)) time, much faster than the order (degS(v))2(\deg_{S}(v))^{2} time required to even write down the clique Cv(S)C_{v}(S).

Although it is notationally convenient for us to pass the whole matrix SS to CliqueSample, the procedure only relies on multi-edges incident on vv, so we will only pass these multi-edges.

Theorem 3.1 only provides guarantees only on the expected running time. In fact, if we make a small change to the algorithm, we can get O(δ2ϵ2mlog3n)O(\frac{\delta^{2}}{\epsilon^{2}}m\log^{3}n) running time w.h.p. At the kthk^{th} elimination, instead of picking the vertex π(k)\pi(k) uniformly at random among the remaining vertices, we pick the vertex uniformly at random among the remaining vertices with at most twice the average multi-edge degree in S(k)S^{(k)}. In Appendix B, we sketch a proof of this.

Analysis of the Algorithm using Matrix Concentration

In this section, we analyze the SparseCholesky algorithm, and prove Theorem 3.1. To prove the theorem, we need several intermediate results which we will now present. In Section 4.1, we show how the output the SparseCholesky and CliqueSample algorithms can be used to define a matrix martingale. In Section 4.2, we introduce a new type of martingale, called a bags-of-dice martingale, and a novel matrix concentration result for these martingales. In Section 4.3, we show how to apply our new matrix concentration results to the SparseCholesky martingale and prove Theorem 3.1. We defer proofs of the lemmas that characterize CliqueSample to Section 5, and proofs of our matrix concentration results to Section 6.

Throughout this section, we will study matrices that arise in the when using SparseCholesky to produce a sparse approximate Cholesky factorization of the Laplacian LL of a multi-graph GG. We will very frequently need to refer to matrices that are normalized by LL. We adopt the following notation: Given a symmetric matrix SS s.t. ker(L)ker(S)\ker(L)\subseteq\ker(S),

We will only use this notation for matrices SS that satisfy the condition ker(L)ker(S)\ker(L)\subseteq\ker(S). Note that L=Π,\overline{L}=\Pi, and ABA\preceq B iff AB.\overline{A}\preceq\overline{B}. Normalization is always done with respect to the Laplacian LL input to SparseCholesky. We say a multi-edge ee is 1/ρ1/\rho-bounded if

Given a Laplacian SS that corresponds to a multi-graph GSG_{S}, and a scalar ρ>0\rho>0, we say that SS is 1/ρ1/\rho-bounded if every multi-edge of SS is 1/ρ1/\rho-bounded. Since every multi-edge of LL is trivially 11-bounded, we can obtain a 1/ρ1/\rho-bounded Laplacian that corresponds to the same matrix, by splitting each multi-edge into ρ\left\lceil\rho\right\rceil identical copies, with a fraction 1/ρ1/\left\lceil\rho\right\rceil of the initial weight. The resulting Laplacian has at most ρm\left\lceil\rho\right\rceil m multi-edges.

Our next lemma describes some basic properties of the samples output by CliqueSample. We prove the lemma in Section 5.

YeY_{e} is or the Laplacian of a multi-edge with endpoints u1,u2u_{1},u_{2}, where u1,u2u_{1},u_{2} are neighbors of vv in SS.

Ye1/ρ\left\|\overline{Y}_{e}\right\|\leq 1/\rho, i.e. YeY_{e} is 1/ρ1/\rho-bounded w.r.t. LL.

The algorithm runs in time O(degS(v))O(\deg_{S}(v)).

The lemma tells us that the samples in expectation behave like the clique Cv(S)C_{v}(S), and that each sample is 1/ρ1/\rho-bounded. This will be crucial to proving concentration properties of our algorithm. We will use the fact that the expectation of the CliqueSample algorithm output equals the matrix produced by standard Cholesky elimination, to show that in expectation, the sparse approximate Cholesky decomposition produced by our SparseCholesky algorithm equals the original Laplacian. We will also see how we can use this expected behaviour to represent our sampling process as a martingale. We define the kthk^{\textrm{th}} approximate Laplacian as

Thus our final output equals L(n)L^{(n)}. Note that Line (9) of the SparseCholesky algorithm does not introduce any sampling error, and so L(n)=L(n1)L^{(n)}=L^{(n-1)}. The only significance of Line (9) is that it puts the matrix in the form we need for our factorization. Now

This is a martingale. To prove multiplicative concentration bounds, we need to normalize the martingale by LL, and so instead we consider

This martingale has considerable structure beyond a standard martingale. Conditional on the choices of the SparseCholesky algorithm until step k1k-1, and conditional on π(k)\pi(k), the terms Xe(k)\overline{X^{(k)}_{e}} are independent.

In Section 4.2 we define a type of martingale that formalizes the important aspects of this structure.

2 Bags-of-Dice Martingales and Matrix Concentration Results

We use the following shorthand notation: Given a sequence of random variables (r1,R(1),r2,R(2),,rk,R(k))(r_{1},R^{(1)},r_{2},R^{(2)},\ldots,r_{k},R^{(k)}), for every i,i, we write

Extending this notation to conditional expectations, we write,

Note that lil_{i} is allowed to be random, as long as it is fixed conditional on (i1)(i-1) and rir_{i}. The martingale given in Equation (6) is a bags-of-dice martingale, with ri=π(i)r_{i}=\pi(i), and Re(i)=Ye(i)R^{(i)}_{e}=Y^{(i)}_{e}. The name is motivated by a simple model: At each step of the martingale we pick a random bag of dice from a collection of bags (this corresponds to the outcome of rir_{i}) and then we independently roll each die in the bag (corresponds to the outcomes Re(i)R^{(i)}_{e}).

Suppose Z=i=1ke=1liZe(i){Z=\sum_{i=1}^{k}\sum_{e=1}^{l_{i}}Z^{(i)}_{e}} is a bags-of-dice martingale of d×dd\times d matrices that satisfies

Every sample Ze(i)Z^{(i)}_{e} satisfies Ze(i)2σ12,\left\|Z^{(i)}_{e}\right\|^{2}\leq\sigma^{2}_{1},

We remark that this theorem, and all the results in this section extend immediately to Hermitian matrices. We prove the above theorem in Section 6. This result is based on the techniques introduced by Tropp [Tro12] for using Lieb’s Concavity Theorem to prove matrix concentration results. Tropp’s result improved on earlier work, such as Ahlswede and Winter [AW02] and Rudelson and Vershynin [RV07]. These earlier matrix concentration results required IID sampling, making them unsuitable for our purposes.

Unfortunately, we cannot apply Theorem 4.3 directly to the bags-of-dice martingale in Equation (6). As we will see later, the variance of eXe(i)\sum_{e}\overline{X^{(i)}_{e}} can have norm proportional to L(i)\left\|\overline{L^{(i)}}\right\|, which can grow large.

However, we expect that the probability of L(i)\left\|\overline{L^{(i)}}\right\| growing large is very small. Our next construction allows us to leverage this idea, and avoid the small probability tail events that prevent us from directly applying Theorem 4.3 to the bags-of-dice martingale in Equation (6).

Given a bags-of-dice martingale of d×dd\times d matrices Z=i=1ke=1liZe(i){Z=\sum_{i=1}^{k}\sum_{e=1}^{l_{i}}Z^{(i)}_{e}}, and a scalar ϵ>0\epsilon>0, we define for each h{1,2,,k+1}h\in\left\{1,2,\ldots,k+1\right\} the event

We also define the ϵ\epsilon-truncated martingale:

The truncated martingale is derived from another martingale by forcing the martingale to get “stuck” if it grows too large. This ensures that so long as the martingale is not stuck, it is not too large. On the other hand, as our next result shows, the truncated martingale fails more often than the original martingale, and so it suffices to prove concentration of the truncated martingale to prove concentration of the original martingale. The theorem stated below is proven in Section 6.

Given a bags-of-dice martingale of d×dd\times d matrices Z=i=1ke=1liZe(i){Z=\sum_{i=1}^{k}\sum_{e=1}^{l_{i}}Z^{(i)}_{e}}, a scalar ϵ>0\epsilon>0, the associated ϵ\epsilon-truncated martingale Z~\widetilde{Z} is also a bags-of-dice martingale, and

3 Analyzing the SparseCholesky Algorithm Using Bags-of-Dice Martingales

By taking Ze(k)=Xe(k)Z^{(k)}_{e}=\overline{X^{(k)}_{e}}, ri=π(i)r_{i}=\pi(i) and Re(i)=Ye(i)R^{(i)}_{e}=Y^{(i)}_{e}, we obtain a bags-of-dice martingale Z=i=1n1eZe(i)Z=\sum_{i=1}^{n-1}\sum_{e}Z^{(i)}_{e}. Let Z~\widetilde{Z} denote the corresponding ϵ\epsilon-truncated bags-of-dice martingale. The next lemma shows that Z~\widetilde{Z} is well-behaved. The lemma is proven in Section 5.

Proof of Theorem 3.1: We have L(n)=Π+Z\overline{L^{(n)}}=\Pi+Z. Since for all k,ek,e, ker(L)ker(Ye(k))\ker(L)\subseteq\ker(Y^{(k)}_{e}), the statement (1ϵ)LL(n)(1+ϵ)L(1-\epsilon)L\preceq L^{(n)}\preceq(1+\epsilon)L is equivalent to ϵΠZϵΠ-\epsilon\Pi\preceq Z\preceq\epsilon\Pi. Further, ΠZΠ=Z\Pi Z\Pi=Z, and so it is equivalent to ϵIZϵI-\epsilon I\preceq Z\preceq\epsilon I. By Theorem 4.5, we have,

Thus, we can pick σ2=32ρ.\sigma_{2}=\sqrt{\frac{3}{2\rho}}. Finally, Lemma 4.1 also gives,

Thus, we can pick Ωk=3ρ(n+1k)I,\Omega_{k}=\frac{3}{\rho(n+1-k)}I, and

Similarly, we obtain concentration for Z~-\widetilde{Z} with the same parameters. Thus, by Theorem 4.3,

Picking ρ=12(1+δ)2ϵ2ln2n,\rho=\left\lceil 12(1+\delta)^{2}\epsilon^{-2}\ln^{2}n\right\rceil, we get σ2ϵ24(1+δ)lnn,\sigma^{2}\leq\frac{\epsilon^{2}}{4(1+\delta)\ln n}, and

Combining this with Equation ( ‣ 4.3) establishes Equation (11).

Finally, we need to bound the expected running time of the algorithm. We start by observing that the algorithm maintains the two following invariants:

Every multi-edge in S^(k1)\widehat{S}^{(k-1)} is 1/ρ1/\rho-bounded.

The total number of multi-edges is at most ρm\rho m.

We establish the first invariant inductively. The invariant holds for S^(0)\widehat{S}^{(0)}, because of the splitting of original edges into ρ\rho copies with weight 1/ρ1/\rho. The invariant thus also holds for S^(0)(S^(0))π(1)\widehat{S}^{(0)}-\left(\widehat{S}^{(0)}\right)_{\pi(1)}, since the multi-edges of this Laplacian are a subset of the previous ones. By Lemma 4.1, every multi-edge YeY_{e} output by CliqueSample is 1/ρ1/\rho-bounded, so S^(1)=S^(0)(S^(0))π(1)+C^1\widehat{S}^{(1)}=\widehat{S}^{(0)}-\left(\widehat{S}^{(0)}\right)_{\pi(1)}+\widehat{C}_{1} is 1/ρ1/\rho-bounded. If we apply this argument repeatedly for k=1,,n1k=1,\ldots,n-1 we get invariant (1).

Invariant (2) is also very simple to establish: It holds for S^(0)\widehat{S}^{(0)}, because splitting of original edges into ρ\rho copies does not produce more than ρm\rho m multi-edges in total. When computing S^(k)\widehat{S}^{(k)}, we subtract (S^(k1))π(k)\left(\widehat{S}^{(k-1)}\right)_{\pi(k)}, which removes exactly degS^(k1)(π(k))\deg_{\widehat{S}^{(k-1)}}(\pi(k)) multi-edges, while we add the multi-edges produced by the call to \textscCliqueSample(S^(k1),π(k))\textsc{CliqueSample}(\widehat{S}^{(k-1)},\pi(k)), which is at most degS^(k1)(π(k))\deg_{\widehat{S}^{(k-1)}}(\pi(k)). So the number of multi-edges is not increasing.

Clique Sampling Proofs

In this section, we prove Lemmas 4.1 and 4.6 that characterize the behaviour of our algorithm CliqueSample, which is used in SparseCholesky to approximate the clique generated by eliminating a variable.

A important element of the CliqueSample algorithm is our very simple approach to leverage score estimation. Using the well-known result that effective resistance in Laplacians is a distance (see Lemma 5.2), we give a bound on the leverage scores of all edges in a clique that arises from elimination. We let

Note that the factor \nicefrac12\nicefrac{{1}}{{2}} accounts for the fact that every pair is double counted.

Suppose multi-edges e,eve,e^{\prime}\ni v are 1/ρ1/\rho-bounded w.r.t. LL, and have endpoints v,uv,u and v,zv,z respectively, and zuz\neq u, then w(e)w(e)bu,zbu,zw(e)w(e^{\prime})\boldsymbol{\mathit{b}}_{u,z}\boldsymbol{\mathit{b}}_{u,z}^{\top} is w(e)+w(e)ρ\frac{w(e)+w(e^{\prime})}{\rho}-bounded.

To prove Lemma 5.1, we need the following result about Laplacians:

Given a connected weighted multi-graph G=(V,E,w)G=(V,E,w) with associated Laplacian matrix LL in GG, consider three distinct vertices u,v,zVu,v,z\in V, and the pair-vectors bu,vb_{u,v}, bv,zb_{v,z} and bu,zb_{u,z}.

This is known as phenomenon that Effective Resistance is a distance [KR93].

Proof of Lemma 5.1: Using the previous lemma:

To prove Lemma 4.1, we need the following result of Walker [Wal77] (see Bringmann and Panagiotou [BP12] for a modern statement of the result).

The time required for each sample is O(1)O(1).

We note that there are simpler sampling constructions than that of Lemma 5.3 that need O(logn)O(\log n) time per sample, and using such a method would only worsen our running time by a factor O(logn)O(\log n).

Proof of Lemma 4.1: From Lines (5) and (6), YiY_{i} is or the Laplacian of a multi-edge with endpoints u1,u2u_{1},u_{2}. To upper bound the running time, it is important to note that we do not need access to the entire matrix SS. We only need the multi-edges incident on vv. When calling CliqueSample, we only pass a copy of just these multi-edges.

We observe that the uniform samples in Line (3) can be done in O(1)O(1) time each, provided we count the number of multi-edges incident on vv to find degS(v)\deg_{S}(v). We can compute degS(v)\deg_{S}(v) in O(degS(v))O(\deg_{S}(v)) time. Using Lemma 5.3, if we do O(degS(v))O(\deg_{S}(v)) time preprocessing, we can compute each sample in Line (2) in time O(1)O(1). Since we do O(degS(v))O(\deg_{S}(v)) samples, the total time for sampling is hence O(degS(v))O(\deg_{S}(v)).

Now we determine the expected value of the sum of the samples. Note that in the sum below, each pair of multi-edges appears twice, with different weights.

Proof of Lemma 4.6: Throughout the proof of this lemma, all the random variables considered are conditional on the choices of the SparseCholesky algorithm up to and including step k1k-1.

Now, Cπ(k)(S^(k1))C_{\pi(k)}(\widehat{S}^{(k-1)}) is PSD since it is a Laplacian, so Cπ(k)(S^(k1))=λmax(Cπ(k)(S^(k1)))\left\|\overline{C_{\pi(k)}(\widehat{S}^{(k-1)})}\right\|=\lambda_{\max}(\overline{C_{\pi(k)}(\widehat{S}^{(k-1)})}). By Equation (2), we get Cπ(k)(S^(k1))(S^(k1))π(k){C_{\pi(k)}(\widehat{S}^{(k-1)})\preceq\left(\widehat{S}^{(k-1)}\right)_{\pi(k)}} and by Equation (1) we get (S^(k1))π(k)S^(k1){\left(\widehat{S}^{(k-1)}\right)_{\pi(k)}\preceq\widehat{S}^{(k-1)}}, finally by Equation (5) we get S^(k1)L(k1){\widehat{S}^{(k-1)}\preceq L^{(k-1)}} so

Again, using Cπ(k)(S^(k1))(S^(k1))π(k)C_{\pi(k)}(\widehat{S}^{(k-1)})\preceq\left(\widehat{S}^{(k-1)}\right)_{\pi(k)}, we get

Matrix Concentration Analysis

To prove Theorem 4.3, we need the following lemma, which is the main technical result of this section.

Given a bags-of-dice martingale of d×dd\times d matrices Z=i=1ke=1liZe(i){Z=\sum_{i=1}^{k}\sum_{e=1}^{l_{i}}Z^{(i)}_{e}} that is for all θ\theta such that 0<θ2min{1σ12,512σ22}0<\theta^{2}\leq\min\{\frac{1}{\sigma_{1}^{2}},\frac{5}{12\sigma_{2}^{2}}\}, we have,

Before proving this lemma, we will see how to use it to prove Theorem 4.3.

Proof of Theorem 4.3: Given Lemma 6.1, we can show Theorem 4.3 using the following bound via trace exponentials, which was first developed by Ahlswede and Winter [AW02].

To show Lemma 6.1 will need the following result by Tropp [Tro12], which is a corollary of Lieb’s Concavity Theorem [Lie73].

Given a random symmetric matrix ZZ, and a fixed symmetric matrix HH,

We will use the following claim to control the above trace by an inductive argument.

For all j=1,,k,j=1,\ldots,k, and all θ\theta such that 0<θ2min{1σ12,512σ22}0<\theta^{2}\leq\min\{\frac{1}{\sigma_{1}^{2}},\frac{5}{12\sigma_{2}^{2}}\}, we have,

Before proving this claim, we see that it immediately implies Lemma 6.1.

Proof of Lemma 6.1: We chain the inequalities given by Claim 6.3 for j=k,k1,,1j=k,k-1,\ldots,1 to obtain,

where the last inequality follows from Trexp(A)dexp(A)\operatorname*{Tr}\exp\left({A}\right)\leq d\exp\left({\left\|A\right\|}\right) for symmetric AA and i=1kΩiσ32.\left\|\sum_{i=1}^{k}\Omega_{i}\right\|\leq\sigma_{3}^{2}. \Box

We will also need the next two lemmas, which essentially appear in the work of Tropp [Tro12]. For completeness, we also prove these lemmas in Appendix A.

We also need the following well-known fact (see for example [Tro12]):

Given symmetric matrices A,BA,B s.t. ABA\preceq B, Trexp(A)Trexp(B)\operatorname*{Tr}{\exp\left({A}\right)}\preceq\operatorname*{Tr}{\exp\left({B}\right)}.

For all θ\theta such that 0<θ21σ12,0<\theta^{2}\leq\frac{1}{\sigma^{2}_{1}}, all j=1,,kj=1,\ldots,k and for all symmetric H~\widetilde{H} that are fixed given (r1,R1),,(rj1,Rj1),(r_{1},R_{1}),\ldots,(r_{j-1},R_{j-1}),

Now, using Fact 6.6 which states that Trexp()\operatorname*{Tr}{\exp\left({\cdot}\right)} is monotone increasing with respect to the PSD order, we obtain the lemma. \Box

Proof of Claim 6.3: Since 0<θ21σ12,0<\theta^{2}\leq\frac{1}{\sigma_{1}^{2}}, using Lemma 6.7 with H~=θ2(i=j+1kΩi)+θi=1j1eZe(i),\widetilde{H}=\theta^{2}\left({\sum_{i=j+1}^{k}\Omega_{i}}\right)+\theta\sum_{i=1}^{j-1}\sum_{e}Z^{(i)}_{e}, we obtain,

Now, using Fact 6.6, namely that Trexp()\operatorname*{Tr}{\exp\left({\cdot}\right)} is monotone increasing with respect to the PSD order, we obtain the lemma. \Box

2 Truncating Bags-of-Dice Martingales

To prove this theorem, we use the next lemma, which we will prove later in this section:

Acknowledgements

We thank Daniel Spielman for suggesting this project and for helpful comments and discussions.

References

Appendix A Conditions for Bounding Matrix Moment Generating Functions

Proof of Lemma 6.4: We define f(z)=ezz1z2f(z)=\frac{e^{z}-z-1}{z^{2}}. Note that f(1)4/5f(1)\leq 4/5. The function ff is positive and increasing in zz for all real zz. This means for every symmetric matrix AA, f(A)f(A)If(A)\preceq f(\left\|A\right\|)I, and so for any symmetric matrix BB, Bf(A)Bf(A)B2Bf(A)B\preceq f(\left\|A\right\|)B^{2}. Thus

Proof of Lemma 6.5: We define g(z)=ez1zg(z)=\frac{e^{z}-1}{z}. The function gg is positive and increasing in zz for all real z0z\geq 0. This means for every symmetric matrix AA, g(A)g(A)Ig(A)\preceq g(\left\|A\right\|)I, and so for any symmetric matrix BB, Bg(A)Bg(A)B2Bg(A)B\preceq g(\left\|A\right\|)B^{2}. Also g(1/3)6/5g(1/3)\leq 6/5. Thus

The lemma now follows using the fact that log\log is operator monotone (increasing), log(1+z)z\log(1+z)\leq z for all real z>0,z>0, and C0.C\succeq 0. \Box

Appendix B Obtaining Concentration of Running Time

As indicated in Remark 3.2, we can obtain a version of Theorem 3.1 that provides running time guarantees with high probability instead of in expectation, by making a slight change to the SparseCholesky algorithm. In this appendix, we briefly sketch how to prove this. We refer to the modified algorithm as LowDegreeSparseCholesky. The algorithm is requires only two small modifications: Firstly, instead of initially choosing a permutation at random, we choose the kthk^{\text{th}} vertex to eliminate by sampling it uniformly at random amongst the remaining vertices that have degree at most twice the average multi-edge degree in S^(k1)\widehat{S}^{(k-1)}. We can do this by keeping track of all vertex degrees, and sampling a remaining vertex at random, and resampling if the degree is too high, until we get a low degree vertex. Secondly, to make up for the slight reduction in the randomization of the choice of vertex, we double the value of ρ\rho used in Line 1.

The number of non-zero entries in L\mathcal{L} is O(δ2ϵ2mlog3n)O(\frac{\delta^{2}}{\epsilon^{2}}m\log^{3}n). With high probability the algorithm runs in time O(δ2ϵ2mlog3n)O(\frac{\delta^{2}}{\epsilon^{2}}m\log^{3}n).

Appendix C Sparse Cholesky Factorization for bDD Matrices

In this appendix, we sketch a version of our approximate Cholesky factorization algorithm, that is also applicable to BDD matrices, which include the class of Connection Laplacian matrices. We call this algorithm BDDSparseCholesky. It follows closely the algorithmic structure used in [KLP+16]. Like [KLP+16], we need the input matrix to be non-singular, which we can achieve using the approach described in Claims 2.4 and 2.5 of [KLP+16].

Our algorithm replace their expander-based Schur complement approximation routine with our simple one-by-one vertex elimination, while still using their recursive subsampling based framework for estimation of leverage scores. The constants in this appendix are not optimized.

We study bDD matrices as defined in [KLP+16], with r×rr\times r blocks, where rr is a constant (see their Section 1.1). The class of bDD matrices is Hermitian, rather than symmetric, but our notion of spectral approximation and our matrix concentration results extend immediately to Hermitian matrices. Throughout this section, we use ()(\cdot)^{{\dagger}} to conjugate transpose. Our algorithm will still compute a Cholesky composition, except we do not factor the individual r×rr\times r block matrices. We will sketch a proof of the following result:

where Z=PπLDLPπ,Z=P_{\pi}\mathcal{L}\mathcal{D}\mathcal{L}^{{\dagger}}P_{\pi}^{{\dagger}}, i.e., ZZ has a sparse Cholesky factorization.

We choose a fixed ϵ=1/2\epsilon=1/2, but the algorithms can be adapted to produce approximate Cholesky decompositions with ϵ1/2\epsilon\leq 1/2 spectral approximation and running time dependence ϵ2\epsilon^{-2}.

For BDD matrices, we do not know a result analogous to the fact that effective resistance is a distance in Laplacians (see Lemma 5.2). Instead, we use a result that is weaker by a factor 22: Given two bDD multi-edge matrices ee, ee^{\prime} w(e)Bu,vBu,vw(e)B_{u,v}B_{u,v}^{{\dagger}} and w(e)Bv,zBv,zw(e^{\prime})B_{v,z}B_{v,z}^{{\dagger}} that are incident on vertex blocks v,uv,u and v,zv,z respectively, if we eliminate vertex block vv, this creates a multi-edge ee^{\prime\prime} with BDD matrix w(e)Bu,zBu,zw(e^{\prime\prime})B_{u,z}B_{u,z}^{{\dagger}} satisfying

We will sketch how to modify the LowDegreeSparseCholesky algorithm to solve BDD matrices. We call this BDDSparseCholesky. This algorithm is similar to SparseCholesky and LowDegreeSparseCholesky, except that the number of multi-edges in the approximate factorization will be slowly increasing with each elimination and to counter this we will need to occasionally sparsify the matrices we produce. First we will assume an oracle procedure Sparsify, which we will later see how to construct using a boot-strapping approach that recursively calls BDDSparseCholesky on smaller matrices.

\textscBDDSparseCholesky(S,ρ)\textsc{BDDSparseCholesky}(S,\rho) should be identical to LowDegreeSparseCholesky, except

The sampling rate ρ\rho is an explicit parameter to BDDSparseCholesky.

BDDSparseCholesky should not split the initial input multi-edge into ρ\rho smaller copies. This will be important because we use BDDSparseCholesky recursively, and we will only split edges at the top level.

We adapt the CliqueSample routine to sample from a bDD elimination clique, and we produce 2degS(v)2\deg_{S}(v) samples, and scale all samples by a factor 1/21/2.

After eliminating 910n\frac{9}{10}n vertices, it calls \textscSparsify(S^(9n/10),ρ)\textsc{Sparsify}(\widehat{S}^{(9n/10)},\rho) to produce a sampled matrix SS^{\prime} of dimension nr/10×nr/10{nr/10\times nr/10}, and then calls \textscBDDSparseCholesky(S,ρ)\textsc{BDDSparseCholesky}(S^{\prime},\rho) to recursively produce an approximate Cholesky decomposition of SS^{\prime}.

To compute an approximate Cholesky decomposition of LL, we set ρ=103log2n\rho=\left\lceil 10^{3}\log^{2}n\right\rceil. Form S(0)S^{(0)} by splitting each edge of LL into ρ\rho copies with 1/ρ1/\rho of their initial weight. Call \textscBDDSparseCholesky(S(0),ρ)\textsc{BDDSparseCholesky}(S^{(0)},\rho).

The clique sampling routine for bDD matrices uses more conservative sampling than CliqueSample , because we use the weaker Equation (12). The sparsification step then becomes necessary because our clique sampling routine now causes the total number of number of multi-edges to increase with each elimination. However, the increase will not exceed a factor (1+8/n)(1+8/n), so after 910n\frac{9}{10}n eliminations, the total number of multi-edges has not grown by more than 21042\cdot 10^{4}.

We can use a truncated martingale to analyze the entire approximate Cholesky decomposition produced by BDDSparseCholesky and its recursive calls using Theorems 4.3 and 4.5 (these theorems extend to Hermitian matrices immediately). The calls to Sparsify will cause our bound on the martingale variance σ3\sigma_{3} to grow larger, but only by a constant factor. By increasing ρ\rho by an appropriate constant, we still obtain concentration.

On the other hand, the calls to Sparsify will ensure that the time spent in the recursive calls to BDDSparseCholesky is only a constant fraction of the time spent in the initial call, assuming the total number of multi-edges in SS exceeds ρ22106nr\rho^{2}\cdot 2\cdot 10^{6}nr (if not, we can always split edges to achieve this). This corresponds to assuming that before the initial edge splitting at the start of the algorithm, we have at least ρ2106nr\rho\cdot 2\cdot 10^{6}nr edges.

This means the total time to compute the approximate Cholesky decomposition of LL using BDDSparseCholesky will only be O(ρ(m+ρn))O(\rho(m+\rho n)), excluding calls to Sparsify. The decomposition will have O(mlog2n+nlog4n)O(m\log^{2}n+n\log^{4}n) non-zeros, and its approximate inverse can be applied in O(mlog2n+nlog4n)O(m\log^{2}n+n\log^{4}n) time.

We now sketch briefly how to implement the Sparsify routine. It closely resembles the Sparsify routine of [KLP+16] (see their Lemma H.1).

Implementing the sparsification routine.

\textscSparsify(S^(9n/10),ρ)\textsc{Sparsify}(\widehat{S}^{(9n/10)},\rho) uses the subsampling-based techniques of [CLM+15]. It is identical to the sparsification routine of [KLP+16], except the recursive call to a linear solver uses BDDSparseCholesky. The routine first samples each multi-edge with probability 12105ρ\frac{1}{2\cdot 10^{5}\rho} to produce a sparse matrix SS^{\prime\prime}, and then uses Johnson-Lindenstrauss-based leverage score estimation (see [CLM+15]) to compute IID samples with the desired 1/ρ1/\rho-bound w.r.t. LL. The IID samples are summed to give the output matrix SS^{\prime}. This requires approximately solving ρ\sqrt{\rho} systems of linear equations in the sparse matrix SS^{\prime\prime}. To do so, Sparsify first splits every edge of SS^{\prime\prime} into ρ\rho copies (increasing the number of multi-edges by a factor ρ\rho), then makes a single recursive call to \textscBDDSparseCholesky(S,ρ)\textsc{BDDSparseCholesky}(S^{\prime\prime},\rho), and then uses the resulting Cholesky decomposition ρ\sqrt{\rho} times to compute approximate solutions to systems of linear equations. One issue requires some care: The subsampling guarantees provided by [CLM+15] are with respect to S^(9n/10)\widehat{S}^{(9n/10)} and not LL, however, by using a truncated martingale in our analysis, we can assume that S^(9n/10)2L\widehat{S}^{(9n/10)}\preceq 2L.

Running time including sparsification.

Finally, if we take account of time spent on calls to Sparsify and its recursive calls to \textscBDDSparseCholesky(S,ρ)\textsc{BDDSparseCholesky}(S^{\prime\prime},\rho), we get a time recursion for BDDSparseCholesky of

(assuming initially for LL that mρ2106nm\geq\rho 2\cdot 10^{6}n), which can be solved to give a running time bound of O(mρ1.5+nρ2.5)=O(mlog3n+nlog5n)O(m\rho^{1.5}+n\rho^{2.5})=O(m\log^{3}n+n\log^{5}n).

Fixing ρ\rho at the top level ensures that a union bound across all recursive calls in BDDSparseCholesky will give that the approximate Cholesky decomposition obtains a 1/21/2 factor spectral approximation with high probability.

By applying the sparsification approach described in this section once, we can compute an approximate Cholesky decomposition of Laplacian matrices in time O(mlog2nloglogn)O(m\log^{2}n\log\log n) time w.h.p.

We run LowDegreeSparseCholesky with a modification: after eliminating all but n/log100nn/log^{100}n vertices, we do sparsification on the remaining graph with ρm\rho m multi-edges and n/log100nn/log^{100}n vertices. The call to LowDegreeSparseCholesky until sparsification will take O(mlog2nloglogn)O(m\log^{2}n\log\log n) time w.h.p. The sparsification is done using a modified version of the Sparsify routine described above. Instead of sampling each multi-edge with probability 12105ρ\frac{1}{2\cdot 10^{5}\rho}, we use a probability of 1log8n\frac{1}{\log^{8}n}. The recursive linear solve in Sparsify can be done using unmodified LowDegreeSparseCholesky, and O(logn)O(\log n) linear system solves for Johnson-Lindenstrauss leverage score estimation can be done using this decomposition, all in O(m+n)O(m+n) time. The output graph SS^{\prime} from Sparsify can be Cholesky decomposed using unmodified LowDegreeSparseCholesky as well, and this will take time O(m+n)O(m+n). In total, we get a running time and number of non-zeros bounded by O(mlog2nloglogn)O(m\log^{2}n\log\log n) time w.h.p.