Approximate Gaussian Elimination for Laplacians: Fast, Sparse, and Simple

Rasmus Kyng, Sushant Sachdeva

Introduction

A symmetric matrix $L$ is called Symmetric and Diagonally Dominant (SDD) if for all $i,$ $L(i,i)\geq\sum_{j\neq i}|L(i,j)|.$ An SDD matrix $L$ is a Laplacian if $L(i,j)\leq 0$ for $i\neq j,$ and for all $i,$ $\sum_{j}L(i,j)=0.$ A Laplacian matrix is naturally associated with a graph on its vertices, where $i,j$ are adjacent if $L(i,j)\neq 0.$ The problem of solving systems of linear equations $Lx=b,$ where $L$ is an SDD matrix (and often a Laplacian), is a fundamental primitive and arises in varied applications in both theory and practice. Example applications include solutions of partial differential equations via the finite element method [Str86, BHV08], semi-supervised learning on graphs [ZGL03, ZS04, ZBL+04], and computing maximum flows in graphs [DS08, CKM+11, Mad13, LS14]. It has also been used as a primitive in the design of several fast algorithms [KM09, OSV12, KMP12, LKP12, KRS15]. It is known that solving SDD linear systems can be reduced to solving Laplacian systems [Gre96].

A natural approach to solving systems of linear equations is Gaussian elimination, or its variant for symmetric matrices, Cholesky factorization. Cholesky factorization of a matrix $L$ produces a factorization $L=\mathcal{L}\mathcal{D}\mathcal{L}^{\top},$ where $\mathcal{L}$ is a lower-triangular matrix, and $\mathcal{D}$ is a diagonal matrix. Such a factorization allows us to solve a system $Lx=b$ by computing $x=L^{-1}b=(\mathcal{L}^{-1})^{\top}\mathcal{D}^{-1}\mathcal{L}^{-1}b,$ where the inverse of $\mathcal{L}$ , and $\mathcal{D}$ can be applied quickly since they are lower-triangular and diagonal, respectively.

The fundamental obstacle to using Cholesky factorization for quickly solving systems of linear equations is that $\mathcal{L}$ can be a dense matrix even if the original matrix $L$ is sparse. The reason is that the key step in Cholesky factorization, eliminating a variable, say $x_{i},$ from a system of equations, creates a new coefficient $L^{\prime}({j,k})$ for every pair $j,k$ such that $L({j,i})$ and $L({i,k})$ are non-zero. This phenomenon is called fill-in. For Laplacian systems, eliminating the first variable corresponds to eliminating the first vertex in the graph, and the fill-in corresponds to adding a clique on all the neighbors of the first vertex. Sequentially eliminating variables often produces a sequence of increasingly-dense systems, resulting in an $O(n^{3})$ worst-case time even for sparse $L.$ Informally, the algorithm for generating the Cholesky factorization for a Laplacian can be expressed as follows:

Use equation $i$ to express the variable for vertex $i$ in terms of the remaining variables.

Eliminate vertex $i$ , adding a clique on the neighbors of $i.$

Eliminating the vertices in an order given by a permutation $\pi$ generates a factorization $L=P_{\pi}\mathcal{L}\mathcal{D}\mathcal{L}^{\top}P^{\top}_{\pi},$ where $P_{\pi}$ denotes the permutation matrix of $\pi$ , i.e., $(P_{\pi}z)_{i}=z_{\pi(i)}$ for all $z.$ Though picking a good order of elimination can significantly reduce the running time of Cholesky factorization, it gives no guarantees for general systems, e.g., for sparse expander graphs, every ordering results in an $\Omega(n^{3})$ running time [RJL79].

Our Results.

In this paper, we present the first nearly linear time algorithm that generates a sparse approximate Cholesky decomposition for Laplacian matrices, with provable approximation guarantees. Our algorithm SparseCholesky can be described informally as follows (see Section 3 for a precise description):

Use equation $i$ to express the variable for vertex $i$ in terms of the remaining variables.

Eliminate vertex $i$ , adding random samples from the clique on the neighbors of $i.$

We prove the following theorem about our algorithm, where for symmetric matrices $A,B,$ we write $A\preceq B$ if $B-A$ is positive semidefinite (PSD).

where $Z=P_{\pi}\mathcal{L}\mathcal{D}\mathcal{L}^{\top}P_{\pi}^{\top},$ i.e., $Z$ has a sparse Cholesky factorization.

The sparse approximate Cholesky factorization for $L$ given by Theorem C.1 immediately implies fast solvers for Laplacian systems. We can use the simplest iterative method, called iterative refinement [Hig02, Chapter 12] to solve the system $Lx=b$ as follows. We let,

For all Laplacian systems $Lx=b$ with $\bm{1}^{\top}b=0,$ and all $\epsilon>0,$ using the sparse approximate Cholesky factorization $Z$ given by Theorem C.1, the above iterate for $t=3\log\nicefrac{{1}}{{\epsilon}}$ satisfies $\left\|x^{(t)}-L^{+}b\right\|_{L}\leq\epsilon\left\|L^{+}b\right\|_{L}.$ We can compute such an $x^{(t)}$ in time $O(m\log^{3}n\log\nicefrac{{1}}{{\epsilon}}).$

In our opinion, this is the simplest nearly-linear time solver for Laplacian systems. Our algorithm only uses random sampling, and no graph-theoretic constructions, in contrast with all previous Laplacian solvers. The analysis is also entirely contained in this paper. We also remark that there is a possibility that our analysis is not tight, and that the bounds can be improved by a stronger matrix concentration result.

Technical Contributions.

There are several key ideas that are central to our result: The first is randomizing the order of elimination. At step $i$ of the algorithm, if we eliminate a fixed vertex and sample the resulting clique, we do not know useful bounds on the sample variances that would allow us to prove concentration. Randomizing over the choice of vertex to eliminate allows us to bound the sample variance by roughly $\nicefrac{{1}}{{n}}$ times the Laplacian at the previous step.

The second key idea is our approach to estimating effective resistances: A core element in all nearly linear time Laplacian solvers is a procedure for estimating effective resistances (or leverage scores) for edges in order to compute sampling probabilities. In previous solvers, these estimates are obtained using fairly involved procedures (e.g. low-stretch trees, ultrasparsifiers, or the subsampling procedure due to Cohen et al. [CLM+15]). In contrast, our solver starts with the crudest possible estimates of $1$ for every edge, and then uses the triangle inequality for effective resistances (Lemma 5.1) to obtain estimates for the new edges generated. We show that these estimates suffice for constructing a nearly linear time Laplacian solver.

Finally, we develop new concentration bounds for a class of matrix martingales that we call bags-of-dice martingales. The name is motivated by a simple scalar model: at each step, we pick a random bag of dice from a collection of bags (in the algorithm, this corresponds to picking a random vertex to eliminate), and then we independently roll each die in the selected bag (corresponding to drawing independent samples from the clique added). The guarantees obtained from existing matrix concentration results are too weak for our application. The concentration bound gives us a powerful tool for handling conditionally independent sums of variables. We defer a formal description of the martingales and the concentration bound to Section 4.2.

Comparison to other Laplacian solvers.

In contrast, our algorithm requires no graph-theoretic construction, and is based purely on random sampling. Our result only uses two algebraic facts about Laplacian matrices: 1. They are closed under taking Schur complements, and 2. They satisfy the effective resistance triangle inequality (Lemma 5.1).

[KLP+16] presented the first nearly linear time solver for block Diagonally Dominant (bDD) systems – a generalization of SDD systems. If bDD matrices satisfy the effective resistance triangle inequality (we are unaware if they do), then the algorithm in the main body of this paper immediately applies to bDD systems, giving a sparse approximate block Cholesky decomposition and a nearly linear time solver for bDD matrices.

In Section C, we sketch a near-linear time algorithm for computing a sparse approximate block Cholesky factorization for bDD matrices. It combines the approach of SparseCholesky with a recursive approach for estimating effective resistances, as in [KLP+16], using the subsampling procedure [CLM+15]. Though the algorithm is more involved than SparseCholesky, it runs in time $O(m\log^{3}n+n\log^{5}n),$ and produces a sparse approximate Cholesky decomposition with only $O(m\log^{2}n+n\log^{4}n)$ entries. The algorithm only uses that bDD matrices are closed under taking Schur complements, and that the Schur complements have a clique structure similar to Laplacians (see Section 2).

Comparison to Incomplete Cholesky Factorization.

A popular approach to tackling fill-in is Incomplete Cholesky factorization, where we throw away most of the new entries generated when eliminating variables. The hope is that the resulting factorization is still an approximation to the original matrix $L,$ in which case such an approximate factorization can be used to quickly solve systems in $L.$ Though variants of this approach are used often in practice, and we have approximation guarantees for some families of Laplacians [Gus78, Gua97, BGH+06], there are no known guarantees for general Laplacians to the best of our knowledge.

Preliminaries

A weighted multi-graph $G$ is not uniquely defined by its Laplacian, since the Laplacian only depends on the sum of the weights of the multi-edges on each edge. We want to establish a one-to-one correspondence between a weighted multi-graph $G$ and its Laplacian $L$ , so from now on, we will consider every Laplacian to be maintained explicitly as a sum of Laplacians of multi-edges, and we will maintain this multi-edge decomposition as part of our algorithms.

If $G$ is connected, then the kernel of $L$ is the span of the vector $\bm{1}.$

Cholesky Factorization in Sum and Product Forms.

We now formally introduce Cholesky factorization. Rather than the usual perspective where we factor out lower triangular matrices at every step of the algorithm, we present an equivalent perspective where we subtract a rank one term from the current matrix, obtaining its Schur complement. The lower triangular structure follows from the fact that the matrix effectively become smaller at every step.

Let $L$ be any symmetric positive-semidefinite matrix. Let $L(\mathrel{\mathop{\mathchar 58\relax}},i)$ denote the $i^{\text{th}}$ column of $L.$ Using the first equation in the system $Lx=b$ to eliminate the variable $x_{1}$ produces another system $S^{(1)}x^{\prime}=b^{\prime},$ where $b^{\prime}_{1}=0,x^{\prime}$ is $x$ with $x_{1}$ replaced by 0, and

For computing the Cholesky factorization, we perform a sequence of such operations, where in the $k^{th}$ step, we select an index $v_{k}\in V\setminus\left\{v_{1},\ldots,v_{k-1}\right\}$ and eliminate the variable $v_{k}.$ We define

If at some step $k$ , $S^{(k-1)}(v_{k},v_{k})=0$ , then we define $\alpha_{k}=0$ , and $\boldsymbol{\mathit{c}}_{k}=0$ . Continuing until $k=n-1$ , $S^{(k)}$ will have at most one non-zero entry, which will be on the $v_{n}$ diagonal. We define $\alpha_{n}=S^{(k)}$ and $\boldsymbol{\mathit{c}}_{n}=\boldsymbol{\mathit{e}}_{v_{n}}$ .

Let $\mathcal{C}$ be the $n\times n$ matrix with $\boldsymbol{\mathit{c}}_{i}$ as its $i^{\textrm{th}}$ column, and $\mathcal{D}$ be the $n\times n$ diagonal matrix $\mathcal{D}(i,i)=\alpha_{i}$ , then $L=\sum_{i=1}^{n}\alpha_{i}\boldsymbol{\mathit{c}}_{i}\boldsymbol{\mathit{c}}_{i}^{\top}=\mathcal{C}\mathcal{D}\mathcal{C}^{\top}.$ Define the permutation matrix $P$ by $P\boldsymbol{\mathit{e}}_{i}=\boldsymbol{\mathit{e}}_{v_{i}}.$ Letting $\mathcal{L}=P^{\top}\mathcal{C},$ we have $L=P\mathcal{L}\mathcal{D}\mathcal{L}^{\top}P^{\top}.$ This decomposition is known as Cholesky factorization. Crucially, $\mathcal{L}$ is lower triangular, since $\mathcal{L}(i,j)=(P^{\top}\boldsymbol{\mathit{c}}_{j})(i)=\boldsymbol{\mathit{c}}_{j}(v_{i}),$ and for $i<j,$ we have $\boldsymbol{\mathit{c}}_{j}(v_{i})=0$ .

Clique Structure of the Schur Complement.

For example, we denote the first column of $L$ by $\begin{pmatrix}d\\ -\boldsymbol{\mathit{a}}\end{pmatrix},$ then $L_{1}=\begin{bmatrix}d&-\boldsymbol{\mathit{a}}^{\top}\\ -\boldsymbol{\mathit{a}}&{\bf Diag}\left({\boldsymbol{\mathit{a}}}\right)\end{bmatrix}.$ We can write the Schur complement $S^{(1)}$ as $S^{(1)}=L-\left(L\right)_{v}+\left(L\right)_{v}-\frac{1}{L(v,v)}L(\mathrel{\mathop{\mathchar 58\relax}},v)L(\mathrel{\mathop{\mathchar 58\relax}},v)^{\top}.$ It is immediate that $L-\left(L\right)_{v}$ is a Laplacian matrix, since $L-\left(L\right)_{v}=\sum_{e\in E\mathrel{\mathop{\mathchar 58\relax}}e\not\ni v}w(e)\boldsymbol{\mathit{b}}_{e}\boldsymbol{\mathit{b}}_{e}^{\top}$ . A more surprising (but well-known) fact is that

is also a Laplacian, and its edges form a clique on the neighbors of $v$ . It suffices to show it for $v=1.$ We write $i\sim j$ to denote $(i,j)\in E.$ Then

Thus $S^{(1)}$ is a Laplacian since it is a sum of two Laplacians. By induction, for all $k,$ $S^{(k)}$ is a Laplacian.

The SparseCholesky Algorithm

Algorithm 1 gives the pseudo-code for our algorithm SparseCholesky. Our main result, Theorem 3.1 (a more precise version of Theorem C.1), shows that the algorithm computes an approximate sparse Cholesky decomposition in nearly linear time. We assume the Real RAM model. We prove the theorem in Section 4.

The expected number of non-zero entries in $\mathcal{L}$ is $O(\frac{\delta^{2}}{\epsilon^{2}}m\log^{3}n)$ . The algorithm runs in expected time $O(\frac{\delta^{2}}{\epsilon^{2}}m\log^{3}n)$ .

Algorithm 2 gives the pseudo-code for our CliqueSample algorithm.

The most significant obstacle to making Cholesky factorization of Laplacians efficient is the fill-in phenomenon, namely that each clique $C_{v}(S)$ has roughly $(\deg_{S}(v))^{2}$ non-zero entries. To solve this problem, we develop a sampling procedure CliqueSample that produces a sparse Laplacian matrix which approximates the clique $C_{v}(S)$ . As input, the procedure requires a Laplacian matrix $S$ , maintained as a sum of Laplacians of multi-edges, and a vertex $v$ . It then computes a sampled matrix that approximates $C_{v}(S)$ . The elimination step in SparseCholesky removes the $\deg_{S}(v)$ edges incident on $v$ , and $\textsc{CliqueSample}(S,v)$ only adds at most $\deg_{S}(v)$ multi-edges. This means the total number of multi-edges does not increase with each elimination step, solving the fill-in problem. The sampling procedure is also very fast: It takes $O(\deg_{S}(v))$ time, much faster than the order $(\deg_{S}(v))^{2}$ time required to even write down the clique $C_{v}(S)$ .

Although it is notationally convenient for us to pass the whole matrix $S$ to CliqueSample, the procedure only relies on multi-edges incident on $v$ , so we will only pass these multi-edges.

Theorem 3.1 only provides guarantees only on the expected running time. In fact, if we make a small change to the algorithm, we can get $O(\frac{\delta^{2}}{\epsilon^{2}}m\log^{3}n)$ running time w.h.p. At the $k^{th}$ elimination, instead of picking the vertex $\pi(k)$ uniformly at random among the remaining vertices, we pick the vertex uniformly at random among the remaining vertices with at most twice the average multi-edge degree in $S^{(k)}$ . In Appendix B, we sketch a proof of this.

Analysis of the Algorithm using Matrix Concentration

In this section, we analyze the SparseCholesky algorithm, and prove Theorem 3.1. To prove the theorem, we need several intermediate results which we will now present. In Section 4.1, we show how the output the SparseCholesky and CliqueSample algorithms can be used to define a matrix martingale. In Section 4.2, we introduce a new type of martingale, called a bags-of-dice martingale, and a novel matrix concentration result for these martingales. In Section 4.3, we show how to apply our new matrix concentration results to the SparseCholesky martingale and prove Theorem 3.1. We defer proofs of the lemmas that characterize CliqueSample to Section 5, and proofs of our matrix concentration results to Section 6.

Throughout this section, we will study matrices that arise in the when using SparseCholesky to produce a sparse approximate Cholesky factorization of the Laplacian $L$ of a multi-graph $G$ . We will very frequently need to refer to matrices that are normalized by $L$ . We adopt the following notation: Given a symmetric matrix $S$ s.t. $\ker(L)\subseteq\ker(S)$ ,

We will only use this notation for matrices $S$ that satisfy the condition $\ker(L)\subseteq\ker(S)$ . Note that $\overline{L}=\Pi,$ and $A\preceq B$ iff $\overline{A}\preceq\overline{B}.$ Normalization is always done with respect to the Laplacian $L$ input to SparseCholesky. We say a multi-edge $e$ is $1/\rho$ -bounded if

Given a Laplacian $S$ that corresponds to a multi-graph $G_{S}$ , and a scalar $\rho>0$ , we say that $S$ is $1/\rho$ -bounded if every multi-edge of $S$ is $1/\rho$ -bounded. Since every multi-edge of $L$ is trivially $1$ -bounded, we can obtain a $1/\rho$ -bounded Laplacian that corresponds to the same matrix, by splitting each multi-edge into $\left\lceil\rho\right\rceil$ identical copies, with a fraction $1/\left\lceil\rho\right\rceil$ of the initial weight. The resulting Laplacian has at most $\left\lceil\rho\right\rceil m$ multi-edges.

Our next lemma describes some basic properties of the samples output by CliqueSample. We prove the lemma in Section 5.

$Y_{e}$ is or the Laplacian of a multi-edge with endpoints $u_{1},u_{2}$ , where $u_{1},u_{2}$ are neighbors of $v$ in $S$ .

$\left\|\overline{Y}_{e}\right\|\leq 1/\rho$ , i.e. $Y_{e}$ is $1/\rho$ -bounded w.r.t. $L$ .

The algorithm runs in time $O(\deg_{S}(v))$ .

The lemma tells us that the samples in expectation behave like the clique $C_{v}(S)$ , and that each sample is $1/\rho$ -bounded. This will be crucial to proving concentration properties of our algorithm. We will use the fact that the expectation of the CliqueSample algorithm output equals the matrix produced by standard Cholesky elimination, to show that in expectation, the sparse approximate Cholesky decomposition produced by our SparseCholesky algorithm equals the original Laplacian. We will also see how we can use this expected behaviour to represent our sampling process as a martingale. We define the $k^{\textrm{th}}$ approximate Laplacian as

Thus our final output equals $L^{(n)}$ . Note that Line (9) of the SparseCholesky algorithm does not introduce any sampling error, and so $L^{(n)}=L^{(n-1)}$ . The only significance of Line (9) is that it puts the matrix in the form we need for our factorization. Now

This is a martingale. To prove multiplicative concentration bounds, we need to normalize the martingale by $L$ , and so instead we consider

This martingale has considerable structure beyond a standard martingale. Conditional on the choices of the SparseCholesky algorithm until step $k-1$ , and conditional on $\pi(k)$ , the terms $\overline{X^{(k)}_{e}}$ are independent.

In Section 4.2 we define a type of martingale that formalizes the important aspects of this structure.

2 Bags-of-Dice Martingales and Matrix Concentration Results

We use the following shorthand notation: Given a sequence of random variables $(r_{1},R^{(1)},r_{2},R^{(2)},\ldots,r_{k},R^{(k)})$ , for every $i,$ we write

Extending this notation to conditional expectations, we write,

Note that $l_{i}$ is allowed to be random, as long as it is fixed conditional on $(i-1)$ and $r_{i}$ . The martingale given in Equation (6) is a bags-of-dice martingale, with $r_{i}=\pi(i)$ , and $R^{(i)}_{e}=Y^{(i)}_{e}$ . The name is motivated by a simple model: At each step of the martingale we pick a random bag of dice from a collection of bags (this corresponds to the outcome of $r_{i}$ ) and then we independently roll each die in the bag (corresponds to the outcomes $R^{(i)}_{e}$ ).

Suppose ${Z=\sum_{i=1}^{k}\sum_{e=1}^{l_{i}}Z^{(i)}_{e}}$ is a bags-of-dice martingale of $d\times d$ matrices that satisfies

Every sample $Z^{(i)}_{e}$ satisfies $\left\|Z^{(i)}_{e}\right\|^{2}\leq\sigma^{2}_{1},$

We remark that this theorem, and all the results in this section extend immediately to Hermitian matrices. We prove the above theorem in Section 6. This result is based on the techniques introduced by Tropp [Tro12] for using Lieb’s Concavity Theorem to prove matrix concentration results. Tropp’s result improved on earlier work, such as Ahlswede and Winter [AW02] and Rudelson and Vershynin [RV07]. These earlier matrix concentration results required IID sampling, making them unsuitable for our purposes.

Unfortunately, we cannot apply Theorem 4.3 directly to the bags-of-dice martingale in Equation (6). As we will see later, the variance of $\sum_{e}\overline{X^{(i)}_{e}}$ can have norm proportional to $\left\|\overline{L^{(i)}}\right\|$ , which can grow large.

However, we expect that the probability of $\left\|\overline{L^{(i)}}\right\|$ growing large is very small. Our next construction allows us to leverage this idea, and avoid the small probability tail events that prevent us from directly applying Theorem 4.3 to the bags-of-dice martingale in Equation (6).

Given a bags-of-dice martingale of $d\times d$ matrices ${Z=\sum_{i=1}^{k}\sum_{e=1}^{l_{i}}Z^{(i)}_{e}}$ , and a scalar $\epsilon>0$ , we define for each $h\in\left\{1,2,\ldots,k+1\right\}$ the event

We also define the $\epsilon$ -truncated martingale:

The truncated martingale is derived from another martingale by forcing the martingale to get “stuck” if it grows too large. This ensures that so long as the martingale is not stuck, it is not too large. On the other hand, as our next result shows, the truncated martingale fails more often than the original martingale, and so it suffices to prove concentration of the truncated martingale to prove concentration of the original martingale. The theorem stated below is proven in Section 6.

Given a bags-of-dice martingale of $d\times d$ matrices ${Z=\sum_{i=1}^{k}\sum_{e=1}^{l_{i}}Z^{(i)}_{e}}$ , a scalar $\epsilon>0$ , the associated $\epsilon$ -truncated martingale $\widetilde{Z}$ is also a bags-of-dice martingale, and

3 Analyzing the SparseCholesky Algorithm Using Bags-of-Dice Martingales

By taking $Z^{(k)}_{e}=\overline{X^{(k)}_{e}}$ , $r_{i}=\pi(i)$ and $R^{(i)}_{e}=Y^{(i)}_{e}$ , we obtain a bags-of-dice martingale $Z=\sum_{i=1}^{n-1}\sum_{e}Z^{(i)}_{e}$ . Let $\widetilde{Z}$ denote the corresponding $\epsilon$ -truncated bags-of-dice martingale. The next lemma shows that $\widetilde{Z}$ is well-behaved. The lemma is proven in Section 5.

Proof of Theorem 3.1: We have $\overline{L^{(n)}}=\Pi+Z$ . Since for all $k,e$ , $\ker(L)\subseteq\ker(Y^{(k)}_{e})$ , the statement $(1-\epsilon)L\preceq L^{(n)}\preceq(1+\epsilon)L$ is equivalent to $-\epsilon\Pi\preceq Z\preceq\epsilon\Pi$ . Further, $\Pi Z\Pi=Z$ , and so it is equivalent to $-\epsilon I\preceq Z\preceq\epsilon I$ . By Theorem 4.5, we have,

Thus, we can pick $\sigma_{2}=\sqrt{\frac{3}{2\rho}}.$ Finally, Lemma 4.1 also gives,

Thus, we can pick $\Omega_{k}=\frac{3}{\rho(n+1-k)}I,$ and

Similarly, we obtain concentration for $-\widetilde{Z}$ with the same parameters. Thus, by Theorem 4.3,

Picking $\rho=\left\lceil 12(1+\delta)^{2}\epsilon^{-2}\ln^{2}n\right\rceil,$ we get $\sigma^{2}\leq\frac{\epsilon^{2}}{4(1+\delta)\ln n},$ and

Combining this with Equation ( ‣ 4.3) establishes Equation (11).

Finally, we need to bound the expected running time of the algorithm. We start by observing that the algorithm maintains the two following invariants:

Every multi-edge in $\widehat{S}^{(k-1)}$ is $1/\rho$ -bounded.

The total number of multi-edges is at most $\rho m$ .

We establish the first invariant inductively. The invariant holds for $\widehat{S}^{(0)}$ , because of the splitting of original edges into $\rho$ copies with weight $1/\rho$ . The invariant thus also holds for $\widehat{S}^{(0)}-\left(\widehat{S}^{(0)}\right)_{\pi(1)}$ , since the multi-edges of this Laplacian are a subset of the previous ones. By Lemma 4.1, every multi-edge $Y_{e}$ output by CliqueSample is $1/\rho$ -bounded, so $\widehat{S}^{(1)}=\widehat{S}^{(0)}-\left(\widehat{S}^{(0)}\right)_{\pi(1)}+\widehat{C}_{1}$ is $1/\rho$ -bounded. If we apply this argument repeatedly for $k=1,\ldots,n-1$ we get invariant (1).

Invariant (2) is also very simple to establish: It holds for $\widehat{S}^{(0)}$ , because splitting of original edges into $\rho$ copies does not produce more than $\rho m$ multi-edges in total. When computing $\widehat{S}^{(k)}$ , we subtract $\left(\widehat{S}^{(k-1)}\right)_{\pi(k)}$ , which removes exactly $\deg_{\widehat{S}^{(k-1)}}(\pi(k))$ multi-edges, while we add the multi-edges produced by the call to $\textsc{CliqueSample}(\widehat{S}^{(k-1)},\pi(k))$ , which is at most $\deg_{\widehat{S}^{(k-1)}}(\pi(k))$ . So the number of multi-edges is not increasing.

Clique Sampling Proofs

In this section, we prove Lemmas 4.1 and 4.6 that characterize the behaviour of our algorithm CliqueSample, which is used in SparseCholesky to approximate the clique generated by eliminating a variable.

A important element of the CliqueSample algorithm is our very simple approach to leverage score estimation. Using the well-known result that effective resistance in Laplacians is a distance (see Lemma 5.2), we give a bound on the leverage scores of all edges in a clique that arises from elimination. We let

Note that the factor $\nicefrac{{1}}{{2}}$ accounts for the fact that every pair is double counted.

Suppose multi-edges $e,e^{\prime}\ni v$ are $1/\rho$ -bounded w.r.t. $L$ , and have endpoints $v,u$ and $v,z$ respectively, and $z\neq u$ , then $w(e)w(e^{\prime})\boldsymbol{\mathit{b}}_{u,z}\boldsymbol{\mathit{b}}_{u,z}^{\top}$ is $\frac{w(e)+w(e^{\prime})}{\rho}$ -bounded.

To prove Lemma 5.1, we need the following result about Laplacians:

Given a connected weighted multi-graph $G=(V,E,w)$ with associated Laplacian matrix $L$ in $G$ , consider three distinct vertices $u,v,z\in V$ , and the pair-vectors $b_{u,v}$ , $b_{v,z}$ and $b_{u,z}$ .

This is known as phenomenon that Effective Resistance is a distance [KR93].

Proof of Lemma 5.1: Using the previous lemma:

To prove Lemma 4.1, we need the following result of Walker [Wal77] (see Bringmann and Panagiotou [BP12] for a modern statement of the result).

The time required for each sample is $O(1)$ .

We note that there are simpler sampling constructions than that of Lemma 5.3 that need $O(\log n)$ time per sample, and using such a method would only worsen our running time by a factor $O(\log n)$ .

Proof of Lemma 4.1: From Lines (5) and (6), $Y_{i}$ is or the Laplacian of a multi-edge with endpoints $u_{1},u_{2}$ . To upper bound the running time, it is important to note that we do not need access to the entire matrix $S$ . We only need the multi-edges incident on $v$ . When calling CliqueSample, we only pass a copy of just these multi-edges.

We observe that the uniform samples in Line (3) can be done in $O(1)$ time each, provided we count the number of multi-edges incident on $v$ to find $\deg_{S}(v)$ . We can compute $\deg_{S}(v)$ in $O(\deg_{S}(v))$ time. Using Lemma 5.3, if we do $O(\deg_{S}(v))$ time preprocessing, we can compute each sample in Line (2) in time $O(1)$ . Since we do $O(\deg_{S}(v))$ samples, the total time for sampling is hence $O(\deg_{S}(v))$ .

Now we determine the expected value of the sum of the samples. Note that in the sum below, each pair of multi-edges appears twice, with different weights.

Proof of Lemma 4.6: Throughout the proof of this lemma, all the random variables considered are conditional on the choices of the SparseCholesky algorithm up to and including step $k-1$ .

Now, $C_{\pi(k)}(\widehat{S}^{(k-1)})$ is PSD since it is a Laplacian, so $\left\|\overline{C_{\pi(k)}(\widehat{S}^{(k-1)})}\right\|=\lambda_{\max}(\overline{C_{\pi(k)}(\widehat{S}^{(k-1)})})$ . By Equation (2), we get ${C_{\pi(k)}(\widehat{S}^{(k-1)})\preceq\left(\widehat{S}^{(k-1)}\right)_{\pi(k)}}$ and by Equation (1) we get ${\left(\widehat{S}^{(k-1)}\right)_{\pi(k)}\preceq\widehat{S}^{(k-1)}}$ , finally by Equation (5) we get ${\widehat{S}^{(k-1)}\preceq L^{(k-1)}}$ so

Again, using $C_{\pi(k)}(\widehat{S}^{(k-1)})\preceq\left(\widehat{S}^{(k-1)}\right)_{\pi(k)}$ , we get

Matrix Concentration Analysis

To prove Theorem 4.3, we need the following lemma, which is the main technical result of this section.

Given a bags-of-dice martingale of $d\times d$ matrices ${Z=\sum_{i=1}^{k}\sum_{e=1}^{l_{i}}Z^{(i)}_{e}}$ that is for all $\theta$ such that $0<\theta^{2}\leq\min\{\frac{1}{\sigma_{1}^{2}},\frac{5}{12\sigma_{2}^{2}}\}$ , we have,

Before proving this lemma, we will see how to use it to prove Theorem 4.3.

Proof of Theorem 4.3: Given Lemma 6.1, we can show Theorem 4.3 using the following bound via trace exponentials, which was first developed by Ahlswede and Winter [AW02].

To show Lemma 6.1 will need the following result by Tropp [Tro12], which is a corollary of Lieb’s Concavity Theorem [Lie73].

Given a random symmetric matrix $Z$ , and a fixed symmetric matrix $H$ ,

We will use the following claim to control the above trace by an inductive argument.

For all $j=1,\ldots,k,$ and all $\theta$ such that $0<\theta^{2}\leq\min\{\frac{1}{\sigma_{1}^{2}},\frac{5}{12\sigma_{2}^{2}}\}$ , we have,

Before proving this claim, we see that it immediately implies Lemma 6.1.

Proof of Lemma 6.1: We chain the inequalities given by Claim 6.3 for $j=k,k-1,\ldots,1$ to obtain,

where the last inequality follows from $\operatorname*{Tr}\exp\left({A}\right)\leq d\exp\left({\left\|A\right\|}\right)$ for symmetric $A$ and $\left\|\sum_{i=1}^{k}\Omega_{i}\right\|\leq\sigma_{3}^{2}.$ $\Box$

We will also need the next two lemmas, which essentially appear in the work of Tropp [Tro12]. For completeness, we also prove these lemmas in Appendix A.

We also need the following well-known fact (see for example [Tro12]):

Given symmetric matrices $A,B$ s.t. $A\preceq B$ , $\operatorname*{Tr}{\exp\left({A}\right)}\preceq\operatorname*{Tr}{\exp\left({B}\right)}$ .

For all $\theta$ such that $0<\theta^{2}\leq\frac{1}{\sigma^{2}_{1}},$ all $j=1,\ldots,k$ and for all symmetric $\widetilde{H}$ that are fixed given $(r_{1},R_{1}),\ldots,(r_{j-1},R_{j-1}),$

Now, using Fact 6.6 which states that $\operatorname*{Tr}{\exp\left({\cdot}\right)}$ is monotone increasing with respect to the PSD order, we obtain the lemma. $\Box$

Proof of Claim 6.3: Since $0<\theta^{2}\leq\frac{1}{\sigma_{1}^{2}},$ using Lemma 6.7 with $\widetilde{H}=\theta^{2}\left({\sum_{i=j+1}^{k}\Omega_{i}}\right)+\theta\sum_{i=1}^{j-1}\sum_{e}Z^{(i)}_{e},$ we obtain,

Now, using Fact 6.6, namely that $\operatorname*{Tr}{\exp\left({\cdot}\right)}$ is monotone increasing with respect to the PSD order, we obtain the lemma. $\Box$

2 Truncating Bags-of-Dice Martingales

To prove this theorem, we use the next lemma, which we will prove later in this section:

Acknowledgements

We thank Daniel Spielman for suggesting this project and for helpful comments and discussions.

References

Appendix A Conditions for Bounding Matrix Moment Generating Functions

Proof of Lemma 6.4: We define $f(z)=\frac{e^{z}-z-1}{z^{2}}$ . Note that $f(1)\leq 4/5$ . The function $f$ is positive and increasing in $z$ for all real $z$ . This means for every symmetric matrix $A$ , $f(A)\preceq f(\left\|A\right\|)I$ , and so for any symmetric matrix $B$ , $Bf(A)B\preceq f(\left\|A\right\|)B^{2}$ . Thus

Proof of Lemma 6.5: We define $g(z)=\frac{e^{z}-1}{z}$ . The function $g$ is positive and increasing in $z$ for all real $z\geq 0$ . This means for every symmetric matrix $A$ , $g(A)\preceq g(\left\|A\right\|)I$ , and so for any symmetric matrix $B$ , $Bg(A)B\preceq g(\left\|A\right\|)B^{2}$ . Also $g(1/3)\leq 6/5$ . Thus

The lemma now follows using the fact that $\log$ is operator monotone (increasing), $\log(1+z)\leq z$ for all real $z>0,$ and $C\succeq 0.$ $\Box$

Appendix B Obtaining Concentration of Running Time

As indicated in Remark 3.2, we can obtain a version of Theorem 3.1 that provides running time guarantees with high probability instead of in expectation, by making a slight change to the SparseCholesky algorithm. In this appendix, we briefly sketch how to prove this. We refer to the modified algorithm as LowDegreeSparseCholesky. The algorithm is requires only two small modifications: Firstly, instead of initially choosing a permutation at random, we choose the $k^{\text{th}}$ vertex to eliminate by sampling it uniformly at random amongst the remaining vertices that have degree at most twice the average multi-edge degree in $\widehat{S}^{(k-1)}$ . We can do this by keeping track of all vertex degrees, and sampling a remaining vertex at random, and resampling if the degree is too high, until we get a low degree vertex. Secondly, to make up for the slight reduction in the randomization of the choice of vertex, we double the value of $\rho$ used in Line 1.

The number of non-zero entries in $\mathcal{L}$ is $O(\frac{\delta^{2}}{\epsilon^{2}}m\log^{3}n)$ . With high probability the algorithm runs in time $O(\frac{\delta^{2}}{\epsilon^{2}}m\log^{3}n)$ .

Appendix C Sparse Cholesky Factorization for bDD Matrices

In this appendix, we sketch a version of our approximate Cholesky factorization algorithm, that is also applicable to BDD matrices, which include the class of Connection Laplacian matrices. We call this algorithm BDDSparseCholesky. It follows closely the algorithmic structure used in [KLP+16]. Like [KLP+16], we need the input matrix to be non-singular, which we can achieve using the approach described in Claims 2.4 and 2.5 of [KLP+16].

Our algorithm replace their expander-based Schur complement approximation routine with our simple one-by-one vertex elimination, while still using their recursive subsampling based framework for estimation of leverage scores. The constants in this appendix are not optimized.

We study bDD matrices as defined in [KLP+16], with $r\times r$ blocks, where $r$ is a constant (see their Section 1.1). The class of bDD matrices is Hermitian, rather than symmetric, but our notion of spectral approximation and our matrix concentration results extend immediately to Hermitian matrices. Throughout this section, we use $(\cdot)^{{\dagger}}$ to conjugate transpose. Our algorithm will still compute a Cholesky composition, except we do not factor the individual $r\times r$ block matrices. We will sketch a proof of the following result:

where $Z=P_{\pi}\mathcal{L}\mathcal{D}\mathcal{L}^{{\dagger}}P_{\pi}^{{\dagger}},$ i.e., $Z$ has a sparse Cholesky factorization.

We choose a fixed $\epsilon=1/2$ , but the algorithms can be adapted to produce approximate Cholesky decompositions with $\epsilon\leq 1/2$ spectral approximation and running time dependence $\epsilon^{-2}$ .

For BDD matrices, we do not know a result analogous to the fact that effective resistance is a distance in Laplacians (see Lemma 5.2). Instead, we use a result that is weaker by a factor $2$ : Given two bDD multi-edge matrices $e$ , $e^{\prime}$ $w(e)B_{u,v}B_{u,v}^{{\dagger}}$ and $w(e^{\prime})B_{v,z}B_{v,z}^{{\dagger}}$ that are incident on vertex blocks $v,u$ and $v,z$ respectively, if we eliminate vertex block $v$ , this creates a multi-edge $e^{\prime\prime}$ with BDD matrix $w(e^{\prime\prime})B_{u,z}B_{u,z}^{{\dagger}}$ satisfying

We will sketch how to modify the LowDegreeSparseCholesky algorithm to solve BDD matrices. We call this BDDSparseCholesky. This algorithm is similar to SparseCholesky and LowDegreeSparseCholesky, except that the number of multi-edges in the approximate factorization will be slowly increasing with each elimination and to counter this we will need to occasionally sparsify the matrices we produce. First we will assume an oracle procedure Sparsify, which we will later see how to construct using a boot-strapping approach that recursively calls BDDSparseCholesky on smaller matrices.

$\textsc{BDDSparseCholesky}(S,\rho)$ should be identical to LowDegreeSparseCholesky, except

The sampling rate $\rho$ is an explicit parameter to BDDSparseCholesky.

BDDSparseCholesky should not split the initial input multi-edge into $\rho$ smaller copies. This will be important because we use BDDSparseCholesky recursively, and we will only split edges at the top level.

We adapt the CliqueSample routine to sample from a bDD elimination clique, and we produce $2\deg_{S}(v)$ samples, and scale all samples by a factor $1/2$ .

After eliminating $\frac{9}{10}n$ vertices, it calls $\textsc{Sparsify}(\widehat{S}^{(9n/10)},\rho)$ to produce a sampled matrix $S^{\prime}$ of dimension ${nr/10\times nr/10}$ , and then calls $\textsc{BDDSparseCholesky}(S^{\prime},\rho)$ to recursively produce an approximate Cholesky decomposition of $S^{\prime}$ .

To compute an approximate Cholesky decomposition of $L$ , we set $\rho=\left\lceil 10^{3}\log^{2}n\right\rceil$ . Form $S^{(0)}$ by splitting each edge of $L$ into $\rho$ copies with $1/\rho$ of their initial weight. Call $\textsc{BDDSparseCholesky}(S^{(0)},\rho)$ .

The clique sampling routine for bDD matrices uses more conservative sampling than CliqueSample , because we use the weaker Equation (12). The sparsification step then becomes necessary because our clique sampling routine now causes the total number of number of multi-edges to increase with each elimination. However, the increase will not exceed a factor $(1+8/n)$ , so after $\frac{9}{10}n$ eliminations, the total number of multi-edges has not grown by more than $2\cdot 10^{4}$ .

We can use a truncated martingale to analyze the entire approximate Cholesky decomposition produced by BDDSparseCholesky and its recursive calls using Theorems 4.3 and 4.5 (these theorems extend to Hermitian matrices immediately). The calls to Sparsify will cause our bound on the martingale variance $\sigma_{3}$ to grow larger, but only by a constant factor. By increasing $\rho$ by an appropriate constant, we still obtain concentration.

On the other hand, the calls to Sparsify will ensure that the time spent in the recursive calls to BDDSparseCholesky is only a constant fraction of the time spent in the initial call, assuming the total number of multi-edges in $S$ exceeds $\rho^{2}\cdot 2\cdot 10^{6}nr$ (if not, we can always split edges to achieve this). This corresponds to assuming that before the initial edge splitting at the start of the algorithm, we have at least $\rho\cdot 2\cdot 10^{6}nr$ edges.

This means the total time to compute the approximate Cholesky decomposition of $L$ using BDDSparseCholesky will only be $O(\rho(m+\rho n))$ , excluding calls to Sparsify. The decomposition will have $O(m\log^{2}n+n\log^{4}n)$ non-zeros, and its approximate inverse can be applied in $O(m\log^{2}n+n\log^{4}n)$ time.

We now sketch briefly how to implement the Sparsify routine. It closely resembles the Sparsify routine of [KLP+16] (see their Lemma H.1).

Implementing the sparsification routine.

$\textsc{Sparsify}(\widehat{S}^{(9n/10)},\rho)$ uses the subsampling-based techniques of [CLM+15]. It is identical to the sparsification routine of [KLP+16], except the recursive call to a linear solver uses BDDSparseCholesky. The routine first samples each multi-edge with probability $\frac{1}{2\cdot 10^{5}\rho}$ to produce a sparse matrix $S^{\prime\prime}$ , and then uses Johnson-Lindenstrauss-based leverage score estimation (see [CLM+15]) to compute IID samples with the desired $1/\rho$ -bound w.r.t. $L$ . The IID samples are summed to give the output matrix $S^{\prime}$ . This requires approximately solving $\sqrt{\rho}$ systems of linear equations in the sparse matrix $S^{\prime\prime}$ . To do so, Sparsify first splits every edge of $S^{\prime\prime}$ into $\rho$ copies (increasing the number of multi-edges by a factor $\rho$ ), then makes a single recursive call to $\textsc{BDDSparseCholesky}(S^{\prime\prime},\rho)$ , and then uses the resulting Cholesky decomposition $\sqrt{\rho}$ times to compute approximate solutions to systems of linear equations. One issue requires some care: The subsampling guarantees provided by [CLM+15] are with respect to $\widehat{S}^{(9n/10)}$ and not $L$ , however, by using a truncated martingale in our analysis, we can assume that $\widehat{S}^{(9n/10)}\preceq 2L$ .

Running time including sparsification.

Finally, if we take account of time spent on calls to Sparsify and its recursive calls to $\textsc{BDDSparseCholesky}(S^{\prime\prime},\rho)$ , we get a time recursion for BDDSparseCholesky of

(assuming initially for $L$ that $m\geq\rho 2\cdot 10^{6}n$ ), which can be solved to give a running time bound of $O(m\rho^{1.5}+n\rho^{2.5})=O(m\log^{3}n+n\log^{5}n)$ .

Fixing $\rho$ at the top level ensures that a union bound across all recursive calls in BDDSparseCholesky will give that the approximate Cholesky decomposition obtains a $1/2$ factor spectral approximation with high probability.

By applying the sparsification approach described in this section once, we can compute an approximate Cholesky decomposition of Laplacian matrices in time $O(m\log^{2}n\log\log n)$ time w.h.p.

We run LowDegreeSparseCholesky with a modification: after eliminating all but $n/log^{100}n$ vertices, we do sparsification on the remaining graph with $\rho m$ multi-edges and $n/log^{100}n$ vertices. The call to LowDegreeSparseCholesky until sparsification will take $O(m\log^{2}n\log\log n)$ time w.h.p. The sparsification is done using a modified version of the Sparsify routine described above. Instead of sampling each multi-edge with probability $\frac{1}{2\cdot 10^{5}\rho}$ , we use a probability of $\frac{1}{\log^{8}n}$ . The recursive linear solve in Sparsify can be done using unmodified LowDegreeSparseCholesky, and $O(\log n)$ linear system solves for Johnson-Lindenstrauss leverage score estimation can be done using this decomposition, all in $O(m+n)$ time. The output graph $S^{\prime}$ from Sparsify can be Cholesky decomposed using unmodified LowDegreeSparseCholesky as well, and this will take time $O(m+n)$ . In total, we get a running time and number of non-zeros bounded by $O(m\log^{2}n\log\log n)$ time w.h.p.