Monte Carlo Markov Chain Algorithms for Sampling Strongly Rayleigh Distributions and Determinantal Point Processes

Nima Anari, Shayan Oveis Gharan, Alireza Rezaei

Introduction

We assign a multi-affine polynomial with variables z1,,znz_{1},\dots,z_{n} to μ\mu,

where for a set S[n]S\subseteq[n], zS=iSziz^{S}=\prod_{i\in S}z_{i}. The polynomial gμg_{\mu} is also known as the generating polynomial of μ\mu. We say μ\mu is kk-homogeneous if gμg_{\mu} is a homogeneous polynomial of degree kk, i.e., if for any Ssupp{μ}S\in\textup{supp}\{\mu\}, we have S=k|S|=k.

Strongly Rayleigh distributions are introduced and deeply studied in the work of [BBL09]. These distributions are natural generalizations of determinantal measures and random spanning tree distributions. It is shown in [BBL09] that strongly Rayleigh distributions satisfy the strongest form of negative dependence properties. These negative dependence properties were recently exploited to design approximation algorithms [OSS11, PP14, AO15].

In this paper we show that the “natural” Monte Carlo Markov Chain (MCMC) method on the support of a homogeneous strongly Rayleigh distribution μ\mu mixes rapidly. Therefore, this Markov Chain can be used to efficiently draw an approximate sample from μ\mu. Since determinantal point processes are special cases of strongly Rayleigh measures, our result implies that the same Markov chain efficiently generates random samples of a kk-determinantal point process (see Subsection 1.1 for the details).

In a state SS, choose an element iSi\in S and jSj\notin S uniformly and independently at random, and let T=Si+jT=S-i+j, then

If Tsupp{μ}T\in\textup{supp}\{\mu\}, move to TT with probability 12min{1,μ(T)/μ(S)}\frac{1}{2}\min\{1,\mu(T)/\mu(S)\};

It is easy to see that Mμ{\cal M}_{\mu} is reversible and μ(.)\mu(.) is the stationary distribution of the chain. In addition, Brändén showed that the support of a (homogeneous) strongly Rayleigh distribution is the set of bases of a matroid [Brä07, Cor 3.4]; so Mμ{\cal M}_{\mu} is irreducible. Lastly, since we stay in each state SS with probability at least 1/21/2, Mμ{\cal M}_{\mu} is a lazy chain.

If XX is a random variable sampled according to ν\nu and νπTVϵ\|\nu-\pi\|_{\operatorname{TV}}\leq\epsilon, then we say XX is an ϵ\epsilon-approximate sample of π\pi.

For a state xΩx\in\Omega and ϵ>0\epsilon>0, the total variation mixing time of a chain started at xx with transition probability matrix PP and stationary distribution π\pi is defined as follows:

where Pt(x,.)P^{t}(x,.) is the distribution of the chain started at xx at time tt.

is at least 12kn\frac{1}{2kn} by construction.

Suppose we have access to a set Ssupp{μ}S\in\textup{supp}\{\mu\} such that μ(S)exp(n)\mu(S)\geq\exp(-n). In addition, we are given an oracle such that for any set T(nk)T\in{n\choose k}, it returns μ(T)\mu(T) if Tsupp{μ}T\in\textup{supp}\{\mu\} and zero otherwise. Then, by the above theorem we can generate an ϵ\epsilon-approximate sample of μ\mu with at most poly(n,k,log(1/ϵ))\operatorname{poly}(n,k,\log(1/\epsilon)) oracle calls.

Borcea, Brändén, and Liggett showed that for any strongly Rayleigh distribution μ\mu, and any integer kk, μk\mu_{k} is also strongly Rayleigh, [BBL09]. Therefore, if we have access to a set S[n]S\subset[n] of size kk, we can use the above theorem to generate an approximate sample of μk\mu_{k}.

where LSL_{S} is the principal submatrix of LL indexed by the elements of SS.

DPPs are one of the fundamental objects used to study a variety of tasks in machine learning, including text summarization, image search, news threading, etc. For more information about DPPs and their applications we refer to a recent survey by Kulesza and Taskar [KT13].

For an integer 0kn0\leq k\leq n, and a DPP μ\mu, the truncation of μ\mu to kk, μk\mu_{k} is called a kk-DPP. It turns out that the family of determinantal point processes are not closed under truncation. Perhaps, the simplest example is the kk-uniform distribution over a set of nn elements. Although the uniform distribution over nn elements is a DPP, for any 2kn22\leq k\leq n-2, the corresponding kk-DPP is not a DPP [KT13, Section 5].

In the past, several spectral algorithms were designed for sampling from kk-DPPs [HKPV06, DR10, KT13], but these algorithms typically need to diagonalize a giant nn-by-nn matrix, so they are inefficient in time and memory We remark that the algorithms in [DR10] are almost linear in nn; however they need access to the Cholesky decomposition of the ensemble matrix of the underlying DPP.. It was asked by [DR10] to generate random samples of a kk-DPP using Markov chain techniques. Markov chain techniques are very appealing in this context because of their simplicity and efficiency. There has been several attempts [Kan13, LJS15, RK15] to upper bound the mixing time of the Markov chain Mμ{\cal M}_{\mu} for a kk-DPP μ\mu; but, to the best of our knowledge, this question is still openWe remark that [Kan13] claimed to have a proof of the rapid mixing time of a similar Markov chain. As it is pointed out in [RK15] the coupling argument of [Kan13] is ill-defined. To be more precise, the chain specified in Algorithm 1 of [Kan13] may not mix in a polynomial time of nn. The chain specified in Algorithm 2 of [Kan13] is similar to Mμ{\cal M}_{\mu}, but the statement of Theorem 2 which upper bounds its mixing time is clearly incorrect even when k=1k=1..

Here, we show that for a kk-DPP μ\mu, Mμ{\cal M}_{\mu} can be used to efficiently generate an approximate sample of μ\mu. [BBL09] show that any DPP is a strongly Rayleigh distribution. Since strongly Rayleigh distributions are closed under truncation, any kk-DPP is a strongly Rayleigh distribution. Therefore, by Theorem 1.1, for any kk-DPP μ\mu, Mμ{\cal M}_{\mu} mixes rapidly to the stationary distribution.

Given access to the ensemble matrix of a kk-DPP, we can use the above theorem to generate ϵ\epsilon-approximate samples of the kk-DPP.

Given an ensemble matrix LL of a kk-DPP μ\mu there is an algorithm for any ϵ>0\epsilon>0 generates an ϵ\epsilon-approximate sample of μ\mu in time poly(k)O(nlog(n/ϵ))\operatorname{poly}(k)O(n\log(n/\epsilon)).

To prove the above theorem, we need an efficient algorithm to generate a set Ssupp{μ}S\in\textup{supp}\{\mu\} such that μ(S)\mu(S) is bounded away from zero, perhaps by an exponentially small function of n,kn,k. We use the greedy algorithm 1 to find such a set, and we show that, in time O(n)poly(k)O(n)\operatorname{poly}(k), it returns a set SS such that

Noting that each transition step of the Markov chain Mμ{\cal M}_{\mu} only takes time that is polynomial in kk, this completes the proof of the above theorem.

The maximum volume submatrix problem is NP-hard to approximate within a factor ckc^{k} for some constanat c>1c>1 [ÇMI13]. Numerous approximation algorithm is given for this problem [ÇMI09, ÇMI13, Nik15]. It was shown in [ÇMI09, Thm 11] that choosing the rows of XX greedily gives a k!k! approximation to the maximum volume submatrix problem. Algorithm 1 is equivalent to the greedy algorithm of [ÇMI09]; it is only described in the language of ensemble matrix LL. Therefore, it returns a set SS such that

2 Proof Overview

In the rest of the paper we prove Theorem 1.1. To prove Theorem 1.1 we lower bound the spectral gap, a.k.a. the Poincaré constant of the chain Mμ{\cal M}_{\mu}. This directly upper bounds the mixing time in total variation distance. To lower bound the spectral gap, we use an extension of the seminal work of [FM92]. Feder and Mihail showed that the bases exchange graph of the bases of a balanced matroid is an expander. This directly lower bounds the spectral gap by Cheeger’s inequality. A matroid is called balanced if the matroid and all of its minors satisfy the property that, the uniform distribution of the bases is negatively associated (see Subsection 2.2 for the definition of negative association).

Our proof can be seen as a weighted variant of [FM92]. As we mentioned earlier, the support of a homogeneous strongly Rayleigh distribution corresponds to the bases of a matroid. Our proof shows that if a distribution μ\mu over the bases of a matroid and all of its conditional measures are negatively associated, then the MCMC algorithm mixes rapidly. To show that μ\mu satisfies the aforementioned property we simply appeal to the negative dependence theory of strongly Rayleigh distributions developed in [BBL09]. Although our proof can be written in the language of [FM92], we work with the more advanced chain decomposition idea of [JSTV04] to prove a tight bound on the Poincaré constant, see Subsection 2.3 for the details.

We remark that, in general, the decomposition idea of [JSTV04] can be also be used to lower bound the log-Sobolev constant. However, it turns out that in our case, the log-Sobolev constant may be no larger than 1log(minSsupp{μ}μ(S))\frac{1}{-\log(\min_{S\in\textup{supp}\{\mu\}}\mu(S))}. Since the latter quantity is not necessarily lower-bounded by a function of kk and nn, the log-Sobolev constant (and hence, the L2L_{2} mixing time) of the chain may be unbounded.

Background

In this section we give a high level overview of Markov chains and their mixing times. We refer the readers for [LPW06, MT06] for details. Let Ω\Omega denote the state space, PP denote the Markov kernel and π(.)\pi(.) denote the stationary distribution of a Markov chain. We say a Markov chain is lazy if for any state xΩx\in\Omega, P(x,x)1/2P(x,x)\geq 1/2.

In particular, fπ=f,fπ\|f\|_{\pi}=\sqrt{\langle f,f\rangle_{\pi}}. For a function fL2(π)f\in L^{2}(\pi), the Dirichlet form Eπ(f,f){\cal E}_{\pi}(f,f) is defined as follows

Next, we overview classical spectral techniques to upper bound the mixing time of Markov chains.

The Poincaré constant of the chain is defined as follows,

where the infimum is over all functions with nonzero variance.

It is easy to see that for any transition probability matrix PP, the second largest eigenvalue of PP is 1λ1-\lambda. If PP is a lazy chain, then 1λ1-\lambda is also the second largest eigenvalue of PP in absolute value. In the following fact we see how to calculate the Poincaré constant of any reversible 2-state chain.

The Poincaré constant of any reversible two state chain with Ω=0,1\Omega={0,1} and P(0,1)=cπ(1)P(0,1)=c\cdot\pi(1) is cc.

To prove Theorem 1.1 we simply calculate the Poincaré constant of the chain Mμ{\cal M}_{\mu} and then we use the following classical theorem of Diaconis and Stroock to upper bound the mixing time.

For any reversible irreducible lazy Markov chain (Ω,P,π)(\Omega,P,\pi) with Poincaré constant λ\lambda, ϵ>0\epsilon>0 and any state xΩx\in\Omega,

Using the above theorem, to prove Theorem 1.1, it is enough to lower bound the Poincaré constant of Mμ{\cal M}_{\mu}.

It is easy to see that Theorem 1.1 follows by the above two theorems.

2 Strongly Rayleigh Measures

Building on [FM92], Borcea, Brändén and Liggett proved that any strongly Rayleigh distribution is negatively associated.

Any strongly Rayleigh probability distribution is negatively associated.

As an example, the above theorem implies that any kk-DPP is negatively associated. The negative association property is the key to our lower bound on the Poincaré constant of the chain Mμ{\cal M}_{\mu}.

For 1in1\leq i\leq n, let YiY_{i} be the random variable indicating whether ii is in a sample of μ\mu. We use

to denote the conditional measure on sets that contain ii and

to denote the conditional measure on sets that do not contain ii. Borcea, Brändén and Ligett showed that strongly Rayleigh distributions are closed under conditioning.

3 Decomposable Markov Chains

In this section we describe the decomposable Markov chain technique due to Jerrum, Son, Tetali and Vigoda [JSTV04]. This will be our main tool to lower bound the Poincaré constant of Mμ{\cal M}_{\mu}. Roughly speaking, they consider Markov chains that can be decomposed into “projection” and “restriction” chains. They lower bound the Poincaré constant of the original chain assuming certain properties of these projection/restriction chains.

Let Ω0Ω1\Omega_{0}\cup\Omega_{1} be a decomposition of the state space of a Markov chain (Ω,P,π)(\Omega,P,\pi) into two disjoint setsHere, we only focus on decomposition into two disjoint sets, although the technique of [JSTV04] is more general.. For i{0,1}i\in\{0,1\} let

The Markov chain ({0,1},Pˉ,πˉ)(\{0,1\},\bar{P},\bar{\pi}) is called a projection chain. Let λˉ\bar{\lambda} be the Poincaré constant of this chain.

We can also define a restriction Markov chain on each Ωi\Omega_{i} as follows. For each i{0,1}i\in\{0,1\},

In other words, for any transition from xx to a state outside of Ωi\Omega_{i}, we remain in xx. Observe that in the stationary distribution of the restriction chain, the probability of xx is proportional to π(x)\pi(x). Let λi\lambda_{i} be the Poincaré constant of the chain (Ωi,Pi,.)(\Omega_{i},P_{i},.). Now, we are ready to explain the main result of [JSTV04].

If for any distinct i,j{0,1}i,j\in\{0,1\}, and any xΩix\in\Omega_{i},

then the Poincaré constant of (Ω,P,π)(\Omega,P,\pi) is at least min{λˉ,λ0,λ1}\min\{\bar{\lambda},\lambda_{0},\lambda_{1}\}.

Inductive Argument

In this section we prove Theorem 2.3. Throughout this section we fix a strongly Rayleigh distribution μ\mu, and we let Ω,P\Omega,P be the state space and the transition probability matrix of Mμ{\cal M}_{\mu}.

It remains to lower bound the Poincaré constant of the projection chain and to prove equation (2.1). Unfortunately, PP does not satisfy (2.1). So, we use an idea of [JSTV04]. We construct a new Markov kernel P^\hat{P} such that (i) P^\hat{P} has the same stationary distribution μ\mu. (ii) The Poincaré constant of P^\hat{P}, λ^\hat{\lambda}, lower-bounds λ\lambda. Then, we use Theorem 2.6 to lower bound λ^\hat{\lambda}.

To make sure that P^\hat{P} satisfies (i), (ii), it is enough that for all distinct states x,yΩx,y\in\Omega,

Equation (3.1) implies (i), i.e., that μ\mu is also the stationary distribution of P^\hat{P}. By an application of the comparison method [DSC93], (i) together with (3.2) implies (ii), i.e.,

So, to prove the induction step, it is enough to show that

For any i{0,1}i\in\{0,1\} and states x,yΩix,y\in\Omega_{i}, P^(x,y)=P(x,y)\hat{P}(x,y)=P(x,y).

The Poincaré constant of the chain (Ω,P^,μ)(\Omega,\hat{P},\mu) projected onto Ω0,Ω1\Omega_{0},\Omega_{1} is at least λ^ˉCμ\bar{\hat{\lambda}}\geq C_{\mu},

For any state Ssupp{μ}S\in\textup{supp}\{\mu\} and distinct i,j{0,1}i,j\in\{0,1\},

Before, proving the above lemma, we use it to finish the proof of the induction. By part (2), P^\hat{P} agrees with PP on the projection chains. Therefore, the Poincaré constants of the chains (Ω0,P^0,.)(\Omega_{0},\hat{P}_{0},.) and (Ω1,P^1,.)(\Omega_{1},\hat{P}_{1},.) are at least λ0,λ1Cμ\lambda_{0},\lambda_{1}\geq C_{\mu}. So, by parts (3) and (4) we can invoke Theorem 2.6 for P^\hat{P} and we get that

This proves (3.4). As we discussed earlier, part (1) implies (3.3) which completes the induction.

In the rest of this section we prove 3.1. Note that the main challenge in proving the lemma is part (4). The transition probability matrix PP already satisfies parts (1)-(3). The key to prove part (4) is to construct a fractional perfect matching between the states of Ω0\Omega_{0} and Ω1\Omega_{1}, see the following lemma for the formal definition. This idea originally was used in [FM92] and it was later extended in [JS02].

We use the negative association property of the strongly Rayleigh distributions to prove the above lemma. But before that let us prove 3.1.

Proof of 3.1. We use ww to construct P^\hat{P}. For any i,j{0,1}i,j\in\{0,1\} and xΩix\in\Omega_{i} and yΩjy\in\Omega_{j}, we let

Note that by definition part (2) is satisfied. First we verify part (1). If iji\neq j, then

and if i=ji=j the same identity holds because P^(x,y)=P(x,y)\hat{P}(x,y)=P(x,y). This proves (3.1). To see (3.2) note that for any iji\neq j and xΩi,yΩjx\in\Omega_{i},y\in\Omega_{j} we have

The first inequality follows by the definition of CμC_{\mu} (see (1.1)), the second inequality follows by the fact that w{x,y}μ(x)μ(Ωi)w_{\{x,y\}}\leq\frac{\mu(x)}{\mu(\Omega_{i})} and w{x,y}μ(y)μ(Ωj)w_{\{x,y\}}\leq\frac{\mu(y)}{\mu(\Omega_{j})}, and the last inequality follows by the detailed balanced condition. This completes the proof of part (1).

Next, we prove part (3). By definition of P^\hat{P}, for distinct i,j{0,1}i,j\in\{0,1\} we have

where the second to last equality follows by (3.5). By 2.1, the Poincaré constant of P^ˉ=Cμ\bar{\hat{P}}=C_{\mu}. This proves part (3).

Finally we prove part (4). Fix distinct i,j{0,1}i,j\in\{0,1\} and zΩiz\in\Omega_{i}. We have,

where we used (3.5). On the other hand, by definition of P^\hat{P} we know that

where the second equality follows by (3.5). This completes the proof of part (4) and 3.1. ∎

It remains to prove 3.2. For a set AΩA\subseteq\Omega let

To prove 3.2 we use a maximum flow-minimum cut argument. To prove the claim we need to show that the support graph of the transition probability matrix PμP_{\mu} satisfies Hall’s condition. This is proved in the following lemma using the negative association property of strongly Rayleigh measures. The proof is simply an extension of the proof of [FM92, Lem 3.1].

Let RμR\sim\mu be a random set. Recall that Ω0={Ssupp{u}:nS}\Omega_{0}=\{S\in\textup{supp}\{u\}:n\notin S\} and Ω1={Ssupp{u}:nS}\Omega_{1}=\{S\in\textup{supp}\{u\}:n\in S\}. Let gg be a random variable indicating whether nRn\in R. Let ff be a indicator random variable which is 11 if there exists TAT\in A such that RT{n}R\supseteq T\setminus\{n\}. It is easy to see that ff and gg are two increasing functions which are supported on two disjoint sets of elements. By the negative association property, Theorem 2.4, we can write

The lemma follows by the fact that the LHS of the above inequality is μ(N(A))μ(Ω0)\frac{\mu(N(A))}{\mu(\Omega_{0})} and the RHS is μ(A)μ(Ω1)\frac{\mu(A)}{\mu(\Omega_{1})}. ∎

Proof of 3.2. Let GG be a bipartite graph on Ω0Ω1\Omega_{0}\cup\Omega_{1} where there is an edge between xΩ1x\in\Omega_{1} and yΩ0y\in\Omega_{0} if P(x,y)>0P(x,y)>0. We prove the lemma by showing there is a unit flow from Ω1\Omega_{1} to Ω0\Omega_{0} such that the amount of the flow going out of any xΩ1x\in\Omega_{1} is μ(x)μ(Ω1)\frac{\mu(x)}{\mu(\Omega_{1})}, and the incoming flow to any yΩ0y\in\Omega_{0} is μ(y)μ(Ω0)\frac{\mu(y)}{\mu(\Omega_{0})}. Then, we simply let w{x,y}w_{\{x,y\}} be the flow on the edge connecting xx to yy.

Add a source ss and a sink tt. For any xΩ1x\in\Omega_{1} add an arc (s,x)(s,x) with capacity cs,x=μ(x)/μ(Ω1)c_{s,x}=\mu(x)/\mu(\Omega_{1}). Similarly, for any yΩ0y\in\Omega_{0} add an arc (y,t)(y,t) with capacity cy,t=μ(y)/μ(Ω0)c_{y,t}=\mu(y)/\mu(\Omega_{0}). Let the capacity of any other edge in the graph be \infty. Since the sum of the capacity of all edges leaving SS is 1, to prove the lemma, it is enough to show that the maximum flow is 1. Equivalently, by the max-flow min-cut theorem, it suffices to show the value of the minimum cut separating ss and tt is at least 11. Let B,BB,\overline{B} be an arbitrary ss-tt cut, i.e., sBs\in B and tBt\in\overline{B}. Let B0=Ω0BB_{0}=\Omega_{0}\cap B and B1=Ω1BB_{1}=\Omega_{1}\cap B. For XΩ1X\subseteq\Omega_{1}, YΩ0Y\subseteq\Omega_{0}, let c(X,Y)=xX,yYcx,yc(X,Y)=\sum_{x\in X,y\in Y}c_{x,y}. We have

where the inequality follows by 3.2. If there is any edge from B1B_{1} to Ω0B0\Omega_{0}\setminus B_{0}, then c(B,B)=c(B,\overline{B})=\infty and we are done. Otherwise, N(B1)B0N(B_{1})\subseteq B_{0}. Therefore, μ(N(B1))μ(B0)\mu(N(B_{1}))\leq\mu(B_{0}), and the RHS of the above inequality is at least 1. So, c(B,B)1c(B,\overline{B})\geq 1 as desired. ∎

References