Monte Carlo Markov Chain Algorithms for Sampling Strongly Rayleigh Distributions and Determinantal Point Processes

Nima Anari, Shayan Oveis Gharan, Alireza Rezaei

Introduction

We assign a multi-affine polynomial with variables $z_{1},\dots,z_{n}$ to $\mu$ ,

where for a set $S\subseteq[n]$ , $z^{S}=\prod_{i\in S}z_{i}$ . The polynomial $g_{\mu}$ is also known as the generating polynomial of $\mu$ . We say $\mu$ is $k$ -homogeneous if $g_{\mu}$ is a homogeneous polynomial of degree $k$ , i.e., if for any $S\in\textup{supp}\{\mu\}$ , we have $|S|=k$ .

Strongly Rayleigh distributions are introduced and deeply studied in the work of [BBL09]. These distributions are natural generalizations of determinantal measures and random spanning tree distributions. It is shown in [BBL09] that strongly Rayleigh distributions satisfy the strongest form of negative dependence properties. These negative dependence properties were recently exploited to design approximation algorithms [OSS11, PP14, AO15].

In this paper we show that the “natural” Monte Carlo Markov Chain (MCMC) method on the support of a homogeneous strongly Rayleigh distribution $\mu$ mixes rapidly. Therefore, this Markov Chain can be used to efficiently draw an approximate sample from $\mu$ . Since determinantal point processes are special cases of strongly Rayleigh measures, our result implies that the same Markov chain efficiently generates random samples of a $k$ -determinantal point process (see Subsection 1.1 for the details).

In a state $S$ , choose an element $i\in S$ and $j\notin S$ uniformly and independently at random, and let $T=S-i+j$ , then

If $T\in\textup{supp}\{\mu\}$ , move to $T$ with probability $\frac{1}{2}\min\{1,\mu(T)/\mu(S)\}$ ;

It is easy to see that ${\cal M}_{\mu}$ is reversible and $\mu(.)$ is the stationary distribution of the chain. In addition, Brändén showed that the support of a (homogeneous) strongly Rayleigh distribution is the set of bases of a matroid [Brä07, Cor 3.4]; so ${\cal M}_{\mu}$ is irreducible. Lastly, since we stay in each state $S$ with probability at least $1/2$ , ${\cal M}_{\mu}$ is a lazy chain.

If $X$ is a random variable sampled according to $\nu$ and $\|\nu-\pi\|_{\operatorname{TV}}\leq\epsilon$ , then we say $X$ is an $\epsilon$ -approximate sample of $\pi$ .

For a state $x\in\Omega$ and $\epsilon>0$ , the total variation mixing time of a chain started at $x$ with transition probability matrix $P$ and stationary distribution $\pi$ is defined as follows:

where $P^{t}(x,.)$ is the distribution of the chain started at $x$ at time $t$ .

is at least $\frac{1}{2kn}$ by construction.

Suppose we have access to a set $S\in\textup{supp}\{\mu\}$ such that $\mu(S)\geq\exp(-n)$ . In addition, we are given an oracle such that for any set $T\in{n\choose k}$ , it returns $\mu(T)$ if $T\in\textup{supp}\{\mu\}$ and zero otherwise. Then, by the above theorem we can generate an $\epsilon$ -approximate sample of $\mu$ with at most $\operatorname{poly}(n,k,\log(1/\epsilon))$ oracle calls.

Borcea, Brändén, and Liggett showed that for any strongly Rayleigh distribution $\mu$ , and any integer $k$ , $\mu_{k}$ is also strongly Rayleigh, [BBL09]. Therefore, if we have access to a set $S\subset[n]$ of size $k$ , we can use the above theorem to generate an approximate sample of $\mu_{k}$ .

where $L_{S}$ is the principal submatrix of $L$ indexed by the elements of $S$ .

DPPs are one of the fundamental objects used to study a variety of tasks in machine learning, including text summarization, image search, news threading, etc. For more information about DPPs and their applications we refer to a recent survey by Kulesza and Taskar [KT13].

For an integer $0\leq k\leq n$ , and a DPP $\mu$ , the truncation of $\mu$ to $k$ , $\mu_{k}$ is called a $k$ -DPP. It turns out that the family of determinantal point processes are not closed under truncation. Perhaps, the simplest example is the $k$ -uniform distribution over a set of $n$ elements. Although the uniform distribution over $n$ elements is a DPP, for any $2\leq k\leq n-2$ , the corresponding $k$ -DPP is not a DPP [KT13, Section 5].

In the past, several spectral algorithms were designed for sampling from $k$ -DPPs [HKPV06, DR10, KT13], but these algorithms typically need to diagonalize a giant $n$ -by- $n$ matrix, so they are inefficient in time and memory We remark that the algorithms in [DR10] are almost linear in $n$ ; however they need access to the Cholesky decomposition of the ensemble matrix of the underlying DPP.. It was asked by [DR10] to generate random samples of a $k$ -DPP using Markov chain techniques. Markov chain techniques are very appealing in this context because of their simplicity and efficiency. There has been several attempts [Kan13, LJS15, RK15] to upper bound the mixing time of the Markov chain ${\cal M}_{\mu}$ for a $k$ -DPP $\mu$ ; but, to the best of our knowledge, this question is still openWe remark that [Kan13] claimed to have a proof of the rapid mixing time of a similar Markov chain. As it is pointed out in [RK15] the coupling argument of [Kan13] is ill-defined. To be more precise, the chain specified in Algorithm 1 of [Kan13] may not mix in a polynomial time of $n$ . The chain specified in Algorithm 2 of [Kan13] is similar to ${\cal M}_{\mu}$ , but the statement of Theorem 2 which upper bounds its mixing time is clearly incorrect even when $k=1$ ..

Here, we show that for a $k$ -DPP $\mu$ , ${\cal M}_{\mu}$ can be used to efficiently generate an approximate sample of $\mu$ . [BBL09] show that any DPP is a strongly Rayleigh distribution. Since strongly Rayleigh distributions are closed under truncation, any $k$ -DPP is a strongly Rayleigh distribution. Therefore, by Theorem 1.1, for any $k$ -DPP $\mu$ , ${\cal M}_{\mu}$ mixes rapidly to the stationary distribution.

Given access to the ensemble matrix of a $k$ -DPP, we can use the above theorem to generate $\epsilon$ -approximate samples of the $k$ -DPP.

Given an ensemble matrix $L$ of a $k$ -DPP $\mu$ there is an algorithm for any $\epsilon>0$ generates an $\epsilon$ -approximate sample of $\mu$ in time $\operatorname{poly}(k)O(n\log(n/\epsilon))$ .

To prove the above theorem, we need an efficient algorithm to generate a set $S\in\textup{supp}\{\mu\}$ such that $\mu(S)$ is bounded away from zero, perhaps by an exponentially small function of $n,k$ . We use the greedy algorithm 1 to find such a set, and we show that, in time $O(n)\operatorname{poly}(k)$ , it returns a set $S$ such that

Noting that each transition step of the Markov chain ${\cal M}_{\mu}$ only takes time that is polynomial in $k$ , this completes the proof of the above theorem.

The maximum volume submatrix problem is NP-hard to approximate within a factor $c^{k}$ for some constanat $c>1$ [ÇMI13]. Numerous approximation algorithm is given for this problem [ÇMI09, ÇMI13, Nik15]. It was shown in [ÇMI09, Thm 11] that choosing the rows of $X$ greedily gives a $k!$ approximation to the maximum volume submatrix problem. Algorithm 1 is equivalent to the greedy algorithm of [ÇMI09]; it is only described in the language of ensemble matrix $L$ . Therefore, it returns a set $S$ such that

2 Proof Overview

In the rest of the paper we prove Theorem 1.1. To prove Theorem 1.1 we lower bound the spectral gap, a.k.a. the Poincaré constant of the chain ${\cal M}_{\mu}$ . This directly upper bounds the mixing time in total variation distance. To lower bound the spectral gap, we use an extension of the seminal work of [FM92]. Feder and Mihail showed that the bases exchange graph of the bases of a balanced matroid is an expander. This directly lower bounds the spectral gap by Cheeger’s inequality. A matroid is called balanced if the matroid and all of its minors satisfy the property that, the uniform distribution of the bases is negatively associated (see Subsection 2.2 for the definition of negative association).

Our proof can be seen as a weighted variant of [FM92]. As we mentioned earlier, the support of a homogeneous strongly Rayleigh distribution corresponds to the bases of a matroid. Our proof shows that if a distribution $\mu$ over the bases of a matroid and all of its conditional measures are negatively associated, then the MCMC algorithm mixes rapidly. To show that $\mu$ satisfies the aforementioned property we simply appeal to the negative dependence theory of strongly Rayleigh distributions developed in [BBL09]. Although our proof can be written in the language of [FM92], we work with the more advanced chain decomposition idea of [JSTV04] to prove a tight bound on the Poincaré constant, see Subsection 2.3 for the details.

We remark that, in general, the decomposition idea of [JSTV04] can be also be used to lower bound the log-Sobolev constant. However, it turns out that in our case, the log-Sobolev constant may be no larger than $\frac{1}{-\log(\min_{S\in\textup{supp}\{\mu\}}\mu(S))}$ . Since the latter quantity is not necessarily lower-bounded by a function of $k$ and $n$ , the log-Sobolev constant (and hence, the $L_{2}$ mixing time) of the chain may be unbounded.

Background

In this section we give a high level overview of Markov chains and their mixing times. We refer the readers for [LPW06, MT06] for details. Let $\Omega$ denote the state space, $P$ denote the Markov kernel and $\pi(.)$ denote the stationary distribution of a Markov chain. We say a Markov chain is lazy if for any state $x\in\Omega$ , $P(x,x)\geq 1/2$ .

In particular, $\|f\|_{\pi}=\sqrt{\langle f,f\rangle_{\pi}}$ . For a function $f\in L^{2}(\pi)$ , the Dirichlet form ${\cal E}_{\pi}(f,f)$ is defined as follows

Next, we overview classical spectral techniques to upper bound the mixing time of Markov chains.

The Poincaré constant of the chain is defined as follows,

where the infimum is over all functions with nonzero variance.

It is easy to see that for any transition probability matrix $P$ , the second largest eigenvalue of $P$ is $1-\lambda$ . If $P$ is a lazy chain, then $1-\lambda$ is also the second largest eigenvalue of $P$ in absolute value. In the following fact we see how to calculate the Poincaré constant of any reversible 2-state chain.

The Poincaré constant of any reversible two state chain with $\Omega={0,1}$ and $P(0,1)=c\cdot\pi(1)$ is $c$ .

To prove Theorem 1.1 we simply calculate the Poincaré constant of the chain ${\cal M}_{\mu}$ and then we use the following classical theorem of Diaconis and Stroock to upper bound the mixing time.

For any reversible irreducible lazy Markov chain $(\Omega,P,\pi)$ with Poincaré constant $\lambda$ , $\epsilon>0$ and any state $x\in\Omega$ ,

Using the above theorem, to prove Theorem 1.1, it is enough to lower bound the Poincaré constant of ${\cal M}_{\mu}$ .

It is easy to see that Theorem 1.1 follows by the above two theorems.

2 Strongly Rayleigh Measures

Building on [FM92], Borcea, Brändén and Liggett proved that any strongly Rayleigh distribution is negatively associated.

Any strongly Rayleigh probability distribution is negatively associated.

As an example, the above theorem implies that any $k$ -DPP is negatively associated. The negative association property is the key to our lower bound on the Poincaré constant of the chain ${\cal M}_{\mu}$ .

For $1\leq i\leq n$ , let $Y_{i}$ be the random variable indicating whether $i$ is in a sample of $\mu$ . We use

to denote the conditional measure on sets that contain $i$ and

to denote the conditional measure on sets that do not contain $i$ . Borcea, Brändén and Ligett showed that strongly Rayleigh distributions are closed under conditioning.

3 Decomposable Markov Chains

In this section we describe the decomposable Markov chain technique due to Jerrum, Son, Tetali and Vigoda [JSTV04]. This will be our main tool to lower bound the Poincaré constant of ${\cal M}_{\mu}$ . Roughly speaking, they consider Markov chains that can be decomposed into “projection” and “restriction” chains. They lower bound the Poincaré constant of the original chain assuming certain properties of these projection/restriction chains.

Let $\Omega_{0}\cup\Omega_{1}$ be a decomposition of the state space of a Markov chain $(\Omega,P,\pi)$ into two disjoint setsHere, we only focus on decomposition into two disjoint sets, although the technique of [JSTV04] is more general.. For $i\in\{0,1\}$ let

The Markov chain $(\{0,1\},\bar{P},\bar{\pi})$ is called a projection chain. Let $\bar{\lambda}$ be the Poincaré constant of this chain.

We can also define a restriction Markov chain on each $\Omega_{i}$ as follows. For each $i\in\{0,1\}$ ,

In other words, for any transition from $x$ to a state outside of $\Omega_{i}$ , we remain in $x$ . Observe that in the stationary distribution of the restriction chain, the probability of $x$ is proportional to $\pi(x)$ . Let $\lambda_{i}$ be the Poincaré constant of the chain $(\Omega_{i},P_{i},.)$ . Now, we are ready to explain the main result of [JSTV04].

If for any distinct $i,j\in\{0,1\}$ , and any $x\in\Omega_{i}$ ,

then the Poincaré constant of $(\Omega,P,\pi)$ is at least $\min\{\bar{\lambda},\lambda_{0},\lambda_{1}\}$ .

Inductive Argument

In this section we prove Theorem 2.3. Throughout this section we fix a strongly Rayleigh distribution $\mu$ , and we let $\Omega,P$ be the state space and the transition probability matrix of ${\cal M}_{\mu}$ .

It remains to lower bound the Poincaré constant of the projection chain and to prove equation (2.1). Unfortunately, $P$ does not satisfy (2.1). So, we use an idea of [JSTV04]. We construct a new Markov kernel $\hat{P}$ such that (i) $\hat{P}$ has the same stationary distribution $\mu$ . (ii) The Poincaré constant of $\hat{P}$ , $\hat{\lambda}$ , lower-bounds $\lambda$ . Then, we use Theorem 2.6 to lower bound $\hat{\lambda}$ .

To make sure that $\hat{P}$ satisfies (i), (ii), it is enough that for all distinct states $x,y\in\Omega$ ,

Equation (3.1) implies (i), i.e., that $\mu$ is also the stationary distribution of $\hat{P}$ . By an application of the comparison method [DSC93], (i) together with (3.2) implies (ii), i.e.,

So, to prove the induction step, it is enough to show that

For any $i\in\{0,1\}$ and states $x,y\in\Omega_{i}$ , $\hat{P}(x,y)=P(x,y)$ .

The Poincaré constant of the chain $(\Omega,\hat{P},\mu)$ projected onto $\Omega_{0},\Omega_{1}$ is at least $\bar{\hat{\lambda}}\geq C_{\mu}$ ,

For any state $S\in\textup{supp}\{\mu\}$ and distinct $i,j\in\{0,1\}$ ,

Before, proving the above lemma, we use it to finish the proof of the induction. By part (2), $\hat{P}$ agrees with $P$ on the projection chains. Therefore, the Poincaré constants of the chains $(\Omega_{0},\hat{P}_{0},.)$ and $(\Omega_{1},\hat{P}_{1},.)$ are at least $\lambda_{0},\lambda_{1}\geq C_{\mu}$ . So, by parts (3) and (4) we can invoke Theorem 2.6 for $\hat{P}$ and we get that

This proves (3.4). As we discussed earlier, part (1) implies (3.3) which completes the induction.

In the rest of this section we prove 3.1. Note that the main challenge in proving the lemma is part (4). The transition probability matrix $P$ already satisfies parts (1)-(3). The key to prove part (4) is to construct a fractional perfect matching between the states of $\Omega_{0}$ and $\Omega_{1}$ , see the following lemma for the formal definition. This idea originally was used in [FM92] and it was later extended in [JS02].

We use the negative association property of the strongly Rayleigh distributions to prove the above lemma. But before that let us prove 3.1.

Proof of 3.1. We use $w$ to construct $\hat{P}$ . For any $i,j\in\{0,1\}$ and $x\in\Omega_{i}$ and $y\in\Omega_{j}$ , we let

Note that by definition part (2) is satisfied. First we verify part (1). If $i\neq j$ , then

and if $i=j$ the same identity holds because $\hat{P}(x,y)=P(x,y)$ . This proves (3.1). To see (3.2) note that for any $i\neq j$ and $x\in\Omega_{i},y\in\Omega_{j}$ we have

The first inequality follows by the definition of $C_{\mu}$ (see (1.1)), the second inequality follows by the fact that $w_{\{x,y\}}\leq\frac{\mu(x)}{\mu(\Omega_{i})}$ and $w_{\{x,y\}}\leq\frac{\mu(y)}{\mu(\Omega_{j})}$ , and the last inequality follows by the detailed balanced condition. This completes the proof of part (1).

Next, we prove part (3). By definition of $\hat{P}$ , for distinct $i,j\in\{0,1\}$ we have

where the second to last equality follows by (3.5). By 2.1, the Poincaré constant of $\bar{\hat{P}}=C_{\mu}$ . This proves part (3).

Finally we prove part (4). Fix distinct $i,j\in\{0,1\}$ and $z\in\Omega_{i}$ . We have,

where we used (3.5). On the other hand, by definition of $\hat{P}$ we know that

where the second equality follows by (3.5). This completes the proof of part (4) and 3.1. ∎

It remains to prove 3.2. For a set $A\subseteq\Omega$ let

To prove 3.2 we use a maximum flow-minimum cut argument. To prove the claim we need to show that the support graph of the transition probability matrix $P_{\mu}$ satisfies Hall’s condition. This is proved in the following lemma using the negative association property of strongly Rayleigh measures. The proof is simply an extension of the proof of [FM92, Lem 3.1].

Let $R\sim\mu$ be a random set. Recall that $\Omega_{0}=\{S\in\textup{supp}\{u\}:n\notin S\}$ and $\Omega_{1}=\{S\in\textup{supp}\{u\}:n\in S\}$ . Let $g$ be a random variable indicating whether $n\in R$ . Let $f$ be a indicator random variable which is $1$ if there exists $T\in A$ such that $R\supseteq T\setminus\{n\}$ . It is easy to see that $f$ and $g$ are two increasing functions which are supported on two disjoint sets of elements. By the negative association property, Theorem 2.4, we can write

The lemma follows by the fact that the LHS of the above inequality is $\frac{\mu(N(A))}{\mu(\Omega_{0})}$ and the RHS is $\frac{\mu(A)}{\mu(\Omega_{1})}$ . ∎

Proof of 3.2. Let $G$ be a bipartite graph on $\Omega_{0}\cup\Omega_{1}$ where there is an edge between $x\in\Omega_{1}$ and $y\in\Omega_{0}$ if $P(x,y)>0$ . We prove the lemma by showing there is a unit flow from $\Omega_{1}$ to $\Omega_{0}$ such that the amount of the flow going out of any $x\in\Omega_{1}$ is $\frac{\mu(x)}{\mu(\Omega_{1})}$ , and the incoming flow to any $y\in\Omega_{0}$ is $\frac{\mu(y)}{\mu(\Omega_{0})}$ . Then, we simply let $w_{\{x,y\}}$ be the flow on the edge connecting $x$ to $y$ .

Add a source $s$ and a sink $t$ . For any $x\in\Omega_{1}$ add an arc $(s,x)$ with capacity $c_{s,x}=\mu(x)/\mu(\Omega_{1})$ . Similarly, for any $y\in\Omega_{0}$ add an arc $(y,t)$ with capacity $c_{y,t}=\mu(y)/\mu(\Omega_{0})$ . Let the capacity of any other edge in the graph be $\infty$ . Since the sum of the capacity of all edges leaving $S$ is 1, to prove the lemma, it is enough to show that the maximum flow is 1. Equivalently, by the max-flow min-cut theorem, it suffices to show the value of the minimum cut separating $s$ and $t$ is at least $1$ . Let $B,\overline{B}$ be an arbitrary $s$ - $t$ cut, i.e., $s\in B$ and $t\in\overline{B}$ . Let $B_{0}=\Omega_{0}\cap B$ and $B_{1}=\Omega_{1}\cap B$ . For $X\subseteq\Omega_{1}$ , $Y\subseteq\Omega_{0}$ , let $c(X,Y)=\sum_{x\in X,y\in Y}c_{x,y}$ . We have