Monte Carlo Markov Chain Algorithms for Sampling Strongly Rayleigh Distributions and Determinantal Point Processes
Nima Anari, Shayan Oveis Gharan, Alireza Rezaei
Introduction
We assign a multi-affine polynomial with variables to ,
where for a set , . The polynomial is also known as the generating polynomial of . We say is -homogeneous if is a homogeneous polynomial of degree , i.e., if for any , we have .
Strongly Rayleigh distributions are introduced and deeply studied in the work of [BBL09]. These distributions are natural generalizations of determinantal measures and random spanning tree distributions. It is shown in [BBL09] that strongly Rayleigh distributions satisfy the strongest form of negative dependence properties. These negative dependence properties were recently exploited to design approximation algorithms [OSS11, PP14, AO15].
In this paper we show that the “natural” Monte Carlo Markov Chain (MCMC) method on the support of a homogeneous strongly Rayleigh distribution mixes rapidly. Therefore, this Markov Chain can be used to efficiently draw an approximate sample from . Since determinantal point processes are special cases of strongly Rayleigh measures, our result implies that the same Markov chain efficiently generates random samples of a -determinantal point process (see Subsection 1.1 for the details).
In a state , choose an element and uniformly and independently at random, and let , then
If , move to with probability ;
It is easy to see that is reversible and is the stationary distribution of the chain. In addition, Brändén showed that the support of a (homogeneous) strongly Rayleigh distribution is the set of bases of a matroid [Brä07, Cor 3.4]; so is irreducible. Lastly, since we stay in each state with probability at least , is a lazy chain.
If is a random variable sampled according to and , then we say is an -approximate sample of .
For a state and , the total variation mixing time of a chain started at with transition probability matrix and stationary distribution is defined as follows:
where is the distribution of the chain started at at time .
is at least by construction.
Suppose we have access to a set such that . In addition, we are given an oracle such that for any set , it returns if and zero otherwise. Then, by the above theorem we can generate an -approximate sample of with at most oracle calls.
Borcea, Brändén, and Liggett showed that for any strongly Rayleigh distribution , and any integer , is also strongly Rayleigh, [BBL09]. Therefore, if we have access to a set of size , we can use the above theorem to generate an approximate sample of .
where is the principal submatrix of indexed by the elements of .
DPPs are one of the fundamental objects used to study a variety of tasks in machine learning, including text summarization, image search, news threading, etc. For more information about DPPs and their applications we refer to a recent survey by Kulesza and Taskar [KT13].
For an integer , and a DPP , the truncation of to , is called a -DPP. It turns out that the family of determinantal point processes are not closed under truncation. Perhaps, the simplest example is the -uniform distribution over a set of elements. Although the uniform distribution over elements is a DPP, for any , the corresponding -DPP is not a DPP [KT13, Section 5].
In the past, several spectral algorithms were designed for sampling from -DPPs [HKPV06, DR10, KT13], but these algorithms typically need to diagonalize a giant -by- matrix, so they are inefficient in time and memory We remark that the algorithms in [DR10] are almost linear in ; however they need access to the Cholesky decomposition of the ensemble matrix of the underlying DPP.. It was asked by [DR10] to generate random samples of a -DPP using Markov chain techniques. Markov chain techniques are very appealing in this context because of their simplicity and efficiency. There has been several attempts [Kan13, LJS15, RK15] to upper bound the mixing time of the Markov chain for a -DPP ; but, to the best of our knowledge, this question is still openWe remark that [Kan13] claimed to have a proof of the rapid mixing time of a similar Markov chain. As it is pointed out in [RK15] the coupling argument of [Kan13] is ill-defined. To be more precise, the chain specified in Algorithm 1 of [Kan13] may not mix in a polynomial time of . The chain specified in Algorithm 2 of [Kan13] is similar to , but the statement of Theorem 2 which upper bounds its mixing time is clearly incorrect even when ..
Here, we show that for a -DPP , can be used to efficiently generate an approximate sample of . [BBL09] show that any DPP is a strongly Rayleigh distribution. Since strongly Rayleigh distributions are closed under truncation, any -DPP is a strongly Rayleigh distribution. Therefore, by Theorem 1.1, for any -DPP , mixes rapidly to the stationary distribution.
Given access to the ensemble matrix of a -DPP, we can use the above theorem to generate -approximate samples of the -DPP.
Given an ensemble matrix of a -DPP there is an algorithm for any generates an -approximate sample of in time .
To prove the above theorem, we need an efficient algorithm to generate a set such that is bounded away from zero, perhaps by an exponentially small function of . We use the greedy algorithm 1 to find such a set, and we show that, in time , it returns a set such that
Noting that each transition step of the Markov chain only takes time that is polynomial in , this completes the proof of the above theorem.
The maximum volume submatrix problem is NP-hard to approximate within a factor for some constanat [ÇMI13]. Numerous approximation algorithm is given for this problem [ÇMI09, ÇMI13, Nik15]. It was shown in [ÇMI09, Thm 11] that choosing the rows of greedily gives a approximation to the maximum volume submatrix problem. Algorithm 1 is equivalent to the greedy algorithm of [ÇMI09]; it is only described in the language of ensemble matrix . Therefore, it returns a set such that
2 Proof Overview
In the rest of the paper we prove Theorem 1.1. To prove Theorem 1.1 we lower bound the spectral gap, a.k.a. the Poincaré constant of the chain . This directly upper bounds the mixing time in total variation distance. To lower bound the spectral gap, we use an extension of the seminal work of [FM92]. Feder and Mihail showed that the bases exchange graph of the bases of a balanced matroid is an expander. This directly lower bounds the spectral gap by Cheeger’s inequality. A matroid is called balanced if the matroid and all of its minors satisfy the property that, the uniform distribution of the bases is negatively associated (see Subsection 2.2 for the definition of negative association).
Our proof can be seen as a weighted variant of [FM92]. As we mentioned earlier, the support of a homogeneous strongly Rayleigh distribution corresponds to the bases of a matroid. Our proof shows that if a distribution over the bases of a matroid and all of its conditional measures are negatively associated, then the MCMC algorithm mixes rapidly. To show that satisfies the aforementioned property we simply appeal to the negative dependence theory of strongly Rayleigh distributions developed in [BBL09]. Although our proof can be written in the language of [FM92], we work with the more advanced chain decomposition idea of [JSTV04] to prove a tight bound on the Poincaré constant, see Subsection 2.3 for the details.
We remark that, in general, the decomposition idea of [JSTV04] can be also be used to lower bound the log-Sobolev constant. However, it turns out that in our case, the log-Sobolev constant may be no larger than . Since the latter quantity is not necessarily lower-bounded by a function of and , the log-Sobolev constant (and hence, the mixing time) of the chain may be unbounded.
Background
In this section we give a high level overview of Markov chains and their mixing times. We refer the readers for [LPW06, MT06] for details. Let denote the state space, denote the Markov kernel and denote the stationary distribution of a Markov chain. We say a Markov chain is lazy if for any state , .
In particular, . For a function , the Dirichlet form is defined as follows
Next, we overview classical spectral techniques to upper bound the mixing time of Markov chains.
The Poincaré constant of the chain is defined as follows,
where the infimum is over all functions with nonzero variance.
It is easy to see that for any transition probability matrix , the second largest eigenvalue of is . If is a lazy chain, then is also the second largest eigenvalue of in absolute value. In the following fact we see how to calculate the Poincaré constant of any reversible 2-state chain.
The Poincaré constant of any reversible two state chain with and is .
To prove Theorem 1.1 we simply calculate the Poincaré constant of the chain and then we use the following classical theorem of Diaconis and Stroock to upper bound the mixing time.
For any reversible irreducible lazy Markov chain with Poincaré constant , and any state ,
Using the above theorem, to prove Theorem 1.1, it is enough to lower bound the Poincaré constant of .
It is easy to see that Theorem 1.1 follows by the above two theorems.
2 Strongly Rayleigh Measures
Building on [FM92], Borcea, Brändén and Liggett proved that any strongly Rayleigh distribution is negatively associated.
Any strongly Rayleigh probability distribution is negatively associated.
As an example, the above theorem implies that any -DPP is negatively associated. The negative association property is the key to our lower bound on the Poincaré constant of the chain .
For , let be the random variable indicating whether is in a sample of . We use
to denote the conditional measure on sets that contain and
to denote the conditional measure on sets that do not contain . Borcea, Brändén and Ligett showed that strongly Rayleigh distributions are closed under conditioning.
3 Decomposable Markov Chains
In this section we describe the decomposable Markov chain technique due to Jerrum, Son, Tetali and Vigoda [JSTV04]. This will be our main tool to lower bound the Poincaré constant of . Roughly speaking, they consider Markov chains that can be decomposed into “projection” and “restriction” chains. They lower bound the Poincaré constant of the original chain assuming certain properties of these projection/restriction chains.
Let be a decomposition of the state space of a Markov chain into two disjoint setsHere, we only focus on decomposition into two disjoint sets, although the technique of [JSTV04] is more general.. For let
The Markov chain is called a projection chain. Let be the Poincaré constant of this chain.
We can also define a restriction Markov chain on each as follows. For each ,
In other words, for any transition from to a state outside of , we remain in . Observe that in the stationary distribution of the restriction chain, the probability of is proportional to . Let be the Poincaré constant of the chain . Now, we are ready to explain the main result of [JSTV04].
If for any distinct , and any ,
then the Poincaré constant of is at least .
Inductive Argument
In this section we prove Theorem 2.3. Throughout this section we fix a strongly Rayleigh distribution , and we let be the state space and the transition probability matrix of .
It remains to lower bound the Poincaré constant of the projection chain and to prove equation (2.1). Unfortunately, does not satisfy (2.1). So, we use an idea of [JSTV04]. We construct a new Markov kernel such that (i) has the same stationary distribution . (ii) The Poincaré constant of , , lower-bounds . Then, we use Theorem 2.6 to lower bound .
To make sure that satisfies (i), (ii), it is enough that for all distinct states ,
Equation (3.1) implies (i), i.e., that is also the stationary distribution of . By an application of the comparison method [DSC93], (i) together with (3.2) implies (ii), i.e.,
So, to prove the induction step, it is enough to show that
For any and states , .
The Poincaré constant of the chain projected onto is at least ,
For any state and distinct ,
Before, proving the above lemma, we use it to finish the proof of the induction. By part (2), agrees with on the projection chains. Therefore, the Poincaré constants of the chains and are at least . So, by parts (3) and (4) we can invoke Theorem 2.6 for and we get that
This proves (3.4). As we discussed earlier, part (1) implies (3.3) which completes the induction.
In the rest of this section we prove 3.1. Note that the main challenge in proving the lemma is part (4). The transition probability matrix already satisfies parts (1)-(3). The key to prove part (4) is to construct a fractional perfect matching between the states of and , see the following lemma for the formal definition. This idea originally was used in [FM92] and it was later extended in [JS02].
We use the negative association property of the strongly Rayleigh distributions to prove the above lemma. But before that let us prove 3.1.
Proof of 3.1. We use to construct . For any and and , we let
Note that by definition part (2) is satisfied. First we verify part (1). If , then
and if the same identity holds because . This proves (3.1). To see (3.2) note that for any and we have
The first inequality follows by the definition of (see (1.1)), the second inequality follows by the fact that and , and the last inequality follows by the detailed balanced condition. This completes the proof of part (1).
Next, we prove part (3). By definition of , for distinct we have
where the second to last equality follows by (3.5). By 2.1, the Poincaré constant of . This proves part (3).
Finally we prove part (4). Fix distinct and . We have,
where we used (3.5). On the other hand, by definition of we know that
where the second equality follows by (3.5). This completes the proof of part (4) and 3.1. ∎
It remains to prove 3.2. For a set let
To prove 3.2 we use a maximum flow-minimum cut argument. To prove the claim we need to show that the support graph of the transition probability matrix satisfies Hall’s condition. This is proved in the following lemma using the negative association property of strongly Rayleigh measures. The proof is simply an extension of the proof of [FM92, Lem 3.1].
Let be a random set. Recall that and . Let be a random variable indicating whether . Let be a indicator random variable which is if there exists such that . It is easy to see that and are two increasing functions which are supported on two disjoint sets of elements. By the negative association property, Theorem 2.4, we can write
The lemma follows by the fact that the LHS of the above inequality is and the RHS is . ∎
Proof of 3.2. Let be a bipartite graph on where there is an edge between and if . We prove the lemma by showing there is a unit flow from to such that the amount of the flow going out of any is , and the incoming flow to any is . Then, we simply let be the flow on the edge connecting to .
Add a source and a sink . For any add an arc with capacity . Similarly, for any add an arc with capacity . Let the capacity of any other edge in the graph be . Since the sum of the capacity of all edges leaving is 1, to prove the lemma, it is enough to show that the maximum flow is 1. Equivalently, by the max-flow min-cut theorem, it suffices to show the value of the minimum cut separating and is at least . Let be an arbitrary - cut, i.e., and . Let and . For , , let . We have
where the inequality follows by 3.2. If there is any edge from to , then and we are done. Otherwise, . Therefore, , and the RHS of the above inequality is at least 1. So, as desired. ∎