Fast Simulation of Hyperplane-Truncated Multivariate Normal Distributions

Yulai Cong, Bo Chen, Mingyuan Zhou

1 Introduction

We investigate the problem of simulation from a multivariate normal (MVN) distribution whose samples are restricted to the intersection of a set of hyperplanes, which is shown to be inherently related to the simulation of a conditional distribution of a MVN distribution. A naive approach, which linearly transforms a random variable drawn from the conditional distribution of a related MVN distribution, requires a large number of intermediate variables that are often computationally expensive to instantiate. To address this issue, we propose a fast and exact simulation algorithm that directly projects a MVN random variable onto the intersection of a set of hyperplanes. We further show that sampling from a MVN distribution, whose covariance (precision) matrix can be decomposed as the sum (difference) of a positive-definite matrix, whose inversion is known or easy to compute, and a low-rank symmetric matrix, may also be made significantly fast by exploiting this newly proposed stimulation algorithm for hyperplane-truncated MVN distributions, avoiding the need of Cholesky decomposition that has a computational complexity of O(k3)O(k^{3}) (Golub and Van Loan, 2012), where kk is the dimension of the MVN random variable.

2 Hyperplane-truncated and conditional MVNs

For the problem under study, we express a kk-dimensional MVN distribution truncated on the intersection of k2<kk_{2}<k hyperplanes as

and \mboxRank(G)=k2\mbox{Rank}({\bf G})=k_{2}. The probability density function can be expressed as

where ZZ is a constant ensuring p(xμ,Σ,G,r)dx=1\int p(\boldsymbol{x}\,|\,\boldsymbol{\mu},\boldsymbol{\Sigma},{\bf G},\boldsymbol{r})d\boldsymbol{x}=1, and δ(x)=1\delta(x)=1 if the condition xx is satisfied and δ(x)=0\delta(x)=0 otherwise. Let us partition G{\bf G}, x\boldsymbol{x}, μ\boldsymbol{\mu}, Σ\boldsymbol{\Sigma}, and Λ=Σ1\boldsymbol{\Lambda}=\boldsymbol{\Sigma}^{-1} as

respectively, where k=k1+k2k=k_{1}+k_{2}, Σ21=Σ12T\boldsymbol{\Sigma}_{21}=\boldsymbol{\Sigma}_{12}^{T}, and Λ21=Λ12T\boldsymbol{\Lambda}_{21}=\boldsymbol{\Lambda}_{12}^{T}.

A special case that frequently arises in real applications is when G1=0k2×k1{\bf G}_{1}=\mathbf{0}_{k_{2}\times k_{1}} and G2=Ik2{\bf G}_{2}={\bf I}_{k_{2}}, which means (0k2×k1,Ik2)x=x2=r(\mathbf{0}_{k_{2}\times k_{1}},{\bf I}_{k_{2}})\boldsymbol{x}=\boldsymbol{x}_{2}=\boldsymbol{r} and the need is to simulate x1\boldsymbol{x}_{1} given x2=r\boldsymbol{x}_{2}=\boldsymbol{r}. For a MVN random variable xN(μ,Σ)\boldsymbol{x}\sim\mathcal{N}(\boldsymbol{\mu},\boldsymbol{\Sigma}), it is well known, e.g.e.g., in Tong (2012), that the conditional distribution of x1\boldsymbol{x}_{1} given x2=r\boldsymbol{x}_{2}=\boldsymbol{r}, i.e.i.e., the distribution of x\boldsymbol{x} restricted to S={x:(0k2×k1,Ik2)x=r}\mathcal{S}=\{\boldsymbol{x}:(\mathbf{0}_{k_{2}\times k_{1}},{\bf I}_{k_{2}})\boldsymbol{x}=\boldsymbol{r}\}, can be expressed as

Alternatively, applying the Woodbury matrix identity to relate the entries of the covariance matrix Σ\boldsymbol{\Sigma} to those of the precision matrix Λ\boldsymbol{\Lambda}, one may obtain the following equivalent expression as

More specifically, denoting Λ=[H1Σ(H1)T]1=HTΣ1H\boldsymbol{\Lambda}=[{\bf H}^{-1}\boldsymbol{\Sigma}({\bf H}^{-1})^{T}]^{-1}={\bf H}^{T}\boldsymbol{\Sigma}^{-1}{\bf H} as the precision matrix for z\boldsymbol{z}, we have

and hence x\boldsymbol{x} truncated on S\mathcal{S} can be naively generated using the following algorithm, whose computational complexity is described in Table 1 of the Appendix.

For illustration, we consider a simple 2-dimensional example with μ=(1,1.2)T\boldsymbol{\mu}=(1,1.2)^{T}, Σ=[(1,0.3)T,(0.3,1)T]\boldsymbol{\Sigma}=[(1,0.3)^{T},(0.3,1)^{T}], G=(1,1){\bf G}=(1,1), and r=1\boldsymbol{r}=1. If we choose H1=(0.7071,0.7071)T{\bf H}_{1}=(-0.7071,0.7071)^{T} and H2=(1.3,1.3)T{\bf H}_{2}=(1.3,1.3)^{T}, then we have z2=(GH2)1r=(2.6)1=0.3846z_{2}=({\bf G}{\bf H}_{2})^{-1}\boldsymbol{r}=(2.6)^{-1}=0.3846, Λ11=1.4285\boldsymbol{\Lambda}_{11}=1.4285, and Λ12=0\boldsymbol{\Lambda}_{12}=0; as shown in Figure 1, we may generate x\boldsymbol{x} using

where z1N(0.1414,0.7)z_{1}\sim\mathcal{N}(0.1414,0.7) and z2=0.3846z_{2}=0.3846.

For high dimensional problems, however, Algorithm 1 in general requires a large number of intermediate variables that could be computationally expensive to compute. In the following discussion, we will show how to completely avoid instantiating these intermediate variables.

3 Fast and exact simulation of MVN distributions

Instead of using Algorithm 1, we first provide a theorem to show how to efficiently and exactly simulate from a hyperplane-truncated MVN distribution. In the Appendix, we provide two different proofs. The first proof facilitates the derivations by employing an existing algorithm of Hoffman and Ribak (1991) and Doucet (2010), which describes how to simulate from the conditional distribution of a MVN distribution shown in (4) without computing Σ11Σ12Σ221Σ21\boldsymbol{\Sigma}_{11}-\boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\boldsymbol{\Sigma}_{21} and its Cholesky decomposition. Note it is straightforward to verify that the algorithm in Hoffman and Ribak (1991) and Doucet (2010), as shown in the Appendix, can be considered as a special case of the proposed algorithm with G=[0,I]{\bf G}=[\boldsymbol{0},{\bf I}].

The above algorithm and theorem, whose computational complexity is described in Table 2 of the Appendix, show that one may draw y\boldsymbol{y} from the unconstrained MVN as yN(μ,Σ)\boldsymbol{y}\sim\mathcal{N}(\boldsymbol{\mu},\boldsymbol{\Sigma}) and directly map it to a vector x\boldsymbol{x} on the intersection of hyperplanes using x=ΣGT(GΣGT)1r+[IΣGT(GΣGT)1G]y.\boldsymbol{x}=\boldsymbol{\Sigma}{\bf G}^{T}({\bf G}\boldsymbol{\Sigma}{\bf G}^{T})^{-1}\boldsymbol{r}+\left[{\bf I}-\boldsymbol{\Sigma}{\bf G}^{T}({\bf G}\boldsymbol{\Sigma}{\bf G}^{T})^{-1}{\bf G}\right]\boldsymbol{y}. For illustration, with the same μ\boldsymbol{\mu}, Σ\boldsymbol{\Sigma}, G{\bf G}, and r\boldsymbol{r} as those in Figure 1, we show in Figure 2 a simple two dimensional example, where the unrestricted Gaussian distribution N(μ,Σ)\mathcal{N}(\boldsymbol{\mu},\boldsymbol{\Sigma}) is represented with a set of ellipses, and the constrained sample space S\mathcal{S} is represented as a straight line in the two-dimensional setting. With ΣGT(GΣGT)1r=(0.5,0.5)T\boldsymbol{\Sigma}{\bf G}^{T}({\bf G}\boldsymbol{\Sigma}{\bf G}^{T})^{-1}\boldsymbol{r}=(0.5,0.5)^{T}, [IΣGT(GΣGT)1G]=[(0.5,0.5)T,(0.5,0.5)T]\left[{\bf I}-\boldsymbol{\Sigma}{\bf G}^{T}({\bf G}\boldsymbol{\Sigma}{\bf G}^{T})^{-1}{\bf G}\right]=[(0.5,-0.5)^{T},(-0.5,0.5)^{T}], one may directly maps a sample yN(μ,Σ)\boldsymbol{y}\sim\mathcal{N}(\boldsymbol{\mu},\boldsymbol{\Sigma}) to a vector on the constrained space. For example, if y=(1,2)T\boldsymbol{y}=(1,2)^{T}, then it would be mapped to x=(0,1)T\boldsymbol{x}=(0,1)^{T} on the straight line.

For fast simulation of MVN distributions with structured covariance or precision matrices, our idea is to relate them to higher-dimensional hyperplane-truncated MVN distributions, with block-diagonal covariance matrices, that can be efficiently simulated with Algorithm 2. We first introduce an efficient algorithm for the simulation of a MVN distribution, whose covariance matrix is a positive-definite matrix subtracted by a low-rank symmetric matrix. Such kind of covariance matrices commonly arise in the conditional distributions of MVN distributions, as shown in (4). We then further extend this algorithm to the simulation of a MVN distribution whose precision (inverse covariance) matrix is the sum of a positive-definite matrix and a low-rank symmetric matrix. Such kind of precision matrices commonly arise in the conditional posterior distributions of the regression coefficients in both linear regression and generalized linear models.

The probability density function (PDF) of the MVN distribution

is the same as the PDF of the marginal distribution of x1=(x1,,xk1)T\boldsymbol{x}_{1}=(x_{1},\ldots,x_{k_{1}})^{T} in x=(x1T,xk1+1,,xk)T\boldsymbol{x}=(\boldsymbol{x}_{1}^{T},x_{k_{1}+1},\ldots,x_{k})^{T}, whose PDF is expressed as

is the same as the PDF of the marginal distribution of x1\boldsymbol{x}_{1} in x=(x1T,xk1+1,,xk)T\boldsymbol{x}=(\boldsymbol{x}_{1}^{T},x_{k_{1}+1},\ldots,x_{k})^{T}, whose PDF is expressed as

where x2=(xk1+1,,xk)T\boldsymbol{x}_{2}=(x_{k_{1}+1},\ldots,x_{k})^{T}, ZZ is a normalization constant, and

Further applying Theorem 1 to Corollary 3, as described in detail in the Appendix, a MVN random variable x\boldsymbol{x} with a structured covariance matrix can be generated as in Algorithm 3, where there is no need to compute Σ11Σ12Σ221Σ21\boldsymbol{\Sigma}_{11}-\boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\boldsymbol{\Sigma}_{21} and its Cholesky decomposition. Suppose the covariance matrix Σ11\boldsymbol{\Sigma}_{11} admits some special structure that makes it easy to invert and computationally efficient to simulate from N(0,Σ11)\mathcal{N}(\mathbf{0},\boldsymbol{\Sigma}_{11}), then Algorithm 3 could lead to a significant saving in computation if k2k1k_{2}\ll k_{1}. On the other hand, when k2k1k_{2}\gg k_{1} and Σ22Σ21Σ111Σ12\boldsymbol{\Sigma}_{22}-\boldsymbol{\Sigma}_{21}\boldsymbol{\Sigma}_{11}^{-1}\boldsymbol{\Sigma}_{12} admits no special structures, Algorithm 3 may not bring any computational advantage and hence one may resort to the naive Cholesky decomposition based procedure. Detailed computational complexity analyses for both methods are provided in Tables 3 and 4 of the Appendix, respectively.

A random variable simulated with Algorithm 3 is distributed as x1N(μ1,Σ11Σ12Σ221Σ21).\boldsymbol{x}_{1}\sim\mathcal{N}(\boldsymbol{\mu}_{1},\boldsymbol{\Sigma}_{11}-\boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\boldsymbol{\Sigma}_{21}).

The random variable obtained with Algorithm 4 is distributed as βN(μβ,Σβ)\boldsymbol{\beta}\sim\mathcal{N}(\boldsymbol{\mu}_{\beta},\boldsymbol{\Sigma}_{\beta}), where Σβ=(A+ΦTΩΦ)1\boldsymbol{\Sigma}_{\beta}=({\bf A}+\boldsymbol{\Phi}^{T}\boldsymbol{\Omega}\boldsymbol{\Phi})^{-1}.

4 Illustrations

Below we provide several examples to illustrate Theorem 1, which shows how to efficiently simulate from a hyperplane-truncated MVN distribution, and Corollary 4 (Corollary 5), which shows how to efficiently simulate from a MVN distribution with a structured covariance (precision) matrix. We run all our experiments on a 2.9 GHz computer.

We first compare Algorithms 1 and 2, whose generated random samples follow the same distribution, as suggested by Theorem 1, to highlight the advantages of Algorithm 2 over Algorithm 1. We then employ Algorithm 2 for a real application whose data dimension is high and sample size is large.

First, to verify Theorem 1, we conduct an experiment with k=5000k=5000 data dimension, k2=20k_{2}=20 hyperplanes, and a diagonal Σ\boldsymbol{\Sigma}. Contour plots of two randomly selected dimensions of the 10,000 random samples simulated with Algorithms 1 and 2 are shown in the top and bottom rows of Figure 3, respectively. The clear matches between the contour plots of these two different algorithms suggest the correctness of Theorem 1.

To demonstrate the efficiency of Algorithm 2, we first carry out a series of experiments with the number of hyperplane constraints fixed at k2=20k_{2}=20 and the data dimension increased from k=50k=50 to k=5000k=5000. The computation time of simulating 10,000 samples averaged over five random trials is shown in Figure 4 for non-diagonal Σ\boldsymbol{\Sigma}’s and in Figure 4 for diagonal ones. It is clear that, when the data dimension kk is high, Algorithm 2 has a clear advantage over Algorithm 1 by avoiding computing unnecessary intermediate variables, which is especially evident when Σ\boldsymbol{\Sigma} is diagonal. We then carry out a series of experiments where we vary not only kk, but also k2k_{2} from 0.1k0.1k to 0.9k0.9k for each kk. As shown in Figure 4, it is evident that Algorithm 2 dominates Algorithm 1 in all scenarios, which can be explained by the fact that Algorithm 2 needs to compute much fewer intermediate variables. Also observed is that a larger k2k_{2} leads to slower simulation for both algorithms, but to a much lesser extent for Algorithm 2. Moreover, the curvatures of those curves indicate that Algorithm 2 is more practical in a high dimensional setting. Note that since Algorithm 2 can naturally exploit the structure of the covariance matrix Σ\boldsymbol{\Sigma} for fast simulation, it is clearly more capable of benefiting from having a diagonal or block-diagonal Σ\boldsymbol{\Sigma}, demonstrated by comparing Figures 4 and 4 with Figures 4 and 4. All these observations agree with our computational complexity analyses for Algorithms 1 and 2, as shown in Table 1 and 2 of the Appendix, respectively.

A practical application of Algorithm 2

More specifically, we focus on a big data setting in which the globally shared simplex-constrained model parameters could be linked to some latent counts via the multinomial likelihood. When there are tens of thousands or millions of observations in the dataset, scalable Bayesian inference for the simplex-constrained globally shared model parameters is highly desired, for example, for inferring the topics’ distributions over words in latent Dirichlet allocation (Blei et al., 2003; Hoffman et al., 2010) and Poisson factor analysis (Zhou et al., 2012, 2016).

However, this batch-learning inference procedure would become inefficient and even impractical when the dataset size NN grows to a level that makes it too time consuming to finish even a single iteration of updating all local variables nvjκn_{vj\kappa}. To address this issue, we consider constructing a mini-batch based Bayesian inference procedure that could make substantial progress in posterior simulation while the batch-learning one may still be waiting to finish a single iteration.

Instead of waiting for all nj\boldsymbol{n}_{j} to be updated before performing a single update of ϕ\boldsymbol{\phi}, we develop a mini-batch based Bayesian inference algorithm under a general framework for constructing stochastic gradient Markov chain Monte Carlo (SG-MCMC) (Ma et al., 2015), allowing ϕ\boldsymbol{\phi} to be updated every time a mini-batch of nj\boldsymbol{n}_{j} are processed. For the sake of completeness, we concisely describe the derivation for a SG-MCMC algorithm, as outlined below, for simplex-constrained globally shared model parameters. We refer the readers to Cong et al. (2017) for more details on the derivation and its application to scalable inference for topic modeling.

It is clear that (16) corresponds to simulation of a V1V-1 dimensional truncated MVN distribution with VV inequality constraints. Since the number of constraints is larger than the dimension, previously proposed iterative simulation methods such as the one in Botev (2016) are often inappropriate. Note that, by omitting the non-negative constraints, the update in (17) corresponds to simulation of a hyperplane-truncated MVN simulation with a diagonal covariance matrix, which can be efficiently sampled as described in the following example.

Example 1: Simulation of a hyperplane-truncated MVN distribution as

Return x=y+(11Ty)ϕ\boldsymbol{x}=\boldsymbol{y}+(1-\mathbf{1}^{T}\boldsymbol{y})\boldsymbol{\phi}.

Example 2: Simulation from (16) can be approximately but rapidly realized as

Calculate z=y+(11Ty)ϕt\boldsymbol{z}=\boldsymbol{y}+(1-\mathbf{1}^{T}\boldsymbol{y})\boldsymbol{\phi}_{t};

For comparison, we choose the same SG-MCMC inference procedure but consider simulating (16), as performed every time a mini-batch of data samples are provided, either as in Example 2 or with the Gibbs sampler of Rodriguez-Yam et al. (2004). Simulating (16) with the Gibbs sampler of Rodriguez-Yam et al. (2004) is realized by updating all the VV dimensions, one dimension at a time, in each Gibbs sampling iteration. We set the total number of Gibbs sampling iterations for (16) in each mini-batch based update as 1, 5, or 10. Note that in practice, the nj\boldsymbol{n}_{j} belonging to the current mini-batch are often latent and are updated conditioning on the data samples in the mini-batch and ϕ\boldsymbol{\phi}. For simplicity, all nj\boldsymbol{n}_{j} here are simulated once and then fixed.

Using ϕpost=(j=1Nnj+η)/(j=1Nnj+ηV)\boldsymbol{\phi}_{post}^{*}=(\sum_{j=1}^{N}\boldsymbol{n}_{j}+\eta)/(\sum_{j=1}^{N}n_{\boldsymbol{\cdot}j}+\eta V), the posterior mean of ϕ\boldsymbol{\phi} in a batch-learning setting, as the reference, we show in Figure 5 how the residual errors for the estimated ϕ\boldsymbol{\phi}^{*}, defined as ϕϕ2\left\|\boldsymbol{\phi}^{*}-\boldsymbol{\phi}\right\|_{2}, change both as a function of the number of processed mini-batches and as a function of computation time under various settings of the mini-batch based SG-MCMC algorithm. The curves shown in Figure 5 suggest that for each mini-batch, to simulate (16) with the Gibbs sampler of Rodriguez-Yam et al. (2004), it is necessary to have more than one Gibbs sampling iteration to achieve satisfactory results. It is clear from Figure 5 that the Gibbs sampler with 5 or 10 iterations for each mini-batch, even though each mini-batch has only 10 data samples, provides residual errors that quickly approach that of the batch posterior mean with a tiny gap, indicating the effectiveness of the SG-MCMC updating in (16). While simulating (16) with Gibbs sampling could in theory lead to unbiased samples if the number of Gibbs sampling iterations is large enough, it is much more efficient to simulate (16) with the procedure described in Example 2, which provides a performance that is undistinguishable from those of the Gibbs sampler with as many as 5 or 10 iterations for each mini-batch, but at the expense of a tiny fraction of a single Gibbs sampling iteration.

4.2 Simulation of MVNs with structured covariance matrices

To illustrate Corollary 4, we mimic the truncated MVN simulation in (16) and present the following simulation example with a structured covariance matrix.

Example 3: Simulation of a MVN distribution as

Return x1=μ1+y1(1Ty1+ay2)ϕ1\boldsymbol{x}_{1}=\boldsymbol{\mu}_{1}+\boldsymbol{y}_{1}-(\mathbf{1}^{T}\boldsymbol{y}_{1}+a\boldsymbol{y}_{2})\boldsymbol{\phi}_{1}.

Denoting x=(x1T,xk)T\boldsymbol{x}=(\boldsymbol{x}_{1}^{T},x_{k})^{T}, ϕ=(ϕ1T,ϕk)T\boldsymbol{\phi}=(\boldsymbol{\phi}_{1}^{T},\phi_{k})^{T}, μ=(μ1T,μk)T\boldsymbol{\mu}=(\boldsymbol{\mu}_{1}^{T},\mu_{k})^{T}, and μk=11Tμ1\mu_{k}=1-\mathbf{1}^{T}\boldsymbol{\mu}_{1}, the above sampling steps can also be equivalently expressed as follows.

Return x1=y1+(11Ty)ϕ1\boldsymbol{x}_{1}=\boldsymbol{y}_{1}+(1-\mathbf{1}^{T}\boldsymbol{y})\boldsymbol{\phi}_{1}.

As shown in Figure 6, for the proposed Algorithm 3, the average time of simulating 10,000 random samples increases linearly in the dimension kk. By contrast, for the naive Cholesky decomposition based simulation algorithm, whose computational complexity is O(k3)O(k^{3}) (Golub and Van Loan, 2012), the average simulation time increases at a significantly faster rate as the dimension kk increases.

For explicit verification, with the 10,000 simulated k=104k=10^{4} dimensional random samples in a random trial, we randomly choose two dimensions and display their joint distribution using a contour plot. As in Figure 7, shown in the first row are the contour plots of five different random trials for the naive Cholesky implementation, whereas shown in the second row are the corresponding ones for the proposed Algorithm 3. As expected, the contour lines of the two figures in the same column closely match each other.

To further examine when to apply Algorithm 3 instead of the naive Cholesky decomposition based implementation in a general setting, we present the computational complexity analyses in Tables 3 and 4 of the Appendix for the naive approach and Algorithm 3, respectively. In addition, we mimic the settings in Section 0.4.1 to conduct a set of experiments with randomly generated Σ12\boldsymbol{\Sigma}_{12}, diagonal Σ11\boldsymbol{\Sigma}_{11}, and diagonal Σ22\boldsymbol{\Sigma}_{22}. We fix k1=4000k_{1}=4000 and vary k2k_{2} from 1 to 8000. The computation time for one sample averaged over 50 random trials is presented in Figure 8. It is clear from Tables 3 and 4 and Figure 8 that, as a general guideline, one may choose Algorithm 3 when k2k_{2} is smaller than k1k_{1} and Σ11\boldsymbol{\Sigma}_{11} admits some special structure that makes it easy to invert and computationally efficient to simulate from N(0,Σ11)\mathcal{N}(\bf{0},\boldsymbol{\Sigma}_{11}).

4.3 Simulation of MVNs with structured precision matrices

To examine when to apply Algorithm 4 instead of the naive Choleskey decomposition based procedure, we first consider a series of random simulations in which the sample size nn is fixed while the data dimension pp is varying. We then show that Algorithm 4 can be applied for high-dimensional regression whose pp is often much larger than nn.

We fix n=4000n=4000, vary pp from 1 to 8000, and mimic the settings in Section 0.4.1 to randomly generate Φ\boldsymbol{\Phi}, diagonal A{\bf A}, and diagonal Ω\boldsymbol{\Omega}. As a function of dimensions pp, the computation time for one sample averaged over 50 random trials is shown in Figure 9. It is evident that, identical to the complexity analysis in Tables 5 and 6, Algorithm 4 has a linear complexity with respect to pp under these settings, which will bring significant acceleration in a high-dimensional setting with pnp\gg n. If the sample size nn is large enough that n>pn>p, then one may directly apply the naive Cholesky decomposition based implementation.

Example 4: Simulation of the MVN distribution

Sample y1N(0,A1)\boldsymbol{y}_{1}\sim\mathcal{N}(\mathbf{0},{\bf A}^{-1}) and y2N(0,Ω1)\boldsymbol{y}_{2}\sim\mathcal{N}(\mathbf{0},\boldsymbol{\Omega}^{-1}) ;

Return β=y1+A1ΦT(Ω1+ΦA1ΦT)1(tΦy1y2)\boldsymbol{\beta}=\boldsymbol{y}_{1}+{\bf A}^{-1}\boldsymbol{\Phi}^{T}(\boldsymbol{\Omega}^{-1}+\boldsymbol{\Phi}{\bf A}^{-1}\boldsymbol{\Phi}^{T})^{-1}\left(\boldsymbol{t}-\boldsymbol{\Phi}\boldsymbol{y}_{1}-\boldsymbol{y}_{2}\right), which can be realized using

Solve α\boldsymbol{\alpha} such that (Ω1+ΦA1ΦT)α=tΦy1y2(\boldsymbol{\Omega}^{-1}+\boldsymbol{\Phi}{\bf A}^{-1}\boldsymbol{\Phi}^{T})\boldsymbol{\alpha}=\boldsymbol{t}-\boldsymbol{\Phi}\boldsymbol{y}_{1}-\boldsymbol{y}_{2};

Return β=y1+A1ΦTα\boldsymbol{\beta}=\boldsymbol{y}_{1}+{\bf A}^{-1}\boldsymbol{\Phi}^{T}\boldsymbol{\alpha}.

Note that if Ω=In\boldsymbol{\Omega}={\bf I}_{n}, then the simulation algorithm in Example 4 reduces to the one in Proposition 2.1 of Bhattacharya et al. (2016), which is shown there to be significantly more efficient than that of Rue (2001) for high-dimensional regression if pnp\gg n.

5 Conclusions

A fast and exact simulation algorithm is developed for a multivariate normal (MVN) distribution whose sample space is constrained on the intersection of a set of hyperplanes, which is shown to be inherently related to the conditional distribution of a unconstrained MVN distribution. The proposed simulation algorithm is further generalized to efficiently simulate from a MVN distribution, whose covariance (precision) matrix can be decomposed as the sum (difference) of a positive-definite matrix and a low-rank symmetric matrix, using a higher dimensional hyperplane-truncated MVN distribution whose covariance matrix is block-diagonal.

References

Proofs

Let us denote Λ=HTΣ1H\boldsymbol{\Lambda}={\bf H}^{T}\boldsymbol{\Sigma}^{-1}{\bf H} as a precision matrix that can be partitioned as in (7). For Algorithm 1, instead of directly simulating z1\boldsymbol{z}_{1} given z2\boldsymbol{z}_{2} using the conditional distribution of the MVN, we apply Algorithm 5 (Hoffman and Ribak, 1991; Doucet, 2010) to modify its sampling steps as follows.

Therefore, we can equivalently generate x\boldsymbol{x} as follows.

Sample yN(μ,Σ)\boldsymbol{y}\sim\mathcal{N}(\boldsymbol{\mu},\boldsymbol{\Sigma}).

Return x=(H1,H1Λ111Λ12)H1y+(H2H1Λ111Λ12)(GH2)1r\boldsymbol{x}=({\bf H}_{1},{\bf H}_{1}\boldsymbol{\Lambda}_{11}^{-1}\boldsymbol{\Lambda}_{12}){\bf H}^{-1}\boldsymbol{y}+({\bf H}_{2}-{\bf H}_{1}\boldsymbol{\Lambda}_{11}^{-1}\boldsymbol{\Lambda}_{12})({\bf G}{\bf H}_{2})^{-1}\boldsymbol{r}.

The computation can be significantly simplified if Λ12=0\boldsymbol{\Lambda}_{12}=\mathbf{0}, which means

Since H1TGT=0{\bf H}_{1}^{T}{\bf G}^{T}=0 by definition, to make Λ12=0\boldsymbol{\Lambda}_{12}=\mathbf{0}, if and only if we have H2{\bf H}_{2} as

Sample yN(μ,Σ)\boldsymbol{y}\sim\mathcal{N}(\boldsymbol{\mu},\boldsymbol{\Sigma}).

Return x=(H1,0k×k2)H1y+ΣGT(GΣGT)1r\boldsymbol{x}=({\bf H}_{1},\mathbf{0}_{k\times k_{2}}){\bf H}^{-1}\boldsymbol{y}+\boldsymbol{\Sigma}{\bf G}^{T}({\bf G}\boldsymbol{\Sigma}{\bf G}^{T})^{-1}\boldsymbol{r}, or return

Let us denote (0k×k1,ΣGTM)H1=C(\mathbf{0}_{k\times k_{1}},\boldsymbol{\Sigma}{\bf G}^{T}{\bf M}){\bf H}^{-1}={\bf C}. We have

and hence CH1=0{\bf C}{\bf H}_{1}=\mathbf{0} and CΣGTM=ΣGTM{\bf C}\boldsymbol{\Sigma}{\bf G}^{T}{\bf M}=\boldsymbol{\Sigma}{\bf G}^{T}{\bf M}. Since GH1=0{\bf G}{\bf H}_{1}=0, we have C=ΣGT(GΣGT)1G{\bf C}=\boldsymbol{\Sigma}{\bf G}^{T}({\bf G}\boldsymbol{\Sigma}{\bf G}^{T})^{-1}{\bf G}. The proof is completed by substituting (0k×k1,ΣGTM)H1(\mathbf{0}_{k\times k_{1}},\boldsymbol{\Sigma}{\bf G}^{T}{\bf M}){\bf H}^{-1} in (20) with ΣGT(GΣGT)1G\boldsymbol{\Sigma}{\bf G}^{T}({\bf G}\boldsymbol{\Sigma}{\bf G}^{T})^{-1}{\bf G}. ∎

To solve the problem in (3), one may solve an equivalent problem in (6) by defining an invertible transformation matrix H{\bf H} that satisfies GH1=0k2×k1{\bf G}{\bf H}_{1}=\mathbf{0}_{k_{2}\times k_{1}}. Let us denote Λ=HTΣ1H\boldsymbol{\Lambda}={\bf H}^{T}\boldsymbol{\Sigma}^{-1}{\bf H} as a precision matrix that can be partitioned as in (7). To simply the problem in (6), we choose the transformation matrix H{\bf H} to make z1\boldsymbol{z}_{1} and z2\boldsymbol{z}_{2} be independent to each other. Since z\boldsymbol{z} follows a MVN distribution, z1\boldsymbol{z}_{1} and z2\boldsymbol{z}_{2} are independent to each other if and only if

Since H1TGT=0{\bf H}_{1}^{T}{\bf G}^{T}=\mathbf{0} by definition, to make Λ12=0\boldsymbol{\Lambda}_{12}=\mathbf{0}, if and only if we have H2{\bf H}_{2} as

Thus with H{\bf H} satisfying GH1=0k2×k1{\bf G}{\bf H}_{1}=\mathbf{0}_{k_{2}\times k_{1}} and H2=ΣGTM{\bf H}_{2}=\boldsymbol{\Sigma}{\bf G}^{T}{\bf M}, one can transform the original problem in (3) to that in (6), where z1\boldsymbol{z}_{1} and z2\boldsymbol{z}_{2} are independent and the restrictions Gx=r{\bf G}\boldsymbol{x}=\boldsymbol{r} and z2=(GH2)1r\boldsymbol{z}_{2}=({\bf G}{\bf H}_{2})^{-1}\boldsymbol{r} imply each other. Following the naive approach shown in Algorithm 1, one can generate x\boldsymbol{x} from (3) as follows

Find H=(H1,H2){\bf H}=({\bf H}_{1},{\bf H}_{2}) with H2=ΣGTM{\bf H}_{2}=\boldsymbol{\Sigma}{\bf G}^{T}{\bf M} and with H1{\bf H}_{1} satisfying GH1=0k2×k1{\bf G}{\bf H}_{1}=\mathbf{0}_{k_{2}\times k_{1}};

Sample z1N[(Ik1,0k1×k2)H1μ,(H1TΣ1H1)1]\boldsymbol{z}_{1}\sim\mathcal{N}[({\bf I}_{k_{1}},\mathbf{0}_{k_{1}\times k_{2}}){\bf H}^{-1}\boldsymbol{\mu},({\bf H}^{T}_{1}\boldsymbol{\Sigma}^{-1}{\bf H}_{1})^{-1}];

Return \boldsymbol{x}={\bf H}\left[\begin{array}[]{c}\boldsymbol{z}_{1}\\ {\bf M}^{-1}({\bf G}\boldsymbol{\Sigma}{\bf G}^{T})^{-1}\boldsymbol{r}\end{array}\right].

However, this naive approach contains intermediate variables that could be computationally expensive to compute. Below we present a method to bypass these intermediate variables. Since the last step could be reexpressed as

which means zN[H1μ,H1Σ(H1)T]\boldsymbol{z}\sim\mathcal{N}[{\bf H}^{-1}\boldsymbol{\mu},{\bf H}^{-1}\boldsymbol{\Sigma}({\bf H}^{-1})^{T}], and choose M{\bf M} to make

which means GΣGT(Mz2)=GHz{\bf G}\boldsymbol{\Sigma}{\bf G}^{T}({\bf M}\boldsymbol{z}_{2})={\bf G}{\bf H}\boldsymbol{z}. Thus we have

In addition, since if zN[H1μ,H1Σ(H1)T]\boldsymbol{z}\sim\mathcal{N}[{\bf H}^{-1}\boldsymbol{\mu},{\bf H}^{-1}\boldsymbol{\Sigma}({\bf H}^{-1})^{T}], then y=HzN(μ,Σ)\boldsymbol{y}={\bf H}\boldsymbol{z}\sim\mathcal{N}(\boldsymbol{\mu},\boldsymbol{\Sigma}). Therefore, without the need to compute any intermediate variables, one may use Algorithm 2 to generate x\boldsymbol{x} from the hyperplane truncated MVN distribution.

Using the matrix inversion lemma on (8), we have

Using (11), we have Gμ=G1μ1+G2μ2=r{\bf G}\boldsymbol{\mu}={\bf G}_{1}\boldsymbol{\mu}_{1}+{\bf G}_{2}\boldsymbol{\mu}_{2}=\boldsymbol{r}. Since Gx=r{\bf G}\boldsymbol{x}=\boldsymbol{r}, we further have G(xμ)=0{\bf G}(\boldsymbol{x}-\boldsymbol{\mu})=0 and hence G1(x1μ1)=G2(x2μ2){\bf G}_{1}(\boldsymbol{x}_{1}-\boldsymbol{\mu}_{1})=-{\bf G}_{2}(\boldsymbol{x}_{2}-\boldsymbol{\mu}_{2}). Therefore, given the construction of μ2\boldsymbol{\mu}_{2} as in (11), we can replace the equality constraint Gx=r{\bf G}\boldsymbol{x}=\boldsymbol{r} on x\boldsymbol{x} by requiring (x2μ2)=G21G1(x1μ1)(\boldsymbol{x}_{2}-\boldsymbol{\mu}_{2})=-{\bf G}_{2}^{-1}{\bf G}_{1}(\boldsymbol{x}_{1}-\boldsymbol{\mu}_{1}). Using this equivelent constraint together with (3), we have

Applying Theorem 1 to Corollary 3, we can generate x\boldsymbol{x} with

Thus we can let x1=μ1+y1Σ12Σ221(Σ21Σ111y1+y2)\boldsymbol{x}_{1}=\boldsymbol{\mu}_{1}+\boldsymbol{y}^{\prime}_{1}-\boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\left(\boldsymbol{\Sigma}_{21}\boldsymbol{\Sigma}_{11}^{-1}\boldsymbol{y}^{\prime}_{1}+\boldsymbol{y}_{2}\right), where y1=y1μ1N(0,Σ11)\boldsymbol{y}^{\prime}_{1}=\boldsymbol{y}_{1}-\boldsymbol{\mu}_{1}\sim\mathcal{N}(\mathbf{0},\boldsymbol{\Sigma}_{11}). ∎

Using the matrix inversion lemma, we have

The proof is completed by using Corollary 4 with μ1=μβ\boldsymbol{\mu}_{1}=\boldsymbol{\mu}_{\beta}, Σ11=A1\boldsymbol{\Sigma}_{11}={\bf A}^{-1}, Σ12=A1ΦT\boldsymbol{\Sigma}_{12}={\bf A}^{-1}\boldsymbol{\Phi}^{T}, and Σ22=Ω1+ΦA1ΦT\boldsymbol{\Sigma}_{22}=\boldsymbol{\Omega}^{-1}+\boldsymbol{\Phi}{\bf A}^{-1}\boldsymbol{\Phi}^{T}. ∎

Computational Complexity

We present the computational complexities of all proposed algorithms in the following tables, where we highlight with bold the lowest complexity in each row.

Brief derivation of SG-MCMC for a simplex-constrained vector

Based on a comprehensive framework for constructing SG-MCMC algorithms in Ma et al. (2015), we have the mini-batch update rule for a global variable z\boldsymbol{z} as

For simplicity, we adopt the same specifications that lead to the stochastic gradient Riemannian Langevin dynamics (SGRLD) inference algorithm for simplex-constrained model parameters (Patterson and Teh, 2013; Ma et al., 2015), namely D(z)=G(z)1{\bf D}\left(\boldsymbol{z}\right)={\bf G}{\left(\boldsymbol{z}\right)^{-1}}, Q(z)=0{\bf Q}\left(\boldsymbol{z}\right)=\boldsymbol{0}, and ^Bt=0{\hat{}{\bf B}_{t}}=\boldsymbol{0}, where G(z){\bf G}\left(\boldsymbol{z}\right) denotes the Fisher information matrix (FIM) (Girolami and Calderhead, 2011).

Substituting both (Brief derivation of SG-MCMC for a simplex-constrained vector) and (26) into (Brief derivation of SG-MCMC for a simplex-constrained vector), we have (16) as

Next we prove that equation (16) can be equivalently represented as (17), namely