Dependent Multinomial Models Made Easy: Stick Breaking with the Pólya-Gamma Augmentation

Scott W. Linderman, Matthew J. Johnson, Ryan P. Adams

Introduction

It is often desirable to model discrete data in terms of continuous latent structure. In applications involving text corpora, discrete-valued time series, or polling and purchasing decisions, we may want to learn correlations or spatiotemporal dynamics, and leverage these learned structures to improve inferences and predictions. However, adding these continuous latent dependence structures often comes at the cost of significantly complicating inference: such models may require specialized, one-off inference algorithms, such as a nonconjugate variational optimization, or they may only admit very general inference tools like particle MCMC Andrieu et al. (2010) or elliptical slice sampling Murray et al. (2010), which can be inefficient and difficult to scale. Developing, extending, and applying these models has remained a challenge.

In this paper we aim to provide a class of such models that are easy and efficient. We develop models for categorical and multinomial data in which dependencies among the multinomial parameters are modeled via latent Gaussian distributions or Gaussian processes, and we show that this flexible class of models admits a simple auxiliary variable method that makes inference easy, fast, and modular. This construction not only makes these models simple to develop and apply, but also allows the resulting inference methods to use off-the-shelf algorithms and software for Gaussian processes and linear Gaussian dynamical systems.

The paper is organized as follows. After providing background material and defining our general models and inference methods, we demonstrate the utility of this class of models by applying it to three domains as case studies. First, we develop a correlated topic model for text corpora. Second, we study an application to modeling the spatial and temporal patterns in birth names given only sparse data. Finally, we provide a new continuous state-space model for discrete-valued sequences, including text and human DNA. In each case, given our model construction and auxiliary variable method, inference algorithms are easy to develop and very effective in experiments. We conclude with comments on the new kinds of models these methods may enable.

Code to use these models and reproduce all the figures is available at https://github.com/HIPS/pgmult.

Modeling correlations in multinomial parameters

In this section, we discuss an auxiliary variable scheme that allows multinomial observations to appear as Gaussian likelihoods within a larger probabilistic model. The key trick discussed in the proceeding sections is to introduce Pólya-gamma random variables into the joint distribution over data and parameters in such a way that the resulting marginal leaves the original model intact.

The integral identity at the heart of the Pólya-gamma augmentation scheme Polson et al. (2013) is

for some functions $a$ , $b$ , and $c$ . Such likelihoods arise, e.g., in logistic regression and in binomial and negative binomial regression Polson et al. (2013). Using (1) along with a prior $p(\psi)$ , we can write the joint density of $(\psi,x)$ as

The integrand of (3) defines a joint density on $(\psi,x,\omega)$ which admits $p(\psi,x)$ as a marginal density. Conditioned on these auxiliary variables $\omega$ , we have

This augmentation scheme has been used to develop Gibbs sampling and variational inference algorithms for Bernoulli, binomial Polson et al. (2013), and negative binomial regression models Zhou et al. (2012) with logit link functions. In this paper we extend it to the multinomial distribution.

2 A Pólya-gamma augmentation for the multinomial distribution

To develop a Pólya-gamma augmentation for the multinomial, we first rewrite the $K$ -dimensional multinomial density recursively in terms of ${K-1}$ binomial densities:

Next, we rewrite the density into the form required by (1) by substituting $\sigma(\psi_{k})$ for $\widetilde{\pi}_{k}$ :

Choosing ${a_{k}(x)=x_{k}}$ and ${b_{k}(x)=N_{k}}$ for each ${k=1,2,\ldots,K-1}$ , we can then introduce Pólya-gamma auxiliary variables $\omega_{k}$ corresponding to each coordinate $\psi_{k}$ ; dropping terms that do not depend on $\boldsymbol{\psi}$ and completing the square yields

where ${\boldsymbol{\Omega}\equiv\text{diag}(\boldsymbol{\omega})}$ and ${\kappa(\boldsymbol{x})\equiv\boldsymbol{x}-N(\boldsymbol{x})/2}$ . That is, conditioned on $\boldsymbol{\omega}$ , the likelihood of $\boldsymbol{\psi}$ under the augmented multinomial model is proportional to a diagonal Gaussian distribution.

Figure 1 illustrates how a variety of Gaussian densities map to probability densities on the simplex. Correlated Gaussians (left) put most probability mass near the $\pi_{1}=\pi_{2}$ axis of the simplex, and anti-correlated Gaussians (center) put mass along the sides where $\pi_{1}$ is large when $\pi_{2}$ is small and vice-versa. Finally, a nearly spherical Gaussian approximates a symmetric Dirichlet distribution. Appendix A derives a closed-form expression for the density on $\boldsymbol{\pi}$ implied by a Gaussian distribution on $\boldsymbol{\psi}$ , as well an expression for a diagonal Guassian distribution that best approximates, in a moment-matching sense, a Dirichlet distribution on $\boldsymbol{\pi}$ .

3 Alternative models

This stick breaking transformation has been explored in the previous work of Ren et al. (2011) and Khan et al. (2012), but has not been connected with the Pólya-gamma augmentation. The multinomial probit and logistic normal methods are most commonly used, and we describe them here.

The multinomial probit model Albert and Chib (1993) applies to categorical regrssion, and is based upon the following auxiliary variable model: ${z_{k}=\Phi(\psi_{k}+\epsilon_{k})}$ , where $\Phi$ is the probit function and $\epsilon_{k}\sim\mathcal{N}(0,1)$ . Given these auxiliary variables, ${x_{k}=1}$ if ${z_{k}>z_{j}\,\forall j\neq k}$ , and ${x_{k}=0}$ otherwise. This approach has primarily focused on categorical modeling.

Unlike the logistic normal and multinomial probit, the stick-breaking transformation we employ is asymmetric. Our illustrations in Figure 1 and the discussion in Appendix A show that this lack of symmetry does not impair the representational capacity of the model.

Correlated topic models

The Latent Dirichlet Allocation (LDA) (Blei et al., 2003) is a popular model for learning topics from text corpora. The Correlated Topic Model (CTM) (Blei and Lafferty, 2006a) extends LDA by including a Gaussian correlation structure among topics. This correlation model is powerful not only because it reveals correlations among topics but also because inferring such correlations can significantly improve predictions, especially when inferring the remaining words in a document after only a few have been revealed Blei and Lafferty (2006a). However, the addition of this Gaussian correlation structure breaks the Dirichlet-Multinomial conjugacy of LDA, making estimation and particularly Bayesian inference and model-averaged predictions more challenging. An approximate maximum likelihood approach using variational EM Blei and Lafferty (2006a) is often effective, but a fully Bayesian approach which integrates out parameters may be preferable, especially when making predictions based on a small number of revealed words in a document. A recent Bayesian approach based on a Pólya-Gamma augmentation to the logistic normal CTM (LN-CTM) Chen et al. (2013) provides a Gibbs sampling algorithm with conjugate updates, but the Gibbs updates are limited to single-site resampling of one scalar at a time, which can lead to slow mixing in correlated models.

In this section we show that MCMC sampling in a correlated topic model based on the stick breaking construction (SB-CTM) can be significantly more efficient than sampling in the LN-CTM while maintaining the same integration advantage over EM.

In the standard LDA model, each topic $\boldsymbol{\beta}_{t}$ ( $t=1,2,\ldots,T$ ) is a distribution over a vocabulary of $V$ possible words, and each document $d$ has a distribution over topics $\boldsymbol{\theta}_{d}$ ( $d=1,2,\ldots,D$ ). The $n$ th word in document $d$ is denoted $w_{n,d}$ for $d=1,2,\ldots,N_{d}$ . When each $\boldsymbol{\beta}_{t}$ and $\boldsymbol{\theta}_{d}$ is given a symmetric Dirichlet prior with concentration parameters $\alpha_{\beta}$ and $\alpha_{\theta}$ , respectively, the full generative model is

The CTM replaces the Dirichlet prior on each $\boldsymbol{\theta}_{d}$ with a new prior that models the coordinates of $\boldsymbol{\theta}_{d}$ as mutually correlated. This correlation structure on $\boldsymbol{\theta}_{d}$ is induced by first sampling a correlated Gaussian vector $\boldsymbol{\psi}_{d}$ and then applying the logistic normal map:

where the Gaussian parameters $(\boldsymbol{\mu},\boldsymbol{\Sigma})$ can be given a conjugate normal-inverse Wishart (NIW) prior. Analogously, our SB-CTM generates the correlation structure by instead applying the stick-breaking logistic map:

The goal is then to infer the posterior distribution over the topics $\boldsymbol{\beta}_{t}$ , the documents’ topic allocations $\boldsymbol{\psi}_{d}$ , and their mean and correlation structure $(\boldsymbol{\mu},\boldsymbol{\Sigma})$ . (In the case of the EM algorithm of Blei and Lafferty (2006a), the task is to approximate maximum likelihood estimates of the same parameters.) Modeling correlation structure within the topics $\boldsymbol{\beta}$ can be done analogously.

For fully Bayesian MCMC inference in the SB-CTM, we develop a Gibbs sampler that exploits the block conditional Gaussian structure provided by the stick-breaking construction. The Gibbs sampler iteratively samples ${\boldsymbol{z}\,|\,\boldsymbol{w},\boldsymbol{\beta},\boldsymbol{\psi}}$ ; ${\boldsymbol{\beta}\,|\,\boldsymbol{z},\boldsymbol{w}}$ ; ${\boldsymbol{\psi}\,|\,\boldsymbol{z},\boldsymbol{\mu},\boldsymbol{\Sigma},\boldsymbol{\omega}}$ ; and $\boldsymbol{\mu},\boldsymbol{\Sigma}\,|\,\boldsymbol{\psi}$ ; as well as the auxiliary variables $\boldsymbol{\omega}\,|\,\boldsymbol{\psi},\boldsymbol{z}$ . The first two are standard updates for LDA models, so we focus on the latter three. Using the identities derived in Section 2.2, the conditional density of each $\boldsymbol{\psi}_{d}\,|\,\boldsymbol{z}_{d},\boldsymbol{\mu},\boldsymbol{\Sigma},\boldsymbol{\omega}$ can be written

We compare the performance of this Gibbs sampling algorithm for the SB-CTM to the Gibbs sampling algorithm of the LN-CTM Chen et al. (2013), which uses a different Pólya-gamma augmentation, as well as the original variational EM algorithm for the CTM and collapsed Gibbs sampling in standard LDA. Figure 2 shows results on both the AP News dataset and the 20 Newsgroups dataset, where models were trained on a random subset of 95% of the complete documents and tested on the remaining 5% by estimating held-out likelihoods of half the words given the other half. The collapsed Gibbs sampler for LDA is fast but because it does not model correlations its ability to predict is significantly constrained. The variational EM algorithm for the CTM is reasonably fast but its point estimate doesn’t quite match the performance from integrating out parameters via MCMC in this setting. The LN-CTM Gibbs sampler continues to improve slowly but is limited by its single-site updates, while the SB-CTM sampler seems to both mix effectively and execute efficiently due to its block Gaussian updating.

The SB-CTM demonstrates that the stick-breaking construction and corresponding Pólya-Gamma augmentation makes inference in correlated topic models both easy to implement and computationally efficient. The block conditional Gaussianity also makes inference algorithms modular and compositional: the construction immediately extends to dynamic topic models (DTMs) Blei and Lafferty (2006b), in which the latent $\boldsymbol{\psi}_{d}$ evolve according to linear Gaussian dynamics, and inference can be implemented simply by applying off-the-shelf code for Gaussian linear dynamical systems (see Section 5). Finally, because LDA is so commonly used as a component of other models (e.g. for images Wang and Grimson (2008)), easy, effective, modular inference for CTMs and DTMs is a promising general tool.

To apply correlated topic models to increasingly large datasets, a stochastic variational inference (SVI) (Hoffman et al., 2013) approach is promising. In Appendix C, we show that the stick-breaking construction enables an algorithm based on the Pólya-gamma augmentation that can work with subsets, or mini-batches, of data in each iteration. As with the Gibbs sampler, the conditionally conjugate structure makes the algorithm easy to derive and implement.

Gaussian processes with multinomial observations

Consider the United States census data, which lists the first names of children born in each state for the years 1910-2013. Suppose we wish to predict the probability of a particular name in New York State in the years 2012 and 2013 given observed names in earlier years. We might reasonably expect that name probabilities vary smoothly over time as names rise and fall in popularity, and that name probability would be similar in neighboring states. A Gaussian process naturally captures these prior intuitions about spatiotemporal correlations, but the observed name counts are most naturally modeled as multinomial draws from latent probability distributions over names for each combination of state and year. We show how efficient inference can be performed in this otherwise difficult model by leveraging the Pólya-gamma augmentation.

To perform inference, introduce auxiliary Pólya-gamma variables, $\omega_{m,k}$ for each $\psi_{m,k}$ . Conditioned on these variables, the conditional distribution of $\boldsymbol{\psi}_{:,k}$ is,

Figure 3 illustrates the power of this approach on U.S. census data. The top two plots show the inferred probabilities under our stick-breaking multinomial GP model for the full dataset. Interesting spatiotemporal correlations in name probability are uncovered. In this large-count regime, the posterior uncertainty is negligible since we observe thousands of names per state and year, and simply modeling the transformed empirical probabilities with a GP works well. However, in the sparse data regime with only ${N_{m}=50}$ observations per input, it greatly improves performance to model uncertainty in the latent probabilities using a Gaussian process with multinomial observations.

The bottom panels compare four methods of predicting future names in the years 2012 and 2013 for a down-sampled dataset with ${N_{m}=50}$ : predicting based on the empirical probability measured in 2011; a standard GP to the empirical probabilities transformed by $\pi_{\mathsf{SB}}^{-1}$ (Raw GP); a GP whose outputs are transformed by the logistic normal function, $\pi_{\mathsf{LN}}$ , to obtain multinomial probabilities (LNM GP) fit using elliptical slice sampling Murray et al. (2010); and our stick-breaking multinomial GP (SBM GP). In terms of ability to predict the top and bottom 10 names, the multinomial models are both comparable and vastly superior to the naive approaches.

In terms of efficiency, our model is considerably faster than the logistic normal version, as shown in the bottom right panel. This difference is due to two reasons. First, our augmented Gibbs sampler is more efficient than the elliptical slice sampling algorithm used to handle the nonconjugacy in the LNM GP. Second, and perhaps most important, we are able to make collapsed predictions in which we compute the predictive distribution test $\boldsymbol{\psi}$ ’s given $\boldsymbol{\omega}$ , integrating out the training $\boldsymbol{\psi}$ . In contrast, the LNM-GP must condition on the training GP values in order to make predictions, and effectively integrate over training samples using MCMC. Appendix B goes into greater detail on how marginal predictions are computed and why they are more efficient than predicting conditioned on a single value of $\boldsymbol{\psi}$ .

Multinomial linear dynamical systems

While discrete-state hidden Markov models (HMMs) are ubiquitous for modeling time series and sequence data, it can be preferable to use a continuous state space model. In particular, while discrete states have no intrinsic structure, continuous states can correspond to a natural embedding or geometry Belanger and Kakade (2015). These considerations are particularly relevant to text, where word embeddings Collobert and Weston (2008) have proven to be a powerful tool.

Gaussian linear dynamical systems (LDS) provide very efficient learning and inference algorithms, but they can typically only be applied when the observations are themselves linear with Gaussian noise. While it is possible to apply a Gaussian LDS to count vectors Belanger and Kakade (2015), the resulting model is misspecified in the sense that, as a continuous density, the model assigns zero probability to training and test data. However, Belanger and Kakade (2015) show that this model can still be used for several machine learning tasks with compelling performance, and that the efficient algorithms afforded by the misspecified Gaussian assumptions confer a significant computational advantage. Indeed, the authors have observed that such a Gaussian model is “worth exploring, since multinomial models with softmax link functions prevent closed-form M step updates and require expensive” computations Belanger and Kakade (2014); this paper aims to help bridge precisely this gap and enable efficient Gaussian LDS computational methods to be applied while maintaining multinomial emissions and an asymptotically unbiased representation of the posteiror. While there are other approximation schemes that effectively extend some of the benefits of LDSs to nonlinear, non-Gaussian settings, such as the extended Kalman filter (EKF) and unscented Kalman filter (UKF) Wan and Van Der Merwe (2000); Thrun et al. (2005), these methods do not allow for asymptotically unbiased Bayesian inference and can have complex behavior. Alternatively, particle MCMC (pMCMC) Andrieu et al. (2010) with ancestor resampling Lindsten et al. (2012) is a very powerful algorithm that provides unbiased Bayesian inference for very general state space models, but it does not enjoy the efficient block updates or conjugacy of LDSs or HMMs.

In this section we show how to use the stick-breaking construction and its Pólya-gamma augmentation to perform efficient inference in LDS with multinomial emissions. We focus on a Gibbs sampler with fully conjugate updates that utilizes standard LDS message passing for efficient block updates.

The stick-breaking multinomial linear dynamical system (SBM-LDS) generates states according to a standard linear Gaussian dynamical system but generates multinomial observations using the stick-breaking logistic map:

Figure 4 (left panels) shows the predictive log likelihood for each method on each experiment, normalized by the number of counts in the test dataset and setting to zero the likelihood under a multinomial model fit to the training data mean. For the DNA data, which has the smallest “vocabulary” size, the HMM achieves the highest predictive likelihood, but the SBM-LDS edges out the other LDS methods. On the two text datasets, the SBM-LDS outperforms the other methods, particularly in Alice where the vocabulary is larger and the document is longer. In terms of run time, the SBM-LDS is orders of magnitude faster than the LNM-LDS with pMCMC (right panel) because it mixes much more efficiently over the latent trajectories.

The SBM-LDS is an easy but powerful linear state space model for multinomial observations. The Gibbs sampler leveraging the Pólya-gamma augmentation appears very efficient, performing comparably to an optimized HMM implementation and orders of magnitude faster than a general pMCMC algorithm. Because the augmentation renders the states’ conditional distribution a Gaussian LDS, it easily interfaces with high-performance LDS software, and extending these models with additional structure or covariates can be similarly modular.

Related Work

The stick-breaking transformation used herein was applied to categorical models by Khan et al. (2012), but they used local variational bound instead of the Pólya-gamma augmentation. Their promising results corroborate our findings of improved performance using this transformation. Their generalized expectation-maximization algorithm is not fully Bayesian, and does not integrate into existing Gaussian modeling code as easily as our augmentation.

Conversely, Chen et al. (2013) used the Pólya-gamma augmentation in conjunction with the logistic normal transformation for correlated topic modeling, exploiting the conditional conjugacy of a single entry ${\psi_{k}\,|\,\omega_{k},\boldsymbol{\psi}_{\neg k}}$ with a Gaussian prior. Unlike our stick-breaking transformation which admits block Gibbs sampling over the entire vector $\boldsymbol{\psi}$ simultaneously, their approach is limited to single-site Gibbs sampling. As shown in our correlated topic model experiments, this has dramatic effects on inferential performance. Moreover, it precludes analytical marginalization and integration with existing Gaussian modeling algorithms. For example, it is not immediately applicable to inference in linear dynamical systems with multinomial observations.

Conclusion

These case studies demonstrate that the stick-breaking multinomial model construction paired with the Pólya-gamma augmentation yields a flexible class of models with easy, efficient, and compositional inference. In addition to making these models easy, the methods developed here can also enable new models for multinomial and mixed data: the latent continuous structures used here to model correlations and state-space structure can be leveraged to explore new models for interpretable feature embeddings, interacting time series, and dependence with other covariates.

We thank the members of the Harvard Intelligent Probabilistic Systems (HIPS) group, especially Yakir Reshef, for many helpful conversations. S.W.L. is supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216. M.J.J. is supported by the Harvard/MIT Joint Research Grants Program. R.P.A. is partially supported by NSF IIS-1421780.

References

Appendix A Transforming between p(𝝍)𝑝𝝍p(\boldsymbol{\psi}) and p(𝝅)𝑝𝝅p(\boldsymbol{\pi})

Since the mapping between $\boldsymbol{\pi}$ and $\boldsymbol{\psi}$ is invertible, we can compute the distribution on $\boldsymbol{\pi}$ that is implied by a Gaussian distribution on $\boldsymbol{\psi}$ . Assume $\boldsymbol{\psi}\sim\mathcal{N}(\boldsymbol{\mu},\boldsymbol{\Sigma})$ . Then,

Since the Jacobian of the inverse transformation is lower diagonal, its determinant is simply the product of its diagonal,

where we have used the fact that the Jacobian of the inverse transformation is simply the inverse of the Jacobian of the forward transformation. We simply need to rewrite the Jacobian in terms of $\psi$ rather than $\pi$ . Note that $1-\sum_{j<k}\pi_{j}$ is the length of the remaining stick and $\sigma(\psi_{k})$ is the fraction of the remaining “stick” allocated to $\pi_{k}$ . Thus, the remaining stick length is equal to,

Moreover, $\pi_{k}=\sigma(\psi_{k})(1-\sum_{j<k}\pi_{j})=\sigma(\psi_{k})\prod_{j<k}\sigma(-\psi_{j})$ . Thus,

Expanding the Dirichlet distribution and substituting $\psi$ for $\pi$ , we conclude that,

This factorized form is unsurprising given that the Dirichlet distribution can be written as a stick-breaking product of beta distributions in the same way that the multinomial can be written as a product of binomials. Each term in the product above corresponds to the transformed beta distribution over $\widetilde{\pi}_{k}$ .

Figure 6 shows the marginal densities on $\psi_{k}$ implied by a $K=9$ dimensional symmetric Dirichlet prior on $\boldsymbol{\pi}$ with $\alpha=1$ . The densities of $\psi_{k}$ become increasingly skewed for small values of $k$ , but they are still well approximate by a Gaussian distribution. In order to approximate a uniform distribution, we numerically compute the mean and variance of these densities to set the parameters of a diagonal Guassian distribution.

Appendix B Marginal Predictions with the Augmented Model

One of the primary advantages offered by the Pólya-gamma augmentation is the ability to make marginal predictions about $\boldsymbol{\psi}_{\mathsf{test}}\,|\,\boldsymbol{x},\boldsymbol{\omega}$ , integrating out the value of $\boldsymbol{\psi}_{\mathsf{train}}$ . For example, in the GP multinomial regression models described in the main text, the methods were evaluated on the accuracy of their predictions about future name probabilities, which were functions of $\boldsymbol{\psi}_{\mathsf{test}}$ . When $p(\boldsymbol{\psi}_{\mathsf{train}})$ and $p(\boldsymbol{\psi}_{\mathsf{test}}\,|\,\boldsymbol{\psi}_{\mathsf{train}})$ are both Gaussian, we can integrate out the latent training variables in order to predict their test values. In a latent Gaussian-multinomial model, the posterior distribution over those latent training variables is non-Gaussian, but after Pólya-gamma augmentation, it is rendered Gaussian.

and perform Monte Carlo integration over $\boldsymbol{\omega}$ in order to compute the predictive distribution. By contrast, in the standard formulation we must perform Monte Carlo integration over $\boldsymbol{\psi}$ ,

Why does the augmented model confer a predictive advantage? It does not come from performing Monte Carlo integration over a smaller dimension since $\boldsymbol{\omega}$ and $\boldsymbol{\psi}_{\mathsf{train}}$ are of the same size. Instead, it comes from the ability of the conjugate Gibbs sampler to efficiently mix over $\boldsymbol{\psi}$ and $\boldsymbol{\omega}$ , and from the ability of a single sample of $\boldsymbol{\omega}$ to render a conditional Gaussian distribution over $\boldsymbol{\psi}$ that captures much of the volume of the true marginal distribution.

This latter point is illustrated in Figure 6. The red shading shows the true marginal density of $\boldsymbol{\psi}$ and the blue ellipses show the conditional density for a fixed value of $\boldsymbol{\omega}$ . Each ellipse capture a significant amount of the marginal distribution, indicating that with a single sample of $\boldsymbol{\omega}$ we can integrate over a substantial amount of the uncertainty in $\boldsymbol{\psi}$ . This example is only for a $K=3$ dimensional multinomial observation, but this intuition should extend to higher dimensions in which the advantages of analytical integration should be more readily apparent.

Appendix C Variational Inference for Correlated Topic Models

We use the following factorized approximation to the posterior distribution,

First let’s consider the variational distribution for $\boldsymbol{\psi}_{d}$ and $\boldsymbol{\omega}_{d}$ . From the conjugacy of the model, we have

The factor for $\boldsymbol{\omega}_{d}$ is not available in closed form. We have,

The updates for the global topic distribution parameters, $\boldsymbol{\mu}_{\theta}$ and $\boldsymbol{\Sigma}_{\theta}$ , depend only on their normal inverse-Wishart prior and the expectations with respect to $q(\boldsymbol{\psi}_{d})$ . These follow their standard form, see, for example, Bishop .

The variational updates for $z_{n,d}$ and $\boldsymbol{\beta}_{t}$ are straightforward.

This implies that $q(z_{n,d})$ is categorical with parameters,

We recognize this as a Dirichlet distribution,

The data local variables, $z_{n,d}$ , $\boldsymbol{\psi}_{d}$ , and $\boldsymbol{\omega}_{d}$ , are conditionally independent across documents. Moreover, since the model is fully conjugate, the expectations required to update the global variables, $\boldsymbol{\beta}_{t}$ , $\boldsymbol{\mu}_{\theta}$ , and $\boldsymbol{\Sigma}_{\theta}$ depend on sufficient statistics that are derived from summmations over documents. Rather than summing over the entire corpus of documents, we can get an unbiased estimate of the sufficient statistics by considering a random subset, or mini-batch, per iteration. This is the key to stochastic variational inference (SVI) algorithms Hoffman et al. , which have been widely successful in scalable topic modeling applications. Those same gains in scalability are readily applicable in this correlated topic model formulation.