Nonparametric Bayesian Negative Binomial Factor Analysis

Mingyuan Zhou

1 Introduction

The need to analyze a covariate-sample count matrix, each of whose elements counts the number of time that a covariate appears in a sample, arises in many different settings, such as text analysis, next-generation sequencing, medical records mining, and consumer behavior studies. The mixed-membership model, widely used for text analysis (Blei et al., 2003) and population genetics (Pritchard et al., 2000), treats each sample as a bag of indices (words), and associates each index with both a covariate that is observed and a subpopulation that is latent. It makes the assumption that there are $K$ latent subpopulations, each of which is characterized by how frequent the covariates are relative to each other within it. Given the total number of indices for a sample, it assigns each index independently to one of the $K$ subpopulations, with a probability proportional to the product of the corresponding covariate’s relative frequency in that subpopulation and that subpopulation’s relative frequency in the sample. A mixed-membership model constructed in this manner, as shown in Zhou et al. (2012) and Zhou and Carin (2015), can also be connected to Poisson factor analysis (PFA) that factorizes the covariate-sample count matrix, under the Poisson likelihood, into the product of a nonnegative covariate-subpopulation factor loading matrix and a nonnegative subpopulation-sample factor score matrix. Each column of the factor loading matrix encodes the relative frequencies of the covariates in a subpopulation, while that of the factor score matrix encodes the weights of the subpopulations in a sample.

Despite the popularity of both approaches in analyzing the covariate-sample count matrix, they both make restrictive assumptions. Given the relative frequencies of the covariates in subpopulations and the relative frequencies of the subpopulations in samples, the mixed-membership model independently assigns each index to both a covariate and a subpopulation, and hence may not sufficiently capture the tendency for an index to excite the other ones in the same sample to take the same or related covariates. Whereas for PFA, given the factor loading and score matrices, it assumes that the variance and mean are the same for each covariate-sample count, and hence is likely to underestimate the variability of overdispersed counts.

In practice, however, highly overdispersed covariate-sample counts are frequently observed due to self- and cross-excitation of covariate frequencies, that is to say, some covariates are particularly intense and also make other related covariates intense. For example, the tendency for a word present in a document to appear repeatedly is a fundamental phenomenon in natural language that is commonly referred to as word burstiness (Church and Gale, 1995; Madsen et al., 2005; Doyle and Elkan, 2009). If a word is bursty in a document, it is also common for it to excite (stimulate) related words to exhibit burstiness. Without capturing the self- and cross-excitation (stimulation) of covariate frequencies or better modeling the overdispersed covariate-sample counts, the ultimate potential of the mixed-membership model and PFA will be limited no matter how the priors on latent parameters are adjusted. In addition, it could be a waste of computation if the model tries to increase the model capacity to better capture overdispersions that could be simply explained with self- and cross-excitations.

To remove these restrictions, we introduce negative binomial factor analysis (NBFA) to factorize the covariate-sample count matrix, in which we replace the Poisson distributions on which PFA is built, with the negative binomial (NB) distributions. As PFA is closely related to the canonical mixed-membership model built on the multinomial distribution, we show that NBFA is closely related to a Dirichlet-multinomial mixed-membership (DMMM) model that uses the Dirichlet-categorical (Dirichlet-multinomial) rather than categorical (multinomial) distributions to assign an index to both a covariate and a factor (subpopulation). From the viewpoint of count modeling, NBFA improves PFA by better modeling overdispersed counts, while from that of mixed-membership modeling, it improves the canonical mixed-membership model by capturing the burstiness at both the covariate and factor levels ( $i.e.$ , for topic modeling, it exhibits a rich-get-richer phenomenon at both the word and topic levels). In addition, we will show NBFA could significantly reduce the computation spent on large covariate-sample counts.

Note that with a different likelihood for counts and a different mechanism to generate both the covariate and factor indices, NBFA and the related DMMM model proposed in the paper complement, rather than compete with, PFA (Zhou et al., 2012; Zhou and Carin, 2015). First, NBFA provides more significant advantages in modeling longer samples, where there is more need to capture both self- and cross-excitation of covariate frequencies. Second, various extensions built on PFA or the multinomial mixed-membership model, such as capturing the correlations between factors (Blei and Lafferty, 2005; Paisley et al., 2012) and learning multilayer deep representations (Ranganath et al., 2015; Gan et al., 2015; Zhou et al., 2016a), could also be applied to extend NBFA. In this paper, we will focus on constructing a nonparametric Bayesian NBFA with a potentially infinite number of factors, and leave a wide variety of potential extensions under this new framework to future research.

To avoid the need of selecting the number of factors $K$ , for PFA and the closely related multinomial mixed-membership model, a number of nonparametric Bayesian priors can be employed to support a potentially infinite number of latent factors, such as the hierarchical Dirichlet process (Teh et al., 2006) and beta-negative binomial process (Zhou et al., 2012; Broderick et al., 2015; Zhou and Carin, 2015). To support countably infinite factors for NBFA, generalizing the gamma-negative binomial process (GNBP) (Zhou and Carin, 2015; Zhou et al., 2016b), we introduce a new nonparametric Bayesian prior: the hierarchical gamma-negative binomial process (hGNBP), where each of the $J$ samples is assigned with a sample-specific GNBP and a globally shared gamma process is mixed with all the $J$ GNBPs. We derive both blocked and collapsed Gibbs sampling for the hGNBP-NBFA, with the number of factors automatically inferred.

The remainder of the paper is organized as follows. In Section 0.2, we review PFA and the multinomial mixed-membership model. In Section 0.3, we introduce NBFA and its representation as a DMMM model, and compare them with related models. In Section 0.4, we propose nonparametric-Bayesian NBFA. In Section 0.5, we derive both blocked and collapsed Gibbs sampling algorithms. In Section 0.6, we first make comparisons between different sampling strategies and then compare the performance of various algorithms on real data. We conclude the paper in Section 7. The proofs and Gibbs sampling update equations are provided in the Supplementary Material.

2 Poisson factor analysis and mixed-membership model

Let ${{\bf N}}$ be a $V\times J$ covariate-sample count matrix for $V$ covariates among $J$ samples, where $n_{vj}$ is the $(v,j)$ element of ${{\bf N}}$ and gives the number of times that sample $j$ has covariate $v$ . PFA factorizes ${{\bf N}}$ under the Poisson likelihood as

As in Zhou et al. (2012), it can also be equivalently constructed by first generating $n_{vj}$ and then assigning them to the latent factors using the multinomial distributions as

2.2 Multinomial mixed-membership model

This alternative representation suggests a potential link of PFA to a standard mixed-membership model for text analysis such as probabilistic latent semantic analysis (PLSA) (Hofmann, 1999) and latent Dirichlet allocation (LDA) (Blei et al., 2003). Given the factors $\boldsymbol{\phi}_{k}$ and factor proportions $\boldsymbol{\theta}_{j}/\theta_{\boldsymbol{\cdot}j}$ , where $\sum_{v=1}^{V}\phi_{vk}=1$ and $\theta_{\boldsymbol{\cdot}j}:=\sum_{k=1}^{K}\theta_{kj}$ , a standard procedure is to associate $x_{ji}\in\{1,\ldots,V\}$ , the $i$ th index (word) of sample $j$ , with factor (topic) $z_{ji}\in\{1,\ldots,K\}$ , and generate a bag of indices $\{x_{j1},\ldots,x_{jn_{j}}\}$ as

where $n_{j}=\sum_{v=1}^{V}n_{vj}$ and $n_{vj}=\sum_{i=1}^{n_{j}}\delta(x_{ji}=v)$ . We refer to (4) as the multinomial mixed-membership model. LDA completes the multinomial mixed-membership model by imposing the Dirichlet priors on both $\{\boldsymbol{\phi}_{k}\}_{k}$ and $\{\boldsymbol{\theta}_{j}/\theta_{\boldsymbol{\cdot}j}\}_{j}$ (Blei et al., 2003).

If in additional to the multinomial mixed-membership model described in (4), one further generates the sample lengths as

then the joint likelihood of $\boldsymbol{x}_{j}:=(x_{j1},\ldots,x_{jn_{j}})$ , $\boldsymbol{z}_{j}:=(z_{j1},\ldots,z_{jn_{j}})$ , and $n_{j}$ given $\boldsymbol{\Phi}$ and $\boldsymbol{\theta}_{j}$ can be expressed as

whose product with the combinatorial coefficient ${n_{j}!}/({\prod_{v=1}^{V}\prod_{k=1}^{K}n_{vjk}!})$ becomes the same as the likelihood $P(\{n_{vj},n_{vj1},\ldots,n_{vjK}\}_{v}\,|\,\boldsymbol{\Phi},\boldsymbol{\theta}_{j})$ of (2).

From the viewpoint of PFA, shown in (2), and its alternative representation constituted by (4) and (5), a wide variety of discrete latent variable models, such as nonnegative matrix factorization (NMF) (Lee and Seung, 2001), PLSA, LDA, the gamma-Poisson model of Canny (2004), and the discrete component analysis of Buntine and Jakulin (2006), all have the same mechanism to model the covariate counts that they generate both the covariate and factor indices using the categorical distributions shown in (4); they mainly differ from each other on how the priors on $\boldsymbol{\phi}_{k}$ and $\boldsymbol{\theta}_{j}$ (or $\boldsymbol{\theta}_{j}/\theta_{\boldsymbol{\cdot}j}$ ) are constructed (Zhou et al., 2012; Zhou and Carin, 2015).

3 Negative binomial factor analysis and Dirichlet-multinomial mixed-membership model

To better model overdispersed counts, instead of following PFA to factorize the covariate-sample count matrix under the Poisson likelihood, we propose negative binomial (NB) factor analysis (NBFA) to factorize it under the NB likelihood as

where $n\sim\mbox{NB}(r,p)$ denote the NB distribution with probability mass function (PMF) $f_{N}(n\,|\,r,p)=\frac{\Gamma(n+r)}{n!\Gamma(r)}p^{n}(1-p)^{r},$ where $n\in\{0,1,\ldots\}$ . NBFA has an augmented representation as

Similar to how the Poisson distribution is related to the multinomial distribution (e.g., Dunson and Herring (2005) and Lemma 4.1 of Zhou et al. (2012)), we reveal how the NB distribution is related to the Dirichlet-multinomial distribution using Theorem 1 shown below, whose proof is provided in Appendix H. As in Mosimann (1962), marginalizing out $\boldsymbol{\theta}$ from $\boldsymbol{z}\sim\prod_{i=1}^{n}\mbox{Cat}(z_{i};\boldsymbol{\theta}),~{}\boldsymbol{\theta}\sim\mbox{Dir}(r_{1},\ldots,r_{K})$ leads to a Dirichlet-categorical (DirCat) distribution with PMF

where $n_{k}=\sum_{i=1}^{n}\delta(z_{i}=k)$ , and a Dirichlet-multinomial (DirMult) distribution with PMF $P[(n_{1},\ldots,n_{K})\,|\,r_{1},\ldots,r_{K}]=\frac{n!}{\prod_{k=1}^{K}n_{k}!}P(\boldsymbol{z}\,|\,r_{1},\ldots,r_{K}).$

Let $\boldsymbol{x}=(x,x_{1},$ $\ldots,x_{K})$ be random variables generated as

Set $r_{\boldsymbol{\cdot}}=\sum_{k=1}^{K}r_{k}$ and let $\boldsymbol{y}=(y,y_{1},\ldots,y_{K})$ be random variables generated as

Then the distribution of $\boldsymbol{x}$ is the same as that of $\boldsymbol{y}$ .

Using Theorem 1, $(n_{vj},n_{vj1},\ldots,n_{vjK})$ in (6) can be equivalently generated as

Clearly, how the factorization under the NB likelihood is related to the Dirichlet-multinomial distribution, as in (6) and (8), mimics how the factorization under the Poisson likelihood is related to the multinomial distribution, as in (2) and (3).

3.2 The Dirichlet-multinomial mixed-membership model

Similar to how we relate PFA in (2) to the multinomial topic model in (4), as in Section 0.2.1, we may relate NBFA in (6) to a mixed-membership model constructed by replacing the categorical distributions in (4) with the Dirichlet-categorical distributions as

where $\{\boldsymbol{\phi}^{[j]}_{k}\}_{k}$ and $\boldsymbol{\theta}^{[j]}$ represent the factors and factor scores specific for sample $j$ , respectively. A graphical representation of the model, including the settings of the hyper-priors to be discussed later, is shown in Figure 1. Introducing $\boldsymbol{\phi}^{[j]}_{k}$ into the hierarchical model allows the same set of factors $\{\boldsymbol{\phi}_{k}\}_{k}$ to be manifested differently in different samples, whereas introducing $\boldsymbol{\theta}^{[j]}$ allows each sample to have two different representations: the factor scores $\boldsymbol{\theta}^{[j]}$ under the sample-specific factors $\{\boldsymbol{\phi}^{[j]}_{k}\}_{k}$ , and the factor scores $\boldsymbol{\theta}_{j}$ under the shared factors $\{\boldsymbol{\phi}_{k}\}_{k}$ . In addition, under this construction, the variance-to-mean ratio of $\phi^{[j]}_{vk}$ given $\boldsymbol{\phi}_{k}$ and $\theta_{kj}$ becomes $({1-\phi_{vk}})/({\theta_{kj}+1}),$ which monotonically decreases as the corresponding factor score $\theta_{kj}$ increases, allowing the variability of $\boldsymbol{\phi}^{[j]}_{k}$ in the prior to be controlled by the popularity of $\boldsymbol{\phi}_{k}$ in the corresponding sample. Moreover, this construction helps simplify the model likelihood and allows the model to be closely related to NBFA, as discussed below.

Explicitly drawing $\{\boldsymbol{\phi}^{[j]}_{k}\}_{k}$ for all samples would be computationally prohibitive, especially if the number of samples is large. Fortunately, this operation is totally unnecessary. By marginalizing out $\boldsymbol{\phi}^{[j]}_{k}$ and $\boldsymbol{\theta}^{[j]}$ in (9), we have

where $\boldsymbol{z}_{j}:=(z_{1},\ldots,z_{jn_{j}})$ , $n_{vjk}:=\sum_{i=1}^{n_{j}}\delta(x_{ji}=v,z_{ji}=k)$ , and $n_{\boldsymbol{\cdot}jk}:=\sum_{v=1}^{V}n_{vjk}$ . Thus the joint likelihood of $\boldsymbol{x}_{j}:=(x_{j1},\ldots,x_{jn_{j}})$ and $\boldsymbol{z}_{j}$ given $\boldsymbol{\Phi}$ , $\boldsymbol{\theta}_{j}$ , and $n_{j}$ can be expressed as

We call the model shown in (9) or (10) as the Dirichlet-multinomial mixed-membership (DMMM) model, whose likelihood given the factors and factor scores is shown in (11).

We introduce the following proposition, with proof provided in Appendix H, to show that one can recover NBFA from the DMMM model by randomizing its sample lengths with the NB distributions, and can reduce NBFA to the DMMM model by conditioning on these lengths. Thus how NBFA and the DMMM are related to each other mimics how PFA and the multinomial mixed-membership model are related to each other.

For the DMMM model that generates the covariate and factor indices using (9) or (10), if the sample lengths are randomized as

then the joint likelihood of $\boldsymbol{x}_{j}$ , $\boldsymbol{z}_{j}$ , and $n_{j}$ given $\boldsymbol{\Phi}$ , $\boldsymbol{\theta}_{j}$ , and $p_{j}$ , multiplied by the combinatorial coefficient ${n_{j}!}/({\prod_{v=1}^{V}\prod_{k=1}^{K}n_{vjk}!})$ , is equal to the likelihood of negative binomial factor analysis (NBFA) described in (6) or (8), expressed as

The DMMM model could model not only the burstiness of the covariates, but also that of the factors via the Dirichlet-categorical distributions, as further explained below when discussing related models. As far as the conditional posteriors of $\boldsymbol{\phi}_{k}$ and $\boldsymbol{\theta}_{j}$ are concerned, the DMMM model with the lengths of its samples randomized via the NB distributions, as shown in (10) and (12), is equivalent to NBFA, as shown in (6). The representational differences, however, lead to different inference strategies, which will be discussed in detail along with their nonparametric Bayesian generalizations.

3.3 Comparisons with related models

Preceding the Dirichlet-multinomial mixed-membership (DMMM) model proposed in this paper, to account for covariate burstiness, Doyle and Elkan (2009) proposed Dirichlet compound multinomial LDA (DCMLDA) that can be expressed as

where the Dirichlet prior is further imposed on $\boldsymbol{\theta}_{j}$ . Note that the sample-specific factor scores $\{\boldsymbol{\theta}_{j}\}_{j}$ are represented under the sample-specific factors $\{\boldsymbol{\phi}^{[j]}_{k}\}_{k}$ in DCMLDA, as shown in (14), whereas they are represented under the same set of factors $\{\boldsymbol{\phi}_{k}\}_{k}$ in the DMMM model, as shown in (10).

Comparing (9) with (14), it is clear that removing $\boldsymbol{\theta}^{[j]}$ from (9) reduces the DMMM model to DCMLDA in (14). Moreover, if we further let

then the joint likelihood of $\boldsymbol{x}_{j}$ , $\boldsymbol{z}_{j}$ , and $n_{j}$ given $\boldsymbol{\Phi}$ , $\boldsymbol{r}$ , and $p_{j}$ can be expressed as

which multiplied by the combinatorial coefficient ${n_{j}!}/({\prod_{v=1}^{V}\prod_{k=1}^{K}n_{vjk}!})$ is equal to the likelihood of $\{n_{vj},n_{vj1},\ldots,n_{vjK}\}_{v=1,V}$ given $\boldsymbol{\Phi}$ , $\boldsymbol{r}$ , and $p_{j}$ in

Thus, as far as the conditional posteriors of $\{\boldsymbol{\phi}_{k}\}_{k}$ and $\{r_{k}\}_{k}$ are concerned, DCMLDA constituted by (14)-(15) is equivalent to a special case of NBFA shown in (17), which is the augmented representation of ${\boldsymbol{n}}_{j}\sim\mbox{NB}(\boldsymbol{\Phi}\boldsymbol{r},p_{j})$ that restricts all samples to have the same factor scores $\{r_{k}\}_{k}$ under the same set of shared factors $\{\boldsymbol{\phi}_{k}\}_{k}$ .

Given the factors $\{\boldsymbol{\phi}_{k}\}_{k}$ and factor scores $\boldsymbol{\theta}_{j}$ , for the multinomial mixed-membership model in (4), both the covariate and factor indices are independently drawn from the categorical distributions; for DCMLDA in (14), the factor indices but not the covariate indices are independently drawn from the categorical distributions; whereas for the DMMM model in (9), neither the factor indices nor covariate indices are independently drawn from the categorical distributions. Denoting $y^{-ji}$ as the variable $y$ obtained by excluding the contribution of the $i$ th word in sample $j$ , we compare in Table 1 these three different models on their predictive distributions of $x_{ji}$ and $z_{ji}$ .

In comparison to the multinomial mixed-membership model, DCMLDA allows the number of times that a covariate appears in a sample to exhibit the “rich get richer” ( $i.e.$ , self-excitation) behavior, leading to a better modeling of covariate burstiness, and the DMMM model further models the burstiness of the factor indices and hence encourages not only self-excitation, but also cross-excitation of covariate frequencies. It is clear from Table 1 that DCMLDA models covariate burstiness but not factor burstiness, and the corresponding NBFA restricts all samples to have the same factor scores under the shared factors $\{\boldsymbol{\phi}_{k}\}_{k}$ . Thus we expect the DMMM model to clearly outperform DCMLDA, as will be confirmed by our experimental results.

4 Hierarchical gamma-negative binomial process negative binomial factor analysis

To support countably infinite factors for the DMMM model, we consider a hierarchical gamma-negative binomial process (hGNBP) as

where $X\sim\mbox{NBP}(\Theta,p)$ is a NB process defined such that $X(A_{i})\sim\mbox{NB}(\Theta(A_{i}),p)$ are independent NB random variables for disjoint partitions $A_{i}$ of $\Omega$ . Given a gamma process random draw $G=\sum_{k=1}^{\infty}r_{k}\delta_{\boldsymbol{\phi}_{k}}$ , we have

where $\theta_{kj}:=\Theta_{j}(\boldsymbol{\phi}_{k})$ measures the weight of factor $k$ in sample $j$ , and

where $n_{j}=X_{j}(\Omega)$ is the length of sample $j$ and $n_{jk}:=X_{j}(\boldsymbol{\phi}_{k})=\sum_{i=1}^{n_{j}}\delta(z_{ji}=k)$ represents the number of times that factor $k$ appears in sample $j$ .

We provide posterior analysis for the proposed hGNBP in Appendix I, where we utilize several additional discrete distributions, including the Chinese restaurant table (CRT), logarithmic, and sum-logarithmic (SumLog) distributions, that will also be used in the following discussion.

where the atoms of the gamma process are drawn from a Dirichlet base distribution $\boldsymbol{\phi}_{k}\sim\mbox{Dir}(\eta,\ldots,\eta).$ We further let $\gamma_{0}\sim\mbox{Gamma}(a_{0},1/b_{0})$ and $c_{0}\sim\mbox{Gamma}(e_{0},1/f_{0})$ . With Proposition 2, the above model can also be represented as a hGNBP-NBFA as

The DMMM model in (18) and NBFA in (19) have the same conditional posteriors for both the factors $\{\boldsymbol{\phi}_{k}\}_{k}$ and factor scores $\{\boldsymbol{\theta}_{j}\}_{j}$ , but lead to different inference strategies. To infer $\{\boldsymbol{\phi}_{k}\}_{k}$ and $\{\boldsymbol{\theta}_{j}\}_{j}$ , we first develop both blocked and collapsed Gibbs sampling with (18), and then develop blocked Gibbs sampling with (19), as described below.

5 Inference via Gibbs sampling

We present in this section three different Gibbs sampling algorithms, as summarized in Algorithm 1 in the Appendix.

As it is impossible to draw all the countably infinite atoms of a gamma process draw, expressed as $G=\sum_{k=1}^{\infty}r_{k}\delta_{\boldsymbol{\phi}_{k}}$ , for the convenience of implementation, it is common to consider truncating the total number of atoms to be $K$ by choosing a discrete base measure as $G_{0}=\sum_{k=1}^{K}\frac{\gamma_{0}}{K}\delta_{\boldsymbol{\phi}_{k}}$ , under which we have $r_{k}\sim\mbox{Gamma}(\gamma_{0}/K,1/c_{0})$ for $k\in\{1,\ldots,K\}$ (Zhou and Carin, 2015). The finite truncation strategy is also commonly used for Dirichlet process mixture models (Ishwaran and James, 2001; Fox et al., 2011). Although fixing $K$ often works well in practice, it may lead to a considerable waste of computation if $K$ is set to be too large. For nonparametric Bayesian mixture models based on the Dirichlet process (Ferguson, 1973; Escobar and West, 1995) or other more general normalized random measures with independent increments (NRMIs) (Regazzini et al., 2003; Lijoi et al., 2007), one may consider slice sampling to adaptively truncate the number of atoms used in each Markov chain Monte Carlo (MCMC) iteration (Walker, 2007; Papaspiliopoulos and Roberts, 2008). Unlike NRMIs whose atoms’ weights have to sum to one and hence are negatively correlated, the weights of the atoms of completely random measures are independent from each other. Exploiting this property, for our models built on completely random measures, we construct a sampling procedure that adaptively truncates the total number of atoms in each iteration.

and set $K:=K^{+}+K_{\star}$ as the total number of atoms to be used for the next iteration.

5.2 Collapsed Gibbs sampling

One common strategy to improve convergence and mixing for multinomial mixed-membership models is to collapse the factors $\{\boldsymbol{\phi}_{k}\}_{k}$ and factor scores $\{\boldsymbol{\theta}_{j}\}_{j}$ in the sampler (Griffiths and Steyvers, 2004; Newman et al., 2009). To apply this strategy to the DMMM model, we first need to transform the likelihood in (16) to make it amenable to marginalization. Using an analogy similar to that for the Chinese restaurant franchise of Teh et al. (2006), if we consider $z_{ji}$ as the index of the “dish” that the $i$ th “customer” in the $j$ th “restaurant” takes, then, to make the likelihood in (16) become fully factorized, we may introduce $b_{ji}$ as the index of the table at which this customer is seated. The following proposition, whose proof is provided in Appendix A, reveals how the CRP can be related to the Dirichlet-multinomial and NB distributions, and shows how to introduce auxiliary variables to make the likelihood of $\boldsymbol{z}\sim\mbox{DirCat}(n,r_{1},\ldots,r_{K})$ , as shown in (7), become fully factorized.

Given the sample length $n$ (number of customers) and $\boldsymbol{r}=(r_{1},\ldots,r_{K})$ , the joint distribution of the “table” indices ${\boldsymbol{b}}=(b_{1},\ldots,b_{n})$ and “dish” indices $\boldsymbol{z}=(z_{1},\ldots,z_{n})$ in

If we further randomize the sample length as

then we have the PMF for the joint distribution of ${\boldsymbol{b}}$ , $\boldsymbol{z}$ , and $n$ given $\boldsymbol{r}$ and $p$ in a fully factorized form as

Using (16) and Proposition 3, introducing the auxiliary variables

we have the joint likelihood for the DMMM model as

whose likelihood $P({\boldsymbol{b}}_{j},\boldsymbol{x}_{j},\boldsymbol{z}_{j},n_{j}\,|\,\boldsymbol{\Phi},\boldsymbol{\theta}_{j},p_{j})$ is the same as the likelihood, as shown in (24), of the DMMM model constituted of (10) and (12) and augmented with (23).

We outline the collapsed Gibbs sampler for the hGNBP-NBFA in Algorithm 1 and provide the derivation and update equations below. This collapsed sampling strategy marginalizes out both the factors $\{\boldsymbol{\phi}_{k}\}$ and factor scores $\{\boldsymbol{\theta}_{j}\}_{j}$ , but at the expense of introducing an auxiliary variable $b_{ji}$ for each index $x_{ji}$ .

Marginalizing out $\boldsymbol{\Phi}$ and $\boldsymbol{\Theta}$ from $\prod_{j}P({\boldsymbol{b}}_{j},\boldsymbol{x}_{j},\boldsymbol{z}_{j},n_{j}\,|\,\boldsymbol{\Phi},\boldsymbol{\theta}_{j},p_{j})$ , we have

5.3 Blocked Gibbs sampling under compound Poisson representation

Rather than representing NBFA in (19) as the DMMM model in (18), we may directly consider its compound Poisson representation as

All the other model parameters can be sampled in the same way as they are sampled in the regular blocked Gibbs sampler, with (J.1)-(J.2).

This new sampling algorithm not only is less expensive in computation, but also may converge much faster as there is no need to worry about the dependencies between the MCMC samples for the factor indices $z_{ji}$ , which are not used at all under the compound Poisson representation. Note that the collapsed Gibbs sampler in Section 0.5.2 removes the need to sample $\boldsymbol{\Phi}$ and $\boldsymbol{\Theta}$ at the expense of having to sample $z_{ji}$ and augment a $b_{ij}$ for each $z_{ij}$ . We will show in Appendix M that maintaining $\boldsymbol{\Phi}$ and $\boldsymbol{\Theta}$ but removing the need to sample $z_{ji}$ leads to a more effective sampler.

5.4 Model comparison

We also consider NBFA based on the GNBP, in which each of the $J$ samples is assigned with a sample-specific NB process and a globally shared gamma process is mixed with all the $J$ NB processes. Given its connection to DCMLDA, as revealed in Section 0.3.3, we call this nonparametric Bayesian model as the GNBP-DCMLDA, which is shown to be restrictive in that although each sample has its own factor scores under the corresponding sample-specific factors, all the samples are enforced to have the same factor scores under the same set of globally shared factors. By contrast, by modeling not only the burstiness of the covariates, but also that of the factors, the hGNBP-NBFA provides sample-specific factor scores under the same set of shared factors, making it suitable for extracting low-dimensional latent representations for high-dimensional covariate count vectors.

We describe in detail in Appendix K how to use the gamma-negative binomial process (GNBP) as a nonparametric Bayesian prior for both PFA and DCMLDA. In the prior, for PFA, we have $n_{vj}\sim\mbox{Pois}(\sum_{k}\phi_{vk}\theta_{kj})$ , whereas for NBFA, we have $n_{vj}\sim\mbox{NB}(\sum_{k}\phi_{vk}\theta_{kj},p_{j})$ , which can be augmented as

Thus we have $\textstyle(\lambda_{vj}\,|\,n_{vj},\boldsymbol{\Phi},\boldsymbol{\theta}_{j},p_{j})\sim\mbox{Gamma}\left(n_{vj}+\sum_{k}\phi_{vk}\theta_{kj},~{}p_{j}\right)$ for NBFA. Similarly, we have $(\lambda_{vj}\,|-)\sim\mbox{Gamma}\left(n_{vj}+\sum_{k}\phi_{vk}r_{k},~{}p_{j}\right)$ for the GNBP-DCMLDA. To better understand the similarities and differences between the GNBP-PFA, GNBP-DCMLDA, and hGNBP-NBFA, in Table 2 we compare their Poisson rates of $n_{vj}$ , estimated with the factors and factor scores in a single MCMC sample, and several other important model properties.

To estimate the latent Poisson rates for each count $n_{vj}$ and hence the smoothed normalized covariate frequencies, it is clear from the second row of Table 2 that PFA (the multinomial mixed-membership model) solely relies on the inferred factors $\{\boldsymbol{\phi}_{k}\}$ and factor scores $\{\theta_{kj}\}$ , DCMLDA adds a sample-invariant smoothing parameter, calculated as $\sum_{v}\phi_{vk}r_{k}$ , into the observed count $n_{vj}$ and weights that sum by a sample-specific probability parameter $p_{j}$ , whereas NBFA (the DMMM model) adds a sample-specific smoothing parameter, calculated as $\sum_{v}\phi_{vk}\theta_{kj}$ , into the observed count $n_{vj}$ and weights that sum by $p_{j}$ . Thus PFA represents an extreme that the observed counts are used to infer the factors and factor scores but are not used to directly estimate the Poisson rates; DCMLDA represents another extreme that the covariate frequencies in all samples are indiscriminately smoothed by the same set of smoothing parameters; whereas NBFA combines the observed counts with the inferred sample-specific smoothing parameters. This unique working mechanism also makes NBFA have reduced hyper-parameter sensitivity, as will be demonstrated with experiments.

6 Example Results

We apply the proposed models to factorize covariate-sample count matrices, each column of which is represented as a $V$ dimensional covariate-frequency count vector, where $V$ is the number of unique covariates. We set the hyper-parameters as $a_{0}=b_{0}=0.01$ and $e_{0}=f_{0}=1$ . We consider the JACM (http://www.cs.princeton.edu/ $\sim$ blei/downloads/), Psychological Review (PsyReview, http://psiexp.ss.uci.edu/research/programs $\_$ data/toolbox.htm), and NIPS12 (http://www.cs.nyu.edu/ $\sim$ roweis/data.html) datasets, choosing covariates that occur in five or more samples. In addition, we consider the 20newsgroups dataset (http://qwone.com/ $\sim$ jason/20Newsgroups/), consisting of 18,774 samples from 20 different categories. It is partitioned into a training set of 11,269 samples and a testing set of 7,505 ones that were collected at later times. We remove a standard list of stopwords and covariates that appear less than five times. As summarized in Table 3 of Appendix N, for the PysReview and JACM datasets, each of whose sample corresponds to the abstract of a research paper, the average sample lengths are only about 56 and 127, respectively. By contrast, a NIPS12 sample that includes the words of all sections of a research paper is in average more than ten times longer. By varying the percentage of covariate indices randomly selected from each sample for training, we construct a set of covariate-sample matrices with a large variation on the average lengths of samples, which will be used to help make comparison between different models. Depending on applications, we either treat the Dirichlet smoothing parameter $\eta$ as a tuning parameter, or sample it via data augmentation, as described in Appendix L.

To learn the factors in all the following experiments, we use the compound Poisson representation based blocked Gibbs sampler for both the hGNBP-NBFA and GNBP-NBFA, and use collapsed Gibbs sampling for the GNBP-PFA. We compare different samplers for the hGNBP-NBFA and provide justifications for choosing these samplers in Appendix M.

We randomly choose a certain percentage of the covariate indices in each sample as training, and use the remaining ones to calculate heldout perplexity. As shown in Zhou and Carin (2015), the GNBP-PFA performs similarly to the hierarchical Dirichlet process LDA of Teh et al. (2006) and outperforms a wide array of discrete latent variable models, thus we choose it for comparison. To demonstrate the importance of modeling both the burstiness of the covariates and that of the factors, we also make comparison to the GNBP-DCMLDA that considers only covariate burstiness. Since the inferred number of factors and hence the performance often depends on the Dirichlet smoothing parameter $\eta$ , we set $\eta$ as $0.005$ , $0.02$ , $0.01$ , $0.05$ , $0.10$ , $0.25$ , or $0.50$ . We vary both the training percentage and $\eta$ to examine how the average sample length and the value of $\eta$ influence the behaviors of the GNBP-PFA, GNBP-DCMLDA, and hGNBP-NBFA and impact their performance relative to each other.

where $s\in\{1,\ldots,S\}$ is the index of a collected MCMC sample, $m^{\text{test}}_{vj}$ is the number of test covariate indices at covariate $v$ in sample $j$ , $m^{\text{test}}_{\boldsymbol{\cdot}\boldsymbol{\cdot}}=\sum_{v}\sum_{j}m^{\text{test}}_{vj}$ , and $\lambda_{vj}^{(s)}$ are computed using the equations shown in the second row of Tabel 2, $e.g.$ , we have $\lambda^{(s)}_{vj}=\left(n_{vj}+\sum_{k=1}^{K^{+}+K_{\star}}\phi_{vk}^{(s)}\theta_{kj}^{(s)}\right)p_{j}^{(s)}$ for the hGNBP-NBFA. For each unique combination of $\eta$ and the training percentage, the results are averaged over five random training/testing partitions. The evaluation method is similar to those used in Wallach et al. (2009b), Paisley et al. (2012), and Zhou and Carin (2015). All algorithms are coded in MATLAB, with the steps of sampling factor and table indices coded in C to optimize speed. We terminate a trial and omit the results for that particular setting if it takes a single core of an Intel Xeon 3.3 GHz CPU more than 24 hours to finish 5000 iterations. The code will be made available in the author’s website for reproducible research.

We first consider the NIPS12 dataset, whose average sample length is about 1323, and present its results in Figures 3-5. We also consider both the PsyReview and JACM datasets, whose average sample lengths are about 56 and 127, respectively, and provide related plots in Appendix N.

General observations

For multinomial mixed-membership models, generally speaking, the smaller the Dirichlet smoothing parameter $\eta$ is, the more sparse and specific the inferred factors are encouraged to be, and the larger the number of inferred factors using a nonparametric Bayesian mixed-membership modeling prior, such as the hierarchical Dirichlet process and the gamma- and beta-negative binomial processes (Paisley et al., 2012; Zhou et al., 2012). As shown in Figures 3(a)-(e), for the hGNBP-NBFA, a nonparametric Bayesian DMMM model, we observe a relationship between the number of inferred factors and $\eta$ similar to that for the GNBP-PFA, a nonparametric Bayesian multinomial mixed-membership model.

In comparison to multinomial mixed-membership models such as the GNBP-PFA, what make the hGNBP-NBFA different and desirable are: 1) its parsimonious representation that uses fewer factors to achieve better heldout prediction, as shown in Figures 3(a)-(e); 2) ts distinct mechanism in adjusting its number of inferred factors according to the lengths of samples, as shown in Figures 5(a)-(f); 3) its significantly lower computational complexity for a covariate-sample matrix with long samples (large column sums), with the differences becoming increasingly more significant as the the average sample length increases, as shown in Figures 3(f)-(j) and 5(a)-(f); 4) its ability to achieve the same predictive power with significantly less time, as shown in Figures 5(g)-(l); 5) and its overall better predictive performance both under various values of $\eta$ while controlling the sample lengths, as shown in Figures 3(f)-(j), and under various sample lengths while controlling $\eta$ , as shown in Figures 5(g)-(l).

Detailed discussions

Distinct behavior and parsimonious representation. When fixing $\eta$ but gradually increasing the average sample length, the number of factors inferred by a nonparametric Bayesian multinomial mixed-membership model such as the GNBP-PFA often increases at a near-constant rate, as shown with the blue curves in Figure 5(a)-(f). The GNBP-DCMLDA, which models covariate burtiness, behaves similarly in the number of inferred factors, as shown with the red curves in Figure 5(a)-(f). Under the same setting, the number of inferred factors by the hGNBP-NBFA often first increases at a similar near-constant rate when the average sample length is short, however, it starts decreasing once the average sample length becomes sufficiently long, and eventually turns around and increases, but at a much lower rate, as the average sample length further increases, as shown with the yellow curves in Figure 5(a)-(f). This distinct behavior implies that by exploiting its ability to model both covariate and factor burstiness, the hGNBP-NBFA could have a parsimonious representation of a dataset with long samples. By contrast, a nonparametric Bayesian multinomial mixed-membership model, such as the GNBP-PFA, models neither covariate nor factor burstiness. Consequently, it has to increase its latent dimension at a near-constant rate as a function of the average sample length, in order to adequately capture self- and cross-excitation of covariate frequencies, which are often more prominent in longer samples. It is clear that by decreasing $\eta$ and hence increasing the number of inferred factors, the GNBP-PFA can gradually approach and eventually outperform the GNBP-DCMLDA, but still clearly underperform the hGNBP-NBFA in most cases, even if using many more factors and consequently significantly more computation.

Combining factorization and the modeling of burstinss. As shown in Figure 3(f), when the training percentage is as small as $2\%$ , the GNBP-DCMLDA, which combines the observed counts $n_{vj}$ and the inferred sample-invariant smoothing parameters $\sum_{k}\phi_{vk}r_{k}$ to estimate the Poisson rates (and hence the smoothed normalized covariate frequencies), achieves the best predictive performance (lowest heldout perplexity); the hGNBP-NBFA tries to improve DCMLDA by combining the observed counts and document-specific smoothing parameters $\sum_{k}\phi_{vk}\theta_{kj}$ , and the GNBP-PFA only relies on $\sum_{k}\phi_{vk}\theta_{kj}$ , yielding slightly and significantly worse performance, respectively, at this relatively extreme setting. This suggests that when the observed counts are too small, using factorization may not provide any advantages than simply smoothing the raw covariate counts with sample-invariant smoothing parameters.

As the training percentage increases, all three algorithms quickly improve their performance, as shown in Figures 3(g)-(j). Given a training percentage that is sufficiently large, e.g., 10% for this dataset ( $i.e.$ , the average training sample length is about 132), all three algorithms tend to increase their numbers of inferred factors $K^{+}$ as $\eta$ decreases, although the hGNBP-NBFA usually has a lower increasing rate. They differ from each other significantly, however, on how the performance improves as the inferred number factors increases, as shown in Figures 3(a)-(e): for the GNBP-DCMLDA, as it relies on $\sum_{k}\phi_{vk}r_{k}$ to smooth the observed counts, its predictive power is almost invariant to the change of $\eta$ and its number of factors; for the GNBP-PFA, by decreasing $\eta$ and hence increasing its number of inferred factors, it can approach and eventually outperform DCMLDA; whereas for the hGNBP-NBFA, it follows DCMLDA closely when $\eta$ is large or the lengths of samples are short, but often reduces its rate of increase for the number of inferred factors as $\eta$ decreases and quickly lowers its perplexity as $K^{+}$ increases, as long as $\eta$ is sufficiently small or the samples are sufficiently long. Thus in general, the hGNBP-NBFA provides the lowest perplexity using the least number of inferred factors.

Note that when the lengths of the training samples are short, setting $\eta$ to be large will make the factors $\boldsymbol{\phi}_{k}$ of NBFA become over-smoothed and hence NBFA becomes essentially the same as DCMLDA. As $\eta$ decreases given the same average sample length, or as the average sample length increases given the same $\eta$ , the factorization of NBFA with sample-dependent factor scores gradually take effect to improve the estimation of the Poisson rates and hence the smoothed normalized covariate frequencies for each sample. Overall, by combining the factorization, as used in PFA, the modeling of covariate burstiness, as used in DCMLDA, and the modeling of factor burstiness, unique to NBFA, the hGNBP-NBFA captures both self- and cross-excitation of covariate frequencies and achieves the best predictive performance with the most parsimonious representation as long as the average sample length is not too short and the value of $\eta$ is not set too large to overly smooth the factors.

Significantly lower computation for sufficiently long samples. For the GNBP-PFA, the collapsed Gibbs sampler samples all the factor indices with a computational complexity of $O(n_{\boldsymbol{\cdot}\boldsymbol{\cdot}}K^{+})$ , whereas for the hGNBP-NBFA, the corresponding computation has a complexity of $O\big{[}\sum_{v}\sum_{j}\ln(n_{vj}+1)K^{+}\big{]}$ and sampling $\{\phi_{k}\}_{k}$ and $\{\boldsymbol{\theta}_{j}\}_{j}$ adds an additional computation of $O(VK^{+}+NK^{+})$ . Thus the computation for the hGNBP-NBFA not only is often lower given the same $K^{+}$ for a dataset consisting of sufficiently long samples, but also becomes much lower because the inferred $K^{+}$ is often much smaller when the sample lengths are sufficiently long. For example, as shown in Figure 3(i), when 30% of the covariate indices in each sample are used for training, which means the average training sample length is about 397, the time for the GNBP-PFA to finish 5000 Gibbs sampling iteration on a 3.3 GHz CPU is about double that for the hGNBP-NBFA when their inferred numbers of factors are similar to each other; and when 20% of the covariate indices in each sample are used for training ( $i.e.$ , the average training sample length is around 265), in comparison to the hGNBP-NBFA, the GNBP-PFA takes about three times more minutes when $\eta=0.1$ , as shown in Figures 5(b), and four times more minutes when $\eta=0.01$ , as shown in Figures 5(e). Overall, for a dataset whose samples are not too short to exhibit self- and cross-excitation of covariate frequencies, the hGNBP-NBFA often takes the least time to finish the computation while controlling the value of $\eta$ and average sample length, has lower computation given the same inferred number of factors $K^{+}$ , and achieves a low perplexity with significantly less computation.

6.2 Unsupervised feature learning for classification

To further verify the advantages of NBFA that models both self- and cross-excitation of covariate frequencies, we use the proposed models to extract low-dimensional feature vectors from high-dimensional covariate-frequency count vectors of the 20newsgroups dataset, and then examine how well the unsupervisedly extracted feature vector of a test sample can be used to correctly classify it to one of the 20 news groups. As the classification accuracy often strongly depends on the dimension of the feature vectors, we truncate the total number of factors at $K=25$ , 50, 100, 200, 400, 600, 800, or 1000. Correspondingly, we slightly modify the gamma process based nonparametric Bayesian models by choosing a discrete base measure for the gamma process as $G_{0}=\sum_{k=1}^{K}\frac{\gamma_{0}}{K}\delta_{\boldsymbol{\phi}_{k}},~{}\boldsymbol{\phi}_{k}\sim\mbox{Dir}(\eta,\ldots,\eta)$ . Thus in the prior we now have $r_{k}=G(\boldsymbol{\phi}_{k})\sim\mbox{Gamma}(\gamma_{0}/K,1/c_{0})$ and consequently the Gibbs sampling update equations for $\{r_{k}\}_{k}$ and $\gamma_{0}$ will also slightly change. We omit these details for brevity, and refer to Zhou and Carin (2015) on how the same type of finite truncation is used in inference for nonparametric Bayesian models.

For this application, we fix the truncation level $K$ but impose the non-informative $\mbox{Gamma}(0.01,1/0.01)$ prior on the Dirichlet smoothing parameter $\eta$ , letting it be inferred from the data using (L.2). The same as before, we consider collapsed Gibbs sampling for the GNBP-PFA and the compound Poisson representation based blocked Gibbs sampler for the hGNBP-NBFA, with the main difference in that a fixed instead of an adaptive truncation is now used for inference. We do not consider the GNBP-DCMLDA here since it does not provide sample specific feature vectors under the same set of shared factors. Note that although we fix $K$ , if $K$ is set to be large enough, not necessarily all factors would be used and hence a truncated model still preserves its ability to infer the number of active factors $K^{+}\leq K$ ; whereas if $K$ is set to be small, a truncated model may lose its ability to infer $K^{+}$ , but it still maintains asymmetric priors (Wallach et al., 2009a) on the factor scores.

We first consider distinguishing between the $alt.atheism$ and $talk.religion.misc$ news groups, and between the $comp.sys.ibm.pc.hardware$ and $comp.sys.mac.hardware$ news groups. For each binary classification task, we remove a standard list of stop words and only consider the covariates that appear at least five times in both newsgroups combined, and report the classification accuracies based on twelve independent runs with random initializations. For the first binary classification task, we have 856 training documents, with 6509 unique terms and about 116K words, while for the second one, we have 1162 training documents, with 4737 unique terms and about 91K words.

In addition to these two binary classification tasks, we consider multi-class classification on the 20newsgroups dataset. After removing stopwords and terms (covariates) that appear less than five times, we obtain $33,420$ unique terms and about 1.4 million words for training, as summarized in Table 3. We use all 11,269 training documents to infer the factors and factor scores, and mimic the same testing procedure used for binary classification to extract low-dimensional feature vectors, with which each testing sample is classified to one of the 20 news groups using the same $L_{2}$ regularized logistic regression. Note the classification accuracies of logistic regression with the raw counts or normalized term frequencies as features are 78.0% and 79.4%, respectively. As shown in Figure 6(f), NBFA generally outperforms PFA in terms of classification accuracies given the same feature dimensions, consistent with our observations for both binary classification tasks. We also observe similar relationship between the $K^{+}$ and inferred $\eta$ as we do in both binary classification tasks.

7 Conclusions

Negative binomial factor analysis (NBFA) is proposed to factorize the covariate-sample count matrix under the NB likelihood. Its equivalent representation as the Dirichlet multinomial mixed-membership model reveals its distinctions from previously proposed discrete latent variable models. The hierarchical gamma-negative binomial process (hGNBP) is further proposed to support NBFA with countably infinite factors, and a compound Poisson representation based blocked Gibbs sampler is shown to converge fast and have low computational complexity. By capturing both self- and cross-excitation of covariate frequencies and by smoothing the observed counts with both sample and covariate specific rates obtained through factorization under the NB likelihood, the hGNBP-NBFA not only infers a parsimonious representation of a covariate-sample count matrix, but also achieves state-of-the-art predictive performance at low computational cost. In addition, the latent feature vectors inferred under the hGNBP-NBFA are better suited for classification than those inferred by the GNBP-PFA. It is of interest to investigate a wide variety of extensions built on Poisson factor analysis under this new modeling framework.

References

H Proofs

where $\boldsymbol{\theta}=(\theta_{1},\ldots,\theta_{K})^{T}$ . Conditioning on $\boldsymbol{\theta}$ and $y$ , we have

conditioning on $\boldsymbol{\theta}$ and $\lambda$ , we have

where $\lambda_{k}=\lambda\theta_{k}$ are independent gamma random variables, as the independent product of the gamma random variable $\lambda$ and the Dirichlet random vector $\boldsymbol{\theta}$ , with the gamma shape parameter and Dirichlet concentration parameter both equal to $r_{\boldsymbol{\cdot}}$ , leads to independent gamma random variables. Further marginalizing out $\lambda_{k}$ , we have

Thus $\boldsymbol{x}$ and $\boldsymbol{y}$ are equal in distribution as their characteristic functions are the same. ∎

Multiplying the likelihood in (11) with the PMF of the NB distribution in (12), we have

which, multiplied by the combinatorial coefficient ${n_{j}!}/({\prod_{v=1}^{V}\prod_{k=1}^{K}n_{vjk}!})$ , becomes the same as (13). ∎

For the first hierarchical model, we have

For the second hierarchical model, we have

For the first hierarchical model, we have

Summing over all ${\boldsymbol{n}}$ in the set $\mathcal{M}_{n,K}=\left\{(n_{1},\ldots,n_{K}):n_{k}\geq 0\text{ and }\sum_{k=1}^{K}n_{k}=n\right\}$ , we have

I Posterior analysis for the hGNBP

Denote $L\sim\mbox{CRTP}(X,G)$ as a CRT process such that $L(A)=\sum_{\omega\in A}L(\omega),L(\omega)\sim\mbox{CRT}[X(\omega),G(\omega)]$ for each $A\subset\Omega$ , and $X\sim\mbox{SumLogP}(L,p)$ as a sum-logarithmic process such that $X(A)\sim\mbox{SumLog}[L(A),p]$ for each $A\subset\Omega$ . As in Zhou and Carin (2015), generalizing the Poisson-logarithmic bivariate distribution, one may show that $X$ and $L$ given $G$ and $p$ in

is equivalent in distribution to those in

where $L\sim\mbox{PP}[-G\ln(1-p)]$ is a Poisson process such that $L(A)\sim\mbox{Pois}[-G(A)\ln(1-p)]$ for each $A\subset\Omega$ . Generalizing the analysis for the GNBP in Zhou and Carin (2015) and Zhou et al. (2016b), with

we can express the conditional posteriors of $G$ and $\Theta_{j}$ as

If we let $\gamma_{0}\sim\mbox{Gamma}(a_{0},1/b_{0})$ , the conditional posterior of $\gamma_{0}$ can be expressed as

If the base measure $G_{0}$ is finite and continuous, we have

which is the number of active atoms that are associated with nonzero counts, otherwise we have $\big{(}\widetilde{\widetilde{L}}(\omega_{k})\,|\,\{\widetilde{L}_{j}\}_{j},G_{0}\big{)}\sim\mbox{CRT}\big{[}\sum_{j}\widetilde{L}_{j}(\omega_{k}),G_{0}(\omega_{k})\big{]}$ for all atoms $\omega_{k}\in\Omega$ . In this paper, we let $K^{+}=\widetilde{\widetilde{L}}(\Omega)$ denote the number of active atoms.

J Additional Gibbs sampling update equations

Sample $r_{k}$ . We first sample latent counts and then sample $\gamma_{0}$ and $c_{0}$ as

and for the absolutely continuous space $\{\boldsymbol{\phi}_{k}\}_{k:n_{\boldsymbol{\cdot}\boldsymbol{\cdot}k}=0}$ , we draw $K_{\star}$ unused atoms, whose weights are sampled as

J.2 Additional update equations for collapsed Gibbs sampling

The other model parameters can all be sampled in the way similar to how they are sampled in Section 0.5.1. Below we highlight the differences. First, we do not need to sample $\{\boldsymbol{\phi}_{k}\}$ . Instead of sampling $\{\theta_{kj}\}_{k}$ , we only need to sample $\theta_{\boldsymbol{\cdot}j}$ as

For the absolutely continuous space, we have

K Gamma-negative binomial process PFA and DCMLDA

We consider the GNBP (Zhou and Carin, 2015) as

The GNBP multinomial mixed-membership model of Zhou and Carin (2015) can be expressed as

which, as far as the conditional posteriors of $\{\boldsymbol{\phi}_{k}\}_{k}$ and $\{\boldsymbol{\theta}_{j}\}_{j}$ are concerned, can be equivalently represented as the GNBP-PFA

Similar to how adaptive truncation is used in blocked Gibbs sampling for the hGNBP-NBFA, one may readily extend the blocked Gibbs sampler for the GNBP multinomial mixed-membership model developed in Zhou and Carin (2015), which has a fixed finite truncation, to a one with adaptive truncation. We omit these details for brevity. We describe a collapsed Gibbs sampler for the GNBP-PFA in Appendix K.1.

As discussed before, the GNBP can also be applied to DCMLDA to support countably infinite factors. We express the GNBP-DCMLDA as

which, as far as the conditional posteriors of $\{\boldsymbol{\phi}_{k}\}_{k}$ and $\{r_{k}\}_{k}$ are concerned, can be equivalently represented as

The restriction is evident from (K.3) as all the samples are enforced to have the same factor scores as $\boldsymbol{r}_{k}$ under the shared factors $\{\boldsymbol{\phi}_{k}\}_{k}$ . Blocked Gibbs sampling with and without sampling $z_{ji}$ for the GNBP-DCMLDA can be similarly derived as those for the hGNBP DMMM model, omitted here for brevity. We describe in detail a collapsed Gibbs sampler for the GNBP-DCMLDA in Appendix K.2.

For the GNBP in (K.1), the conditional likelihood $p(\{X_{j}\}_{1,J}\,|\,G)$ is shown in Appendix B.1 of Zhou et al. (2016b). As there is a one-to-many mapping from $\{X_{j}\}_{1,J}$ to $\boldsymbol{z}=\{z_{11},\ldots,z_{Jm_{J}}\}$ , similar to the analysis in Zhou (2014), we have the joint likelihood of $\boldsymbol{z}$ and the sample lengths ${\boldsymbol{m}}=(m_{1},\ldots,m_{J})$ as

Assuming the $K^{+}$ factors that are associated with nonzero counts are relabeled in an arbitrary oder from $1$ to $K^{+}$ , based on this conditional likelihood, we have a prediction rule conditioning on $G$ as

where $r_{\star}=G(\Omega\backslash\{\boldsymbol{\phi}_{k}\}_{1,K^{+}})$ is the total weight of all the factors assigned with zero count. This prediction rule becomes very similar to the direct assignment sampler of the hierarchical Dirichlet process (Teh et al., 2006) if one writes each $r_{k}$ as the product of a total random mass $\alpha$ and a probability $\pi_{k}$ , with $\alpha=\sum_{k=1}^{\infty}r_{k}$ and $\sum_{k=1}^{\infty}\pi_{k}=1$ . This is as expected since the gamma process can be represented as the independent product of a gamma process and a Dirichlet process, under the condition that the mass parameter of the gamma process is the same as the concentration parameter of the Dirichlet process, and the GNBP is closely related to the hierarchical Dirichlet process for mixed-membership modeling (Zhou and Carin, 2015).

Similar to the derivation of collapsed Gibbs sampling for the mixed-membership model based on the beta-negative binomial process, as shown in Zhou (2014), we can write the collapsed Gibbs sampling update equation for the factor indices as

and if $k=(K^{+})^{-ji}+1$ happens, we draw $\beta\sim\mbox{Beta}(1,\gamma_{0})$ and then let $r_{k}=\beta r_{\star}$ and $r_{\star}=(1-\beta)r_{\star}$ . Gibbs sampling update equations for the other model parameters of the GNBP can be similarly derived as in Zhou and Carin (2015) and Zhou et al. (2016b), omitted here for brevity.

K.2 Collapsed Gibbs sampling for GNBP-DCMLDA

For collapsed Gibbs sampling of (K.2), introducing the auxiliary variables

we have the joint likelihood of ${\boldsymbol{b}}_{j}$ , $\boldsymbol{z}_{j}$ , $\boldsymbol{x}_{j}$ and $n_{j}$ for DCMLDA as

Marginalizing out $\boldsymbol{\Phi}$ from this likelihood, we have

and if $k=(K^{+})^{-ji}+1$ happens, then we draw then we draw $\beta\sim\mbox{Beta}(1,\gamma_{0})$ and then let $r_{k}=\beta r_{\star}$ and $r_{\star}=(1-\beta)r_{\star}$ .

Using the Palm formula (James, 2002; James et al., 2009; Caron et al., 2014), similar to related derivation in Zhou et al. (2016b), we may further marginalize out $G$ from (K.4), leading to

We use the above equation in the collapsed Gibbs sampler for GNBP-DCMLDA.

L Sample the Dirichlet smoothing parameter

For the hGNBP-NBFA, from (24), we have the likelihood for $\{\boldsymbol{\phi}_{k}\}$ as

Marginalizing out $\boldsymbol{\phi}_{k}$ from (L.1), we have the likelihood for $\eta$ as

we can further apply the data augmentation technique for the NB distribution of Zhou and Carin (2015) to derive closed-form update equations for $\eta$ as

M Comparisons of different sampling strategies

We first diagnose the convergence of the regular blocked Gibbs sampler in Section 0.5.1, the collapsed Gibbs sampler in Section 0.5.2, and the compound Poisson representation based blocked Gibbs sampler in Section 0.5.3 for the hGNBP-NBFA (Dirichlet-multinomial mixed-membership model), via the trace plots of the inferred number of active factors $K^{+}$ . We set the Dirichlet smoothing parameter as $\eta=0.05$ , and initialize the number of factors as $K=0$ for the collapsed Gibbs sampler and $K=10$ for both the regular and compound Poisson based blocked Gibbs samplers. We also consider initializing the number of factors as $K=500$ for all three samplers.

As shown in Figure 7 for the PsyReview dataset, both the regular blocked Gibbs sampler and collapsed Gibbs sampler travel relatively slowly to the target distribution of the number of active factors $K^{+}$ , especially when the number of factors is initialized to be large, whereas the compound Poisson based blocked Gibbs sampler travels relatively quickly to the target distribution in both cases. We have also made similar comparisons on both the JACM and NIPS12 datasets, and the experiments on all three datasets consistently suggest that the compound Poisson representation based blocked Gibbs sampler converges the fastest in the number of inferred active factors.

We observe similar differences in convergence between the blocked Gibbs sampler, the collapsed Gibbs sampler, and the compound Poisson representation based blocked Gibbs sampler for the GNBP-DCMLDA. This is as expected since GNBP-DCMLDA can be considered as a special case of the hGNBP-NBFA, and its compound Poisson representation also allows it to eliminate the need of sampling the factor indices $\{z_{ji}\}$ .

For the GNBP-PFA, we find that its blocked Gibbs sampler, presented in Zhou and Carin (2015) and improved in this paper to allow adaptively truncating the number of active factors in each Gibbs sampling iteration, could converge slightly faster if the number of factors is initialized to be large. However, its collapsed Gibbs sampler shown in the Appendix often converges much faster in the number of inferred active factors if the number of factors is initialized with a small value.

Therefore, to learn the factors in all the following experiments, we use the compound Poisson representation based blocked Gibbs sampler for both the hGNBP-NBFA and GNBP-NBFA, and use collapsed Gibbs sampling for the GNBP-PFA.

N Additional table and plots

We show the results of the GNBP-NBFA, GNBP-DCMLDA, and hGNBP-NBFA on both the PsyReview and JACM datasets in the following figures.