Augment-and-Conquer Negative Binomial Processes

Mingyuan Zhou, Lawrence Carin

Introduction

There has been increasing interest in count modeling using the Poisson process, geometric process and recently the negative binomial (NB) process . Notably, it has been independently shown in and that the NB process, originally constructed for count analysis, can be naturally applied for mixture modeling of grouped data x1,,xJ\boldsymbol{x}_{1},\cdots,\boldsymbol{x}_{J}, where each group xj={xji}i=1,Nj\boldsymbol{x}_{j}=\{x_{ji}\}_{i=1,N_{j}}. For a territory long occupied by the hierarchical Dirichlet process (HDP) and related models, the inference of which may require substantial bookkeeping and suffer from slow convergence , the discovery of the NB process for mixture modeling can be significant. As the seemingly distinct problems of count and mixture modeling are united under the NB process framework, new opportunities emerge for better data fitting, more efficient inference and more flexible model constructions. However, neither nor explore the properties of the NB distribution deep enough to achieve fully tractable closed-form inference. Of particular concern is the NB dispersion parameter, which was simply fixed or empirically set , or inferred with a Metropolis-Hastings algorithm . Under these limitations, both papers fail to reveal the connections of the NB process to the HDP, and thus may lead to false assessments on comparing their modeling abilities.

We perform joint count and mixture modeling under the NB process framework, using completely random measures that are simple to construct and amenable for posterior computation. We propose to augment-and-conquer the NB process: by “augmenting” a NB process into both the gamma-Poisson and compound Poisson representations, we “conquer” the unification of count and mixture modeling, the analysis of fundamental model properties, and the derivation of efficient Gibbs sampling inference. We make two additional contributions: 1) we construct a gamma-NB process, analyze its properties and show how its normalization leads to the HDP, highlighting its unique theoretical, structural and computational advantages relative to the HDP. 2) We show that a variety of NB processes can be constructed with distinct model properties, for which the shared random measure can be selected from completely random measures such as the gamma, beta, and beta-Bernoulli processes; we compare their performance on topic modeling, a typical example for mixture modeling of grouped data, and show the importance of inferring both the NB dispersion and probability parameters, which respectively govern the overdispersion level and the variance-to-mean ratio in count modeling.

Before introducing the NB process, we first illustrate how the seemingly distinct problems of count and mixture modeling can be united under the Poisson process. Denote Ω\Omega as a measure space and for each Borel set AΩA\subset\Omega, denote Xj(A)X_{j}(A) as a count random variable describing the number of observations in xj\boldsymbol{x}_{j} that reside within AA. Given grouped data x1,,xJ\boldsymbol{x}_{1},\cdots,\boldsymbol{x}_{J}, for any measurable disjoint partition A1,,AQA_{1},\cdots,A_{Q} of Ω\Omega, we aim to jointly model the count random variables {Xj(Aq)}\{X_{j}(A_{q})\}. A natural choice would be to define a Poisson process Xj\mboxPP(G)X_{j}\sim\mbox{PP}(G), with a shared completely random measure GG on Ω\Omega, such that X_{j}(A)\sim\mbox{Pois}\big{(}G(A)\big{)} for each AΩA\subset\Omega. Denote G(Ω)=q=1QG(Aq)G(\Omega)=\sum_{q=1}^{Q}G(A_{q}) and G~=G/G(Ω)\widetilde{G}=G/G(\Omega). Following Lemma 4.1 of , the joint distributions of Xj(Ω),Xj(A1),,Xj(AQ)X_{j}(\Omega),X_{j}(A_{1}),\cdots,X_{j}(A_{Q}) are equivalent under the following two expressions:

Thus the Poisson process provides not only a way to generate independent counts from each AqA_{q}, but also a mechanism for mixture modeling, which allocates the observations into any measurable disjoint partition {Aq}1,Q\{A_{q}\}_{1,Q} of Ω\Omega, conditioning on Xj(Ω)X_{j}(\Omega) and the normalized mean measure G~\widetilde{G}.

To complete the model, we may place a gamma process prior on the shared measure as G\mboxGaP(c,G0)G\sim\mbox{GaP}(c,G_{0}), with concentration parameter cc and base measure G0G_{0}, such that G(A)\mboxGamma(G0(A),1/c)G(A)\sim\mbox{Gamma}(G_{0}(A),1/c) for each AΩA\subset\Omega, where G0G_{0} can be continuous, discrete or a combination of both. Note that G~=G/G(Ω)\widetilde{G}=G/G(\Omega) now becomes a Dirichlet process (DP) as G~\mboxDP(γ0,G~0)\widetilde{G}\sim\mbox{DP}(\gamma_{0},\widetilde{G}_{0}), where γ0=G0(Ω)\gamma_{0}=G_{0}(\Omega) and G~0=G0/γ0\widetilde{G}_{0}=G_{0}/\gamma_{0}. The normalized gamma representation of the DP is discussed in and has been used to construct the group-level DPs for an HDP . The Poisson processhas an equal-dispersion assumption for count modeling. As shown in (2), the construction of Poisson processes with a shared gamma process mean measure implies the same mixture proportions across groups, which is essentially the same as the DP when used for mixture modeling when the total counts {Xj(Ω)}j\{X_{j}(\Omega)\}_{j} are not treated as random variables. This motivates us to consider adding an additional layer or using a different distribution other than the Poisson to model the counts. As shown below, the NB distribution is an ideal candidate, not only because it allows overdispersion, but also because it can be augmented into both a gamma-Poisson and a compound Poisson representations.

Augment-and-Conquer the Negative Binomial Distribution

The NB distribution m\mboxNB(r,p)m\sim\mbox{NB}(r,p) has the probability mass function (PMF) fM(m)=Γ(r+m)m!Γ(r)(1p)rpmf_{M}(m)=\frac{\Gamma(r+m)}{m!\Gamma(r)}(1-p)^{r}p^{m}. It has a mean μ=rp/(1p)\mu={rp}/(1-p) smaller than the variance σ2=rp/(1p)2=μ+r1μ2\sigma^{2}={rp}/{(1-p)^{2}}=\mu+r^{-1}\mu^{2}, with the variance-to-mean ratio (VMR) as (1p)1(1-p)^{-1} and the overdispersion level (ODL, the coefficient of the quadratic term in σ2\sigma^{2}) as r1r^{-1}. It has been widely investigated and applied to numerous scientific studies . The NB distribution can be augmented into a gamma-Poisson construction as m\mboxPois(λ), λ\mboxGamma(r,p/(1p))m\sim\mbox{Pois}(\lambda),~{}\lambda\sim\mbox{Gamma}\left(r,{p}/{(1-p)}\right), where the gamma distribution is parameterized by its shape rr and scale p/(1p)p/(1-p). It can also be augmented under a compound Poisson representation as m=t=1lut, ut\mboxLog(p), l\mboxPois(rln(1p))m=\sum_{{t}=1}^{l}u_{{t}},~{}u_{{t}}\sim\mbox{Log}(p),~{}l\sim\mbox{Pois}(-r\ln(1-p)), where u\mboxLog(p)u\sim\mbox{Log}(p) is the logarithmic distribution with probability-generating function (PGF) CU(z)=ln(1pz)/ln(1p),  z<p1.C_{U}(z)={\ln(1-pz)}/{\ln(1-p)},~{}~{}|z|<{p^{-1}}. In a slight abuse of notation, but for added conciseness, in the following discussion we use mt=1l\mboxLog(p)m\sim\sum_{t=1}^{l}\mbox{Log}(p) to denote m=t=1lut, ut\mboxLog(p)m=\sum_{{t}=1}^{l}u_{{t}},~{}u_{{t}}\sim\mbox{Log}(p).

The inference of the NB dispersion parameter rr has long been a challenge . In this paper, we first place a gamma prior on it as r\mboxGamma(r1,1/c1)r\sim\mbox{Gamma}(r_{1},1/c_{1}). We then use Lemma 2.1 (below) to infer a latent count ll for each m\mboxNB(r,p)m\sim\mbox{NB}(r,p) conditioning on mm and rr. Since l\mboxPois(rln(1p))l\sim\mbox{Pois}(-r\ln(1-p)) by construction, we can use the gamma Poisson conjugacy to update rr. Using Lemma 2.2 (below), we can further infer an augmented latent count ll^{\prime} for each ll, and then use these latent counts to update r1r_{1}, assuming r1\mboxGamma(r2,1/c2)r_{1}\sim\mbox{Gamma}(r_{2},1/c_{2}). Using Lemmas 2.1 and 2.2, we can continue this process repeatedly, suggesting that we may build a NB process to model data that have subgroups within groups. The conditional posterior of the latent count ll was first derived by us but was not given an analytical form . Below we explicitly derive the PMF of ll, shown in (3), and find that it exactly represents the distribution of the random number of tables occupied by mm customers in a Chinese restaurant process with concentration parameter rr . We denote l\mboxCRT(m,r)l\sim\mbox{CRT}(m,r) as a Chinese restaurant table (CRT) count random variable with such a PMF and as proved in the supplementary material, we can sample it as l=n=1mbn, bn\mboxBernoulli(r/(n1+r))l=\sum_{n=1}^{m}b_{n},~{}b_{n}\sim\mbox{Bernoulli}\left({r}/(n-1+r)\right).

Both the gamma-Poisson and compound-Poisson augmentations of the NB distribution and Lemmas 2.1 and 2.2 are key ingredients of this paper. We will show that these augment-and-concur methods not only unite count and mixture modeling and provide efficient inference, but also, as shown in Section 3, let us examine the posteriors to understand fundamental properties of the NB processes, clearly revealing connections to previous nonparametric Bayesian mixture models.

Denote s(m,j)s(m,j) as Stirling numbers of the first kind . Augment m\mboxNB(r,p)m\sim\emph{\mbox{NB}}(r,p) under the compound Poisson representation as mt=1l\mboxLog(p), l\mboxPois(rln(1p))m\sim\sum_{{t}=1}^{l}\emph{\mbox{Log}}(p),~{}l\sim\emph{\mbox{Pois}}(-r\ln(1-p)), then the conditional posterior of ll has PMF

Denote wjt=1j\mboxLog(p)w_{j}\sim\sum_{{t}=1}^{j}\mbox{Log}(p), j=1,,mj=1,\cdots,m. Since wjw_{j} is the summation of jj iid \mboxLog(p)\mbox{Log}(p) random variables, the PGF of wjw_{j} becomes CWj(z)=CUj(z)=[ln(1pz)/ln(1p)]j, z<p1C_{W_{j}}(z)=C_{U}^{j}(z)=\left[{\ln(1-pz)}/{\ln(1-p)}\right]^{j},~{}|z|<{p^{-1}}. Using the property that [ln(1+x)]j=j!n=js(n,j)xnn![\ln(1+x)]^{j}=j!\sum_{n=j}^{\infty}\frac{s(n,j)x^{n}}{n!} , we have \mboxPr(wj=m)=CWj(m)(0)/m!=(1)mpjj!s(m,j)/(m![ln(1p)]j).\mbox{Pr}(w_{j}=m)={C_{W_{j}}^{(m)}(0)}/{m!}=(-1)^{m}p^{j}j!s(m,j)/(m![\ln(1-p)]^{j}). Thus for 0jm0\leq j\leq m, we have \mboxPr(L=jm,r)\mboxPr(wj=m)\mboxPois(j;rln(1p))s(m,j)rj.\mbox{Pr}(L=j|m,r)\propto\mbox{Pr}(w_{j}=m)\mbox{Pois}(j;-r\ln(1-p))\propto|s(m,j)|r^{j}. Denote Sr(m)=j=0ms(m,j)rjS_{r}(m)=\sum_{j=0}^{m}|s(m,j)|r^{j}, we have Sr(m)=(m1+r)Sr(m1)==n=1m1(r+n)Sr(1)=n=0m1(r+n)=Γ(m+r)Γ(r)S_{r}(m)=(m-1+r)S_{r}(m-1)=\cdots=\prod_{n=1}^{m-1}(r+n)S_{r}(1)=\prod_{n=0}^{m-1}(r+n)=\frac{\Gamma(m+r)}{\Gamma(r)}. ∎

Let m\mboxNB(r,p), r\mboxGamma(r1,1/c1)m\sim\emph{\mbox{NB}}(r,p),~{}r\sim\emph{\mbox{Gamma}}(r_{1},1/c_{1}), denote p=ln(1p)c1ln(1p)p^{\prime}=\frac{-\ln(1-p)}{c_{1}-\ln(1-p)}, then mm can also be generated from a compound distribution as

Augmenting mm leads to mt=1l\mboxLog(p), l\mboxPois(rln(1p))m\sim\sum_{{t}=1}^{l}{\mbox{Log}}(p),~{}l\sim{\mbox{Pois}}(-r\ln(1-p)). Marginalizing out rr leads to l\mboxNB(r1,p)l\sim{\mbox{NB}}\left(r_{1},p^{\prime}\right). Augmenting ll using its compound Poisson representation leads to (4). ∎

Gamma-Negative Binomial Process

We explore sharing the NB dispersion across groups while the probability parameters are group dependent. We define a NB process X\mboxNBP(G,p)X\sim\mbox{NBP}(G,p) as X(A)\mboxNB(G(A),p)X(A)\sim\mbox{NB}(G(A),p) for each AΩA\subset\Omega and construct a gamma-NB process for joint count and mixture modeling as Xj\mboxNBP(G,pj), G\mboxGaP(c,G0)X_{j}\sim\mbox{NBP}(G,p_{j}),~{}G\sim\mbox{GaP}(c,G_{0}), which can be augmented as a gamma-gamma-Poisson process as

In the above PP(\cdot) and GaP(\cdot) represent the Poisson and gamma processes, respectively, as defined in Section 1.1. Using Lemma 2.2, the gamma-NB process can also be augmented as

These three augmentations allow us to derive a sequence of closed-form update equations for inference with the gamma-NB process. Using the gamma Poisson conjugacy on (5), for each AΩA\subset\Omega, we have Λj(A)G,Xj,pj\mboxGamma(G(A)+Xj(A),pj)\Lambda_{j}(A)|G,X_{j},p_{j}\sim\mbox{Gamma}\left(G(A)+X_{j}(A),p_{j}\right), thus the conditional posterior of Λj\Lambda_{j} is

Define T\mboxCRTP(X,G)T\sim\mbox{CRTP}(X,G) as a CRT process that T(A)=ωAT(ω), T(ω)\mboxCRT(X(ω),G(ω))T(A)=\sum_{\omega\in A}T(\omega),~{}T(\omega)\sim\mbox{CRT}(X(\omega),G(\omega)) for each AΩA\subset\Omega. Applying Lemma 2.1 on (6) and (7), we have

If G0G_{0} is a continuous base measure and γ0=G0(Ω)\gamma_{0}=G_{0}(\Omega) is finite, we have G0(ω)0G_{0}(\omega)\rightarrow 0  ωΩ\forall~{}\omega\in\Omega and thus

which is equal to K+K^{+}, the total number of used discrete atoms; if G0G_{0} is discrete as G0=k=1Kγ0KδωkG_{0}=\sum_{k=1}^{K}\frac{\gamma_{0}}{K}\delta_{\omega_{k}}, then L(ωk)=\mboxCRT(L(ωk),γ0K)1L^{\prime}(\omega_{k})=\mbox{CRT}(L(\omega_{k}),\frac{\gamma_{0}}{K})\geq 1 if jXj(ωk)>0\sum_{j}X_{j}(\omega_{k})>0, thus L(Ω)K+L^{\prime}(\Omega)\geq K^{+}. In either case, let γ0\mboxGamma(e0,1/f0)\gamma_{0}\sim{\mbox{Gamma}}(e_{0},1/f_{0}), with the gamma Poisson conjugacy on (6) and (7), we have

Since the data {xji}i\{x_{ji}\}_{i} are exchangeable within group jj, the predictive distribution of a point XjiX_{ji}, conditioning on Xji={Xjn}n:niX_{j}^{-i}=\{X_{jn}\}_{n:n\neq i} and GG, with Λj\Lambda_{j} marginalized out, can be expressed as

Using the equivalence between (1) and (2) and normalizing all the gamma processes in (5), denoting Λ~j=Λj/Λj(Ω)\widetilde{\Lambda}_{j}={\Lambda_{j}}/\Lambda_{j}(\Omega), α=G(Ω)\alpha=G(\Omega), G~=G/α\widetilde{G}=G/\alpha, γ0=G0(Ω)\gamma_{0}=G_{0}(\Omega) and G~0=G0/γ0\widetilde{G}_{0}=G_{0}/\gamma_{0}, we can re-express (5) as

which is an HDP . Thus the normalized gamma-NB process leads to an HDP, yet we cannot return from the HDP to the gamma-NB process without modeling Xj(Ω)X_{j}(\Omega) and Λj(Ω)\Lambda_{j}(\Omega) as random variables. Theoretically, they are distinct in that the gamma-NB process is a completely random measure, assigning independent random variables into any disjoint Borel sets {Aq}1,Q\{A_{q}\}_{1,Q} of Ω\Omega; whereas the HDP is not. Practically, the gamma-NB process can exploit conjugacy to achieve analytical conditional posteriors for all latent parameters. The inference of the HDP is a major challenge and it is usually solved through alternative constructions such as the Chinese restaurant franchise (CRF) and stick-breaking representations . In particular, without analytical conditional posteriors, the inference of concentration parameters α\alpha and γ0\gamma_{0} is nontrivial and they are often simply fixed . Under the CRF metaphor α\alpha governs the random number of tables occupied by customers in each restaurant independently; further, if the base probability measure G~0\widetilde{G}_{0} is continuous, γ0\gamma_{0} governs the random number of dishes selected by tables of all restaurants. One may apply the data augmentation method of to sample α\alpha and γ0\gamma_{0}.However, if G~0\widetilde{G}_{0} is discrete as G~0=k=1K1Kδωk\widetilde{G}_{0}=\sum_{k=1}^{K}\frac{1}{K}\delta_{\omega_{k}}, which is of practical value and becomes a continuous base measure as KK\rightarrow\infty , then using the method of to sample γ0\gamma_{0} is only approximately correct, which may result in a biased estimate in practice, especially if KK is not large enough. By contrast, in the gamma-NB process, the shared gamma process GG can be analytically updated with (12) and G(Ω)G(\Omega) plays the role of α\alpha in the HDP, which is readily available as

and as in (11), regardless of whether the base measure is continuous, the total mass γ0\gamma_{0} has an analytical gamma posterior whose shape parameter is governed by L(Ω)L^{\prime}(\Omega), with L(Ω)=K+L^{\prime}(\Omega)=K^{+} if G0G_{0} is continuous and finite and L(Ω)K+L^{\prime}(\Omega)\geq K^{+} if G0=k=1Kγ0KδωkG_{0}=\sum_{k=1}^{K}\frac{\gamma_{0}}{K}\delta_{\omega_{k}}. Equation (15) also intuitively shows how the NB probability parameters {pj}\{p_{j}\} govern the variations among {Λ~j}\{\widetilde{\Lambda}_{j}\} in the gamma-NB process. In the HDP, pjp_{j} is not explicitly modeled, and since its value becomes irrelevant when taking the normalized constructions in (14), it is usually treated as a nuisance parameter and perceived as pj=0.5p_{j}=0.5 when needed for interpretation purpose. Fixing pj=0.5p_{j}=0.5 is also considered in to construct an HDP, whose group-level DPs are normalized from gamma processes with the scale parameters as pj1pj=1\frac{p_{j}}{1-p_{j}}=1; it is also shown in that improved performance can be obtained for topic modeling by learning the scale parameters with a log Gaussian process prior. However, no analytical conditional posteriors are provided and Gibbs sampling is not considered as a viable option .

2 Augment-and-conquer inference for joint count and mixture modeling

where marginally we have njk\mboxNB(rk,pj)n_{jk}\sim\mbox{NB}(r_{k},p_{j}). Using the equivalence between (1) and (2), we can equivalently express NjN_{j} and njkn_{jk} in the above model as Nj\mboxPois(λj), [nj1,,njK]\mboxMult(Nj;λj1/λj,,λjK/λj)N_{j}\sim\mbox{Pois}\left(\lambda_{j}\right),~{}[n_{j1},\cdots,n_{jK}]\sim\mbox{Mult}\left(N_{j};{\lambda_{j1}}/{\lambda_{j}},\cdots,{\lambda_{jK}}/{\lambda_{j}}\right), where λj=k=1Kλjk\lambda_{j}=\sum_{k=1}^{K}\lambda_{jk}. Since the data {xji}i=1,Nj\{x_{ji}\}_{i=1,N_{j}} are fully exchangeable, rather than drawing [nj1,,njK][n_{j1},\cdots,n_{jK}] once, we may equivalently draw the index

for each xjix_{ji} and then let njk=i=1Njδ(zji=k)n_{jk}=\sum_{i=1}^{N_{j}}\delta(z_{ji}=k). This provides further insights on how the seemingly disjoint problems of count and mixture modeling are united under the NB process framework. Following (8)-(12), the block Gibbs sampling is straightforward to write as

which has similar computational complexity as that of the direct assignment block Gibbs sampling of the CRF-HDP . If g0(ω)g_{0}(\omega) is conjugate to the likelihood F(x;ω)F(x;\omega), then the posterior p(ω)p(\omega|-) would be analytical. Note that when KK\rightarrow\infty, we have (lk)=δ(jljk>0)=δ(jnjk>0)(l^{\prime}_{k}|-)=\delta(\sum_{j}l_{jk}>0)=\delta(\sum_{j}n_{jk}>0).

Using (1) and (2) and normalizing the gamma distributions, (16) can be re-expressed as

which loses the count modeling ability and becomes a finite representation of the HDP, the inference of which is not conjugate and has to be solved under alternative representations . This also implies that by using the Dirichlet process as the foundation, traditional mixture modeling may discard useful count information from the beginning.

The Negative Binomial Process Family and Related Algorithms

The gamma-NB process shares the NB dispersion across groups. Since the NB distribution has two adjustable parameters, we may explore alternative ideas, with the NB probability measure shared across groups as in , or with both the dispersion and probability measures shared as in . These constructions are distinct from both the gamma-NB process and HDP in that Λj\Lambda_{j} has space dependent scales, and thus its normalization Λ~j=Λj/Λj(Ω)\widetilde{\Lambda}_{j}={\Lambda_{j}}/\Lambda_{j}(\Omega) no longer follows a Dirichlet process.

It is natural to let the probability measure be drawn from a beta process , which can be defined by its Lévy measure on a product space ×Ω\times\Omega as ν(dpdω)=cp1(1p)c1dpB0(dω).\nu(dpd\omega)=cp^{-1}(1-p)^{c-1}dpB_{0}(d\omega). A draw from the beta process B\mboxBP(c,B0)B\sim\mbox{BP}(c,B_{0}) with concentration parameter cc and base measure B0B_{0} can be expressed as B=k=1pkδωk.B=\sum_{k=1}^{\infty}p_{k}\delta_{\omega_{k}}. A beta-NB process can be constructed by letting Xj\mboxNBP(rj,B)X_{j}\sim\mbox{NBP}(r_{j},B), with a random draw expressed as Xj=k=1njkδωk, njk\mboxNB(rj,pk).X_{j}=\sum_{k=1}^{\infty}n_{jk}\delta_{\omega_{k}},~{}n_{jk}\sim\mbox{NB}(r_{j},p_{k}). Under this construction, the NB probability measure is shared and the NB dispersion parameters are group dependent. As in , we may also consider a marked-beta-NBWe may also consider a beta marked-gamma-NB process, whose performance is found to be very similar. process that both the NB probability and dispersion measures are shared, in which each point of the beta process is marked with an independent gamma random variable. Thus a draw from the marked-beta process becomes (R,B)=k=1(rk,pk)δωk(R,B)=\sum_{k=1}^{\infty}(r_{k},p_{k})\delta_{\omega_{k}}, and the NB process Xj\mboxNBP(R,B)X_{j}\sim\mbox{NBP}(R,B) becomes Xj=k=1njkδωk, njk\mboxNB(rk,pk).X_{j}=\sum_{k=1}^{\infty}n_{jk}\delta_{\omega_{k}},~{}n_{jk}\sim\mbox{NB}(r_{k},p_{k}). Since the beta and NB processes are conjugate, the posterior of BB is tractable, as shown in . If it is believed that there are excessive number of zeros, governed by a process other than the NB process, we may introduce a zero inflated NB process as Xj\mboxNBP(RZj,pj)X_{j}\sim\mbox{NBP}(RZ_{j},p_{j}), where Zj\mboxBeP(B)Z_{j}\sim\mbox{BeP}(B) is drawn from the Bernoulli process and (R,B)=k=1(rk,πk)δωk(R,B)=\sum_{k=1}^{\infty}(r_{k},\pi_{k})\delta_{\omega_{k}} is drawn from a marked-beta process, thus njk\mboxNB(rkbjk,pj), bjk=\mboxBernoulli(πk)n_{jk}\sim\mbox{NB}(r_{k}b_{jk},p_{j}),~{}b_{jk}=\mbox{Bernoulli}(\pi_{k}). This construction can be linked to the model in with appropriate normalization, with advantages that there is no need to fix pj=0.5p_{j}=0.5 and the inference is fully tractable. The zero inflated construction can also be linked to models for real valued data using the Indian buffet process (IBP) or beta-Bernoulli process spike-and-slab prior .

To show how the NB processes can be diversely constructed and to make connections to previous parametric and nonparametric mixture models, we show in Table 1 a variety of NB processes, which differ on how the dispersion and probability measures are shared. For a deeper understanding on how the counts are modeled, we also show in Table 1 both the VMR and ODL implied by these settings. We consider topic modeling of a document corpus, a typical example of mixture modeling of grouped data, where each a-bag-of-words document constitutes a group, each word is an exchangeable group member, and F(xji;ωk)F(x_{ji};\omega_{k}) is simply the probability of word xjix_{ji} in topic ωk\omega_{k}.

We consider six differently constructed NB processes in Table 1: (i) Related to latent Dirichlet allocation (LDA) and Dirichlet Poisson factor analysis (Dir-PFA) , the NB-LDA is also a parametric topic model that requires tuning the number of topics. However, it uses a document dependent rjr_{j} and pjp_{j} to automatically learn the smoothing of the gamma distributed topic weights, and it lets rj\mboxGamma(γ0,1/c), γ0\mboxGamma(e0,1/f0)r_{j}\sim\mbox{Gamma}(\gamma_{0},1/c),~{}\gamma_{0}\sim\mbox{Gamma}(e_{0},1/f_{0}) to share statistical strength between documents, with closed-form Gibbs sampling inference. Thus even the most basic parametric LDA topic model can be improved under the NB count modeling framework. (ii) The NB-HDP model is related to the HDP , and since pjp_{j} is an irrelevant parameter in the HDP due to normalization, we set it in the NB-HDP as 0.5, the usually perceived value before normalization. The NB-HDP model is comparable to the DILN-HDP that constructs the group-level DPs with normalized gamma processes, whose scale parameters are also set as one. (iii) The NB-FTM model introduces an additionalbeta-Bernoulli process under the NB process framework to explicitly model zero counts. It is the same as the sparse-gamma-gamma-PFA (SγΓ\gamma\Gamma-PFA) in and is comparable to the focused topic model (FTM) , which is constructed from the IBP compound DP. Nevertheless, they apply about the same likelihoods and priors for inference. The Zero-Inflated-NB process improves over them by allowing pjp_{j} to be inferred, which generally yields better data fitting. (iv) The Gamma-NB process explores the idea that the dispersion measure is shared across groups, and it improves over the NB-HDP by allowing the learning of pjp_{j}. It reduces to the HDP by normalizing both the group-level and the shared gamma processes. (v) The Beta-NB process explores sharing the probability measure across groups, and it improves over the beta negative binomial process (BNBP) proposed in , allowing inference of rjr_{j}. (vi) The Marked-Beta-NB process is comparable to the BNBP proposed in , with the distinction that it allows analytical update of rkr_{k}. The constructions and inference of various NB processes and related algorithms in Table 1 all follow the formulas in (16) and (3.2), respectively, with additional details presented in the supplementary material.

Example Results

Figure 1 compares the performance of various algorithms. The Marked-Beta-NB process has the best performance, closely followed by the Gamma-NB process, CRF-HDP and Beta-NB process. With an appropriate KK, the parametric NB-LDA may outperform the nonparametric NB-HDP and NB-FTM as the training data percentage increases, somewhat unexpected but very intuitive results, showing that even by learning both the NB dispersion and probability parameters rjr_{j} and pjp_{j} in a document dependent manner, we may get better data fitting than using nonparametric models that share the NB dispersion parameters rkr_{k} across documents, but fix the NB probability parameters.

It is also interesting to compare the Gamma-NB and NB-HDP. From a mixture-modeling viewpoint, fixing pj=0.5p_{j}=0.5 is natural as pjp_{j} becomes irrelevant after normalization. However, from a count modeling viewpoint, this would make a restrictive assumption that each count vector {njk}k=1,K\{n_{jk}\}_{k=1,K} has the same VMR of 2, and the experimental results in Figure 1 confirm the importance of learning pjp_{j} together with rkr_{k}. It is also interesting to examine (15), which can be viewed as the concentration parameter α\alpha in the HDP, allowing the adjustment of pjp_{j} would allow a more flexible model assumption on the amount of variations between the topic proportions, and thus potentially better data fitting.

Conclusions

We propose a variety of negative binomial (NB) processes to jointly model counts across groups, which can be naturally applied for mixture modeling of grouped data. The proposed NB processes are completely random measures that they assign independent random variables to disjoint Borel sets of the measure space, as opposed to the hierarchical Dirichlet process (HDP) whose measures on disjoint Borel sets are negatively correlated. We discover augment-and-conquer inference methods that by “augmenting” a NB process into both the gamma-Poisson and compound Poisson representations, we are able to “conquer” the unification of count and mixture modeling, the analysis of fundamental model properties and the derivation of efficient Gibbs sampling inference. We demonstrate that the gamma-NB process, which shares the NB dispersion measure across groups, can be normalized to produce the HDP and we show in detail its theoretical, structural and computational advantages over the HDP. We examine the distinct sharing mechanisms and model properties of various NB processes, with connections to existing algorithms, with experimental results on topic modeling showing the importance of modeling both the NB dispersion and probability parameters.

The research reported here was supported by ARO, DOE, NGA, and ONR, and by DARPA under the MSEE and HIST programs.

References

Appendix A Generating a CRT random variable

A CRT random variable l\mboxCRT(m,r)l\sim\emph{\mbox{CRT}}(m,r) can be generated with the summation of independent Bernoulli random variables as

Since ll is the summation of independent Bernoulli random variables, its PGF becomes

Thus we have fL(lm,r)=CL(l)(0)l!=Γ(r)Γ(m+r)s(m,l)rl,  l=0,1,,m.f_{L}(l|m,r)=\frac{C^{(l)}_{L}(0)}{l!}={\frac{\Gamma(r)}{\Gamma(m+r)}}|s(m,l)|r^{l},~{}~{}l=0,1,\cdots,m.

Appendix B Dir-PFA and LDA

The Dirichlet Poisson factor analysis (Dir-PFA) model is constructed as

where η\eta is the Dirichlet smoothing parameter for the topic’s distribution over the vocabulary, njk=i=1Njδ(zji=k)n_{jk}=\sum_{i=1}^{N_{j}}\delta({z_{ji}=k}), and the data likelihood F(xji;ωk)F(x_{ji};\omega_{k}) in topic modeling is ωvjik\omega_{v_{ji}k}, the probability of the iith word in jjth document under topic ωk\omega_{k}.

The Dir-PFA has the same block Gibbs sampling as LDA , expressed as

Appendix C CRF-HDP

Under the CRF metaphor, denote njkn_{jk} as the number of customers eating dish kk in restaurant jj and ljkl_{jk} as the number of tables serving dish kk in restaurant jj, the direct assignment block Gibbs sampling can be expressed as

When KK\rightarrow\infty, the concentration parameter γ0\gamma_{0} can be sampled as

where K+K^{+} is the number of used atoms. Since it is infeasible in practice to let KK\rightarrow\infty, directly using this method to sample γ0\gamma_{0} is only approximately correct, which may result in a biased estimate especially if KK is not set large enough. Thus in the experiments, γ0\gamma_{0} is not sampled and is fixed as one. Note that for implementation convenience, it is also common to fix the concentration parameter α\alpha as one . We find through experiments that learning this parameter usually results in obviously lower per-word perplexity for held out words, thus we allow the learning of α\alpha using the data augmentation method proposed in , which is modified from the one proposed in .

Appendix D NB-LDA

Note that letting rj\mboxGamma(γ0,1/c), γ0\mboxGamma(e0,1/f0)r_{j}\sim\mbox{Gamma}(\gamma_{0},1/c),~{}\gamma_{0}\sim\mbox{Gamma}(e_{0},1/f_{0}) allows different documents to share statistical strength for inferring their NB dispersion parameters.

The block Gibbs sampling can be expressed as

Appendix E NB-HDP

The NB-HDP model is a special case of the Gamma-NB process model with pj=0.5p_{j}=0.5. The hierarchical model and inference for the Gamma-NB process are shown in (16) and (18) of the main paper, respectively.

Appendix F NB-FTM

The NB-FTM model is a special case of zero-inflated NB process with pj=0.5p_{j}=0.5, which is constructed as

The block Gibbs sampling can be expressed as

Appendix G Beta-NB

The beta-NB process model is constructed as

The block Gibbs sampling can be expressed as

Appendix H Marked-Beta-NB

The Marked-Beta-NB process model is constructed as

The block Gibbs sampling can be expressed as

Appendix I Marked-Gamma-NB

The Marked-Gamma-NB process model is constructed as

The block Gibbs sampling can be expressed as