Negative Binomial Process Count and Mixture Modeling

Mingyuan Zhou, Lawrence Carin

Introduction

Count data appear in many settings, such as predicting the number of motor insurance claims , analyzing infectious diseases and modeling topics of document corpora . There has been increasing interest in count modeling using the Poisson process, geometric process and recently the negative binomial (NB) process . It is shown in and further demonstrated in that the NB process, originally constructed for count analysis, can be naturally applied for mixture modeling of grouped data x1,,xJ\boldsymbol{x}_{1},\cdots,\boldsymbol{x}_{J}, where each group xj={xji}i=1,Nj\boldsymbol{x}_{j}=\{x_{ji}\}_{i=1,N_{j}}. For example, in topic modeling (mixed-membership modeling), each document consists of a group of exchangeable words and each word is a group member that is assigned to a topic; the number of times a topic appears in a document is a latent count random variable that could be well modeled with an NB distribution .

Mixture modeling, which infers random probability measures to assign data samples into clusters (mixture components), is a key research area of statistics and machine learning. Although the number of samples assigned to clusters are counts, mixture modeling is not typically considered as a count-modeling problem. It is often addressed under the Dirichlet-multinomial framework, using the Dirichlet process as the prior distribution. With the Dirichlet-multinomial conjugacy, the Dirichlet process mixture model enjoys tractability because the posterior of the random probability measure is still a Dirichlet process. Despite its popularity, the Dirichlet process is inflexible in that a single concentration parameter controls both the variability of the mass around the mean and the distribution of the number of distinct atoms . For mixture modeling of grouped data, the hierarchical Dirichlet process (HDP) has been further proposed to share statistical strength between groups. The HDP inherits the same inflexibility of the Dirichlet process, and due to the non-conjugacy between Dirichlet processes, its inference has to be solved under alternative constructions, such as the Chinese restaurant franchise and stick-breaking representations . To make the number of distinct atoms increase at a rate faster than that of the Dirichlet process, one may consider the Pitman-Yor process or the normalized generalized gamma process that provide extra parameters to add flexibility.

To construct more expressive mixture models with tractable inference, in this paper we consider mixture modeling as a count-modeling problem. Directly modeling the counts assigned to mixture components as NB random variables, we perform joint count and mixture modeling via the NB process, using completely random measures that are easy to construct and amenable to posterior computation. By constructing a bivariate count distribution that connects the Poisson, logarithmic, NB and Chinese restaurant table distributions, we develop data augmentation and marginalization techniques unique to the NB distribution, with which we augment an NB process into both the gamma-Poisson and compound Poisson representations, yielding unification of count and mixture modeling, derivation of fundamental model properties, as well as efficient Bayesian inference.

Under the NB process, we employ a gamma process to model the rate measure of a Poisson process. The normalization of the gamma process provides a random probability measure (not necessarily a Dirichlet process) for mixture modeling, and the marginalization of the gamma process leads to an NB process for count modeling. Since the gamma scale parameters appear as NB probability parameters when the gamma processes are marginalized out, they directly control count distributions on atoms and they could be conveniently inferred with the beta-NB conjugacy. For mixture modeling of grouped data, we construct hierarchical models by employing an NB process for each group and sharing their NB dispersion or probability measures across groups. Different parameterizations of the NB dispersion and probability parameters result in a wide variety of NB processes, which are connected to previously proposed nonparametric Bayesian mixture models. The proposed joint count and mixture modeling framework provides new opportunities for better data fitting, efficient inference and flexible model constructions.

Parts of the work presented here first appeared in . In this paper, we unify related materials scattered in these three conference papers and provide significant expansions. In particular, we construct a Poisson-logarithmic bivariate distribution that tightly connects the NB and Chinese restaurant table distributions, extending the Chinese restaurant process to describe the case that both the numbers of customers and tables are random variables, and we provide necessary conditions to recover the NB process and the gamma-NB process from the Dirichlet process and HDP, respectively.

We mention that a related beta-NB process has been independently investigated in . Our constructions of a wide variety of NB processes, including the beta-NB processes in and as special cases, are built on our thorough investigation of the properties, relationships and inference of the NB and related stochastic processes. In particular, we show that the gamma-Poisson construction of the NB process is key to uniting count and mixture modeling, and there are two equivalent augmentations of the NB process that allow us to develop analytic conditional posteriors and predictive distributions. These insights are not provided in , and the NB dispersion parameters there are empirically set rather than inferred. More distinctions will be discussed along with specific models.

The remainder of the paper is organized as follows. We review some commonly used nonparametric Bayesian priors in Section 2 and study the NB distribution in Section 3. We present the NB process in Section 4, the gamma-NB process in Section 5, and the NB process family in Section 6. We discuss NB process topic modeling in Section 7 and present example results in Section 8.

Preliminaries

where ν(dpdω)ν+π(dpdω)\nu(dpd\omega)\equiv\nu^{+}\pi(dpd\omega). A random signed measure L\mathcal{L} satisfying (1) is called a Lévy random measure. More generally, if the Lévy measure ν(dpdω)\nu({dpd\omega}) satisfies

for each compact SΩS\subset\Omega, the Lévy random measure L\mathcal{L} is well defined, even if the Poisson intensity ν+\nu^{+} is infinite. A nonnegative Lévy random measure L\mathcal{L} satisfying (2) was called a completely random measure , and it was introduced to machine learning in and .

1.2 Gamma Process

where π(drdω)ν+ν(drdω)\pi(drd\omega)\nu^{+}\equiv\nu(drd\omega).

1.3 Beta Process

where π(dpdω)ν+ν(dpdω)\pi(dpd\omega)\nu^{+}\equiv\nu(dpd\omega).

2 Dirichlet and Chinese Restaurant Processes

Denote G~=G/G(Ω)\widetilde{G}=G/G(\Omega), where G\mboxGaP(c,G0)G\sim\mbox{GaP}(c,G_{0}), then for any measurable disjoint partition A1,,AQA_{1},\cdots,A_{Q} of Ω\Omega, we have [G~(A1),,G~(AQ)]\mboxDir(γ0G~0(A1),,γ0G~0(AQ))\left[\widetilde{G}(A_{1}),\cdots,\widetilde{G}(A_{Q})\right]\sim\mbox{Dir}\left(\gamma_{0}\widetilde{G}_{0}(A_{1}),\cdots,\gamma_{0}\widetilde{G}_{0}(A_{Q})\right), where γ0=G0(Ω)\gamma_{0}=G_{0}(\Omega) and G~0=G0/γ0\widetilde{G}_{0}=G_{0}/\gamma_{0}. Therefore, with a space invariant scale parameter 1/c1/c, the normalized gamma process G~=G/G(Ω)\widetilde{G}=G/G(\Omega) is a Dirichlet process with concentration parameter γ0\gamma_{0} and base probability measure G~0\widetilde{G}_{0}, expressed as G~\mboxDP(γ0,G~0)\widetilde{G}\sim\mbox{DP}(\gamma_{0},\widetilde{G}_{0}). Unlike the gamma process, the Dirichlet process is no longer a completely random measure as the random variables {G~(Aq)}\{\widetilde{G}(A_{q})\} for disjoint sets {Aq}\{A_{q}\} are negatively correlated.

A gamma process with a space invariant scale parameter can also be recovered from a Dirichlet process: if a gamma random variable α\mboxGamma(γ0,1/c)\alpha\sim{\mbox{Gamma}}(\gamma_{0},1/c) and a Dirichlet process G~\mboxDP(γ0,G~0)\widetilde{G}\sim{\mbox{DP}}(\gamma_{0},\widetilde{G}_{0}) are independent with γ0=G0(Ω)\gamma_{0}=G_{0}(\Omega) and G~0=G0/γ0\widetilde{G}_{0}=G_{0}/\gamma_{0}, then G=αG~G=\alpha\widetilde{G} becomes a gamma process as G\mboxGaP(c,G0)G\sim{\mbox{GaP}}(c,G_{0}).

2.2 Chinese Restaurant Process

In a Dirichlet process G~\mboxDP(γ0,G~0)\widetilde{G}\sim\mbox{DP}(\gamma_{0},\widetilde{G}_{0}), we assume XiG~X_{i}\sim\widetilde{G}; {Xi}\{X_{i}\} are independent given G~\widetilde{G} and hence exchangeable. The predictive distribution of a new data sample Xm+1X_{m+1}, conditioning on X1,,XmX_{1},\cdots,X_{m}, with G~\widetilde{G} marginalized out, can be expressed as

where {ωk}1,K\{\omega_{k}\}_{1,K} are distinct atoms in Ω\Omega observed in X1,,XmX_{1},\cdots,X_{m} and nk=i=1mXi(ωk)n_{k}=\sum_{i=1}^{m}X_{i}(\omega_{k}) is the number of data samples associated with ωk\omega_{k}. The stochastic process described in (2.2.2) is known as the Pólya urn scheme and also the Chinese restaurant process .

The number of nonempty tables ll in a Chinese restaurant process, with concentration parameter γ0\gamma_{0} and mm customers, is a random variable generated as l=n=1mbn, bn\mboxBernoulli(γ0n1+γ0).l=\sum_{n=1}^{m}b_{n},~{}b_{n}\sim\mbox{{Bernoulli}}\left(\frac{\gamma_{0}}{n-1+\gamma_{0}}\right). This random variable is referred as the Chinese restaurant table (CRT) random variable l\mboxCRT(m,γ0)l\sim\mbox{CRT}(m,\gamma_{0}). As shown in , it has probability mass function (PMF)

where s(m,l)s(m,l) are Stirling numbers of the first kind.

Negative Binomial Distribution

Its mean and variance are both equal to λ\lambda. Due to heterogeneity (difference between individuals) and contagion (dependence between the occurrence of events), count data are usually overdispersed in that the variance is greater than the mean, making the Poisson assumption restrictive. By placing a gamma prior with shape rr and scale p1p\frac{p}{1-p} on λ\lambda as m\mboxPois(λ)m\sim\mbox{Pois}(\lambda), \lambda\sim\mbox{Gamma}\big{(}r,\frac{p}{1-p}\big{)} and marginalizing out λ\lambda, an NB distribution m\mboxNB(r,p)m\sim\mbox{NB}(r,p) is obtained, with PMF

where rr is the nonnegative dispersion parameter and pp is the probability parameter. Thus the NB distribution is also known as the gamma-Poisson mixture distribution . It has a mean μ=rp/(1p)\mu={rp}/(1-p) smaller than the variance σ2=rp/(1p)2=μ+r1μ2\sigma^{2}={rp}/{(1-p)^{2}}=\mu+r^{-1}\mu^{2}, with the variance-to-mean ratio (VMR) as (1p)1(1-p)^{-1} and the overdispersion level (ODL, the coefficient of the quadratic term in σ2\sigma^{2}) as r1r^{-1}, and thus it is usually favored over the Poisson distribution for modeling overdispersed counts.

As shown in , m\mboxNB(r,p)m\sim\mbox{NB}(r,p) can also be generated from a compound Poisson distribution as

where u\mboxLog(p)u\sim\mbox{Log}(p) corresponds to the logarithmic distribution with PMF fU(k)=pk/[kln(1p)],  k=1,2,f_{U}(k)={-p^{k}}/[k\ln(1-p)],~{}~{}k=1,2,\cdots, and probability-generating function (PGF)

One may also show that limr\mboxNB(r,λλ+r)=\mboxPois(λ)\lim_{r\rightarrow\infty}\mbox{NB}(r,\frac{\lambda}{\lambda+r})=\mbox{Pois}(\lambda), and conditioning on m>0m>0, m\mboxNB(r,p)m\sim\mbox{NB}(r,p) becomes m\mboxLog(p)m\sim\mbox{Log}(p) as r0{r\rightarrow 0}.

The NB distribution has been widely investigated and applied to numerous scientific studies . Although inference of the NB probability parameter pp is straightforward with the beta-NB conjugacy, inference of the NB dispersion parameter rr, whose conjugate prior is unknown, has long been a challenge. The maximum likelihood (ML) approach is commonly used to estimate rr, however, it only provides a point estimate and does not allow incorporating prior information; moreover, the ML estimator of rr often lacks robustness and may be severely biased or even fail to converge, especially if the sample size is small . Bayesian approaches are able to model the uncertainty of estimation and incorporate prior information, however, the only available closed-form Bayesian inference for rr relies on approximating the ratio of two gamma functions .

Advancing previous research on the NB distribution in , we construct a Poisson-logarithmic bivariate distribution that assists Bayesian inference of the NB distribution and unites various count distributions.

The Poisson-logarithmic (PoisLog) bivariate distribution with PMF

where \mboxSumLog(m;l,p)\emph{\mbox{SumLog}}(m;l,p) denotes the sum-logarithmic distribution generated as m=t=1lut, utiid\mboxLog(p)m=\sum_{{t}=1}^{l}u_{{t}},~{}u_{{t}}\stackrel{{\scriptstyle iid}}{{\sim}}\emph{\mbox{Log}}(p).

The proof of Theorem 1 is provided in Appendix A. As shown in Fig. 1, this bivariate distribution intuitively describes the joint distribution of two count random variables mm and ll under two equivalent circumstances:

1) There are m\mboxNB(r,p)m\sim\mbox{NB}(r,p) customers seated at l\mboxCRT(m,r)l\sim\mbox{CRT}(m,r) tables;

2) There are l\mboxPois(rln(1p))l\sim\mbox{Pois}(-r\ln(1-p)) tables, each of which with ut\mboxLog(p)u_{t}\sim\mbox{Log}(p) customers, with m=t=1lutm=\sum_{t=1}^{l}u_{t} customers in total.

In a slight abuse of notation, but for added conciseness, in the following discussion we use mt=1l\mboxLog(p)m\sim\sum_{t=1}^{l}\mbox{Log}(p) to denote m=t=1lut, utiid\mboxLog(p)m=\sum_{{t}=1}^{l}u_{{t}},~{}u_{{t}}\stackrel{{\scriptstyle iid}}{{\sim}}\mbox{Log}(p).

Let m\mboxNB(r,p), r\mboxGamma(r1,1/c1)m\sim\emph{\mbox{NB}}(r,p),~{}r\sim\emph{\mbox{Gamma}}(r_{1},1/c_{1}) represent a gamma-NB mixture distribution. It can be augmented as mt=1l\mboxLog(p), l\mboxPois(rln(1p)), r\mboxGamma(r1,1/c1).m\sim\sum_{{t}=1}^{l}\emph{\mbox{Log}}(p),~{}l\sim\emph{\mbox{Pois}}(-r\ln(1-p)),~{}r\sim\emph{\mbox{Gamma}}(r_{1},1/c_{1}). Marginalizing out rr leads to

where the latent count l\mboxNB(r1,p)l\sim\emph{\mbox{NB}}\left(r_{1},p^{\prime}\right) can be augmented as

which, using Theorem 1, is equivalent in distribution to

The connections between various distributions shown in Theorem 1 and Corollary 2 are key ingredients of this paper, which not only allow us to derive efficient inference, but also, as shown below, let us examine the posteriors to understand fundamental properties of various NB processes, clearly revealing connections to previous nonparametric Bayesian mixture models, including those based on the Dirichlet process, HDP and beta-NB processes.

Joint Count and Mixture Modeling

In this Section, we first show the connections between the Poisson and multinomial processes, and then we place a gamma process prior on the Poisson rate measure for joint count and mixture modeling. This construction can be reduced to the Dirichlet process and its restrictions for modeling grouped data are further discussed.

Let X\mboxPP(G)X\sim\emph{\mbox{PP}}(G) be a Poisson process defined on a completely random measure GG such that X(A)\mboxPois(G(A))X(A)\sim\emph{\mbox{Pois}}(G(A)) for each subset AΩA\subset\Omega. Define Y\mboxMP(Y(Ω),GG(Ω))Y\sim\emph{\mbox{MP}}(Y(\Omega),\frac{G}{G(\Omega)}) as a multinomial process, with total count Y(Ω)\mboxPois(G(Ω))Y(\Omega)\sim\emph{\mbox{Pois}}(G(\Omega)) and random probability measure GG(Ω)\frac{G}{G(\Omega)}, such that (Y(A1),,Y(AQ))\mboxMult(Y(Ω);G(A1)G(Ω),,G(AQ)G(Ω))(Y(A_{1}),\cdots,Y(A_{Q}))\sim\emph{\mbox{Mult}}\left(Y(\Omega);\frac{G(A_{1})}{G(\Omega)},\cdots,\frac{G(A_{Q})}{G(\Omega)}\right) for any disjoint partition {Aq}1,Q\{A_{q}\}_{1,Q} of Ω\Omega. According to Lemma 4.1 of , X(A)X(A) and Y(A)Y(A) would have the same Poisson distribution for each AΩA\subset\Omega, thus XX and YY are equivalent in distribution.

Using Corollary 3, we illustrate how the seemingly distinct problems of count and mixture modeling can be united under the Poisson process. For each AΩA\subset\Omega, denote Xj(A)X_{j}(A) as a count random variable describing the number of observations in xj\boldsymbol{x}_{j} that reside within AA. Given grouped data x1,,xJ\boldsymbol{x}_{1},\cdots,\boldsymbol{x}_{J}, for any measurable disjoint partition A1,,AQA_{1},\cdots,A_{Q} of Ω\Omega, we aim to jointly model count random variables {Xj(Aq)}\{X_{j}(A_{q})\}. A natural choice would be to define a Poisson process

with a shared completely random measure GG on Ω\Omega, such that X_{j}(A)\sim\mbox{Pois}\big{(}G(A)\big{)} for each AΩA\subset\Omega and G(Ω)=q=1QG(Aq)G(\Omega)=\sum_{q=1}^{Q}G(A_{q}). Following Corollary 3, with G~=G/G(Ω)\widetilde{G}=G/G(\Omega), letting Xj\mboxPP(G)X_{j}\sim\mbox{PP}(G) is equivalent to letting

Thus the Poisson process provides not only a way to generate independent counts from each AqA_{q}, but also a mechanism for mixture modeling, which allocates the Xj(Ω)X_{j}(\Omega) observations into any measurable disjoint partition {Aq}1,Q\{A_{q}\}_{1,Q} of Ω\Omega, conditioning on the normalized random measure G~\widetilde{G}.

2 Gamma-Poisson Process and Negative Binomial Process

To complete the Poisson process, it is natural to place a gamma process prior on the Poisson rate measure GG as

For a distinct atom ωk\omega_{k}, we have njk\mboxPois(rk)n_{jk}\sim\mbox{Pois}(r_{k}), where njk=Xj(ωk)n_{jk}=X_{j}(\omega_{k}) and rk=G(ωk)r_{k}=G(\omega_{k}). Marginalizing out GG of the gamma-Poisson process leads to an NB process

in which X(A)\mboxNB(G0(A),p)X(A)\sim\mbox{NB}(G_{0}(A),p) for each AΩA\subset\Omega.

where g0(dω):=G0(dω)/γ0g_{0}(d\omega):={G_{0}(d\omega)}/{\gamma_{0}}. Thus the NB probability parameter pp plays a critical role in count and mixture modeling as it directly controls the prior distributions of the number of distinct atoms K+\mboxPois(γ0ln(1p))K^{+}\sim\mbox{Pois}(-\gamma_{0}\ln(1-p)), the number of samples at each of these atoms nk\mboxLog(p)n_{k}\sim\mbox{Log}(p), and the total number of samples X(Ω)\mboxNB(γ0,p)X(\Omega)\sim\mbox{NB}(\gamma_{0},p). However, its value would become irrelevant if one directly works with the normalization of GG, as commonly used in conventional mixture modeling.

Define L\mboxCRTP(X,G0)L\sim\mbox{CRTP}(X,G_{0}) as a CRT process that

for each AΩA\subset\Omega. Under the Chinese restaurant process metaphor, X(A)X(A) and L(A)L(A) represent the customer count and table count, respectively, observed in each AΩA\subset\Omega. A direct generalization of Theorem 1 leads to:

The NB process X\mboxNBP(G0,p)X\sim\emph{\mbox{NBP}}(G_{0},p) augmented under its compound Poisson representation as

3 Posterior Analysis and Predictive Distribution

Imposing a gamma prior \mboxGamma(e0,1/f0)\mbox{Gamma}(e_{0},1/f_{0}) on γ0\gamma_{0} and a beta prior \mboxBeta(a0,1/b0)\mbox{Beta}(a_{0},1/b_{0}) on pp, using conjugacy, we have all conditional posteriors in closed-form as

If the base measure G0G_{0} is finite and continuous, then G0(ω)0G_{0}(\omega)\rightarrow 0 and we have L(ω)\mboxCRT(X(ω),G0(ω))=δ(X(ω)>0)L(\omega)\sim\mbox{CRT}(X(\omega),G_{0}(\omega))=\delta(X(\omega)>0) and thus L(Ω)=ωΩδ(X(ω)>0)L(\Omega)=\sum_{\omega\in\Omega}\delta(X(\omega)>0), i.e., the number of nonempty tables L(Ω)L(\Omega) is equal to K+K^{+}, the number of distinct atoms. The gamma-Poisson process is also well defined with a discrete base measure G0=k=1Kγ0KδωkG_{0}=\sum_{k=1}^{K}\frac{\gamma_{0}}{K}\delta_{\omega_{k}}, for which we have L=k=1Klkδωk, lk\mboxCRT(X(ωk),γ0/K)L=\sum_{k=1}^{K}l_{k}\delta_{\omega_{k}},~{}l_{k}\sim\mbox{CRT}(X(\omega_{k}),\gamma_{0}/K) and hence it is possible that lk>1l_{k}>1 if X(ωk)>1X(\omega_{k})>1, which means L(Ω)K+L(\Omega)\geq K^{+}. As the data {xji}i\{x_{ji}\}_{i} are exchangeable within group jj, conditioning on Xji=X\XjiX^{-ji}=X\backslash{X_{ji}} and G0G_{0}, with GG marginalized out, we have

This prediction rule is similar to that of the Chinese restaurant process described in (2.2.2).

4 Relationship with the Dirichlet Process

Based on Corollary 3 on the multinomial process and Section 2.2.1 on the Dirichlet process, the gamma-Poisson process in (5) can be equivalently expressed as

5 Restrictions of the Gamma-Poisson Process

The Poisson process has an equal-dispersion assumption for count modeling. For mixture modeling of grouped data, the gamma-Poisson (NB) process might be too restrictive in that, as shown in (4.4), it implies the same mixture proportions across groups, and as shown in (6), it implies the same count distribution on each distinct atom. This motivates us to consider adding an additional layer into the gamma-Poisson process or using a different distribution other than the Poisson to model the counts for grouped data. As shown in Section 3, the NB distribution is an ideal candidate, not only because it allows overdispersion, but also because it can be augmented into either a gamma-Poisson or a compound Poisson representations and it can be used together with the CRT distribution to form a bivariate distribution that jointly models the counts of customers and tables.

Joint Count and Mixture Modeling of Grouped Data

In this Section we couple the gamma process with the NB process to construct a gamma-NB process, which is well suited for modeling grouped data. We derive analytic conditional posteriors for this construction and show that it can be reduced to an HDP.

For joint count and mixture modeling of grouped data, e.g., topic modeling where a document consists of a group of exchangeable words, we replace the Poisson processes in (5) with NB processes. Sharing the NB dispersion across groups while making the probability parameters be group dependent, we construct a gamma-NB process as

With G\mboxGaP(c,G0)G\sim\mbox{GaP}(c,G_{0}) expressed as G=k=1rkδωkG=\sum_{k=1}^{\infty}r_{k}\delta_{\omega_{k}}, a draw from \mboxNBP(G,pj)\mbox{NBP}(G,p_{j}) can be expressed as Xj=k=1njkδωk, njk\mboxNB(rk,pj).X_{j}=\sum_{k=1}^{\infty}n_{jk}\delta_{\omega_{k}},~{}n_{jk}\sim\mbox{NB}(r_{k},p_{j}).

The gamma-NB process can be augmented as a gamma-gamma-Poisson process as

and with θjk=Θj(ωk)\theta_{jk}=\Theta_{j}(\omega_{k}), we have njk\mboxPois(θjk), θjk\mboxGamma(rk,pj/(1pj)).n_{jk}\sim\mbox{Pois}(\theta_{jk}),~{}\theta_{jk}\sim\mbox{Gamma}(r_{k},p_{j}/(1-p_{j})). This construction introduces gamma processes {Θj}\{\Theta_{j}\}, whose normalization provide group-specific random probability measures {Θ~j}\{\widetilde{\Theta}_{j}\} for mixture modeling. The gamma-NB process can also be augmented as

according to Corollary 4. These three closely related constructions are graphically presented in Fig. 2.

With Corollaries 2 and 4, p:=jln(1pj)cjln(1pj)p^{\prime}:=\frac{-\sum_{j}\ln(1-p_{j})}{c-\sum_{j}\ln(1-p_{j})} and L:=jLjL:=\sum_{j}L_{j}, we further have two equivalent augmentations:

These augmentations allow us to derive a sequence of closed-form update equations, as described below.

2 Posterior Analysis and Predictive Distribution

With pj\mboxBeta(a0,b0)p_{j}\sim\mbox{Beta}(a_{0},b_{0}) and (10), we have

If G0G_{0} is finite and continuous, we have G0(ω)0G_{0}(\omega)\rightarrow 0  ωΩ\forall~{}\omega\in\Omega and thus L(Ω)L,G0=ωΩδ(L(ω)>0)=ωΩδ(jXj(ω)>0)=K+;L^{\prime}(\Omega)|L,G_{0}=\sum_{\omega\in\Omega}\delta(L(\omega)>0)=\sum_{\omega\in\Omega}\delta(\sum_{j}X_{j}(\omega)>0)=K^{+}; if G0G_{0} is discrete as G0=k=1Kγ0KδωkG_{0}=\sum_{k=1}^{K}\frac{\gamma_{0}}{K}\delta_{\omega_{k}}, then L(ωk)=\mboxCRT(L(ωk),γ0K)1L^{\prime}(\omega_{k})=\mbox{CRT}(L(\omega_{k}),\frac{\gamma_{0}}{K})\geq 1 if jXj(ωk)>0\sum_{j}X_{j}(\omega_{k})>0, thus L(Ω)K+L^{\prime}(\Omega)\geq K^{+}. In either case, let γ0=G0(Ω)\mboxGamma(e0,1/f0)\gamma_{0}=G_{0}(\Omega)\sim{\mbox{Gamma}}(e_{0},1/f_{0}), with the gamma-Poisson conjugacy on (14) and (5.1), we have

Using the gamma-Poisson conjugacy on (11), we have

Since the data {xji}i\{x_{ji}\}_{i} are exchangeable within group jj, conditioning on Xji=Xj\XjiX_{j}^{-i}=X_{j}\backslash X_{ji} and GG, with Θj\Theta_{j} marginalized out, we have

This prediction rule is similar to that of the Chinese restaurant franchise (CRF) .

3 Relationship with Hierarchical Dirichlet Process

With Corollary 3 and Section 2.2.1, we can equivalently express the gamma-gamma-Poisson process in (11) as

where Θj=θjΘ~j\Theta_{j}=\theta_{j}\widetilde{\Theta}_{j}, G=αG~G=\alpha\widetilde{G} and G0=γ0G~0G_{0}=\gamma_{0}\widetilde{G}_{0}. Without modeling Xj(Ω)X_{j}(\Omega) and θj\theta_{j} as random variables, (5.3) becomes an HDP . Thus the augmented and then normalized gamma-NB process leads to an HDP. However, we cannot return from the HDP to the gamma-NB process without modeling Xj(Ω)X_{j}(\Omega) and θj\theta_{j} as random variables. Theoretically, they are distinct in that the gamma-NB process is a completely random measure, assigning independent random variables into any disjoint Borel sets {Aq}1,Q\{A_{q}\}_{1,Q} in Ω\Omega, and the count Xj(A)X_{j}(A) has the distribution as Xj(A)\mboxNB(G(A),pj)X_{j}(A)\sim\mbox{NB}(G(A),p_{j}); by contrast, due to normalization, the HDP is not, and marginally

Practically, the gamma-NB process can exploit Corollary 4 and the gamma-Poisson conjugacy to achieve analytic conditional posteriors. The inference of the HDP is a challenge and it is usually solved through alternative constructions such as the CRF and stick-breaking representations . In particular, both concentration parameters α\alpha and γ0\gamma_{0} are nontrivial to infer and they are often simply fixed . One may apply the data augmentation method of to sample α\alpha and γ0\gamma_{0}. However, if G~0\widetilde{G}_{0} is discrete as G~0=k=1K1Kδωk\widetilde{G}_{0}=\sum_{k=1}^{K}\frac{1}{K}\delta_{\omega_{k}}, which is of practical value and becomes a continuous base measure as KK\rightarrow\infty , then using that method to sample γ0\gamma_{0} is only approximately correct, which may result in a biased estimate in practice, especially if KK is not sufficiently large.

By contrast, in the gamma-NB process, the shared GG can be analytically updated with (20) and G(Ω)G(\Omega) plays the role of α\alpha in the HDP, which is readily available as

and as in (19), regardless of whether the base measure is continuous, the total mass γ0\gamma_{0} has an analytic gamma posterior. Equation (24) also intuitively shows how the NB probability parameters {pj}\{p_{j}\} govern the variations among {Θ~j}\{\widetilde{\Theta}_{j}\} in the gamma-NB process. In the HDP, pjp_{j} is not explicitly modeled, and since its value appears irrelevant when taking the normalized constructions in (5.3), it is usually treated as a nuisance parameter and perceived as pj=0.5p_{j}=0.5 when needed for interpretation.

Another related model is the DILN-HDP in , where group-specific Dirichlet processes are normalized from gamma processes, with the gamma scale parameters either fixed as pj1pj=1\frac{p_{j}}{1-p_{j}}=1 or learned with a log Gaussian process prior. Yet no analytic conditional posteriors are provided and Gibbs sampling is not considered as a viable option. The main purpose of is introducing correlations between mixture components. It would be interesting to compare the differences between learning the {pj}\{p_{j}\} with beta priors and learning the gamma scale parameters with the log Gaussian process prior.

The Negative Binomial Process Family

The gamma-NB process shares the NB dispersion across groups while the NB probability parameters are group dependent. Since the NB distribution has two adjustable parameters, it is natural to wonder whether one can explore sharing the NB probability measure across groups, while making the NB dispersion parameters group specific or atom dependent. That kind of construction would be distinct from both the gamma-NB process and HDP in that Θj\Theta_{j} has space dependent scales, and thus its normalization Θ~j=ΘjΘj(Ω)\widetilde{\Theta}_{j}=\frac{\Theta_{j}}{\Theta_{j}(\Omega)}, still a random probability measure, no longer follows a Dirichlet process.

It is natural to let the NB probability measure be drawn from the beta process . In fact, the first discovered member of the NB process family is a beta-NB process . A main motivation of that construction is observing that the beta and Bernoulli distributions are conjugate and the beta-Bernoulli process is found to be quite useful for dictionary learning , whereas although the beta distribution is also conjugate to the NB distribution, there is apparent lack of exploitation of that relationship .

A beta-NB process is constructed by letting

With B\mboxBP(c,B0)B\sim\mbox{BP}(c,B_{0}) expressed as B=k=1pkδωkB=\sum_{k=1}^{\infty}p_{k}\delta_{\omega_{k}}, a random draw from \mboxNBP(rj,B)\mbox{NBP}(r_{j},B) can be expressed as

Under this construction, the NB probability measure is shared and the NB dispersion parameters are group dependent. Note that if {rj}\{r_{j}\} are fixed as one, then the beta-NB process reduces to the beta-geometric process, related to the one for count modeling discussed in ; if {rj}\{r_{j}\} are empirically set to some other values, then the beta-NB process reduces to the one proposed in . These simplifications are not considered in the paper, as they are often overly restrictive.

The asymptotic behavior of the beta-NB process with respect to the NB dispersion parameter is studied in . Such analysis is not provided here as we infer NB dispersion parameters from the data, which usually do not have large values due to overdispersion. In , the beta-NB process is treated comparable to a gamma-Poisson process and is considered less flexible than the HDP, motivating the construction of a hierarchical-beta-NB process. By contrast, in this paper, with the beta-NB process augmented as a beta-gamma-Poisson process, one can draw group-specific Poisson rate measures for count modeling and then use their normalization to provide group-specific random probability measures for mixture modeling; therefore, the beta-NB process, gamma-NB process and HDP are treated comparable to each other in hierarchical structures and are all considered suitable for mixed-membership modeling.

As in , we may also consider a marked-beta-NB process, with both the NB probability and dispersion measures shared, in which each point of the beta process is marked with an independent gamma random variable. Thus a draw from the marked-beta process becomes (R,B)=k=1(rk,pk)δωk(R,B)=\sum_{k=1}^{\infty}(r_{k},p_{k})\delta_{\omega_{k}}, and a draw from the NB process Xj\mboxNBP(R,B)X_{j}\sim\mbox{NBP}(R,B) becomes

With the beta-NB conjugacy, the posterior of BB is tractable in both the beta-NB and marked-beta-NB processes . Similar to the marked-beta-NB process, we may also consider a marked-gamma-NB process, where each point of the gamma process is marked with an independent beta random variable, whose performances is found to be similar.

If it is believed that there are excessive number of zeros, governed by a process other than the NB process, we may introduce a zero inflated NB process as Xj\mboxNBP(RZj,pj)X_{j}\sim\mbox{NBP}(RZ_{j},p_{j}), where Zj\mboxBeP(B)Z_{j}\sim\mbox{BeP}(B) is drawn from the Bernoulli process and (R,B)=k=1(rk,πk)δωk(R,B)=\sum_{k=1}^{\infty}(r_{k},\pi_{k})\delta_{\omega_{k}} is drawn from a gamma marked-beta process, thus a draw from \mboxNBP(RZj,pj)\mbox{NBP}(RZ_{j},p_{j}) can be expressed as Xj=k=1njkδωkX_{j}=\sum_{k=1}^{\infty}n_{jk}\delta_{\omega_{k}}, with

This construction can be linked to the focused topic model in with appropriate normalization, with advantages that there is no need to fix pj=0.5p_{j}=0.5 and the inference is fully tractable. The zero inflated construction can also be linked to models for real valued data using the Indian buffet process (IBP) or beta-Bernoulli process spike-and-slab prior . Below we apply various NB processes for topic modeling and illustrate the differences between them.

Negative Binomial Process Topic Modeling and Poisson Factor Analysis

For the gamma-NB process described in Section 5, with the gamma process expressed as G=k=1rkδϕkG=\sum_{k=1}^{\infty}r_{k}\delta_{\boldsymbol{\phi}_{k}}, we can express the hierarchical model as

where g0(dϕ)=G0(dϕ)/γ0g_{0}(d\boldsymbol{\phi})=G_{0}(d\boldsymbol{\phi})/\gamma_{0}. With θj=Θj(Ω)=k=1θjk\theta_{j}=\Theta_{j}(\Omega)=\sum_{k=1}^{\infty}\theta_{jk}, nj=(nj1,,nj)T{\boldsymbol{n}}_{j}=(n_{j1},\cdots,n_{j\infty})^{T} and θj=(θj1,,θj)T\boldsymbol{\theta}_{j}=(\theta_{j1},\cdots,\theta_{j\infty})^{T}, using Corollary 3, we can equivalently express NjN_{j} and njkn_{jk} in (29) as

Since {xji}i=1,Nj\{x_{ji}\}_{i=1,N_{j}} are fully exchangeable, rather than drawing nj{\boldsymbol{n}}_{j} as in (30), we may equivalently draw it as

This provides further insights on uniting the seemingly distinct problems of count and mixture modeling.

Denote nvjk=i=1Njδ(zji=k,vji=v)n_{vjk}=\sum_{i=1}^{N_{j}}\delta(z_{ji}=k,v_{ji}=v), nvk=jnvjkn_{v\boldsymbol{\cdot}k}=\sum_{j}n_{vjk} and nk=jnjkn_{\boldsymbol{\cdot}k}=\sum_{j}n_{jk}. For modeling convenience, we place Dirichlet priors on topics ϕk\mboxDir(η,,η)\boldsymbol{\phi}_{k}\sim\mbox{Dir}(\eta,\cdots,\eta), then for the gamma-NB process topic model, we have

which would be the same for the other NB processes, since the gamma-NB process differs from them only on how the gamma priors of θjk\theta_{jk} and consequently the NB priors of njkn_{jk} are constituted. For example, marginalizing out θjk\theta_{jk}, we have njk\mboxNB(rk,pj)n_{jk}\sim\mbox{NB}(r_{k},p_{j}) for the gamma-NB process, njk\mboxNB(rj,pk)n_{jk}\sim\mbox{NB}(r_{j},p_{k}) for the beta-NB process, njk\mboxNB(rk,pk)n_{jk}\sim\mbox{NB}(r_{k},p_{k}) for both the marked-beta-NB and marked-gamma-NB processes, and njk\mboxNB(rkbjk,pj)n_{jk}\sim\mbox{NB}(r_{k}b_{jk},p_{j}) for the zero-inflated-NB process.

As in , we may augment mvj\mboxPois(k=1Kϕvkθjk)m_{vj}\sim\mbox{Pois}(\sum_{k=1}^{K}\phi_{vk}\theta_{jk}) as

If v=1Vϕvk=1\sum_{v=1}^{V}{\phi_{vk}}=1, we have njk\mboxPois(θjk)n_{jk}\sim\mbox{Pois}(\theta_{jk}), and with Corollary 3 and θj=(θj1,,θjK)T\boldsymbol{\theta}_{j}=(\theta_{j1},\cdots,\theta_{jK})^{T}, we also have (n_{vj1},\cdots,n_{vjK}|-)\sim\mbox{Mult}\big{(}m_{vj};\frac{\phi_{v1}\theta_{j1}}{\sum_{k=1}^{K}\phi_{vk}\theta_{jk}}, \cdots,\frac{\phi_{vK}\theta_{jK}}{\sum_{k=1}^{K}\phi_{vk}\theta_{jk}}\big{)}, (nv1,,nvK)(n_{v\boldsymbol{\cdot}1},\cdots,n_{v\boldsymbol{\cdot}K}|-) \mboxMult(nk;ϕk)\sim\mbox{Mult}(n_{\boldsymbol{\cdot}k};\boldsymbol{\phi}_{k}), and (nj1,,njK)(n_{j1},\cdots,n_{jK}|-) \mboxMult(Nj;θj)\sim\mbox{Mult}\left(N_{j};\boldsymbol{\theta}_{j}\right), which would lead to (32) under the assumption that the words {xji}i\{x_{ji}\}_{i} are exchangeable and (33) if ϕk\mboxDir(η,,η)\boldsymbol{\phi}_{k}\sim\mbox{Dir}(\eta,\cdots,\eta). Thus topic modeling with the NB process can be considered as factorization of the term-document count matrix under the Poisson likelihood as M\mboxPois(ΦΘ){\bf M}\sim\mbox{Pois}(\boldsymbol{\Phi}\boldsymbol{\Theta}).

PFA provides a unified framework to connect previously proposed discrete latent variable models, such as those in . As discussed in detail in , these models mainly differ on how the priors of ϕvk\phi_{vk} and θjk\theta_{jk} are constituted and how the inferences are implemented. For example, nonnegative matrix factorization with an objective function of minimizing the Kullback-Leibler (KL) divergence DKL(MΦΘ)D_{\text{KL}}({\bf M}||\boldsymbol{\Phi}\boldsymbol{\Theta}) is equivalent to the ML estimation of Φ\boldsymbol{\Phi} and Θ\boldsymbol{\Theta} under PFA, and latent Dirichlet allocation (LDA) is equivalent to a PFA with Dirichlet priors imposed on both ϕk\boldsymbol{\phi}_{k} and θj\boldsymbol{\theta}_{j}.

2 Negative Binomial Process Topic Modeling

From the point view of PFA, an NB process topic model factorizes the term-document count matrix under the constraints that each factor sums to one and the factor scores are gamma distributed random variables, and consequently, the number of words assigned to a topic (factor/atom) follows an NB distribution. Depending on how the NB distributions are parameterized, as shown in Table I, we can construct a variety of NB process topic models, which can also be connected to a large number of previously proposed parametric and nonparametric topic models. For a deeper understanding on how the counts are modeled, we also show in Table I both the variance-to-mean ratio (VMR) and overdispersion level (ODL) implied by these settings. Eight differently constructed NB processes are considered:

(i) The NB process described in Section 4 is used for topic modeling. It improves over the count-modeling gamma-Poisson process discussed in in that it unites mixture modeling and has closed-form conditional posteriors. Although this is a nonparametric model supporting an infinite number of topics, requiring {θjk}j=1,Jrk\{\theta_{jk}\}_{j=1,J}\equiv r_{k} may be too restrictive.

(ii) Related to LDA and Dir-PFA , the NB-LDA is also a parametric topic model that requires tuning the number of topics. It is constructed by replacing the topic weights of the Gamma-NB process in (29) as θjk\mboxGamma(rj,pj/(1pj))\theta_{jk}\sim\mbox{Gamma}(r_{j},p_{j}/(1-p_{j})). It uses document dependent rjr_{j} and pjp_{j} to learn the smoothing of the topic weights, and it lets rj\mboxGamma(γ0,1/c), γ0\mboxGamma(e0,1/f0)r_{j}\sim\mbox{Gamma}(\gamma_{0},1/c),~{}\gamma_{0}\sim\mbox{Gamma}(e_{0},1/f_{0}) to share statistical strength between documents.

(iii) Related to the HDP , the NB-HDP model is constructed by fixing pj/(1pj)1p_{j}/(1-p_{j})\equiv 1 (i.e., pj0.5p_{j}\equiv 0.5) in (29). It is also comparable to the HDP in that constructs group-specific Dirichlet processes with normalized gamma processes, whose scale parameters are also set as one.

(iv) The NB-FTM model is constructed by replacing the topic weights in (29) as θjk\mboxGamma(rkbjk,pj/(1pj))\theta_{jk}\sim\mbox{Gamma}(r_{k}b_{jk},p_{j}/(1-p_{j})), with pj0.5p_{j}\equiv 0.5 and bjkb_{jk} drawn from a beta-Bernoulli process that is used to explicitly model zero counts. It is the same as the sparse-gamma-gamma-PFA (SγΓ\gamma\Gamma-PFA) in and is comparable to the focused topic model (FTM) , which is constructed from the IBP compound Dirichlet process. The Zero-Inflated-NB process improves over these approaches by allowing {pj}\{p_{j}\} to be inferred, which generally yields better data fitting.

(v) The Gamma-NB process, as shown in (10) and (29), explores sharing the NB dispersion measure across groups, and it improves over the NB-HDP by allowing the learning of {pj}\{p_{j}\}. As shown in (5.3), it reduces to the HDP in without modeling Xj(Ω)X_{j}(\Omega) and θj\theta_{j} as random variables.

(vi) The Beta-Geometric process is constructed by replacing the topic weights in (29) as θjk\mboxGamma(1,pk/(1pk))\theta_{jk}\sim\mbox{Gamma}(1,p_{k}/(1-p_{k})). It explores sharing the NB probability measure across groups, which is related to the one proposed for count modeling in . It is restrictive that the NB dispersion parameters are fixed as one.

(vii) The Beta-NB process is constructed by replacing the topic weights in (29) as θjk\mboxGamma(rj,pk/(1pk))\theta_{jk}\sim\mbox{Gamma}(r_{j},p_{k}/(1-p_{k})). It explores sharing the NB probability measure across groups, which improves over the Beta-Geometric process and the beta-NB process (BNBP) proposed in by providing analytic conditional posteriors of {rj}\{r_{j}\}.

(viii) The Marked-Beta-NB process constructed by replacing the topic weights in (29) as θjk\mboxGamma(rk,pk/(1pk))\theta_{jk}\sim\mbox{Gamma}(r_{k},p_{k}/(1-p_{k})). It is comparable to the BNBP proposed in , with the distinction that it provides analytic conditional posteriors of of {rk}\{r_{k}\}.

3 Approximate and Exact Inference

Although all proposed NB process models have closed-form conditional posteriors, they contain countably infinite atoms that are infeasible to explicitly represent in practice. This infinite dimensional problem can be addressed by using a discrete base measure with KK atoms, i.e., truncating the total number of atoms to be KK, and then doing Bayesian inference via block Gibbs sampling . This is a very general approach and is used in our experiments to make a fair comparison between a wide variety of models. Block gibbs sampling for the Gamma-NB process is described in Appendix B; block gibbs sampling for other NB processes and related algorithms in Table I can be similarly derived, as described in and omitted here for brevity. The infinite dimensional problem can also be addressed by discarding the atoms with weights smaller than a small constant ϵ\epsilon or by modifying the Lévy measure to make its integral over the whole space be finite . A sufficiently large (small) KK (ϵ\epsilon) usually provides a good approximation, however, there is an increasing risk of wasting computation as the truncation level gets larger.

To avoid truncation, the slice sampling scheme of has been utilized for the Dirichlet process and normalized random measure based mixture models . With auxiliary slice latent variables introduced to allow adaptive truncations in each MCMC interaction, the infinite dimensional problem is transformed into a finite one. This method has also been applied to the beta-Bernoulli process and the beta-NB process . It would be interesting to investigate slice sampling for the NB process based count and mixture models, which provide likelihoods that might be more amenable to posterior simulation since no normalization is imposed on the weights of the atoms. As slice sampling is not the focus of this paper, we leave it for future study.

Both the block Gibbs sampler and the slice sampler explicitly represent a finite set of atoms for posterior simulation, and algorithms based on these samplers are commonly referred as “conditional” methods . Another approach of solving the infinite dimensional problem is employing a collapsed inference scheme that marginalizes out the atoms and their weights . Algorithms based on the collapsed inference scheme are usually referred as “marginal” methods . A well-defined prediction rule is usually required to develop a collapsed Gibbs sampler, and the conjugacy between the likelihood and the prior distribution of atoms is desired to avoid numerical integrations. In topic modeling, a word is linked to a Dirichlet distributed atom with a multinomial likelihood, thus the atoms can be analytically marginalized out; since their weights can also be marginalized out as in (5.2), we may develop a collapsed Gibbs sampler for the gamma-NB process based topic models. As the collapsed inference scheme is not the focus of this paper and the prediction rules for other NB processes need further investigation, we leave them for future study.

Example Results and Discussions

Motivated by Table I, we consider topic modeling using a variety of NB processes. We compare them with LDA and CRF-HDP , in which the latent count njkn_{jk} is marginally distributed as

We consider the Psychological Reviewhttp://psiexp.ss.uci.edu/research/programs_\_data/toolbox.htm corpus, restricting the vocabulary to terms that occur in five or more documents. The corpus includes 1281 abstracts from 1967 to 2003, with V=2566V=2566 and 71,279 total word counts. We randomly select 20%20\%, 40%40\%, 60%60\% or 80%80\% of the words from each document to learn a document dependent probability for each term vv and calculate the per-word perplexity on the held-out words as

where fjv=s=1Sk=1Kϕvk(s)θjk(s)s=1Sv=1Vk=1Kϕvk(s)θjk(s)f_{jv}=\frac{\sum_{s=1}^{S}\sum_{k=1}^{K}\phi^{(s)}_{vk}\theta^{(s)}_{jk}}{\sum_{s=1}^{S}\sum_{v=1}^{V}\sum_{k=1}^{K}\phi^{(s)}_{vk}\theta^{(s)}_{jk}}, yjvy_{jv} is the number of words held out at term vv in document jj, y=j=1Jv=1Vyjvy_{\boldsymbol{\cdot}\boldsymbol{\cdot}}=\sum_{j=1}^{J}\sum_{v=1}^{V}y_{jv}, and s=1,,Ss=1,\cdots,S are the indices of collected samples. Note that the per-word perplexity is equal to VV if fjv=1Vf_{jv}=\frac{1}{V}, thus it should be no greater than VV for a topic model that works appropriately. The final results are averaged over five random training/testing partitions. The performance measure is the same as the one used in and similar to those in .

We show in Fig. 3 the NB dispersion and probability parameters learned by various NB process topic models listed in Table I, revealing distinct sharing mechanisms and model properties. In Fig. 4 we compare the per-held-out-word prediction performance of various algorithms. We set the parameters as c=1c=1, η=0.05\eta=0.05 and a0=b0=e0=f0=0.01a_{0}=b_{0}=e_{0}=f_{0}=0.01. For LDA and NB-LDA, we search KK for optimal performance. All the other NB process topic models are nonparametric Bayesian models that can automatically learn the number of active topics K+K^{+} for a given corpus. For fair comparison, all the models considered are implemented with block Gibbs sampling, where K=400K=400 is set as an upper-bound.

When θjkrk\theta_{jk}\equiv r_{k} is used, as in the NB process, different documents are imposed to have the same topic weights, leading to the worst held-out-prediction performance.

When (rj,pj)(r_{j},p_{j}) is used, as in NB-LDA, different documents are weakly coupled with rj\mboxGamma(γ0,1/c)r_{j}\sim\mbox{Gamma}(\gamma_{0},1/c), and the modeling results in Fig. 3 show that a typical document in this corpus usually has a small rjr_{j} and a large pjp_{j}, thus a large overdispersion level (ODL) and a large variance-to-mean ratio (VMR), indicating highly overdispersed counts on its topic usage. NB-LDA is a parametric topic model that requires tuning the number of topics KK. It improves over LDA in that it only has to tune KK, whereas LDA has to tune both KK and α\alpha. With an appropriate KK, the parametric NB-LDA may outperform the nonparametric NB-HDP and NB-FTM as the training data percentage increases, showing that even by learning both the NB parameters rjr_{j} and pjp_{j} in a document dependent manner, we may get better data fitting than using nonparametric models that fix the NB probability parameters.

The NB-HDP is a special case of the Gamma-NB process that pj0.5p_{j}\equiv 0.5. From a mixture modeling viewpoint, fixing pj=0.5p_{j}=0.5 is a natural choice as pjp_{j} appears irrelevant after normalization. However, from a count modeling viewpoint, this would make a restrictive assumption that each count vector {njk}k=1,K\{n_{jk}\}_{k=1,K} has the same VMR of 2. It is also interesting to examine (24), which can be viewed as the concentration parameter α\alpha in the HDP, allowing the adjustment of pjp_{j} would allow a more flexible model assumption on the amount of variations between the topic proportions, and thus potentially better data fitting.

When (rk,πk)(r_{k},\pi_{k}) is used, as in the NB-FTM model, our results in Fig. 3 show that we usually have a small πk\pi_{k} and a large rkr_{k}, indicating topic kk is sparsely used across the documents but once it is used, the amount of variation on usage is small. This property might be helpful when there are excessive number of zeros that might not be well modeled by the NB process alone. In our experiments, the more direct approaches of using pkp_{k} or pjp_{j} generally yield better results, which might not be the case when excessive number of zeros could be better explained with the beta-Bernoulli processes, e.g., when the training words are scarce, the NB-FTM can approach the performance of the Marked-Beta-NB process.

Conclusions

We propose a variety of negative binomial (NB) processes for count modeling, which can be naturally applied for the seemingly disjoint problem of mixture modeling. The proposed NB processes are completely random measures, which assign independent random variables to disjoint Borel sets of the measure space, as opposed to Dirichlet processes, whose measures on disjoint Borel sets are negatively correlated. We reveal connections between various discrete distributions and discover unique data augmentation and marginalization methods for the NB process, with which we are able to unite count and mixture modeling, analyze fundamental model properties, and derive efficient Bayesian inference. We demonstrate that the NB process and the gamma-NB process can be recovered from the Dirichlet process and the HDP, respectively. We show in detail the theoretical, structural and computational advantages of the NB process. We examine the distinct sharing mechanisms and model properties of various NB processes, with connections made to existing discrete latent variable models under the Poisson factor analysis framework. Experimental results on topic modeling show the importance of modeling both the NB dispersion and probability parameters, which respectively govern the overdispersion level and variance-to-mean ratio for count modeling.

Acknowledgments

The authors would like to thank the two anonymous reviewers and the editor for their constructive comments that help improve the manuscript.

References

Appendix A Proof of Theorem 1

With the PMFs of both the NB and CRT distributions, the PMF of the joint distribution of counts mm and ll is fM,L(m,lr,p)=fL(lm,r)fM(mr,p)=Γ(r)s(m,l)rlΓ(m+r)Γ(r+m)(1p)rpmm!Γ(r)=s(m,l)rl(1p)rpmm!f_{M,L}(m,l|r,p)=f_{L}(l|m,r)f_{M}(m|r,p)={\frac{\Gamma(r)|s(m,l)|r^{l}}{\Gamma(m+r)}}\frac{\Gamma(r+m)(1-p)^{r}p^{m}}{m!\Gamma(r)}=\frac{|s(m,l)|r^{l}(1-p)^{r}p^{m}}{m!}, which is the same as (4).

Since m\mboxSumLog(l,p)m\sim{\mbox{SumLog}}(l,p) is the summation of ll iid \mboxLog(p)\mbox{Log}(p) random variables, its PGF becomes CM(z)=CUl(z)=[ln(1pz)/ln(1p)]l, z<p1.C_{M}(z)=C_{U}^{l}(z)=\left[{\ln(1-pz)}/{\ln(1-p)}\right]^{l},~{}|z|<{p^{-1}}. With [ln(1+x)]l=l!n=ls(n,l)xn/n![\ln(1+x)]^{l}=l!\sum_{n=l}^{\infty}{s(n,l)x^{n}}/{n!} and s(m,l)=(1)mls(m,l)|s(m,l)|=(-1)^{m-l}s(m,l) , its PMF can be expressed as

Letting l\mboxPois(rln(1p))l\sim{\mbox{Pois}}(-r\ln(1-p)), the PMF of the joint distribution of counts mm and ll is fM,L(m,lr,p)=fM(ml,p)fL(lr,p)=pml!s(m,l)m![ln(1p)]l(rln(1p))lerln(1p)l!=s(m,l)rl(1p)rpmm!f_{M,L}(m,l|r,p)=f_{M}(m|l,p)f_{L}(l|r,p)=\frac{p^{m}l!|s(m,l)|}{m![-\ln(1-p)]^{l}}\frac{(-r\ln(1-p))^{l}e^{r\ln(1-p)}}{l!}=\frac{|s(m,l)|r^{l}(1-p)^{r}p^{m}}{m!}, which is the same as (4). ∎

Appendix B Block Gibbs Sampling for the Gamma-Negative Binomial Process

With pj\mboxBeta(a0,b0)p_{j}\sim\mbox{Beta}(a_{0},b_{0}), γ0\mboxGamma(e0,1/f0)\gamma_{0}\sim\mbox{Gamma}(e_{0},1/f_{0}) and a discrete base measure as G0=k=1Kγ0KδωkG_{0}=\sum_{k=1}^{K}\frac{\gamma_{0}}{K}\delta_{\omega_{k}}, following Section 5.2, block Gibbs sampling for (29) proceeds as

where p:=jln(1pj)cjln(1pj)p^{\prime}:=\frac{-\sum_{j}\ln(1-p_{j})}{c-\sum_{j}\ln(1-p_{j})}. Note that when KK\rightarrow\infty, we have (lk)=δ(jnjk>0)(l^{\prime}_{k}|-)=\delta(\sum_{j}n_{jk}>0) and thus klk=K+\sum_{k}l^{\prime}_{k}=K^{+}.