Negative Binomial Process Count and Mixture Modeling
Mingyuan Zhou, Lawrence Carin
Introduction
Count data appear in many settings, such as predicting the number of motor insurance claims , analyzing infectious diseases and modeling topics of document corpora . There has been increasing interest in count modeling using the Poisson process, geometric process and recently the negative binomial (NB) process . It is shown in and further demonstrated in that the NB process, originally constructed for count analysis, can be naturally applied for mixture modeling of grouped data , where each group . For example, in topic modeling (mixed-membership modeling), each document consists of a group of exchangeable words and each word is a group member that is assigned to a topic; the number of times a topic appears in a document is a latent count random variable that could be well modeled with an NB distribution .
Mixture modeling, which infers random probability measures to assign data samples into clusters (mixture components), is a key research area of statistics and machine learning. Although the number of samples assigned to clusters are counts, mixture modeling is not typically considered as a count-modeling problem. It is often addressed under the Dirichlet-multinomial framework, using the Dirichlet process as the prior distribution. With the Dirichlet-multinomial conjugacy, the Dirichlet process mixture model enjoys tractability because the posterior of the random probability measure is still a Dirichlet process. Despite its popularity, the Dirichlet process is inflexible in that a single concentration parameter controls both the variability of the mass around the mean and the distribution of the number of distinct atoms . For mixture modeling of grouped data, the hierarchical Dirichlet process (HDP) has been further proposed to share statistical strength between groups. The HDP inherits the same inflexibility of the Dirichlet process, and due to the non-conjugacy between Dirichlet processes, its inference has to be solved under alternative constructions, such as the Chinese restaurant franchise and stick-breaking representations . To make the number of distinct atoms increase at a rate faster than that of the Dirichlet process, one may consider the Pitman-Yor process or the normalized generalized gamma process that provide extra parameters to add flexibility.
To construct more expressive mixture models with tractable inference, in this paper we consider mixture modeling as a count-modeling problem. Directly modeling the counts assigned to mixture components as NB random variables, we perform joint count and mixture modeling via the NB process, using completely random measures that are easy to construct and amenable to posterior computation. By constructing a bivariate count distribution that connects the Poisson, logarithmic, NB and Chinese restaurant table distributions, we develop data augmentation and marginalization techniques unique to the NB distribution, with which we augment an NB process into both the gamma-Poisson and compound Poisson representations, yielding unification of count and mixture modeling, derivation of fundamental model properties, as well as efficient Bayesian inference.
Under the NB process, we employ a gamma process to model the rate measure of a Poisson process. The normalization of the gamma process provides a random probability measure (not necessarily a Dirichlet process) for mixture modeling, and the marginalization of the gamma process leads to an NB process for count modeling. Since the gamma scale parameters appear as NB probability parameters when the gamma processes are marginalized out, they directly control count distributions on atoms and they could be conveniently inferred with the beta-NB conjugacy. For mixture modeling of grouped data, we construct hierarchical models by employing an NB process for each group and sharing their NB dispersion or probability measures across groups. Different parameterizations of the NB dispersion and probability parameters result in a wide variety of NB processes, which are connected to previously proposed nonparametric Bayesian mixture models. The proposed joint count and mixture modeling framework provides new opportunities for better data fitting, efficient inference and flexible model constructions.
Parts of the work presented here first appeared in . In this paper, we unify related materials scattered in these three conference papers and provide significant expansions. In particular, we construct a Poisson-logarithmic bivariate distribution that tightly connects the NB and Chinese restaurant table distributions, extending the Chinese restaurant process to describe the case that both the numbers of customers and tables are random variables, and we provide necessary conditions to recover the NB process and the gamma-NB process from the Dirichlet process and HDP, respectively.
We mention that a related beta-NB process has been independently investigated in . Our constructions of a wide variety of NB processes, including the beta-NB processes in and as special cases, are built on our thorough investigation of the properties, relationships and inference of the NB and related stochastic processes. In particular, we show that the gamma-Poisson construction of the NB process is key to uniting count and mixture modeling, and there are two equivalent augmentations of the NB process that allow us to develop analytic conditional posteriors and predictive distributions. These insights are not provided in , and the NB dispersion parameters there are empirically set rather than inferred. More distinctions will be discussed along with specific models.
The remainder of the paper is organized as follows. We review some commonly used nonparametric Bayesian priors in Section 2 and study the NB distribution in Section 3. We present the NB process in Section 4, the gamma-NB process in Section 5, and the NB process family in Section 6. We discuss NB process topic modeling in Section 7 and present example results in Section 8.
Preliminaries
where . A random signed measure satisfying (1) is called a Lévy random measure. More generally, if the Lévy measure satisfies
for each compact , the Lévy random measure is well defined, even if the Poisson intensity is infinite. A nonnegative Lévy random measure satisfying (2) was called a completely random measure , and it was introduced to machine learning in and .
1.2 Gamma Process
where .
1.3 Beta Process
where .
2 Dirichlet and Chinese Restaurant Processes
Denote , where , then for any measurable disjoint partition of , we have , where and . Therefore, with a space invariant scale parameter , the normalized gamma process is a Dirichlet process with concentration parameter and base probability measure , expressed as . Unlike the gamma process, the Dirichlet process is no longer a completely random measure as the random variables for disjoint sets are negatively correlated.
A gamma process with a space invariant scale parameter can also be recovered from a Dirichlet process: if a gamma random variable and a Dirichlet process are independent with and , then becomes a gamma process as .
2.2 Chinese Restaurant Process
In a Dirichlet process , we assume ; are independent given and hence exchangeable. The predictive distribution of a new data sample , conditioning on , with marginalized out, can be expressed as
where are distinct atoms in observed in and is the number of data samples associated with . The stochastic process described in (2.2.2) is known as the Pólya urn scheme and also the Chinese restaurant process .
The number of nonempty tables in a Chinese restaurant process, with concentration parameter and customers, is a random variable generated as This random variable is referred as the Chinese restaurant table (CRT) random variable . As shown in , it has probability mass function (PMF)
where are Stirling numbers of the first kind.
Negative Binomial Distribution
Its mean and variance are both equal to . Due to heterogeneity (difference between individuals) and contagion (dependence between the occurrence of events), count data are usually overdispersed in that the variance is greater than the mean, making the Poisson assumption restrictive. By placing a gamma prior with shape and scale on as , \lambda\sim\mbox{Gamma}\big{(}r,\frac{p}{1-p}\big{)} and marginalizing out , an NB distribution is obtained, with PMF
where is the nonnegative dispersion parameter and is the probability parameter. Thus the NB distribution is also known as the gamma-Poisson mixture distribution . It has a mean smaller than the variance , with the variance-to-mean ratio (VMR) as and the overdispersion level (ODL, the coefficient of the quadratic term in ) as , and thus it is usually favored over the Poisson distribution for modeling overdispersed counts.
As shown in , can also be generated from a compound Poisson distribution as
where corresponds to the logarithmic distribution with PMF , and probability-generating function (PGF)
One may also show that , and conditioning on , becomes as .
The NB distribution has been widely investigated and applied to numerous scientific studies . Although inference of the NB probability parameter is straightforward with the beta-NB conjugacy, inference of the NB dispersion parameter , whose conjugate prior is unknown, has long been a challenge. The maximum likelihood (ML) approach is commonly used to estimate , however, it only provides a point estimate and does not allow incorporating prior information; moreover, the ML estimator of often lacks robustness and may be severely biased or even fail to converge, especially if the sample size is small . Bayesian approaches are able to model the uncertainty of estimation and incorporate prior information, however, the only available closed-form Bayesian inference for relies on approximating the ratio of two gamma functions .
Advancing previous research on the NB distribution in , we construct a Poisson-logarithmic bivariate distribution that assists Bayesian inference of the NB distribution and unites various count distributions.
The Poisson-logarithmic (PoisLog) bivariate distribution with PMF
where denotes the sum-logarithmic distribution generated as .
The proof of Theorem 1 is provided in Appendix A. As shown in Fig. 1, this bivariate distribution intuitively describes the joint distribution of two count random variables and under two equivalent circumstances:
1) There are customers seated at tables;
2) There are tables, each of which with customers, with customers in total.
In a slight abuse of notation, but for added conciseness, in the following discussion we use to denote .
Let represent a gamma-NB mixture distribution. It can be augmented as Marginalizing out leads to
where the latent count can be augmented as
which, using Theorem 1, is equivalent in distribution to
The connections between various distributions shown in Theorem 1 and Corollary 2 are key ingredients of this paper, which not only allow us to derive efficient inference, but also, as shown below, let us examine the posteriors to understand fundamental properties of various NB processes, clearly revealing connections to previous nonparametric Bayesian mixture models, including those based on the Dirichlet process, HDP and beta-NB processes.
Joint Count and Mixture Modeling
In this Section, we first show the connections between the Poisson and multinomial processes, and then we place a gamma process prior on the Poisson rate measure for joint count and mixture modeling. This construction can be reduced to the Dirichlet process and its restrictions for modeling grouped data are further discussed.
Let be a Poisson process defined on a completely random measure such that for each subset . Define as a multinomial process, with total count and random probability measure , such that for any disjoint partition of . According to Lemma 4.1 of , and would have the same Poisson distribution for each , thus and are equivalent in distribution.
Using Corollary 3, we illustrate how the seemingly distinct problems of count and mixture modeling can be united under the Poisson process. For each , denote as a count random variable describing the number of observations in that reside within . Given grouped data , for any measurable disjoint partition of , we aim to jointly model count random variables . A natural choice would be to define a Poisson process
with a shared completely random measure on , such that X_{j}(A)\sim\mbox{Pois}\big{(}G(A)\big{)} for each and . Following Corollary 3, with , letting is equivalent to letting
Thus the Poisson process provides not only a way to generate independent counts from each , but also a mechanism for mixture modeling, which allocates the observations into any measurable disjoint partition of , conditioning on the normalized random measure .
2 Gamma-Poisson Process and Negative Binomial Process
To complete the Poisson process, it is natural to place a gamma process prior on the Poisson rate measure as
For a distinct atom , we have , where and . Marginalizing out of the gamma-Poisson process leads to an NB process
in which for each .
where . Thus the NB probability parameter plays a critical role in count and mixture modeling as it directly controls the prior distributions of the number of distinct atoms , the number of samples at each of these atoms , and the total number of samples . However, its value would become irrelevant if one directly works with the normalization of , as commonly used in conventional mixture modeling.
Define as a CRT process that
for each . Under the Chinese restaurant process metaphor, and represent the customer count and table count, respectively, observed in each . A direct generalization of Theorem 1 leads to:
The NB process augmented under its compound Poisson representation as
3 Posterior Analysis and Predictive Distribution
Imposing a gamma prior on and a beta prior on , using conjugacy, we have all conditional posteriors in closed-form as
If the base measure is finite and continuous, then and we have and thus , i.e., the number of nonempty tables is equal to , the number of distinct atoms. The gamma-Poisson process is also well defined with a discrete base measure , for which we have and hence it is possible that if , which means . As the data are exchangeable within group , conditioning on and , with marginalized out, we have
This prediction rule is similar to that of the Chinese restaurant process described in (2.2.2).
4 Relationship with the Dirichlet Process
Based on Corollary 3 on the multinomial process and Section 2.2.1 on the Dirichlet process, the gamma-Poisson process in (5) can be equivalently expressed as
5 Restrictions of the Gamma-Poisson Process
The Poisson process has an equal-dispersion assumption for count modeling. For mixture modeling of grouped data, the gamma-Poisson (NB) process might be too restrictive in that, as shown in (4.4), it implies the same mixture proportions across groups, and as shown in (6), it implies the same count distribution on each distinct atom. This motivates us to consider adding an additional layer into the gamma-Poisson process or using a different distribution other than the Poisson to model the counts for grouped data. As shown in Section 3, the NB distribution is an ideal candidate, not only because it allows overdispersion, but also because it can be augmented into either a gamma-Poisson or a compound Poisson representations and it can be used together with the CRT distribution to form a bivariate distribution that jointly models the counts of customers and tables.
Joint Count and Mixture Modeling of Grouped Data
In this Section we couple the gamma process with the NB process to construct a gamma-NB process, which is well suited for modeling grouped data. We derive analytic conditional posteriors for this construction and show that it can be reduced to an HDP.
For joint count and mixture modeling of grouped data, e.g., topic modeling where a document consists of a group of exchangeable words, we replace the Poisson processes in (5) with NB processes. Sharing the NB dispersion across groups while making the probability parameters be group dependent, we construct a gamma-NB process as
With expressed as , a draw from can be expressed as
The gamma-NB process can be augmented as a gamma-gamma-Poisson process as
and with , we have This construction introduces gamma processes , whose normalization provide group-specific random probability measures for mixture modeling. The gamma-NB process can also be augmented as
according to Corollary 4. These three closely related constructions are graphically presented in Fig. 2.
With Corollaries 2 and 4, and , we further have two equivalent augmentations:
These augmentations allow us to derive a sequence of closed-form update equations, as described below.
2 Posterior Analysis and Predictive Distribution
With and (10), we have
If is finite and continuous, we have and thus if is discrete as , then if , thus . In either case, let , with the gamma-Poisson conjugacy on (14) and (5.1), we have
Using the gamma-Poisson conjugacy on (11), we have
Since the data are exchangeable within group , conditioning on and , with marginalized out, we have
This prediction rule is similar to that of the Chinese restaurant franchise (CRF) .
3 Relationship with Hierarchical Dirichlet Process
With Corollary 3 and Section 2.2.1, we can equivalently express the gamma-gamma-Poisson process in (11) as
where , and . Without modeling and as random variables, (5.3) becomes an HDP . Thus the augmented and then normalized gamma-NB process leads to an HDP. However, we cannot return from the HDP to the gamma-NB process without modeling and as random variables. Theoretically, they are distinct in that the gamma-NB process is a completely random measure, assigning independent random variables into any disjoint Borel sets in , and the count has the distribution as ; by contrast, due to normalization, the HDP is not, and marginally
Practically, the gamma-NB process can exploit Corollary 4 and the gamma-Poisson conjugacy to achieve analytic conditional posteriors. The inference of the HDP is a challenge and it is usually solved through alternative constructions such as the CRF and stick-breaking representations . In particular, both concentration parameters and are nontrivial to infer and they are often simply fixed . One may apply the data augmentation method of to sample and . However, if is discrete as , which is of practical value and becomes a continuous base measure as , then using that method to sample is only approximately correct, which may result in a biased estimate in practice, especially if is not sufficiently large.
By contrast, in the gamma-NB process, the shared can be analytically updated with (20) and plays the role of in the HDP, which is readily available as
and as in (19), regardless of whether the base measure is continuous, the total mass has an analytic gamma posterior. Equation (24) also intuitively shows how the NB probability parameters govern the variations among in the gamma-NB process. In the HDP, is not explicitly modeled, and since its value appears irrelevant when taking the normalized constructions in (5.3), it is usually treated as a nuisance parameter and perceived as when needed for interpretation.
Another related model is the DILN-HDP in , where group-specific Dirichlet processes are normalized from gamma processes, with the gamma scale parameters either fixed as or learned with a log Gaussian process prior. Yet no analytic conditional posteriors are provided and Gibbs sampling is not considered as a viable option. The main purpose of is introducing correlations between mixture components. It would be interesting to compare the differences between learning the with beta priors and learning the gamma scale parameters with the log Gaussian process prior.
The Negative Binomial Process Family
The gamma-NB process shares the NB dispersion across groups while the NB probability parameters are group dependent. Since the NB distribution has two adjustable parameters, it is natural to wonder whether one can explore sharing the NB probability measure across groups, while making the NB dispersion parameters group specific or atom dependent. That kind of construction would be distinct from both the gamma-NB process and HDP in that has space dependent scales, and thus its normalization , still a random probability measure, no longer follows a Dirichlet process.
It is natural to let the NB probability measure be drawn from the beta process . In fact, the first discovered member of the NB process family is a beta-NB process . A main motivation of that construction is observing that the beta and Bernoulli distributions are conjugate and the beta-Bernoulli process is found to be quite useful for dictionary learning , whereas although the beta distribution is also conjugate to the NB distribution, there is apparent lack of exploitation of that relationship .
A beta-NB process is constructed by letting
With expressed as , a random draw from can be expressed as
Under this construction, the NB probability measure is shared and the NB dispersion parameters are group dependent. Note that if are fixed as one, then the beta-NB process reduces to the beta-geometric process, related to the one for count modeling discussed in ; if are empirically set to some other values, then the beta-NB process reduces to the one proposed in . These simplifications are not considered in the paper, as they are often overly restrictive.
The asymptotic behavior of the beta-NB process with respect to the NB dispersion parameter is studied in . Such analysis is not provided here as we infer NB dispersion parameters from the data, which usually do not have large values due to overdispersion. In , the beta-NB process is treated comparable to a gamma-Poisson process and is considered less flexible than the HDP, motivating the construction of a hierarchical-beta-NB process. By contrast, in this paper, with the beta-NB process augmented as a beta-gamma-Poisson process, one can draw group-specific Poisson rate measures for count modeling and then use their normalization to provide group-specific random probability measures for mixture modeling; therefore, the beta-NB process, gamma-NB process and HDP are treated comparable to each other in hierarchical structures and are all considered suitable for mixed-membership modeling.
As in , we may also consider a marked-beta-NB process, with both the NB probability and dispersion measures shared, in which each point of the beta process is marked with an independent gamma random variable. Thus a draw from the marked-beta process becomes , and a draw from the NB process becomes
With the beta-NB conjugacy, the posterior of is tractable in both the beta-NB and marked-beta-NB processes . Similar to the marked-beta-NB process, we may also consider a marked-gamma-NB process, where each point of the gamma process is marked with an independent beta random variable, whose performances is found to be similar.
If it is believed that there are excessive number of zeros, governed by a process other than the NB process, we may introduce a zero inflated NB process as , where is drawn from the Bernoulli process and is drawn from a gamma marked-beta process, thus a draw from can be expressed as , with
This construction can be linked to the focused topic model in with appropriate normalization, with advantages that there is no need to fix and the inference is fully tractable. The zero inflated construction can also be linked to models for real valued data using the Indian buffet process (IBP) or beta-Bernoulli process spike-and-slab prior . Below we apply various NB processes for topic modeling and illustrate the differences between them.
Negative Binomial Process Topic Modeling and Poisson Factor Analysis
For the gamma-NB process described in Section 5, with the gamma process expressed as , we can express the hierarchical model as
where . With , and , using Corollary 3, we can equivalently express and in (29) as
Since are fully exchangeable, rather than drawing as in (30), we may equivalently draw it as
This provides further insights on uniting the seemingly distinct problems of count and mixture modeling.
Denote , and . For modeling convenience, we place Dirichlet priors on topics , then for the gamma-NB process topic model, we have
which would be the same for the other NB processes, since the gamma-NB process differs from them only on how the gamma priors of and consequently the NB priors of are constituted. For example, marginalizing out , we have for the gamma-NB process, for the beta-NB process, for both the marked-beta-NB and marked-gamma-NB processes, and for the zero-inflated-NB process.
As in , we may augment as
If , we have , and with Corollary 3 and , we also have (n_{vj1},\cdots,n_{vjK}|-)\sim\mbox{Mult}\big{(}m_{vj};\frac{\phi_{v1}\theta_{j1}}{\sum_{k=1}^{K}\phi_{vk}\theta_{jk}}, \cdots,\frac{\phi_{vK}\theta_{jK}}{\sum_{k=1}^{K}\phi_{vk}\theta_{jk}}\big{)}, , and , which would lead to (32) under the assumption that the words are exchangeable and (33) if . Thus topic modeling with the NB process can be considered as factorization of the term-document count matrix under the Poisson likelihood as .
PFA provides a unified framework to connect previously proposed discrete latent variable models, such as those in . As discussed in detail in , these models mainly differ on how the priors of and are constituted and how the inferences are implemented. For example, nonnegative matrix factorization with an objective function of minimizing the Kullback-Leibler (KL) divergence is equivalent to the ML estimation of and under PFA, and latent Dirichlet allocation (LDA) is equivalent to a PFA with Dirichlet priors imposed on both and .
2 Negative Binomial Process Topic Modeling
From the point view of PFA, an NB process topic model factorizes the term-document count matrix under the constraints that each factor sums to one and the factor scores are gamma distributed random variables, and consequently, the number of words assigned to a topic (factor/atom) follows an NB distribution. Depending on how the NB distributions are parameterized, as shown in Table I, we can construct a variety of NB process topic models, which can also be connected to a large number of previously proposed parametric and nonparametric topic models. For a deeper understanding on how the counts are modeled, we also show in Table I both the variance-to-mean ratio (VMR) and overdispersion level (ODL) implied by these settings. Eight differently constructed NB processes are considered:
(i) The NB process described in Section 4 is used for topic modeling. It improves over the count-modeling gamma-Poisson process discussed in in that it unites mixture modeling and has closed-form conditional posteriors. Although this is a nonparametric model supporting an infinite number of topics, requiring may be too restrictive.
(ii) Related to LDA and Dir-PFA , the NB-LDA is also a parametric topic model that requires tuning the number of topics. It is constructed by replacing the topic weights of the Gamma-NB process in (29) as . It uses document dependent and to learn the smoothing of the topic weights, and it lets to share statistical strength between documents.
(iii) Related to the HDP , the NB-HDP model is constructed by fixing (i.e., ) in (29). It is also comparable to the HDP in that constructs group-specific Dirichlet processes with normalized gamma processes, whose scale parameters are also set as one.
(iv) The NB-FTM model is constructed by replacing the topic weights in (29) as , with and drawn from a beta-Bernoulli process that is used to explicitly model zero counts. It is the same as the sparse-gamma-gamma-PFA (S-PFA) in and is comparable to the focused topic model (FTM) , which is constructed from the IBP compound Dirichlet process. The Zero-Inflated-NB process improves over these approaches by allowing to be inferred, which generally yields better data fitting.
(v) The Gamma-NB process, as shown in (10) and (29), explores sharing the NB dispersion measure across groups, and it improves over the NB-HDP by allowing the learning of . As shown in (5.3), it reduces to the HDP in without modeling and as random variables.
(vi) The Beta-Geometric process is constructed by replacing the topic weights in (29) as . It explores sharing the NB probability measure across groups, which is related to the one proposed for count modeling in . It is restrictive that the NB dispersion parameters are fixed as one.
(vii) The Beta-NB process is constructed by replacing the topic weights in (29) as . It explores sharing the NB probability measure across groups, which improves over the Beta-Geometric process and the beta-NB process (BNBP) proposed in by providing analytic conditional posteriors of .
(viii) The Marked-Beta-NB process constructed by replacing the topic weights in (29) as . It is comparable to the BNBP proposed in , with the distinction that it provides analytic conditional posteriors of of .
3 Approximate and Exact Inference
Although all proposed NB process models have closed-form conditional posteriors, they contain countably infinite atoms that are infeasible to explicitly represent in practice. This infinite dimensional problem can be addressed by using a discrete base measure with atoms, i.e., truncating the total number of atoms to be , and then doing Bayesian inference via block Gibbs sampling . This is a very general approach and is used in our experiments to make a fair comparison between a wide variety of models. Block gibbs sampling for the Gamma-NB process is described in Appendix B; block gibbs sampling for other NB processes and related algorithms in Table I can be similarly derived, as described in and omitted here for brevity. The infinite dimensional problem can also be addressed by discarding the atoms with weights smaller than a small constant or by modifying the Lévy measure to make its integral over the whole space be finite . A sufficiently large (small) () usually provides a good approximation, however, there is an increasing risk of wasting computation as the truncation level gets larger.
To avoid truncation, the slice sampling scheme of has been utilized for the Dirichlet process and normalized random measure based mixture models . With auxiliary slice latent variables introduced to allow adaptive truncations in each MCMC interaction, the infinite dimensional problem is transformed into a finite one. This method has also been applied to the beta-Bernoulli process and the beta-NB process . It would be interesting to investigate slice sampling for the NB process based count and mixture models, which provide likelihoods that might be more amenable to posterior simulation since no normalization is imposed on the weights of the atoms. As slice sampling is not the focus of this paper, we leave it for future study.
Both the block Gibbs sampler and the slice sampler explicitly represent a finite set of atoms for posterior simulation, and algorithms based on these samplers are commonly referred as “conditional” methods . Another approach of solving the infinite dimensional problem is employing a collapsed inference scheme that marginalizes out the atoms and their weights . Algorithms based on the collapsed inference scheme are usually referred as “marginal” methods . A well-defined prediction rule is usually required to develop a collapsed Gibbs sampler, and the conjugacy between the likelihood and the prior distribution of atoms is desired to avoid numerical integrations. In topic modeling, a word is linked to a Dirichlet distributed atom with a multinomial likelihood, thus the atoms can be analytically marginalized out; since their weights can also be marginalized out as in (5.2), we may develop a collapsed Gibbs sampler for the gamma-NB process based topic models. As the collapsed inference scheme is not the focus of this paper and the prediction rules for other NB processes need further investigation, we leave them for future study.
Example Results and Discussions
Motivated by Table I, we consider topic modeling using a variety of NB processes. We compare them with LDA and CRF-HDP , in which the latent count is marginally distributed as
We consider the Psychological Reviewhttp://psiexp.ss.uci.edu/research/programsdata/toolbox.htm corpus, restricting the vocabulary to terms that occur in five or more documents. The corpus includes 1281 abstracts from 1967 to 2003, with and 71,279 total word counts. We randomly select , , or of the words from each document to learn a document dependent probability for each term and calculate the per-word perplexity on the held-out words as
where , is the number of words held out at term in document , , and are the indices of collected samples. Note that the per-word perplexity is equal to if , thus it should be no greater than for a topic model that works appropriately. The final results are averaged over five random training/testing partitions. The performance measure is the same as the one used in and similar to those in .
We show in Fig. 3 the NB dispersion and probability parameters learned by various NB process topic models listed in Table I, revealing distinct sharing mechanisms and model properties. In Fig. 4 we compare the per-held-out-word prediction performance of various algorithms. We set the parameters as , and . For LDA and NB-LDA, we search for optimal performance. All the other NB process topic models are nonparametric Bayesian models that can automatically learn the number of active topics for a given corpus. For fair comparison, all the models considered are implemented with block Gibbs sampling, where is set as an upper-bound.
When is used, as in the NB process, different documents are imposed to have the same topic weights, leading to the worst held-out-prediction performance.
When is used, as in NB-LDA, different documents are weakly coupled with , and the modeling results in Fig. 3 show that a typical document in this corpus usually has a small and a large , thus a large overdispersion level (ODL) and a large variance-to-mean ratio (VMR), indicating highly overdispersed counts on its topic usage. NB-LDA is a parametric topic model that requires tuning the number of topics . It improves over LDA in that it only has to tune , whereas LDA has to tune both and . With an appropriate , the parametric NB-LDA may outperform the nonparametric NB-HDP and NB-FTM as the training data percentage increases, showing that even by learning both the NB parameters and in a document dependent manner, we may get better data fitting than using nonparametric models that fix the NB probability parameters.
The NB-HDP is a special case of the Gamma-NB process that . From a mixture modeling viewpoint, fixing is a natural choice as appears irrelevant after normalization. However, from a count modeling viewpoint, this would make a restrictive assumption that each count vector has the same VMR of 2. It is also interesting to examine (24), which can be viewed as the concentration parameter in the HDP, allowing the adjustment of would allow a more flexible model assumption on the amount of variations between the topic proportions, and thus potentially better data fitting.
When is used, as in the NB-FTM model, our results in Fig. 3 show that we usually have a small and a large , indicating topic is sparsely used across the documents but once it is used, the amount of variation on usage is small. This property might be helpful when there are excessive number of zeros that might not be well modeled by the NB process alone. In our experiments, the more direct approaches of using or generally yield better results, which might not be the case when excessive number of zeros could be better explained with the beta-Bernoulli processes, e.g., when the training words are scarce, the NB-FTM can approach the performance of the Marked-Beta-NB process.
Conclusions
We propose a variety of negative binomial (NB) processes for count modeling, which can be naturally applied for the seemingly disjoint problem of mixture modeling. The proposed NB processes are completely random measures, which assign independent random variables to disjoint Borel sets of the measure space, as opposed to Dirichlet processes, whose measures on disjoint Borel sets are negatively correlated. We reveal connections between various discrete distributions and discover unique data augmentation and marginalization methods for the NB process, with which we are able to unite count and mixture modeling, analyze fundamental model properties, and derive efficient Bayesian inference. We demonstrate that the NB process and the gamma-NB process can be recovered from the Dirichlet process and the HDP, respectively. We show in detail the theoretical, structural and computational advantages of the NB process. We examine the distinct sharing mechanisms and model properties of various NB processes, with connections made to existing discrete latent variable models under the Poisson factor analysis framework. Experimental results on topic modeling show the importance of modeling both the NB dispersion and probability parameters, which respectively govern the overdispersion level and variance-to-mean ratio for count modeling.
Acknowledgments
The authors would like to thank the two anonymous reviewers and the editor for their constructive comments that help improve the manuscript.
References
Appendix A Proof of Theorem 1
With the PMFs of both the NB and CRT distributions, the PMF of the joint distribution of counts and is , which is the same as (4).
Since is the summation of iid random variables, its PGF becomes With and , its PMF can be expressed as
Letting , the PMF of the joint distribution of counts and is , which is the same as (4). ∎
Appendix B Block Gibbs Sampling for the Gamma-Negative Binomial Process
With , and a discrete base measure as , following Section 5.2, block Gibbs sampling for (29) proceeds as
where . Note that when , we have and thus .