Negative Binomial Process Count and Mixture Modeling

Mingyuan Zhou, Lawrence Carin

Introduction

Count data appear in many settings, such as predicting the number of motor insurance claims , analyzing infectious diseases and modeling topics of document corpora . There has been increasing interest in count modeling using the Poisson process, geometric process and recently the negative binomial (NB) process . It is shown in and further demonstrated in that the NB process, originally constructed for count analysis, can be naturally applied for mixture modeling of grouped data $\boldsymbol{x}_{1},\cdots,\boldsymbol{x}_{J}$ , where each group $\boldsymbol{x}_{j}=\{x_{ji}\}_{i=1,N_{j}}$ . For example, in topic modeling (mixed-membership modeling), each document consists of a group of exchangeable words and each word is a group member that is assigned to a topic; the number of times a topic appears in a document is a latent count random variable that could be well modeled with an NB distribution .

Mixture modeling, which infers random probability measures to assign data samples into clusters (mixture components), is a key research area of statistics and machine learning. Although the number of samples assigned to clusters are counts, mixture modeling is not typically considered as a count-modeling problem. It is often addressed under the Dirichlet-multinomial framework, using the Dirichlet process as the prior distribution. With the Dirichlet-multinomial conjugacy, the Dirichlet process mixture model enjoys tractability because the posterior of the random probability measure is still a Dirichlet process. Despite its popularity, the Dirichlet process is inflexible in that a single concentration parameter controls both the variability of the mass around the mean and the distribution of the number of distinct atoms . For mixture modeling of grouped data, the hierarchical Dirichlet process (HDP) has been further proposed to share statistical strength between groups. The HDP inherits the same inflexibility of the Dirichlet process, and due to the non-conjugacy between Dirichlet processes, its inference has to be solved under alternative constructions, such as the Chinese restaurant franchise and stick-breaking representations . To make the number of distinct atoms increase at a rate faster than that of the Dirichlet process, one may consider the Pitman-Yor process or the normalized generalized gamma process that provide extra parameters to add flexibility.

To construct more expressive mixture models with tractable inference, in this paper we consider mixture modeling as a count-modeling problem. Directly modeling the counts assigned to mixture components as NB random variables, we perform joint count and mixture modeling via the NB process, using completely random measures that are easy to construct and amenable to posterior computation. By constructing a bivariate count distribution that connects the Poisson, logarithmic, NB and Chinese restaurant table distributions, we develop data augmentation and marginalization techniques unique to the NB distribution, with which we augment an NB process into both the gamma-Poisson and compound Poisson representations, yielding unification of count and mixture modeling, derivation of fundamental model properties, as well as efficient Bayesian inference.

Under the NB process, we employ a gamma process to model the rate measure of a Poisson process. The normalization of the gamma process provides a random probability measure (not necessarily a Dirichlet process) for mixture modeling, and the marginalization of the gamma process leads to an NB process for count modeling. Since the gamma scale parameters appear as NB probability parameters when the gamma processes are marginalized out, they directly control count distributions on atoms and they could be conveniently inferred with the beta-NB conjugacy. For mixture modeling of grouped data, we construct hierarchical models by employing an NB process for each group and sharing their NB dispersion or probability measures across groups. Different parameterizations of the NB dispersion and probability parameters result in a wide variety of NB processes, which are connected to previously proposed nonparametric Bayesian mixture models. The proposed joint count and mixture modeling framework provides new opportunities for better data fitting, efficient inference and flexible model constructions.

Parts of the work presented here first appeared in . In this paper, we unify related materials scattered in these three conference papers and provide significant expansions. In particular, we construct a Poisson-logarithmic bivariate distribution that tightly connects the NB and Chinese restaurant table distributions, extending the Chinese restaurant process to describe the case that both the numbers of customers and tables are random variables, and we provide necessary conditions to recover the NB process and the gamma-NB process from the Dirichlet process and HDP, respectively.

We mention that a related beta-NB process has been independently investigated in . Our constructions of a wide variety of NB processes, including the beta-NB processes in and as special cases, are built on our thorough investigation of the properties, relationships and inference of the NB and related stochastic processes. In particular, we show that the gamma-Poisson construction of the NB process is key to uniting count and mixture modeling, and there are two equivalent augmentations of the NB process that allow us to develop analytic conditional posteriors and predictive distributions. These insights are not provided in , and the NB dispersion parameters there are empirically set rather than inferred. More distinctions will be discussed along with specific models.

The remainder of the paper is organized as follows. We review some commonly used nonparametric Bayesian priors in Section 2 and study the NB distribution in Section 3. We present the NB process in Section 4, the gamma-NB process in Section 5, and the NB process family in Section 6. We discuss NB process topic modeling in Section 7 and present example results in Section 8.

Preliminaries

where $\nu(dpd\omega)\equiv\nu^{+}\pi(dpd\omega)$ . A random signed measure $\mathcal{L}$ satisfying (1) is called a Lévy random measure. More generally, if the Lévy measure $\nu({dpd\omega})$ satisfies

for each compact $S\subset\Omega$ , the Lévy random measure $\mathcal{L}$ is well defined, even if the Poisson intensity $\nu^{+}$ is infinite. A nonnegative Lévy random measure $\mathcal{L}$ satisfying (2) was called a completely random measure , and it was introduced to machine learning in and .

1.2 Gamma Process

where $\pi(drd\omega)\nu^{+}\equiv\nu(drd\omega)$ .

1.3 Beta Process

where $\pi(dpd\omega)\nu^{+}\equiv\nu(dpd\omega)$ .

2 Dirichlet and Chinese Restaurant Processes

Denote $\widetilde{G}=G/G(\Omega)$ , where $G\sim\mbox{GaP}(c,G_{0})$ , then for any measurable disjoint partition $A_{1},\cdots,A_{Q}$ of $\Omega$ , we have $\left[\widetilde{G}(A_{1}),\cdots,\widetilde{G}(A_{Q})\right]\sim\mbox{Dir}\left(\gamma_{0}\widetilde{G}_{0}(A_{1}),\cdots,\gamma_{0}\widetilde{G}_{0}(A_{Q})\right)$ , where $\gamma_{0}=G_{0}(\Omega)$ and $\widetilde{G}_{0}=G_{0}/\gamma_{0}$ . Therefore, with a space invariant scale parameter $1/c$ , the normalized gamma process $\widetilde{G}=G/G(\Omega)$ is a Dirichlet process with concentration parameter $\gamma_{0}$ and base probability measure $\widetilde{G}_{0}$ , expressed as $\widetilde{G}\sim\mbox{DP}(\gamma_{0},\widetilde{G}_{0})$ . Unlike the gamma process, the Dirichlet process is no longer a completely random measure as the random variables $\{\widetilde{G}(A_{q})\}$ for disjoint sets $\{A_{q}\}$ are negatively correlated.

A gamma process with a space invariant scale parameter can also be recovered from a Dirichlet process: if a gamma random variable $\alpha\sim{\mbox{Gamma}}(\gamma_{0},1/c)$ and a Dirichlet process $\widetilde{G}\sim{\mbox{DP}}(\gamma_{0},\widetilde{G}_{0})$ are independent with $\gamma_{0}=G_{0}(\Omega)$ and $\widetilde{G}_{0}=G_{0}/\gamma_{0}$ , then $G=\alpha\widetilde{G}$ becomes a gamma process as $G\sim{\mbox{GaP}}(c,G_{0})$ .

2.2 Chinese Restaurant Process

In a Dirichlet process $\widetilde{G}\sim\mbox{DP}(\gamma_{0},\widetilde{G}_{0})$ , we assume $X_{i}\sim\widetilde{G}$ ; $\{X_{i}\}$ are independent given $\widetilde{G}$ and hence exchangeable. The predictive distribution of a new data sample $X_{m+1}$ , conditioning on $X_{1},\cdots,X_{m}$ , with $\widetilde{G}$ marginalized out, can be expressed as

where $\{\omega_{k}\}_{1,K}$ are distinct atoms in $\Omega$ observed in $X_{1},\cdots,X_{m}$ and $n_{k}=\sum_{i=1}^{m}X_{i}(\omega_{k})$ is the number of data samples associated with $\omega_{k}$ . The stochastic process described in (2.2.2) is known as the Pólya urn scheme and also the Chinese restaurant process .

The number of nonempty tables $l$ in a Chinese restaurant process, with concentration parameter $\gamma_{0}$ and $m$ customers, is a random variable generated as $l=\sum_{n=1}^{m}b_{n},~{}b_{n}\sim\mbox{{Bernoulli}}\left(\frac{\gamma_{0}}{n-1+\gamma_{0}}\right).$ This random variable is referred as the Chinese restaurant table (CRT) random variable $l\sim\mbox{CRT}(m,\gamma_{0})$ . As shown in , it has probability mass function (PMF)

where $s(m,l)$ are Stirling numbers of the first kind.

Negative Binomial Distribution

Its mean and variance are both equal to $\lambda$ . Due to heterogeneity (difference between individuals) and contagion (dependence between the occurrence of events), count data are usually overdispersed in that the variance is greater than the mean, making the Poisson assumption restrictive. By placing a gamma prior with shape $r$ and scale $\frac{p}{1-p}$ on $\lambda$ as $m\sim\mbox{Pois}(\lambda)$ , $\lambda\sim\mbox{Gamma}\big{(}r,\frac{p}{1-p}\big{)}$ and marginalizing out $\lambda$ , an NB distribution $m\sim\mbox{NB}(r,p)$ is obtained, with PMF

where $r$ is the nonnegative dispersion parameter and $p$ is the probability parameter. Thus the NB distribution is also known as the gamma-Poisson mixture distribution . It has a mean $\mu={rp}/(1-p)$ smaller than the variance $\sigma^{2}={rp}/{(1-p)^{2}}=\mu+r^{-1}\mu^{2}$ , with the variance-to-mean ratio (VMR) as $(1-p)^{-1}$ and the overdispersion level (ODL, the coefficient of the quadratic term in $\sigma^{2}$ ) as $r^{-1}$ , and thus it is usually favored over the Poisson distribution for modeling overdispersed counts.

As shown in , $m\sim\mbox{NB}(r,p)$ can also be generated from a compound Poisson distribution as

where $u\sim\mbox{Log}(p)$ corresponds to the logarithmic distribution with PMF $f_{U}(k)={-p^{k}}/[k\ln(1-p)],~{}~{}k=1,2,\cdots$ , and probability-generating function (PGF)

One may also show that $\lim_{r\rightarrow\infty}\mbox{NB}(r,\frac{\lambda}{\lambda+r})=\mbox{Pois}(\lambda)$ , and conditioning on $m>0$ , $m\sim\mbox{NB}(r,p)$ becomes $m\sim\mbox{Log}(p)$ as ${r\rightarrow 0}$ .

The NB distribution has been widely investigated and applied to numerous scientific studies . Although inference of the NB probability parameter $p$ is straightforward with the beta-NB conjugacy, inference of the NB dispersion parameter $r$ , whose conjugate prior is unknown, has long been a challenge. The maximum likelihood (ML) approach is commonly used to estimate $r$ , however, it only provides a point estimate and does not allow incorporating prior information; moreover, the ML estimator of $r$ often lacks robustness and may be severely biased or even fail to converge, especially if the sample size is small . Bayesian approaches are able to model the uncertainty of estimation and incorporate prior information, however, the only available closed-form Bayesian inference for $r$ relies on approximating the ratio of two gamma functions .

Advancing previous research on the NB distribution in , we construct a Poisson-logarithmic bivariate distribution that assists Bayesian inference of the NB distribution and unites various count distributions.

The Poisson-logarithmic (PoisLog) bivariate distribution with PMF

where $\emph{\mbox{SumLog}}(m;l,p)$ denotes the sum-logarithmic distribution generated as $m=\sum_{{t}=1}^{l}u_{{t}},~{}u_{{t}}\stackrel{{\scriptstyle iid}}{{\sim}}\emph{\mbox{Log}}(p)$ .

The proof of Theorem 1 is provided in Appendix A. As shown in Fig. 1, this bivariate distribution intuitively describes the joint distribution of two count random variables $m$ and $l$ under two equivalent circumstances:

1) There are $m\sim\mbox{NB}(r,p)$ customers seated at $l\sim\mbox{CRT}(m,r)$ tables;

2) There are $l\sim\mbox{Pois}(-r\ln(1-p))$ tables, each of which with $u_{t}\sim\mbox{Log}(p)$ customers, with $m=\sum_{t=1}^{l}u_{t}$ customers in total.

In a slight abuse of notation, but for added conciseness, in the following discussion we use $m\sim\sum_{t=1}^{l}\mbox{Log}(p)$ to denote $m=\sum_{{t}=1}^{l}u_{{t}},~{}u_{{t}}\stackrel{{\scriptstyle iid}}{{\sim}}\mbox{Log}(p)$ .

Let $m\sim\emph{\mbox{NB}}(r,p),~{}r\sim\emph{\mbox{Gamma}}(r_{1},1/c_{1})$ represent a gamma-NB mixture distribution. It can be augmented as $m\sim\sum_{{t}=1}^{l}\emph{\mbox{Log}}(p),~{}l\sim\emph{\mbox{Pois}}(-r\ln(1-p)),~{}r\sim\emph{\mbox{Gamma}}(r_{1},1/c_{1}).$ Marginalizing out $r$ leads to

where the latent count $l\sim\emph{\mbox{NB}}\left(r_{1},p^{\prime}\right)$ can be augmented as

which, using Theorem 1, is equivalent in distribution to

The connections between various distributions shown in Theorem 1 and Corollary 2 are key ingredients of this paper, which not only allow us to derive efficient inference, but also, as shown below, let us examine the posteriors to understand fundamental properties of various NB processes, clearly revealing connections to previous nonparametric Bayesian mixture models, including those based on the Dirichlet process, HDP and beta-NB processes.

Joint Count and Mixture Modeling

In this Section, we first show the connections between the Poisson and multinomial processes, and then we place a gamma process prior on the Poisson rate measure for joint count and mixture modeling. This construction can be reduced to the Dirichlet process and its restrictions for modeling grouped data are further discussed.

Let $X\sim\emph{\mbox{PP}}(G)$ be a Poisson process defined on a completely random measure $G$ such that $X(A)\sim\emph{\mbox{Pois}}(G(A))$ for each subset $A\subset\Omega$ . Define $Y\sim\emph{\mbox{MP}}(Y(\Omega),\frac{G}{G(\Omega)})$ as a multinomial process, with total count $Y(\Omega)\sim\emph{\mbox{Pois}}(G(\Omega))$ and random probability measure $\frac{G}{G(\Omega)}$ , such that $(Y(A_{1}),\cdots,Y(A_{Q}))\sim\emph{\mbox{Mult}}\left(Y(\Omega);\frac{G(A_{1})}{G(\Omega)},\cdots,\frac{G(A_{Q})}{G(\Omega)}\right)$ for any disjoint partition $\{A_{q}\}_{1,Q}$ of $\Omega$ . According to Lemma 4.1 of , $X(A)$ and $Y(A)$ would have the same Poisson distribution for each $A\subset\Omega$ , thus $X$ and $Y$ are equivalent in distribution.

Using Corollary 3, we illustrate how the seemingly distinct problems of count and mixture modeling can be united under the Poisson process. For each $A\subset\Omega$ , denote $X_{j}(A)$ as a count random variable describing the number of observations in $\boldsymbol{x}_{j}$ that reside within $A$ . Given grouped data $\boldsymbol{x}_{1},\cdots,\boldsymbol{x}_{J}$ , for any measurable disjoint partition $A_{1},\cdots,A_{Q}$ of $\Omega$ , we aim to jointly model count random variables $\{X_{j}(A_{q})\}$ . A natural choice would be to define a Poisson process

with a shared completely random measure $G$ on $\Omega$ , such that $X_{j}(A)\sim\mbox{Pois}\big{(}G(A)\big{)}$ for each $A\subset\Omega$ and $G(\Omega)=\sum_{q=1}^{Q}G(A_{q})$ . Following Corollary 3, with $\widetilde{G}=G/G(\Omega)$ , letting $X_{j}\sim\mbox{PP}(G)$ is equivalent to letting

Thus the Poisson process provides not only a way to generate independent counts from each $A_{q}$ , but also a mechanism for mixture modeling, which allocates the $X_{j}(\Omega)$ observations into any measurable disjoint partition $\{A_{q}\}_{1,Q}$ of $\Omega$ , conditioning on the normalized random measure $\widetilde{G}$ .

2 Gamma-Poisson Process and Negative Binomial Process

To complete the Poisson process, it is natural to place a gamma process prior on the Poisson rate measure $G$ as

For a distinct atom $\omega_{k}$ , we have $n_{jk}\sim\mbox{Pois}(r_{k})$ , where $n_{jk}=X_{j}(\omega_{k})$ and $r_{k}=G(\omega_{k})$ . Marginalizing out $G$ of the gamma-Poisson process leads to an NB process

in which $X(A)\sim\mbox{NB}(G_{0}(A),p)$ for each $A\subset\Omega$ .

where $g_{0}(d\omega):={G_{0}(d\omega)}/{\gamma_{0}}$ . Thus the NB probability parameter $p$ plays a critical role in count and mixture modeling as it directly controls the prior distributions of the number of distinct atoms $K^{+}\sim\mbox{Pois}(-\gamma_{0}\ln(1-p))$ , the number of samples at each of these atoms $n_{k}\sim\mbox{Log}(p)$ , and the total number of samples $X(\Omega)\sim\mbox{NB}(\gamma_{0},p)$ . However, its value would become irrelevant if one directly works with the normalization of $G$ , as commonly used in conventional mixture modeling.

Define $L\sim\mbox{CRTP}(X,G_{0})$ as a CRT process that

for each $A\subset\Omega$ . Under the Chinese restaurant process metaphor, $X(A)$ and $L(A)$ represent the customer count and table count, respectively, observed in each $A\subset\Omega$ . A direct generalization of Theorem 1 leads to:

The NB process $X\sim\emph{\mbox{NBP}}(G_{0},p)$ augmented under its compound Poisson representation as

3 Posterior Analysis and Predictive Distribution

Imposing a gamma prior $\mbox{Gamma}(e_{0},1/f_{0})$ on $\gamma_{0}$ and a beta prior $\mbox{Beta}(a_{0},1/b_{0})$ on $p$ , using conjugacy, we have all conditional posteriors in closed-form as

If the base measure $G_{0}$ is finite and continuous, then $G_{0}(\omega)\rightarrow 0$ and we have $L(\omega)\sim\mbox{CRT}(X(\omega),G_{0}(\omega))=\delta(X(\omega)>0)$ and thus $L(\Omega)=\sum_{\omega\in\Omega}\delta(X(\omega)>0)$ , i.e., the number of nonempty tables $L(\Omega)$ is equal to $K^{+}$ , the number of distinct atoms. The gamma-Poisson process is also well defined with a discrete base measure $G_{0}=\sum_{k=1}^{K}\frac{\gamma_{0}}{K}\delta_{\omega_{k}}$ , for which we have $L=\sum_{k=1}^{K}l_{k}\delta_{\omega_{k}},~{}l_{k}\sim\mbox{CRT}(X(\omega_{k}),\gamma_{0}/K)$ and hence it is possible that $l_{k}>1$ if $X(\omega_{k})>1$ , which means $L(\Omega)\geq K^{+}$ . As the data $\{x_{ji}\}_{i}$ are exchangeable within group $j$ , conditioning on $X^{-ji}=X\backslash{X_{ji}}$ and $G_{0}$ , with $G$ marginalized out, we have

This prediction rule is similar to that of the Chinese restaurant process described in (2.2.2).

4 Relationship with the Dirichlet Process

Based on Corollary 3 on the multinomial process and Section 2.2.1 on the Dirichlet process, the gamma-Poisson process in (5) can be equivalently expressed as

5 Restrictions of the Gamma-Poisson Process

The Poisson process has an equal-dispersion assumption for count modeling. For mixture modeling of grouped data, the gamma-Poisson (NB) process might be too restrictive in that, as shown in (4.4), it implies the same mixture proportions across groups, and as shown in (6), it implies the same count distribution on each distinct atom. This motivates us to consider adding an additional layer into the gamma-Poisson process or using a different distribution other than the Poisson to model the counts for grouped data. As shown in Section 3, the NB distribution is an ideal candidate, not only because it allows overdispersion, but also because it can be augmented into either a gamma-Poisson or a compound Poisson representations and it can be used together with the CRT distribution to form a bivariate distribution that jointly models the counts of customers and tables.

Joint Count and Mixture Modeling of Grouped Data

In this Section we couple the gamma process with the NB process to construct a gamma-NB process, which is well suited for modeling grouped data. We derive analytic conditional posteriors for this construction and show that it can be reduced to an HDP.

For joint count and mixture modeling of grouped data, e.g., topic modeling where a document consists of a group of exchangeable words, we replace the Poisson processes in (5) with NB processes. Sharing the NB dispersion across groups while making the probability parameters be group dependent, we construct a gamma-NB process as

With $G\sim\mbox{GaP}(c,G_{0})$ expressed as $G=\sum_{k=1}^{\infty}r_{k}\delta_{\omega_{k}}$ , a draw from $\mbox{NBP}(G,p_{j})$ can be expressed as $X_{j}=\sum_{k=1}^{\infty}n_{jk}\delta_{\omega_{k}},~{}n_{jk}\sim\mbox{NB}(r_{k},p_{j}).$

The gamma-NB process can be augmented as a gamma-gamma-Poisson process as

and with $\theta_{jk}=\Theta_{j}(\omega_{k})$ , we have $n_{jk}\sim\mbox{Pois}(\theta_{jk}),~{}\theta_{jk}\sim\mbox{Gamma}(r_{k},p_{j}/(1-p_{j})).$ This construction introduces gamma processes $\{\Theta_{j}\}$ , whose normalization provide group-specific random probability measures $\{\widetilde{\Theta}_{j}\}$ for mixture modeling. The gamma-NB process can also be augmented as

according to Corollary 4. These three closely related constructions are graphically presented in Fig. 2.

With Corollaries 2 and 4, $p^{\prime}:=\frac{-\sum_{j}\ln(1-p_{j})}{c-\sum_{j}\ln(1-p_{j})}$ and $L:=\sum_{j}L_{j}$ , we further have two equivalent augmentations:

These augmentations allow us to derive a sequence of closed-form update equations, as described below.

2 Posterior Analysis and Predictive Distribution

With $p_{j}\sim\mbox{Beta}(a_{0},b_{0})$ and (10), we have

If $G_{0}$ is finite and continuous, we have $G_{0}(\omega)\rightarrow 0$ $\forall~{}\omega\in\Omega$ and thus $L^{\prime}(\Omega)|L,G_{0}=\sum_{\omega\in\Omega}\delta(L(\omega)>0)=\sum_{\omega\in\Omega}\delta(\sum_{j}X_{j}(\omega)>0)=K^{+};$ if $G_{0}$ is discrete as $G_{0}=\sum_{k=1}^{K}\frac{\gamma_{0}}{K}\delta_{\omega_{k}}$ , then $L^{\prime}(\omega_{k})=\mbox{CRT}(L(\omega_{k}),\frac{\gamma_{0}}{K})\geq 1$ if $\sum_{j}X_{j}(\omega_{k})>0$ , thus $L^{\prime}(\Omega)\geq K^{+}$ . In either case, let $\gamma_{0}=G_{0}(\Omega)\sim{\mbox{Gamma}}(e_{0},1/f_{0})$ , with the gamma-Poisson conjugacy on (14) and (5.1), we have

Using the gamma-Poisson conjugacy on (11), we have

Since the data $\{x_{ji}\}_{i}$ are exchangeable within group $j$ , conditioning on $X_{j}^{-i}=X_{j}\backslash X_{ji}$ and $G$ , with $\Theta_{j}$ marginalized out, we have

This prediction rule is similar to that of the Chinese restaurant franchise (CRF) .

3 Relationship with Hierarchical Dirichlet Process

With Corollary 3 and Section 2.2.1, we can equivalently express the gamma-gamma-Poisson process in (11) as

where $\Theta_{j}=\theta_{j}\widetilde{\Theta}_{j}$ , $G=\alpha\widetilde{G}$ and $G_{0}=\gamma_{0}\widetilde{G}_{0}$ . Without modeling $X_{j}(\Omega)$ and $\theta_{j}$ as random variables, (5.3) becomes an HDP . Thus the augmented and then normalized gamma-NB process leads to an HDP. However, we cannot return from the HDP to the gamma-NB process without modeling $X_{j}(\Omega)$ and $\theta_{j}$ as random variables. Theoretically, they are distinct in that the gamma-NB process is a completely random measure, assigning independent random variables into any disjoint Borel sets $\{A_{q}\}_{1,Q}$ in $\Omega$ , and the count $X_{j}(A)$ has the distribution as $X_{j}(A)\sim\mbox{NB}(G(A),p_{j})$ ; by contrast, due to normalization, the HDP is not, and marginally

Practically, the gamma-NB process can exploit Corollary 4 and the gamma-Poisson conjugacy to achieve analytic conditional posteriors. The inference of the HDP is a challenge and it is usually solved through alternative constructions such as the CRF and stick-breaking representations . In particular, both concentration parameters $\alpha$ and $\gamma_{0}$ are nontrivial to infer and they are often simply fixed . One may apply the data augmentation method of to sample $\alpha$ and $\gamma_{0}$ . However, if $\widetilde{G}_{0}$ is discrete as $\widetilde{G}_{0}=\sum_{k=1}^{K}\frac{1}{K}\delta_{\omega_{k}}$ , which is of practical value and becomes a continuous base measure as $K\rightarrow\infty$ , then using that method to sample $\gamma_{0}$ is only approximately correct, which may result in a biased estimate in practice, especially if $K$ is not sufficiently large.

By contrast, in the gamma-NB process, the shared $G$ can be analytically updated with (20) and $G(\Omega)$ plays the role of $\alpha$ in the HDP, which is readily available as

and as in (19), regardless of whether the base measure is continuous, the total mass $\gamma_{0}$ has an analytic gamma posterior. Equation (24) also intuitively shows how the NB probability parameters $\{p_{j}\}$ govern the variations among $\{\widetilde{\Theta}_{j}\}$ in the gamma-NB process. In the HDP, $p_{j}$ is not explicitly modeled, and since its value appears irrelevant when taking the normalized constructions in (5.3), it is usually treated as a nuisance parameter and perceived as $p_{j}=0.5$ when needed for interpretation.

Another related model is the DILN-HDP in , where group-specific Dirichlet processes are normalized from gamma processes, with the gamma scale parameters either fixed as $\frac{p_{j}}{1-p_{j}}=1$ or learned with a log Gaussian process prior. Yet no analytic conditional posteriors are provided and Gibbs sampling is not considered as a viable option. The main purpose of is introducing correlations between mixture components. It would be interesting to compare the differences between learning the $\{p_{j}\}$ with beta priors and learning the gamma scale parameters with the log Gaussian process prior.

The Negative Binomial Process Family

The gamma-NB process shares the NB dispersion across groups while the NB probability parameters are group dependent. Since the NB distribution has two adjustable parameters, it is natural to wonder whether one can explore sharing the NB probability measure across groups, while making the NB dispersion parameters group specific or atom dependent. That kind of construction would be distinct from both the gamma-NB process and HDP in that $\Theta_{j}$ has space dependent scales, and thus its normalization $\widetilde{\Theta}_{j}=\frac{\Theta_{j}}{\Theta_{j}(\Omega)}$ , still a random probability measure, no longer follows a Dirichlet process.

It is natural to let the NB probability measure be drawn from the beta process . In fact, the first discovered member of the NB process family is a beta-NB process . A main motivation of that construction is observing that the beta and Bernoulli distributions are conjugate and the beta-Bernoulli process is found to be quite useful for dictionary learning , whereas although the beta distribution is also conjugate to the NB distribution, there is apparent lack of exploitation of that relationship .

A beta-NB process is constructed by letting

With $B\sim\mbox{BP}(c,B_{0})$ expressed as $B=\sum_{k=1}^{\infty}p_{k}\delta_{\omega_{k}}$ , a random draw from $\mbox{NBP}(r_{j},B)$ can be expressed as

Under this construction, the NB probability measure is shared and the NB dispersion parameters are group dependent. Note that if $\{r_{j}\}$ are fixed as one, then the beta-NB process reduces to the beta-geometric process, related to the one for count modeling discussed in ; if $\{r_{j}\}$ are empirically set to some other values, then the beta-NB process reduces to the one proposed in . These simplifications are not considered in the paper, as they are often overly restrictive.

The asymptotic behavior of the beta-NB process with respect to the NB dispersion parameter is studied in . Such analysis is not provided here as we infer NB dispersion parameters from the data, which usually do not have large values due to overdispersion. In , the beta-NB process is treated comparable to a gamma-Poisson process and is considered less flexible than the HDP, motivating the construction of a hierarchical-beta-NB process. By contrast, in this paper, with the beta-NB process augmented as a beta-gamma-Poisson process, one can draw group-specific Poisson rate measures for count modeling and then use their normalization to provide group-specific random probability measures for mixture modeling; therefore, the beta-NB process, gamma-NB process and HDP are treated comparable to each other in hierarchical structures and are all considered suitable for mixed-membership modeling.

As in , we may also consider a marked-beta-NB process, with both the NB probability and dispersion measures shared, in which each point of the beta process is marked with an independent gamma random variable. Thus a draw from the marked-beta process becomes $(R,B)=\sum_{k=1}^{\infty}(r_{k},p_{k})\delta_{\omega_{k}}$ , and a draw from the NB process $X_{j}\sim\mbox{NBP}(R,B)$ becomes

With the beta-NB conjugacy, the posterior of $B$ is tractable in both the beta-NB and marked-beta-NB processes . Similar to the marked-beta-NB process, we may also consider a marked-gamma-NB process, where each point of the gamma process is marked with an independent beta random variable, whose performances is found to be similar.

If it is believed that there are excessive number of zeros, governed by a process other than the NB process, we may introduce a zero inflated NB process as $X_{j}\sim\mbox{NBP}(RZ_{j},p_{j})$ , where $Z_{j}\sim\mbox{BeP}(B)$ is drawn from the Bernoulli process and $(R,B)=\sum_{k=1}^{\infty}(r_{k},\pi_{k})\delta_{\omega_{k}}$ is drawn from a gamma marked-beta process, thus a draw from $\mbox{NBP}(RZ_{j},p_{j})$ can be expressed as $X_{j}=\sum_{k=1}^{\infty}n_{jk}\delta_{\omega_{k}}$ , with

This construction can be linked to the focused topic model in with appropriate normalization, with advantages that there is no need to fix $p_{j}=0.5$ and the inference is fully tractable. The zero inflated construction can also be linked to models for real valued data using the Indian buffet process (IBP) or beta-Bernoulli process spike-and-slab prior . Below we apply various NB processes for topic modeling and illustrate the differences between them.

Negative Binomial Process Topic Modeling and Poisson Factor Analysis

For the gamma-NB process described in Section 5, with the gamma process expressed as $G=\sum_{k=1}^{\infty}r_{k}\delta_{\boldsymbol{\phi}_{k}}$ , we can express the hierarchical model as

where $g_{0}(d\boldsymbol{\phi})=G_{0}(d\boldsymbol{\phi})/\gamma_{0}$ . With $\theta_{j}=\Theta_{j}(\Omega)=\sum_{k=1}^{\infty}\theta_{jk}$ , ${\boldsymbol{n}}_{j}=(n_{j1},\cdots,n_{j\infty})^{T}$ and $\boldsymbol{\theta}_{j}=(\theta_{j1},\cdots,\theta_{j\infty})^{T}$ , using Corollary 3, we can equivalently express $N_{j}$ and $n_{jk}$ in (29) as

Since $\{x_{ji}\}_{i=1,N_{j}}$ are fully exchangeable, rather than drawing ${\boldsymbol{n}}_{j}$ as in (30), we may equivalently draw it as

This provides further insights on uniting the seemingly distinct problems of count and mixture modeling.

Denote $n_{vjk}=\sum_{i=1}^{N_{j}}\delta(z_{ji}=k,v_{ji}=v)$ , $n_{v\boldsymbol{\cdot}k}=\sum_{j}n_{vjk}$ and $n_{\boldsymbol{\cdot}k}=\sum_{j}n_{jk}$ . For modeling convenience, we place Dirichlet priors on topics $\boldsymbol{\phi}_{k}\sim\mbox{Dir}(\eta,\cdots,\eta)$ , then for the gamma-NB process topic model, we have

which would be the same for the other NB processes, since the gamma-NB process differs from them only on how the gamma priors of $\theta_{jk}$ and consequently the NB priors of $n_{jk}$ are constituted. For example, marginalizing out $\theta_{jk}$ , we have $n_{jk}\sim\mbox{NB}(r_{k},p_{j})$ for the gamma-NB process, $n_{jk}\sim\mbox{NB}(r_{j},p_{k})$ for the beta-NB process, $n_{jk}\sim\mbox{NB}(r_{k},p_{k})$ for both the marked-beta-NB and marked-gamma-NB processes, and $n_{jk}\sim\mbox{NB}(r_{k}b_{jk},p_{j})$ for the zero-inflated-NB process.

As in , we may augment $m_{vj}\sim\mbox{Pois}(\sum_{k=1}^{K}\phi_{vk}\theta_{jk})$ as

If $\sum_{v=1}^{V}{\phi_{vk}}=1$ , we have $n_{jk}\sim\mbox{Pois}(\theta_{jk})$ , and with Corollary 3 and $\boldsymbol{\theta}_{j}=(\theta_{j1},\cdots,\theta_{jK})^{T}$ , we also have $(n_{vj1},\cdots,n_{vjK}|-)\sim\mbox{Mult}\big{(}m_{vj};\frac{\phi_{v1}\theta_{j1}}{\sum_{k=1}^{K}\phi_{vk}\theta_{jk}},$ $\cdots,\frac{\phi_{vK}\theta_{jK}}{\sum_{k=1}^{K}\phi_{vk}\theta_{jk}}\big{)}$ , $(n_{v\boldsymbol{\cdot}1},\cdots,n_{v\boldsymbol{\cdot}K}|-)$ $\sim\mbox{Mult}(n_{\boldsymbol{\cdot}k};\boldsymbol{\phi}_{k})$ , and $(n_{j1},\cdots,n_{jK}|-)$ $\sim\mbox{Mult}\left(N_{j};\boldsymbol{\theta}_{j}\right)$ , which would lead to (32) under the assumption that the words $\{x_{ji}\}_{i}$ are exchangeable and (33) if $\boldsymbol{\phi}_{k}\sim\mbox{Dir}(\eta,\cdots,\eta)$ . Thus topic modeling with the NB process can be considered as factorization of the term-document count matrix under the Poisson likelihood as ${\bf M}\sim\mbox{Pois}(\boldsymbol{\Phi}\boldsymbol{\Theta})$ .

PFA provides a unified framework to connect previously proposed discrete latent variable models, such as those in . As discussed in detail in , these models mainly differ on how the priors of $\phi_{vk}$ and $\theta_{jk}$ are constituted and how the inferences are implemented. For example, nonnegative matrix factorization with an objective function of minimizing the Kullback-Leibler (KL) divergence $D_{\text{KL}}({\bf M}||\boldsymbol{\Phi}\boldsymbol{\Theta})$ is equivalent to the ML estimation of $\boldsymbol{\Phi}$ and $\boldsymbol{\Theta}$ under PFA, and latent Dirichlet allocation (LDA) is equivalent to a PFA with Dirichlet priors imposed on both $\boldsymbol{\phi}_{k}$ and $\boldsymbol{\theta}_{j}$ .

2 Negative Binomial Process Topic Modeling

From the point view of PFA, an NB process topic model factorizes the term-document count matrix under the constraints that each factor sums to one and the factor scores are gamma distributed random variables, and consequently, the number of words assigned to a topic (factor/atom) follows an NB distribution. Depending on how the NB distributions are parameterized, as shown in Table I, we can construct a variety of NB process topic models, which can also be connected to a large number of previously proposed parametric and nonparametric topic models. For a deeper understanding on how the counts are modeled, we also show in Table I both the variance-to-mean ratio (VMR) and overdispersion level (ODL) implied by these settings. Eight differently constructed NB processes are considered:

(i) The NB process described in Section 4 is used for topic modeling. It improves over the count-modeling gamma-Poisson process discussed in in that it unites mixture modeling and has closed-form conditional posteriors. Although this is a nonparametric model supporting an infinite number of topics, requiring $\{\theta_{jk}\}_{j=1,J}\equiv r_{k}$ may be too restrictive.

(ii) Related to LDA and Dir-PFA , the NB-LDA is also a parametric topic model that requires tuning the number of topics. It is constructed by replacing the topic weights of the Gamma-NB process in (29) as $\theta_{jk}\sim\mbox{Gamma}(r_{j},p_{j}/(1-p_{j}))$ . It uses document dependent $r_{j}$ and $p_{j}$ to learn the smoothing of the topic weights, and it lets $r_{j}\sim\mbox{Gamma}(\gamma_{0},1/c),~{}\gamma_{0}\sim\mbox{Gamma}(e_{0},1/f_{0})$ to share statistical strength between documents.

(iii) Related to the HDP , the NB-HDP model is constructed by fixing $p_{j}/(1-p_{j})\equiv 1$ (i.e., $p_{j}\equiv 0.5$ ) in (29). It is also comparable to the HDP in that constructs group-specific Dirichlet processes with normalized gamma processes, whose scale parameters are also set as one.

(iv) The NB-FTM model is constructed by replacing the topic weights in (29) as $\theta_{jk}\sim\mbox{Gamma}(r_{k}b_{jk},p_{j}/(1-p_{j}))$ , with $p_{j}\equiv 0.5$ and $b_{jk}$ drawn from a beta-Bernoulli process that is used to explicitly model zero counts. It is the same as the sparse-gamma-gamma-PFA (S $\gamma\Gamma$ -PFA) in and is comparable to the focused topic model (FTM) , which is constructed from the IBP compound Dirichlet process. The Zero-Inflated-NB process improves over these approaches by allowing $\{p_{j}\}$ to be inferred, which generally yields better data fitting.

(v) The Gamma-NB process, as shown in (10) and (29), explores sharing the NB dispersion measure across groups, and it improves over the NB-HDP by allowing the learning of $\{p_{j}\}$ . As shown in (5.3), it reduces to the HDP in without modeling $X_{j}(\Omega)$ and $\theta_{j}$ as random variables.

(vi) The Beta-Geometric process is constructed by replacing the topic weights in (29) as $\theta_{jk}\sim\mbox{Gamma}(1,p_{k}/(1-p_{k}))$ . It explores sharing the NB probability measure across groups, which is related to the one proposed for count modeling in . It is restrictive that the NB dispersion parameters are fixed as one.

(vii) The Beta-NB process is constructed by replacing the topic weights in (29) as $\theta_{jk}\sim\mbox{Gamma}(r_{j},p_{k}/(1-p_{k}))$ . It explores sharing the NB probability measure across groups, which improves over the Beta-Geometric process and the beta-NB process (BNBP) proposed in by providing analytic conditional posteriors of $\{r_{j}\}$ .

(viii) The Marked-Beta-NB process constructed by replacing the topic weights in (29) as $\theta_{jk}\sim\mbox{Gamma}(r_{k},p_{k}/(1-p_{k}))$ . It is comparable to the BNBP proposed in , with the distinction that it provides analytic conditional posteriors of of $\{r_{k}\}$ .

3 Approximate and Exact Inference

Although all proposed NB process models have closed-form conditional posteriors, they contain countably infinite atoms that are infeasible to explicitly represent in practice. This infinite dimensional problem can be addressed by using a discrete base measure with $K$ atoms, i.e., truncating the total number of atoms to be $K$ , and then doing Bayesian inference via block Gibbs sampling . This is a very general approach and is used in our experiments to make a fair comparison between a wide variety of models. Block gibbs sampling for the Gamma-NB process is described in Appendix B; block gibbs sampling for other NB processes and related algorithms in Table I can be similarly derived, as described in and omitted here for brevity. The infinite dimensional problem can also be addressed by discarding the atoms with weights smaller than a small constant $\epsilon$ or by modifying the Lévy measure to make its integral over the whole space be finite . A sufficiently large (small) $K$ ( $\epsilon$ ) usually provides a good approximation, however, there is an increasing risk of wasting computation as the truncation level gets larger.

To avoid truncation, the slice sampling scheme of has been utilized for the Dirichlet process and normalized random measure based mixture models . With auxiliary slice latent variables introduced to allow adaptive truncations in each MCMC interaction, the infinite dimensional problem is transformed into a finite one. This method has also been applied to the beta-Bernoulli process and the beta-NB process . It would be interesting to investigate slice sampling for the NB process based count and mixture models, which provide likelihoods that might be more amenable to posterior simulation since no normalization is imposed on the weights of the atoms. As slice sampling is not the focus of this paper, we leave it for future study.

Both the block Gibbs sampler and the slice sampler explicitly represent a finite set of atoms for posterior simulation, and algorithms based on these samplers are commonly referred as “conditional” methods . Another approach of solving the infinite dimensional problem is employing a collapsed inference scheme that marginalizes out the atoms and their weights . Algorithms based on the collapsed inference scheme are usually referred as “marginal” methods . A well-defined prediction rule is usually required to develop a collapsed Gibbs sampler, and the conjugacy between the likelihood and the prior distribution of atoms is desired to avoid numerical integrations. In topic modeling, a word is linked to a Dirichlet distributed atom with a multinomial likelihood, thus the atoms can be analytically marginalized out; since their weights can also be marginalized out as in (5.2), we may develop a collapsed Gibbs sampler for the gamma-NB process based topic models. As the collapsed inference scheme is not the focus of this paper and the prediction rules for other NB processes need further investigation, we leave them for future study.

Example Results and Discussions

Motivated by Table I, we consider topic modeling using a variety of NB processes. We compare them with LDA and CRF-HDP , in which the latent count $n_{jk}$ is marginally distributed as

We consider the Psychological Reviewhttp://psiexp.ss.uci.edu/research/programs $\_$ data/toolbox.htm corpus, restricting the vocabulary to terms that occur in five or more documents. The corpus includes 1281 abstracts from 1967 to 2003, with $V=2566$ and 71,279 total word counts. We randomly select $20\%$ , $40\%$ , $60\%$ or $80\%$ of the words from each document to learn a document dependent probability for each term $v$ and calculate the per-word perplexity on the held-out words as

where $f_{jv}=\frac{\sum_{s=1}^{S}\sum_{k=1}^{K}\phi^{(s)}_{vk}\theta^{(s)}_{jk}}{\sum_{s=1}^{S}\sum_{v=1}^{V}\sum_{k=1}^{K}\phi^{(s)}_{vk}\theta^{(s)}_{jk}}$ , $y_{jv}$ is the number of words held out at term $v$ in document $j$ , $y_{\boldsymbol{\cdot}\boldsymbol{\cdot}}=\sum_{j=1}^{J}\sum_{v=1}^{V}y_{jv}$ , and $s=1,\cdots,S$ are the indices of collected samples. Note that the per-word perplexity is equal to $V$ if $f_{jv}=\frac{1}{V}$ , thus it should be no greater than $V$ for a topic model that works appropriately. The final results are averaged over five random training/testing partitions. The performance measure is the same as the one used in and similar to those in .

We show in Fig. 3 the NB dispersion and probability parameters learned by various NB process topic models listed in Table I, revealing distinct sharing mechanisms and model properties. In Fig. 4 we compare the per-held-out-word prediction performance of various algorithms. We set the parameters as $c=1$ , $\eta=0.05$ and $a_{0}=b_{0}=e_{0}=f_{0}=0.01$ . For LDA and NB-LDA, we search $K$ for optimal performance. All the other NB process topic models are nonparametric Bayesian models that can automatically learn the number of active topics $K^{+}$ for a given corpus. For fair comparison, all the models considered are implemented with block Gibbs sampling, where $K=400$ is set as an upper-bound.

When $\theta_{jk}\equiv r_{k}$ is used, as in the NB process, different documents are imposed to have the same topic weights, leading to the worst held-out-prediction performance.

When $(r_{j},p_{j})$ is used, as in NB-LDA, different documents are weakly coupled with $r_{j}\sim\mbox{Gamma}(\gamma_{0},1/c)$ , and the modeling results in Fig. 3 show that a typical document in this corpus usually has a small $r_{j}$ and a large $p_{j}$ , thus a large overdispersion level (ODL) and a large variance-to-mean ratio (VMR), indicating highly overdispersed counts on its topic usage. NB-LDA is a parametric topic model that requires tuning the number of topics $K$ . It improves over LDA in that it only has to tune $K$ , whereas LDA has to tune both $K$ and $\alpha$ . With an appropriate $K$ , the parametric NB-LDA may outperform the nonparametric NB-HDP and NB-FTM as the training data percentage increases, showing that even by learning both the NB parameters $r_{j}$ and $p_{j}$ in a document dependent manner, we may get better data fitting than using nonparametric models that fix the NB probability parameters.

The NB-HDP is a special case of the Gamma-NB process that $p_{j}\equiv 0.5$ . From a mixture modeling viewpoint, fixing $p_{j}=0.5$ is a natural choice as $p_{j}$ appears irrelevant after normalization. However, from a count modeling viewpoint, this would make a restrictive assumption that each count vector $\{n_{jk}\}_{k=1,K}$ has the same VMR of 2. It is also interesting to examine (24), which can be viewed as the concentration parameter $\alpha$ in the HDP, allowing the adjustment of $p_{j}$ would allow a more flexible model assumption on the amount of variations between the topic proportions, and thus potentially better data fitting.

When $(r_{k},\pi_{k})$ is used, as in the NB-FTM model, our results in Fig. 3 show that we usually have a small $\pi_{k}$ and a large $r_{k}$ , indicating topic $k$ is sparsely used across the documents but once it is used, the amount of variation on usage is small. This property might be helpful when there are excessive number of zeros that might not be well modeled by the NB process alone. In our experiments, the more direct approaches of using $p_{k}$ or $p_{j}$ generally yield better results, which might not be the case when excessive number of zeros could be better explained with the beta-Bernoulli processes, e.g., when the training words are scarce, the NB-FTM can approach the performance of the Marked-Beta-NB process.

Conclusions

We propose a variety of negative binomial (NB) processes for count modeling, which can be naturally applied for the seemingly disjoint problem of mixture modeling. The proposed NB processes are completely random measures, which assign independent random variables to disjoint Borel sets of the measure space, as opposed to Dirichlet processes, whose measures on disjoint Borel sets are negatively correlated. We reveal connections between various discrete distributions and discover unique data augmentation and marginalization methods for the NB process, with which we are able to unite count and mixture modeling, analyze fundamental model properties, and derive efficient Bayesian inference. We demonstrate that the NB process and the gamma-NB process can be recovered from the Dirichlet process and the HDP, respectively. We show in detail the theoretical, structural and computational advantages of the NB process. We examine the distinct sharing mechanisms and model properties of various NB processes, with connections made to existing discrete latent variable models under the Poisson factor analysis framework. Experimental results on topic modeling show the importance of modeling both the NB dispersion and probability parameters, which respectively govern the overdispersion level and variance-to-mean ratio for count modeling.

Acknowledgments

The authors would like to thank the two anonymous reviewers and the editor for their constructive comments that help improve the manuscript.

References

Appendix A Proof of Theorem 1

With the PMFs of both the NB and CRT distributions, the PMF of the joint distribution of counts $m$ and $l$ is $f_{M,L}(m,l|r,p)=f_{L}(l|m,r)f_{M}(m|r,p)={\frac{\Gamma(r)|s(m,l)|r^{l}}{\Gamma(m+r)}}\frac{\Gamma(r+m)(1-p)^{r}p^{m}}{m!\Gamma(r)}=\frac{|s(m,l)|r^{l}(1-p)^{r}p^{m}}{m!}$ , which is the same as (4).

Since $m\sim{\mbox{SumLog}}(l,p)$ is the summation of $l$ iid $\mbox{Log}(p)$ random variables, its PGF becomes $C_{M}(z)=C_{U}^{l}(z)=\left[{\ln(1-pz)}/{\ln(1-p)}\right]^{l},~{}|z|<{p^{-1}}.$ With $[\ln(1+x)]^{l}=l!\sum_{n=l}^{\infty}{s(n,l)x^{n}}/{n!}$ and $|s(m,l)|=(-1)^{m-l}s(m,l)$ , its PMF can be expressed as

Letting $l\sim{\mbox{Pois}}(-r\ln(1-p))$ , the PMF of the joint distribution of counts $m$ and $l$ is $f_{M,L}(m,l|r,p)=f_{M}(m|l,p)f_{L}(l|r,p)=\frac{p^{m}l!|s(m,l)|}{m![-\ln(1-p)]^{l}}\frac{(-r\ln(1-p))^{l}e^{r\ln(1-p)}}{l!}=\frac{|s(m,l)|r^{l}(1-p)^{r}p^{m}}{m!}$ , which is the same as (4). ∎

Appendix B Block Gibbs Sampling for the Gamma-Negative Binomial Process

With $p_{j}\sim\mbox{Beta}(a_{0},b_{0})$ , $\gamma_{0}\sim\mbox{Gamma}(e_{0},1/f_{0})$ and a discrete base measure as $G_{0}=\sum_{k=1}^{K}\frac{\gamma_{0}}{K}\delta_{\omega_{k}}$ , following Section 5.2, block Gibbs sampling for (29) proceeds as

where $p^{\prime}:=\frac{-\sum_{j}\ln(1-p_{j})}{c-\sum_{j}\ln(1-p_{j})}$ . Note that when $K\rightarrow\infty$ , we have $(l^{\prime}_{k}|-)=\delta(\sum_{j}n_{jk}>0)$ and thus $\sum_{k}l^{\prime}_{k}=K^{+}$ .