The Poisson Gamma Belief Network

Mingyuan Zhou, Yulai Cong, Bo Chen

Introduction

There has been significant recent interest in deep learning. Despite its tremendous success in supervised learning, inferring a multilayer data representation in an unsupervised manner remains a challenging problem . The sigmoid belief network (SBN), which connects the binary units of adjacent layers via the sigmoid functions, infers a deep representation of multivariate binary vectors . The deep belief network (DBN) is a SBN whose top hidden layer is replaced by the restricted Boltzmann machine (RBM) that is undirected. The deep Boltzmann machine (DBM) is an undirected deep network that connects the binary units of adjacent layers using the RBMs . All these deep networks are designed to model binary observations. Although one may modify the bottom layer to model Gaussian and multinomial observations, the hidden units of these networks are still typically restricted to be binary . One may further consider the exponential family harmoniums to construct more general networks with non-binary hidden units, but often at the expense of noticeably increased complexity in training and data fitting.

Moving beyond conventional deep networks using binary hidden units, we construct a deep directed network with gamma distributed nonnegative real hidden units to unsupervisedly infer a multilayer representation of multivariate count vectors, with a simple but powerful mechanism to capture the correlations among the visible/hidden features across all layers and handle highly overdispersed counts. The proposed model is called the Poisson gamma belief network (PGBN), which factorizes the observed count vectors under the Poisson likelihood into the product of a factor loading matrix and the gamma distributed hidden units (factor scores) of layer one; and further factorizes the shape parameters of the gamma hidden units of each layer into the product of a connection weight matrix and the gamma hidden units of the next layer. Distinct from previous deep networks that often utilize binary units for tractable inference and require tuning both the width (number of hidden units) of each layer and the network depth (number of layers), the PGBN employs nonnegative real hidden units and automatically infers the widths of subsequent layers given a fixed budget on the width of its first layer. Note that the budget could be infinite and hence the whole network can grow without bound as more data are being observed. When the budget is finite and hence the ultimate capacity of the network is limited, we find that the PGBN equipped with a narrower first layer could increase its depth to match or even outperform a shallower network with a substantially wider first layer.

The gamma distribution density function has the highly desired strong non-linearity for deep learning, but the existence of neither a conjugate prior nor a closed-form maximum likelihood estimate for its shape parameter makes a deep network with gamma hidden units appear unattractive. Despite seemingly difficult, we discover that, by generalizing the data augmentation and marginalization techniques for discrete data , one may propagate latent counts one layer at a time from the bottom data layer to the top hidden layer, with which one may derive an efficient upward-downward Gibbs sampler that, one layer at a time in each iteration, upward samples Dirichlet distributed connection weight vectors and then downward samples gamma distributed hidden units.

In addition to constructing a new deep network that well fits multivariate count data and developing an efficient upward-downward Gibbs sampler, other contributions of the paper include: 1) combining the gamma-negative binomial process with a layer-wise training strategy to automatically infer the network structure; 2) revealing the relationship between the upper bound imposed on the width of the first layer and the inferred widths of subsequent layers; 3) revealing the relationship between the network depth and the model’s ability to model overdispersed counts; 4) and generating a multivariate high-dimensional random count vector, whose distribution is governed by the PGBN, by propagating the gamma hidden units of the top hidden layer back to the bottom data layer.

The Poisson Gamma Belief Network

and impose $c_{0}\sim\mbox{Gam}(e_{0},1/f_{0})$ and $\gamma_{0}\sim\mbox{Gam}(a_{0},1/b_{0})$ ; and for $t\in\{3,\ldots,T+1\}$ , we let

We expect the correlations between the rows (features) of $(\boldsymbol{x}_{1}^{(1)},\ldots,\boldsymbol{x}_{J}^{(1)})$ to be captured by the columns of $\boldsymbol{\Phi}^{(1)}$ , and the correlations between the rows (latent features) of $(\boldsymbol{\theta}_{1}^{(t)},\ldots,\boldsymbol{\theta}_{J}^{(t)})$ to be captured by the columns of $\boldsymbol{\Phi}^{(t+1)}$ . Even if all $\boldsymbol{\Phi}^{(t)}$ for $t\geq 2$ are identity matrices, indicating no correlations between latent features, our analysis will show that a deep structure with $T\geq 2$ could still benefit data fitting by better modeling the variability of the latent features $\boldsymbol{\theta}_{j}^{(1)}$ .

Sigmoid and deep belief networks. Under the hierarchical model in (1), given the connection weight matrices, the joint distribution of the count observations and gamma hidden units of the PGBN can be expressed, similar to those of the sigmoid and deep belief networks , as

With $\boldsymbol{\phi}_{v:}$ representing the $v$ th row $\boldsymbol{\Phi}$ , for the gamma hidden units $\theta_{vj}^{(t)}$ we have

which are highly nonlinear functions that are strongly desired in deep learning. By contrast, with the sigmoid function $\sigma(x)=1/(1+e^{-x})$ and bias terms $b_{v}^{(t+1)}$ , a sigmoid/deep belief network would connect the binary hidden units $\theta_{vj}^{(t)}\in\{0,1\}$ of layer $t$ (for deep belief networks, $t<T-1$ ) to the product of the connection weights and binary hidden units of the next layer with

Comparing (4) with (5) clearly shows the differences between the gamma nonnegative hidden units and the sigmoid link based binary hidden units. Note that the rectified linear units have emerged as powerful alternatives of sigmoid units to introduce nonlinearity . It would be interesting to use the gamma units to introduce nonlinearity in the positive region of the rectified linear units.

Deep Poisson factor analysis. With $T=1$ , the PGBN specified by (1)-(3) reduces to Poisson factor analysis (PFA) using the (truncated) gamma-negative binomial process , which is also related to latent Dirichlet allocation if the Dirichlet priors are imposed on both $\boldsymbol{\phi}_{k}^{(1)}$ and $\boldsymbol{\theta}_{j}^{(1)}$ . With $T\geq 2$ , the PGBN is related to the gamma Markov chain hinted by Corollary 2 of and realized in , the deep exponential family of , and the deep PFA of . Different from the PGBN, in , it is the gamma scale but not shape parameters that are chained and factorized; in , it is the correlations between binary topic usage indicators but not the full connection weights that are captured; and neither nor provide a principled way to learn the network structure. Below we break the PGBN of $T$ layers into $T$ related submodels that are solved with the same subroutine.

By definition (7) is true for layer $t=1$ . Suppose that (7) is true for layer $t\geq 2$ , then we can augment each count $x^{(t)}_{vj}$ into the summation of $K_{t}$ latent counts that are smaller or equal as

where $v\in\{1,\ldots,K_{t-1}\}$ . With $m^{(t)(t+1)}_{kj}:=x^{(t)}_{\boldsymbol{\cdot}jk}:=\sum_{v=1}^{K_{t-1}}x^{(t)}_{vjk}$ representing the number of times that factor $k\in\{1,\ldots,K_{t}\}$ of layer $t$ appears in observation $j$ and ${\boldsymbol{m}}^{(t)(t+1)}_{j}:=\big{(}x^{(t)}_{\boldsymbol{\cdot}j1},\ldots,x^{(t)}_{\boldsymbol{\cdot}jK_{t}}\big{)}^{\prime}$ , since $\sum_{v=1}^{K_{t-1}}\phi_{vk}^{(t)}=1$ , we can marginalize out $\boldsymbol{\Phi}^{(t)}$ as in , leading to

Further marginalizing out the gamma distributed $\boldsymbol{\theta}_{j}^{(t)}$ from the above Poisson likelihood leads to

The $k$ th element of ${\boldsymbol{m}}^{(t)(t+1)}_{j}$ can be augmented under its compound Poisson representation as

Thus if (7) is true for layer $t$ , then it is also true for layer $t+1$ . ∎

Using Lemma 4.1 of on (8) and Theorem 1 of on (9), we can propagate the latent counts $x^{(t)}_{vj}$ of layer $t$ upward to layer $t+1$ as

As $x^{(t)}_{\boldsymbol{\cdot}j}=m^{(t)(t+1)}_{\boldsymbol{\cdot}j}$ and $x^{(t+1)}_{kj}$ is in the same order as $\ln\big{(}m^{(t)(t+1)}_{kj}\big{)}$ , the total count of layer $t+1$ , expressed as $\sum_{j}x^{(t+1)}_{\boldsymbol{\cdot}j}$ , would often be much smaller than that of layer $t$ , expressed as $\sum_{j}x^{(t)}_{\boldsymbol{\cdot}j}$ . Thus the PGBN may use $\sum_{j}x^{(T)}_{\boldsymbol{\cdot}j}$ as a simple criterion to decide whether to add more layers.

2 Modeling overdispersed counts

In comparison to a single-layer shallow model with $T=1$ that assumes the hidden units of layer one to be independent in the prior, the multilayer deep model with $T\geq 2$ captures the correlations between them. Note that for the extreme case that $\boldsymbol{\Phi}^{(t)}={\bf I}_{K_{t}}$ for $t\geq 2$ are all identity matrices, which indicates that there are no correlations between the features of $\boldsymbol{\theta}_{j}^{(t-1)}$ left to be captured, the deep structure could still provide benefits as it helps model latent counts ${\boldsymbol{m}}_{j}^{(1)(2)}$ that may be highly overdispersed. For example, supposing $\boldsymbol{\Phi}^{(t)}={\bf I}_{K_{2}}$ for all $t\geq 2$ , then from (1) and (9) we have

In comparison to PFA with $m_{kj}^{(1)(2)}\,|\,r_{k}\sim\mbox{NB}(r_{k},p_{j}^{(2)})$ , with a variance-to-mean ratio of $1/(1-p_{j}^{(2)})$ , the PGBN with $T$ hidden layers, which mixes the shape of $m_{kj}^{(1)(2)}\sim\mbox{NB}(\theta_{kj}^{(2)},p_{j}^{(2)})$ with a chain of gamma random variables, increases the variance-to-mean ratio of the latent count $m_{kj}^{(1)(2)}$ given $r_{k}$ by a factor of $1+(T-1)p_{j}^{(2)}$ , and hence could better model highly overdispersed counts.

3 Upward-downward Gibbs sampling

With Lemma 1 and Corollary 2 and the width of the first layer being bounded by $K_{1\max}$ , we develop an upward-downward Gibbs sampler for the PGBN, each iteration of which proceeds as follows: Sample $x_{vjk}^{(t)}$ . We can sample $x_{vjk}^{(t)}$ for all layers using (10). But for the first hidden layer, we may treat each observed count $x_{vj}^{(1)}$ as a sequence of word tokens at the $v$ th term (in a vocabulary of size $V:=K_{0}$ ) in the $j$ th document, and assign the $x_{\boldsymbol{\cdot}j}^{(1)}$ words $\{v_{ji}\}_{i=1,x_{\boldsymbol{\cdot}j}^{(1)}}$ one after another to the latent factors (topics), with both the topics $\boldsymbol{\Phi}^{(1)}$ and topic weights $\boldsymbol{\theta}_{j}^{(1)}$ marginalized out, as

where $z_{ji}$ is the topic index for $v_{ji}$ and $x_{vjk}^{(1)}:=\sum_{i}\delta(v_{ji}=v,z_{ji}=k)$ counts the number of times that term $v$ appears in document $j$ ; we use the $\boldsymbol{\cdot}$ symbol to represent summing over the corresponding index, e.g., $x^{(t)}_{\boldsymbol{\cdot}jk}:=\sum_{v}x^{(t)}_{vjk}$ , and use $x^{-ji}$ to denote the count $x$ calculated without considering word $i$ in document $j$ . The collapsed Gibbs sampling update equation shown above is related to the one developed in for latent Dirichlet allocation, and the one developed in for PFA using the beta-negative binomial process. When $T=1$ , we would replace the terms $\boldsymbol{\phi}^{(2)}_{k:}\boldsymbol{\theta}_{j}^{(2)}$ with $r_{k}$ for PFA built on the gamma-negative binomial process (or with $\alpha\pi_{k}$ for the hierarchical Dirichlet process latent Dirichlet allocation, see and for details), and add an additional term to account for the possibility of creating an additional topic . For simplicity, in this paper, we truncate the nonparametric Bayesian model with $K_{1\max}$ factors and let $r_{k}\sim\mbox{Gam}(\gamma_{0}/K_{1\max},1/c_{0})$ if $T=1$ . Sample $\boldsymbol{\phi}_{k}^{(t)}$ . Given these latent counts, we sample the factors/topics $\boldsymbol{\phi}^{(t)}_{k}$ as

Sample $x_{vj}^{(t+1)}$ . We sample $\boldsymbol{x}_{j}^{(t+1)}$ using (11), replacing $\boldsymbol{\Phi}^{(T+1)}\boldsymbol{\theta}_{j}^{(T+1)}$ with $\boldsymbol{r}:=(r_{1},\ldots,r_{K_{T}})^{\prime}$ . Sample $\boldsymbol{\theta}_{j}^{(t)}$ . Using (7) and the gamma-Poisson conjugacy, we sample $\boldsymbol{\theta}_{j}$ as

Sample $\boldsymbol{r}$ . Both $\gamma_{0}$ and $c_{0}$ are sampled using related equations in . We sample $\boldsymbol{r}$ as

Sample $c_{j}^{(t)}$ . With $\theta_{\boldsymbol{\cdot}j}^{(t)}:=\sum_{k=1}^{K_{t}}\theta_{kj}^{(t)}$ for $t\leq T$ and $\theta_{\boldsymbol{\cdot}j}^{(T+1)}:=r_{\boldsymbol{\cdot}}$ , we sample $p_{j}^{(2)}$ and $\{c_{j}^{(t)}\}_{t\geq 3}$ as

and calculate $c_{j}^{(2)}$ and $\{p_{j}^{(t)}\}_{t\geq 3}$ with (6).

4 Learning the network structure with layer-wise training

As jointly training all layers together is often difficult, existing deep networks are typically trained using a greedy layer-wise unsupervised training algorithm, such as the one proposed in to train the deep belief networks. The effectiveness of this training strategy is further analyzed in . By contrast, the PGBN has a simple Gibbs sampler to jointly train all its hidden layers, as described in Section 2.3, and hence does not require greedy layer-wise training. Yet the same as commonly used deep learning algorithms, it still needs to specify the number of layers and the width of each layer.

In this paper, we adopt the idea of layer-wise training for the PGBN, not because of the lack of an effective joint-training algorithm, but for the purpose of learning the width of each hidden layer in a greedy layer-wise manner, given a fixed budget on the width of the first layer. The proposed layer-wise training strategy is summarized in Algorithm 1. With a PGBN of $T-1$ layers that has already been trained, the key idea is to use a truncated gamma-negative binomial process to model the latent count matrix for the newly added top layer as $m_{kj}^{(T)(T+1)}\sim\mbox{NB}(r_{k},p_{j}^{(T+1)}),~{}r_{k}\sim\mbox{Gam}(\gamma_{0}/K_{T\max},1/c_{0})$ , and rely on that stochastic process’s shrinkage mechanism to prune inactive factors (connection weight vectors) of layer $T$ , and hence the inferred $K_{T}$ would be smaller than $K_{T\max}$ if $K_{T\max}$ is sufficiently large. The newly added layer and the layers below it would be jointly trained, but with the structure below the newly added layer kept unchanged. Note that when $T=1$ , the PGBN would infer the number of active factors if $K_{1\max}$ is set large enough, otherwise, it would still assign the factors with different weights $r_{k}$ , but may not be able to prune any of them.

Experimental Results

We apply the PGBNs for topic modeling of text corpora, each document of which is represented as a term-frequency count vector. Note that the PGBN with a single hidden layer is identical to the (truncated) gamma-negative binomial process PFA of , which is a nonparametric Bayesian algorithm that performs similarly to the hierarchical Dirichlet process latent Dirichlet allocation for text analysis, and is considered as a strong baseline that outperforms a large number of topic modeling algorithms. Thus we will focus on making comparison to the PGBN with a single layer, with its layer width set to be large to approximate the performance of the gamma-negative binomial process PFA. We evaluate the PGBNs’ performance by examining both how well they unsupervisedly extract low-dimensional features for document classification, and how well they predict heldout word tokens. Matlab code will be available in http://mingyuanzhou.github.io/.

We use Algorithm 1 to learn, in a layer-wise manner, from the training data the weight matrices $\boldsymbol{\Phi}^{(1)},\ldots,\boldsymbol{\Phi}^{(T_{\max})}$ and the top-layer hidden units’ gamma shape parameters $\boldsymbol{r}$ : to add layer $T$ to a previously trained network with $T-1$ layers, we use $B_{T}$ iterations to jointly train $\boldsymbol{\Phi}^{(T)}$ and $\boldsymbol{r}$ together with $\{\boldsymbol{\Phi}^{(t)}\}_{1,T-1}$ , prune the inactive factors of layer $T$ , and continue the joint training with another $C_{T}$ iterations. We set the hyper-parameters as $a_{0}=b_{0}=0.01$ and $e_{0}=f_{0}=1$ . Given the trained network, we apply the upward-downward Gibbs sampler to collect 500 MCMC samples after 500 burnins to estimate the posterior mean of the feature usage proportion vector $\boldsymbol{\theta}_{j}^{(1)}/\theta_{\boldsymbol{\cdot}j}^{(1)}$ at the first hidden layer, for every document in both the training and testing sets.

Feature learning for binary classification. We consider the 20 newsgroups dataset (http://qwone.com/ $\sim$ jason/20Newsgroups/) that consists of 18,774 documents from 20 different news groups, with a vocabulary of size $K_{0}=$ 61,188. It is partitioned into a training set of 11,269 documents and a testing set of 7,505 ones. We first consider two binary classification tasks that distinguish between the $comp.sys.ibm.pc.hardware$ and $comp.sys.mac.hardware$ , and between the $sci.electronics$ and $sci.med$ news groups. For each binary classification task, we remove a standard list of stop words and only consider the terms that appear at least five times, and report the classification accuracies based on 12 independent random trials. With the upper bound of the first layer’s width set as $K_{1\max}\in\{25,50,100,200,400,600,800\}$ , and $B_{t}=C_{t}=1000$ and $\eta^{(t)}=0.01$ for all $t$ , we use Algorithm 1 to train a network with $T\in\{1,2,\ldots,8\}$ layers. Denote $\bar{\boldsymbol{\theta}}_{j}$ as the estimated $K_{1}$ dimensional feature vector for document $j$ , where $K_{1}\leq K_{1\max}$ is the inferred number of active factors of the first layer that is bounded by the pre-specified truncation level $K_{1\max}$ . We use the $L_{2}$ regularized logistic regression provided by the LIBLINEAR package to train a linear classifier on $\bar{\boldsymbol{\theta}}_{j}$ in the training set and use it to classify $\bar{\boldsymbol{\theta}}_{j}$ in the test set, where the regularization parameter is five-folder cross-validated on the training set from $(2^{-10},2^{-9},\ldots,2^{15})$ .

As shown in Fig. 2, modifying the PGBN from a single-layer shallow network to a multilayer deep one clearly improves the qualities of the unsupervisedly extracted feature vectors. In a random trial, with $K_{1\max}=800$ , we infer a network structure of $(K_{1},\ldots,K_{8})=(512,154,75,54,47,37,34,29)$ for the first binary classification task, and $(K_{1},\ldots,K_{8})=(491,143,74,49,36,32,28,26)$ for the second one. Figs. 2(c)-(d) also show that increasing the network depth in general improves the performance, but the first-layer width clearly plays an important role in controlling the ultimate network capacity. This insight is further illustrated below.

Feature learning for multi-class classification. We test the PGBNs for multi-class classification on 20newsgroups. After removing a standard list of stopwords and the terms that appear less than five times, we obtain a vocabulary with $K_{0}=33,420$ . We set $C_{t}=500$ and $\eta^{(t)}=0.05$ for all $t$ . If $K_{1\max}\leq 400$ , we set $B_{t}=1000$ for all $t$ , otherwise we set $B_{1}=1000$ and $B_{t}=500$ for $t\geq 2$ . We use all 11,269 training documents to infer a set of networks with $T_{\max}\in\{1,\ldots,5\}$ and $K_{1\max}\in\{50,100,200,400,600,800\}$ , and mimic the same testing procedure used for binary classification to extract low-dimensional feature vectors, with which each testing document is classified to one of the 20 news groups using the $L_{2}$ regularized logistic regression. Fig. 2 shows a clear trend of improvement in classification accuracy by increasing the network depth with a limited first-layer width, or by increasing the upper bound of the width of the first layer with the depth fixed. For example, a single-layer PGBN with $K_{1\max}=100$ could add one or more layers to slightly outperform a single-layer PGBN with $K_{1\max}=200$ , and a single-layer PGBN with $K_{1\max}=200$ could add layers to clearly outperform a single-layer PGBN with $K_{1\max}$ as large as $800$ . We also note that each iteration of jointly training multiple layers costs moderately more than that of training a single layer, e.g., with $K_{1\max}=400$ , a training iteration on a single core of an Intel Xeon 2.7 GHz CPU on average takes about $5.6$ , $6.7$ , $7.1$ seconds for the PGBN with $1$ , $3$ , and $5$ layers, respectively.

Examining the inferred network structure also reveals interesting details. For example, in a random trial with Algorithm 1, the inferred network widths $(K_{1},\ldots,K_{5})$ are $(50,50,50,50,50)$ , $(200,161,130,94,63)$ , $(528,129,109,98,91)$ , and $(608,100,99,96,89)$ , for $K_{1\max}=50,200,600$ , and $800$ , respectively. This indicates that for a network with an insufficient budget on its first-layer width, as the network depth increases, its inferred layer widths decay more slowly than a network with a sufficient or surplus budget on its first-layer width; and a network with a surplus budget on its first-layer width may only need relatively small widths for its higher hidden layers. In the Appendix, we provide comparisons of accuracies between the PGBN and other related algorithms, including these of and , on similar multi-class document classification tasks.

Perplexities for holdout words. In addition to examining the performance of the PGBN for unsupervised feature learning, we also consider a more direct approach that we randomly choose 30% of the word tokens in each document as training, and use the remaining ones to calculate per-heldout-word perplexity. We consider the NIPS12 (http://www.cs.nyu.edu/ $\sim$ roweis/data.html) corpus, limiting the vocabulary to the 2000 most frequent terms. We set $\eta^{(t)}=0.05$ and $C_{t}=500$ for all $t$ , set $B_{1}=1000$ and $B_{t}=500$ for $t\geq 2$ , and consider five random trials. Among the $B_{t}+C_{t}$ Gibbs sampling iterations used to train layer $t$ , we collect one sample per five iterations during the last $500$ iterations, for each of which we draw the topics $\{\boldsymbol{\phi}^{(1)}_{k}\}_{k}$ and topics weights $\boldsymbol{\theta}_{j}^{(1)}$ , to compute the per-heldout-word perplexity using Equation (34) of . As shown in Fig. 3, we observe a clear trend of improvement by increasing both $K_{1\max}$ and $T$ .

Conclusions

The Poisson gamma belief network is proposed to extract a multilayer deep representation for high-dimensional count vectors, with an efficient upward-downward Gibbs sampler to jointly train all its layers and a layer-wise training strategy to automatically infer the network structure. Example results clearly demonstrate the advantages of deep topic models. For big data problems, in practice one may rarely has a sufficient budget to allow the first-layer width to grow without bound, thus it is natural to consider a belief network that can use a deep representation to not only enhance its representation power, but also better allocate its computational resource. Our algorithm achieves a good compromise between the widths of hidden layers and the depth of the network.

Acknowledgements. M. Zhou thanks TACC for computational support. B. Chen thanks the support of the Thousand Young Talent Program of China, NSC-China (61372132), and NCET-13-0945.

References

Appendix A Comparisons of classification accuracies

For comparison, we consider the same $L_{2}$ regularized logistic regression multi-class classifier, trained either on the raw word counts or normalized term-frequencies of the 20newsgroups training documents using five-folder cross-validation. As summarized in Tab. 1, when using the raw term-frequency word counts as covariates, the same classifier achieves $69.8\%$ ( $68.2\%$ ) accuracy on the 20newsgroups test documents if using the top 2000 terms that exclude (include) a standard list of stopwords, achieves $75.8\%$ if using all the $61,188$ terms in the vocabulary, and achieves $78.0\%$ if using the $33,420$ terms remained after removing a standard list of stopwords and the terms that appear less than five times; and when using the normalized term-frequencies as covariates, the corresponding accuracies are $70.8\%$ ( $67.9\%$ ) if using the top 2000 terms excluding (including) stopwords, $77.6\%$ with all the $61,188$ terms, and $79.4\%$ with the $33,420$ selected terms.

As summarized in Tab. 2, for multi-class classification on the same dataset, with a vocabulary size of 2000 that consisits of the 2000 most frequent terms after removing stopwords and stemming, the DocNADE and the over-replicated softmax provide the accuracies of $67.0\%$ and $66.8\%$ , respectively, for a feature dimension of $K=128$ , and provide the accuracies of $68.4\%$ and $69.1\%$ , respectively, for a feature dimension of $K=512$ .

As summarized in Tab. 3, with the same vocabulary size of 2000 (but different terms due to different preprocessing), the proposed PGBN provides $65.9\%$ ( $67.5\%$ ) with $T=1$ ( $T=5$ ) for $K_{1\max}=128$ , and $65.9\%$ ( $69.2\%)$ with $T=1$ ( $T=5)$ for $K_{1\max}=512$ , which may be further improved if we also consider the stemming step, as done in the these two algorithms, for word preprocessing, or if we set the values of $\eta^{(t)}$ to be smaller than 0.05. We also summarize in Tab. 3 the classification accuracies of the PGBNs learned with $V=33,420$ , as shown in Fig. 2.

Appendix B Generating synthetic documents

Below we provide several synthetic documents generated from the PGBN with $(K_{1},\ldots,K_{5})=(608,100,99,96,89)$ , which is trained on the training set of the 20newsgroups corpus with $K_{1\max}=800$ and $\eta^{(t)}=0.05$ for all $t$ . We set $c_{j^{\prime}}^{(t)}$ as the median of the inferred $\{c_{j}^{t}\}_{j}$ of the training documents for all $t$ . Given $\{\boldsymbol{\Phi}^{(t)}\}_{1,T}$ and $\boldsymbol{r}$ , We first generate $\boldsymbol{\theta}_{j^{\prime}}^{(T)}\sim\mbox{Gam}\left(\boldsymbol{r},1\big{/}c_{j^{\prime}}^{(T+1)}\right)$ and then downward pass it through the network by repeatedly drawing nonnegative real random variables from the gamma distribution as in (1). With the simulated $\boldsymbol{\theta}_{j^{\prime}}^{(1)}$ , we calculate the Poisson rates for all the $V$ words using $\boldsymbol{\Phi}^{(1)}\boldsymbol{\theta}_{j^{\prime}}^{(1)}$ and display the top 100 words ranked according to $\boldsymbol{\Phi}^{(1)}\boldsymbol{\theta}_{j^{\prime}}^{(1)}$ . Below are some example synthetic documents generated in this manner, which are all easy to interpret and reflect various aspects of the 20newsgroups corpus used to train the PGBN.

team game games hockey year cup season playoffs edu win pittsburgh nhl toronto detroit stanley teams montreal play jets pens espn division chicago new penguins pick league players devils rangers wings boston islanders playoff ca series winnipeg gm abc tv playing quebec april time round st vancouver fans best gld bruins coach winner calgary leafs player great watch night patrick vs finals conference final just baseball coverage murray minnesota don won gary points mike like ice kings regular mario played louis caps contact washington selanne norris buffalo columbia keenan star people fan th think canadiens said canada canucks york gerald

hall smith players fame career ozzie winfield nolan guys ryan dave baseball eddie murray numbers steve kingman robinson yount morris roger years bsu puckett long joe jackson hung brett garvey deserve robin evans princeton yeah frank ruth kirby rickey pitcher peak yogi hof great sick lee ha aaron johnny darrell santo time greatest stats seasons ron george reardon shortstops henderson hank mays jack liability marginal rogers average compare belong schmidt gibson willie leo ucs sgi bsuvc comment fans honestly deserves cal bell candidates wagner fielding walks ve likely history gee heck consideration mike player bonds lock rating sandberg standards apparent

fbi koresh batf gas compound waco government atf people children tear cult davidians did bd branch agents happened assault warrant david reno tanks killed weapons clinton point country search building federal raid press started reported death proper needed illegal better house protect burned janet outside burn days media stand job arms inside right come cwru equipment followers investigation oldham believe non power kids burning fires women suicide law order cs sick blame initial alive feds agent tank religious automatic davidian deaths knock good hit said military possible died away light fault child witnesses pay instead folks daniel bureau armored going

people government law state israel rights israeli jews right public states war fact political country arab laws article case court human federal american united support society policy civil freedom members national jewish evidence person majority force power legal citizens action crime world act countries issue arabs group police justice non control palestinian live land peace true anti center writes gaza population research constitution death edu org allowed party protection consider actions number adam apc general subject based murder igc considered life military self parties lives personal nation order cpr social question individual religious today situation free responsibility governments palestine innocent

medical health disease doctor pain patients treatment medicine cancer edu hiv blood use years patient writes cause skin don like just aids symptoms number article help diseases drug com effects information doctors infection physician normal chronic think taking care volume condition drugs page says cure people tobacco hicnet know newsletter effective therapy problem common time women prevent surgery children center immune research called april control effect weeks low syndrome hospital physicians states clinical diagnosed day med age good make caused severe reported public safety child said cdc usually diet national studies tissue months way cases causing migraine smokeless infections does

card video drivers cards driver vga mode ati graphics windows diamond vesa bus svga support gateway dx pc modes color isa board version local bit memory vlb ultra pro eisa monitor new does mb stealth hz using based speedstar orchid colors available latest ram know work chip performance resolution fast screen speed tech million trident winbench dcoleman set problems yes et ftp results winmarks plus edu bbs zeos utexas vram bios robert win higher magazine utxvms able high interlaced viper com boards site weitek tseng chipset modem turbo software non resolutions far faster accelerated supports price meg ega mhz true

card windows video drivers monitor com modem vga cards driver port pc mode screen ati serial graphics dos bus board irq support svga diamond vesa using memory problem dx color gateway file version ports local modes pro bit does isa colors mb know vlb mouse ultra win ram new monitors hz work eisa nec problems chip files stealth use set program speedstar orchid plus high based resolution fast software cable hardware display latest used performance ms like baud bbs tech connector run thanks speed just yes million trident winbench dcoleman available pin ibm uart connect sony window switch et disk

nissan electronics wagon altima delcoelect kocrsv station gm subaru sumax delco spiros hughes wax pathfinder legacy kokomo wagons smorris scott toyota seattleu don just like strong silver software luxury derek proof stanza seattle cisco morris cymbal triantafyllopoulos sportscar think people know near fool ugly proud claims flat statistics lincoln sedans bullet karl lee perth puzzled miata sentra maxima acura infiniti corolla mgb untruth verbatim good time consider way based make stand guys writes noticed want ve heavy suggestion eat steven horrible uunet studies armor fisher lust designs study definately lexus remove conversion embodied aesthetic elvis attached honey stole designing wd

mac apple bit mhz ram simms mb like memory just don cpu people chip chips think color board ibm speed does know se video time machines motherboard hardware lc cache meg ns simm need upgrade built vram good quadra want centris price dx run way processor card clock slots make fpu internal did macs cards ve pin power really machine say faster said software intel macintosh right week writes slot going sx performance things edu years nubus possible thing monitor work point expansion rom iisi ll add dram better little slow let sure pc ii didn ethernet lciii case kind

image jpeg gif file color files images format bit display convert quality formats colors programs program tiff picture viewer graphics bmp bits xv screen pixel read compression conversion zip shareware scale view jpg original save quicktime jfif free version best pcx viewing bitmap gifs simtel viewers don mac usenet resolution animation menu scanner pixels sites gray quantization displays better try msdos tga want current black faq converting white setting mirror xloadimage section ppm fractal amiga write algorithm mpeg pict targa arithmetic export scodal archive converted grasp lossless let space human grey directory pictures rgb demo scanned old choice grayscale compress

gun guns edu writes bike com article weapons dod control crime weapon apr used carry criminals police ride nra bikes self firearms use buy firearm laws concealed bmw defense home handgun criminal motorcycle anti problem car people owners ban rider riding shot just armed new don like crimes assault kill violent protect uio handguns ifi evil ama citizens state org know illegal politics texas thomas thomasp cb talk legal shooting pro road carrying abiding think att honda cs stolen defend good purchase ll law individual hp cc permit rifle issue government states parsli property ve killing federal does motorcycles time

gun guns weapons people control government law crime state rights police laws weapon self criminals carry states public nra used defense firearms anti federal right criminal legal firearm citizens country home political case concealed handgun court fact crimes issue protect armed politics kill ban problem buy national individual support shot society violent use civil war property talk owners assault illegal handguns ifi uio united defend action allowed freedom article american amendment person member power force thomasp car human evidence threat thomas murder shooting majority killed carrying members citizen killing pro abiding group act evil texas america justice permit stolen said