WHAI: Weibull Hybrid Autoencoding Inference for Deep Topic Modeling
Hao Zhang, Bo Chen, Dandan Guo, Mingyuan Zhou
Introduction
There is a surge of research interest in multilayer representation learning for documents. To analyze the term-document count matrix of a text corpus, Srivastava et al. (2013) extend the deep Boltzmann machine (DBM) with the replicated softmax topic model of Salakhutdinov & Hinton (2009) to infer a multilayer representation with binary hidden units, but its inference network is not trained to match the true posterior (Mnih & Gregor, 2014) and the higher-layer neurons learned by DBM are difficult to visualize. The deep Poisson factor models of Gan et al. (2015) are introduced to generalize Poisson factor analysis (Zhou et al., 2012), with a deep structure restricted to model binary topic usage patterns. Deep exponential families (DEF) of Ranganath et al. (2015) construct more general probabilistic deep networks with non-binary hidden units, in which a count matrix can be factorized under the Poisson likelihood, with the gamma distributed hidden units of adjacent layers linked via the gamma scale parameters. The Poisson gamma belief network (PGBN) (Zhou et al., 2015; 2016) also factorizes a count matrix under the Poisson likelihood, but factorizes the shape parameters of the gamma distributed hidden units of each layer into the product of a connection weight matrix and the gamma hidden units of the next layer, resulting in strong nonlinearity and readily interpretable multilayer latent representations.
Those multilayer probabilistic models are often characterized by a top-down generative structure, with the distribution of a hidden layer typically acting as a prior for the layer below. Despite being able to infer a multilayer representation of a text corpus with scalable inference (Patterson & Teh, 2013; Ruiz et al., 2016; Cong et al., 2017a), they usually rely on an iterative procedure to infer the latent representation of a new document at the testing stage, regardless of whether variational inference or Markov chain Monte Carlo (MCMC) is used. The potential need of a large number of iterations per testing document makes them unattractive when real-time processing is desired. For example, one may need to rapidly extract the topic-proportion vector of a document and use it for downstream analysis, such as identifying key topics and retrieving related documents. A potential solution is to construct a variational autoencoder (VAE) that learns the parameters of an inference network (recognition model or encoder) jointly with those of the generative model (decoder) (Kingma & Welling, 2014; Rezende et al., 2014). However, most existing VAEs rely on Gaussian latent variables, with the neural networks (NNs) acting as nonlinear transforms between adjacent layers (Sonderby et al., 2016; Dai et al., 2016; Ishaan et al., 2017). A primary reason is that there is a simple reparameterization trick for Gaussian latent variables that allows efficiently computing the noisy gradients of the evidence lower bound (ELBO) with respect to the NN parameters. Unfortunately, Gaussian based distributions often fail to well approximate the posterior distributions of sparse, nonnegative, and skewed document latent representations. For example, Srivastava & Sutton (2017) propose autoencoding variational inference for topic models (AVITM), as shown in Fig. 2b, which utilizes the logistic-normal distribution to approximate the posterior of the latent representation of a document; even though the generative model is latent Dirichlet allocation (LDA) (Blei et al., 2003), a basic single-hidden-layer topic model, due to the insufficient ability of the logistic-normal distribution to model sparsity, AVITM has to rely on some heuristic to force the latent representation of a document to be sparse. Another common shortcoming of existing VAEs is that they often only provide a point estimate for the global parameters of the generative model, and hence their inference network is optimized to approximate the posteriors of the local parameters conditioning on the data and the point estimate, rather than a full posterior, of the global parameters. In addition, from the viewpoint of probabilistic modeling, the inference network of a VAE is often merely a shallow probabilistic model, whose parameters, though, are deterministically nonlinearly transformed from the observations via a non-probabilistic deep neural network.
Note that we have also tried gamma hybrid autoencoding inference (GHAI), which directly uses the gamma distribution in the probabilistic top-down part of the inference network, while using rejection sampling variational inference (RSVI) of Naesseth et al. to approximately compute the gradient of the ELBO. While RSVI is a very general technique that can be applied to a wide variety of non-reparameterizable distributions, we find that for replacing the reparameterizable Weibull with non-reparameterizable gamma distributions in the inference network, the potential gains are overshadowed by the disadvantages of having to rely on an approximate reparameterization scheme guided by rejection sampling. In the experiments for deep topic modeling, we show that WHAI clearly outperforms GHAI, and both WHAI and GHAI outperform their counterparts that remove the top-down links of the inference network, referred to as WHAI-independent and GHAI-independent, respectively; WHAI is comparable to Gibbs sampling in terms performance, but is scalable to big training data via mini-batch stochastic-gradient based inference and is considerably fast in out-of-sample prediction via the use of an inference network.
WHAI for multilayer document representation
Below we first describe the decoder and encoder of WHAI, and then provide a hybrid stochastic-gradient MCMC and autoencoding variational inference that is fast in both training and testing.
Despite the attractive properties, neither the Gibbs sampler nor TLASGR-MCMC of DLDA can avoid taking a potentially large number of MCMC iterations to infer the latent representation of a testing document, which hinders real-time processing of the incoming documents and motivates us to construct an inference network with fast out-of-sample prediction, as described below.
2 Document encoder: Weibull upward-downward variational encoder
A VAE uses an inference network to map the observations directly to their latent representations. However, their success so far is mostly restricted to Gaussian distributed latent variables, and does not generalize well to model sparse, nonnegative, and skewed latent document representations. To move beyond latent Gaussian models, below we propose Weibull upward-downward variational encoder (WUDVE) to efficiently produce a document’s multilayer latent representation under DLDA.
Assuming the global parameters of DLDA shown in (2.1) are given and the task is to infer the local parameters , the usual strategy of mean-field variational Bayes (Jordan et al., 1999) is to maximize the ELBO that can be expressed as
where the expectations are taken with respect to (w.r.t.) a fully factorized distribution as
Instead of using a conventional latent Gaussian based VAE, in order to model sparse and nonnegative latent document representation, it might be more appropriate to use a gamma distribution based inference network defined as , where and are two related deep neural networks parameterized by . However, it is hard to efficiently compute the gradient of the ELBO with respect to , due to the difficulty to reparameterize a gamma distributed random variable (Kingma & Welling, 2014; Ruiz et al., 2016; Knowles, 2015), motivating us to identify a surrogate distribution that can not only well approximate the gamma distribution, but also be easily reparameterized. Below we show the Weibull distribution is an ideal choice.
A main reason that we choose the Weibull distribution to construct the inference network is that the Weibull and gamma distributions have similar PDFs:
Moreover, its KL-divergence from the gamma distribution has an analytic expression as
where is the Euler–Mascheroni constant. Minimizing this KL divergence, one can identify the two parameters of a Weibull distribution to approximate a given gamma one. As shown in Fig. 1, the inferred Weibull distribution in general quite accurately approximates the target gamma one, as long as the gamma shape parameter is neither too close to zero nor too large.
2.2 Upward-downward information propagation
For the DLDA upward-downward Gibbs sampler sketched in Fig. 2c, the corresponding Gibbs sampling update equation for can be expressed as
where and are latent random variables constituted by information upward propagated to layer , as described in detail in Zhou et al. (2016) and hence omitted here for brevity. It is clear from (5) that the conditional posterior of is related to both the information at the higher (prior) layer, and that upward propagated to the current layer via a series of data augmentation and marginalization steps described in Zhou et al. (2016). Inspired by this instructive upward-downward information propagation in Gibbs sampling, as shown in Fig. 2a, we construct WUDVE, the inference network of our model, as , where
Comparing Figs. 2c and 2a show that in each iteration, both Gibbs sampling and WUDVE have not only an upward information propagation (orange arrows), but also a downward one (blue arrows), but their underlying implementations are distinct from each other. Gibbs sampling in Fig. 2c does not have an inference network and needs the local variables to help perform stochastic upward information propagation, whereas WUDVE in Fig. 2a uses its non-probabilistic part to perform deterministic upward information propagation, without relying on the local variables . It is also interesting to notice that the upward-downward structure of WUDVE, motivated by the upward-downward Gibbs sampler of DLDA, is closely related to that used in the ladder VAE of Sonderby et al. (2016). However, to combine the bottom-up and top-down information, ladder VAE relies on some heuristic restricted to Gaussian latent variables.
3 Hybrid MCMC/VAE inference
In Section 2.1, we describe how to use TLASGR-MCMC of Cong et al. (2017a), a stochastic-gradient MCMC algorithm for DLDA, to sample the global parameters ; whereas in Section 2.2.2, we describe how to use WUDVE, an autoencoding variational inference network, to approximate the conditional posterior of the local parameters given and observation . Rather than merely finding a point estimate of the global parameters , we describe in Algorithm 1 how to combine TLASGR-MCMC and the proposed WUDVE into a hybrid MCMC/VAE inference algorithm, which infers posterior samples for both the global parameters of the generative network, and the corresponding neural network parameters of the inference network. Being able to efficiently evaluating the gradient of the ELBO is important to the success of a variational inference algorithm (Hoffman et al., 2013; Paisley et al., 2012; Kingma & Welling, 2014; Mnih & Gregor, 2014; Ranganath et al., 2015; Ruiz et al., 2016; Rezende et al., 2014). An important step of Algorithm 1 is calculating the gradient of the ELBO in (3) with respect to the NN parameters . Thanks to the choice of the Weibull distribution, the second term of the ELBO in (3) is analytic, and due to simple reparameterization of the Weibull distribution, the gradient of the first term of the ELBO with respect to can be accurately evaluated, achieving satisfactory performance using even a single Monte Carlo sample, as shown in our experimental results. Thanks to the architecture of WUDVE, using the inference network, for a new mini-batch, we can directly find the conditional posteriors of given and the stochastically updated , with which we can sample the local parameters and then use TLASGR-MCMC to stochastically update the global parameters .
4 Variations of WHAI
To clearly understand how each component contributes to the overall performance of WHAI, below we consider two different variations: GHAI and WAI. We first consider gamma hybrid autoencoding inference (GHAI). In WUDVE, the inference network for WHAI, we have a deterministic-upward and stochastic-downward structure, where the reparameterizable Weilbull distribution is used to connect adjacent stochastic layers. Although we choose to use the Weibull distribution for the reasons specified in Section 2.2.1, one may also choose some other distribution in the downward structure. For example, one may choose the gamma distribution and replace (6) with
To demonstrate the advantages of the proposed hybrid inference for WHAI, which infers posterior samples of the global parameters, including and , using TLASGR-MCMC, we also consider Weibull autoencoding inference (WAI) that has the same inference network as WHAI but infers and using stochastic gradient decent (SGD) (Kingma & Ba, 2015). Note that as argued in Mandt et al. (2017), SGD can also be used for approximate Bayesian inference. We will show in experiments that sampling the global parameters via TLASGR-MCMC provides improved performance in comparison to sampling them via SGD.
To understand the importance of the stochastic-downward structure used in the inference network, and further understand the differences between using the Weibull distribution with simple reparameterization and using the gamma distribution with RSVI, we also consider DLDA-GHAI-Independent and DLDA-WHAI-Independent that remove the stochastic-downward connections of DLDA-GHAI and DLDA-WHAI, respectively. More specifically, they define in (6) as and , respectively, and use variational inference and RSVI, respectively, to infer .
Experimental results
We compare the performance of different algorithms on 20Newsgroups (20News), Reuters Corpus Volume I (RCV1), and Wikipedia (Wiki). 20News consists of 18,845 documents with a vocabulary size of 2,000. RCV1 consists of 804,414 documents with a vocabulary size of 10,000. Wiki, with a vocabulary size of 7,702, consists of 10 million documents randomly downloaded from Wikipedia using the script provided for Hoffman et al. (2010). Similar to Cong et al. (2017a), we randomly select 100,000 documents for testing. To be consistent with previous settings (Gan et al., 2015; Henao et al., 2015; Cong et al., 2017a), no precautions are taken in the Wikipedia downloading script to prevent a testing document from being downloaded into a mini-batch for training. Our code is written in Theano (Theano Development Team, 2016).
For comparison, we consider the deep Poisson factor analysis (DPFA) of Gan et al. (2015), DLDA-Gibbs of Zhou et al. (2016), DLDA-TLASGR of Cong et al. (2017a), and AVITM of Srivastava & Sutton (2017), using the code provided by the authors. Note that as shown in Cong et al. (2017a), DLDA-Gibbs and DLDA-TLASGR are state-of-the-art topic modeling algorithms that clearly outperform a large number of previously proposed ones, such as the replicated softmax of Salakhutdinov & Hinton (2009) and the nested Hierarchical Dirichlet process of Paisley et al. (2015).
Per-heldout-word perplexity is a widely-used performance measure. Similar to Wallach et al. (2009), Paisley et al. (2011), and Zhou et al. (2012), for each corpus, we randomly select 70 of the word tokens from each document to form a training matrix , holding out the remaining 30 to form a testing matrix . We use to train the model and calculate the per-heldout-word perplexity as
where is the total number of collected samples and . For the proposed model, we set the mini-batch size as 200, and use as burn-in 2000 mini-batches for both 20News and RCV1 and 3500 for wiki. We collect 3000 samples after burn-in to calculate perplexity. The hyperparameters of WHAI are set as: , , and .
Table 1 lists for various algorithms both the perplexity and the average run time per testing document given a single sample (estimate) of the global parameters. Clearly, given the same generative network structure, DLDA-Gibbs performs the best in terms of predicting heldout word tokens, which is not surprising as this batch algorithm can sample from the true posteriors given enough Gibbs sampling iterations. DLDA-TLASGR is a mini-batch algorithm that is much more scalable in training than DLDA-Gibbs, at the expense of slighted degraded performance in out-of-sample prediction. Both DLDA-WAI, using SGD to infer the global parameters, and DLDA-WHAI, using a stochastic-gradient MCMC to infer the global parameters, slightly underperform DLDA-TLASGR; all mini-batch based algorithms are scalable to a big training corpus, but due to the use of the WUDVE inference network, both DLDA-GHAI and DLDA-WHAI, as well as their variations, are considerably fast in processing a testing document. In terms of perplexity, all algorithms with DLDA as the generative model clearly outperform both DPFA of Gan et al. (2015) and AVITM of Srivastava & Sutton (2017), while in terms of the computational cost for testing, all algorithms with an inference network, such as AVITM, DLDA-GHAI, and DLDA-WHAI, clearly outperform these relying on an interactive procedure for out-of-sample prediction, including DPFA, DLDA-Gibbs, and DLDA-TLASGR. It is also clear that except for DLDA-GHAI-Independent and DLDA-WHAI-Independent that have no stochastic-downward components in their inference, all the other algorithms with DLDA as the generative model have a clear trend of improvement as the generative network becomes deeper, indicating the importance of having stochastic-downward information propagation during posterior inference; and DLDA-WHAI with a single hidden layer already clearly outperforms AVITM, indicating that using the Weibull distribution is more appropriate than using the logistic-normal distribution to model the document latent representation. Furthermore, thanks to the use of the stochastic gradient based TLASGR-MCMC rather than a simple SGD procedure, DLDA-WHAI consistently outperforms DLDA-WAI. Last but not least, while DLDA-GHAI that relies on RSVI to approximately reparameterize the gamma distributions clearly outperforms AVITM and DPFA, it clearly underperforms DLDA-WHAI that has simple reparameterizations for its Weibull distributions.
Below we examine how various inference algorithms progress over time during training, evaluated with per-holdout-word perplexity. As clearly shown in Fig. 3, DLDA-WHAI outperforms DPFA and AVITM in providing lower perplexity as time progresses, which is not surprising as the DLDA multilayer generative model is good at document representation, while AVITM is only "deep" in the deterministic part of its inference network and DPFA is restricted to model binary topic usage patterns via its deep network. When DLDA is used as the generative model, in comparison to Gibbs sampling and TLASGR-MCMC on two large corpora, RCV1 and Wiki, the mini-batch based WHAI converges slightly slower than TLASGR-MCMC but much faster than Gibbs sampling; WHAI consistently outperforms WAI, which demonstrates the advantage of the hybrid MCMC/VAE inference; in addition, the RSVI based DLDA-GHAI clearly converges more slowly in time than DLDA-WHAI. Note that for all three datasets, the perplexity of TLASGR decreases at a fast rate, followed by closely by WHAI, while that of Gibbs sampling decreases slowly, especially for RCV1 and Wiki, as shown in Figs. 3(b-c). This is expected as both RCV1 and Wiki are much larger corpora, for which a mini-batch based inference algorithm can already make significant progress in inferring the global model parameters, before a batch-learning Gibbs sampler finishes a single iteration that needs to go through all documents. We also notice that although AVITM is fast for testing via the use of a VAE, its representation power is limited due to not only the use of a shallow topic model, but also the use of a latent Gaussian based inference network that is not naturally suited to model document latent representation.
2 Topic hierarchy and manifold
In addition to quantitative evaluations, we have also visually inspected the inferred topics at different layers and the inferred connection weights between the topics of adjacent layers. Distinct from many existing deep learning models that build nonlinearity via “black-box” neural networks, we can easily visualize the whole stochastic network, whose hidden units of layer and those of layer are connected by that are sparse. In particular, we can understand the meaning of each hidden unit by projecting it back to the original data space via . We show in Fig. 4 a subnetwork, originating from units 16, 19, and 24 of the top hidden layer, taken from the generative network of size 128-64-32 inferred on Wiki. The semantic meaning of each topic and the connections between different topics are highly interpretable. We provide several additional topic hierarchies for Wiki in the Appendix.
To further illustrate the effectiveness of our multilayer representation in our model, we apply a three-hidden-layer WHAI to MNIST digits and present the learned dictionary atoms. We use the Poisson likelihood directly to model the MNIST digit pixel values that are nonnegative integers ranging from 0 to 255. As shown in Figs. 5a-5c, it is clear that the factors at layers one to three represent localized points, strokes, and digit components, respectively, that cover increasingly larger spatial regions. This type of hierarchical visual representation is difficult to achieve with other types of deep neural networks (Srivastava et al., 2013; Kingma & Welling, 2014; Rezende et al., 2014; Sonderby et al., 2016).
WUDVE, the inference network of WHAI, has a deterministic-upward-stochastic-downward structure, in contrast to a conventional VAE that often has a pure deterministic bottom-up structure. Here, we further visualize the importance of the stochastic-downward part of WUDVE through a simple experiment. We remove the stochastic-downward part of WUDVE shown in (6) and define the inference network as , in other words, we ignore the top-down information. As shown in Figs. 5d-5f, although some latent structures are learned, the hierarchical relationships between adjacent layers almost all disappear, indicating the importance of having a stochastic-downward structure together with a deterministic-upward one in the inference network.
As a sanity check for latent representation and overfitting, we shown in Fig. 6 the latent space interpolations between the test set examples on MNIST dataset, and provide related results in the Appendix for the 20News corpus. With the 3-layer model learned before, following Dumoulin et al. (2016), we sample pairs of test set examples and and project them into and . We then linearly interpolate between and , and pass the intermediary points through the generative model to generate the input-space interpolations. In Fig. 6, the left and right column are the digits generated from and , while the middle ones are generated from the interpolation latent space. We observe a smooth transitions between pairs of example, and intermediary images remain interpretable. In other words, the latent space the model learned is on a manifold, indicating that WHAI has learned a generalizable latent feature representation rather than concentrating its probability mass exclusively around training examples.
Conclusion
To infer a hierarchical latent representations of a big corpus, we develop Weibull hybrid autoencoding inference (WHAI) for deep latent Dirichlet allocation (DLDA), a deep probabilistic topic model that factorizes the observed high-dimensional count vectors under the Poisson likelihood and models the latent representation under the gamma likelihood at multiple different layers. WHAI integrates topic-layer-adaptive stochastic gradient Riemannian (TLASGR) MCMC to update the global parameters given the posterior sample of a mini-batch’s local parameters, and a Weibull distribution based upward-downward variational autoencoder to infer the conditional posterior of the local parameters given the stochastically updated global parameters. The use of the Weibull distribution, which resembles the gamma distribution and has a simple reparameterization, makes one part of the evidence lower bound (ELBO) analytic, and makes it efficient to compute the gradient of the non-analytic part of the ELBO with respect to the parameters of the inference network. Moving beyond deep models and inference procedures based on Gaussian latent variables, WHAI provides posterior samples for both the global parameters of the generative model and these of the inference network, yields highly interpretable multilayer latent document representation, is scalable to a big training corpus due to the use of a stochastic-gradient MCMC, and is fast in out-of-sample prediction due to the use of an inference network. Compelling experimental results on big text corpora demonstrate the advantages of WHAI in both quantitative and qualitative analysis.
Acknowledgments
This work is partially supported by the Fund for Foreign Scholars in University Research and Teaching Programs (the 111 Project) (No. B18039), the Thousand Young Talent Program of China, NSFC (61771361) , NSFC for Distinguished Young Scholars (61525105), and Innovation Fund of International Exchange Program for Graduate Student of Xidian University.
Erratum
The camera-ready copy of this paper contained two typos: missing a negative sign in the definition of the probability density function of the Weibull distribution in Page 4 and missing a negative sign in the definition of in Page 5. In addition, it did not include the definition of that is used in computing . This ArXiv update is made to correct both typos and add the definition of as the Euler–Mascheroni constant.
References
Appendix A Hierarchical topics learned from Wiki
Appendix B manifold on documents
From a sci.medicine document to an eci.space one
1. com, writes, article, edu, medical, pitt, pain, blood, disease, doctor, medicine, treatment, patients, health, ibm
2. com, writes, article, edu, space, medical, pitt, pain, blood, disease, doctor, data, treatment, patients, health
3. space, com, writes, article, edu, data, medical, launch, earth, states, blood, moon, disease, satellite, medicine,
4. space, data, com, writes, article, edu, launch, earth, states, moon, satellite, shuttle, nasa, price, lunar
5. space, data, launch, earth, states, moon, satellite, case, com, shuttle, price, nasa, price, lunar, writes,
6. space, data, launch, earth, states, moon, orbit, satellite, case, shuttle, price, nasa, system, lunar, spacecraft
From a alt.atheism document to a soc.religion.christian one
1. god, just, want, moral, believe, religion, atheists, atheism, christian, make, atheist, good, say, bible, faith
2. god, just, want, believe, jesus, christian, atheists, bible, atheism faith, say, make, religious, christians, atheist
3. god, jesus, just, faith, believe, christian, bible, want, church, say, religion, moral, lord, world, writes
4. god, jesus, faith, just, bible, church, christ, believe, say, writes, lord, religion, world, want, sin
5. god, jesus, faith, church, christ, bible, christian, say, write, lord, believe, truth, world, human, holy
6. god, jesus, faith, church, christ, bible, writes, say, christian, lord, sin, human, father, spirit, truth
From a com.graphics document to a comp.sys.ibm.pc.hardware one
1. image, color, windows, files, image, thanks, jpeg, gif, card, bit, window, win, help, colors, format
2. image, windows, color, files, card, images, jpeg, thanks, gif, bit, window, win, colors, monitor, program
3. windows, image, color, card, files, gov, writes, nasa, article, images, program, jpeg, vidio, display, monitor
4. windows, gov, writes, nasa, article, card, going, program, image, color, memory, files, software, know, screen
5. gov, windows, writes, nasa, article, going, dos, card, memory, know, display, says, screen, work, ram
6. gov, writes, nasa, windows, article, going, dos, program, card, memory, software, says, ram, work, running