Discovering Discrete Latent Topics with Neural Variational Inference

Yishu Miao, Edward Grefenstette, Phil Blunsom

Introduction

Probabilistic models for inducing latent topics from documents are one of the great success stories of unsupervised learning. Starting with latent semantic analysis (LSA (Landauer et al., 1998)), models for uncovering the underlying semantic structure of a document collection have been widely applied in data mining, text processing and information retrieval. Probabilistic topic models (e.g. PLSA (Hofmann, 1999), LDA (Blei et al., 2003) and HDPs (Teh et al., 2006)) provide a robust, scalable, and theoretically sound foundation for document modelling by introducing latent variables for each token to topic assignment.

For the traditional Dirichlet-Multinomial topic model, efficient inference is available by exploiting conjugacy with either Monte Carlo or Variational techniques (Jordan et al., 1999; Attias, 2000; Beal, 2003)). However, as topic models have grown more expressive, in order to capture topic dependencies or exploit conditional information, inference methods have become increasingly complex. This is especially apparent for non-conjugate models (Carlin & Polson, 1991; Blei & Lafferty, 2007; Wang & Blei, 2013).

Deep neural networks are excellent function approximators and have shown great potential for learning complicated non-linear distributions for unsupervised models. Neural variational inference (Kingma & Welling, 2014; Rezende et al., 2014; Mnih & Gregor, 2014) approximates the posterior of a generative model with a variational distribution parameterised by a neural network. This allows both the generative model and the variational network to be jointly trained with backpropagation. For models with continuous latent variables associated with particular distributions, such as Gaussians, there exist reparameterisations (Kingma & Welling, 2014; Rezende et al., 2014) of the distribution permitting unbiased and low-variance estimates of the gradients with respect to the parameters of the inference network. For models with discrete latent variables, Monte-Carlo estimates of the gradient must be employed. Recently, algorithms such as REINFORCE have been used effectively to decrease variance and improve learning (Mnih & Gregor, 2014; Mnih et al., 2014).

In this work we propose and evaluate a range of topic models parameterised with neural networks and trained with variational inference. We introduce three different neural structures for constructing topic distributions: the Gaussian Softmax distribution (GSM), the Gaussian Stick Breaking distribution (GSB), and the Recurrent Stick Breaking process (RSB), all of which are conditioned on a draw from a multivariate Gaussian distribution. The Gaussian Softmax topic model constructs a finite topic distribution with a softmax function applied to the projection of the Gaussian random vector. The Gaussian Stick Breaking model also constructs a discrete distribution from the Gaussian draw, but this time employing a stick breaking construction to provide a bias towards sparse topic distributions. Finally, the Recurrent Stick Breaking process employs a recurrent neural network, again conditioned on the Gaussian draw, to progressively break the stick, yielding a neural analog of a Dirichlet Process topic model (Teh et al., 2006).

Our neural topic models combine the merits of both neural networks and traditional probabilistic topic models. They can be trained efficiently by backpropagation, scaled to large data sets, and easily conditioned on any available contextual information. Further, as probabilistic graphical models, they are interpretable and explicitly represent the dependencies amongst the random variables. Previous neural document models, such as the neural variational document model (NVDM) (Miao et al., 2016), belief networks document model (Mnih & Gregor, 2014), neural auto-regressive document model (Larochelle & Lauly, 2012) and replicated softmax (Hinton & Salakhutdinov, 2009), have not explicitly modelled latent topics. Through evaluations on a range of data sets we compare our models with previously proposed neural document models and traditional probabilistic topic models, demonstrating their robustness and effectiveness.

Parameterising Topic Distributions

In probabilistic topic models, such as LDA (Blei et al., 2003), we use the latent variables $\theta_{d}$ and $z_{n}$ for the topic proportion of document $d$ , and the topic assignment for the observed word $w_{n}$ , respectively. In order to facilitate efficient inference, the Dirichlet distribution (or Dirichlet process (Teh et al., 2006)) is employed as the prior to generate the parameters of the multinomial distribution $\theta_{d}$ for each document. The use of a conjugate prior allows the tractable computation of the posterior distribution over the latent variables’ values. While alternatives have been explored, such as log-normal topic distributions (Blei & Lafferty, 2006, 2007), extra approximation (e.g. the Laplace approximation (Wang & Blei, 2013)) is required for closed form derivations. The generative process of LDA is:

where $\beta_{z_{n}}$ represents the topic distribution over words given topic assignment $z_{n}$ and $N_{d}$ is the number of tokens in document $d$ . $\beta_{z_{n}}$ can be drawn from another Dirichlet distribution, but here we consider it a model parameter. $\alpha_{0}$ is the hyper-parameter of the Dirichlet prior and $N_{d}$ is the total number of words in document $d$ . The marginal likelihood for a document in collection $D$ is:

If we employ mean-field variational inference, the updates for the variational parameters $q(\theta)$ and $q(z_{n})$ can be directly derived in closed form.

In contrast, our proposed models introduce a neural network to parameterise the multinomial topic distribution. The generative process is:

where $G(\mu_{0},\sigma_{0}^{2})$ is composed of a neural network $\theta=g(x)$ conditioned on a isotropic Gaussian $x\sim N(\mu_{0},\sigma_{0}^{2})$ Throughout this presentation we employ diagonal Gaussian distributions. As such we use $N(\mu,\sigma^{2})$ to represent the Gaussian distributions, where $\sigma^{2}$ is the diagonal of the covariance matrix.. The marginal likelihood is:

Compared to Equation (1), here we parameterise the latent variable $\theta$ by a neural network conditioned on a draw from a Gaussian distribution. To carry out neural variational inference (Miao et al., 2016), we construct an inference network $q(\theta|\mu(d),\sigma(d))$ to approximate the posterior $p(\theta|d)$ , where $\mu(d)$ and $\sigma(d)$ are functions of $d$ that are implemented by multilayer perceptrons (MLP). By using a Gaussian prior distribution, we are able to employ the re-parameterisation trick (Kingma & Welling, 2014) to build an unbiased and low-variance gradient estimator for the variational distribution. Without conjugacy, the updates of the parameters can still be derived directly and easily from the variational lower bound. We defer discussion of the inference process until the next section. Here we introduce several alternative neural networks for $g(x)$ which transform a Gaussian sample $x$ into the topic proportions $\theta$ .

In deep learning, an energy-based function is generally used to construct probability distributions (LeCun et al., 2006). Here we pass a Gaussian random vector through a softmax function to parameterise the multinomial document topic distributions. Thus $\theta\sim G_{\text{GSM}}(\mu_{0},\sigma_{0}^{2})$ is defined as:

where $W_{1}$ is a linear transformation, and we leave out the bias terms for brevity. $\mu_{0}$ and $\sigma_{0}^{2}$ are hyper-parameters which we set for a zero mean and unit variance Gaussian.

2 The Gaussian Stick Breaking Construction

For instance assume $K=3$ , $\theta$ is generated by 2 breaks where $\theta_{1}=\eta_{1}$ , $\theta_{2}=\eta_{2}(1-\eta_{1})$ and the remaining stick $\theta_{3}=(1-\eta_{2})(1-\eta_{1})$ . If the model proceeds to break the stick for $K=4$ , the remaining stick $\theta_{3}$ is broken into $(\theta_{3}^{\prime},\theta_{4}^{\prime})$ , where $\theta_{3}^{\prime}=\eta_{3}\cdot\theta_{3}$ , $\theta_{4}^{\prime}=(1-\eta_{3})\cdot\theta_{3}$ and $\theta_{3}=\theta_{3}^{\prime}+\theta_{4}^{\prime}$ . Hence, for different values of $K$ , it always satisfies $\sum_{k=1}^{K}\theta_{k}=1$ . The stick breaking construction $f_{\text{SB}}(\eta)$ is illustrated in Figure 1 and the distribution $\theta\sim G_{\text{GSB}}(\mu_{0},\sigma_{0}^{2})$ is defined as:

Although the Gaussian stick breaking construction breaks exchangeability, compared to the stick breaking definition of the Dirichlet process, it does provide a more amenable form for neural variational inference. More interestingly, this stick breaking construction introduces a non-parametric aspect to neural topic models.

3 The Recurrent Stick Breaking Construction

Recurrent Neural Networks (RNN) are commonly used for modelling sequences of inputs in deep learning. Here we consider the stick breaking construction as a sequential draw from an RNN, thus capturing an unbounded number of breaks with a finite number of parameters. Conditioned on a Gaussian latent variable $x$ , the recurrent neural network $f_{\text{SB}}(x)$ produces a sequence of binomial logits which are used to break the stick sequentially. The $f_{\text{RNN}}(x)$ is decomposed as:

where $h_{k}$ is the output of the $k$ th state, which we feed into the next state of the RNN ${}_{\text{SB}}$ as an input. Figure 2 shows the recurrent neural network structure. Now $\theta\sim G_{\text{RSB}}(\mu_{0},\sigma_{0}^{2})$ is defined as:

where $f_{\text{SB}}(\eta)$ is equivalent to the stick breaking function used in the Gaussian stick breaking construction. Here, the RNN is able to dynamically produce new logits to break the stick ad infinitum. The expressive power of the RNN to model sequences of unbounded length is still bounded by the parametric model’s capacity, but for topic modelling it is adequate to model the countably infinite topics amongst the documents in a truncation-free fashion.

Models

Given the above described constructions for the topic distributions, in this section we introduce our family of neural topic models and corresponding inference methods.

where $q(\theta|d)$ is the variational distribution approximating the true posterior $p(\theta|d)$ . Following the framework of neural variational inference (Miao et al., 2016; Kingma & Welling, 2014; Rezende et al., 2014), we introduce an inference network conditioned on the observed document $d$ to generate the variational parameters $\mu(d)$ and $\sigma(d)$ so that we can estimate the lower bound by sampling $\theta$ from $q(\theta|d)=G(\theta|\mu(d),\sigma^{2}(d))$ . In practise we reparameterise $\hat{\theta}=\mu(d)+\hat{\epsilon}\cdot\sigma(d)$ with the sample $\hat{\epsilon}\in\mathcal{N}(0,I)$ .

Since the generative distribution $p(\theta|\mu_{0},\sigma_{0}^{2})=p(g(x)|\mu_{0},\sigma_{0}^{2})=p(x|\mu_{0},\sigma_{0}^{2})$ and the variational distribution $q(\theta|d)=q(g(x)|d)=q(x|\mu(d),\sigma^{2}(d))$ , the KL term in Equation (3) can be easily integrated as a Gaussian KL-divergence. Note that, the parameterisation network $g(x)$ and its parameters are shared across all the documents. In addition, given a sampled $\hat{\theta}$ , the latent variable $z_{n}$ can be integrated out as:

Thus there is no need to introduce another variational approximation for the topic assignment $z$ . The variational lower bound is therefore:

We can directly derive the gradients of the generative parameters $\Theta$ , including $t$ , $v$ and $g(x)$ . While for the variational parameters $\Phi$ , including $\mu(d)$ and $\sigma(d)$ , we use the gradient estimators:

$\Theta$ and $\Phi$ are jointly updated by stochastic gradient back-propagation. The structure of this variational auto-encoder is illustrated in Figure 3.

2 Recurrent Neural Topic Models

where $\hat{\theta}^{i}$ corresponds to the topic distribution over words $\beta^{t}$ . In order to dynamically increase the number of topics, the model proposes the $i$ th break on the stick to split the $(i+1)$ th topic. In this case, RNN ${}_{\text{Topic}}$ proceeds to the next state and generates topic $t^{(i+1)}$ for $\beta^{(i+1)}$ and the RNN ${}_{\text{SB}}$ generates $\hat{\theta}^{(i+1)}$ by an extra break of the stick. Firstly, we compute the likelihood increase brought by topic $i$ across the documents $D$ :

Then, we employ an acceptance hyper-parameter $\gamma$ to decide whether to generate a new topic. If $\mathcal{I}>\gamma$ , the previous proposed new topic (the $i$ th topic) contributes to the generation of words and we increase the active number of topics $i$ by 1, otherwise we keep the current $i$ unchanged. Thus $\gamma$ controls the rate at which the model generates new topics. In practise, the increase of the lower bound is computed over mini-batches so that the model is able to generate new topics before the current epoch is finished. The details of the algorithm are described in Algorithm 1.

3 Topic vs. Document Models

In most topic models, documents are modelled by a mixture of topics, and each word is associated with a single topic latent variable, e.g. LDA and GSM. However, the NVDM (Miao et al., 2016) is implemented as a VAE (Kingma & Welling, 2014) without modelling topics explicitly. The major difference is that NVDM employs a softmax decoder (Equation 5) to generate all of the words of a document conditioned on the document representation $\hat{\theta}$ :

where both $\hat{\theta}$ and $\beta$ are unnormalised. Hence, it breaks the topic model assumption that each document consists of a mixture of topics. Although the latent variables can still be interpreted as topics, these topics are not modelled explicitly since there is no actual topic distribution over words. Srivastava & Sutton (2016) interprets the above decoder as a weighted product of experts topic model, here however, we refer to such models that do not directly assign topics to words as document models instead of topic models. We can also convert our neural topic models to neural document models by replacing the mixture decoder (Equation 4) with the softmax decoder (Equation 5). For example in the GSM construction, if we remove the softmax function over $\theta$ , and directly apply Equation 5 to generate the words, it reduces to a variant of the NVDM (GSM applies topic and word vectors to compute $\beta$ , while NVDM directly models a projection $\beta$ from the latent variables to words).

Related Work

Topic models have been extensively studied for a variety of applications in document modelling and information retrieval. Beyond LDA, significant extensions have sought to capture topic correlations (Blei & Lafferty, 2007), model temporal dependencies (Blei & Lafferty, 2006) and discover an unbounded number of topics (Teh et al., 2006). Topic models have been extended to capture extra context information such as time (Wang & McCallum, 2006), authorship (Rosen-Zvi et al., 2004), and class labels (Mcauliffe & Blei, 2008). Such extensions often require carefully tailored graphical models, and associated inference algorithms, to capture the desired context. Neural models provide a more generic and extendable option and a number of works have sought to leverage these, such as the Replicated Softmax (Hinton & Salakhutdinov, 2009), the Auto-Regressive Document Model (Larochelle & Lauly, 2012), Sigmoid Belief Document Model (Mnih & Gregor, 2014), Variational Auto-Encoder Document Model (NVDM) (Miao et al., 2016) and TopicRNN Model (Dieng et al., 2016). However, these neural works do not explicitly capture topic assignments.

The recent work of Srivastava & Sutton (2016) also employs neural variational inference to train topic models and is closely related to our work. Their model follows the original LDA formulation in keeping the Dirichlet-Multinomial parameterisation and applies a Laplace approximation to allow gradient to back-propagate to the variational distribution. In contrast, our models directly parameterise the multinomial distribution with neural networks and jointly learn the model and variational parameters during inference. Nalisnick & Smyth (2016) proposes a reparameterisation approach for continuous latent variables with Beta prior, which enables neural variational inference for Dirichlet process. However, Taylor expansion is required to approximate the KL Divergence while having multiple draws from the Kumaraswamy variational distribution. In our case, we can easily apply the Gaussian reparametersation trick with only one draw from the Gaussian distribution.

Experiments

We perform an experimental evaluation employing three datasets: MXMhttp://labrosa.ee.columbia.edu/millionsong/musixmatch (Bertin-Mahieux et al., 2011) song lyrics, 20NewsGroupshttp://qwone.com/ jason/20Newsgroups and Reuters RCV1-v2http://trec.nist.gov/data/reuters/reuters.html news. MXM is the official lyrics collection of the Million Song Dataset with 210,519 training and 27,143 testing datapoints respectively. The 20NewsGroups corpus is divided into 11,314 training and 7,531 testing documents, while the RCV1-v2 corpus is a larger collection with 794,414 training and 10,000 test cases from Reuters newswire stories. We employ the original 5,000 vocabulary provided for MXM, while the other two datasets are processed by stemming, filtering stopwords and taking the most frequent 2,000We use the vocabulary provided by Srivastava & Sutton (2016) for direct comparison. and 10,000 words as the vocabularies.

The hidden dimension of the MLP for constructing $q(\theta|d)$ is 256 for all the neural topic models and the benchmarks that apply neural variational inference (e.g. NVDM, proLDA, NVLDA), and 0.8 dropout is applied on the output of the MLP before parameterising the diagonal Gaussian distribution. Grid search is carried out on learning rate and batch size for achieving the held-out perplexity. For the recurrent stick breaking construction we use a one layer LSTM cell (256 hidden units) for constructing the recurrent neural network. For the finite topic models we set the maximum number of topics $K$ as 50 and 200. The models are trained by Adam (Kingma & Ba, 2015) and only one sample is used for neural variational inference. We follow the optimisation strategy of Miao et al. (2016) by alternately updating the model parameters and the inference network. To alleviate the redundant topics issue, we also apply topic diversity regularisation (Xie et al., 2015) while carrying out neural variational inference (Appendix B).

We use Perplexity as the main metric for assessing the generalisation ability of our generative models. Here we use the variational lower bound to estimate the document perplexity: $\text{exp}(-\frac{1}{D}\sum_{d}^{D}\frac{1}{N_{d}}\log p(d))$ following Miao et al. (2016). Table 1 presents the test document perplexities of the topic models on the three datasets. Amongst the finite topic models, the Gaussian softmax construction (GSM) achieves the lowest perplexity in most cases, while all of the GSM, GSB and RSB models are significantly better than the benchmark LDA and NVLDA models. Amongst our selection of unbounded topic models, we compare our truncation-free RSB model, which applies an RNN to dynamically increase the active topics ( $\gamma$ is empirically set as $5e^{-5}$ ), with the traditional non-parametric HDP topic model (Teh et al., 2006). Here we see that the recurrent neural topic model performs significantly better than the HDP topic model on perplexity.

Next we evaluate our neural network parameterisations as document models with the implicit topic distribution introduced in Section 3.3. Table 2 compares the proposed neural document models with the benchmarks. According to our experimental results, the generalisation abilities of the GSM, GSB and RSB models are all improved by switching to an implicit topic distribution, and their performance is also significantly better than the NVDM and ProdLDA. We hypothesise that this effect is due to the models not needing to infer the topic-word assignments, which makes optimisation much easier. Interestingly, the RSB model performs better than the GSM and GSB on 20NewsGroups in both the 50 and 200 topic settings. This is possibly due to the fact that GSM and GSB apply linear transformations $W_{1}$ and $W_{2}$ to generate the hidden variable $\theta$ and breaking proportions $\eta$ from a Gaussian draw, while the RSB applies recurrent neural networks to produce $\eta$ in a sequence which induces dependencies in $\eta$ and helps escape local minima. It is worth noting that the recurrent neural network uses more parameters than the other two models. As mentioned in Section 3.3, GSM is a variant of NVDM that applies topic and word vectors to construct the topic distribution over words instead of directly modelling a multinomial distribution by a softmax function, which further simplifies optimisation. If it is not necessary to model the explicit topic distribution over words, using an implicit topic distribution may lead to better generalisation.

To further demonstrate the effectiveness of the stick-breaking construction, Figure 5 presents the average probability of each topic by estimating the posterior probability $q(z|d)$ of each document from 20NewsGroups. Here we set the number of topics to 400, which is large enough for this dataset. Figure 5a shows that the topics with higher probability are evenly distributed. While in Figure 5a the higher probability ones are placed in the front, and we can see a small tail on the topics after 300. Due to the sparsity inducing property of the stick-breaking construction, the topics on the tail are less likely to be sampled. This is also the advantage of stick-breaking construction when we apply the RSB-TF as a non-parameteric topic model, since the model activates the topics according to the knowledge learned from data and it becomes less sensitive to the hyperparameter controlling the initial number of topics. Figure 6 shows the impact on test perplexity for the neural topic models when the maximum number of topics is increased. We can see that the performance of the GSM model gets worse if the maximum number of topics exceeds 400, but the GSB and RSB are stable even though the number of topics far outstrips that which the model requires. In addition, the RSB model performs better than GSB when the number of topics is under 200, but it becomes slightly worse than GSB when the number exceeds 400, possibly due to the difficulty of learning long sequences with RNNs.

Figure 7 shows the convergence process of the truncation-free RSB (RSB-TF) model on the 20NewsGroups. With different initial number of topics, 10, 30, and 50. The RSB-TF dynamically increases the number of active topics to achieve a better variational lower bound. We can see the training perplexity keeps decreasing while the RSB-TF activates more topics. The numbers of active topics will stabilise when the convergence point is approaching (normally between 200 and 300 active topics on the 20NewsGroups). Hence, as a non-parametric model, RSB-TF is not sensitive to the initial number of active topics.

In addition since the quality of the discovered topics is not directly reflected by perplexity (i.e. a function of log-likelihood), we evaluate the topic observed coherence by normalised point-wise mutual information (NPMI) (Lau et al., 2014). Table 3 shows the topic observed coherence achieved by the finite neural topic models. According to these results, there does not appear to be a significant difference in topic coherence amongst the neural topic models. We observe that in both the GSB and RSB, the NPMI scores of the former topics in the stick breaking order are higher than the latter ones. It is plausible as the stick-breaking construction implicitly assumes the order of the topics, the former topics obtain more sufficient gradients to update the topic distributions. Likewise we present the results obtained by the neural document models with implicit topic distributions. Though the topic probability distribution over words does not exist, we could rank the words by the positiveness of the connections between the words and each dimension of the latent variable. Interestingly the performance of these document models are significantly better than their topic model counterparts on topic coherence. The results of RSB-TF and HDP are not presented due to the fact that the number of active topics is dynamic, which makes these two models not directly comparable to the others. To further demonstrate the quality of the topics, we produce a t-SNE projection for the estimated topic proportions of each document in Figure 8.

Conclusion

In this paper we have introduced a family of neural topic models using the Gaussian Softmax, Gaussian Stick-Breaking and Recurrent Stick-Breaking constructions for parameterising the latent multinomial topic distributions of each document. With the help of the stick-breaking construction, we are able to build neural topic models which exhibit similar sparse topic distributions as found with traditional Dirichlet-Multinomial models. By exploiting the ability of recurrent neural networks to model sequences of unbounded length, we further present a truncation-free variational inference method that allows the number of topics to dynamically increase. The evaluation results show that our neural models achieve state-of-the-art performance on a range of standard document corpora.

References

Appendix A Discovered Topics

Table 4 presents the topics by the words with highest probability (top-10 words) achieved by different neural topic models on 20NewsGroups dataset.

Appendix B Topic Diversity

An issue that exists in both probabilistic and neural topic models is redundant topics. In neural models, however, we are able to straightforwardly regularise the distance between each of the topic vectors in order to diversify the topics. Following Xie et al. (2015), we apply such topic diversity regularisation during the inference process. We compute the angles between each two topics $a(t_{i},t_{j})=arccos(\frac{|t_{i}\cdot t_{j}|}{||t_{i}||\cdot||t_{j}||})$ . Then, the mean angle of all pairs of $K$ topics is $\zeta=\frac{1}{K^{2}}\sum_{i}\sum_{j}a(t_{i},t_{j})$ , and the variance is $\nu=\frac{1}{K^{2}}\sum_{i}\sum_{j}(a(t_{i},t_{j})-\zeta)^{2}$ . We add the following topic diversity regularisation to the variational objective:

where $\lambda$ is a hyper-parameter for the regularisation that is set as 0.1 in the experiments. During training, the mean angle is encouraged to be larger while the variance is suppressed to be smaller so that all of the topics will be pushed away from each other in the topic semantic space. Though in practice diversity regularisation does not provide a significant improvement to perplexity ( $2\!\!\sim\!\!5$ in most cases), it helps reduce topic redundancy and can be easily applied on topic vectors instead of the simplex over the full vocabulary.