MetaLDA: a Topic Model that Efficiently Incorporates Meta information

He Zhao, Lan Du, Wray Buntine, Gang Liu

I Introduction

With the rapid growth of the internet, huge amounts of text data are generated in social networks, online shopping and news websites, etc. These data create demand for powerful and efficient text analysis techniques. Probabilistic topic models such as Latent Dirichlet Allocation (LDA) are popular approaches for this task, by discovering latent topics from text collections. Many conventional topic models discover topics purely based on the word-occurrences, ignoring the meta information (a.k.a., side information) associated with the content. In contrast, when we humans read text it is natural to leverage meta information to improve our comprehension, which includes categories, authors, timestamps, the semantic meanings of the words, etc. Therefore, topic models capable of using meta information should yield improved modelling accuracy and topic quality.

In practice, various kinds of meta information are available at the document level and the word level in many corpora. At the document level, labels of documents can be used to guide topic learning so that more meaningful topics can be discovered. Moreover, it is highly likely that documents with common labels discuss similar topics, which could further result in similar topic distributions. For example, if we use authors as labels for scientific papers, the topics of the papers published by the same researcher can be closely related.

At the word level, different semantic/syntactic features are also accessible. For example, there are features regarding word relationships, such as synonyms obtained from WordNet , word co-occurrence patterns obtained from a large corpus, and linked concepts from knowledge graphs. It is preferable that words having similar meaning but different morphological forms, like “dog” and “puppy”, are assigned to the same topic, even if they barely co-occur in the modelled corpus. Recently, word embeddings generated by GloVe and word2vec , have attracted a lot of attention in natural language processing and related fields. It has been shown that the word embeddings can capture both the semantic and syntactic features of words so that similar words are close to each other in the embedding space. It seems reasonable to expect that these word embedding will improve topic modelling .

Conventional topic models can suffer from a large performance degradation over short texts (e.g., tweets and news headlines) because of insufficient word co-occurrence information. In such cases, meta information of documents and words can play an important role in analysing short texts by compensating the lost information in word co-occurrences. At the document level, for example, tweets are usually associated with hashtags, users, locations, and timestamps, which can be used to alleviate the data sparsity problem. At the word level, word semantic similarity and embeddings obtained or trained on large external corpus (e.g., Google News or Wikipedia) have been proven useful in learning meaningful topics from short texts .

The benefit of using document and word meta information separately is shown in several models such as . However, in existing models this is usually not efficient enough due to non-conjugacy and/or complex model structures. Moreover, only one kind of meta information (either at document level or at word level) is used in most existing models. In this paper, we propose MetaLDACode at https://github.com/ethanhezhao/MetaLDA/, a topic model that can effectively and efficiently leverage arbitrary document and word meta information encoded in binary form. Specifically, the labels of a document in MetaLDA are incorporated in the prior of the per-document topic distributions. If two documents have similar labels, their topic distributions should be generated with similar Dirichlet priors. Analogously, at the word level, the features of a word are incorporated in the prior of the per-topic word distributions, which encourages words with similar features to have similar weights across topics. Therefore, both document and word meta information, if and when they are available, can be flexibly and simultaneously incorporated using MetaLDA. MetaLDA has the following key properties:

MetaLDA jointly incorporates various kinds of document and word meta information for both regular and short texts, yielding better modelling accuracy and topic quality.

With the data augmentation techniques, the inference of MetaLDA can be done by an efficient and closed-form Gibbs sampling algorithm that benefits from the full local conjugacy of the model.

The simple structure of incorporating meta information and the efficient inference algorithm give MetaLDA advantage in terms of running speed over other models with meta information.

We conduct extensive experiments with several real datasets including regular and short texts in various domains. The experimental results demonstrate that MetaLDA achieves improved performance in terms of perplexity, topic coherence, and running time.

II Related Work

In this section, we review three lines of related work: models with document meta information, models with word meta information, and models for short texts.

At the document level, Supervised LDA (sLDA) models document labels by learning a generalised linear model with an appropriate link function and exponential family dispersion function. But the restriction for sLDA is that one document can only have one label. Labelled LDA (LLDA) assumes that each label has a corresponding topic and a document is generated by a mixture of the topics. Although multiple labels are allowed, LLDA requires that the number of topics must equal to the number of labels, i.e., exactly one topic per label. As an extension to LLDA, Partially Labelled LDA (PLLDA) relaxes this requirement by assigning multiple topics to a label. The Dirichlet Multinomial Regression (DMR) model incorporates document labels on the prior of the topic distributions like our MetaLDA but with the logistic-normal transformation. As full conjugacy does not exist in DMR, a part of the inference has to be done by numerical optimisation, which is slow for large sets of labels and topics. Similarly, in the Hierarchical Dirichlet Scaling Process (HDSP) , conjugacy is broken as well since the topic distributions have to be renormalised. introduces a Poisson factorisation model with hierarchical document labels. But the techniques cannot be applied to regular topic models as the topic proportion vectors are also unnormalised.

Recently, there is growing interest in incorporating word features in topic models. For example, DF-LDA incorporates word must-links and cannot-links using a Dirichlet forest prior in LDA; MRF-LDA encodes word semantic similarity in LDA with a Markov random field; WF-LDA extends LDA to model word features with the logistic-normal transform; LF-LDA integrates word embeddings into LDA by replacing the topic-word Dirichlet multinomial component with a mixture of a Dirichlet multinomial component and a word embedding component; Instead of generating word types (tokens), Gaussian LDA (GLDA) directly generates word embeddings with the Gaussian distribution. Despite the exciting applications of the above models, their inference is usually less efficient due to the non-conjugacy and/or complicated model structures.

Analysis of short text with topic models has been an active area with the development of social networks. Generally, there are two ways to deal with the sparsity problem in short texts, either using the intrinsic properties of short texts or leveraging meta information. For the first way, one popular approach is to aggregate short texts into pseudo-documents, for example, introduces a model that aggregates tweets containing the same word; Recently, PTM aggregates short texts into latent pseudo documents. Another approach is to assume one topic per short document, known as mixture of unigrams or Dirichlet Multinomial Mixture (DMM) such as . For the second way, document meta information can be used to aggregate short texts, for example, aggregates tweets by the corresponding authors and shows that aggregating tweets by their hashtags yields superior performance over other aggregation methods. One closely related work to ours is the models that use word features for short texts. For example, introduces an extension of GLDA on short texts which samples an indicator variable that chooses to generate either the type of a word or the embedding of a word and GPU-DMM extends DMM with word semantic similarity obtained from embeddings for short texts. Although with improved performance there still exists challenges for existing models: (1) for aggregation-based models, it is usually hard to choose which meta information to use for aggregation; (2) the “single topic” assumption makes DMM models lose the flexibility to capture different topic ingredients of a document; and (3) the incorporation of meta information in the existing models is usually less efficient.

To our knowledge, the attempts that jointly leverage document and word meta information are relatively rare. For example, meta information can be incorporated by first-order logic in Logit-LDA and score functions in SC-LDA . However, the first-order logic and score functions need to be defined for different kinds of meta information and the definition can be infeasible for incorporating both document and word meta information simultaneously.

III The MetaLDA Model

Given a corpus, LDA uses the same Dirichlet prior for all the per-document topic distributions and the same prior for all the per-topic word distributions . While in MetaLDA, each document has a specific Dirichlet prior on its topic distribution, which is computed from the meta information of the document, and the parameters of the prior are estimated during training. Similarly, each topic has a specific Dirichlet prior computed from the word meta information. Here we elaborate our MetaLDA, in particular on how the meta information is incorporated. Hereafter, we will use labels as document meta information, unless otherwise stated.

Fig. 1 shows the graphical model of MetaLDA and the generative process is as following:

For each token vv: Compute βk,v=l=1Lwordδl,kgv,l\beta_{k,v}=\prod_{l^{\prime}=1}^{L_{word}}\delta_{l^{\prime},k}^{g_{v,l^{\prime}}}

Draw ϕkDirV(βk)\boldsymbol{\phi}_{k}\sim\text{Dir}_{V}(\boldsymbol{\beta}_{k})

For each topic kk: Compute αd,k=l=1Ldocλl,kfd,l\alpha_{d,k}=\prod_{l=1}^{L_{doc}}\lambda_{l,k}^{f_{d,l}}

Draw θdDirK(αd)\boldsymbol{\theta}_{d}\sim\text{Dir}_{K}(\boldsymbol{\alpha}_{d})

Draw topic zd,iCatK(θd)z_{d,i}\sim\text{Cat}_{K}(\boldsymbol{\theta}_{d})

Draw word wd,iCatV(ϕzd,i)w_{d,i}\sim\text{Cat}_{V}(\boldsymbol{\phi}_{z_{d,i}})

where Ga(,)\text{Ga}(\cdot,\cdot), Dir()\text{Dir}(\cdot), Cat()\text{Cat}(\cdot) are the gamma distribution, the Dirichlet distribution, and the categorical distribution respectively. KK, μ0\mu_{0}, and ν0\nu_{0} are the hyper-parameters.

To incorporate document labels, MetaLDA learns a specific Dirichlet prior over the topics for each document by using the label information. Specifically, the information of document dd’s labels is incorporated in αd\boldsymbol{\alpha}_{d}, the parameter of Dirichlet prior on θd\boldsymbol{\theta}_{d}. As shown in Step 2a, αd,k\alpha_{d,k} is computed as a log linear combination of the labels fd,lf_{d,l}. Since fd,lf_{d,l} is binary, αd,k\alpha_{d,k} is indeed the multiplication of λl,k\lambda_{l,k} over all the active labels of document dd, i.e., {lfd,l=1}\{l\mid f_{d,l}=1\}. Drawn from the gamma distribution with mean 1, λl,k\lambda_{l,k} controls the impact of label ll on topic kk. If label ll has no or less impact on topic kk, λl,k\lambda_{l,k} is expected to be 1 or close to 1, and then λl,k\lambda_{l,k} will have no or little influence on αd,k\alpha_{d,k} and vice versa. The hyper-parameter μ0\mu_{0} controls the variation of λl,k\lambda_{l,k}. The incorporation of word features is analogous but in the parameter of the Dirichlet prior on the per-topic word distributions as shown in Step 1c.

The intuition of our way of incorporating meta information is: At the document level, if two documents have more labels in common, their Dirichlet parameter αd\boldsymbol{\alpha}_{d} will be more similar, resulting in more similar topic distributions θd\boldsymbol{\theta}_{d}; At the word level, if two words have similar features, their βk,v\beta_{k,v} in topic kk will be similar and then we can expect that their ϕk,v\phi_{k,v} could be more or less the same. Finally, the two words will have similar probabilities of showing up in topic kk. In other words, if a topic “prefers” a certain word, we expect that it will also prefer other words with similar features to that word. Moreover, at both the document and the word level, different labels/features may have different impact on the topics (λ\lambda/δ\delta), which is automatically learnt in MetaLDA.

IV Inference

Unlike most existing methods, our way of incorporating the meta information facilitates the derivation of an efficient Gibbs sampling algorithm. With two data augmentation techniques (i.e., the introduction of auxiliary variables), MetaLDA admits the local conjugacy and a close-form Gibbs sampling algorithm can be derived. Note that MetaLDA incorporates the meta information on the Dirichlet priors, so we can still use LDA’s collapsed Gibbs sampling algorithm for the topic assignment zd,iz_{d,i}. Moreover, Step 2a and 1c show that one only needs to consider the non-zero entries of F\mathbf{F} and G\mathbf{G} in computing the full conditionals, which further reduces the inference complexity.

Similar to LDA, the complete model likelihood (i.e., joint distribution) of MetaLDA is:

where nk,v=dDi=1Nd1(wd,i=v,zd,i=k)n_{k,v}=\sum_{d}^{D}\sum_{i=1}^{N_{d}}\boldsymbol{1}_{(w_{d,i}=v,z_{d,i}=k)}, md,k=i=1Nd1(zd,i=k)m_{d,k}=\sum_{i=1}^{N_{d}}\boldsymbol{1}_{(z_{d,i}=k)}, and 1()\boldsymbol{1}_{(\cdot)} is the indicator function.

To sample λl,k\lambda_{l,k}, we first marginalise out θd,k\theta_{d,k} in the right part of Eq. (1) with the Dirichlet multinomial conjugacy:

where αd,=k=1Kαd,k\alpha_{d,\cdot}=\sum_{k=1}^{K}\alpha_{d,k}, md,=k=1Kmd,km_{d,\cdot}=\sum_{k=1}^{K}m_{d,k}, and Γ()\Gamma(\cdot) is the gamma function. Gamma ratio 1 in Eq. (2) can be augmented with a set of Beta random variables q1:Dq_{1:D} as:

where for each document dd, qdBeta(αd,,md,)q_{d}\sim\text{Beta}(\alpha_{d,\cdot},m_{d,\cdot}). Given a set of q1:Dq_{1:D} for all the documents, Gamma ratio 1 can be approximated by the product of q1:Dq_{1:D}, i.e., d=1Dqdαd,\prod_{d=1}^{D}q_{d}^{\alpha_{d,\cdot}}.

Gamma ratio 2 in Eq. (2) is the Pochhammer symbol for a rising factorial, which can be augmented with an auxiliary variable td,kt_{d,k} as follows:

where StmS^{m}_{t} indicates an unsigned Stirling number of the first kind. Gamma ratio 2 is a normalising constant for the probability of the number of tables in the Chinese Restaurant Process (CRP) , td,kt_{d,k} can be sampled by a CRP with αd,k\alpha_{d,k} as the concentration and md,km_{d,k} as the number of customers:

where Bern()\text{Bern}(\cdot) samples from the Bernoulli distribution. The complexity of sampling td,kt_{d,k} by Eq. (5) is O(md,k)\mathcal{O}(m_{d,k}). For large md,km_{d,k}, as the standard deviation of td,kt_{d,k} is O(logmd,k)\mathcal{O}(\sqrt{\log m_{d,k}}) , one can sample td,kt_{d,k} in a small window around the current value in complexity O(logmd,k)\mathcal{O}(\sqrt{\log m_{d,k}}).

By ignoring the terms unrelated to α\alpha, the augmentation of Eq. (4) can be simplified to a single term αd,ktd,k\alpha_{d,k}^{t_{d,k}}. With auxiliary variables now introduced, we simplify Eq. (2) to:

Replacing αd,k\alpha_{d,k} with λl,k\lambda_{l,k}, we can get:

Recall that all the document labels are binary and λl,k\lambda_{l,k} is involved in computing αd,k\alpha_{d,k} iff fd,l=1f_{d,l}=1. Extracting all the terms related to λl,k\lambda_{l,k} in Eq. (IV-A), we get the marginal posterior of λl,k\lambda_{l,k}:

where αd,kλl,k\frac{\alpha_{d,k}}{\lambda_{l,k}} is the value of αd,k\alpha_{d,k} with λl,k\lambda_{l,k} removed when fd,l=1f_{d,l}=1. With the data augmentation techniques, the posterior is transformed into a form that is conjugate to the gamma prior of λl,k\lambda_{l,k}. Therefore, it is straightforward to yield the following sampling strategy for λl,k\lambda_{l,k}:

We can compute and cache the value of αd,k\alpha_{d,k} first. After λl,k\lambda_{l,k} is sampled, αd,k\alpha_{d,k} can be updated by:

where λi,k\lambda^{\prime}_{i,k} is the newly-sampled value of λi,k\lambda_{i,k}.

Since the derivation of sampling δl,k\delta_{l^{\prime},k} is analogous to λl,k\lambda_{l,k}, we directly give the sampling formulas:

Given αd\boldsymbol{\alpha_{d}} and βk\boldsymbol{\beta_{k}}, the collapsed Gibbs sampling of a new topic for a word wd,i=vw_{d,i}=v in MetaLDA is:

V Experiments

In this section, we evaluate the proposed MetaLDA against several recent advances that also incorporate meta information on 6 real datasets including both regular and short texts. The goal of the experimental work is to evaluate the effectiveness and efficiency of MetaLDA’s incorporation of document and word meta information both separately and jointly compared with other methods. We report the performance in terms of perplexity, topic coherence, and running time per iteration.

In the experiments, three regular text datasets and three short text datasets were used:

Reuters is widely used corpus extracted from the Reuters-21578 dataset where documents without any labels are removed MetaLDA is able to handle documents/words without labels/features. But for fair comparison with other models, we removed the documents without labels and words without features.. There are 11,367 documents and 120 labels. Each document is associated with multiple labels. The vocabulary size is 8,817 and the average document length is 73.

20NG, 20 Newsgroup, a widely used dataset consists of 18,846 news articles with 20 categories. The vocabulary size is 22,636 and the average document length is 108.

NYT, New York Times is extracted from the documents in the category “Top/News/Health” in the New York Times Annotated Corpushttps://catalog.ldc.upenn.edu/ldc2008t19. There are 52,521 documents and 545 unique labels. Each document is with multiple labels. The vocabulary contains 21,421 tokens and there are 442 words in a document on average.

WS, Web Snippet, used in , contains 12,237 web search snippets and each snippet belongs to one of 8 categories. The vocabulary contains 10,052 tokens and there are 15 words in one snippet on average.

TMN, Tag My News, used in , consists of 32,597 English RSS news snippets from Tag My News. With a title and a short description, each snippet belongs to one of 7 categories. There are 13,370 tokens in the vocabulary and the average length of a snippet is 18.

AN, ABC News, is a collection of 12,495 short news descriptions and each one is in multiple of 194 categories. There are 4,255 tokens in the vocabulary and the average length of a description is 13.

All the datasets were tokenised by Mallethttp://mallet.cs.umass.edu and we removed the words that exist in less than 5 documents and more than 95% documents.

V-B Meta Information Settings

Document labels and word features. At the document level, the labels associated with documents in each dataset were used as the meta information. At the word level, we used a set of 100-dimensional binarised word embeddings as word features0, which were obtained from the 50-dimensional GloVe word embeddings pre-trained on Wikipediahttps://nlp.stanford.edu/projects/glove/. To binarise word embeddings, we first adopted the following method similar to :

where gv\boldsymbol{g}^{\prime\prime}_{v} is the original embedding vector for word vv, gv,jg^{\prime}_{v,j} is the binarised value for jthj^{\text{th}} element of gv\boldsymbol{g^{\prime\prime}_{v}}, and Mean+()\text{Mean}_{+}(\cdot) and Mean()\text{Mean}_{-}(\cdot) are the average value of all the positive elements and negative elements respectively. The insight is that we only consider features with strong opinions (i.e., large positive or negative value) on each dimension. To transform g{1,1}g^{\prime}\in\{-1,1\} to the final g{0,1}g\in\{0,1\}, we use two binary bits to encode one dimension of gv,jg^{\prime}_{v,j}: the first bit is on if gv,j=1g^{\prime}_{v,j}=1 and the second is on if gv,j=1g^{\prime}_{v,j}=-1. Besides, MetaLDA can work with other word features such as semantic similarity as well.

Default feature. Besides the labels/features associated with the datasets, a default label/feature for each document/word is introduced in MetaLDA, which is always equal to 1. The default can be interpreted as the bias term in α\alpha/β\beta, which captures the information unrelated to the labels/features. While there are no document labels or word features, with the default, MetaLDA is equivalent in model to asymmetric-asymmetric LDA of .

V-C Compared Models and Parameter Settings

We evaluate the performance of the following models:

MetaLDA and its variants: the proposed model and its variants. Here we use MetaLDA to indicate the model considering both document labels and word features. Several variants of MetaLDA with document labels and word features separately were also studied, which are shown in Table I. These variants differ in the method of estimating α\boldsymbol{\alpha} and β\boldsymbol{\beta}. All the models listed in Table I were implemented on top of Mallet. The hyper-parameters μ0\mu_{0} and ν0\nu_{0} were set to 1.01.0.

LDA : the baseline model. The Mallet implementation of SparseLDA is used.

LLDA, Labelled LDA and PLLDA, Partially Labelled LDA : two models that make use of multiple document labels. The original implementationhttps://nlp.stanford.edu/software/tmt/tmt-0.4/ is used.

DMR, LDA with Dirichlet Multinomial Regression : a model that can use multiple document labels. The Mallet implementation of DMR based on SparseLDA was used. Following Mallet, we set the mean of λ\lambda to 0.0 and set the variances of λ\lambda for the default label and the document labels to 100.0 and 1.0 respectively.

WF-LDA, Word Feature LDA : a model with word features. We implemented it on top of Mallet and used the default settings in Mallet for the optimisation.

LF-LDA, Latent Feature LDA : a model that incorporates word embeddings. The original implementationhttps://github.com/datquocnguyen/LFTM was used. Following the paper, we used 1500 and 500 MCMC iterations for initialisation and sampling respectively and set λ\lambda to 0.6, and used the original 50-dimensional GloVe word embeddings as word features.

GPU-DMM, Generalized Pólya Urn DMM : a model that incorporates word semantic similarity. The original implementationhttps://github.com/NobodyWHU/GPUDMM was used. The word similarity was generated from the distances of the word embeddings. Following the paper, we set the hyper-parameters μ\mu and ϵ\epsilon to 0.1 and 0.7 respectively, and the symmetric document Dirichlet prior to 50/K50/K.

PTM, Pseudo document based Topic Model : a model for short text analysis. The original implementationhttp://ipv6.nlsde.buaa.edu.cn/zuoyuan/ was used. Following the paper, we set the number of pseudo documents to 1000 and λ\lambda to 0.1.

All the models, except where noted, the symmetric parameters of the document and the topic Dirichlet priors were set to 0.1 and 0.01 respectively, and 2000 MCMC iterations are used to train the models.

V-D Perplexity Evaluation

Perplexity is a measure that is widely used to evaluate the modelling accuracy of topic models. The lower the score, the higher the modelling accuracy. To compute perplexity, we randomly selected some documents in a dataset as the training set and the remaining as the test set. We first trained a topic model on the training set to get the word distributions of each topic kk (ϕktrain\boldsymbol{\phi}_{k}^{train}). Each test document dd was split into two halves containing every first and every second words respectively. We then fixed the topics and trained the models on the first half to get the topic proportions (θdtest\boldsymbol{\theta}_{d}^{test}) of test document dd and compute perplexity for predicting the second half. In regard to MetaLDA, we fixed the matrices Φtrain\mathbf{\Phi}^{train} and Λtrain\mathbf{\Lambda}^{train} output from the training procedure. On the first half of test document dd, we computed the Dirichlet prior αdtest\boldsymbol{\alpha}_{d}^{test} with Λtrain\mathbf{\Lambda}^{train} and the labels fdtest\boldsymbol{f}_{d}^{test} of test document dd (See Step 2a), and then point-estimated θdtest\boldsymbol{\theta}_{d}^{test}. We ran all the models 5 times with different random number seeds and report the average scores and the standard deviations.

In testing, we may encounter words that never occur in the training documents (a.k.a., unseen words or out-of-vocabulary words). There are two strategies for handling unseen words for calculating perplexity on test documents: ignoring them or keeping them in computing the perplexity. Here we investigate both strategies:

In this experiment, the perplexity is computed only on the words that appear in the training vocabulary. Here we used 80% documents in each dataset as the training set and the remaining 20% as the test set.

Tables II and III showFor GPU-DMM and PTM, perplexity is not evaluated because the inference code for unseen documents is not public available. The random number seeds used in the code of LLDA and PLLDA are pre-fixed in the package. So the standard deviations of the two models are not reported.: the average perplexity scores with standard deviations for all the models. Note that: (1) The scores on AN with 150 and 200 topics are not reported due to overfitting observed in all the compared models. (2) Given the size of NYT, the scores of 200 and 500 topics are reported. (3) The number of latent topics in LLDA must equal to the number of document labels. (4) For PLLDA, we varied the number of topics per label from 5 to 50 (2 and 5 topics on NYT). The number of topics in PPLDA is the product of the numbers of labels and topics per label.

The results show that MetaLDA outperformed all the competitors in terms of perplexity on nearly all the datasets, showing the benefit of using both document and word meta information. Specifically, we have the following remarks:

By looking at the models using only the document-level meta information, we can see the significant improvement of these models over LDA, which indicates that document labels can play an important role in guiding topic modelling. Although the performance of the two variants of MetaLDA with document labels and DMR is comparable, our models runs much faster than DMR, which will be studied later in Section V-F.

It is interesting that PLLDA with 50 topics for each label has better perplexity than MetaLDA with 200 topics in the 20NG dataset. With the 20 unique labels, the actual number of topics in PLLDA is 1000. However, if 10 topics for each label in PLLDA are used, which is equivalent to 200 topics in MetaLDA, PLLDA is outperformed by MetaLDA significantly.

At the word level, MetaLDA-def-wf performed the best among the models with word features only. Moreover, our model has obvious advantage in running speed (see Table V). Furthermore, comparing MetaLDA-def-wf with MetaLDA-def-def and MetaLDA-0.1-wf with LDA, we can see using the word features indeed improved perplexity.

The scores show that the improvement gained by MetaLDA over LDA on the short text datasets is larger than that on the regular text datasets. This is as expected because meta information serves as complementary information in MetaLDA and can have more significant impact when the data is sparser.

It can be observed that models usually gained improved perplexity, if α\alpha is sampled/optimised, in line with .

On the AN dataset, there is no statistically significant difference between MetaLDA and DMR. On NYT, a similar trend is observed: the improvement in the models with the document labels over LDA is obvious but not in the models with the word features. Given the number of the document labels (194 of AN and 545 of NYT), it is possible that the document labels already offer enough information and the word embeddings have little contribution in the two datasets.

V-D2 Perplexity Computed with Unseen Words

To test the hypothesis that the incorporation of meta information in MetaLDA can significantly improve the modelling accuracy in the cases where the corpus is sparse, we varied the proportion of documents used in training from 20% to 80% and used the remaining for testing. It is natural that when the proportion is small, the number of unseen words in testing documents will be large. Instead of simply excluding the unseen words in the previous experiments, here we compute the perplexity with unseen words for LDA, DMR, WF-LDA and the proposed MetaLDA. For perplexity calculation, ϕk,vtest\phi^{test}_{k,v} for each topic kk and each token vv in the test documents is needed. If vv occurs in the training documents, ϕk,vtest\phi^{test}_{k,v} can be directly obtained. While if vv is unseen, ϕk,vunseen\phi^{unseen}_{k,v} can be estimated by the prior: βk,vunseennk,train+βk,train+βk,unseen\frac{\beta^{unseen}_{k,v}}{n^{train}_{k,\cdot}+\beta^{train}_{k,\cdot}+\beta^{unseen}_{k,\cdot}}. For LDA and DMR which do not use word features, βk,vunseen=βk,vtrain\beta^{unseen}_{k,v}=\beta^{train}_{k,v}; For WF-LDA and MetaLDA which are with word features, βk,vunseen\beta^{unseen}_{k,v} is computed with the features of the unseen token. Following Step 1c, for MetaLDA, βk,vunseen=lLwordδl,kgv,lunseen\beta^{unseen}_{k,v}=\prod_{l^{\prime}}^{L_{word}}\delta_{l^{\prime},k}^{g^{unseen}_{v,l}}.

Figure 2 shows the perplexity scores on Reuters, 20NG, TMN and WS with 200, 200, 100 and 50 topics respectively. MetaLDA outperformed the other models significantly with a lower proportion of training documents and relatively higher proportion of unseen words. The gap between MetaLDA and the other three models increases while the training proportion decreases. It indicates that the meta information helps MetaLDA to achieve better modelling accuracy on predicting unseen words.

V-E Topic Coherence Evaluation

We further evaluate the semantic coherence of the words in a topic learnt by LDA, PTM, DMR, LF-LDA, WF-LDA, GPU-DMM and MetaLDA. Here we use the Normalised Pointwise Mutual Information (NPMI) to calculate topic coherence score for topic kk with top TT words: NPMI(k)=j=2Ti=1j1logp(wj,wi)p(wj)p(wi)/logp(wj,wi)\text{NPMI}(k)=\sum_{j=2}^{T}\sum_{i=1}^{j-1}\log\frac{p(w_{j},w_{i})}{p(w_{j})p(w_{i})}/-\log p(w_{j},w_{i}), where p(wi)p(w_{i}) is the probability of word ii, and p(wi,wj)p(w_{i},w_{j}) is the joint probability of words ii and jj that co-occur together within a sliding window. Those probabilities were computed on an external large corpus, i.e., a 5.48GB Wikipedia dump in our experiments. The NPMI score of each topic in the experiments is calculated with top 10 words (T=10T=10) by the Palmetto packagehttp://palmetto.aksw.org. Again, we report the average scores and the standard deviations over 5 random runs.

It is known that conventional topic models directly applied to short texts suffer from low quality topics, caused by the insufficient word co-occurrence information. Here we study whether or not the meta information helps MetaLDA improve topic quality, compared with other topic models that can also handle short texts. Table IV shows the NPMI scores on the three short text datasets. Higher scores indicate better topic coherence. All the models were trained with 100 topics. Besides the NPMI scores averaged over all the 100 topics, we also show the scores averaged over top 20 topics with highest NPMI, where “rubbish” topics are eliminated, following . It is clear that MetaLDA performed significantly better than all the other models in WS and AN dataset in terms of NPMI, which indicates that MetaLDA can discover more meaningful topics with the document and word meta information. We would like to point out that on the TMN dataset, even though the average score of MetaLDA is still the best, the score of MetaLDA has overlapping with the others’ in the standard deviation, which indicates the difference is not statistically significant.

V-F Running Time

In this section, we empirically study the efficiency of the models in term of per-iteration running time. The implementation details of our MetaLDA are as follows: (1) The SparseLDA framework reduces the complexity of LDA to be sub-linear by breaking the conditional of LDA into three “buckets”, where the “smoothing only” bucket is cached for all the documents and the “document only” bucket is cached for all the tokens in a document. We adopted a similar strategy when implementing MetaLDA. When only the document meta information is used, the Dirichlet parameters α\alpha for different documents in MetaLDA are different and asymmetric. Therefore, the “smoothing only” bucket has to be computed for each document, but we can cache it for all the tokens, which still gives us a considerable reduction in computing complexity. However, when the word meta information is used, the SparseLDA framework no longer works in MetaLDA as the β\beta parameters for each topic and each token are different. (2) By adapting the DistributedLDA framework , our MetaLDA implementation runs in parallel with multiple threads, which makes MetaLDA able to handle larger document collections. The parallel implementation was used on the NYT dataset.

The per-iteration running time of all the models is shown in Table V. Note that: (1) On the Reuters and WS datasets, all the models ran with a single thread on a desktop PC with a 3.40GHz CPU and 16GB RAM. (2) Due to the size of NYT, we report the running time for the models that are able to run in parallel. All the parallelised models ran with 10 threads on a cluster with a 14-core 2.6GHz CPU and 128GB RAM. (3) All the models were implemented in JAVA. (4) As the models with meta information add extra complexity to LDA, the per-iteration running time of LDA can be treated as the lower bound.

At the document level, both MetaLDA-df-0.01 and DMR use priors to incorporate the document meta information and both of them were implemented in the SparseLDA framework. However, our variant is about 6 to 8 times faster than DMR on the Reuters dataset and more than 10 times faster on the WS dataset. Moreover, it can be seen that the larger the number of topics, the faster our variant is over DMR. At the word level, similar patterns can be observed: our MetaLDA-0.1-wf ran significantly faster than WF-LDA and LF-LDA especially when more topics are used (20-30 times faster on WS). It is not surprising that GPU-DMM has comparable running speed with our variant, because only one topic is allowed for each document in GPU-DMM. With both document and word meta information, MetaLDA still ran several times faster than DMR, LF-LDA, and WF-LDA. On NYT with the parallel settings, MetaLDA maintains its efficiency advantage as well.

VI Conclusion

In this paper, we have presented a topic modelling framework named MetaLDA that can efficiently incorporate document and word meta information. This gains a significant improvement over others in terms of perplexity and topic quality. With two data augmentation techniques, MetaLDA enjoys full local conjugacy, allowing efficient Gibbs sampling, demonstrated by superiority in the per-iteration running time. Furthermore, without losing generality, MetaLDA can work with both regular texts and short texts. The improvement of MetaLDA over other models that also use meta information is more remarkable, particularly when the word-occurrence information is insufficient. As MetaLDA takes a particular approach for incorporating meta information on topic models, it is possible to apply the same approach to other Bayesian probabilistic models, where Dirichlet priors are used. Moreover, it would be interesting to extend our method to use real-valued meta information directly, which is the subject of future work.

Acknowledgement

Lan Du was partially supported by Chinese NSFC project under grant number 61402312. Gang Liu was partially supported by Chinese PostDoc Fund under grant number LBH-Q15031.

References