Supervised Topic Models

David M. Blei, Jon D. McAuliffe

Introduction

There is a growing need to analyze collections of electronic text. We have unprecedented access to large corpora, such as government documents and news archives, but we cannot take advantage of these collections without being able to organize, understand, and summarize their contents. We need new statistical models to analyze such data, and fast algorithms to compute with those models.

The complexity of document corpora has led to considerable interest in applying hierarchical statistical models based on what are called topics (Blei and Lafferty,, 2009; Blei et al.,, 2003; Griffiths et al.,, 2005; Erosheva et al.,, 2004). Formally, a topic is a probability distribution over terms in a vocabulary. Informally, a topic represents an underlying semantic theme; a document consisting of a large number of words might be concisely modelled as deriving from a smaller number of topics. Such topic models provide useful descriptive statistics for a collection, which facilitates tasks like browsing, searching, and assessing document similarity.

Most topic models, such as latent Dirichlet allocation (LDA) (Blei et al.,, 2003), are unsupervised. Only the words in the documents are modelled, and the goal is to infer topics that maximize the likelihood (or the posterior probability) of the collection. This is compelling—with only the documents as input, one can find patterns of words that reflect the broad themes that run through them—and unsupervised topic modeling has many applications. Researchers have used the topic-based decomposition of the collection to examine interdisciplinarity (Ramage et al.,, 2009), organize large collections of digital books (Mimno and McCallum,, 2007), recommend purchased items (Marlin,, 2003), and retrieve relevant documents to a query (Wei and Croft,, 2006). Researchers have also evaluated the interpretability of the topics directly, such as by correlation to a thesaurus (Steyvers and Griffiths,, 2006) or by human study (Chang et al.,, 2009).

In this work, we focus on document collections where each document is endowed with a response variable, external to its words, that we are interested in predicting. There are many examples: in a collection of movie reviews, each document is summarized by a numerical rating; in a collection of news articles, each document is assigned to a section of the newspaper; in a collection of on-line scientific articles, each document is downloaded a certain number of times. To analyze such collections, we develop supervised topic models, where the goal is to infer latent topics that are predictive of the response. With a fitted model in hand, we can infer the topic structure of an unlabeled document and then form a prediction of its response.

Unsupervised LDA has previously been used to construct features for classification. The hope was that LDA topics would turn out to be useful for categorization, since they act to reduce data dimension (Blei et al.,, 2003; Fei-Fei and Perona,, 2005; Quelhas et al.,, 2005). However, when the goal is prediction, fitting unsupervised topics may not be a good choice. Consider predicting a movie rating from the words in its review. Intuitively, good predictive topics will differentiate words like “excellent”, “terrible”, and “average,” without regard to genre. But topics estimated from an unsupervised model may correspond to genres, if that is the dominant structure in the corpus.

The distinction between unsupervised and supervised topic models is mirrored in existing dimension-reduction techniques. For example, consider regression on unsupervised principal components versus partial least squares and projection pursuit (Hastie et al.,, 2001)— both search for covariate linear combinations most predictive of a response variable. Linear supervised methods have nonparametric analogs, such as an approach based on kernel ICA (Fukumizu et al.,, 2004). In text analysis problems, McCallum et al., (2006) developed a joint topic model for words and categories, and Blei and Jordan, (2003) developed an LDA model to predict caption words from images. In chemogenomic profiling, Flaherty et al., (2005) proposed “labelled LDA,” which is also a joint topic model, but for genes and protein function categories. These models differ fundamentally from the model proposed here.

This paper is organized as follows. We develop the supervised latent Dirichlet allocation model (sLDA) for document-response pairs. We derive parameter estimation and prediction algorithms for the general exponential family response distributions. We show specific algorithms for a Gaussian response and Poisson response, and suggest a general approach to any other exponential family. Finally, we demonstrate sLDA on two real-world problems. First, we predict movie ratings based on the text of the reviews. Second, we predict the political tone of a senate amendment, based on an ideal-point analysis of the roll call data (Clinton et al.,, 2004). In both settings, we find that sLDA provides more predictive power than regression on unsupervised LDA features. The sLDA approach also improves on the lasso (Tibshirani,, 1996), a modern regularized regression technique.

Supervised latent Dirichlet allocation

Topic models are distributions over document collections where each document is represented as a collection of discrete random variables W1:nW_{1:n}, which are its words. In topic models, we treat the words of a document as arising from a set of latent topics, that is, a set of unknown distributions over the vocabulary. Documents in a corpus share the same set of KK topics, but each document uses a mix of topics—the topic proportions—unique to itself. Topic models are a relaxation of classical document mixture models, which associate each document with a single unknown topic. Thus they are mixed-membership models (Erosheva et al.,, 2004). See Steyvers and Griffiths, (2006) and Blei and Lafferty, (2009) for recent reviews.

Here we build on latent Dirichlet allocation (LDA) (Blei et al.,, 2003), a topic model that serves as the basis for many others. In LDA, we treat the topic proportions for a document as a draw from a Dirichlet distribution. We obtain the words in the document by repeatedly choosing a topic assignment from those proportions, then drawing a word from the corresponding topic.

In supervised latent Dirichlet allocation (sLDA), we add to LDA a response variable connected to each document. As mentioned, examples of this variable include the number of stars given to a movie, the number of times an on-line article was downloaded, or the category of a document. We jointly model the documents and the responses, in order to find latent topics that will best predict the response variables for future unlabeled documents. SLDA uses the same probabilistic machinery as a generalized linear model to accommodate various types of response: unconstrained real values, real values constrained to be positive (e.g., failure times), ordered or unordered class labels, nonnegative integers (e.g., count data), and other types.

Fix for a moment the model parameters: KK topics β1:K\beta_{1:K} (each βk\beta_{k} a vector of term probabilities), a Dirichlet parameter α\alpha, and response parameters η\eta and δ\delta. (These response parameters are described in detail below.) Under the sLDA model, each document and response arises from the following generative process:

Draw topic proportions θαDir(α).\theta\,|\,\alpha\sim\textrm{Dir}(\alpha).

Draw topic assignment znθMult(θ).z_{n}\,|\,\theta\sim\textrm{Mult}(\theta).

Draw word wnzn,β1:KMult(βzn).w_{n}\,|\,z_{n},\beta_{1:K}\sim\textrm{Mult}(\beta_{z_{n}}).

Draw response variable yz1:N,η,δGLM(zˉ,η,δ)y\,|\,z_{1:N},\eta,\delta\sim\text{GLM}(\bar{z},\eta,\delta), where we define

Figure 1 illustrates the family of probability distributions corresponding to this generative process as a graphical model.

The distribution of the response is a generalized linear model (McCullagh and Nelder,, 1989),

There are two main ingredients in a generalized linear model (GLM): the “random component” and the “systematic component.” For the random component, we take the distribution of the response to be an exponential dispersion family with natural parameter ηzˉ\eta^{\top}\bar{z} and dispersion parameter δ\delta. For each fixed δ\delta, Equation 2 is an exponential family, with base measure h(y,δ)h(y,\delta), sufficient statistic yy, and log-normalizer A(ηzˉ)A(\eta^{\top}\bar{z}). The dispersion parameter provides additional flexibility in modeling the variance of yy. Note that Equation 2 need not be an exponential family jointly in (ηzˉ,δ)(\eta^{\top}\bar{z},\delta).

In the systematic component of the GLM, we relate the exponential-family parameter of the random component to a linear combination of covariates—the so-called linear predictor. For sLDA, we have already introduced the linear predictor: it is ηzˉ\eta^{\top}\bar{z}. The reader familiar with GLMs will recognize that our choice of systematic component means sLDA uses only canonical link functions. In future work, we will relax this constraint.

The GLM framework gives us the flexibility to model any type of response variable whose distribution can be written in exponential dispersion form Equation 2. This includes many commonly used distributions: the normal (for real response); the binomial (for binary response); the multinomial (for categorical response); the Poisson and negative binomial (for count data); the gamma, Weibull, and inverse Gaussian (for failure time data); and others. Each of these distributions corresponds to a particular choice of h(y,δ)h(y,\delta) and A(ηzˉ)A(\eta^{\top}\bar{z}). For example, the normal distribution corresponds to h(y,δ)=(1/2πδ)exp{y2/2}h(y,\delta)=(1/\sqrt{2\pi\delta})\exp\{-y^{2}/2\} and A(ηzˉ)=(ηzˉ)2/2A(\eta^{\top}\bar{z})=(\eta^{\top}\bar{z})^{2}/2. In this case, the usual Gaussian parameters, mean μ\mu and variance σ2\sigma^{2}, are equal to ηzˉ\eta^{\top}\bar{z} and δ\delta, respectively.

What distinguishes sLDA from the usual GLM is that the covariates are the unobserved empirical frequencies of the topics in the document. In the generative process, these latent variables are responsible for producing the words of the document, and thus the response and the words are tied. The regression coefficients on those frequencies constitute η\eta. Note that a GLM usually includes an intercept term, which amounts to adding a covariate that always equals one. Here, such a term is redundant, because the components of zˉ\bar{z} always sum to one (see Equation 1).

By regressing the response on the empirical topic frequencies, we treat the response as non-exchangeable with the words. The document (i.e., words and their topic assignments) is generated first, under full word exchangeability; then, based on the document, the response variable is generated. In contrast, one could formulate a model in which yy is regressed on the topic proportions θ\theta. This treats the response and all the words as jointly exchangeable. But as a practical matter, our chosen formulation seems more sensible: the response depends on the topic frequencies which actually occurred in the document, rather than on the mean of the distribution generating the topics. Estimating this fully exchangeable model with enough topics allows some topics to be used entirely to explain the response variables, and others to be used to explain the word occurrences. This degrades predictive performance, as demonstrated in Blei and Jordan, (2003). Put a different way, here the latent variables that govern the response are the same latent variables that governed the words. The model does not have the flexibility to infer topic mass that explains the response, without also using it to explain some of the observed words.

Computation with supervised LDA

We need to address three computational problems to analyze data with sLDA. First is posterior inference, computing the conditional distribution of the latent variables at the document level given its words w1:Nw_{1:N} and the corpus-wide model parameters. The posterior is thus a conditional distribution of topic proportions θ\theta and topic assignments z1:Nz_{1:N}. This distribution is not possible to compute. We approximate it with variational inference.

Second is parameter estimation, estimating the Dirichlet parameters α\alpha, GLM parameters η\eta and δ\delta, and topic multinomials β1:K\beta_{1:K} from a data set of observed document-response pairs {wd,1:N,yd}d=1D\{w_{d,1:N},y_{d}\}_{d=1}^{D}. We use maximum likelihood estimation based on variational expectation-maximization.

The final problem is prediction, predicting a response yy from a newly observed document w1:Nw_{1:N} and fixed values of the model parameters. This amounts to approximating the posterior expectation E[yw1:N,α,β1:K,η,δ]\textrm{E}[y\,|\,w_{1:N},\alpha,\beta_{1:K},\eta,\delta].

We treat these problems in turn for the general GLM setting of sLDA, noting where GLM-specific quantities need to be computed or approximated. We then consider the special cases of a Gaussian response and a Poisson response, for which the algorithms have an exact form. Finally, we suggest a general-purpose approximation procedure for other response distributions.

Both parameter estimation and prediction hinge on posterior inference. Given a document and response, the posterior distribution of the latent variables is

The normalizing value is the marginal probability of the observed data, i.e., the document w1:Nw_{1:N} and response yy. This normalizer is also known as the likelihood, or the evidence. As with LDA, it is not efficiently computable (Blei et al.,, 2003). Thus, we appeal to variational methods to approximate the posterior.

Variational methods encompass a number of types of approximation for the normalizing value of the posterior. For reviews, see Wainwright and Jordan, (2008) and Jordan et al., (1999). We use mean-field variational inference, where Jensen’s inequality is used to lower bound the normalizing value. We let π\pi denote the set of model parameters, π={α,β1:K,η,δ}\pi=\{\alpha,\beta_{1:K},\eta,\delta\} and q(θ,z1:N)q(\theta,z_{1:N}) denote a variational distribution of the latent variables. The lower bound is

where all expectations are taken with respect to q(θ,z1:N)q(\theta,z_{1:N}). This bound is called the evidence lower bound (ELBO), which we denote by L()\mathcal{L}(\cdot). The first term is the expectation of the log of the joint probability of hidden and observed variables; the second term is the entropy of the variational distribution, H(q)=E[logq(θ,z1:N)]\textrm{H}(q)=-\textrm{E}[\log q(\theta,z_{1:N})]. In its expanded form, the sLDA ELBO is

This is a function of the observations {w1:N,y}\{w_{1:N},y\} and the variational distribution.

In variational inference we first construct a parameterized family for the variational distribution, and then fit its parameters to tighten Equation 4 as much as possible for the given observations. The parameterization of the variational distribution governs the expense of optimization.

Equation 4 is tight when q(θ,z1:N)q(\theta,z_{1:N}) is the posterior, but specifying a family that contains the posterior leads to an intractable optimization problem. We thus specify a simpler family. Here, we choose the fully factorized distribution,

where γ\gamma is a K-dimensional Dirichlet parameter vector and each ϕn\phi_{n} parametrizes a categorical distribution over KK elements. The latent topic assignment ZnZ_{n} is represented as a KK-dimensional indicator vector; thus E[Zn]=q(zn)=ϕn\textrm{E}[Z_{n}]=q(z_{n})=\phi_{n}. Tightening the bound with respect to this family is equivalent to finding its member that is closest in KL divergence to the posterior (Wainwright and Jordan,, 2008; Jordan et al.,, 1999). Thus, given a document-response pair, we maximize Equation 4 with respect to ϕ1:N\phi_{1:N} and γ\gamma to obtain an estimate of the posterior.

Before turning to optimization, we describe how to compute the ELBO in Equation 4. The first three terms and the entropy of the variational distribution are identical to the corresponding terms in the ELBO for unsupervised LDA (Blei et al.,, 2003). The first three terms are

The entropy of the variational distribution is

We note that the expectation of the log of the Dirichlet random variable is

where Ψ(x)\Psi(x) denotes the digamma function, i.e., the first derivative of the log of the Gamma function.

The variational objective function for sLDA differs from LDA in the fourth term, the expected log probability of the response variable given the latent topic assignments. This term is

We see that computing the sLDA ELBO hinges on two expectations. The first expectation is

which follows from ZnZ_{n} being an indicator vector.

The second expectation is E[A(ηZˉ)]\textrm{E}[A(\eta^{\top}\bar{Z})]. This is computable in some models, such as when the response is Gaussian or Poisson distributed, but in the general case it must be approximated. We will address these settings in detail in Section 3.4. For now, we assume that this issue is resolvable and continue with the procedure for approximate posterior inference.

We use block coordinate-ascent variational inference, maximizing Equation 4 with respect to each variational parameter vector in turn. The terms that involve the variational Dirichlet γ\gamma are identical to those in unsupervised LDA, i.e., they do not involve the response variable yy. Thus, the coordinate ascent update is as in Blei et al., (2003),

The central difference between sLDA and LDA lies in the update for the variational multinomial ϕn\phi_{n}. Given n{1,,N}n\in\{1,\ldots,N\}, the partial derivative of the ELBO with respect to ϕn\phi_{n} is

Optimizing with respect to the variational multinomial depends on the form of {\partial\over\partial\phi_{n}}\left\{\textrm{E}\bigl{[}A(\eta^{\top}\bar{Z})\bigr{]}\right\}. In some cases, such as a Gaussian or Poisson response, we obtain exact coordinate updates. In other cases, we require gradient based optimization methods. Again, we postpone these details to Section 3.4.

Variational inference proceeds by iteratively updating the variational parameters {γ,ϕ1:N}\{\gamma,\phi_{1:N}\} according to Equation 13 and the derivative in Equation 14. This finds a local optimum of the ELBO in Equation 4. The resulting variational distribution qq is used as a proxy for the posterior.

2 Parameter estimation

The parameters of sLDA are the KK topics β1:K\beta_{1:K}, the Dirichlet hyperparameters α\alpha, the GLM coefficients η\eta, and the GLM dispersion parameter δ\delta. We fit these parameters with variational expectation maximization (EM), an approximate form of expectation maximization where the expectation is taken with respect to a variational distribution. As in the usual EM algorithm, maximization proceeds by maximum likelihood estimation under expected sufficient statistics (Dempster et al.,, 1977).

Our data are a corpus of document-response pairs D={wd,1:N,yd}d=1D{\cal D}=\{w_{d,1:N},y_{d}\}_{d=1}^{D}. Variational EM is an optimization of the corpus-level lower bound on the log likelihood of the data, which is the sum of per-document ELBOs. As we are now considering a collection of document-response pairs, in this section we add document indexes to the previous section’s quantities, so response variable yy becomes ydy_{d}, empirical topic assignment frequencies Zˉ\bar{Z} becomes Zˉd\bar{Z}_{d}, and so on. Notice each document is endowed with its own variational distribution. Expectations are taken with respect to that document-specific variational distribution qd(z1:N,θ)q_{d}(z_{1:N},\theta),

In the expectation step (E-step), we estimate the approximate posterior distribution for each document-response pair using the variational inference algorithm described above. In the maximization step (M-step), we maximize the corpus-level ELBO with respect to the model parameters. Variational EM finds a local optimum of Equation 15 by iterating between these steps. The M-step updates are described below.

The M-step updates of the topics β1:K\beta_{1:K} are the same as for unsupervised LDA, where the probability of a word under a topic is proportional to the expected number of times that it was assigned to that topic (Blei et al.,, 2003),

Proportionality means that each β^knew\hat{\beta}_{k}^{\text{new}} is normalized to sum to one. We note that in a fully Bayesian sLDA model, one can place a symmetric Dirichlet prior on the topics and use variational Bayes to approximate their posterior. This adds no additional complexity to the algorithm (Bishop,, 2006).

Estimating the GLM parameters.

The GLM parameters are the coefficients η\eta and dispersion parameter δ\delta. For the corpus-level ELBO, the gradient with respect to GLM coefficients η\eta is

where μ()=EGLM[Y]\mu(\cdot)=\textrm{E}_{\text{GLM}}[Y\,|\,\cdot]. The appearance of this expectation follows from properties of the cumulant generating function A(ηzˉ)A(\eta^{\top}\bar{z}) (Brown,, 1986). This GLM mean response is a known function of ηzˉd\eta^{\top}\bar{z}_{d} in all standard cases. However, E[μ(ηZˉd)Zˉd]\textrm{E}[\mu(\eta^{\top}\bar{Z}_{d})\bar{Z}_{d}] has an exact solution only in some cases, such as Gaussian or Poisson response. In other cases, we approximate the expectation. (See Section 3.4.)

The derivative with respect to δ\delta, evaluated at η^new\hat{\eta}_{\text{new}}, is

Equation 19 can be computed given that the rightmost summation has been evaluated, exactly or approximately, while optimizing the coefficients η\eta. Depending on h(y,δ)h(y,\delta) and its partial derivative with respect to δ\delta, we obtain δ^new\hat{\delta}_{\text{new}} either in closed form or via one-dimensional numerical optimization.

Estimating the Dirichlet parameter.

While we fix the Dirichlet parameter α\alpha to 1/K1/K in Section 4, other applications of topic modeling fit this parameter (Blei et al.,, 2003; Wallach et al.,, 2009). Estimating the Dirichlet parameters follows the standard procedure for estimating a Dirichlet distribution (Ronning,, 1989). In a fully-observed setting, the sufficient statistics of the Dirichlet are the logs of the observed simplex vectors. Here, they are the expected logs of the topic proportions, see Equation 10.

3 Prediction

Our focus in applying sLDA is prediction. Given a new document w1:Nw_{1:N} and a fitted model {α,β1:K,η,δ}\{\alpha,\beta_{1:K},\eta,\delta\}, we want to compute the expected response value,

To perform prediction, we approximate the posterior mean of Zˉ\bar{Z} using variational inference. This is the same procedure as in Section 3.1, but here the terms depending on the response yy are removed from the ELBO in Equation 4. Thus, with the variational distribution taking the same form as Equation 5, we implement the following coordinate ascent updates for the variational parameters,

where the log of a vector is a vector of the log of each of its components, βwn\beta_{w_{n}} is the vector of p(wnβk)p(w_{n}\,|\,\beta_{k}) for each k{1,,K}k\in\{1,\ldots,K\}, and again proportionality means that the vector is normalized to sum to one. Note in this section we distinguish between expectations taken with respect to the model and those taken with respect to the variational distribution. (In other sections all expectations are taken with respect to the variational distribution.)

This coordinate ascent algorithm is identical to variational inference for unsupervised LDA: since we averaged the response variable out of the right-hand side in Equation 20, what remains is the standard unsupervised LDA model for Z1:NZ_{1:N} and θ\theta (Blei et al.,, 2003). Notice this algorithm does not depend on the particular response type.

Thus, given a new document, we first compute q(θ,z1:N)q(\theta,z_{1:N}), the variational posterior distribution of the latent variables θ\theta and ZnZ_{n}. We then estimate the response with

As with parameter estimation, this depends on being able to compute or approximate Eq[μ(ηZˉ)]\textrm{E}_{q}[\mu(\eta^{\top}\bar{Z})].

4 Examples

The previous three sections have outlined the general computational strategy for sLDA: a procedure for approximating the posterior distribution of the topic proportions θ\theta and topic assignments Z1:NZ_{1:N} on a per-document/response basis, a maximum likelihood procedure for fitting the topics β1:K\beta_{1:K} and GLM parameters {η,δ}\{\eta,\delta\} using variational EM, and a procedure for predicting new responses from observed documents using an approximate expectation based on a variational posterior.

Our derivation has remained free of any specific assumptions about the form of the response distribution. We now focus on sLDA for specific response distributions—the Gaussian and Poisson—and suggest a general approximation technique for handling other responses within the GLM framework. The response distribution blocked our derivation at three points: the computation of E[A(ηZˉ)]\textrm{E}[A(\eta^{\top}\bar{Z})] in the per-document ELBO of Equation 4, the computation of {\partial\over\partial\phi_{n}}\left\{\textrm{E}\bigl{[}A(\eta^{\top}\bar{Z})\bigr{]}\right\} in Equation 14 and corresponding update for the variational multinomial ϕn\phi_{n}, and the computation of Eq[μ(ηZˉd)Zˉd]\textrm{E}_{q}[\mu(\eta^{\top}\bar{Z}_{d})\bar{Z}_{d}] for fitting the GLM parameters in a variational EM algorithm. We note that other aspects of working with sLDA are general to any type of exponential family response.

When the response is Gaussian, the dispersed exponential family form can be seen as follows:

Here the natural parameter ηzˉ\eta^{\top}\bar{z} and mean parameter are identical. Thus, μ(ηzˉ)=ηzˉ\mu(\eta^{\top}\bar{z})=\eta^{\top}\bar{z}. The usual variance parameter σ2\sigma^{2} is equal to δ\delta. Moreover, notice that

We first specify the variational inference algorithm. This requires computing the expectation of the log normalizer in Equation 27, which hinges on \textrm{E}\bigl{[}\bar{Z}\bar{Z}^{\top}\bigr{]},

To see Equation 29, notice that for mnm\neq n, E[ZnZm]=E[Zn]E[Zm]=ϕnϕm\textrm{E}[Z_{n}Z_{m}^{\top}]=\textrm{E}[Z_{n}]\textrm{E}[Z_{m}]^{\top}=\phi_{n}\phi_{m}^{\top} because the variational distribution is fully factorized. On the other hand, E[ZnZn]=diag(E[Zn])=diag(ϕn)\textrm{E}[Z_{n}Z_{n}^{\top}]=\text{diag}(\textrm{E}[Z_{n}])=\text{diag}(\phi_{n}) because ZnZ_{n} is an indicator vector.

To take the derivative of the expected log normalizer, we consider it as a function of a single variational parameter ϕj\phi_{j}. Define ϕj:=njϕn\phi_{-j}:=\sum_{n\neq j}\phi_{n}. Expressed as a function of ϕj\phi_{j}, \textrm{E}\bigl{[}A(\eta^{\top}\bar{Z})\bigr{]} is

Substituting this gradient into Equation 14 yields an exact coordinate update for the variational multinomial ϕj\phi_{j},

Exponentiating a vector means forming the vector of exponentials. The proportionality symbol means the components of ϕjnew\phi_{j}^{\text{new}} are computed according to Equation 33, then normalized to sum to one. Note that E[logθi]\textrm{E}[\log\theta_{i}] is given in Equation 10.

Examining this update in the Gaussian setting uncovers further intuitions about the difference between sLDA and LDA. As in LDA, the jjth word’s variational distribution over topics depends on the word’s topic probabilities under the actual model (determined by β1:K\beta_{1:K}). But wjw_{j}’s variational distribution, and those of all other words, affect the probability of the response. To see this, consider the expectation of logp(yzˉ,η,δ)\log p(y\,|\,\bar{z},\eta,\delta) of Equation 24, which is is a term in the document-level ELBO of Equation 4. Notice that variational distribution q(zn)q(z_{n}) plays a role in the expected residual sum of squares E[(yηZˉ)2]\textrm{E}[(y-\eta^{\top}\bar{Z})^{2}]. The end result is that the update Equation 33 also encourages the corresponding variational parameter ϕj\phi_{j} to decrease this expected residual sum of squares.

Further, the update in Equation 33 depends on the variational parameters ϕj\phi_{-j} of all other words. Unlike LDA, the ϕj\phi_{j} cannot be updated in parallel. Distinct occurrences of the same term must be treated separately.

We now turn to parameter estimation for the Gaussian response sLDA model, i.e., the M-step updates for the parameters η\eta and δ\delta. Define y:=y1:Dy:=y_{1:D} as the vector of response values across documents and let XX be the D×KD\times K design matrix, whose rows are the vectors Zˉd\bar{Z}_{d}. Setting to zero the η\eta gradient of the corpus-level ELBO from Equation 18, we arrive at an expected-value version of the normal equations:

Here the expectation is over the matrix XX, using the variational distribution parameters chosen in the previous E-step. Note that the ddth row of E[X]\textrm{E}[X] is just E[Zˉd]\textrm{E}[\bar{Z}_{d}] from Equation 12. Also,

with each term having a fixed value from the previous E-step as well, given by Equation 29.

We now apply the first-order condition for δ\delta of Equation 19. The partial derivative needed is

which can be seen from Equation 27. Using this derivative and the definition of η^new\hat{\eta}_{\text{new}} in Equation 19, we obtain

It is tempting to try a further simplification of Equation 37: in an ordinary least squares (OLS) regression of yy on the columns of E[X]\textrm{E}[X], the analog of Equation 37 would just equal 1/D1/D times the sum of the squared residuals (yE[X]η^new)(y-\textrm{E}[X]\hat{\eta}_{\text{new}}). However, that identity does not hold here, because the inverted matrix in Equation 37 is E[XX]\textrm{E}[X^{\top}X], rather than E[X]E[X]\textrm{E}[X]^{\top}\textrm{E}[X]. This illustrates that the η\eta update Equation 34 is not just OLS regression of yy on E[X]\textrm{E}[X].

Finally, we form predictions just as in Section 3.3. Since the mapping from the natural parameter to the mean parameter is the identity function,

Again, the expectation is taken with respect to variational inference in the unsupervised setting (Equation 21 and Equation 22).

Poisson response

An overdispersed Poisson response provides a natural generalized linear model formulation for count data. Given mean parameter λ\lambda, the overdispersed Poisson has density

This can be put in the overdispersed exponential family form

In the GLM, the natural parameter logλ=ηzˉ\log\lambda=\eta^{\top}\bar{z}, h(y,δ)=1/y!h(y,\delta)=1/y!, and

We first compute the expectation of the log normalizer

The product in Equation 43 also provides E[μ(ηZˉ)]\textrm{E}[\mu(\eta^{\top}\bar{Z})], which is needed for prediction in Equation 20. Denote this product by CC and let CnC_{-n} denote the same product, but over all indices except for the nnth. The derivative of the expected log normalizer needed to update ϕn\phi_{n} is

As for the Gaussian response, this permits an exact update of the variational multinomial.

In the derivative of the corpus-level ELBO with respect to the coefficients η\eta (Equation 18), we need to compute the expected mean parameter times Zˉ\bar{Z},

We turn to the M-step. We cannot find an exact M-step update for the GLM coefficients, but we can compute the gradient for use in a convex optimization algorithm. The gradient of the corpus-level ELBO is

The derivative of h(y,δ)h(y,\delta) with respect to δ\delta is zero. Thus the dispersion parameter M-step is exact,

As for the general case, Poisson sLDA prediction requires computing \textrm{E}\bigl{[}\mu(\eta^{\top}\bar{Z})\bigr{]}. Since μ()\mu(\cdot) and A()A(\cdot) are identical, this is given in Equation 43.

Exponential family response via the delta method

With a general exponential family response one can use the multivariate delta method for moments to approximate difficult expectations (Bickel and Doksum,, 2007), a method which is effective in variational approximations (Braun and McAuliffe,, 2010). In other work, Chang and Blei, (2010) use this method with logistic regression to adapt sLDA to network analysis; Wang et al., (2009) use this method with multinomial regression for image classification. With the multivariate delta method, one can embed any generalized linear model into sLDA.

Empirical study

We studied sLDA on two prediction problems. First, we consider the “sentiment analysis” of newspaper movie reviews. We use the publicly available data introduced in Pang and Lee, (2005), which contains movie reviews paired with the number of stars given. While Pang and Lee, (2005) treat this as a classification problem, we treat it as a regression problem.

Analyzing document data requires choosing an appropriate vocabulary on which to estimate the topics. In topic modeling research, one typically removes very common words, which cannot help discriminate between documents, and very rare words, which are unlikely to be of predictive importance. We select the vocabulary by removing words that occur in more than 25% of the documents and words that occur in fewer than 5 documents. This yielded a vocabulary of 2180 words, a corpus of 5006 documents, and 908,000 observations. For more on selecting a vocabulary in topic models, see Blei and Lafferty, (2009).

Second, we studied the texts of amendments from the 109th and 110th Senates. Here the response variables are the discrimination parameters from an ideal point analysis of the votes (Clinton et al.,, 2004). Ideal point analysis of voting data is used in quantitative political science to map senators to a real-valued point on a political spectrum. Ideal point models posit that the votes of each senator jj are summarized with a real-valued latent variable xjx_{j}. The senator’s vote on an issue ii, which is the binary variable yijy_{ij}, is determined by a probit model,

Thus, each issue is connected to two parameters. For our analysis, we are most interested in the “issue discrimination” βi\beta_{i}. When βi\beta_{i} and xjx_{j} have the same sign, the senator jj is more likely to vote in favor of the issue ii. Just as xjx_{j} can be interpreted as a senator’s point on the political spectrum, βi\beta_{i} can be interpreted as an issue’s place on that spectrum. (The intercept term αi\alpha_{i} is called the “difficulty” and allows for bias in the votes, regardless of the ideal points of the senators.)

We connected the results of an ideal point analysis to texts about the votes. In particular, we study the amendments considered by the 109th and 110th U.S. Senates.We collected our data—the amendment texts and the voting records—from the open government web-site www.govtrack.com. As a preprocessing step, we estimated the discrimination parameters βi\beta_{i} for each amendment, based on the voting record. Each discrimination βi\beta_{i} is used as the response variable for its amendment text. We study the question: How well we can predict the political tone of an amendment based only on its text?

We pre-processed the Senate text in the same way as the reviews data. To obtain the response variables, we used the implementation of ideal point analysis in Simon Jackman’s Political Science Computational Laboratory R package. For the 109th Senate, this yielded 288 amendments, a vocabulary of 2084 words, and 62,000 observations. For the 110th Senate, this yielded 213 amendments, a vocabulary of 1653 words, and 63,000 observations.

For both reviews and senate amendments, we transformed the response to approximate normality by taking logs. This makes the data amenable to the continuous-response model of Section 2; for these two problems, generalized linear modeling turned out to be unnecessary. We initialized β1:K\beta_{1:K} to randomly perturbed uniform topics, σ2\sigma^{2} to the sample variance of the response, and η\eta to a grid on $inincrementsofin increments of2/K$. We ran EM until the relative change in the corpus-level likelihood bound was less than 0.01%. In the E-step, we ran coordinate-ascent variational inference for each document until the relative change in the per-document ELBO was less than 0.01%.

We can examine the patterns of words found by models fit to these data. In Figure 2 we illustrate a 12-topic sLDA model fit to the movie review data. Each topic is plotted as a list of its most likely words, and each is attached to its estimated coefficient in the linear model. Words like “powerful” and “complex” are in a topic with a high positive coefficient; words like “waste” and “unbearable” are in a topic with high negative coefficient. In Figure 3 and Figure 4 we illustrate sLDA models fit to the Senate amendment data. These models are harder to interpret than the models fit to movie review data, as the response variable is likely less governed by the text. (Much more goes into a Senator’s decision of how to vote.) Nonetheless, some patterns are worth noting. The health care amendments in the 110th Senate were a distinctly right wing issue; grants and immigration in the 109th Senate were a left wing issue.

We assessed the quality of the predictions using five fold cross-validation. We measured error in two ways. First, we measured correlation between the out-of-fold predictions with the out-of-fold response variables. Second, we measured “predictive R2.” We defined this quantity as the fraction of variability in the out-of-fold response values which is captured by the out-of-fold predictions:

We compared sLDA to linear regression on the ϕˉd\bar{\phi}_{d} from unsupervised LDA. This is the regression equivalent of using LDA topics as classification features (Blei et al.,, 2003; Fei-Fei and Perona,, 2005). We studied the quality of these predictions across different numbers of topics. Figures 5, 6 and 7 illustrate that sLDA provides improved predictions on all data. The movie review rating is easier to predict that the U.S. Senates discrimination parameters. However, our study shows that there is predictive power in the texts of the amendments.

Finally, we compared sLDA to the lasso, which is L1-regularized least-squares regression. The lasso is a widely used prediction method for high-dimensional problems (Tibshirani,, 1996). We used each document’s empirical distribution over words as its lasso covariates. We report the highest pR2\textrm{pR}^{2} it achieved across different settings of the complexity parameter, and compare to the highest pR2 attained by sLDA across different numbers of topics. For the reviewer data, the best lasso pR2\textrm{pR}^{2} was 0.4260.426 versus 0.4320.432 for sLDA, a modest 2%2\% improvement. On the U.S. Senate data, sLDA provided definitively better predictions. For the 109th U.S. Senate data the best lasso pR2\textrm{pR}^{2} was 0.150.15 versus 0.270.27 for sLDA, an 80%80\% improvement. For the 110th U.S. Senate data, the best lasso pR2\textrm{pR}^{2} was 0.160.16 versus 0.230.23 for sLDA, a 43%43\% improvement. Note moreover that the Lasso provides only a prediction rule, whereas sLDA models latent structure useful for other purposes.

Discussion

We have developed sLDA, a statistical model of labelled documents. The model accommodates the different types of response variable commonly encountered in practice. We presented a variational procedure for approximate posterior inference, which we then incorporated in an EM algorithm for maximum-likelihood parameter estimation. We studied the model’s predictive performance, our main focus, on two real-world problems. In both cases, we found that sLDA improved on two natural competitors: unsupervised LDA analysis followed by linear regression, and the lasso. These results illustrate the benefits of supervised dimension reduction when prediction is the ultimate goal.

We close with remarks on future directions. First, a “semi-supervised” version of sLDA—where some documents have a response and others do not—is straightforward. Simply omit the last two terms in Equation 14 for unlabelled documents and include only labelled documents in Equation 18 and Equation 19. Since partially labelled corpora are the rule rather than the exception, this is a valuable avenue. (Though note, in this setting, that care must be taken that the response data exert sufficient influence on the fit.) Second, if we observe an additional fixed-dimensional covariate vector with each document, we can include it in the analysis just by adding it to the linear predictor. This change will generally require us to add an intercept term as well. Third, the technique we have used to incorporate a response can be applied in existing variants of LDA, such as author-topic models (Rosen-Zvi et al.,, 2004), population-genetics models (Pritchard et al.,, 2000), and survey-data models (Erosheva,, 2002). We have already mentioned that sLDA has been adapted to network models (Chang and Blei,, 2010) and image models (Wang et al.,, 2009). Finally, we are now studying the maximization of conditional likelihood rather than joint likelihood, as well as Bayesian nonparametric methods to explicitly incorporate uncertainty about the number of topics.

References