Deep Exponential Families

Rajesh Ranganath, Linpeng Tang, Laurent Charlin, David M. Blei

Introduction

In this paper we develop deep exponential families (DEFs), a flexible family of probability distributions that reflect the intuitions behind deep unsupervised feature learning. In a DEF, observations arise from a cascade of layers of latent variables. Each layer’s variables are drawn from an exponential family that is governed by the inner product of the previous layer’s variables and a set of weights.

As in deep unsupervised feature learning, a DEF represents hidden patterns, from coarse to fine grained, that compose with each other to form the observations. DEFs also enjoy the advantages of probabilistic modeling. Through their connection to exponential families , they support many kinds of data. Overall DEFs combine the powerful representations of deep networks with the flexibility of graphical models.

Consider the problem of modeling documents. We can represent a document as a vector of term counts modeled with Poisson random variables . In one type of DEF, the rate of each term’s Poisson count is an inner product of a layer of latent variables (one level up from the terms) and a set of weights that are shared across documents. Loosely, we can think of the latent layer above the observations as per-document “topic” activations, each of which ignites a set of related terms via their inner product with the weights. These latent topic are, in turn, modeled in a similar way, conditioned on a layer above of “super topics.” Just as the topics group related terms, the super topics group related topics, again via the inner product.

Figure 1 illustrates an example of a three level DEF uncovered from a large set of articles in The New York Times. (This style of model, though with different details, has been previously studied in the topic modeling literature .) Conditional on the word counts of the articles, the DEF defines a posterior distribution of the per-document cascades of latent variables and the layers of weights. Here we have visualized two third-layer topics which correspond to the concepts of “Government” and “Politics”. We focus on “Government” and notice that the model has discovered, through its second-layer super-topics, the three branches of government: judiciary ( left), legislative (center) and executive (right).

This is just one example. In a DEF, the latent variables can be from any exponential family: Bernoulli latent variables recover the classical sigmoid belief network ; Gamma latent variables give something akin to deep version of nonnegative matrix factorization ; Gaussian latent variables lead to the types of models that have recently been explored in the context of computer vision . DEFs fall into the broad class of stochastic feed forward networks defined by Neal . These networks differ from the undirected deep probabilistic models in that they allow for explaining away, where latent variables compete to explain the observations.

In addition to varying types of latent variables, we can further change the prior on the weights and the observation model. Observations can be real valued, such as those from music and images, binary, such as those in the sigmoid belief network, or multinomial, such as when modeling text. In the language of neural networks, the prior on the weights amounts to choosing a type of regularization; the observation model amounts to choosing a type of loss.

Finally, we can embed the DEF in a more complex model, building ”deep” versions of traditional models from the statistics and machine learning research literature. As examples, the DEF can be made part of a multi-level model of grouped data , time-series model of sequential data , or a factorization model of pairwise data . As a concrete example, we will develop and study the double DEF. The double DEF models a matrix of pairwise observations, such as users rating items. It uses two DEFs, one for the latent representation of a user and one for the representation of an item. The observation of each user/item interaction combines the lowest layer of their individual DEF representations.

In the rest of this paper, we will define, develop, and study deep exponential families. We will explain some of their properties and situate them in the larger contexts of probabilistic models and deep neural networks. We will then develop generic variational inference algorithms for using DEFs. We will s how how to use them to solve real-world problems with large data sets, and we will extensively study many DEFs on the problems of document modeling and collaborative filtering. We show that DEF-variants of existing ”shallow” models give more interesting exploratory structure and better predictive performance. More generally, DEFs are a flexible class of models which, along with our algorithms for computing with them, let us easily explore a rich landscape of solutions for modern data analysis problems.

Deep exponential families

In this section we review exponential families and present deep exponential families.

Exponential families are an important class of distributions with convenient mathematical properties. They take the following form.

where $h$ is the base measure, $\eta$ are the natural parameters, $T$ are the sufficient statistics, and $a$ is the log-normalizer. The expectation of the sufficient statistics of an exponential family is the gradient of the log-normalizer $\textrm{E}[T(x)]=\nabla_{\eta}a(\eta)$ . Exponential families are completely specified by their sufficient statistics and base measure; different choices of $h$ and $T$ lead to different distributions. For example, in the normal distribution the base measure is $h=\sqrt{(2\pi)}$ and the sufficient statistics are $T(x)=[x,x^{2}]$ ; and for the Beta distribution, a distribution with support over $(0,1)$ , the base measure is $h=1$ and sufficient statistics are and $T(x)=[\log x,\log 1-x]$ .

Deep exponential families.

To construct deep exponential families, we will chain exponential families together in a hierarchy, where the draw from one layer controls the natural parameters of the next.

For simplicity, we omit the data index $n$ and describe the distribution of a single data point $x$ . First, the top layer of latent variables are drawn given a hyperparameter $\eta$

where the notation $\textsc{expfam}(x,\eta)$ denotes $x$ is drawn from an exponential family with natural parameter $\eta$ .We are loose with the base measure $h$ as it can be absorbed into the dominating measure.

Next, each latent variable is drawn conditional on the previous layer,

DEFs can also be understood as random effects models where the random variables are controlled by the product of a weight vector and a set of latent covariates.

Likelihood.

The data are drawn conditioned on the lowest layer of the DEF, $p(x_{n,i}\,|\,\bm{\mathbf{z_{n,1}}})$ . Separating the likelihood from the DEF will allow us to compose and embed DEFs in other models. Later, we provide an example where we combine two DEFs to form a model for pairwise data.

In this paper we focus on count data, thus we use the Poisson distribution as the observation likelihood. The Poisson distribution with mean $\lambda$ is

If we let $x_{n,i}$ be the count of type $i$ associated with observation $n$ , then $x_{n,i}$ ’s distribution is

The observation weights $\bm{\mathbf{W}}_{0}$ is matrix where each entry is gamma distributed. We will discuss gamma distribution further in the next section.

Returning to the example from the introduction of modeling documents, the $x_{n}$ are a vector of term counts. This means the observation weights $\bm{\mathbf{W}}_{0}$ put positive mass on groups of terms. Thus, they form “topics.” Similarly, the weights on the second layer represents “super topics,” and the weights on the third layer represent “concepts.” The distribution $p(\bm{\mathbf{z}}_{n,1}\,|\,\bm{\mathbf{z}}_{n,2},\bm{\mathbf{W}}_{1})$ represents the distribution of “topics” given the “super topics” of a document. Figure 1 depicts the compositional and sharing semantics of DEFs.

The link function.

Consider the case of the identity link function, where $g_{l}(x)=x$ . In this case, the expectation of latent variables in deep exponential families is transformed by the log-normalizer at each level. This transformation of the expectation is one source of non-linearity in DEFs. It parallels the non-linearities used in neural networks.

To be clear, here is a representation of how the values and weights of one layer control the expected sufficient statistics at the next:

For example, in the sigmoid belief network , we will see that the identity link function recovers the sigmoid transformation used in neural networks.

Examples

To illustrate the potential of deep exponential families, we present three examples: the sparse gamma DEF, the sigmoid belief network, and a latent Poisson DEF.

The sparse gamma DEF is a DEF with gamma distributed layers. The gamma distribution is an exponential family distribution with support over the positive reals. The probability density of the gamma distribution with natural parameters, $\alpha$ and $\beta$ , is

where $\Gamma$ is the gamma function. The expectation of the gamma distribution is $\textrm{E}[z]=\alpha\beta^{-1}$ .

Through the link function in DEFs, the inner product of the previous layer and the weights control the natural parameters of the next layer. For sparse gamma models, we let components in a layer to control the expected activation of the next layer, while the shape at each layer remains fixed.

As the expectation of gamma variables needs to be positive, we let the weight matrices be gamma distributed as well.

When $\alpha$ is small (less than $1$ ), gamma distribution puts most of its mass near zero, and we call this type of distributions sparse gamma. This type of distribution is akin to a soft spike-slab prior . Spike and slab priors have shown to perform well on feature selection and unsupervised feature discovery . Thus sparse gamma distributions are like spike and slab priors, so we use them as the prior for the observation weights $W_{1}$ and all DEF weights that are constrained to be positive.

The sparse gamma distribution differs from distributions such as the normal and Poisson in how the probability mass moves given a change in the mean. For example, when the expected value is high, draws from the Poisson distribution are likely to be much larger than zero, while in the sparse gamma distribution draws will either be close to zero or very large. This is like growing the slab of our soft spike and slab prior. Figure 3 visually demonstrates this. We plot both the Poisson and sparse gamma distribution in both settings. As an example in the sparse gamma DEF for documents, this means an observation does not have to express every “super topic” in a “concept” it expresses.

We estimate the posterior on this DEF using one to three layers for two large text copora: Science and The New York Times (NYT). We defer the discussion of the details of the corpora to Section 6. The topic hierarchy shown earlier in Figure 1 is from a three layer sparse gamma DEF. In the appendix we present a portion of the Science hierarchy.

Sigmoid belief network.

This is a special case of a deep exponential family with Bernoulli latent layers and the identity link function. To see this, use Eq. 1 to form the Bernoulli conditional of a hidden layer,

In the sigmoid belief network, we allow for the natural parameters to have intercepts. The intercepts provide a baseline activation for each feature independent of the shared weights.

Poisson DEF.

The Poisson distribution is a distribution over counts that models the number of events occurring in an interval. Its exponential family form with parameter $\eta$ is

where the mean of this distribution is $e^{\eta}$ .

We define the Poisson deep exponential family as a DEF with latent Poisson levels and log-link function. We add an intercept to ensure positivity of the rate. This corresponds to the following conditional distribution for the layers

In the case of document modeling, the value of $z_{2,k}$ represents how many times “super topic” k is represented in this example.

Using the link function property of DEFs described earlier, the mean activation at the next layer is given by the gradient of the log-normalizer

Similar to the sigmoid belief network, we allow the natural parameter to have an intercept. This type of Poisson DEFs can be seen as an extension of the sigmoid belief network, where each observation expresses an integer count number of a feature rather than just turning the feature on or off. Table 1 summarizes the DEFs we have described and will study in our experiments.

Related Work

Graphical models and neural nets have a long and distinguished history. A full review is outside of the scope of this article, however we highlight some key results as they relate to DEFs. More generally, deep exponential families fall into the broad class of stochastic feed forward belief networks , but Neal focuses mainly on one example in this class, the sigmoid belief network, which is a binary latent variable model. Several existing stochastic feed forward networks are DEFs, such as latent Gaussian models and the sigmoid belief network with layerwise dependencies .

Undirected graphical models have also been used in inferring compositional hierarchies. Salakhutdinov and Hinton propose deep probabilistic models based on Restricted Boltzmann Machines (RBMs) . RBMs are a two layer undirected probabilistic model with one layer of latent variables and one layer of observations tied together by a weight matrix. Directed models such as DEFs have the property of explaining away, where independent latent variables under the prior become dependent conditioned on the observations. This property makes inference harder than in RBMs, but forces a more parsimonious representation where similar features compete to explain the data rather than work in tandem .

RBMs have been extended to general exponential family conditionals in a model called exponential family harmoniums (EFH) . A certain infinite DEF with tied weights is equivalent to an EFH , but as our weights are not tied, deep exponential families represent a broader class of models than exponential family harmoniums (and RBMs).

The literature of latent variable models relates to DEFs through hierarchical models and Bayesian factor analysis. Latent tree hierarchies have been constructed with specific distributions (Dirchlet) , while Bayesian factor analysis methods such as exponential family PCA and multinomial PCA can be seen as a single layer deep exponential family.

Inference

The central computational problem for working with DEFs is posterior inference. The intractability of the partition function means posterior computations require approximations. Past work on sigmoid belief networks has proposed doing greedy layer-wise learning (for a specific kind of network) . Here, instead, we develop variational methods that are applicable to general DEFs.

Variational inference casts the posterior inference problem as an optimization problem. Variational algorithms seek to minimize the KL divergence to the posterior from an approximating distribution $q$ . This is equivalent to maximizing the following ,

where $z$ denotes all latent variables associated with each observation and $W$ all latent variables shared across observations. This objective function is called the Evidence Lower BOund (ELBO) because it is a lower bound on $\log p(x)$ .

For the approximating distribution, $q$ , we use the mean field variational family. In the mean field approximating family, the distribution over the latent variables factorizes. We focus on the running example of a Poisson likelihood and $n$ observations. The variational family is

where the exponential family is the same one as the model distribution $p$ . Similarly, we choose $q(W)$ to be in the same family as $p(W)$ with parameters $\xi$ .

To maximize the ELBO, we need to compute expectations under the approximation $q$ . These expectations for general DEFs will not have a simple analytic form. Thus we use more recent “black box” variational inference techniques that step around computing this expectation .

Black box variational inference methods use stochastic optimization to avoid the analytic intractability of computing the objective function. Stochastic optimization works by following noisy unbiased gradients. In black box variational inference , the gradient of the ELBO with respect to the parameters of a latent variable can be written as an expectation with respect to the variational approximation.

We compute Monte Carlo estimates of this gradient by averaging the evaluation of the gradient at several samples. To compute the Monte Carlo estimate of the gradient, we need to be able to sample from the approximation to evaluate the Markov blanket for each latent variable, the approximating distribution, and the gradient of the log of the approximating distribution (score functions). We detail the score functions in the appendix. From this equation, we can see that the primary cost in computing the gradients is in evaluating the likelihood and score function on a sample. To speed up our algorithm, we parallelize the likelihood computation across samples.

The Markov blanket for a latent variable in the first layer of a DEF is

The Markov blank for a latent variable in the intermediate layer is

The gradients and Markov blankets for $W$ can be written similarly.

Stochastic optimization requires a learning rate to scale the noisy gradients before applying them to the current parameters. We use RMSProp which scales the gradient by the square root of the online average of the sqaured gradient.www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf RMSProp captures the varying length scales and noise through the sum of squares term used to normalize the gradient step. We present a sketch of the algorithm in Algorithm 1, and present the full algorithm in the appendix.

Experiments

We have introduced DEFs and detailed a procedure for posterior inference in DEFs. We now provide an extensive evaluation of DEFs. We report predictive results from 28 different DEF instances where we explore the number of layers (1, 2 or 3), the latent variable distributions (gamma, Poisson, Bernoulli) and the weight distributions (normal, gamma) using a Poisson observational model. Furthermore, we instantiate and report results using a combination of two DEFs for pairwise data.

Show improvements over strong baselines for both topic modeling and collaborative filtering on a total of four corpora.

Lead us to conclude that deeper DEFs and sparse gamma DEFS display the strongest performance overall.

We consider two large text corpora Science and The New York Times. Science consists of 133K documents and 5.9K terms. The New York Times consists of 166K documents and 8K terms.

As a baseline we consider Latent Dirichlet Allocation a popular topic model, and state-of-the-art DocNADE . DocNADE estimates the probability of a given word in a document given the previously observed words in that document. In DocNADE, the connections between each observation and the latent variables used to generate the observations are shared.

We note that the one layer sparse gamma DEF is equivalent to Poisson matrix factorization but our model is fully Bayesian and our variational distribution is collapsed.

Evaluation.

We compute perplexity on a held out set of 1,000 documents. Held out perplexity is given by

Conditional on the total number of held out words, the distribution of the held out words becomes multinomial. The mean of the conditional multinomial is given by the normalized Poisson rate in each document. We set the rates to the expected value under the variational distribution. Additionally, we let all methods see ten percent of the words in each document; the other ninety percent form the held out set. This similar to the document completion evaluation metric in Wallach et al. except we query the test words independently. We use the observed ten percent to compute the variational distribution for the document specific latent variables, the DEF for the document, while keeping the approximation on the shared weights fixed. In DocNADE, this corresponds to always seeing a fixed set of words first, then evaluating each new word given the first ten percent of the document.

Held out perplexity differs from perplexity computed from the predictive distribution $p(x^{*}\,|\,x)$ . The latter can be a more difficult problem as we only ever condition on a fraction of the document. Additionally computing perplexity from the predictive distribution requires computationally demanding sampling procedures which for most models like LDA only allow testing of only a small number (50) of documents . In contrast our held-out test metric can be quickly computed for 1,000 test documents.

Architectures and hyperparameters.

We build one, two and three layer hierarchies of the sparse gamma DEF, the sigmoid belief network, Poisson DEF, and log-link Poisson DEF. The sizes of the layers are 100, 30, and 15, respectively. We note that while different DEFs may have better predictive performance at different sizes, we consider DEFs of a fixed size as we also seek a compact explorable representation of our corpus. One hundred topics fall into the range of topics searched in the topic modeling literature . We detail the hyperparameters for each DEF in the appendix.

We observe two phases to DEF convergence, it converges quickly to a good held-out perplexity (around 2,000 iterations) and then slowly improves until final convergence (around 10,000 iterations). Each iteration takes approximately 30 seconds on a modern 32-core machine (from Amazon’s AWS).

Results.

Table 2 summarizes the predictive results on both corpora. We note that DEFs outperform the baselines on both datasets. Furthermore moving beyond one layer models generally improves performance as expected. The table also reveals that stacking layers of gamma latent variables leads always leads to similar or better performance. Finally, as shown by the Poisson DEFs with different link functions, we find gamma-distributed weights to outperform normally-distributed weights. Somewhat related, we find sigmoid DEFs (with normal weights) to be more difficult to train and deeper version perform poorly.

2 Matrix Factorization

Previously, we constructed models out of a single DEF, but DEFs can be embedded and composed in more complex models. We now present the double DEF, a factorization model for pairwise data where both the rows and columns are determined by DEFs. The graphical model of this double DEF corresponds to replacing $\bm{\mathbf{W}}_{0}$ in Figure 2 with another DEF.

We focus on factorization of counts (ratings, clicks). The observed data are generated with a Poisson distribution. The observation likelihood for this double DEF is $p(x_{n,i}\,|\,\bm{\mathbf{z}}_{n,1}^{c},\bm{\mathbf{z}}_{i,1}^{r})=\textrm{Poisson}({\bm{\mathbf{z}}_{n,1}^{c}}^{\top}{\bm{\mathbf{z}}_{i,1}^{r}})$ , where $\bm{\mathbf{z}}_{n,1}^{c}$ is the lowest layer of a DEF for the $n$ th observation and $\bm{\mathbf{z}}_{i,1}^{r}$ is the lowest layer of a DEF for the $i$ th item. The double DEF has hierarchies on both users and items.

We infer double DEFs on Netflix movie ratings and click data from the ArXiv (www.arXiv.org) which indicates how many times a user clicked on a paper. Our Netflix collection consists of 50K users and 17.7K movies. The movie ratings range from zero to five stars, where zero means the movie was unrated by the user. The ArXiv collection consists of 18K users and 20K documents. We fit a one, two, and three layer double DEF where the sizes of the row DEF match the sizes of the column DEF at each layer. The sizes of the layers are 100, 30, and 15. We compare double DEFs to $l2$ -regularized (Gaussian) matrix factorization (MF) . We re-use the testing procedure introduced in the previous section (this is referred to as strong-generalization in the recommendation literature ) where the held-out test set contains one thousand users. For performance and computational reasons we subsample zero-observations for MF as is standard . Further, we also report the commonly-used multi-level ranking measure (un-truncated) NDCG for all methods.

Table 3 shows that two-layer DEFs improve performance over the shallow DEF and that all DEFs outperform Gaussian MF. On perplexity the three layer model performs similarly on Netflix and slightly worse on the ArXiv. The table further highlights that when comparing ranking performance, the advantage of deeper models is especially clear on low-activity users (NDCG across all test users is comparable within the three DEFs architectures and is not reported here). This data regime is of particular importance for practical recommender systems. We postulate that this due to the hierarchy in deeper models acting as a more structured prior compared to single-layer models.

Discussion

We develop deep exponential families as a way to describe hierarchical relationships of latent variables to capture compositional semantics of data. We present several instantiations of deep exponential families and achieve improved predictive power and interpretable semantic structures for both problems in text modeling and collaborative filtering.

References

Appendix

Following the notation from the main paper, the general algorithm for mean field variational inference in deep approximating families is given in (Alg. 2).

Properties of q𝑞q.

In our experiments, we use four variational families (Poisson, gamma, Bernoulli, and normal). We detail the necessary score functions here. For the Poisson, the distribution is given by:

For the gamma, we use the shape $\alpha$ and scale $\theta$ as variational parameters. The distribution is given by

For the Bernoulli distribution, we use the natural parameterization with parameter $\eta$ to form the variational approximation. The distribution is

For the normal variational approximation, we use the standard parameterization by mean $\mu$ and variance $\sigma^{2}$ . The distribution is

Parameterizations of Variational Distributions.

Several of our variational parameters like the variance of the normal have positive constraints To enforce positivity constraints, we transform an unconstrained variable by $\log(1+\exp(x))$ . To avoid numerical issues when sampling, we truncate values when appropriate. Gradients of the unconstrained parameters are obtained with the chain rule from the score function and the derivative of softmax: $\exp(x)/(1+\exp(x))$ .

Optimization

We perform gradient ascent step on the ELBO using

$\rho$ is a fixed scalar set to $0.2$ in our experiments. $\nabla_{\theta}\text{ELBO}$ is a noisy gradient estimated using BBVI. $\Gamma$ is a diagonal preconditioning matrix estimated using the RMSProp heuristicDescribed by G. Hinton, RMSprop: Divide the gradient by a running average of its recent magnitude, in Coursera online course: Neural networks for machine learning, lecture 6e, 2012.. A diagonal element of $\Gamma$ is the reciprocal of the squared root of a running average of the squares of historical gradients of that component. We used a window size of $10$ in our experiments.

Hyperparameters and Convergence

We use the same hyperparameters on Gamma distributions on each layer with shape and rate $0.3$ . For the sigmoid belief network we use a prior of $0.1$ to achieve some sparsity as well. We fix the Poisson prior rate to be $0.1$ . For gamma $W$ ’s we use shape $0.1$ and rate $0.3$ . For Gaussian $W$ ’s we use a prior mean of and variance of $1$ . We let the experiments run for 10,000 iterations at which point the validation likelihood is stable.

For the double DEF, we set all shapes to $0.1$ and rates to $0.3$ . We let the Double DEF experiment run for about 10,000 iterations. The validation likelihood had converged for all models by this point.