Dynamic Bernoulli Embeddings for Language Evolution

Maja Rudolph, David Blei

Introduction

Word embeddings are a collection of unsupervised learning methods for capturing latent semantic structure in language. Embedding methods analyze text data, learning distributed representations of the vocabulary to capture its co-occurrence statistics. These learned representations are then useful for reasoning about word usage and meaning (Harris, 1954; Rumelhart et al., 1986). With large data sets and approaches from neural networks, word embeddings have become an important tool for analyzing language (Bengio et al., 2003; Mikolov et al., 2013c, b, a; Pennington et al., 2014; Levy & Goldberg, 2014; Arora et al., 2015).

Recently, Rudolph et al. (2016) developed exponential family embeddings. Exponential family embeddings distill the key assumptions of an embedding problem, generalize them to many types of data, and cast the distributed representations as latent variables in a probabilistic model. They encompass many existing methods for embeddings and open the door to bringing expressive probabilistic modeling (Bishop, 2006; Murphy, 2012) to the problem of learning distributed representations (Bengio et al., 2003).

Here we use exponential family embeddings to develop dynamic word embeddings, a method for learning distributed representations that change over time. Dynamic embeddings analyze long-running texts, e.g., documents that span many years, where the way words are used changes over time. The goal of dynamic embeddings is to characterize and understand those changes.

Figure 1 illustrates the approach. It shows the changing representation of intelligence in two corpora, the collection of computer science abstracts from the ACM 1951–2014 and the U.S. Senate speeches 1858–2009. On the y-axis is “meaning,” a proxy for the dynamic representation of the word; in both corpora, its representation changes dramatically over the years. To understand where it is located, the plots also show similar words (according to their changing representations) at various points. Loosely, in the ACM corpus intelligence changes from government intelligence to cognitive intelligence to artificial intelligence; in the Congressional record intelligence changes from psychological intelligence to government intelligence. Section 3 gives other examples from these corpora, such as iraq, data, and computer.

In more detail, a word embedding uses representation vectors to parameterize the conditional probabilities of words in the context of other words. Dynamic embeddings divide the documents into time slices, e.g., one per year, and cast the embedding vector as a latent variable that drifts via a Gaussian random walk. When fit to data, the dynamic embeddings capture how the representation of each word drifts from slice to slice.

Section 2 describes dynamic embeddings and how to fit them. Section 3 studies this approach on three datasets: 9 years of Arxiv machine learning papers (2007–2015), 64 years of computer science abstracts (1951–2014), and 151 years of U.S. Senate speeches (1858–2009). Dynamic embeddings give better predictive performance than existing approaches and provide an interesting exploratory window into how language changes.

Related work. Language is known to evolve (Aitchison, 2001; Kirby et al., 2007) and there have been several lines of research around capturing semantic shifts. Mihalcea & Nastase (2012); Tang et al. (2016) detect semantic changes of words using features such as part-of-speech tags and entropy and Sagi et al. (2011); Basile et al. (2014) employ latent semantic analysis and temporal semantic indexing for quantifying changes in meaning.

Most closely related to our work are methods for dynamic word embeddings (Kim et al., 2014; Kulkarni et al., 2015; Hamilton et al., 2016). These methods train a separate embedding model for each time slice of the data. While interesting, this requires enough data in each time slice such that a high quality embedding can be trained for each. Further, because each time slice is trained independently, the dimensions of the embeddings are not comparable across time; they must use initialization (Kim et al., 2014) or ad-hoc alignment techniques (Kulkarni et al., 2015; Hamilton et al., 2016; Zhang et al., 2016) to stitch them together.

In contrast, the representations of dynamic embeddings are sequential latent variables. Dynamic embeddings naturally accommodates time slices with sparse data and immediately connect the latent dimensions across time. In Section 3, we found that dynamic embeddings provide quantitative improvements over independently fitting each slice.Two similar models have been independently developed. Bamler & Mandt (2016) model both the embeddings and the context vectors using an Uhlenbeck-Ornstein process (Uhlenbeck & Ornstein, 1930). Yao et al. (2017) factorize the pointwise mutual information (pmi) matrix at different time slices. Their regularization also resembles an Uhlenbeck-Ornstein process.

Dynamic topic modeling also studies text data over time (Blei & Lafferty, 2006; Wang & McCallum, 2006; Wang et al., 2008; Gerrish & Blei, 2010; Wijaya & Yeniterzi, 2011; Yogatama et al., 2014; Mitra et al., 2014, 2015; Frermann & Lapata, 2016). This class of models describes documents in terms of topics, which are distributions over the vocabulary, and then allows the topics to change over the course of the collection. As in dynamic embeddings, some dynamic topic models use a Gaussian random walk to capture drift in the underlying language model; for example, see Blei & Lafferty (2006); Wang et al. (2008); Gerrish & Blei (2010); Frermann & Lapata (2016).

Though topic models and word embeddings are related, they are ultimately different approaches to language modeling. Topic models capture co-occurrence of words at the document level and focus on heterogeneity, i.e., that a document can exhibit multiple topics (Blei et al., 2003). Word embeddings capture co-occurrence in terms of proximity in the text, usually focusing on small neighborhoods around each word (Mikolov et al., 2013c). Combining dynamic topic models and dynamic word embeddings is an area for future study.

Dynamic Embeddings

We develop dynamic embeddings, a type of exponential family embedding (efe) (Rudolph et al., 2016) that captures sequential changes in the representation of the data. We focus on text data and the Bernoulli embedding model.

In this section, we review Bernoulli embeddings for text and show how to include dynamics into the model. We then derive the objective function for dynamic embeddings and develop stochastic gradients to optimize it.

Bernoulli embeddings for text. An exponential family embedding is a conditionally specified model. It has three ingredients: The context, the conditional distribution of each data point, and the parameter sharing structure.

In an efe for text, the data is a corpus of text. It is a collection of words $(x_{1},\ldots,x_{N})$ from a vocabulary of size $V$ . Each word $x_{i}\in\{0,1\}^{V}$ is an indicator vector (also called a “one-hot” vector). It has exactly one nonzero entry at $v$ , where $v$ is the vocabulary term at position $i$ .

In an efe each data point has a context. In text, the context of each word is its neighborhood; thus it is modelled conditional on the words that come before and after it. Typical context sizes range between $2$ and $10$ words. (This is set in advance or by cross-validation.)

We will build on Bernoulli embeddings, which provide a conditional model for the individual entries of the indicator vectors $x_{iv}\in\{0,1\}$ . Let $c_{i}$ be the set of positions in the neighborhood of position $i$ and let $\bm{\mathbf{x}}_{c_{i}}$ denote the collection of data points indexed by those positions. The conditional distribution of $x_{iv}$ is

where $p_{iv}\in(0,1)$ is the Bernoulli probability.Multinomial embeddings (Rudolph et al., 2016) model each indicator vector $x_{i}$ with a categorical conditional distribution, but this requires expensive normalization in form of a softmax function. For computational efficiency, one can replace the softmax with the hierarchical softmax (Mikolov et al., 2013b) or employ approaches related to noise contrastive estimation (Gutmann & Hyvärinen, 2010; Mnih & Kavukcuoglu, 2013). Bernoulli embeddings relax the one-hot constraint of $x_{i}$ , and work well in practice; they relate to the negative sampling (Mikolov et al., 2013b).

This is the inner product between the embedding $\rho_{v}$ and the context vectors of the words that surround position $i$ . (Because $x_{j}$ is an indicator vector, the sum over the vocabulary selects the appropriate context vector $\alpha$ at position $j$ .) The goal is to learn the embeddings and context vectors.

The index on the parameters does not depend on position $i$ , but only on term $v$ ; the embeddings are shared across all positions in the text. This is what Rudolph et al. (2016) call the parameter sharing structure. It ensures, for example, that the embedding vector for intelligence is the same wherever it appears in the corpus. (Dynamic embeddings partially relax this restriction.)

The natural parameter of the conditional likelihood is similar to Equation (2) but with the embedding vector $\rho_{v}$ replaced by the per-time-slice embedding vector $\rho_{v}^{(t_{i})}$ ,

Finally, dynamic embeddings use a Gaussian random walk as a prior on the embedding vectors,

Given data, this leads to smoothly changing estimates of each term’s embedding.Because $\bm{\alpha}$ and $\bm{\rho}$ appear only as inner products in Equation (2), there is some redundancy in placing temporal dynamics on both the embeddings and the context vectors. Exploring dynamics in $\bm{\alpha}$ is a subject for future study.

Figure 2 gives the graphical model for dynamic embeddings. Dynamic embeddings are a conditionally specified model, which in general are not guaranteed to imply a consistent joint distribution. But dynamic Bernoulli embeddings model binary data, and thus a joint exists (Arnold et al., 2001).

Fitting dynamic embeddings. Calculating the joint is computationally intractable. Rather, we fit dynamic embeddings with the pseudo log likelihood, the sum of the log conditionals. This is a commonly used objective for conditionally specified models (Arnold et al., 2001).

In detail, we regularize the pseudo log likelihood with the log priors and then maximize to obtain a pseudo MAP estimate. For dynamic Bernoulli embeddings, this objective is the sum of the log priors and the conditional log likelihoods of the data $x_{iv}$ . We divide the data likelihood into two parts, the contribution of nonzero data entries $\mathcal{L}_{\text{pos}}$ and contribution of zero data entries $\mathcal{L}_{\text{neg}}$ ,

where $\sigma(\cdot)$ is the sigmoid, which maps natural parameters to probabilities.

The parameters $\bm{\rho}$ and $\bm{\alpha}$ appear in the natural parameters $\eta_{iv}$ of Equations (2) and (3) and in the log prior. The random walk prior penalizes consecutive word vectors $\rho_{v}^{(t-1)}$ and $\rho_{v}^{(t)}$ for drifting too far apart. It prioritizes parameter settings for which the norm of their difference is small.

The most expensive term in the objective is $\mathcal{L}_{\text{neg}}$ , the contribution of the zeroes to the conditional log likelihood. The objective is cheaper if we subsample the zeros. Rather than summing over all words which are not at position $i$ , we sum over a subset of negative examples drawn at random. Mikolov et al. (2013b) call this negative sampling and recommend sampling from the unigram distribution raised to the power of $0.75$ .

With negative sampling, we redefine $\mathcal{L}_{\text{neg}}$ in Equation (6). Denote the sampling distribution of zeros as $\hat{p}$ ,

This sum has fewer terms and reduces the contribution of the zeros to the objective. In a sense, this incurs a bias—the expectation with respect to the negative samples is not equal to the original objective—but “downweighting the zeros” can improve prediction accuracy (Hu et al., 2008; Liang et al., 2016).

We fit the objective (Equation (6) with Equation (7)) using stochastic gradients (Robbins & Monro, 1951) and with adaptive learning rates (Duchi et al., 2011). Pseudo code is in Appendix B. To avoid deriving the gradients of Equation (6), we implemented the algorithm in Edward (Tran et al., 2016). Edward is based on tensorflow (Team, 2015) and employs automatic differentiation.Code available at http://github.com/mariru/dynamic_bernoulli_embeddings

Empirical Study

Our empirical study contains two parts. In a quantitative evaluation we benchmark dynamic embeddings against static embeddings (Mikolov et al., 2013a, b; Rudolph et al., 2016). Dynamic embeddings improve over static embeddings in terms of the conditional likelihood of held-out predictions. Further, dynamic embeddings perform better than embeddings trained on the individual time slices (Hamilton et al., 2016). In a qualitative evaluation we use a fitted dynamic embedding model to extract which word vectors change most and we visualize their dynamics. Dynamic embeddings provide a new window into how language changes.

Machine Learning Papers (2007 - 2015): This dataset contains the full text from all machine learning papers (tagged “stat.ML”) published on the Arxiv between April 2007 and June 2015. It spans 9 years and we treat each year as a time slice. The number of Arxiv papers about machine learning has increased over the years. There were $101$ papers in $2007$ ; there were $1,573$ papers in $2014$ .

Computer Science Abstracts (1951 - 2014): This dataset contains abstracts of computer science papers published by the Association of Computing Machinery (ACM) from 1951 to 2014. Again, each year is considered a time slice and here too the amount of data increases over the years. For 1953, there are only around $10$ abstracts and their combined length is only $471$ words; the combined length of the abstracts from 2009 is over 2M.

Senate Speeches (1858 - 2009): This dataset contains all U.S. Senate speeches from 1858 to mid 2009. Here we treat every 2 years as a time slice. In contrast to the other datasets, this corpus is a transcript of spoken language.

For all datasets, we divide the observations into training, validation, and testing. Within each time slice we use $80\%$ for training, $10\%$ for validation, and $10\%$ for testing. Appendix A provides details about preprocessing.

2 Quantitative evaluation

We compare dynamic embeddings (d-emb) to time-binned embeddings (t-emb) (Hamilton et al., 2016) and static embeddings (s-emb) (Rudolph et al., 2016). There are many embedding techniques, without dynamics, that enjoy comparable performance. For the s-emb, we study Bernoulli embeddings (Rudolph et al., 2016), which are similar to continuous bag-of-words (cbow) with negative sampling (Mikolov et al., 2013a, b). For time-binned embeddings, Hamilton et al. (2016) train a separate embedding on each time slice.

Evaluation metric. We evaluate models by held-out Bernoulli probability. Given a model, each held-out word (validation or testing) is associated with a Bernoulli probability. At that position, a better model assigns higher probability to the observed word and lower probability to the others. This metric is straightforward because the competing methods all produce Bernoulli conditional likelihoods (Equation (1)).Since we hold out chunks of consecutive words usually both a word and its context are held out. For all methods we have to use the words in the context to compute the conditional likelihoods. We report $\mathcal{L}_{\text{pos}}$ , which considers only the nonzero held-out data. To make results comparable, all methods are trained with the same number of negative samples.

Model training and hyperparameters. Each method takes a maximum of $10$ passes over the data. (The corresponding number of stochastic gradient steps depends on the size of the minibatches.) The parameters of s-emb are initialized randomly. We initialize both d-emb and t-emb from a fit of s-emb which has been trained from one pass, and then train for $9$ additional passes.

We set the dimension of the embeddings to $100$ and the number of negative samples to $20$ . We experiment with two context sizes, $2$ and $8$ .

Other parameters are set by validation error. All methods use validation error to set the initial learning rate $\eta$ and minibatch sizes $m$ . The model selects $\eta\in[0.01,0.1,1,10]$ and $m\in[0.001N,0.0001N,0.00001N]$ , where $N$ is the size of training data. The only parameter specific to d-emb is the precision of the random drift. To have one less hyper parameter to tune, we fix the precision on the context vectors and the initial dynamic embeddings to $\lambda_{0}=\lambda/1000$ , a constant multiple of the precision on the dynamic embeddings. We choose $\lambda\in$ by validation error.

Results. We train each model on each training set and use each validation set for model selection (e.g., selecting the minibatch size and learning rate). Table 2 reports the results on the test set. Dynamic embeddings consistently achieve higher held-out likelihood.

3 Qualitative exploration

We now show how to use dynamic embeddings to explore the dataset. We use the fitted model to suggest ways that language changes and visualize its discovered dynamic structure.

A word’s embedding neighborhood helps visualize its usage and how it changes over time. It is simply a list of other words with similar usage. For a given query word (e.g., computer) we take its index $v$ and select the top ten words according to

We fit a dynamic embedding fit to the Senate speeches. Table 3 gives the embedding neighborhoods of computer for the years $1858$ and $1986$ . Its usage changed dramatically over the years. In 1858, a computer was a profession, a person who was hired to compute things. Now the profession is obsolete; computer refers to the electronic device.

Table 3 provides another example, bush. In 1858 this word always referred to the plant. A bush still is a plant, but in the 1990’s, in the Senate, it is usually refers to political figures. Unlike computer, where the embedding neighborhoods reveal two mutually exclusive meanings, the embedding neighborhoods of bush reflect which meaning is more prevalent in a given period.

A final example in Table 3 is the word data, from the ACM abstracts. The evolution of the embedding neighborhoods of data reflects how its meaning changes in the computer science literature.

Finding changing words with absolute drift. We have highlighted example words whose usage changes. However, not all words have changing usage. We now define a metric to discover which words change most.

One way to find words that change is to use absolute drift. For word $v$ , it is

This is the Euclidean distance between the word’s embedding at the last time slice and at the first time slice.

In the Senate speeches, Table 4 shows the 16 words that have largest absolute drift. The word iraq has largest drift. Figure 3 highlights iraq’s embedding neighborhood in four time slices: 1858, 1950, 1980, and 2008. (Appendix C gives the entire trajectory of its embedding neighborhood.) At first the neighborhood contains other countries and regions. Later, Arab countries move to the top of the neighborhood, suggesting that the speeches start to use rhetoric more specific to Arab countries. In 1980, Iraq invades Iran and the Iran-Iraq war begins. In these years words such as aggressors, troops, and invasion appear in the embedding neighborhood. Eventually, by 2008, the neighborhood contains terror, terrorism, and saddam.

Four other words with large drift are discipline, values, fine and unemployment (Table 4). Table 5 shows their embedding neighborhoods for selected years. Of these words, discipline, values and, fine have multiple meanings. Their neighborhoods reflect how the dominant meaning changes over time. For example, values can be either a numerical quantity or can be used to refer to moral values and principles. In contrast, iraq and unemployment are both words which have always had the same definition. Yet, the evolution of their neighborhood captures changes in the way they are used.

Dynamic embeddings as a tool to study a text. Our hope is that dynamic embeddings provide a suggestive tool for understanding change in language. For example, researchers interested in unemployment can complement their investigation by looking at the embedding neighborhood of related words such as employment, jobs or labor. In Table 6 we list the neighborhoods of jobs for the years 1858, 1938, and 2008. In 2008 the embedding neighborhood contains words like create and opportunities, suggesting a different outlook on jobs than in earlier years.

Another interesting example is prostitution. It used to be immoral and vile, went to indecent, and in modern days it is considered harassment. We note the word prostitution is not a frequent word. On average, it is used once per time slice and, in two thirds of the time slices, it is not mentioned at all. Yet, the model is able to learn about prostitution and the temporal evolution of the embedding neighborhood reveals how over the years a judgemental stance turns into concern over a social issue.

Summary

We described dynamic embeddings, distributed representations of words that drift over the course of the collection. Building on Rudolph et al. (2016), we formulate word embeddings with conditional probabilistic models and then incorporate dynamics with a Gaussian random walk prior. We fit dynamic embeddings with stochastic optimization.

We used dynamic embeddings to analyze several datasets: 8 years of machine learning papers, 63 years of computer science abstracts, and 151 years of speeches in the U.S. Senate. Dynamic embeddings provided a better fit than static embeddings and other methods that account for time.

Finally, we demonstrated how dynamic embeddings can help identify interesting ways that language changes. A word’s meaning can change (e.g., computer); its dominant meaning can change (e.g., values); or its related subject matter can change (e.g., iraq).

Acknowledgements

We would like to thank Francisco Ruiz and Liping Liu for discussion and helpful suggestiongs, Elliot Ash and Suresh Naidu for access to the Congress speeches, and Aaron Plasek and Matthew Jones for access to the ACM abstracts.

References

Appendix A Data Preprocessing

We fix the vocabulary to the $25000$ most frequent words and remove all words from the documents which are not in the vocabulary. As in (Mikolov et al., 2013b) we additionally remove each word with probability $p=1-\sqrt{(}\frac{10^{-5}}{f_{i}})$ where $f_{i}$ is the frequency of the word. This effectively downsamples especially the frequent words and speeds up training. From each time slice $80\%$ of the words are used for training. A random subsample of $10\%$ of the words is held out for validation and another $10\%$ for testing.

Appendix B Pseudo code

Appendix C Entire Embedding Trajectory of iraq

Here we give the entire trajectory of the embedding neighborhood of iraq. Over the years it drifts smoothly. On average iraq is mentioned only $10.6$ times per time slice and in 64 out of the 76 time slices, iraq is not even mentioned at all. For these years, the prior (Equation (4)) ensures that the embedding at time $t$ is the average of the embeddings at time $t-1$ and $t+1$ . When the embedding vector does not change between two consecutive time slices, the embedding neighborhood might still fluctuate. This is because computing the embedding neighborhoods (Equation (8)) involves also the embedding vectors of the other words in the vocabulary.