Dynamic Word Embeddings

Robert Bamler, Stephan Mandt

Introduction

Language evolves over time and words change their meaning due to cultural shifts, technological inventions, or political events. We consider the problem of detecting shifts in the meaning and usage of words over a given time span based on text data. Capturing these semantic shifts requires a dynamic language model.

Word embeddings are a powerful tool for modeling semantic relations between individual words (Bengio et al., 2003; Mikolov et al., 2013a; Pennington et al., 2014; Mnih & Kavukcuoglu, 2013; Levy & Goldberg, 2014; Vilnis & McCallum, 2014; Rudolph et al., 2016). Word embeddings model the distribution of words based on their surrounding words in a training corpus, and summarize these statistics in terms of low-dimensional vector representations. Geometric distances between word vectors reflect semantic similarity (Mikolov et al., 2013a) and difference vectors encode semantic and syntactic relations (Mikolov et al., 2013c), which shows that they are sensible representations of language. Pre-trained word embeddings are useful for various supervised tasks, including sentiment analysis (Socher et al., 2013b), semantic parsing (Socher et al., 2013a), and computer vision (Fu & Sigal, 2016). As unsupervised models, they have also been used for the exploration of word analogies and linguistics (Mikolov et al., 2013c).

Word embeddings are currently formulated as static models, which assumes that the meaning of any given word is the same across the entire text corpus. In this paper, we propose a generalization of word embeddings to sequential data, such as corpora of historic texts or streams of text in social media.

Current approaches to learning word embeddings in a dynamic context rely on grouping the data into time bins and training the embeddings separately on these bins (Kim et al., 2014; Kulkarni et al., 2015; Hamilton et al., 2016). This approach, however, raises three fundamental problems. First, since word embedding models are non-convex, training them twice on the same data will lead to different results. Thus, embedding vectors at successive times can only be approximately related to each other, and only if the embedding dimension is large (Hamilton et al., 2016). Second, dividing a corpus into separate time bins may lead to training sets that are too small to train a word embedding model. Hence, one runs the risk of overfitting to few data whenever the required temporal resolution is fine-grained, as we show in the experimental section. Third, due to the finite corpus size the learned word embedding vectors are subject to random noise. It is difficult to disambiguate this noise from systematic semantic drifts between subsequent times, in particular over short time spans, where we expect only minor semantic drift.

In this paper, we circumvent these problems by introducing a dynamic word embedding model. Our contributions are as follows:

We derive a probabilistic state space model where word and context embeddings evolve in time according to a diffusion process. It generalizes the skip-gram model (Mikolov et al., 2013b; Barkan, 2017) to a dynamic setup, which allows end-to-end training. This leads to continuous embedding trajectories, smoothes out noise in the word-context statistics, and allows us to share information across all times.

We propose two scalable black-box variational inference algorithms (Ranganath et al., 2014; Rezende et al., 2014) for filtering and smoothing. These algorithms find word embeddings that generalize better to held-out data. Our smoothing algorithm carries out efficient black-box variational inference for structured Gaussian variational distributions with tridiagonal precision matrices, and applies more broadly.

We analyze three massive text corpora that span over long periods of time. Our approach allows us to automatically find the words whose meaning changes the most. It results in smooth word embedding trajectories and therefore allows us to measure and visualize the continuous dynamics of the entire embedding cloud as it deforms over time.

Figure 1 exemplifies our method. The plot shows a fit of our dynamic skip-gram model to Google books (we give details in section 5). We show the ten words whose meaning changed most drastically in terms of cosine distance over the last $150$ years. We thereby automatically discover words such as “computer” or “radio” whose meaning changed due to technological advances, but also words like “peer” and “notably” whose semantic shift is less obvious.

Our paper is structured as follows. In section 2 we discuss related work, and we introduce our model in section 3. In section 4 we present two efficient variational inference algorithms for our dynamic model. We show experimental results in section 5. Section 6 summarizes our findings.

Related Work

Probabilistic models that have been extended to latent time series models are ubiquitous (Blei & Lafferty, 2006; Wang et al., 2008; Sahoo et al., 2012; Gultekin & Paisley, 2014; Charlin et al., 2015; Ranganath et al., 2015; Jerfel et al., 2017), but none of them relate to word embeddings. The closest of these models is the dynamic topic model (Blei & Lafferty, 2006; Wang et al., 2008), which learns the evolution of latent topics over time. Topic models are based on bag-of-word representations and thus treat words as symbols without modelling their semantic relations. They therefore serve a different purpose.

Mikolov et al. (2013a, b) proposed the skip-gram model with negative sampling (word2vec) as a scalable word embedding approach that relies on stochastic gradient descent. This approach has been formulated in a Bayesian setup (Barkan, 2017), which we discuss separately in section 3.1. These models, however, do not allow the word embedding vectors to change over time.

Several authors have analyzed different statistics of text data to analyze semantic changes of words over time (Mihalcea & Nastase, 2012; Sagi et al., 2011; Kim et al., 2014; Kulkarni et al., 2015; Hamilton et al., 2016). None of them explicitly model a dynamical process; instead, they slice the data into different time bins, fit the model separately on each bin, and further analyze the embedding vectors in post-processing. By construction, these static models can therefore not share statistical strength across time. This limits the applicability of static models to very large corpora.

Most related to our approach are methods based on word embeddings. Kim et al. (2014) fit word2vec separately on different time bins, where the word vectors obtained for the previous bin are used to initialize the algorithm for the next time bin. The bins have to be sufficiently large and the found trajectories are not as smooth as ours, as we demonstrate in this paper. Hamilton et al. (2016) also trained word2vec separately on several large corpora from different decades. If the embedding dimension is large enough (and hence the optimization problem less non-convex), the authors argue that word embeddings at nearby times approximately differ by a global rotation in addition to a small semantic drift, and they approximately compute this rotation. As the latter does not exist in a strict sense, it is difficult to distinguish artifacts of the approximate rotation from a true semantic drift. As discussed in this paper, both variants result in trajectories which are noisier. Rudolph & Blei (2017) independently developed a similar model, using a different likelihood model. Their approach uses a non-Bayesian treatment of the latent embedding trajectories, which makes the approach less robust to noise when the data per time step is small.

Model

We propose the dynamic skip-gram model, a generalization of the skip-gram model (word2vec) (Mikolov et al., 2013b) to sequential text data. The model finds word embedding vectors that continuously drift over time, allowing to track changes in language and word usage over short and long periods of time. Dynamic skip-gram is a probabilistic model which combines a Bayesian version of the skip-gram model (Barkan, 2017) with a latent time series. It is jointly trained end-to-end and scales to massive data by means of approximate Bayesian inference.

The Bayesian skip-gram model (Barkan, 2017) is a probabilistic version of word2vec (Mikolov et al., 2013b) and forms the basis of our approach. The graphical model is shown in Figure 2a). For each pair of words $i,j$ in the vocabulary, the model assigns probabilities that word $i$ appears in the context of word $j$ . This probability is $\sigma(u_{i}^{\top}v_{j})$ with the sigmoid function $\sigma(x)=1/(1+e^{-x})$ . Let $z_{ij}\in\{0,1\}$ be an indicator variable that denotes a draw from that probability distribution, hence $p(z_{ij}=1)=\sigma(u_{i}^{\top}v_{j})$ . The generative model assumes that many word-word pairs $(i,j)$ are uniformly drawn from the vocabulary and tested for being a word-context pair; hence a separate random indicator $z_{ij}$ is associated with each drawn pair.

Focusing on words and their neighbors in a context window, we collect evidence of word-word pairs for which $z_{ij}=1$ . These are called the positive examples. Denote $n^{+}_{ij}$ the number of times that a word-context pair $(i,j)$ is observed in the corpus. This is a sufficient statistic of the model, and its contribution to the likelihood is $p(n^{+}_{ij}|u_{i},v_{j})=\sigma(u_{i}^{\top}v_{j})^{n^{+}_{ij}}$ . However, the generative process also assumes the possibility to reject word-word pairs if $z_{ij}=0$ . Thus, one needs to construct a fictitious second training set of rejected word-word pairs, called negative examples. Let the corresponding counts be $n^{-}_{ij}$ . The total likelihood of both positive and negative examples is then

Defining $n^{\pm}=(n^{+},n^{-})$ the combination of both positive and negative examples, the resulting log likelihood is

This is exactly the objective of the (non-Bayesian) skip-gram model, see (Mikolov et al., 2013b). The count matrices $n^{+}$ and $n^{-}$ are either pre-computed for the entire corpus, or estimated based on stochastic subsamples from the data in a sequential way, as done by word2vec. Barkan (2017) gives an approximate Bayesian treatment of the model with Gaussian priors on the embeddings.

2 Dynamic Skip-Gram Model

The key extension of our approach is to use a Kalman filter as a prior for the time-evolution of the latent embeddings (Welch & Bishop, 1995). This allows us to share information across all times while still allowing the embeddings to drift.

Model.

The graphical model is shown in Figure 2b). We consider a diffusion process of the embedding vectors over time. The variance $\sigma_{t}^{2}$ of the transition kernel is

where $D$ is a global diffusion constant and $(\tau_{t+1}-\tau_{t})$ is the time between subsequent observations (Welch & Bishop, 1995). At every time step $t$ , we add an additional Gaussian prior with zero mean and variance $\sigma_{0}^{2}$ which prevents the embedding vectors from growing very large, thus

Computing the normalization, this results in

In practice, $\sigma_{0}\gg\sigma_{t}$ , so the damping to the origin is very weak. This is also called Ornstein-Uhlenbeck process (Uhlenbeck & Ornstein, 1930). We recover the Wiener process for $\sigma_{0}\rightarrow\infty$ , but $\sigma_{0}<\infty$ prevents the latent time series from diverging to infinity. At time index $t=1$ , we define $p(U_{1}|U_{0})\equiv{\cal N}(0,\sigma_{0}^{2}I)$ and do the same for $V_{1}$ .

Our joint distribution factorizes as follows:

The prior model enforces that the model learns embedding vectors which vary smoothly across time. This allows to associate words unambiguously with each other and to detect semantic changes. The model efficiently shares information across the time domain, which allows to fit the model even in setups where the data at every given point in time are small, as long as the data in total are large.

Inference

We discuss two scalable approximate inference algorithms. Filtering uses only information from the past; it is required in streaming applications where the data are revealed to us sequentially. Smoothing is the other inference method, which learns better embeddings but requires the full sequence of documents ahead of time.

In Bayesian inference, we start by formulating a joint distribution (Eq. 7) over observations $n^{\pm}$ and parameters $U$ and $V$ , and we are interested in the posterior distribution over parameters conditioned on observations,

The problem is that the normalization is intractable. In variational inference (VI) (Jordan et al., 1999; Blei et al., 2016) one sidesteps this problem and approximates the posterior with a simpler variational distribution $q_{\lambda}(U,V)$ by minimizing the Kullback-Leibler (KL) divergence to the posterior. Here, $\lambda$ summarizes all parameters of the variational distribution, such as the means and variances of a Gaussian, see below. Minimizing the KL divergence is equivalent to optimizing the evidence lower bound (ELBO) (Blei et al., 2016),

For a restricted class of models, the ELBO can be computed in closed-form (Hoffman et al., 2013). Our model is non-conjugate and requires instead black-box VI using the reparameterization trick (Rezende et al., 2014; Kingma & Welling, 2014).

In many applications such as streaming, the data arrive sequentially. Thus, we can only condition our model on past and not on future observations. We will first describe inference in such a (Kalman) filtering setup (Kalman et al., 1960; Welch & Bishop, 1995).

In the filtering scenario, the inference algorithm iteratively updates the variational distribution $q$ as evidence from each time step $t$ becomes available. We thereby use a variational distribution that factorizes across all times, $q(U,V)=\prod_{t=1}^{T}q(U_{t},V_{t})$ and we update the variational factor at a given time $t$ based on the evidence at time $t$ and the approximate posterior of the previous time step. Furthermore, at every time $t$ we use a fully-factorized distribution:

We now describe how we sequentially compute $q(U_{t},V_{t})$ and use the result to proceed to the next time step. As other Markovian dynamical systems, our model assumes the following recursion,

Within our variational approximation, the ELBO (Eq. 9) therefore separates into a sum of $T$ terms, $\mathcal{L}=\sum_{t}\mathcal{L}_{t}$ with

2 Skip-Gram Smoothing

In contrast to filtering, where inference is conditioned on past observations until a given time $t$ , (Kalman) smoothing performs inference based on the entire sequence of observations $n^{\pm}_{1{:}T}$ . This approach results in smoother trajectories and typically higher likelihoods than with filtering, because evidence is used from both future and past observations.

Besides the new inference scheme, we also use a different variational distribution. As the model is fitted jointly to all time steps, we are no longer restricted to a variational distribution that factorizes in time. For simplicity we focus here on the variational distribution for the word embeddings $U$ ; the context embeddings $V$ are treated identically. We use a factorized distribution over both embedding space and vocabulary space,

In the time domain, our variational approximation is structured. To simplify the notation we now drop the indices for words $i$ and embedding dimension $k$ , hence we write $q(u_{1:T})$ for $q(u_{ik,1:T})$ where we focus on a single factor. This factor is a multivariate Gaussian distribution in the time domain with tridiagonal precision matrix $\Lambda$ ,

We fit the variational parameters by training the model jointly on all time steps, using black-box VI and the reparameterization trick. As the computational complexity of an update step scales as $\Theta(L^{2})$ , we first pretrain the model by drawing minibatches of $L^{\prime}<L$ random words and $L^{\prime}$ random contexts from the vocabulary (Hoffman et al., 2013). We then switch to the full batch to reduce the sampling noise. Since the variational distribution does not factorize in the time domain we always include all time steps $\{1,\ldots,T\}$ in the minibatch.

We also derive an efficient algorithm that allows us to estimate the reparametrization gradient using $\Theta(T)$ time and memory, while a naive implementation of black-box variational inference with our structured variational distribution would require $\Theta(T^{2})$ of both resources. The main idea is to parametrize $\Lambda=B^{\top}B$ in terms of its Cholesky decomposition $B$ , which is bidiagonal (Kılıç & Stanica, 2013), and to express gradients of $B^{-1}$ in terms of gradients of $B$ . We use mirror ascent (Ben-Tal et al., 2001; Beck & Teboulle, 2003) to enforce positive definiteness of $B$ . The algorithm is detailed in our supplementary material.

Experiments

We evaluate our method on three time-stamped text corpora. We demonstrate that our algorithms find smoother embedding trajectories than methods based on a static model. This allows us to track semantic changes of individual words by following nearest-neighbor relations over time. In our quantitative analysis, we find higher predictive likelihoods on held-out data compared to our baselines.

We report results from our proposed algorithms from section 4 and compare against baselines from section 2:

SGI denotes the non-Bayesian skip-gram model with independent random initializations of word vectors (Mikolov et al., 2013b). We used our own implementation of the model by dropping the Kalman filtering prior and point-estimating embedding vectors. Word vectors at nearby times are made comparable by approximate orthogonal transformations, which corresponds to Hamilton et al. (2016).

SGP denotes the same approach as above, but with word and context vectors being pre-initialized with the values from the previous year, as in Kim et al. (2014).

DSG-F: dynamic skip-gram filtering (proposed).

DSG-S: dynamic skip-gram smoothing (proposed).

Data and preprocessing.

Our three corpora exemplify opposite limits both in the covered time span and in the amount of text per time step.

We used data from the Google books corpushttp://storage.googleapis.com/books/ngrams/books/datasetsv2.html (Michel et al., 2011) from the last two centuries ( $T=209$ ). This amounts to $5$ million digitized books and approximately $10^{10}$ observed words. The corpus consists of $n$ -gram tables with $n\in\{1,\ldots,5\}$ , annotated by year of publication. We considered the years from $1800$ to $2008$ (the latest available). In $1800$ , the size of the data is approximately $\sim 7\cdot 10^{7}$ words. We used the $5$ -gram counts, resulting in a context window size of $4$ .

We used the “State of the Union” (SoU) addresses of U.S. presidents, which spans more than two centuries, resulting in $T=230$ different time steps and approximately $10^{6}$ observed words.http://www.presidency.ucsb.edu/sou.php Some presidents gave both a written and an oral address; if these were less than a week apart we concatenated them and used the average date. We converted all words to lower case and constructed the positive sample counts $n^{+}_{ij}$ using a context window size of $4$ .

We used a Twitter corpus of news tweets for $21$ randomly drawn dates from $2010$ to $2016$ . The median number of tweets per day is $722$ . We converted all tweets to lower case and used a context window size of $4$ , which we restricted to stay within single tweets.

Hyperparameters.

The vocabulary for each corpus was constructed from the $10{,}000$ most frequent words throughout the given time period. In the Google books corpus, the number of words per year grows by a factor of $200$ from the year $1800$ to $2008$ . To avoid that the vocabulary is dominated by modern words we normalized the word frequencies separately for each year before adding them up.

For the Google books corpus, we chose the embedding dimension $d=200$ , which was also used in Kim et al. (2014). We set $d=100$ for SoU and Twitter, as $d=200$ resulted in overfitting on these much smaller corpora. The ratio $\eta=\sum_{ij}n_{ij,t}^{-}/\sum_{ij}n_{ij,t}^{+}$ of negative to positive word-context pairs was $\eta=1$ . The precise construction of the matrices $n^{\pm}_{t}$ is explained in the supplementary material. We used the global prior variance $\sigma_{0}^{2}=1$ for all corpora and all algorithms, including the baselines. The diffusion constant $D$ controls the time scale on which information is shared between time steps. The optimal value for $D$ depends on the application. A single corpus may exhibit semantic shifts of words on different time scales, and the optimal choice for $D$ depends on the time scale in which one is interested. We used $D=10^{-3}$ per year for Google books and SoU, and $D=1$ per year for the Twitter corpus, which spans a much shorter time range. In the supplementary material, we provide details of the optimization procedure.

Qualitative results.

We show that our approach results in smooth word embedding trajectories on all three corpora. We can automatically detect words that undergo significant semantic changes over time.

Figure 1 in the introduction shows a fit of the dynamic skip-gram filtering algorithm to the Google books corpus. Here, we show the ten words whose word vectors change most drastically over the last $150$ years in terms of cosine distance. Figure 3 visualizes word embedding clouds over four subsequent years of Google books, where we compare DSG-F against SGI. We mapped the normalized embedding vectors to two dimensions using dynamic t-SNE (Rauber et al., 2016) (see supplement for details). Lines indicate shifts of word vectors relative to the preceding year. In our model only few words change their position in the embedding space rapidly, while embeddings using SGI show strong fluctuations, making the cloud’s motion hard to track.

Figure 4 visualizes the smoothness of the trajectories directly in the embedding space (without the projection to two dimensions). We consider differences between word vectors in the year 1998 and the subsequent 10 years. In more detail, we compute histograms of the Euclidean distances $||u_{it}-u_{i,t+\delta}||$ over the word indexes $i$ , where $\delta=1,\dots,10$ (as discussed previously, SGI uses a global rotation to optimally align embeddings first). In our model, embedding vectors gradually move away from their original position as time progresses, indicating a directed motion. In contrast, both baseline models show only little directed motion after the first time step, suggesting that most temporal changes are due to finite-size fluctuations of $n^{\pm}_{ij,t}$ . Initialization schemes alone, thus, seem to have a minor effect on smoothness.

Our approach allows us to detect semantic shifts in the usage of specific words. Figures 5 and 1 both show the cosine distance between a given word and its neighboring words (colored lines) as a function of time. Figure 5 shows results on all three corpora and focuses on a comparison across methods. We see that DSG-S and DSG-F (both proposed) result in trajectories which display less noise than the baselines SGP and SGI. The fact that the baselines predict zero cosine distance (no correlation) between the chosen word pairs on the SoU and Twitter corpora suggests that these corpora are too small to successfully fit these models, in contrast to our approach which shares information in the time domain. Note that as in dynamic topic models, skip-gram smoothing (DSG-S) may diffuse information into the past (see ”presidential” to ”clinton-trump” in Fig. 5).

Quantitative results.

We show that our approach generalizes better to unseen data. We thereby analyze held-out predictive likelihoods on word-context pairs at a given time $t$ , where $t$ is excluded from the training set,

The predictive likelihoods as a function of time $\tau_{t}$ are shown in Figure 6. For the Google Books corpus (left panel in figure 6), the predictive log-likelihood grows over time with all four methods. This must be an artifact of the corpus since SGI does not carry any information of the past. A possible explanation is the growing number of words per year from $1800$ to $2008$ in the Google Books corpus. On all three corpora, differences between the two implementations of the static model (SGI and SGP) are small, which suggests that pre-initializing the embeddings with the previous result may improve their continuity but seems to have little impact on the predictive power. Log-likelihoods for the skip-gram filter (DSG-F) grow over the first few time steps as the filter sees more data, and then saturate. The improvement of our approach over the static model is particularly pronounced in the SoU and Twitter corpora, which are much smaller than the massive Google books corpus. There, sharing information between across time is crucial because there is little data at every time slice. Skip-gram smoothing outperforms skip-gram filtering as it shares information in both time directions and uses a more flexible variational distribution.

Conclusions

We presented the dynamic skip-gram model: a Bayesian probabilistic model that combines word2vec with a latent continuous time series. We showed experimentally that both dynamic skip-gram filtering (which conditions only on past observations) and dynamic skip-gram smoothing (which uses all data) lead to smoothly changing embedding vectors that are better at predicting word-context statistics at held-out time steps. The benefits are most drastic when the data at individual time steps is small, such that fitting a static word embedding model is hard. Our approach may be used as a data mining and anomaly detection tool when streaming text on social media, as well as a tool for historians and social scientists interested in the evolution of language.

We would like to thank Marius Kloft, Cheng Zhang, Andreas Lehrmann, Brian McWilliams, Romann Weber, Michael Clements, and Ari Pakman for valuable feedback.