Dynamic Word Embeddings for Evolving Semantic Discovery

Zijun Yao, Yifan Sun, Weicong Ding, Nikhil Rao, Hui Xiong

Introduction

Human language is an evolving construct, with word semantic associations changing over time. For example, apple which was traditionally only associated with fruits, is now also associated with a technology company. Similarly, the association of names of famous personalities (e.g., trump) changes with a change in their roles. For this reason, understanding and tracking word evolution is useful for time-aware knowledge extraction tasks (e.g., public sentiment analysis), and other applications in text mining. To this end, we aim to learn word embeddings with a temporal bent, for capturing time-aware meanings of words.

Word embeddings aim to represent words with low-dimensional vectors, where words with similar semantics are geometrically closer (e.g. red and blue are closer than red and squirrel). Classic word embedding techniques started in the 90s and relied on statistical approaches (Lund and Burgess, 1996; Deerwester et al., 1990). Later, neural network approaches (Bengio et al., 2003), as well as recent advances such as word2vec (Mikolov et al., 2013a; Mikolov et al., 2013b) and GloVE (Pennington et al., 2014) have greatly improved the performance of word representation learning. However, these techniques usually do not consider temporal factors, and assume that the word is static across time.

In this paper, we are interested in computing time-aware embedding of words. Specifically, each word in a different time frame (e.g., years) is represented by a different vector. From these embeddings, we have a better notion of “distance” (the cosine distance between word embedding vectors), and by looking at word “neighborhoods” (defined through this distance), we can better understand word associations as well as word meanings, as they evolve over time.

For instance, by locating the embeddings of personality names such as trump’s closest words, we can see that he can be associated with the trajectory : real estate $\rightarrow$ television $\rightarrow$ republican. Similarly, the trajectory of apple travels from the neighborhood of strawberry, mango to that of iphone, ipad.

A key practical issue of learning different word embeddings for different time periods is alignment. Specifically, most cost functions for training are invariant to rotations, as a byproduct, the learned embeddings across time may not be placed in the same latent space. We call this the alignment problem, which is an issue in general if embeddings are learned independently for each time slice.

Unlike traditional methods, literature on learning temporal word embedding is relatively short: (Zhang et al., 2016; Liao and Cheng, 2016; Kulkarni et al., 2015; Hamilton et al., 2016). In general, the approaches in these works follow a similar two-step pattern: first compute static word embeddings in each time slice separately, then find a way to align the word embeddings across time slices. To achieve alignment, (Kulkarni et al., 2015) finds a linear transformation of words between any two time slices by solving a $d$ -dimensional least squares problem of $k$ nearest neighbor words (where $d$ is the embedding dimension). Additionally, (Zhang et al., 2016) also use the linear transformation approach between a base and target time slices, and computes the linear transformation using anchor words, which does not change meaning between the two time slices. This method requires the prior knowledge of words that are in fact static, which involves additional expert supervision. Finally, (Hamilton et al., 2016) imposes the transformation to be orthogonal, and solves a $d$ -dimensional Procrustes problem between every two adjacent time slices.

Our main novelty in this paper is to learn the word embeddings across time jointly, thus obviating the need to solve a separate alignment problem. Specifically, we propose to learn temporal embeddings in all time slices concurrently, and apply regularization terms to smooth embedding changes across time. There are three main advantages to this proposed joint modeling approach:

First, this can be seen as an improvement over traditional, “single-time” methods such as word2vec.

Second, our experimental results suggest that enforcing alignment through regularization yields better results than two-step methods.

Third, we can share information across time slices for the majority of vocabulary. As a result, our method is robust against data sparsity – we can afford to have time slices where some words are rarely present, or even missing. This is a crucial practical advantage offered by our method.

Since our model requires embeddings across all time slices to be learned at the same time, it can be computationally challenging. To mitigate this, we employ a block coordinate descent method which can handle large vocabulary sizes through decomposition.

In experimental study, we learn temporal embeddings of words from The New York Times articles between 1990 and 2016. In contrast, previous temporal word embedding works have focused on time-stamped novels and magazine collections (such as Google N-Gram and COHA). However, news corpora are naturally advantageous to studying language evolution through the lens of current events. In addition, it allows us to work with a corpus that maintains consistency in narrative style and grammar, as opposed to Facebook and Twitter posts. For evaluating our embeddings, we develop both qualitative and quantitative metrics.

Qualitatively, we illustrate the advantages of temporal embeddings for evolving semantics discovery by 1) plotting word vector trajectories to find evolving meanings and associations, 2) using alignment through time to identify associated words across time, and 3) looking at norms as a representative of concept popularity.

Quantitatively, we first use semantic topics extracted from Sections (e.g., Technology, World News) of news articles as ground truth to evaluate the semantic accuracy of temporal embeddings. Additionally, we provide two testsets to evaluate cross-time alignment quality: one consists of known changing roles (e.g., U.S. presidents), determined objectively, and one of concept replacements (e.g., compact disk to mp3), determined more subjectively. The experimental results show the effectiveness of our proposed model and demonstrate substantial improvements against baseline methods.

The rest of this paper is organized as follows. In Section 2, we present the proposed model for temporal embedding learning, and in Section 3, we describe a scalable algorithm to train it. In Section 4, we describe the news corpus dataset and setup details of experiments. We perform qualitative evaluations in Section 5. Finally, we quantitatively compare our embeddings against other state-of-the-art temporal embeddings in Section 6.

Methodology

We now set up our temporal word embedding model. We consider a text corpus collected across time. These kinds of corpora such as news article collections with published dates or social media discussion with time-stamps are ubiquitous. Formally, we denote by $\mathcal{D}=(\mathcal{D}_{1},\ldots,\mathcal{D}_{T})$ our text corpus where each $\mathcal{D}_{t}$ , $t=1,\ldots T$ , is the corpus of all documents in the $t$ -th time slice. Without loss of generality, we assume the time slices are ordered chronologically. The length of these time slices can be on the order of months, years, or decades. Moreover, the length of all the time slices could be different. We consider an overall vocabulary $\mathcal{V}=\{w_{1},\ldots,w_{V}\}$ of size $V$ . We note that the vocabulary $\mathcal{V}$ consists of words present in the corpus at any point in time, and thus it is possible for some $w\in\mathcal{V}$ to not appear at all in some $\mathcal{D}_{t}$ . This includes emerging words and dying words that are typical in real-world news corpora.

A fundamental observation in static word embedding literature is that semantically similar words often have similar neighboring words in a corpus (Firth, 1957). This is the idea behind learning dense low-dimensional word representations both traditionally (Lund and Burgess, 1996; Deerwester et al., 1990; Bengio et al., 2003) and recently (Mikolov et al., 2013a; Pennington et al., 2014). In several of these methods, the neighboring structure is captured by the frequencies by which pairs of words co-occur within a small local window.

We compute the $V\times V$ pointwise mutual information (PMI) matrix specific to a corpus $\mathcal{D}$ , whose $w,c$ -th entry is:

where $\#(w,c)$ counts the number of times that words $w$ and $c$ co-occur within a window of size $L$ in corpus $\mathcal{D}$ , and $\#(w)$ , $\#(c)$ counts the number of occurrences of words $w$ and $c$ in $\mathcal{D}$ . $|\mathcal{D}|$ is total number of word tokens in the corpus. $L$ is typically around $5$ to $10$ ; we set $L=5$ throughout this paper.

The key idea behind both word2vec (Mikolov et al., 2013a) and GloVE (Pennington et al., 2014) is to find embedding vectors $u_{w}$ and $u_{c}$ such that for any $w,c$ combination,

where each $u_{w}$ has length $d\ll V$ . While both (Mikolov et al., 2013a) and (Pennington et al., 2014) offer highly scalable algorithms such as negative sampling to do this implicitly, (Levy and Goldberg, 2014) shows that these are equivalent to low-rank factorization of $\text{PMI}(\mathcal{D},L)$ with a constant shift that can be zero.. Our approach is primarily motivated by this observation. We note that though the PMI matrices are of size $V\times V$ , in real-world datasets it is typically sparse as observed in (Pennington et al., 2014), for which efficient factorization methods exist (Yu et al., 2012).

2. Temporal word embeddings

A natural extension of the static word embedding intuition is to use this matrix factorization technique on each time slice $\mathcal{D}_{t}$ separately. Specifically, for each time slice $t$ , we define the $w,c$ -th entry of positive PMI matrix ( $\text{PPMI}(t,L)$ ) as We consider the PPMI rather than the PMI because when $\frac{\#(w,c)\cdot|\mathcal{D}|}{\#(w)\cdot\#(c)}$ is very small, taking the log results in large negative values and is thus extremely unstable. Since for most significantly related pairs $w$ and $c$ the log argument is $>1$ , thresholding it in this way will not affect the solution significantly, but will offer much better numerical stability. This approach is not unique to us; (Levy and Goldberg, 2014) also factorizes the PPMI.

The temporal word embeddings $U(t)$ must satisfy

One way to find such $U(t)$ is for each $t$ , factorizing $\text{PPMI}(t,L)$ by either using an eigenvalue method or solving a matrix factorization problem iteratively.

The Alignment Problem: Imposing (4) is not sufficient for a unique embedding, since the solutions are invariant under rotation; that is, for any $d\times d$ orthogonal matrix $R$ , we have the embedding $\widehat{U}(t)=U(t)R$ , the approximation error in (4) is the same since

For this reason, it is important to enforce alignment; if word $w$ did not semantically shift from $t$ to $t+1$ , then we additionally require $u_{w}(t)\approx u_{w}(t+1)$ .

To do this, (Kulkarni et al., 2015; Hamilton et al., 2016) propose two-step procedures; first, they factorize each $Y(t)$ separately, and afterwards enforce alignment using local linear mapping (Kulkarni et al., 2015) or solving an orthogonal procrustes problem (Hamilton et al., 2016). Note that in these methods, aligning $U(t)$ to $U(t^{\prime})$ assumes that we desire $U(t)\approx U(t^{\prime})$ . If we only pick $t^{\prime}=t+1$ (as done in (Hamilton et al., 2016)), this assumption is reasonable because between any two years, only a few words experience semantic shift, emergence, or death. However, this becomes problematic if $U(t)$ was a result of a time period with extremely sparse data (and hence poorly learned); all subsequent year embeddings and previous year embeddings will be poorly aligned.

3. Our model

We propose finding temporal word embeddings as the solution of the following joint optimization problem:

where $Y(t)=\text{PPMI}(t,L)$ and $\lambda,\tau>0$ . Here the penalty term $\|U(t)\|^{2}_{F}$ enforces the low-rank data-fidelity as widely adopted in previous literature. The key smoothing term $\|U(t-1)-U(t)\|^{2}_{F}$ encourages the word embeddings to be aligned. The parameter $\tau$ controls how fast we allow the embeddings to change; $\tau=0$ enforces no alignment, and picking $\tau\to\infty$ converges to a static embedding with $U(1)=U(2)=\ldots=U(T)$ . Note that the methods of (Kulkarni et al., 2015; Hamilton et al., 2016) can be viewed as suboptimal solutions of (5), in that they optimize for each term separately. For one, while the strategies in (Kulkarni et al., 2015) and (Hamilton et al., 2016) enforce alignment pairwise, we enforce alignment across all time slices; that is, the final aligned solution $U(t)$ is influenced by not only $U(t-1)$ and $U(t+1)$ , but every other embedding as well. This avoids the propagation of alignment errors caused by a specific time frame’s subsampling. Additionally, consider an extreme case in which word $w$ is absent from $\mathcal{D}_{t}$ but has similar meaning in both $t-1$ and $t+1$ . Directly applying any matrix factorization technique to each time point would enforce $u_{w}(t)\approx\mathbf{0}$ . However, for the right choice of $\tau$ , the solution $u_{w}(t)$ to (5) will be close to $u_{w}(t-1)$ and $u_{w}(t+1)$ . Overall, our method achieves high fidelity embeddings with a much smaller corpus, and in particular, in Section 6, we demonstrate that our embeddings are robust against sudden undersampling of specific time slices.

Optimization

A key challenge in solving (5) is that for large $V$ and $T$ , one might not be able to fit all the PPMI matrices $Y(1),\ldots,Y(T)$ in memory, even though $Y(t)$ is sparse. Therefore, for scalability, an obvious solution is to first decompose the objective across time, using alternating minimization to solve for $U(t)$ at each step:

for a specific $t$ . Note that $f(U(t))$ is quartic in $U(t)$ , and thus even if we only solve for a fixed $t$ , (6) cannot be minimized analytically. Thus, our only option is to solve (6) iteratively using a fast first-order method such as gradient descent. The gradient of the first term alone is given by

Each gradient computation is of the order $O(nnz(Y(t))d+d^{2}V)$ (which is then nested in iteratively optimizing $U(t)$ for each $t$ ). $nnz(\cdot)$ is the number of nonzeros in the matrix. In practical applications, $V$ is in the order of ten-thousands to hundred-thousands, and $T$ is in the order of tens to hundreds.

Let us instead look at a slightly relaxed problem of minimizing

where variables $W(t),t=1,\ldots,T$ are introduced to break the symmetry of factorizing $Y(t)$ . Now, minimizing for each $U(t)$ (and equivalently $W(t)$ ) is just the solution of a ridge regression problem, and can be solved in one step by setting the gradient of the objective of (8) to 0, i.e. $U(t)A=B$ where

for $t=2,\ldots,T-1$ , and constants adjusted for $t=0,T$ . Forming $A$ and $B$ requires $O(Vd^{2}+nnz(Y(t))d)$ , and solving $U(t)A=B$ can be done in $O(d^{3})$ computations, in one step. This can be further decomposed to row-by-row blocks of size $b$ , by minimizing over a row block of $U(t)$ at a time, which reduces the complexity of forming $A$ and $B$ to $O(bd^{2}+nnz(Y(t)[:b,:])d)$ , i.e.independent of $V$ . This allows scaling for very large $V$ , as only one row block of $Y(t)$ must be loaded at a time.

The method described here is commonly referred to as block coordinate descent (BCD) because it minimizes with respect to a single block ( $U(t)$ or $W(t)$ ) at a time, and the block size can be made even smaller (a few rows of $U(t)$ or $W(t)$ ) to maintain scalability. The main appeal of BCD is scalability (Yu et al., 2012); however, a main drawback is lack of convergence guarantees, even in the case of convex optimization (Powell, 1973). In practice, however, BCD is highly successful and has been used in many applications (Wright, 2015). Another choice of optimization is stochastic gradient descent (SGD), which decomposes the objective as a sum of smaller terms. For example, the first term of (8) can be written as a sum of terms, each using only one row of $Y(t)$ :

The complexity at first glance is smaller than that of BCD; however, SGD comes with the well-documented issues of slow progress and hard-to-tune step sizes, and in practice, can be much slower for matrix factorization applications (Rao et al., 2015; Yu et al., 2012). However, we point out that the choice of the optimization method is agnostic to our model; anything that successfully solves (5) should lead to an equally successful embedding.

Experimental Dataset and Setup

In this section we describe the specific procedure used to generate embeddings for the next two sections.

News article dataset: First, we crawl a total of 99,872 articles from the New York Times, published between January 1990 and July 2016.The data is available at: https://sites.google.com/site/zijunyaorutgers/. In addition to the text, we also collected metadata including title, author, release date, and section label (e.g., Business, Sports, Technology); in total, there are 59 such sections. We use yearly time slices, dividing the corpus into $T=27$ partitions. After removing rare words (fewer than 200 occurrences in all articles across time) and stop words, our vocabulary consists of $V=20,936$ unique words. We then compute a co-occurrence matrix for each time slice $t$ with a window size $L=5$ , which is then used to compute the PPMI matrix as outlined in (3). All the embedding methods that we compared against are trained on this same dataset.

Training details for our algorithm: We perform a grid search to find the best regularization and optimization parameters. As a result of our search, we obtain $\lambda=10$ , $\tau=\gamma=50$ , and run for 5 epochs (5 complete pass over all time slices, and all rows and columns of $Y(t)$ ). Interestingly, setting $\lambda=0$ also yielded good results, but required more iterations to converge. The block variable is one matrix ( $U(t)$ or $V(t)$ for a specific $t$ ).

Distance metric: All distances between two words are calculated by the cosine similarity between embedding vectors:

where $u_{a}$ and $u_{b}$ are the embeddings of words $a$ and $b$ .

Qualitative evaluation

The embeddings we learn reveal interesting patterns in the shift of word semantics, cross-time semantic analogy, and popularity trends of concepts from the news corpus.

The trajectory of a word in the (properly aligned) embedded space provides tools to understand the shift in meanings of words over time. This can help broader applications, such as capturing and quantifying linguistic evolution, characterizing brands and people, and analyzing emerging association between certain words.

Figure 1 shows the trajectories of a set of example words. We plot the 2-D t-SNE projection of each word’s temporal embedding across time. We also plot the closest words to the target word from each time slice. We pick four words of interest: apple and amazon as emerging corporate names while originally referring to a fruit and a rainforest, and obama and trump as people with changing professional roles.

In all cases, the embeddings illustrate significant semantic shifts of the words of interest during this 27-year time frame. We see apple shift from a fruit and dessert ingredient to space of technology. Interestingly, there is a spike in 1994 in the trajectory, when Apple led a short tide of discussion because of the replacement of the CEO and a collaboration with IBM; then the association shifted back to neighborhood of fruit and dessert until the recovery by Steve Jobs in early 2000s. Similarly, amazon shifts from a forest to an e-commerce company, finally landing in 2016 as a content creation and online-streaming provider due to the popularity of its Prime Video service. The US president names, obama and trump, are most telling, shifting from their pre-presidential lives (Obama as a civil rights attorney and university professor; Trump as a real estate developer and TV celebrity) to the political sphere. Overall, Figure 1 demonstrates that first, our temporal word embeddings can well capture the semantic shifts of words across time, and second, our model provides high alignment quality in that same-meaning words across different years have geometrically close embeddings, without having to solve a separate optimization problem for alignment.

2. Equivalence searching

Another key advantage of word alignment is the ability to find conceptually “equivalent” items or people over time. We provide examples in the field of technology, official roles, and sports professionals. In this type of test, we create a query consisting of a word-year pair that is particularly the representative of that word in that year, and look for other word-year pairs in its vicinity, for different years.

Table 1 lists the closest words ( $top$ -1) of each year to the query vector. For visualization purpose we lump semantically similar words together. For example, the first column shows that iphone in 2012 is closely associated with smartphones in recent years, but is close to words such as desktop and macintosh in the 90’s; interestingly, telephone never appears, suggesting the iPhone serves people more as a portable computer than a calling device. As another example, by looking at the trajectory of twitter, we see the evolution of news sources, from TV & radio news broadcasts in the 90s to chatrooms, websites, and emails in the early 2000s, blogs in the late 2000s, and finally tweets today. The last example is fairly obvious; mp3 represents the main form of which music is consumed in 2000, replacing disk and stereo in 1990s ( cassette also appears in $top$ -3) and is later replaced by online streaming. We can see a one-year spike of Napster which was shut down because of copyright infringementNapster ended its streaming service in 2001, so our equivalence is captured 2 years late; this delay could be because though the event happened in 2001, the legal ramifications were analyzed heavily in subsequent years., and later a new streaming service - iTunes.

Next, we use embeddings to identify people in political roles. Table 2 attempts to discover who is the U.S. presidentAll data was scraped about half a year before Donald Trump was elected as U.S. president in 2016. and New York City mayorWe intentionally choose New York City because it is the most heavily discussed city in the New York Times. of the time, using as query obama in 2016 and blasio in 2015. For president, only the closest word from each year is listed, and is always correct (accounting for the election years). For mayor, the top-1 closet word is shown unless it is mayor, in which case the second word is shown. We can see that both the roles of US president and NYC mayor have been well searched for different persons in their terms of service. We see that the embedding for the President is consistent, and for the most part, so is that of the mayor of NYC. In 2011, cuomo is also partially relevant since he was the governor of NY state. We did not find any relevant words in query NYC mayor in year 2006.

Finally, we search for equivalences in sports, repeating the experiment for the ATP rank 1 male tennis player as shown in Table 3. In the case of president and mayor, we are heavily assisted by the fact that they are commonly referred to by a title: “President Obama” and “Mayor de Blasio”. Tennis champions, on the other hand, are not referred by titles. Still, a surprising number of correct champions appear as the closest words, and all the names are those of famous tennis players for the given time period. A more exhaustive empirical study of alignment quality is provided in Section 6.

3. Popularity determination

It has often been observed that word embeddings computed by factorizing PMI matrices have norms that grow with word frequency (Pennington et al., 2014; Arora et al., 2015). These word vector norms across time can be viewed as a time series for detecting the trending concepts (e.g., sudden semantic shifts or emergences) behind words, with more robustness than word frequency.

Figures 2 and 3 illustrate the comparison between embedding norm and frequency for determining concept popularity per year, determined by key words in the New York Times corpus. Generally, comparing to frequencies which are much more sporadic and noisy, we note that the norm of our embeddings encourages smoothness and normalization while being indicative of the periods when the corresponding words were making news rounds. In Figure 2, the embedding norms display nearly even 4-year humps corresponding to each president’s term. In every term, the name of each current president becomes a trending concept which plays an important role in the information structure at the time. Two interesting observations can be gleaned. First, since Hillary Clinton continuously served as Secretary of State during 2009-2013, the popularity of clinton was preserved; however it was still not as popular as president obama. Second, because of the presidential campaign, trump in 2016 has a rising popularity that greatly surpasses that of his former role as a business man, and eventually surpasses his opponent clinton in terms of news coverage.

In Figure 3, we can see smooth rises and falls of temporary phenomena (the enron scandal and qaeda rises). For qaeda, we see that there is a jump in 2001, and then it remains steady with a small decline. In contrast, enron shows a sharper decline, as despite its temporal hype, it did not linger in the news long. Note also the stability of using norms to track popularity over frequency, which spikes for enron above qaeda, although 9/11 was far more news-prevalent than the corporation’s scandalous decline. For the basketball star pippen, although his publicity (e.g., frequency) was relatively fewer than business terms, his popularity is still recognized by the enhancement in vector norm. For another term isis, we can see that it begins to replace qaeda as the “trending terrorist organization” in news media.

Quantitative evaluation

In this section, we empirically evaluate our proposed Dynamic Word2Vec model (DW2V) against other temporal word embedding methods.The testsets are available at: https://sites.google.com/site/zijunyaorutgers/. In all cases, we set the embedding dimension to $d=50$ . We have the following baselines:

Static-Word2Vec (SW2V): the standard word2vec embeddings (Mikolov et al., 2013b), trained on the entire corpus and ignoring time information.

Transformed-Word2Vec (TW2V) (Kulkarni et al., 2015): the embeddings $U(t)$ are first trained separately by factorizing PPMI matrix for each year $t$ , and then transformed by optimizing a linear transformation matrix which minimizes the distance between $u_{w}(t)$ and $u_{w}(t^{\prime})$ for the $k=30$ nearest words’ embeddings to the querying word $w$ .

Aligned-Word2Vec (AW2V) (Hamilton et al., 2016): the embeddings $U(t)$ are first trained by factorizing the PPMI matrix for each year $t$ , and then aligned by searching for the best othornormal transformation between $U(t)$ and $U(t+1)$ .

One of the most important properties of a word embedding is how accurately it carries the meaning of words. Therefore, we develop a test to see if words can be categorized by meaning based on embeddings. The news articles we collected are tagged with their “sections” such as Business, Sports. This information can be used to determine temporal word meanings. It is important to note that this information is not used in the word embedding learning. For example, we see that amazon occurs 41% of the time in World in 1995, associating strongly with the rainforest (not part of the USA), and 50% of the time in Technology in 2012, associating strongly with e-commerce. We thus use this to establish a ground truth of word category, by identifying words in years that are exceptionally numerous in one particular news section. That is, if a word is extremely frequent in a particular section, we associate that word with that section and use that as ground truth. We select the 11 most popular and discriminative sectionsArts, Business, Fashion & Style, Health, Home & Garden, Real Estate, Science, Sports, Technology, U.S., World. of the New York Times, and for each section $s$ and each word $w$ in year $t$ , we compute its percentage $p$ of occurrences in each section. To avoid duplicated word-time-section $<w,t,s>$ triplets, for a particular $w$ and $s$ we only keep the year of the largest strength, and additionally filter away any triplet with strength less than $p=35\%$ . Note that a random uniform distribution of words would result in it appearing about $9\%$ of the time in each section, and our threshold is about 4 times that quantity. We do this to say with sufficient confidence that such associations can be treated as ground truth.

To limit the size differences among categories, for every section $s$ with more than 200 qualified triplets, we keep the $top$ -200 words by strength. In total, this results in 1888 triplets across 11 sections, where every word-year pair is strongly associated with a section as its true category.

We then apply spherical k-means, which uses cosine similarity between embeddings as the distance function for clustering, with $K=$ 10, 15, and 20 clusters. We use two metrics to evaluate the clustering results:

Normalized Mutual Information (NMI), defined as

where $L$ represents the set of labels and $C$ the set of clusters. $I(L;C)$ denotes the sum of mutual information between any cluster $c_{i}$ and any label $l_{j}$ , and $H(L)$ and $H(C)$ the entropy for labels and clusters, respectively. This metric evaluates the purity of clustering results from an information-theoretic perspective.

Fβ-measure ( $\mathbf{F_{\beta}}$ ), defined as

where $P=\frac{TP}{TP+FP}$ denotes the precision and $R=\frac{TP}{TP+FN}$ denotes the recall. (TP/FP = true/false positive, TN/FN = true/false negative.) As an alternative method to evaluate clustering, we can view every pair of words as a series of decisions. Pick any two $(w,t)$ pairs. If they are clustered together and additionally have the same section label, this is a correct decision; otherwise, the clustering performed a wrong decision. The metric $F_{\beta}$ measures accuracy as the ( $\beta$ -weighted) harmonic mean of the precision and recall. We set $\beta=5$ to give more weight to recall by penalizing false negative more strongly.

Tables 4 and 5 show the clustering evaluation. We can see that our proposed DW2V consistently outperforms other baselines for all values of $K$ . These results show two advantages. First, the word semantic shift has been captured by the temporal embeddings (for example, by correlating correctly with the section label of amazon, which changes from World to Technology). Second, since embeddings of words of all years are used for clustering, a good clustering result indicates good alignment across years. We can also see that AW2V also performs well, as it also applies alignment between adjacent time slices for all words. However, TW2V does not perform well as others, suggesting that aligning locally (only a few words) is not sufficient for high alignment quality.

2. Alignment quality

We now more directly evaluate alignment quality, i.e. the property that the semantic distribution in temporal embedding space should be consistent over time. For example, if a word such as estate or republican does not change much throughout time, its embedding should remain relatively constant for different $t$ . By the same logic, if a word such as trump does change association throughout time, its embedding should reflect this shift by moving from one position to another (e.g., estate $\rightarrow$ republican). We saw this in the previous section for static words like president or mayor; they do not change meanings, though they are accompanied by names that shift to them every few years.

To examine the quality of embedding alignment, we create a task to query equivalences across years. For example, given obama-2012, we want to query its equivalent word in 2002. As we know obama is the U.S. president in 2012; its equivalent in 2002 is bush, who was the U.S. president at that time. In this way, we create two testsets.

The first one is based on publicly recorded knowledge that for each year lists different names for a particular role, such as U.S. president, U.K. prime minister, NFL superbowl champion team, and so on. For each year (e.g., 2012), we put its word (e.g., obama) into the embedding set of every other year for query its equivalence in top closest words.

The second test is human-generated, for exploring more interesting concepts like emerging technologies, brands and major events (e.g., disease outbreaks and financial crisis). For constructing the test word pairs, we first select emerging terms which have not been popularized before 1994, then query their well known precedents during 1990 to 1994 (e.g., app-2012 can correspond to software-1990). For emerging word (e.g., app) we extract its most popular year (e.g., 2012) with maximum frequency, and put its embedding into each year from 1990 to 1994 for querying its precedent (e.g., software). Each word-year pair now forms a query and an answer; in total we have $N=11028$ such pairs in the first testset, and $N=445$ in the second one.

We use two metrics to evaluate the performance.

For each test $i$ , the correct answer word is identified at position rank $[i]$ for closest words. The Mean Reciprocal Rank (MRR) is defined as

where $\frac{1}{\text{rank}[i]}=0$ if the correct answer is not found in the $top$ -10. Higher MRR means that correct answers appear more closely and unambiguously with the query embedding.

Additionally, for test $i$ consisting of a query and target word-year pair, consider the closest $K$ words to the query embedding in the target year. If the target word is among these $K$ words, then the Precision@K for test $i$ (denoted P@K[ $i$ ]) is 1; else, it is 0. Then the Mean Precision@K is defined as

Higher precision indicates a better ability to acquire correct answers using close embeddings.

Tables 6 and 7 show the evaluation of the alignment test. We can see that our proposed method outperforms others and shows good alignment quality, sometimes by an order of magnitude. Comparing to testset 1 which has a large amount of queries with considerable short range alignments (e.g., from 2012 to 2013), the testset 2 mostly consists of fewer long range alignments (e.g., 2012 to 1990). Therefore, we can see that the performance of SW2V is relatively good in testset 1 since the semantic distribution does not change much in short ranges which makes this test favorable to static embeddings. However, SW2V degrades sharply in testset 2, where the long range alignment is needed more. For TW2V, since it does an individual year-to-year (e.g., 2012-to-1990) transformation by assuming that the local structure of target words does not shift, its overall alignment quality of whole embedding sets is not satisfied in testset 1 with many alignments. However, it does similarly to AW2V in testset 2 because its individual year-to-year transformation makes it more capable for long range alignment. AW2V, which enforces alignment for whole embedding sets between adjacent time slices, provides quite reliable performance. However, its alignment quality is still below ours, suggesting that their two-step approach is not as successful in enforcing global alignment.

3. Robustness

Finally, we explore the robustness of our embedding model against subsampling of words for select years. Table 8 shows the result of the alignment task (testset 1) for vectors computed from subsampled co-occurrence matrices for every three years from 1991 to 2015. To subsample, each element $C_{ij}$ is replaced with a randomly drawn integer $\hat{C}_{ij}$ from a Binomial distribution for rate $r$ and $n=C_{ij}$ trials; this simulates the number of co-occurrences measured if they had been missed with probability $r$ . The new frequency $\hat{f}$ is then renormalized so that $\hat{f}_{i}/f_{i}=\sum_{j}\hat{C}_{ij}/\sum_{j}C_{ij}$ . Listed are the alignment test results for $r=1,0.1,0.01,$ and $0.001$ compared against (Hamilton et al., 2016), which otherwise performs comparably with our embedding. Unsurprisingly, for extreme attacks (leaving only 1% or 0.1% co-occurrences), the performance of (Hamilton et al., 2016) degrades sharply; however, because of our joint optimization approach, the performance of our embeddings seems to hold steady.

Related Work

Temporal effects in natural language processing: There are several studies that investigate the temporal features of natural language. Some are using topic modeling on news corpus (Allan et al., 2001) or time-stamped scientific journals (Sipos et al., 2012; Wang and McCallum, 2006; Blei and Lafferty, 2006) to find spikes and emergences of themes and viewpoints. Simpler word count features are used in (Heyer et al., 2009; Michel et al., 2011; Choi and Varian, 2012) to find hotly discussed concepts and cultural phenomena, in (Merchant, 2001; Schiano et al., 2002; Tagliamonte and Denis, 2008) to analyze teen behavior in chatrooms, and in (Heyer et al., 2009) to discover incidents of influenza.

Word embedding learning: The idea of word embeddings has existed at least since the 90s, with vectors computed as rows of the co-occurrence (Lund and Burgess, 1996), through matrix factorization (Deerwester et al., 1990), and most famously through deep neural networks (Bengio et al., 2003; Collobert and Weston, 2008). They have recently been repopularized with the success of low-dimensional embeddings like GloVE (Pennington et al., 2014) and word2vec (Mikolov et al., 2013b; Mikolov et al., 2013a), which have been shown to greatly improve the performance in key NLP tasks, like document clustering (Kusner et al., 2015), LDA (Petterson et al., 2010), and word similarity (Levy et al., 2015; Baroni et al., 2014). There is a close connection between these recent methods and our proposed method, in that both word2vec and GloVE have been shown to be equivalent to matrix factorization of a shifted PMI matrix (Levy and Goldberg, 2014).

Temporal word embeddings and evaluations: While NLP tools have been used frequently to discover emerging word meanings and societal trends, many of them rely on changes in the co-occurrence or PMI matrix (Michel et al., 2011; Gulordava and Baroni, 2011; Wijaya and Yeniterzi, 2011; Heyer et al., 2009; Mitra et al., 2014), changes in parts of speech, (Mihalcea and Nastase, 2012) or other statistical methods (Tang et al., 2016; Basile et al., 2014; Kulkarni et al., 2015). A few works use low-dimensional word embeddings, but either do no smoothing (Sagi et al., 2011), or use two-step methods (Kim et al., 2014; Hamilton et al., 2016; Kulkarni et al., 2015). Semantic shift and emergence are also evaluated in many different ways. In (Sagi et al., 2011), word shifts are identified by tracking the mean angle between a word and its neighbors. One of the several tests in (Kulkarni et al., 2015) create synthetic data with injected semantic shifts, and quantifies the accuracy of capturing them using various time series metrics. In (Mihalcea and Nastase, 2012), the authors show the semantic meaningfulness of key lexical features by using them to predict the time-stamp of a particular phrase. And, (Mitra et al., 2014) makes the connection that emergent meanings usually coexist with previous meanings, and use dynamic embeddings to discover and identify multisenses, evaluated against WordNet. Primarily, temporal word embeddings are evaluated against human-created databases of known semantically shifted words (Hamilton et al., 2016; Kulkarni et al., 2015; Tang et al., 2016) which is our approach as well.

Conclusion

We studied the evolution of word semantics as a dynamic word embedding learning problem. We proposed a model to learn time-aware word embeddings and used it to dynamically mine text corpora. Our proposed method simultaneously learns the embeddings and aligns them across time, and has several benefits: higher interpretability for embeddings, better quality with less data, and more reliable alignment for across-time querying. We solved the resulting optimization problem using a scalable block coordinate descent method. We designed qualitative and quantitative methods to evaluate temporal embeddings for evolving word semantics, and showed that our dynamic embedding method performs favorably against other temporal embedding approaches.