A Continuously Growing Dataset of Sentential Paraphrases

Wuwei Lan, Siyu Qiu, Hua He, Wei Xu

Introduction

A paraphrase is a restatement of meaning using different expressions Bhagat and Hovy (2013). It is a fundamental semantic relation in human language, as formalized in the Meaning-Text linguistic theory which defines meaning as ‘invariant of paraphrases’ Milićević (2006). Researchers have shown benefits of using paraphrases in a wide range of applications Madnani and Dorr (2010), including question answering Fader et al. (2013), semantic parsing Berant and Liang (2014), information extraction Sekine (2006); Zhang et al. (2015), machine translation Mehdizadeh Seraj et al. (2015), textual entailment Dagan et al. (2006); Bjerva et al. (2014); Marelli et al. (2014); Izadinia et al. (2015), vector semantics Faruqui et al. (2015); Wieting et al. (2015), and semantic textual similarity (Agirre et al., 2015; Li and Srikumar, 2016). Studying paraphrases in Twitter can also help track unfolding events Vosoughi and Roy (2016) or the spread of information Bakshy et al. (2011) on social networks.

In this paper, we address a major challenge in paraphrase research — the lack of parallel corpora. There are only two publicly available datasets of naturally occurring sentential paraphrases and non-paraphrases:Meaningful non-paraphrases (pairs of sentences that have similar wordings or topics but different meanings, and that are not randomly or artificially generated) have been very difficult to obtain but are very important, because they serve as necessary distractors in training and evaluation. the MSRP corpus derived from clustered news articles Dolan and Brockett (2005) and the PIT-2015 corpus from Twitter trending topics Xu et al. (2014, 2015). Our goal is not only to create a new annotated paraphrase corpus, but to identify a new data source and method that can narrow down the search space of paraphrases without using the classifier-biased or human-in-the-loop data selection as in MSRP and PIT-2015. This is so that sentential paraphrases can be conveniently and continuously harvested in large quantities to benefit downstream applications.

We present an effective method to collect sentential paraphrases from tweets that refer to the same URL and contribute a new gold-standard annotated corpus of 51,524 sentence pairs, which is the largest to date (Table 1). We show the different characteristics of this new dataset contrasting the two existing corpora through the first systematic study of paraphrase identification across multiple datasets. Our new corpus is complementary to previous work, as the corpus contains multiple references of both formal well-edited and informal user-generated texts. This is also the first work that provides a continuously growing collection, with more than 30,000 new sentential paraphrases per month automatically labeled at $\sim$ 70% precision. We demonstrate that up-to-date phrasal paraphrases can then be extracted via word alignment (see examples in Table 2). We plan to continue collecting paraphrases using our method and release a constantly updating paraphrase resource.

Existing Paraphrase Corpora and Their Limitations

To date, there exist only two publicly available corpora of both sentential paraphrases and non-paraphrases:

Dolan et al. (2004); Dolan and Brockett (2005) This corpus contains 5,801 pairs of sentences from news articles, with 4,076 for training and the remaining 1,725 for testing. It was created from clustered news articles by using an SVM classifier (using features including string similarity and WordNet synonyms) to gather likely paraphrases, then annotated by human on semantic equivalence. The MSRP corpus has a known deficiency skewed toward over-identification Das and Smith (2009), because the “purpose was not to evaluate the potential effectiveness of the classifier itself, but to identify a reasonably large set of both positive and plausible ‘near-miss’ negative examples” Dolan and Brockett (2005). It contains a large portion of sentence pairs with many ngrams shared in common.

Xu et al. (2014, 2015) This corpus was derived from Twitter’s trending topic data. The training set contains 13,063 sentence pairs on 400 distinct topics, and the test set contains 972 sentence pairs on 20 topics. As numerous Twitter users spontaneously talk about varied topics, this dataset contains many lexically divergent paraphrases. However, this method requires a manual step of selecting topics to ensure the quality of collected paraphrases, because many topics detected automatically are either incorrect or too broad. For example, the topic “New York” relates to tweets with a wide range of information and cannot narrow the search space down enough for human annotation and the subsequent application of classification algorithms.

Constructing the Twitter URL Paraphrase Corpus

For paraphrase acquisition, it has been crucial to find a simple and effective way to locate paraphrase candidates (see related work in Section 6). We show the efficacy of tracking URLs in Twitter. This method does not rely on automatic news clustering as in MSRP or topic detection as in PIT-2015, but it keeps collecting good candidate paraphrase pairs in large quantities.

We extracted the embedded URL in each tweet and used Twitter’s Search API to retrieve all tweets that contain the same URL. Some tweets use shortened URLs, which we resolve as full URLs. We tracked 22 English news accounts in Twitter to create the paraphrase corpus in this paper (see examples in Table 3). We will extend the corpus to include other languages and domains in future work.

As shown in Table 5, nearly all the tweets posted by news agencies have embedded URLs. About 51.17% of posts contain two URLs, usually one pointing to a news article and the other to media such as a photo or video. Although close to half of the tweets in Twitter streaming dataWe used Twitter’s Streaming API which provided a real-time stream of public tweets posted on Twitter. contain at least one URL, most of them are very hard to read (see examples in Table 4).

2 Filtering of Retweets

Retweeting is an important feature in Twitter. There are two types: automatic and manual retweets. An automatic retweet is done by clicking the retweet button on Twitter and is easy to remove using the Twitter API. A manual retweet occurs when the user creates a new tweet by copying and pasting the original tweet and possibly adding some extras, such as hashtags, usernames or comments. It is crucial to remove these redundant tweets with minor variations, which otherwise represent a significant portion of the data (Table 6). We preprocessed the tweets using a tokenizerhttp://www.cs.cmu.edu/~ark/TweetNLP/ Gimpel et al. (2011) and an in-house sentence splitter. We then filtered out manual retweets using a set of rules, checking if one tweet was a sub- string of the other, or if it only differed in punctuation, or the contents of the “twitter:title” or “twitter:description” tag in the linked HTML file of the news article.

Table 6 shows the effectiveness of the filtering. We used PINC, a standard paraphrase metric, to measure ngram-based dissimilarity Chen and Dolan (2011), and Jaccard metric to measure token-based string similarity Jaccard (1912). After filtering, the dataset contains tweets with more significant rephrasing as indicated by higher PINC and lower Jaccard scores.

3 Gold Standard Corpus

To get the gold-standard paraphrase corpus, we obtained human labels on Amazon Mechanical Turk. We showed annotators an original sentence, and asked them to select sentences with the same meaning from 10 candidate sentences. For each question, we recruited 6 annotators and paid $0.03 to each worker.The low pricing helps to not attract spammers to this easy-to-finish task. We gave bonus to workers based on quality and the average hourly pay for each worker is about$ 7. On average, each question took about 53 seconds to finish. For each sentence pair, we aggregated the paraphrase and non-paraphrase labels using the majority vote.

We constructed the largest gold standard paraphrase corpus to date, with 42,200 tweets of 4,272 distinct URLs annotated in the training set and 9,324 tweets of 915 distinct URLs in the test set. The training data was collected between 10/10/2016 and 11/22/2016, and testing data between 01/09/2017 and 01/19/2017. In Section 4, we contrast the characteristics of our data against existing paraphrase corpora.

We evaluated the annotation quality of each worker using Cohen’s kappa agreement Artstein and Poesio (2008) against the majority vote of other workers. We asked the best workers (the top 528 out of 876) to label more data by republishing the questions done by workers with low reliability (Cohen’s kappa <0.4).

In addition, we had 300 sampled sentence pairs independently annotated by an expert. The annotated agreement is 0.739 by Cohen’s kappa between the expert and the majority vote of 6 crowdsourcing workers. If we assume the expert annotation is gold, the precision of worker vote is 0.871, the recall is 0.787, and F1 is 0.827, similar to those of PIT-2015.

4 Continuous Harvesting of Sentential Paraphrases

Since our method directly applies to raw tweets, it can continuously extract sentential paraphrases from Twitter. In Section 4, we show that this approach can produce a silver-standard paraphrase corpus at about 70% precision that grows by more than 30,000 new sentential paraphrases per month. Section 5 presents experiments demonstrating the utility of these automatically identified sentential paraphrases.

Comparison of Paraphrase Corpora

Though paraphrasing has been widely studied, supporting analyses and experiments have thus far often only been conducted on a single dataset. In this section, we present a comparative analysis of our newly constructed gold-standard corpus with two existing corpora by 1) individually examining the instances of paraphrase phenomena and 2) benchmarking a range of automatic paraphrase identification approaches.

In order to show the differences across these three datasets, we sampled 100 sentential paraphrases from each training set and counted occurrences of each phenomenon in the following categories: Elaboration (textual pairs can differ in total information content, such as Trump’s ex-wife Ivana and Ivana Trump), Phrasal (alternates of phrases, such as taking over and replaces), Spelling (spelling variants, such as Trump and Trumpf), Synonym (such as said and told), Anaphora (a full noun phrase in one sentence that corresponds to the counterpart, such as @MarkKirk and Kirk) and Reordering (when a word, phrase or the whole sentence reorders, or even logically reordered, such as Matthew Fishbein questioned him and under questioning by Matthew Fishbein). We report the average number of occurrences of each paraphrase type per sentence pair for each corpus in Table 7. As sentences tend to be longer in MSRP and shorter in PIT-2015, we also normalized the numbers by the length of sentences to be more comparable to the URL dataset.

These three datasets exhibit distinct and complementary compositions of paraphrase phenomena. MSRP has more synonyms, because authors of different news articles may use different and rather sophisticated words. PIT-2015 contains many phrasal paraphrases, probably due to the fact that most tweets under the same trending topic are written spontaneously and independently. Our URL dataset shows more elaboration, spelling and anaphora paraphrase phenomena, showing that many URL-embedded tweets are created by users with a conscious intention to rephrase the original news headline.

2 Automatic Paraphrase Identification

We provide a benchmark on paraphrase identification to better understand various models, as well as the characteristics of our new corpus compared to the existing ones. We focus on binary classification of paraphrase/non-paraphrase, and report the maximum F1 measure of any point on the precision-recall curve.

We chose several representative technical approaches for automatic paraphrase identification:

Pennington et al. (2014) This is a word representation model trained on aggregated global word-word co-occurrence statistics from a corpus. We used 300-dimensional word vectors trained on Common Crawl and Twitter, summed the vectors for each sentence, and computed the cosine similarity.

The logistic regression (LR) model incorporates 18 features based on 1-3 gram overlaps between two sentences ( $s_{1}$ and $s_{2}$ ) Das and Smith (2009). The features are of the form precisionn (number of n-gram matches divided by the number of n-grams in $s_{1}$ ), recalln (number of n-gram matches divided by the number of n-grams in $s_{2}$ ), and Fn (harmonic mean of recall and precision). The model also includes lemmatized versions of these features.

Weighted Matrix Factorization (WMF) Guo and Diab (2012) is an unsupervised latent space model. The unobserved words are carefully handled, which results in more robust embeddings for short texts. Orthogonal Matrix Factorization (OrMF) Guo et al. (2014) is the extension of WMF, with an additional objective to obtain nearly orthogonal dimensions in matrix factorization to discount redundant information. Specifically, for the (vec) version, vectors of a pair of sentences $\vec{v_{1}}$ and $\vec{v_{2}}$ are converted into one feature vector, $[\vec{v_{1}}+\vec{v_{2}},|\vec{v_{1}}-\vec{v_{2}}|]$ , by concatenating the element-wise sum $\vec{v_{1}}+\vec{v_{2}}$ and absolute difference $|\vec{v_{1}}-\vec{v_{2}}|$ . We also provide the (sim) variation, which directly uses the single cosine similarity score between two sentence vectors.

This is an open-sourced adaptation Xu et al. (2014) of LEXDISCRIM Ji and Eisenstein (2013) that have shown comparable performance. It combines WMF/OrMF with n-gram overlapping features to train a LR classifier.

MultiP Xu et al. (2014) is a multi-instance learning model suited for short messages on Twitter. The at-least-one-anchor assumption in this model looks for two sentences that have a topical phrase in common, plus at least one pair of anchor words that carry a similar key meaning. This model achieved the best performance in the PIT-2015 Xu et al. (2014) dataset.

He et al. He and Lin (2016) developed a deep neural network model that focuses on important pairwise word interactions across input sentences. This model innovates in proposing a similarity focus layer and a 19-layer very deep convolutional neural network to guide model attention to important word pairs. It has shown state-of-the-art performance on several textual similarity measurement datasets.

2.2 Model Performance and Dataset Difference

The results on three benchmark paraphrase corpora are shown in Table 8, 9 and 10. The random baseline reflects that close to 80% sentence pairs are paraphrases in the MSPR corpus. This is atypical in the real-world text data and may cause falsely positive predictions.

Both the edit distance and the LR models exploit surface word features. In particular, the LR model that uses lemmatization and ngram overlap features achieves very competitive performance on all datasets. Figure 1 shows a closer look at ngram differences across datasets measured by the PINC metric Chen and Dolan (2011), which is the opposite of BLEU Papineni et al. (2002). MSRP consists of paraphrases with more ngram overlap (lower PINC), while PIT-2015 contains shorter and more lexically dissimilar sentences. Our new URL corpus is in between the two, and is more similar to PIT-2015. It includes user’s intentional rephrasing of an original tweet from a news agency with some words untouched, as well as some dramatic paraphrases that are challenging for any automatic identification methods, such as CO2 levels mark ‘new era’ in the world’s changing climate and CO2 levels haven’t been this high for 3 to 5 million years.

MultiP exploits a restrictive constraint that the candidate sentence pairs share a same topical phrase. It achieves the best performance on PIT-2015, which naturally contains such phrases. For MSRP and URL datasets, we uses the named entity tagged with the longest span as an approximation of a shared topic phrase and thus suffered a performance drop.

Both Glove and WMT/OrMF utilize the underlying co-occurrence statistics of the text corpus. WMT/OrMF use global matrix factorization to project sentences into lower dimension and show great advantages on measuring sentence-level semantic similarities over Glove, which focuses on word representations. Figure 2 shows that the fine-grained distribution of the OrMF-based cosine similarities and that the URL-linked Twitter data works well with OrMF to yield sentential paraphrases. Once combined with ngram overlap features, LEX-WMF and LEX-OrMF show consistently high performance across different datasets, close to the more complicated DeepPairwiseWord. The similarity focus mechanism on important pairwise word interactions in DeepPairwiseWord is more helpful for the two Twitter datasets, due to the fact that they contain lexically divergent paraphrases while MSRP has an artificial bias toward sentences with high n-gram overlap.

Extracting Phrasal Paraphrases

We can apply paraphrase identification models trained on our gold standard corpus to unlabeled Twitter data and continuously harvest sentential paraphrases in large quantities. We used the open-sourced LEX-OrMF model and obtained 114,025 sentential paraphrases (system predicted probability $\geq$ 0.5 and average precision $=$ 69.08%) from raw 1% free Twitter data between 10/10/2016 and 01/10/2017. To demonstrate the utility, we show that we can extract up-to-date lexical and phrasal paraphrases from this data.

One of the most successful ideas to obtain lexical and phrasal paraphrases in large quantities is through word alignment, then ranking for better quality. This approach was proposed by Bannard Bannard and Callison-Burch (2005) and previously applied to bilingual parallel data to create PPDB Ganitkevitch et al. (2013); Pavlick et al. (2015). There has been little previous work utilizing monolingual parallel data to learn paraphrases since it is not as naturally available as bitexts.

We used the GIZA++ word aligner in the Moses machine translation toolkit Koehn et al. (2007) and extracted 245,686 phrasal paraphrases. Some examples are shown in Table 2. We additionally explored two supervised monolingual aligners: Jacana aligner Yao et al. (2013) and Md Sultan’s aligner Sultan et al. (2014). We ranked the phrase pairs using four different scores:

Language Model Score Let $w_{-2}w_{-1}pw_{1}w_{2}$ be the context of the phrase $p$ . We considered a phrase $p^{\prime}$ to be a good substitute for $p$ if $w_{-2}w_{-1}p^{\prime}w_{1}w_{2}$ is a likely sequence according to a language model Heafield (2011) trained on Twitter data.

Translation Score Moses provides translation probabilities $\varphi(p|p^{\prime})$ .

Glove Score We used Glove Pennington et al. (2014) pretrained 100-dimensional Twitter word vectors and cosine similarity.

Our Score We trained a supervised SVM regression model using 500 phrase pairs with human ratings. We used the language model, translation, and glove scores as features, and additionally used the inverse phrase translation probability $\varphi(p^{\prime}|p)$ , lexical weighting $lex(p|p^{\prime})$ , and $lex(p^{\prime}|p)$ from Moses.

Figure 3 compares the different ranking methods against the human judgments on 200 phrase pairs randomly sampled from GIZA++.

2 Paraphrase Quality Evaluation

We compared the quality of paraphrases extracted by our method with the closest previous work (BUCC-2013) Xu et al. (2013), in which a similar phrase table was created using Moses from monolingual parallel tweets that contain the same named entity and calendar date. We randomly sampled 500 phrase pairs from each phrase table and collected human judgements on a 5-point Likert scale, as described in Callison-Burch Callison-Burch (2008). Table 11 shows the evaluation results. We focused on the highest-quality paraphrases that rated as 5 (“all of the meaning of the original phrase is retained, and nothing is added”) and their presence among all extracted paraphrases sorted by ranking scores.

We were also interested in how these phrasal paraphrases compared with those in PPDB. We sampled an equal amount of 420 paraphrase pairs from our phrase tables and PPDB, and then checked what percentage out of the total 840 could be found in our phrase tables and PPDB, respectively. As shown in Table 12, there is little overlap between URL data and PPDB, only 1.3% (51.3-50%) plus 0.8% (50.8-50%). Our Twitter URL data complements well with the existing paraphrase resources, such as PPDB, which are primarily derived from well-edited texts.

Related Work

Researchers have found several data sources from which to collect sentential paraphrases: multiple news agencies reporting the same event (MSRP) Dolan et al. (2004); Dolan and Brockett (2005), multiple translated versions of a foreign novel Barzilay and Elhadad (2003); Barzilay and Lee (2003) or other texts Cohn et al. (2008), multiple definitions of the same concept Hashimoto et al. (2011), descriptions of the same video clip from multiple workers Chen and Dolan (2011) or rephrased sentences Burrows et al. (2013); Toutanova et al. (2016). However, all these data collection methods are incapable of obtaining sentential paraphrases on a large scale (i.e. limited number of news agencies or books with multiple translated versions), and/or lack meaningful negative examples. Both of these properties are crucial for developing machine learning models that identify paraphrases and measure semantic similarities.

There are other phrasal and syntactic paraphrase data, such as DIRT Lin and Pantel (2001), POLY Grycner et al. (2016), PATTY Nakashole et al. (2012), DEFIE Bovi et al. (2015), and PPDB Ganitkevitch et al. (2013); Pavlick et al. (2015). Most of these works focus on news or web data. Other earlier works on Twitter paraphrase extraction used unsupervised approaches Xu et al. (2013); Wang et al. (2013) or small datasets Zanzotto et al. (2011); Antoniak et al. (2015).

Conclusion and Future Work

In this paper, we show how a simple method can effectively and continuously collect large-scale sentential paraphrases from Twitter. We rigorously evaluated our data with automatic identification classification models and various measurements. We will share our new dataset with the research community; this dataset includes 51,524 sentence pairs manually labeled and a monthly growth of 30,000 sentential paraphrases automatically labeled. Future work could include expanding into many different languages present in social media and developing language-independent automatic paraphrase identification models.

Acknowledgments

We would like to thank Chris Callison-Burch, Weiwei Guo and Mike White for valuable discussions, as well as the anonymous reviewers for helpful feedback.