Challenges in Data-to-Document Generation
Sam Wiseman, Stuart M. Shieber, Alexander M. Rush
Introduction
Over the past several years, neural text generation systems have shown impressive performance on tasks such as machine translation and summarization. As neural systems begin to move toward generating longer outputs in response to longer and more complicated inputs, however, the generated texts begin to display reference errors, inter-sentence incoherence, and a lack of fidelity to the source material. The goal of this paper is to suggest a particular, long-form generation task in which these challenges may be fruitfully explored, to provide a publically available dataset for this task, to suggest some automatic evaluation metrics, and finally to establish how current, neural text generation methods perform on this task.
A classic problem in natural-language generation (NLG) Kukich (1983); McKeown (1992); Reiter and Dale (1997) involves taking structured data, such as a table, as input, and producing text that adequately and fluently describes this data as output. Unlike machine translation, which aims for a complete transduction of the sentence to be translated, this form of NLG is typically taken to require addressing (at least) two separate challenges: what to say, the selection of an appropriate subset of the input data to discuss, and how to say it, the surface realization of a generation (Reiter and Dale, 1997; Jurafsky and Martin, 2014). Traditionally, these two challenges have been modularized and handled separately by generation systems. However, neural generation systems, which are typically trained end-to-end as conditional language models Mikolov et al. (2010); Sutskever et al. (2011, 2014), blur this distinction.
In this context, we believe the problem of generating multi-sentence summaries of tables or database records to be a reasonable next-problem for neural techniques to tackle as they begin to consider more difficult NLG tasks. In particular, we would like this generation task to have the following two properties: (1) it is relatively easy to obtain fairly clean summaries and their corresponding databases for dataset construction, and (2) the summaries should be primarily focused on conveying the information in the database. This latter property ensures that the task is somewhat congenial to a standard encoder-decoder approach, and, more importantly, that it is reasonable to evaluate generations in terms of their fidelity to the database.
One task that meets these criteria is that of generating summaries of sports games from associated box-score data, and there is indeed a long history of NLG work that generates sports game summaries (Robin, 1994; Tanaka-Ishii et al., 1998; Barzilay and Lapata, 2005). To this end, we make the following contributions:
We introduce a new large-scale corpus consisting of textual descriptions of basketball games paired with extensive statistical tables. This dataset is sufficiently large that fully data-driven approaches might be sufficient.
We introduce a series of extractive evaluation models to automatically evaluate output generation performance, exploiting the fact that post-hoc information extraction is significantly easier than generation itself.
We apply a series of state-of-the-art neural methods, as well as a simple templated generation system, to our data-to-document generation task in order to establish baselines and study their generations.
Our experiments indicate that neural systems are quite good at producing fluent outputs and generally score well on standard word-match metrics, but perform quite poorly at content selection and at capturing long-term structure. While the use of copy-based models and additional reconstruction terms in the training loss can lead to improvements in BLEU and in our proposed extractive evaluations, current models are still quite far from producing human-level output, and are significantly worse than templated systems in terms of content selection and realization. Overall, we believe this problem of data-to-document generation highlights important remaining challenges in neural generation systems, and the use of extractive evaluation reveals significant issues hidden by standard automatic metrics.
Data-to-Text Datasets
We consider the problem of generating descriptive text from database records. Following the notation in Liang et al. (2009), let be a set of records, where for each we define to be the type of , and we assume each to be a binarized relation, where and are a record’s entity and value, respectively. For example, a database recording statistics for a basketball game might have a record such that = points, = Russell Westbrook, and . In this case, gives the player in question, and gives the number of points the player scored. From these records, we are interested in generating descriptive text, of words such that is an adequate and fluent summary of . A dataset for training data-to-document systems typically consists of pairs, where is a document consisting of a gold (i.e., human generated) summary for database .
Several benchmark datasets have been used in recent years for the text generation task, the most popular of these being WeatherGov Liang et al. (2009) and Robocup Chen and Mooney (2008). Recently, neural generation systems have show strong results on these datasets, with the system of Mei et al. (2016) achieving BLEU scores in the 60s and 70s on WeatherGov, and BLEU scores of almost 30 even on the smaller Robocup dataset. These results are quite promising, and suggest that neural models are a good fit for text generation. However, the statistics of these datasets, shown in Table 1, indicate that these datasets use relatively simple language and record structure. Furthermore, there is reason to believe that WeatherGov is at least partially machine-generated (Reiter, 2017). More recently, Lebret et al. (2016) introduced the WikiBio dataset, which is at least an order of magnitude larger in terms of number of tokens and record types. However, as shown in Table 1, this dataset too only contains short (single-sentence) generations, and relatively few records per generation. As such, we believe that early success on these datasets is not yet sufficient for testing the desired linguistic capabilities of text generation at a document-scale.
With this challenge in mind, we introduce a new dataset for data-to-document text generation, available at https://github.com/harvardnlp/boxscore-data. The dataset is intended to be comparable to WeatherGov in terms of token count, but to have significantly longer target texts, a larger vocabulary space, and to require more difficult content selection.
The dataset consists of two sources of articles summarizing NBA basketball games, paired with their corresponding box- and line-score tables. The data statistics of these two sources, RotoWire and SBNation, are also shown in Table 1. The first dataset, RotoWire, uses professionally written, medium length game summaries targeted at fantasy basketball fans. The writing is colloquial, but relatively well structured, and targets an audience primarily interested in game statistics. The second dataset, SBNation, uses fan-written summaries targeted at other fans. This dataset is significantly larger, but also much more challenging, as the language is very informal, and often tangential to the statistics themselves. We show some sample text from RotoWire in Figure 1. Our primary focus will be on the RotoWire data.
Evaluating Document Generation
We begin by discussing the evaluation of generated documents, since both the task we introduce and the evaluation methods we propose are motivated by some of the shortcomings of current approaches to evaluation. Text generation systems are typically evaluated using a combination of automatic measures, such as BLEU Papineni et al. (2002), and human evaluation. While BLEU is perhaps a reasonably effective way of evaluating short-form text generation, we found it to be unsatisfactory for document generation. In particular, we note that it primarily rewards fluent text generation, rather than generations that capture the most important information in the database, or that report the information in a particularly coherent way. While human evaluation, on the other hand, is likely ultimately necessary for evaluating generations Liu et al. (2016); Wu et al. (2016), it is much less convenient than using automatic metrics. Furthermore, we believe that current text generations are sufficiently bad in sufficiently obvious ways that automatic metrics can still be of use in evaluation, and we are not yet at the point of needing to rely solely on human evaluators.
To address this evaluation challenge, we begin with the intuition that assessing document quality is easier than document generation. In particular, it is much easier to automatically extract information from documents than to generate documents that accurately convey desired information. As such, simple, high-precision information extraction models can serve as the basis for assessing and better understanding the quality of automatic generations. We emphasize that such an evaluation scheme is most appropriate when evaluating generations (such as basketball game summaries) that are primarily intended to summarize information. While many generation problems do not fall into this category, we believe this to be an interesting category, and one worth focusing on because it is amenable to this sort of evaluation.
To see how a simple information extraction system might work, consider the document in Figure 1. We may first extract candidate entity (player, team, and city) and value (number and certain string) pairs that appear in the text, and then predict the type (or none) of each candidate pair. For example, we might extract the entity-value pair (“Miami Heat”, “95”) from the first sentence in Figure 1, and then predict that the type of this pair is points, giving us an extracted record such that (Miami Heat, 95, points). Indeed, many relation extraction systems reduce relation extraction to multi-class classification precisely in this way (Zhang, 2004; Zhou et al., 2008; Zeng et al., 2014; dos Santos et al., 2015).
More concretely, given a document , we consider all pairs of word-spans in each sentence that represent possible entities and values . We then model for each pair, using to indicate unrelated pairs. We use architectures similar to those discussed in Collobert et al. (2011) and dos Santos et al. (2015) to parameterize this probability; full details are given in the Appendix.
Importantly, we note that the pairs typically used for training data-to-document systems are also sufficient for training the information extraction model presented above, since we can obtain (partial) supervision by simply checking whether a candidate record lexically matches a record in .Alternative approaches explicitly align the document with the table for this task Liang et al. (2009). However, since there may be multiple records with the same and but with different types , we will not always be able to determine the type of a given entity-value pair found in the text. We therefore train our classifier to minimize a latent-variable loss: for all document spans and , with observed types (possibly ), we minimize
We find that this simple system trained in this way is quite accurate at predicting relations. On the Rotowire data it achieves over 90% accuracy on held-out data, and recalls approximately 60% of the relations licensed by the records.
2 Comparing Generations
With a sufficiently precise relation extraction system, we can begin to evaluate how well an automatic generation has captured the information in a set of records . In particular, since the predictions of a precise information extraction system serve to align entity-mention pairs in the text with database records, this alignment can be used both to evaluate a generation’s content selection (“what the generation says”), as well as content placement (“how the generation says it”).
We consider in particular three induced metrics:
Content Selection (CS): precision and recall of unique relations extracted from that are also extracted from . This measures how well the generated document matches the gold document in terms of selecting which records to generate.
Relation Generation (RG): precision and number of unique relations extracted from that also appear in . This measures how well the system is able to generate text containing factual (i.e., correct) records.
Content Ordering (CO): normalized Damerau-Levenshtein Distance Brill and Moore (2000)DLD is a variant of Levenshtein distance that allows transpositions of elements; it is useful in comparing the ordering of sequences that may not be permutations of the same set (which is a requirement for measures like Kendall’s Tau). between the sequences of records extracted from and that extracted from . This measures how well the system orders the records it chooses to discuss.
We note that CS primarily targets the “what to say” aspect of evaluation, CO targets the “how to say it” aspect, and RG targets both.
We conclude this section by contrasting the automatic evaluation we have proposed with recently proposed adversarial evaluation approaches, which also advocate automatic metrics backed by classification Bowman et al. (2016); Kannan and Vinyals (2016); Li et al. (2017). Unlike adversarial evaluation, which uses a black-box classifier to determine the quality of a generation, our metrics are defined with respect to the predictions of an information extraction system. Accordingly, our metrics are quite interpretable, since by construction it is always possible to determine which fact (i.e., entity-value pair) in the generation is determined by the extractor to not match the database or the gold generation.
Neural Data-to-Document Models
In this section we briefly describe the neural generation methods we apply to the proposed task. As a base model we utilize the now standard attention-based encoder-decoder model (Sutskever et al., 2014; Cho et al., 2014; Bahdanau et al., 2015). We also experiment with several recent extensions to this model, including copy-based generation, and training with a source reconstruction term in the loss (in addition to the standard per-target-word loss).
Copying
There has been a surge of recent work involving augmenting encoder-decoder models to copy words directly from the source material on which they condition Gu et al. (2016); Gülçehre et al. (2016); Merity et al. (2016); Jia and Liang (2016); Yang et al. (2016). These models typically introduce an additional binary variable into the per-timestep target word distribution, which indicates whether the target word is copied from the source or generated:
In our case, we assume that target words are copied from the value portion of a record ; that is, a copy implies for some and .
Joint Copy Model
The models of Gu et al. (2016) and Yang et al. (2016) parameterize the joint distribution table over and directly:
Conditional Copy Model
Gülçehre et al. (2016), on the other hand, decompose the joint probability as:
where an MLP is used to model .
Models with copy-decoders may be trained to minimize the negative log marginal probability, marginalizing out the latent-variable Gu et al. (2016); Yang et al. (2016); Merity et al. (2016). However, if it is known which target words are copied, it is possible to train with a loss that does not marginalize out the latent . Gülçehre et al. (2016), for instance, assume that any target word that also appears in the source is copied, and train to minimize the negative joint log-likelihood of the and .
Reconstruction Losses
Reconstruction-based techniques can also be applied at the document- or sentence-level during training. One simple approach to this problem is to utilize the hidden states of the decoder to try to reconstruct the database. A fully differentiable approach using the decoder hidden states has recently been successfully applied to neural machine translation by Tu et al. (2017). Unlike copying, this method is applied only at training, and attempts to learn decoder hidden states with broader coverage of the input data.
where is the ’th predicted distribution over records, and where we have modeled each component of independently. This loss attempts to make the most probable record in given more probable. We found that augmenting the above loss with a term that penalizes the total variation distance (TVD) between the to be helpful.Penalizing the TVD between the might be useful if, for instance, is too large, and only a smaller number of records can be predicted from . We also experimented with encouraging, rather than penalizing the TVD between the , which might make sense if we were worried about ensuring the captured different records. Both and the TVD term are simply added to the standard negative log-likelihood objective at training time.
Experimental Methods
We train the generation models using SGD and truncated BPTT Elman (1990); Mikolov et al. (2010), as in language modeling. That is, we split each into contiguous blocks of length 100, and backprop both the gradients with respect to the current block as well as with respect to the encoder parameters for each block.
Our extractive evaluator consists of an ensemble of 3 single-layer convolutional and 3 single-layer bidirectional LSTM models. The convolutional models concatenate convolutions with kernel widths 2, 3, and 5, and 200 feature maps in the style of Kim (2014). Both models are trained with SGD.
In addition to neural baselines, we also use a problem-specific, template-based generator. The template-based generator first emits a sentence about the teams playing in the game, using a templatized sentence taken from the training set:
The
Then, 6 player-specific sentences of the following form are emitted (again adapting a simple sentence from the training set):
The 6 highest-scoring players in the game are used to fill in the above template. Finally, a typical end sentence is emitted:
The
Code implementing all models can be found at https://github.com/harvardnlp/data2text. Our encoder-decoder models are based on OpenNMT Klein et al. (2017).
Results
We found that all models performed quite poorly on the SBNation data, with the best model achieving a validation perplexity of 33.34 and a BLEU score of 1.78. This poor performance is presumably attributable to the noisy quality of the SBNation data, and the fact that many documents in the dataset focus on information not in the box- and line-scores. Accordingly, we focus on RotoWire in what follows.
The main results for the RotoWire dataset are shown in Table 2, which shows the performance of the models in Section 4 in terms of the metrics defined in Section 3.2, as well as in terms of perplexity and BLEU.
There are several interesting relationships in the development portion of Table 2. First we note that the Template model scores very poorly on BLEU, but does quite well on the extractive metrics, providing an upper-bound for how domain knowledge could help content selection and generation. All the neural models make significant improvements in terms of BLEU score, with the conditional copying with beam search performing the best, even though all the neural models achieve roughly the same perplexity.
The extractive metrics provide further insight into the behavior of the models. We first note that on the gold documents , the extractive model reaches precision. Using the Joint Copy model, generation only has a record generation (RG) precision of indicating that relationships are often generated incorrectly. The best Conditional Copy system improves this value to , a significant improvement and potentially the cause of the improved BLEU score, but still far below gold.
Notably, content selection (CS) and content ordering (CO) seem to have no correlation at all with BLEU. There is some improvement with CS for the conditional model or reconstruction loss, but not much change as we move to beam search. CO actually gets worse as beam search is utilized, possibly a side effect of generating more records (RG#). The fact that these scores are much worse than the simple templated model indicates that further research is needed into better copying alone for content selection and better long term content ordering models.
Test results are consistent with development results, indicating that the Conditional Copy model is most effective at BLEU, RG, and CS, and that reconstruction is quite helpful for improving the joint model.
2 Human Evaluation
We also undertook two human evaluation studies, using Amazon Mechanical Turk. The first study attempted to determine whether generations considered to be more precise by our metrics were also considered more precise by human raters. To accomplish this, raters were presented with a particular NBA game’s box score and line score, as well as with (randomly selected) sentences from summaries generated by our different models for those games. Raters were then asked to count how many facts in each sentence were supported by records in the box or line scores, and how many were contradicted. We randomly selected 20 distinct games to present to raters, and a total of 20 generated sentences per game were evaluated by raters. The left two columns of Table 3 contain the average numbers of supporting and contradicting facts per sentence as determined by the raters, for each model. We see that these results are generally in line with the RG and CS metrics, with the Conditional Copy model having the highest number of supporting facts, and the reconstruction terms significantly improving the Joint Copy models.
Using a Tukey HSD post-hoc analysis of an ANOVA with the number of contradicting facts as the dependent variable and the generating model and rater id as independent variables, we found significant () pairwise differences in contradictory facts between the gold generations and all models except “Copy+Rec+TVD,” as well as a significant difference between “Copy+Rec+TVD” and “Copy”. We similarly found a significant pairwise difference between “Copy+Rec+TVD” and “Copy” for number of supporting facts.
Our second study attempted to determine whether generated summaries differed in terms of how natural their ordering of records (as captured, for instance, by the DLD metric) is. To test this, we presented raters with random summaries generated by our models and asked them to rate the naturalness of the ordering of facts in the summaries on a 1-7 Likert scale. 30 random summaries were used in this experiment, each rated 3 times by distinct raters. The average Likert ratings are shown in the rightmost column of Table 3. While it is encouraging that the gold summaries received a higher average score than the generated summaries (and that the reconstruction term again improved the Joint Copy model), a Tukey HSD analysis similar to the one presented above revealed no significant pairwise differences.
3 Qualitative Example
Figure 2 shows a document generated by the Conditional Copy model, using a beam of size 5. This particular generation evidently has several nice properties: it nicely learns the colloquial style of the text, correctly using idioms such as “19 percent from deep.” It is also partially accurate in its use of the records; we highlight in blue when it generates text that is licensed by a record in the associated box- and line-scores.
At the same time, the generation also contains major logical errors. First, there are basic copying mistakes, such as flipping the teams’ win/loss records. The system also makes obvious semantic errors; for instance, it generates the phrase “the Rockets were able to out-rebound the Rockets.” Finally, we see the model hallucinates factual statements, such as “in front of their home crowd,” which is presumably likely according to the language model, but ultimately incorrect (and not supported by anything in the box- or line- scores). In practice, our proposed extractive evaluation will pick up on many errors in this passage. For instance, “four assists” is an RG error, repeating the Rockets’ rebounds could manifest in a lower CO score, and incorrectly indicating the win/loss records is a CS error.
Related Work
In this section we note additional related work not noted throughout. Natural language generation has been studied for decades (Kukich, 1983; McKeown, 1992; Reiter and Dale, 1997), and generating summaries of sports games has been a topic of interest for almost as long (Robin, 1994; Tanaka-Ishii et al., 1998; Barzilay and Lapata, 2005).
Historically, research has focused on both content selection (“what to say”) (Kukich, 1983; McKeown, 1992; Reiter and Dale, 1997; Duboue and McKeown, 2003; Barzilay and Lapata, 2005), and surface realization (“how to say it”) (Goldberg et al., 1994; Reiter et al., 2005) with earlier work using (hand-built) grammars, and later work using SMT-like approaches (Wong and Mooney, 2007) or generating from PCFGs (Belz, 2008) or other formalisms (Soricut and Marcu, 2006; White et al., 2007). In the late 2000s and early 2010s, a number of systems were proposed that did both (Liang et al., 2009; Angeli et al., 2010; Kim and Mooney, 2010; Lu and Ng, 2011; Konstas and Lapata, 2013).
Within the world of neural text generation, some recent work has focused on conditioning language models on tables (Yang et al., 2016), and generating short biographies from Wikipedia Tables (Lebret et al., 2016; Chisholm et al., 2017). Mei et al. (2016) use a neural encoder-decoder approach on standard record-based generation datasets, obtaining impressive results, and motivating the need for more challenging NLG problems.
Conclusion and Future Work
This work explores the challenges facing neural data-to-document generation by introducing a new dataset, and proposing various metrics for automatically evaluating content selection, generation, and ordering. We see that recent ideas in copying and reconstruction lead to improvements on this task, but that there is a significant gap even between these neural models and templated systems. We hope to motivate researchers to focus further on generation problems that are relevant both to content selection and surface realization, but may not be reflected clearly in the model’s perplexity.
Future work on this task might include approaches that process or attend to the source records in a more sophisticated way, generation models that attempt to incorporate semantic or reference-related constraints, and approaches to conditioning on facts or records that are not as explicit in the box- and line-scores.
Acknowledgments
We gratefully acknowledge the support of a Google Research Award.
References
Appendix
The RotoWire data covers NBA games played between 1/1/2014 and 3/29/2017; some games have multiple summaries. The summaries have been randomly split into training, validation, and test sets consisting of 3398, 727, and 728 summaries, respectively.
The SBNation data covers NBA games played between 11/3/2006 and 3/26/2017; some games have multiple summaries. The summaries have been randomly split into training, validation, and test sets consisting of 7633, 1635, and 1635 summaries, respectively.
All numbers in the box- and line-scores (but not the summaries) are converted to integers; fractional numbers corresponding to percents are multiplied by 100 to obtain integers in $$. We show the types of records in the data in Table 4.
B. Generation Model Details
Decoder
As mentioned in the body of the paper, we compute two different attention distributions (i.e., using different parameters) at each decoding step. For the Joint Copy model, one attention distribution is not normalized, and is normalized along with all the output-word probabilities.
For the reconstruction-loss, we feed blocks (of size at most 100) of the decoder’s LSTM hidden states through a Kim (2014)-style convolutional model. We use kernels of width 3 and 5, 200 filters, a ReLU nonlinearity, and max-over-time pooling. To create the , these now 400-dimensional features are then mapped via an MLP with a ReLU nonlinearity into 3 separate 200 dimensional vectors corresponding to the predicted relation’s entity, value, and type, respectively. These 200 dimensional vectors are then fed through (separate) linear decoders and softmax layers in order to obtain distributions over entities, values, and types. We use distinct .
Models are trained with SGD, a learning rate of 1 (which is divided by 2 every time validation perplexity fails to decrease), and a batch size of 16. We use dropout (at a rate of 0.5) between LSTM layers and before the linear decoder.