Bootstrapping Generators from Noisy Data
Laura Perez-Beltrachini, Mirella Lapata
Introduction
A core step in statistical data-to-text generation concerns learning correspondences between structured data representations (e.g., facts in a database) and paired texts Barzilay and Lapata (2005); Kim and Mooney (2010); Liang et al. (2009). These correspondences describe how data representations are expressed in natural language (content realisation) but also indicate which subset of the data is verbalised in the text (content selection).
Although content selection is traditionally performed by domain experts, recent advances in generation using neural networks Bahdanau et al. (2015); Ranzato et al. (2016) have led to the use of large scale datasets containing loosely related data and text pairs. A prime example are online data sources like DBPedia Auer et al. (2007) and Wikipedia and their associated texts which are often independently edited. Another example are sports databases and related textual resources. Wiseman et al. Wiseman et al. (2017) recently define a generation task relating statistics of basketball games with commentaries and a blog written by fans.
In this paper, we focus on short text generation from such loosely aligned data-text resources. We work with the biographical subset of the DBPedia and Wikipedia resources where the data corresponds to DBPedia facts and texts are Wikipedia abstracts about people. Figure 1 shows an example for the film-maker Robert Flaherty, the Wikipedia infobox, and the corresponding abstract. We wish to bootstrap a data-to-text generator that learns to verbalise properties about an entity from a loosely related example text. Given the set of properties in Figure (1a) and the related text in Figure (1b), we want to learn verbalisations for those properties that are mentioned in the text and produce a short description like the one in Figure (1c).
In common with previous work Mei et al. (2016); Lebret et al. (2016); Wiseman et al. (2017) our model draws on insights from neural machine translation Bahdanau et al. (2015); Sutskever et al. (2014) using an encoder-decoder architecture as its backbone. Lebret et al. (2016) introduce the task of generating biographies from Wikipedia data, however they focus on single sentence generation. We generalize the task to multi-sentence text, and highlight the limitations of the standard attention mechanism which is often used as a proxy for content selection. When exposed to sub-sequences that do not correspond to any facts in the input, the soft attention mechanism will still try to justify the sequence and somehow distribute the attention weights over the input representation Ghader and Monz (2017). The decoder will still memorise high frequency sub-sequences in spite of these not being supported by any facts in the input.
We propose to alleviate these shortcomings via a specific content selection mechanism based on multi-instance learning (MIL; Keeler and Rumelhart, 1992) which automatically discovers correspondences, namely alignments, between data and text pairs. These alignments are then used to modify the generation function during training. We experiment with two frameworks that allow to incorporate alignment information, namely multi-task learning (MTL; Caruana, 1993) and reinforcement learning (RL; Williams, 1992). In both cases we define novel objective functions using the learnt alignments. Experimental results using automatic and human-based evaluation show that models trained with content-specific objectives improve upon vanilla encoder-decoder architectures which rely solely on soft attention.
The remainder of this paper is organised as follows. We discuss related work in Section 2 and describe the MIL-based content selection approach in Section 3. We explain how the generator is trained in Section 4 and present evaluation experiments in Section 5. Section 7 concludes the paper.
Related Work
Previous attempts to exploit loosely aligned data and text corpora have mostly focused on extracting verbalisation spans for data units. Most approaches work in two stages: initially, data units are aligned with sentences from related corpora using some heuristics and subsequently extra content is discarded in order to retain only text spans verbalising the data. Belz and Kow (2010) obtain verbalisation spans using a measure of strength of association between data units and words, Walter et al. (2013) extract textual patterns from paths in dependency trees while Mrabet et al. (2016) rely on crowd-sourcing. Perez-Beltrachini and Gardent Perez-Beltrachini and Gardent (2016) learn shared representations for data units and sentences reduced to subject-predicate-object triples with the aim of extracting verbalisations for knowledge base properties. Our work takes a step further, we not only induce data-to-text alignments but also learn generators that produce short texts verbalising a set of facts.
Our work is closest to recent neural network models which learn generators from independently edited data and text resources. Most previous work Lebret et al. (2016); Chisholm et al. (2017); Sha et al. (2017); Liu et al. (2017) targets the generation of single sentence biographies from Wikipedia infoboxes, while Wiseman et al. (2017) generate game summary documents from a database of basketball games where the input is always the same set of table fields. In contrast, in our scenario, the input data varies from one entity (e.g., athlete) to another (e.g., scientist) and properties might be present or not due to data incompleteness. Moreover, our generator is enhanced with a content selection mechanism based on multi-instance learning. MIL-based techniques have been previously applied to a variety of problems including image retrieval Maron and Ratan (1998); Zhang et al. (2002), object detection Carbonetto et al. (2008); Cour et al. (2011), text classification Andrews and Hofmann (2004), image captioning Wu et al. (2015); Karpathy and Fei-Fei (2015), paraphrase detection Xu et al. (2014), and information extraction Hoffmann et al. (2011). The application of MIL to content selection is novel to our knowledge.
We show how to incorporate content selection into encoder-decoder architectures following training regimes based on multi-task learning and reinforcement learning. Multi-task learning aims to improve a main task by incorporating joint learning of one or more related auxiliary tasks. It has been applied with success to a variety of sequence-prediction tasks focusing mostly on morphosyntax. Examples include chunking, tagging Collobert et al. (2011); Søgaard and Goldberg (2016); Bjerva et al. (2016); Plank (2016), name error detection Cheng et al. (2015), and machine translation Luong et al. (2016). Reinforcement learning Williams (1992) has also seen popularity as a means of training neural networks to directly optimize a task-specific metric Ranzato et al. (2016) or to inject task-specific knowledge Zhang and Lapata (2017). We are not aware of any work that compares the two training methods directly. Furthermore, our reinforcement learning-based algorithm differs from previous text generation approaches Ranzato et al. (2016); Zhang and Lapata (2017) in that it is applied to documents rather than individual sentences.
Bidirectional Content Selection
We consider loosely coupled data and text pairs where the data component is a set of property-values and the related text is a sequence of sentences . We define a mention span as a (possibly discontinuous) subsequence of containing one or several words that verbalise one or more property-value from . For instance, in Figure 1, the mention span “married to Frances H. Flaherty” verbalises the property-value .
In traditional supervised data to text generation tasks, data units (e.g., in our particular setting) are either covered by some mention span or do not have any mention span at all in . The latter is a case of content selection where the generator will learn which properties to ignore when generating text from such data. In this work, we consider text components which are independently edited, and will unavoidably contain unaligned spans, i.e., text segments which do not correspond to any property-value in . The phrase “from 1914” in the text in Figure (1b) is such an example. Similarly, the last sentence, talks about Frances’ awards and nominations and this information is not supported by the properties either.
Our model checks content in both directions; it identifies which properties have a corresponding text span (data selection) and also foregrounds (un)aligned text spans (text selection). This knowledge is then used to discourage the generator from producing text not supported by facts in the property set . We view a property set and its loosely coupled text as a coarse level, imperfect alignment. From this alignment signal, we want to discover a set of finer grained alignments indicating which mention spans in align to which properties in . For each pair , we learn an alignment set which contains property-value word pairs. For example, for the properties and in Figure 1, we would like to derive the alignments in Table 1.
We formulate the task of discovering finer-grained word alignments as a multi-instance learning problem Keeler and Rumelhart (1992). We assume that words from the text are positive labels for some property-values but we do not know which ones. For each data-text pair , we derive pairs of the form where is the number of sentences in . We encode property sets and sentences into a common multi-modal -dimensional embedding space. While doing this, we discover finer grained alignments between words and property-values. The intuition is that by learning a high similarity score for a property set and sentence pair , we will also learn the contribution of individual elements (i.e., words and property-values) to the overall similarity score. We will then use this individual contribution as a measure of word and property-value alignment. More concretely, we assume the pair is aligned (or unaligned) if this individual score is above (or below) a given threshold. Across examples like the one shown in Figure (1a-b), we expect the model to learn an alignment between the text span “married to Frances H. Flaherty” and the property-value .
In what follows we describe how we encode pairs and define the similarity function.
As there is no fixed order among the property-value pairs in , we individually encode each one of them. Furthermore, both properties and values may consist of short phrases. For instance, the property and value in Figure 1. We therefore consider property-value pairs as concatenated sequences and use a bidirectional Long Short-Term Memory Network (LSTM; Hochreiter and Schmidhuber, 1997) network for their encoding. Note that the same network is used for all pairs. Each property-value pair is encoded into a vector representation:
which is the output of the recurrent network at the final time step. We use addition to combine the forward and backward outputs and generate encoding for .
Sentence Encoder
We also use a biLSTM to obtain a representation for the sentence . Each word is represented by the output of the forward and backward networks at time step . A word at position is represented by the concatenation of the forward and backward outputs of the networks at time step :
and each sentence is encoded as a sequence of vectors .
Alignment Objective
Our learning objective seeks to maximise the similarity score between property set and a sentence Karpathy and Fei-Fei (2015). This similarity score is in turn defined on top of the similarity scores among property-values in and words in . Equation (3) defines this similarity function using the dot product. The function seeks to align each word to the best scoring property-value:
Equation (4) defines our objective which encourages related properties and sentences to have higher similarity than other and :
Generator Training
In this section we describe the base generation architecture and explain two alternative ways of using the alignments to guide the training of the model. One approach follows multi-task training where the generator learns to output a sequence of words but also to predict alignment labels for each word. The second approach relies on reinforcement learning for adjusting the probability distribution of word sequences learnt by a standard word prediction training algorithm.
We follow a standard attention based encoder-decoder architecture for our generator Bahdanau et al. (2015); Luong et al. (2015). Given a set of properties as input, the model learns to predict an output word sequence which is a verbalisation of (part of) the input. More precisely, the generation of sequence is conditioned on input :
The encoder module constitutes an intermediate representation of the input. For this, we use the property-set encoder described in Section 3 which outputs vector representations for a set of property-value pairs. The decoder uses an LSTM and a soft attention mechanism Luong et al. (2015) to generate one word at a time conditioned on the previous output words and a context vector dynamically created:
The dynamic context vector is the weighted sum of the hidden states of the input property set (Equation (9)); and the weights are determined by a dot product attention mechanism:
We initialise the decoder with the averaged sum of the encoded input representations Vinyals et al. (2016). The model is trained to optimize negative log likelihood:
We extend this architecture to multi-sentence texts in a way similar to Wiseman et al. (2017). We view the abstract as a single sequence, i.e., all sentences are concatenated. When training, we cut the abstracts in blocks of equal size and perform forward backward iterations for each block (this includes the back-propagation through the encoder). From one block iteration to the next, we initialise the decoder with the last state of the previous block. The block size is a hyperparameter tuned experimentally on the development set.
2 Predicting Alignment Labels
The generation of the output sequence is conditioned on the previous words and the input. However, when certain sequences are very common, the language modelling conditional probability will prevail over the input conditioning. For instance, the phrase from 1914 in our running example is very common in contexts that talk about periods of marriage or club membership, and as a result, the language model will output this phrase often, even in cases where there are no supporting facts in the input. The intuition behind multi-task training Caruana (1993) is that it will smooth the probabilities of frequent sequences when trying to simultaneously predict alignment labels.
Using the set of alignments obtained by our content selection model, we associate each word in the training data with a binary label indicating whether it aligns with some property in the input set. Our auxiliary task is to predict given the sequence of previously predicted words and input :
and the combined multi-task objective is the weighted sum of both word prediction and alignment prediction losses:
where controls how much model training will focus on each task. As we will explain in Section 5, we can anneal this value during training in favour of one objective or the other.
3 Reinforcement Learning Training
Although the multi-task approach aims to smooth the target distribution, the training process is still driven by the imperfect target text. In other words, at each time step the algorithm feeds the previous word of the target text and evaluates the prediction against the target .
Alternatively, we propose a training approach based on reinforcement learning (Williams 1992) which allows us to define an objective function that does not fully rely on the target text but rather on a revised version of it. In our case, the set of alignments obtained by our content selection model provides a revision for the target text. The advantages of reinforcement learning are twofold: (a) it allows to exploit additional task-specific knowledge Zhang and Lapata (2017) during training, and (b) enables the exploration of other word sequences through sampling. Our setting differs from previous applications of RL Ranzato et al. (2016); Zhang and Lapata (2017) in that the reward function is not computed on the target text but rather on its alignments with the input.
The encoder-decoder model is viewed as an agent whose action space is defined by the set of words in the target vocabulary. At each time step, the encoder-decoder takes action with policy defined by the probability in Equation (6). The agent terminates when it emits the End Of Sequence (EOS) token, at which point the sequence of all actions taken yields the output sequence . This sequence in our task is a short text describing the properties of a given entity. After producing the sequence of actions , the agent receives a reward and the policy is updated according to this reward.
We define the reward function on the alignment set . If the output action sequence is precise with respect to the set of alignments , the agent will receive a high reward. Concretely, we define as follows:
where adjusts the reward value which is the unigram precision of the predicted sequence and the set of words in .
Training Algorithm
We use the REINFORCE algorithm Williams (1992) to learn an agent that maximises the reward function. As this is a gradient descent method, the training loss of a sequence is defined as the negative expected reward:
where is the agent’s policy, i.e., the word distribution produced by the encoder-decoder model (Equation (6)) and is the reward function as defined in Equation (16). The gradient of is given by:
where is a baseline linear regression model used to reduce the variance of the gradients during training. predicts the future reward and is trained by minimizing mean squared error. The input to this predictor is the agent hidden state , however we do not back-propagate the error to . We refer the interested reader to Williams (1992) and Ranzato et al. (2016) for more details.
Document Level Curriculum Learning
Rather than starting from a state given by a random policy, we initialise the agent with a policy learnt by pre-training with the negative log-likelihood objective Ranzato et al. (2016); Zhang and Lapata (2017). The reinforcement learning objective is applied gradually in combination with the log-likelihood objective on each target block subsequence. Recall from Section 4.1 that our document is segmented into blocks of equal size during training which we denote as MaxBlock. When training begins, only the last tokens are predicted by the agent while for the first we still use the negative log-likelihood objective. The number of tokens predicted by the agent is incremented by units every 2 epochs. We set and the training ends when . Since we evaluate the model’s predictions at the block level, the reward function is also evaluated at the block level.
Experimental Setup
We evaluated our model on a dataset collated from WikiBio Lebret et al. (2016), a corpus of 728,321 biography articles (their first paragraph) and their infoboxes sampled from the English Wikipedia. We adapted the original dataset in three ways. Firstly, we make use of the entire abstract rather than first sentence. Secondly, we reduced the dataset to examples with a rich set of properties and multi-sentential text. We eliminated examples with less than six property-value pairs and abstracts consisting of one sentence. We also placed a minimum restriction of 23 words in the length of the abstract. We considered abstracts up to a maximum of 12 sentences and property sets with a maximum of 50 property-value pairs. Finally, we associated each abstract with the set of DBPedia properties corresponding to the abstract’s main entity. As entity classification is available in DBPedia for most entities, we concatenate class information (whenever available) with the property value, i.e., . In Figure 1, the property value is extended with class information from the DBPedia ontology to .
Pre-processing
Numeric date formats were converted to a surface form with month names. Numerical expressions were delexicalised using different tokens created with the property name and position of the delexicalised token on the value sequence. For instance, given the property-value for birth date in Figure (1a), the first sentence in the abstract (Figure (1b)) becomes “ Robert Joseph Flaherty, (February DLX_birth_date_2, DLX_birth_date_4 – July … ”. Years and numbers in the text not found in the values of the property set were replaced with tokens YEAR and NUMERIC.We exploit these tokens to further adjust the score of the reward function given by Equation (16). Each time the predicted output contains some of these symbols we decrease the reward score by which we empirically set to 0.025 . In a second phase, when creating the input and output vocabularies, and respectively, we delexicalised words which were absent from the output vocabulary but were attested in the input vocabulary. Again, we created tokens based on the property name and the position of the word in the value sequence. Words not in or were replaced with the symbol UNK. Vocabulary sizes were limited to and for the alignment model and for the generator. We discarded examples where the text contained more than three UNKs (for the content aligner) and five UNKs (for the generator); or more than two UNKs in the property-value (for generation). Finally, we added the empty relation to the property sets.
Table 2 summarises the dataset statistics for the generator. We report the number of abstracts in the dataset (size), the average number of sentences and tokens in the abstracts, and the average number of properties and sentence length in tokens (sent.len). For the content aligner (cf. Section 3), each sentence constitutes a training instance, and as a result the sizes of the train and development sets are 796,446 and 153,096, respectively.
Training Configuration
We adjusted all models’ hyperparameters according to their performance on the development set. The encoders for both content selection and generation models were initialised with GloVe Pennington et al. (2014) pre-trained vectors. The input and hidden unit dimension was set to 200 for content selection and 100 for generation. In all models, we used encoder biLSTMs and decoder LSTM (regularised with a dropout rate of 0.3 Zaremba et al. (2014)) with one layer. Content selection and generation models (base encoder-decoder and MTL) were trained for 20 epochs with the ADAM optimiser Kingma and Ba (2014) using a learning rate of 0.001. The reinforcement learning model was initialised with the base encoder-decoder model and trained for 35 additional epochs with stochastic gradient descent and a fixed learning rate of 0.001. Block sizes were set to 40 (base), 60 (MTL) and 50 (RL). Weights for the MTL objective were also tuned experimentally; we set for the first four epochs (training focuses on alignment prediction) and switched to for the remaining epochs.
Content Alignment
We optimized content alignment on the development set against manual alignments. Specifically, two annotators aligned 132 sentences to their infoboxes. We used the Yawat annotation tool Germann (2008) and followed the alignment guidelines (and evaluation metrics) used in Cohn et al. (2008). The inter-annotator agreement using macro-averaged f-score was 0.72 (we treated one annotator as the reference and the other one as hypothetical system output).
Alignment sets were extracted from the model’s output (cf. Section 3) by optimizing the threshold where denotes the similarity between the set of property values and words, and is empirically set to 0.75; and are the mean and standard deviation of scores across the development set. Each word was aligned to a property-value if their similarity exceeded a threshold of 0.22. Our best content alignment model (Content-Aligner) obtained an f-score of 0.36 on the development set.
We also compared our Content-Aligner against a baseline based on pre-trained word embeddings (EmbeddingsBL). For each pair we computed the dot product between words in and properties in (properties were represented by the the averaged sum of their words’ vectors). Words were aligned to property-values if their similarity exceeded a threshold of 0.4. EmbeddingsBL obtained an f-score of 0.057 against the manual alignments. Finally, we compared the performance of the Content-Aligner at the level of property set and sentence similarity by comparing the average ranking position of correct pairs among 14 distractors, namely rank@15. The Content-Aligner obtained a rank of 1.31, while the EmbeddingsBL model had a rank of 7.99 (lower is better).
Results
We compared the performance of an encoder-decoder model trained with the standard negative log-likelihood method (ED), against a model trained with multi-task learning (EDMTL) and reinforcement learning (EDRL). We also included a template baseline system (Templ) in our evaluation experiments.
The template generator used hand-written rules to realise property-value pairs. As an approximation for content selection, we obtained the 50 more frequent property names from the training set and manually defined content ordering rules with the following criteria. We ordered personal life properties (e.g., or ) based on their most common order of mention in the Wikipedia abstracts. Profession dependent properties (e.g., or ), were assigned an equal ordering but posterior to the personal properties. We manually lexicalised properties into single sentence templates to be concatenated to produce the final text. The template for the property and example verbalisation for the property-value of the entity zanetti are “NAME played as POSITION.” and “ Zanetti played as defender.” respectively.
Table 3 shows the results of automatic evaluation using BLEU-4 Papineni et al. (2002) against the noisy Wikipedia abstracts. Considering these as a gold standard is, however, not entirely satisfactory for two reasons. Firstly, our models generate considerably shorter text and will be penalized for not generating text they were not supposed to generate in the first place. Secondly, the model might try to re-produce what is in the imperfect reference but not supported by the input properties and as a result will be rewarded when it should not. To alleviate this, we crowd-sourced using AMT a revised version of 200 randomly selected abstracts from the test set.Recently, a metric that automatically addresses the imperfect target texts was proposed in Dhingra et al. (2019).
Crowdworkers were shown a Wikipedia infobox with the accompanying abstract and were asked to adjust the text to the content present in the infobox. Annotators were instructed to delete spans which did not have supporting facts and rewrite the remaining parts into a well-formed text. We collected three revised versions for each abstract. Inter-annotator agreement was 81.64 measured as the mean pairwise BLEU-4 amongst AMT workers.
Automatic evaluation results against the revised abstracts are also shown in Table 3. As can be seen, all encoder-decoder based models have a significant advantage over Templ when evaluating against both types of abstracts. The model enabled with the multi-task learning content selection mechanism brings an improvement of 1.29 BLEU-4 over a vanilla encoder-decoder model. Performance of the RL trained model is inferior and close to the ED model. We discuss the reasons for this discrepancy shortly.
To provide a rough comparison with the results reported in Lebret et al. (2016), we also computed BLEU-4 on the first sentence of the text generated by our system.We post-processed system output with Stanford CoreNLP Manning et al. (2014) to extract the first sentence. Recall that their model generates the first sentence of the abstract, whereas we output multi-sentence text. Using the first sentence in the Wikipedia abstract as reference, we obtained a score of 37.29% (ED), 38.42% (EDMTL) and 38.1% (EDRL) which compare favourably with their best performing model (34.7%0.36).
Human-Based Evaluation
We further examined differences among systems in a human-based evaluation study. Using AMT, we elicited 3 judgements for the same 200 infobox-abstract pairs we used in the abstract revision study. We compared the output of the templates, the three neural generators and also included one of the human edited abstracts as a gold standard (reference). For each test case, we showed crowdworkers the Wikipedia infobox and five short texts in random order. The annotators were asked to rank each of the texts according to the following criteria: (1) Is the text faithful to the content of the table? and (2) Is the text overall comprehensible and fluent? Ties were allowed only when texts were identical strings. Table 5 presents examples of the texts (and properties) crowdworkers saw.
Table 4 shows, proportionally, how often crowdworkers ranked each system, first, second, and so on. Unsurprisingly, the human authored gold text is considered best (and ranked first 47% of the time). EDMTL is mostly ranked second and third best, followed closely by EDRL. The vanilla encoder-decoder system ED is mostly forth and Templ is fifth. As shown in the last column of the table (Rank), the ranking of EDMTL is overall slightly better than EDRL. We further converted the ranks to ratings on a scale of 1 to 5 (assigning ratings 51 to rank placements 15). This allowed us to perform Analysis of Variance (ANOVA) which revealed a reliable effect of system type. Post-hoc Tukey tests showed that all systems were significantly worse than RevAbs and significantly better than Templ (p 0.05). EDMTL is not significantly better than EDRL but is significantly (p 0.05) different from ED.
Discussion
The texts generated by EDRL are shorter compared to the other two neural systems which might affect BLEU-4 scores and also the ratings provided by the annotators. As shown in Table 5 (entity dorsey burnette), EDRL drops information pertaining to dates or chooses to just verbalise birth place information. In some cases, this is preferable to hallucinating incorrect facts; however, in other cases outputs with more information are rated more favourably. Overall, EDMTL seems to be more detail oriented and faithful to the facts included in the infobox (see dorsey burnette, aaron moores, or kirill moryganov). The template system manages in some specific configurations to verbalise appropriate facts (indrani bose), however, it often fails to verbalise infrequent properties (aaron moores) or focuses on properties which are very frequent in the knowledge base but are rarely found in the abstracts (kirill moryganov).
Conclusions
In this paper we focused on the task of bootstrapping generators from large-scale datasets consisting of DBPedia facts and related Wikipedia biography abstracts. We proposed to equip standard encoder-decoder models with an additional content selection mechanism based on multi-instance learning and developed two training regimes, one based on multi-task learning and the other on reinforcement learning. Overall, we find that the proposed content selection mechanism improves the accuracy and fluency of the generated texts. In the future, it would be interesting to investigate a more sophisticated representation of the input Vinyals et al. (2016). It would also make sense for the model to decode hierarchically, taking sequences of words and sentences into account Zhang and Lapata (2014); Lebret et al. (2015).
Acknowledgments
We thank the NAACL reviewers for their constructive feedback. We also thank Xingxing Zhang, Li Dong and Stefanos Angelidis for useful discussions about implementation details. We gratefully acknowledge the financial support of the European Research Council (award number 681760).