Findings of the E2E NLG Challenge
Ondřej Dušek, Jekaterina Novikova, Verena Rieser
Introduction
This paper summarises the first shared task on end-to-end (E2E) natural language generation (NLG) in spoken dialogue systems (SDSs). Shared tasks have become an established way of pushing research boundaries in the field of natural language processing, with NLG benchmarking tasks running since 2007 (Belz and Gatt, 2007). This task is novel in that it poses new challenges for recent end-to-end, data-driven NLG systems for SDSs which jointly learn sentence planning and surface realisation and do not require costly semantic alignment between meaning representations (MRs) and the corresponding natural language reference texts, e.g. Dušek and Jurčíček (2015); Wen et al. (2015b); Mei et al. (2016); Wen et al. (2016); Sharma et al. (2016); Dušek and Jurčíček (2016a); Lampouras and Vlachos (2016).Note that as opposed to the “classical” definition of NLG Reiter and Dale (2000); Gatt and Krahmer (2018), generation for dialogue systems does not involve content selection and its sentence planning stage may be less complex. So far, end-to-end approaches to NLG are limited to small, delexicalised datasets, e.g. BAGEL (Mairesse et al., 2010), SF Hotels/Restaurants (Wen et al., 2015b), or RoboCup (Chen and Mooney, 2008), whereas the E2E shared task is based on a new crowdsourced dataset of 50k instances in the restaurant domain, which is about 10 times larger and also more complex than previous datasets. For the shared challenge, we received 62 system submissions by 17 institutions from 11 countries, with about of these submissions coming from industry. We assess the submitted systems by comparing them to a challenging baseline using automatic as well as human evaluation. We consider this level of participation an unexpected success, which underlines the timeliness of this task.In comparison, the well established Conference in Machine Translation WMT’17 (running since 2006) received submissions from 31 institutions to a total of 8 tasks (Bojar et al., 2017a). While there are previous studies comparing a limited number of end-to-end NLG approaches Novikova et al. (2017a); Wiseman et al. (2017); Gardent et al. (2017), this is the first research to evaluate novel end-to-end generation at scale and using human assessment.
The E2E NLG dataset
In order to maximise the chances for data-driven end-to-end systems to produce high quality output, we aim to provide training data in high quality and large quantity. To collect data in large enough quantity, we use crowdsourcing with automatic quality checks. We use MRs consisting of an unordered set of attributes and their values and collect multiple corresponding natural language texts (references) – utterances consisting of one or several sentences. An example MR-reference pair is shown in Figure 1, Table 1 lists all the attributes in our domain.
In contrast to previous work Mairesse et al. (2010); Wen et al. (2015a); Dušek and Jurčíček (2016), we use different modalities of meaning representation for data collection: textual/logical and pictorial MRs. The textual/logical MRs (see Figure 1) take the form of a sequence with attribute-value pairs provided in a random order. The pictorial MRs (see Figure 2) are semi-automatically generated pictures with a combination of icons corresponding to the appropriate attributes. The icons are located on a background showing a map of a city, thus allowing to represent the meaning of attributes area and near (cf. Table 1).
In a pre-study Novikova et al. (2016), we showed that pictorial MRs provide similar collection speed and utterance length, but are less likely to prime the crowd workers in their lexical choices. Utterances produced using pictorial MRs were considered to be more informative, natural and better phrased. However, while pictorial MRs provide more variety in the utterances, this also introduces noise. Therefore, we decided to use pictorial MRs to collect 20% of the dataset.
Our crowd workers were asked to verbalise all information from the MR; however, they were not penalised for skipping an attribute. This makes the dataset more challenging, as NLG systems need to account for noise in training data. On the other hand, the systems are helped by having multiple human references per MR at their disposal.
2 Data Statistics
The resulting dataset Novikova et al. (2017b) contains over 50k references for 6k distinct MRs (cf. Table 2), which is 10 times bigger than previous sets in comparable domains (BAGEL, SF Hotels/Restaurants, RoboCup). The dataset contains more human references per MR (8.27 on average), which should make it more suitable for data-driven approaches. However, it is also more challenging as it uses a larger number of sentences in references (up to 6 compared to 1–2 in other sets) and more attributes in MRs.
For the E2E challenge, we split the data into training, development and test sets (in a roughly 82-9-9 ratio). MRs in the test set are all previously unseen, i.e. none of them overlaps with training/development sets, even if restaurant names are removed. MRs for the test set were only released to participants two weeks before the challenge submission deadline on October 31, 2017. Participants had no access to test reference texts. The whole dataset is now freely available at the E2E NLG Challenge website at:
http://www.macs.hw.ac.uk/InteractionLab/E2E/
Systems in the Competition
The interest in the E2E Challenge has by far exceeded our expectations. We received a total of 62 submitted systems by 17 institutions (about 1/3 from industry). In accordance with ethical considerations for NLP shared tasks Parra Escartín et al. (2017), we allowed researchers to withdraw or anonymise their results if their system performs in the lower 50% of submissions. Two groups from industry withdrew their submissions and one group asked to be anonymised after obtaining automatic evaluation results.
We asked each of the remaining teams to identify 1-2 primary systems, which resulted in 20 systems by 14 groups. Each primary system is described in a short technical paper (available on the challenge website) and was evaluated both by automatic metrics and human judges (see Section 4). We compare the primary systems to a baseline based on the TGen generator Dušek and Jurčíček (2016a). An overview of all primary systems is given in Table 3, including the main features of their architectures. A more detailed description and comparison of systems will be given in Dušek et al. (2018).
Evaluation Results
Following previous shared tasks in related fields Bojar et al. (2017b); Chen et al. (2015), we selected a range of metrics measuring word-overlap between system output and references, including BLEU, NIST, METEOR, ROUGE-L, and CIDEr. Table 3 summarises the primary system scores. The TGen baseline is very strong in terms of word-overlap metrics: No primary system is able to beat it in terms of all metrics – only Slug comes very close. Several other systems beat TGen in one of the metrics but not in others.Note, however, that several secondary system submissions perform better than the primary ones (and the baseline) with respect to word-overlap metrics. Overall, seq2seq-based systems show the best word-based metric values, followed by Sheff1, a data-driven system based on imitation learning. Template-based and rule-based systems mostly score at the bottom of the list.
2 Results of Human Evaluation
However, the human evaluation study provides a different picture. Rank-based Magnitude Estimation (RankME) Novikova et al. (2018) was used for evaluation, where crowd workers compared outputs of 5 systems for the same MR and assigned scores on a continuous scale. We evaluated output naturalness and overall quality in separate tasks; for naturalness evaluation, the source MR was not shown to workers. We collected 4,239 5-way rankings for naturalness and 2,979 for quality, comparing 9.5 systems per MR on average.
The final evaluation results were produced using the TrueSkill algorithm Herbrich et al. (2006); Sakaguchi et al. (2014), with partial ordering into significance clusters computed using bootstrap resampling Bojar et al. (2013, 2014); Sakaguchi et al. (2014). For both criteria, this resulted in 5 clusters of systems with significantly different performance and showed a clear winner: Sheff2 for naturalness and Slug for quality. The 2nd clusters are quite large for both criteria – they contain 13 and 11 systems, respectively, and both include the baseline TGen system.
The results indicate that seq2seq systems dominate in terms of naturalness of their outputs, while most systems of other architectures score lower. The bottom cluster is filled with template-based systems. The results for quality are, however, more mixed in terms of architectures, with none of them clearly prevailing. Here, seq2seq systems with reranking based on checking output correctness score high while seq2seq systems with no such mechanism occupy the bottom two clusters.
Conclusion
This paper presents the first shared task on end-to-end NLG. The aim of this challenge was to assess the capabilities of recent end-to-end, fully data-driven NLG systems, which can be trained from pairs of input MRs and texts, without the need for fine-grained semantic alignments. We created a novel dataset for the challenge, which is an order-of-magnitude bigger than any previous publicly available dataset for task-oriented NLG. We received 62 system submissions by 17 participating institutions, with a wide range of architectures, from seq2seq-based models to simple templates. We evaluated all the entries in terms of five different automatic metrics; 20 primary submissions (as identified by the 14 remaining participants) underwent crowdsourced human evaluation of naturalness and overall quality of their outputs.
We consider the Slug system Juraska et al. (2018), a seq2seq-based ensemble system with a reranker, as the overall winner of the E2E NLG challenge. Slug scores best in human evaluations of quality, it is placed in the 2nd-best cluster of systems in terms of naturalness and reaches high automatic scores. While the Sheff2 system Chen et al. (2018), a vanilla seq2seq setup, won in terms of naturalness, it scores poorly on overall quality – it placed in the last cluster. The TGen baseline system turned out hard to beat: It ranked highest on average in word-overlap-based automatic metrics and placed in the 2nd cluster in both quality and naturalness.
The results in general show the seq2seq architecture as very capable, but requiring reranking to reach high-quality results. On the other hand, while rule-based approaches are not able to beat data-driven systems in terms of automatic metrics, they often perform comparably or better in human evaluations.
We are preparing a detailed analysis of the results Dušek et al. (2018) and a release of all system outputs with user ratings on the challenge website.http://www.macs.hw.ac.uk/InteractionLab/E2E We plan to use this data for experiments in automatic NLG output quality estimation Specia et al. (2010); Dušek et al. (2017), where the large amount of data obtained in this challenge allows a wider range of experiments than previously possible.
This research received funding from the EPSRC projects DILiGENt (EP/M005429/1) and MaDrIgAL (EP/N017536/1). The Titan Xp used for this research was donated by the NVIDIA Corporation.