Hurdles to Progress in Long-form Question Answering

Kalpesh Krishna, Aurko Roy, Mohit Iyyer

Introduction

Long-form question answering (LFQA) integrates the retrieval component of open-domain QA, which involves searching a large external knowledge source for documents relevant to a given question, with a text generation component to produce paragraph-length answers. Significant progress has been made on open-domain QA datasets such as Natural Questions (Kwiatkowski et al., 2019), whose questions are answerable with short phrases and entities, by leveraging dense retrieval techniques like ORQA (Lee et al., 2019), REALM (Guu et al., 2020), and DPR (Karpukhin et al., 2020; Lewis et al., 2020c; Izacard and Grave, 2020). Methods inspired by these results have recently been combined with pretrained language models (Lewis et al., 2020b; Petroni et al., 2020) and applied to the Reddit-derived “Explain Like I’m Five” (ELI5) dataset (Fan et al., 2019), which is the only publicly-available large-scale LFQA dataset.

The recently proposed KILT benchmark (Petroni et al., 2020), which compares retrieval-augmented models across a variety of knowledge-intensive tasks including ELI5, automatically evaluates LFQA models by the quality of both generated answers (ROUGE-L against reference answers) and retrieved documents (R-precision against human-annotated relevant documents). In this paper, we build a state-of-the-art systemState-of-the-art as of April 3, 2021 — the “Google Research & UMass Amherst” team entry on https://evalai.cloudcv.org/web/challenges/challenge-page/689/leaderboard/1908 for ELI5 by using a sparse Transformer variant (Roy et al., 2020) to condition over Wikipedia paragraphs returned by a REALM-style retriever (Guu et al., 2020).

However, despite its success on the KILT leaderboard, our system does not actually use the documents that it retrieves! To measure the effect of retrieval on generation quality, we design a control experiment in which retrieved documents are replaced with randomly-sampled documents at inference time. Results from both human A/B tests and automatic metrics like ROUGE-L demonstrate that conditioning on random documents has almost no effect on generated answer quality (Figure 1c). We recommend that future LFQA research report the results of such control experiments in addition to reporting generation and retrieval quality.

How can a system using random retrieval perform well on ELI5? Our analysis reveals that this result is partially due to significant train / validation overlap in the ELI5 dataset (Figure 1a), which eliminates the need for external retrieval. A human study shows that at least 81% of validation questions have a paraphrase in the training set, and almost all validation questions are topically similar to a training set question. While Fan et al. (2019) attempted to identify and remove question overlap using TF-IDF similarity, more complex semantic matching methods & human verification is needed to address this issue in future LFQA datasets.

Digging deeper, we identify fundamental issues with using ROUGE-L to evaluate generated answer quality (Figure 1b). Simple baselines such as just repeatedly copying the question, or choosing a random training set answer, can outperform LFQA systems such as RAG (Lewis et al., 2020c) in terms of ROUGE-L. On the other hand, our system achieves higher ROUGE-L than reference human-written answers, which is misleading since human A/B testers strongly prefer reference answers to our system’s. We conclude that ROUGE-L is not a reliable metric to evaluate LFQA due to its large and relatively unconstrained output space (e.g., compared to translation or summarization), and we offer suggestions for better automatic & human evaluations to enable meaningful progress on this task.

A state-of-the-art LFQA system

The ELI5 task (Fan et al., 2019) asks models to generate paragraph-length answers to open-ended questions in English that often rely on world knowledge (e.g., how do jellyfish function without brains or nervous systems?). LFQA systems thus benefit from conditioning answer generation on relevant documents from the web (such as the Wikipedia article about jellyfish). While large-scale pretrained language models store surprising amounts of world knowledge within their parameters (Petroni et al., 2019; Roberts et al., 2020), external document retrieval not only augments this intrinsic knowledge but also grounds model outputs in a knowledge source, which provides interpretability.

In this section, we describe our proposed LFQA system, which conditions answer generation on Wikipedia articles identified by a pretrained retriever. We use a dense retriever trained by scaling up a distantly supervised algorithm from Jernite (2020). Since retrieved articles can be quite long and often exceed the maximum sequence length of pretrained models like BERT (Devlin et al., 2019), we use a sparse-attention variant of the Transformer to allow modeling over longer sequences. While our system sets a new state-of-the-art on ELI5, we question the significance of this result in Section 3.

We begin by specifying our dense retriever (“contrastive REALM” or c-REALM), which returns documents related to an input question. Consider a corpus of long-form questions and answers, represented by $(q_{i},a_{i})^{N}_{i=1}$ . Our retriever uses $q_{i}$ as a query to retrieve $K$ documents $(r_{i,j})_{j=1}^{K}$ from a knowledge corpus (Wikipedia), which is enabled by an encoder network that projects both questions and candidate documents to a 128- $d$ shared embedding space. Like REALM (Guu et al., 2020), our encoder is a BERT-base Transformer (Devlin et al., 2019) with a final projection layer.

Since the ELI5 dataset does not include gold retrievals, we train our retriever by scaling up a method recently introduced by Jernite (2020) that uses gold answers for distant supervision. The key idea is to push the encoded vector for a question close to a vector representation of its ground-truth answer(s), but away from all other answer vectors in the mini-batch (negative examples). Intuitively, this method works because both ELI5 answers and external documents are of paragraph length (documents are paragraph-length chunks from Wikipedia). Concretely, we optimize the loss,

where $B$ is the mini-batch and $\mathbf{q}_{i}$ , $\mathbf{a}_{i}$ are the encoded vector representations for $(q_{i},a_{i})$ . This objective is based on contrastive learning, a method that has been used effectively for semi-supervised learning (Chen et al., 2020) and dense retriever training (Karpukhin et al., 2020). Scaling up from Jernite (2020), who used a mini-batch size of 512 and initialized their retriever with BERT, we use much large mini-batches of size 12,288 (and hence, many more negative examples) and initialize our retriever with a strong pretrained retriever, the REALM model (Guu et al., 2020) trained on the Common Crawl News (CC-News) corpus. These design decisions greatly improve retriever quality, as we observe in an ablation study (see Appendix A.2). During inference, we perform a maximum inner-product search (MIPS) with the ScaNN library (Guo et al., 2020) to efficiently find the top $K$ documents. In all our experiments we use $K=7$ , following the setup in Guu et al. (2020).

2 Generator

We next describe our generator model, which conditions its generated answers on retrieved documents returned by c-REALM. We use the Routing Transformer (RT) from Roy et al. (2020), which is the current state-of-the-art in long-form language modeling. The RT is a sparse attention model that employs local attention as well as mini-batch $k$ -means clustering to better model long-range dependencies in sequences (attention maps in Appendix A.1). Long-form language models such as RT are well-suited to ELI5 as the task requires conditioning answer generation not only on a short question but also many lengthy retrieved documents.

We pretrain our RT model on PG-19, a long-form language modeling benchmark (Rae et al., 2020) created from approximately 28,000 Project Gutenberg books published before 1919. PG-19 has 1.9B tokens and an average context size of 69K words. While this data is out-of-domain for ELI5, we choose it to encourage long & coherent generation. Our RT is a 22-layer model with 1032 hidden units (486M parameters), maximum sequence length of $8192$ tokens, and a vocabulary of 98K subwords.Our hyperparameters have been chosen manually with minimal tuning. See Appendix A.1 for details. We fine-tune our model in a decoder-only fashion (Liu et al., 2018; Wolf et al., 2018) by concatenating the top $K$ retrieved documents to the question $[r_{i,K},~{}r_{i,K-1}~{}...~{}r_{i,1},~{}q_{i},~{}a_{i}]$ and training the model to predict tokens of the answer $a_{i}$ . We do not backpropagate gradients through the retriever.We tried training the retriever jointly with RT using the attention bias scheme proposed in MARGE (Lewis et al., 2020a). This improved perplexity only in autoencoding settings where the gold answer itself is used as a retrieval query (like the setup in Lewis et al., 2020a), which is not valid in LFQA. Retrievals slightly improve perplexity (18.1 vs 17.8) as seen in Wang and McAllester (2020), but do not improve generations (§3.1).

3 Main Experiments

Dataset & Evaluation details: We evaluate our model on the KILT validation & test subsets of ELI5 (Petroni et al., 2020), since the original ELI5 dataset does not have human annotations to measure retriever performance. We downloaded the ELI5 dataset (Fan et al., 2019) from the KILT Github repository.github.com/facebookresearch/KILT This version of the dataset has 272,634 training examples, 1,507 validation examples and 600 test examples. The test set answers are hidden, and hosted on a public leaderboard in the EvalAI platform (Yadav et al., 2019).

Answer quality is measured by the maximum overlap of generations with a set of gold answers in terms of unigram F1 score and ROUGE-L (Lin, 2004). Petroni et al. (2020) collected human annotations of Wikipedia articles which support ELI5 gold answers, which enables measuring retrieval quality by computing R-precision (if the top-1 retrieval matches the annotation) and Recall@5 using the top-5 retrievals. Finally, the KILT benchmark combines R-prec. and ROUGE-L to measure the overall performance of the system by “KILT ROUGE-L”. This metric is similar to ROUGE-L, but assigns a score of 0 whenever the top-1 retrieval does not match the gold annotation.

Baselines: We compare our model with the other entries on the ELI5 KILT leaderboard which are either generation-only, like T5-base (Raffel et al., 2020) and BART (Lewis et al., 2020b), or variants of BART using retrieval such as RAG (Lewis et al., 2020c) and BART + DPR (Petroni et al., 2020). These systems are based on massive pretrained language models, with similar number of parameters as our model (details in Appendix A.3).

Results: Table 1 contains our results on the test set of the ELI5 (also on the public KILT leaderboard). We present four variants of our system, using a different retriever during inference (REALM or c-REALM), and different nucleus sampling $p$ values (Holtzman et al., 2020). All variants outperform prior work in generation quality, with lower-entropy models ( $p=0.6$ ) performing best.As in Holtzman et al. (2020), a human study reveals that higher entropy ( $p=0.9$ ) answers are slightly more coherent and sensible, but lower entropy answers ( $p=0.6$ ) are more relevant to the question (details in Appendix A.5). c-REALM performs competitively to RAG and DPR despite being only distantly supervised, and outperforms REALM. Our proposed RT+c-REALM system achieves a new state-of-the-art on combined performance (KILT R-L). Generations from our model are provided in Figure 2 and Appendix A.4.

Analysis

In this section, we conduct a thorough analysis of our model’s usage of retrievals (Section 3.1), the impact of overlap in ELI5’s train / validation / test folds (Section 3.2), issues with ROUGE-L and performance bounds (Section 3.3), and the difficulty in human evaluation for this task (Section 3.4). At the end of each section, we provide short takeaways with suggestions for future work.

While our retrieval-augmented system achieves state-of-the-art performance, we find little evidence that it is actually using the retrieved documents. To measure this, we run an ablation study where at inference time we replace retrieved paragraphs with randomly sampled paragraphs from Wikipedia. We compare this Random baseline with our original system (Predicted) in terms of generation quality as well as the $n$ -gram overlap between the generation and the retrieved paragraphs.

Generations are similar irrespective of type of retrievals: We present our results in Table 2. Despite not being conditioned on any meaningful retrievals, the Random retrieval model has similar ROUGE-L scores as our Predicted system. Moreover, generations from the Random and Predicted models have similar amounts of 1-gram and 2-gram overlap with the paragraphs retrieved by c-REALM, despite the fact that the Random model does not actually see the retrieved paragraphs.Corresponding experiments with the $p=0.9$ variant of our model are presented in Appendix A.7.

The $n$ -gram overlaps are possibly overestimates due to stopwords (e.g., prepositions, punctuation) and entities which are copied from the question. To tackle this issue, in Table 4 we measure the fractions of lemmatized nouns, proper nouns and numbers in the generated answer which are present in the predicted retrievals but not in the question. We notice similar trends as before, with only small differences between the two systems. Finally, there is almost no correlation (Spearman $\rho=0.09$ ) between the Predicted model’s generation quality and the amount of unigram overlap between its outputs and the retrieved documents (scatter plots in Appendix A.7), strengthening our hypothesis that generations are not grounded in retrievals.All these trends persist even on questions for which our retriever predicts the ground-truth document (Appendix A.7)

Human evaluation validates our findings: As ROUGE-L and $n$ -gram overlap have major limitations for LFQA (Section 3.3), we perform additional human A/B testing on the output of Random and Predicted. Specifically, we ask human volunteersDetails of our experimental setup in Appendix A.5. to choose between answers generated by the two systems (presented in random order). As seen in Table 3, humans struggle to choose which of the two answers is more relevant to the question. For both model variants ( $p=0.6,0.9$ ), there is a less than 7% preference for a particular answer type, with humans preferring answers (by 6%) from the Random model for $p=0.9$ !

Other systems also have this issue, possibly due to source-reference divergence and train-validation overlap: We note that this issue is not unique to our system — other systems on the KILT leaderboard like BART + DPR and RAG actually perform worse than their no-retrieval counterpart (BART) in generation quality, as shown in Table 1. Qualitatively, we found no evidence of retrieval usage in a publicly hosted ELI5 model demo by Jernite (2020).https://huggingface.co/qa A possible explanation for this issue is high source-reference divergence, a common problem in table-to-text generation (Wiseman et al., 2017; Tian et al., 2019). In Table 2 and Table 4, we measure the $n$ -gram overlap of top-ranked gold validation answers (Gold Ans) with predicted retrievals. This overlap is low and similar to that of our generations, which we suspect encourages our model to ignore retrievals. A second explanation is the large amount of train-validation overlap (Section 3.2), which eliminates the need for retrieval.

Why does our model do well compared to other systems despite not using retrievals? While our model has similar capacity as the BART/RAG baselines (comparison in Appendix A.3), we hypothesize that our improvements in ROUGE-L are due to a different pretraining objective. BART is pretrained on a masked infilling task on short sequences. Instead, we pretrain our model to perform next-word prediction on long sequences from Project Gutenberg, which encourages long & fluent generations. To illustrate this length effect, in Appendix A.6 we show that truncated outputs from our model get lower ROUGE-L scores on ELI5.While we do not have access to generations from baselines on the KILT leaderboard, example generations from the demo of the BART model in Jernite (2020) are significantly shorter (59 words avg.) than our generations (187 words avg.). Prior summarization literature (Sun et al., 2019) has also shown that ROUGE scores vary heavily by length. To compare the same systems on shorter length outputs, we also tried finetuning the pretrained model on Wizard of Wikipedia (Dinan et al., 2019), an unconstrained dialogue generation task with single sentence dialogues (much shorter than ELI5). As seen on the public KILT leaderboard,https://eval.ai/web/challenges/challenge-page/689/leaderboard/1909 our system has lower ROUGE-L scores than the BART / RAG baselines. Another possible explanation is issues with ROUGE-L itself, as discussed in Section 3.3.

Takeaway (better evaluation of grounding): For evaluating LFQA, it is important to run control experiments with random retrievals & measure grounding of generations in retrieval. While the KILT benchmark does attempt to measure the combined retrieval + generation performance via KILT RL, it does not check whether the generations actually used the retrievals. In other words, one can submit independent retrieval & generation systems, but still perform well on the combined score. This may not be an issue for short-form QA tasks like Natural Questions, since the gold answer is often exactly contained as a span in the gold retrieval. Also, as retrieval might be less important for large language models with parametric knowledge (Roberts et al., 2020), the KILT-RL strategy of simply aggregating top-1 retrieval score with ROUGE-L unfairly penalizes systems not relying on retrieval.Another issue of KILT-RL is ignoring non top-1 retrievals, penalizing models using multiple retrievals together in context.

2 Training / Validation Overlap

Our experiments in Section 3.1 show that model performance is mostly unchanged by conditioning generation on randomly sampled retrievals instead of predictions from c-REALM. Despite not using retrievals, we observe qualitatively that our model displays a large amount of parametric knowledge (“Faraday Cage” in Figure 1c), which is surprising since it was pretrained on novels from Project Gutenberg (not Wikipedia). In this section, we discover that a major reason for ignoring retrievals is the large amount of train / validation overlap in ELI5. While Fan et al. (2019) attempted to fix this issue through TF-IDF overlap, this method is insufficient to identify all question paraphrases, as we find significant overlap between the training set and the KILT validation set of ELI5.The ELI5 demo from Jernite (2020) also retrieves the top-1 similar training set question. Qualitatively, we found many validation examples had near-identical train paraphrases. ELI5 is not the only dataset with substantial train / test overlap: Lewis et al. (2020d) identify similar issues with short-form QA datasets like Natural Questions.

Finding similar questions & measuring overlap: We use our retriever c-REALM to retrieve similar questions from the training set, since it has learned to map questions to a feature-rich embedding space. For each validation question, we retrieve the 7 most similar training set questions. We use both human and automatic evaluation to calculate the amount of overlap. For human evaluation, we show annotators on Amazon Mechanical TurkWe pay workers 4 cents per question pair ($8-12 / hr). We only hire workers from USA, UK and Australia with a 95% or higher approval rating and at least 1000 approved HITs. a validation set question and a retrieved training set question, and ask them to annotate the pair as 0: No paraphrase relationship; 1: on similar topics, but different questions; 2: approximately the same question (an adaptation of the paraphrase evaluation of Kok and Brockett, 2010). We take 300 validation set questions and ask three crowd-workers to rate them against retrieved training questions on this scale, and consider the label with majority rating. To improve quality, we manually verify their annotations.

Table 5 shows that 81% of validation set questions have at least one paraphrase in the training set, while all annotated questions have at least one topically similar question in the training set, which indicates substantial training / validation overlap. The experiment had “fair agreement” with a Fleiss $\kappa$ of 0.29 (Fleiss, 1971; Landis and Koch, 1977).

As manually annotating question overlap can be expensive and time-consuming, we also experiment with automatic overlap detection methods. In particular, we use a RoBERTa-large binary classifier (Liu et al., 2019) fine-tuned on the Quora Question Paraphrase (QQP) dataset (Iyer et al., 2017) from the GLUE benchmark (Wang et al., 2019). For 43.6% of the ELI5 validation set, this classifier marked at least one retrieved question as a paraphrase (46% for the 300 questions we annotated). Qualitatively, we notice that this classifier often mis-classifies retrieved questions that are valid paraphrases but exhibit significant lexical or syntactic divergence. This observation, along with the smaller fraction of valid paraphrases in the QQP training set (37%), partially explains the gap between automatic & human evaluations.

Using retrieved QA for generation: Since ELI5 contains significant amount of overlap between the training and validation sets, a system can simply copy the answers of retrieved training set questions instead of actually doing generation. Table 7 shows that by using the longest answer within the top- $K$ retrieved questions, we outperform two prior systems (RAG, BART + DPR) that use retrieval-augmented generation. As an upper bound, we also consider a system which uses the best possible answer to retrieved training set questions in terms of ROUGE-L (best top-K train answer). This system gets 28.5 ROUGE-L, outperforming all others.

ELI5 performance on overlapping QA: Finally, we measure the performance difference between validation questions that overlap with the training set vs. those that do not. Since we only have human annotations for 300 questions (the no-overlap subset has only 53 samples), we present this analysis using the QQP classifier’s outputs as well. In Table 6, we notice large differences of 6.6 RPrec, 8.1 R@5 in retrieval performance favoring the overlap subset, but only a small generation score gain of 0.8 F1, 0.4 R-L (which may be misleading as discussed in Section 3.3).

Takeaway (careful held-out curation): Based on our findings, we suggest that more careful dataset curation for LFQA tasks is needed to prevent duplicates. While we acknowledge the efforts of Fan et al. (2019) to fix this issue, we also suggest alternative methods to control overlap and focus on evaluating generalization in held-out sets: (1) automatically retrieving paraphrases and then running human validation to eliminate them; or (2) holding out entire genres or domains to reduce the possibility of overlap — for example, keeping Q/A on Sports only in the held-out sets. Note that simply pruning the existing splits using these criteria will significantly reduce the size of the held-out datasets; so we suggest re-splitting the train/validation/test splits from the entire pool of collected questions.

3 ROUGE-L Bounds on ELI5 Performance

We have seen that simply copying the answer of a close question paraphrase from the training set achieves 28.5 ROUGE-L with an optimal selection among retrieved questions and outperforming all computational models. But how “good” is this absolute number? What are some suitable upper & lower bounds to ROUGE-L scores on ELI5? Is ROUGE-L an informative metric for LFQA?

Lower bounds are trivial baselines used to test the vulnerability of datasets or metrics to simple heuristic strategies that do not actually perform the task. Recent examples include hypothesis-only baselines for natural language inference (Gururangan et al., 2018) and passage-only baselines for reading comprehension (Kaushik and Lipton, 2018). We evaluate two ROUGE-L lower bounds on ELI5:

(1) copy the question 5 times and concatenate, as longer outputs boost ROUGE-L (Appendix A.6); (2) retrieve a random training set answer.

Our first baseline contains entities often present in the gold answer, but without actually answering the question. Our second baseline follows the “style” of an answer but is completely off-topic.

As an upper bound, we estimate the ROUGE-L of gold answers themselves. On an average, there are 12 gold answers per question, so we measure the ROUGE-L of the longest gold answer with respect to the other gold answers. We also measure the maximum pairwise ROUGE-L between two gold answers for the same question.Note that different gold answers were not written independently as Reddit users writing answers can read existing answers and may want to provide a non-overlapping perspective. Due to the high train/valid overlap, the best top-7 retrieved answer could be a better upper bound since it is from another Reddit post (and performs better than best gold answer). We only calculate upper bounds for the validation set, since the gold answers of the KILT test set are hidden.

Lower bounds beat prior work, upper bounds have low ROUGE-L: We compare our bounds with actual retrieval augmented generation systems in Table 7. Both our lower bounds (random training answer, copy input) are quite competitive, outperforming RAG (Lewis et al., 2020c) and performing close to BART + DPR (Petroni et al., 2020) without actually answering the question! This shows that ROUGE-L is fairly sensitive to simply copying entities from the question as well as stylistic properties of ELI5. On the other hand, upper bounds (longest gold answer) perform worse than our system (21.2 vs 24.4). Suspecting that this result is misleading, we run another human A/B test by showing volunteers a question and asking them to choose between answers generated by our system and the longest gold answer, shuffled at random.Human A/B testing details in Appendix A.5. As seen in Table 3, the majority of humans prefer the gold reference answers vs generations (68% vs 14% for $p=0.6$ ). In interviews with human annotators after completing the task, they reported that both answers were often fluent and stylistically similar, but one eventually veered off-topic.

Takeaway (better automatic metrics needed): Our experiments demonstrate that computing the ROUGE-L of generations against gold answers is not a meaningful way to evaluate LFQA systems, since it is not selective enough to differentiate between valid/invalid answers. There is a very small margin of improvement between trivial lower bounds and strong upper bounds, with the absolute scores of upper bounds being quite low. We suspect this is due to the long length of answers and fairly unconstrained and large output space. The ELI5 dataset has several open-ended questions with many plausible answers (like What causes traffic?), often involving analogies. A possible fix is a sentence-level evaluation and then aggregating scores across generated sentences, but appropriate penalties are needed for lack of diversity (Zhu et al., 2018) and short lengths. Other possible fixes include learning task-specific metrics to measure semantic overlap (Sellam et al., 2020) or metrics to check factual correctness (Zhang et al., 2020) and faithfulness to input (Wang et al., 2020; Durmus et al., 2020; Zhou et al., 2020). Ultimately, all automatic metrics have their limitations, and human evaluation is necessary (Celikyilmaz et al., 2020).

4 Difficulty of Human Evaluation

To better understand the inherent difficulty of evaluation in ELI5, we interviewed human annotators (of Table 3) and found two challenges:

(1) Unfamiliarity with question topics: While most annotators found the Q/A interesting, they were often unfamiliar with the technical topics discussed in the questions. This made it hard for them to assess answer correctness. The ELI5 dataset has questions in a wide variety of topics (History, Politics, Biology etc.), while most annotators were Computer Science graduate students. While we did allow annotators to use Wikipedia, they mentioned domain-experts will be better judges of factual correctness of answers.

(2) Length of Answers: Annotators mentioned the paragraph-long length of answers made the task quite challenging. Annotators reported taking an average of 2 minutes per answer pair, many of which required careful thought & concentration. This was especially difficult when only part of the answer was correct and the rest had contradictions or repetitions, a common theme in our generations.

Takeaway: Human evaluation is challenging but necessary for evaluating LFQA. Crowd-workers are unlikely to spend time reading & analyzing long text (Akoury et al., 2020). Hence, it is imperative to design simpler evaluations. One effort in this direction is Dugan et al. (2020), who reveal one generated sentence at a time and estimate system quality based on the number of sentences which fooled humans. Another promising direction is extrinsic evaluation (Celikyilmaz et al., 2020) where humans actually interact with systems in real-world scenarios such as the Alexa Prize (Ram et al., 2018) or STORIUM (Akoury et al., 2020).

Conclusion

We present a “retrieval augmented” generation system that achieves state-of-the-art performance on the ELI5 long-form question answering dataset. However, an in-depth analysis reveals several issues not only with our model, but also with the ELI5 dataset & evaluation metrics. We hope that the community works towards solving these issues so that we can climb the right hills and make meaningful progress on this important task.

Acknowledgements

First and foremost, we thank the twenty people who volunteered to help out with with the human annotation experiments. We are very grateful to Vidhisha Balachandran, Niki Parmar, and Ashish Vaswani for weekly meetings discussing progress and the REALM team (Kenton Lee, Kelvin Guu, Ming-Wei Chang and Zora Tung) for help with their codebase and several useful discussions which helped us improve our experiments. We are grateful to Tu Vu for help with the QQP classifier. We thank Jules Gagnon-Marchand and Sewon Min for suggesting useful experiments on checking ROUGE-L bounds. Finally, we thank Shufan Wang, Andrew Drozdov, Nader Akoury, Andrew McCallum, Rajarshi Das, and the rest of the UMass NLP group for helpful discussions and suggestions at various stages in the project. This work was primarily done during KK’s internship at Google Brain, mentored by AR. MI and KK are supported by award IIS-1955567 from the National Science Foundation (NSF).

Ethical Considerations

Our system faces a similar set of issues as most modern text generation technology, like fabrication of facts (Zellers et al., 2019), potential for misuse (Brown et al., 2020) and reflecting biases prevalent on Reddit (the ELI5 dataset has been built using the r/ELI5 subreddit). In our work, we attempted to make text generators more factually grounded by conditioning generations on retrieved Wikipedia articles, hoping to reduce fact fabrication. Unfortunately, a thorough analysis (Section 3.1) has revealed that our system is still not grounding its generations in retrievals, and we have recommended the design of better metrics to measure factual correctness to tackle this issue.

Our final models were trained using 64 Google Cloud TPUs for a total of 32 hours. As mentioned in the Google 2019 environment report,https://www.gstatic.com/gumdrop/sustainability/google-2019-environmental-report.pdf “TPUs are highly efficient chips which have been specifically designed for machine learning applications”. These accelerators run on Google Cloud, which has “matched 100% of its electricity consumption with renewable energy purchases, and has committed to fully decarbonize its electricity supply by 2030” (https://cloud.google.com/sustainability). More details on training time are provided in Appendix A.1.

References

Appendix A Appendices for “Hurdles to Progress in Long-form Question Answering”

All our models are developed and trained using TensorFlow 1.15 (Abadi et al., 2016) and Tensor2Tensor (Vaswani et al., 2018). Our implementations are based on the open-source codebases of REALM https://github.com/google-research/language/tree/master/language/realm and the Routing Transformer. https://github.com/google-research/google-research/tree/master/routing_transformer Similar to the REALM implementation, we use separate processes to run the retriever and generate training data (using a MIPS search). Since our retriever is frozen, we do not use the document index refresher available in their codebase.

Retriever: Our retriever is trained on 64 Google Cloud TPUs for a total of 4k steps and a batch size of 12288. We do early stopping on the validation data (with a smaller batch size of 512 due to smaller P100 GPU memory). Our model converges quite fast, reaching its best performance in 1.5k steps (in 43 minutes) and needing 103 minutes for the full set of 4k steps.

Generator: Our generator is trained on 64 Google Cloud TPUs, for a total of 100k steps on the ELI5 training set. We use the pg19_local_cluster8k configuration available in the Routing Transformer implementation. Besides the default hyperparameters, setting 15% input, attention and ReLU dropout was critical to prevent overfitting on the training set. We use a learning rate of 5e-5. Our retrievals, questions and answers are truncated / padded to 288 subword tokens (using the PG19 subword tokenizer). We use a minibatch size of 128 QA pairs, which corresponds to 332k tokens per mini-batch (of which, the loss is computed over the last 288 answer tokens, or 37k total tokens). We do not compute loss over padded tokens, and use special symbols to separate different parts of the input context. We reverse the retrieved paragraphs in context since the model uses local attention layers, and we wanted higher ranked retrievals to appear closer to the answer tokens. Our models take about 30 hours to finish 100k steps (0.92 steps / second).

Attention Maps: We show the 2D plots of our generator’s attention maps in Figure 3.

Hyperparameter Choices: We experimented with several different pretraining strategies (using Wikipedia), smaller model variants and hyperparameter choices manually in preliminary experiments. All these experiments performed quite poorly on ELI5, producing very short and sometimes incoherent responses. Finally, switching to a Routing Transformer model which was pretrained on a longform language modeling dataset (PG-19) significantly improved generation quality. Hyperparameters for this pretrained model (like hidden size / number of layers) were manually chosen with model capacity in mind. For our final experiments with this pretrained model we did not perform any hyperparameter search during training, primarily due to the expensive setup required to train the system. During inference, we tuned the nucleus sampling value from 0.0 to 1.0 in increments of 0.1, choosing the value with the best validation set performance. Our hyperparameter choices for contrastive learning on the retriever have been justified in an ablation study in Appendix A.2. Notably, we use very large minibatches of 12,288 to scale the number of negative examples. To train this model, we used the standard trick of data parallelism across 64 hardware accelerators. This resulted in an effective mini-batch size of 192 per chip, which is small enough to fit a BERT-base sized model on a TPU v3 chip’s memory. To accumulate information across different chips before the final softmax, we used the tf.tpu.cross_replica_sum function (using an open-source wrapper found here).

A.2 Ablation Study of c-REALM

One of our contributions is scaling up a distantly supervised objective for training retrievers on ELI5, originally described in Jernite (2020). This method uses in-batch negative sampling, making minibatch size a critical hyperparameter for better constrastive learning. We perform controlled experiments initializing our retrievers with REALM-CCNews (Guu et al., 2020) and varying batch size and keeping all other hyperparameters consistent. In Table 8, we notice a steady increase in performance as minibatch size is increased, with the largest gains coming by doubling the batch size in Jernite (2020) from 512 to 1024. Finally, in preliminary experiments we saw no benefit of more intelligent negative sampling schemes.

Next, we investigate the effect of initialization on the training of c-REALM. Unlike Jernite (2020) who initialize their model with BERT, before training we initialize our retriever with a pretrained self-supervised retriever. As a baseline, we initialize our model with ICT, a weaker self-supervised retriever introduced in Lee et al. (2019). Both models are trained with minibatch sizes of 12228. In Table 9, we notice a large improvement in performance when using a better initialization, confirming our design decisions.

A.3 Number of trainable parameters

In Table 10 we present the number of trainable parameters in our model compared to baselines on the leaderboard. Our generator is slightly larger than the models used in prior work, but we utilize a smaller retriever due to the shared query and candidate encoders in REALM. Overall, our system has a similar total number of parameters as baseline models like RAG and BART + DPR.

A.4 Generations from our System

More generations have been provided (along with retrievals, highlighted to show $n$ -gram overlap) in the supplementary material (data) as HTML files. We also present a few samples in Table 16.

A.5 Human Evaluation Setup

We conducted several A/B tests between variants of our model using human annotators. We asked a total of 20 participants for help who voluntarily agreed to help with the annotation process. Most participants were English-speaking graduate students in computer science. In every test, participants were shown a question along with two answers (generated by different systems) presented in a random order. They were then asked to choose which generation (1) answered the question better / which answer was more relevant to the question; (2) was more coherent / had less repetition; (3) was more factually correct. Since some annotators had a limited time, we asked them to prioritize question (1) over (2) / (3). Annotators were allowed to select “Tie” if they could not choose between the systems. We also permitted them to use search engines, but suggested restricting search to Wikipedia. We present all our results in Table 15. We also interviewed some participants after the annotation process and discuss our findings in Section 3.4. Note that while these A/B tests help us understand which system is relatively better, they do not provide an absolute measure of performance (Celikyilmaz et al., 2020) — annotators reported that there were cases where both answers were very good and other cases where both were very poor. This is a limitation of A/B testing.

A.6 Effect of length on ROUGE-L

In this section we measure the effect of outputs lengths on ROUGE-L scores. To conduct this experiment, we truncate generations by our system to a fixed fraction of tokens across all instances. As we see in Table 11 in the Truncate column, shorter generations tend have lower ROUGE-L. To disentangle the effects of length and content, we also measure the generation quality by repeating the truncated generations several times until it matches the original generation length. In the Repeat $1/f$ times column, we notice a gap between our model’s original generation (24.4 ROUGE-L) and the equal-length truncated generations with repetition. These results indicate that while length helps improve ROUGE-L scores, simple repetition is insufficient.

A.7 More experiments on measuring retrieval grounding of generations

In this section we provide some more experiments testing the grounding of generations in retrieved documents. Overall, trends are consistent with our observations in Section 3.1.

Scatter plots between generation quality and unigram overlap with retrievals: We present this scatter plot in Figure 4. There is virtually no correlation between the two quantities, with Spearman $\rho=0.09$ .

Instances with correct predicted retrieval: In Table 12, we present results similar to Section 3.1 considering only those instances where at least one retrieved document matched the gold annotation (roughly 23% instances). We also present a scatter plot on the same set of instances in Figure 5 and note a low correlation of $\rho=0.13$ .

Experiments with $p=0.9$ : We conduct additional experiments studying our model variant with higher nucleus sampling values. As we saw in Section 2.3, these generations tend to be more fluent and coherent, but less relevant to the question. In Table 13 and Table 14 we find consistent trends as Section 3.1, with very little difference between models conditioned on retrievals from c-REALM and random retrievals.