ZeroSCROLLS: A Zero-Shot Benchmark for Long Text Understanding

Uri Shaham, Maor Ivgi, Avia Efrat, Jonathan Berant, Omer Levy

Introduction

Large language models (LLMs) have been improving at an incredible pace, solving problems that seemed out of reach, without any task-specific training examples Wei et al. (2022a); Ouyang et al. (2022); OpenAI (2023). As commercial LLMs are adopted worldwide, it becomes clear that they must also operate successfully over long sequences, such as conversation histories or scientific documents. However, current LLM benchmarks that do evaluate models in a zero-shot setting, such as HELM (Liang et al., 2022) and BigBench (Srivastava et al., 2022), mostly focus on short sequences; BigBench, for example, has an average of 77 words per input. To fill this gap, we introduce ZeroScrolls: Zero-Shot CompaRison Over Long Language Sequences, a benchmark for zero-shot long text reasoning over natural language, and conduct a thorough investigation of state-of-the-art LLMs.

ZeroScrolls extends Scrolls (Shaham et al., 2022), a long text understanding benchmark that enables fine-tuning, adding four additional tasks: query-based summarization, multi-hop question answering, sentiment aggregation, and sorting book chapter summaries. We specifically design the latter two tasks to examine a model’s ability to aggregate and compare information across long sequences, while keeping evaluation simple and accurate. ZeroScrolls is designed to test zero-shot capabilities, and contains test sets with simple natural prompts and private gold references, small validation sets, and no train data. It has a live leaderboard to enable transparent and dynamic progress. Figure 1 shows the state of the leaderboard based on our experiments, and Figure 2 shows a per-task breakdown of a selected subset of models.

We use this new testbed to perform extensive evaluation and analysis across state-of-the-art open and closed models. On question answering tasks, we find that zero-shot LLMs bridge the gap with task-specific fine-tuned models; GPT-4 sets a new state of the art on the challenging QuALITY task Pang et al. (2022), almost reaching human performance. In contrast, LLMs generally struggle to obtain such high scores for summarization tasks without a training set from which to learn the nuances and artifacts of each dataset, even though GPT-4 does approach the fine-tuned state of the art on two of three datasets. We also observe that two of our new tasks, sentiment aggregation and sorting book chapter summaries, prove exceptionally challenging for all LLMs, with only GPT-4 surpassing the naive baseline in each task. Our code is available online.https://github.com/tau-nlp/zero_scrolls

When analyzing GPT-4 responses, we often find correct answers that do not match the requested format; e.g. producing a full sentence when asked to answer in a single phrase. This problem is not unique to GPT-4, as different models may deviate from the specified format in different tasks. While ZeroScrolls is primarily aimed at facilitating research in understanding long texts, we encourage the community to use this benchmark to advance research in instruction understanding, prompt engineering, and evaluation of generated texts as well.

Background: Scrolls

Scrolls Shaham et al. (2022) was introduced as a long text understanding benchmark. Its datasets were curated, cleaned, and reformatted to a single input-output format allowing for easy and fast usage, with every example containing a single long document, such as a scientific paper or a book. Since its launch, Scrolls has facilitated significant progress, including new pretraining objectives Tay et al. (2023), adaptations of short-text models to long sequences Phang et al. (2022); Xiong et al. (2022); Ivgi et al. (2023); Bertsch et al. (2023), and dedicated long sequence models pretrained from scratch Guo et al. (2022); Ainslie et al. (2023).

All the aforementioned methods eventually fine-tune a specialized model for every single task in Scrolls, a setting that remains important for many applications. However, in the modern era of general purpose, zero-shot reasoning LLMs, a new evaluation setup is required, where this dependence on task-specific fine-tuning is alleviated.

The ZeroScrolls Benchmark

ZeroScrolls is a zero-shot benchmark containing test sets of ten natural language tasks, each one requiring reasoning over a different type of long text. To ensure affordability, we limit every task to a maximum of 500 examples.

We describe the different ZeroScrolls datasets, six of which we adapt from Shaham et al. (2022), and four new tasks. Table 1 provides an overview.

We adopt the three summarization datasets from Scrolls (GovReport, SummScreenFD, and QMSum), and add a fourth (SQuALITY). GovReport and SummScreenFD are full-document summarization tasks, while QMSum and SQuALITY are query-focused.

(Huang et al., 2021) contains long reports by the Congressional Research Service and the U.S. Government Accountability Offices, with their expert written summaries.

(Chen et al., 2022) contains episode scripts from TV shows with community contributed recaps that were collected from Wikipedia and TVMaze as their summaries.

(Zhong et al., 2021) is a query-based summarization dataset over meetings transcripts. It contains academic meetings, industrial product meetings, and Welsh and Canadian parliament transcripts. Alongside the meeting transcript, each instance contains a query, which aims to focus the summary on a particular topic.

(Wang et al., 2022) is a question-focused summarization dataset, where given a story from Project Gutenberg, the task is to produce a summary of the story or aspects of it based on a guiding question. The questions and summaries are original and crowdsourced; experienced writers were told to design questions that require reading significant parts of the story to answer correctly.

1.2 Question Answering

We adopt the three question answering datasets from Scrolls (Qasper, NarrativeQA, and QuALITY), and add MuSiQue, which focuses on multi-hop question answering.

Dasigi et al. (2021) contains NLP papers from the Semantic Scholar Open Research Corpus (S2ORC) Lo et al. (2020). NLP practitioners provided questions based on the abstracts, and another set of practitioners answered given the articles.

Kočiský et al. (2018) contains questions and answers over books from Project Gutenberg and movie scripts from various websites. To create questions and answers, annotators were provided summaries of the books and movies from Wikipedia, and each question was answered by one or more annotators.

Pang et al. (2022) contains stories and articles from Project Gutenberg, the Open American National Corpus, and more. Each instance contains a story and a multiple choice question; question writers were guided to write questions that require reading large portions of the story to answer correctly.

(Trivedi et al., 2022) is a multi-hop question answering dataset, where the inputs are 20 Wikipedia paragraphs and a question that requires multiple hops between different paragraphs. In the original dataset, each question also has an unanswerable twin question, where the correct answer is not present in the paragraphs. We randomly sample 100 unanswerable and 400 answerable questions for ZeroScrolls.

1.3 Aggregation

We create two new tasks that, by construction, require contextualizing and aggregating information from different parts of the input. Despite the inherent complexity required to solve these tasks, we design their evaluation to be simple and accurate.

is a new sentiment aggregation task. Given 50 hotel reviews (without their ratings) from the Space dataset Angelidis et al. (2021), the task is to determine the percentage of positive reviews. We create one example (50 reviews) per hotel from the 500 most rated hotels in the original dataset, keeping only strictly positive (rating 5 or 4) or negative (rating 2 or 1) reviews, discarding ones with an ambivalent rating of 3. To verify that humans perform this task well, we gave 5 human annotators a shortened version of the examples (containing 10 reviews per example) and asked them to write the percentage of positive reviews. Each annotator was assigned 10 examples (100 reviews per annotator, 500 overall). The annotators aggregated their individual predictions perfectly, and had a total of 8 single-review classification errors out of the 500 reviews seen ( $\sim$ 98.4% accuracy).

is a new task based on the BookSum dataset (Kryscinski et al., 2022), which contains summaries of chapters (or parts) of novels, plays, and long poems from various sources. Given a shuffled list of chapter summaries, the task is to reorder them according to the original order of summaries in BookSum. We create the task by manually selecting the summaries of 125 books from BookSum, retaining only high-quality instances. We manually edit each summary by removing introductions, prefaces, overviews, and so forth, as well as any other information that may indicate the exact position of a summary; for example, “Chapter 8 begins with Jane describing…” is replaced with “This Chapter begins with Jane describing…” and “As the play opens, Hippolytus announces…” becomes “Hippolytus announces…”. Each list of summaries contains between 3 and 86 chapter summaries, with a median of 15 and an average of 18.8 chapters per instance. We select 4 random permutations of each list to create 500 instances.

2 Prompting

ZeroScrolls tests the ability to reason over long texts without any explicit training examples (zero-shot). We thus complement each data instance with an instruction that defines both the task and the desired output format Efrat and Levy (2020), without in-context demonstrations. While we invest effort in designing the canonical prompts for ZeroScrolls, the benchmark is open to further zero-shot prompt engineering Radford et al. (2019); Schick and Schütze (2021a, b), such as prompts that encourage chain-of-thought reasoning Wei et al. (2022b). Table 5 contains the prompts for the summarization tasks and Table 6 contains prompts for question answering and agregation tasks.

Figure 3 illustrates an example from the benchmark. We manually craft a prompt for each dataset, following a generic template composed of instruction, context, query, and response. The instruction describes the task, and ends with the desired output format (e.g. “Answer the query in one or more sentences.” for QMSum). When the total input size is too long for a model’s context window, we trim the context and append a string explicitly stating that the rest of the context is trimmed, to inform the model that it cannot see the entire context. We then concatenate the context with a header describing what kind of context it is, e.g. “Report:”, “Reviews:”, etc. For tasks that have queries, we append the question or query with an appropriate header. The prompt ends with a header indicating the response type (e.g. "Answer:" or "Summary:").

Chat LLMs, such as ChatGPT and Claude, are designed to interact with humans through a chat interface. We therefore adapt our canonical prompts to accommodate these models. Specifically, omit the response header (e.g. “Summary:” or “Answer:”) as it is clear, in dialogue, that the input sequence has ended. In addition, we append “Do not provide any explanation.” to the instructions of question answering and aggregation tasks. For Claude, we wrap each prompt with “Human:” and “Assistant:” dialogue indicators, and for the question answering and aggregation tasks also add the instruction to “please highlight your final answer with <{response_type}> tags” – as recommended by Anthropic’s documentation.https://console.anthropic.com/docs/prompt-design/classification

3 Automatic Evaluation

ZeroScrolls evaluation is fully automatic. Given a model’s response to every test instance, we apply per-task automatic evaluation metrics. These are then averaged across tasks to produce the model’s ZeroScrolls score. For existing datasets, we follow Shaham et al. (2022) and use the metrics provided by each dataset’s authors. For our newly proposed tasks (SpaceDigest and BookSumSort), we use two new automatic metrics.

(GovReport, SummScreenFD, QMSum, SQuALITY) ROUGE Lin (2004) measures ngram overlap between generated and reference summaries. For each instances, we combine ROUGE-1, ROUGE-2, and ROUGE-L into a single score by computing their geometric mean. For SQuALITY, where there are multiple references, we take the maximal value of each ROUGE type before computing the geometric mean.

(Qasper, NarrativeQA, MuSiQue) F1 computes unigram overlap between generated and reference answers, after normalizing white-spaces, lowercasing, omitting stopwords and punctuation Rajpurkar et al. (2016), and transliterating any Unicode text to ASCII characters. For Qasper and NarrativeQA, where there are multiple reference answers, we take the maximal F1 score per instance.

(QuALITY) For multiple choice questions, we compare the predicted letter (A, B, C, or D) to the reference. We use the first valid option letter surrounded by word boundaries.

(SpaceDigest) Assuming that the output is a percentage,If the output is not a percentage, we score 0%. We parse the first appearance of a percentage; e.g. for the output “Out of 50 reviews, 20 are positive and 30 are negative, so 40% of the reviews are positive 60% are negative.” we automatically parse 40% as the answer. we compute the exponential similarity between the gold reference percentage $p$ and the predicted scalar $\hat{p}$ :

We use $d=2$ and $c=10$ , which means that, intuitively, the score gets cut by half for every 10 point deviation from the correct answer.

(BookSumSort) Assuming that the output is a permutation of the given chapter summary IDs,If the output is not a permutation, we score 0%. We discard all characters but digits, commas, and white-spaces from the output string to eliminate any prefixes such as “Order:” we measure the amount of chapter summary pairs that are in the right order, divided by the total number of pairs $\binom{n}{2}$ . The average random permutation scores 50% on this metric.

Evaluating State-of-the-Art LLMs

Using ZeroScrolls we conduct, to the best of our knowledge, the first systematic LLMs zero-shot performance comparison over tasks that require long text understanding.

We evaluate both open-source models and closed products available via APIs. We apply greedy decoding to all models, and leave further research into other decoding strategies to future work. Table 2 shows the selection of models we evaluate.

We experiment with Flan-T5-xxl (Wei et al., 2022a) and Flan-UL2, the instruction-tuned versions of T5 (Raffel et al., 2020) and UL2 (Tay et al., 2023), as well as T0pp (Sanh et al., 2022), an LM-adapted Lester et al. (2021) version of T5 that was finetuned on various NLP tasks for zero shot generalization. For all open-source models we use a maximum input length of 8,192 tokens (larger contexts were unstable). We also experiment with shorter context lengths and smaller variants of Flan-T5.

Using product APIs, we evaluate Claude v1.3 from Anthropic,https://www.anthropic.com/index/introducing-claude and DaVinci003,https://platform.openai.com/docs/model-index-for-researchers ChatGPT v0301,https://chat.openai.com/ and GPT-4 v0314 (OpenAI, 2023) from OpenAI. The maximal context length of these models includes both input and output.

To compare general-purpose LLMs (zero-shot) to task-specific models (fine-tuned), we use predictions by CoLT5-xl (Ainslie et al., 2023), a transformer allocating more resources to important tokens, with a maximum input length of 16,384 tokens and is the current state of the art on Scrolls.

We implement simple baselines for all tasks. For GovReport, SummScreenFD, QMSum, SQuALITY and NarrativeQA, we select random spans from the input document of 500, 200, 50, 120 and 4 words respectively. For Qasper, we randomly decide whether to use one of its fixed choices (“Yes”, “No”, “Unanswerable”) or choose a random span of 15 words. For MuSiQue, we use “Unanswerable” for every instance. For QuALITY, we randomly select an option from A, B, C, or D. For SpaceDigest we always use 50%, and for BookSumSort we use the trivial permutation “ $1,2,3,...,n$ .”

We provide human performance figures for 6 of the 10 tasks. For SQuALITY, Wang et al. (2022) estimate human performance by comparing one reference against the other three. Similarly, for Qasper and NarrativeQA, we calculate inter-annotator F1 on the ZeroScrolls subsets. We use the human scores reported by Pang et al. (2022) on the full QuALITY test set, while for MuSiQue, we combine statistics on answerable and non-answerable sets from Trivedi et al. (2022). For SpaceDigest, we use our own human annotations (Section 3.1.3) to estimate exponential similarity over 50 reviews.

2 Main Results

Table 3 shows the results for every model on every ZeroScrolls task, along with the average. The overall best model is GPT-4 with an average score of 41.7, and its closest competitor is Claude with 39.1, both significantly higher than the other models. We discuss the results per task category.

There is a clear trend where the open-source models lag behind product-grade LLMs, and that GPT-4 reaches the highest ROUGE scores on all four datasets. However, zero-shot LLMs struggle to compete with models fine-tuned per dataset (CoLT5) on those tasks, with some gap on SummScreenFd and QMSum, and a dramatic difference on GovReport (41.0 compared to 26.3). In SQuALITY, GPT-4 is only one point away from the lower bound on human performance.

We see a different trend in question answering. GPT-4 achieves the best result on only one dataset, QuALITY, where it scores 89.2, close to human performance of 93.5. Flan-UL2 sets the high scores for Qasper and MuSiQue, while Claude has the best F1 on NarrativeQA, 5 points more than GPT-4. Our analysis in Section 5 reveals that GPT-4 does not conform to the required answer format, resulting in a lower score.

Our new SpaceDigest and BookSumSort datasets enrich ZeroScrolls with challenges that explicitly require aggregating information across the sequence. Results indicate that both tasks are difficult for current LLMs. Performance figures for SpaceDigest show that even though sentiment analysis, counting, and divisions are all “easy” tasks for contemporary models, their combination can be quite challenging; only Claude and GPT-4 significantly outperform the naive baseline. The situation is even more dire in BookSumSort, where only GPT-4 outperforms the naive baseline.

3 Impact of Model Size and Input Length

We now discuss the effects of increasing model size (parameters) and context length (tokens). As one may expect, both dimensions improve performance on ZeroScrolls, suggesting that the benchmark does indeed necessitate complex reasoning over long sequences.

The upper section of Table 4 shows results of Flan-T5 of various sizes, ranging from S (60M parameters) to XXL (11B parameters). As expected, increasing model size drives performance upwards across almost all tasks.

The middle and lower sections of Table 4 show the effect of increasing the maximum number of input tokens for Flan-T5 and Claude. In general, increasing the number of tokens helps the models preform the tasks better. Claude is able to utilize the extra tokens more consistently, which results in an almost 3 point increment to its average score when going from 4k to 8k tokens. Interestingly, Flan-T5 also achieves higher scores on longer inputs in many cases, despite being trained on much shorter sequences.

Analysis

While GPT4 has the highest score on the ZeroScrolls leaderboard, we find it surprising that other models score higher on a number of question answering tasks. We analyze model generations and observe that GPT-4 responses do not match the desired output format (despite explicit instructions in the prompt), which results in penalization by the automatic metrics. Further analysis reveals that format discrepancy is a phenomenon that occurs across different LLMs and tasks, and is not unique to GPT-4 and question answering.

We analyze the responses of GPT-4 and Claude for NarrativeQA (where Claude scores 5 points higher), and the responses of GPT-4 and Flan-UL2 for Qasper and MuSiQue (where Flan-UL2 scores 6.2 and 10.2 points higher, respectively). Specifically, we sample 100 instances from each dataset, and annotate whether the answer is correct, ignoring formatting, fluency, or other factors. Figure 4 shows that, in contrast to the F1 scores, GPT-4 performs better than Claude and Flan-UL2 on NarrativeQA and Qasper, respectively, and that the gap between GPT-4 and Flan-UL2 on MuSiQue is smaller in practice.

From examining the generated texts, we learn that GPT-4 consistently generates complete answers even though the prompt instructs otherwise (see Section 3.2 and Appendix A). We further analyze 200 random instances from NarrativeQA and check whether GPT-4 and Claude respond in the specified format, i.e. “using a single phrase if possible,” regardless of whether the content is correct or not. While Claude answers 191 questions in the correct format, GPT-4 does so for only 71 out of the 200 analyzed examples – explaining why GPT-4 is penalized harder by the F1 metric, despite being “correct” more often than Claude.Another interesting observation from analyzing NarrativeQA is that GPT-4 sometimes responds that it is unable to answer the question because the (trimmed) context does not contain the answer. It does so for 30 out of 200 cases, while Claude generates a similar response for only 5, despite both models having similar context lengths (8k).

Figure 5 surveys the distribution of output lengths across multiple tasks and models. In most cases, models generate outputs that fall within the distribution of reference lengths, indicating that the format criteria provided in the prompts are sufficient. However, certain task-model combinations fall outside of the reference distribution. While the NarrativeQA plot confirms our previous observation that GPT-4 generates longer answers for this task, we find that format discrepancy is not unique to this dataset or GPT-4, as different models struggle to generate texts in the correct format on different tasks; Claude generates long answers for QMSum, Flan-UL2 generates long summaries in SummScreenFD, and all models generate short summaries for GovReport, which negatively impacts their scores.

Conclusion

We introduce ZeroScrolls, a benchmark for zero-shot natural language understanding over long texts. ZeroScrolls enables systematic comparison of LLMs on tasks with naturally long input texts, and ones that require contextualizing and aggregating information from multiple documents. We evaluate open-source and production-grade LLMs to find that GPT-4 and Claude are currently the best performing models, while open-source models such as Flan-UL2 also prove powerful at long-context question answering tasks. ZeroScrolls remains an open challenge for LLM research, with our two new aggregation tasks proving to be particularly difficult for contemporary LLMs.

Limitations

As language models improve, evaluating them presents a growing challenge given their ability to consistently generate coherent and reasonable text, which is harder to score, even with gold references at hand. Specifically in the zero-shot setting, where models must infer the output format from the prompt, ROUGE and F1 (ngram metrics) can assign low scores for semantically equivalent generations, with different word choices or answer lengths. Additionally, to conduct fair evaluation, we use common prompt templates across models for every task, while model-specific prompts, as well as chain-of-thought prompting may improve model performance on this benchmark. Finally, the state of the art is a moving target, and as we write these lines new long-range models, alignment methods, decoding algorithms, and prompting techniques become available; we invite researchers to evaluate their ideas on the ZeroScrolls leaderboard.

Acknowledgements

This research is supported by the Yandex Initiative in Machine Learning and by the Len Blavatnik and the Blavatnik Family foundation. The benchmark is released by Tel Aviv University. All experiments were conducted by Tel Aviv University.

References

Appendix A Prompts

Table 5 shows ZeroScrolls prompts for summarization tasks, and Table 6 shows our prompts for question answering and aggregation tasks. The prompts are designed to be simple, natural, and explicit. In braces are placeholders for the text of every example.