$\infty$Bench: Extending Long Context Evaluation Beyond 100K Tokens

Xinrong Zhang, Yingfa Chen, Shengding Hu, Zihang Xu, Junhao Chen, Moo Khai Hao, Xu Han, Zhen Leng Thai, Shuo Wang, Zhiyuan Liu, Maosong Sun

cs.CL

Introduction

In recent years, large language models (LLMs) Brown et al. (2020); OpenAI (2023a); Touvron et al. (2023) have exhibited exceptional performance across a range of natural language processing (NLP) tasks Qiu et al. (2020); Han et al. (2021). LLMs are showing a promising direction toward generalist task assistance, being capable of aiding users in practical tasks through conversational interactions. These tasks include web navigation Nakano et al. (2021), analysis of code repositories Chen et al. (2021), and extraction of useful information from documents Kočiskỳ et al. (2018), indicating a step towards artificial general intelligence. For these LLM-based scenarios, the ability to process long contexts is increasingly critical, in addition to understanding fine-grained semantics and possessing extensive knowledge Dong et al. (2023); Huang et al. (2023). Textual documents, historical dialogues, complex instructions, and cumbersome workflows, which constitute the data most directly processed in daily tasks, must be input to LLMs as long contexts for effective processing.

Despite this growing importance, LLMs consistently face challenges in processing long contexts, primarily due to the substantial computational resources required for long sequence training Dao et al. (2022); Dao (2023) as well as the apparent inability to generalize to sequences longer than those encountered during training Chen et al. (2023a); Peng et al. (2023b). LLMs are typically trained on sequences containing no more than 8K tokens Touvron et al. (2023); Penedo et al. (2023); Biderman et al. (2023), and thus cannot well handle contexts exceeding 8K tokens. These limitations have largely restricted most LLMs from being applied to more complex tasks.

Recent advancements in training infrastructure Shoeybi et al. (2019); Narayanan et al. (2021); Dao et al. (2022); Dao (2023), and efforts to improve length generalization Anil et al. (2022); Chen et al. (2023b); Peng et al. (2023b)https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/ have led to rapid developments in long-context LLMs. Based on these improved training infrastructures and length generalization methods, several LLMs have purportedly managed to process data exceeding 100K tokens (Peng et al., 2023b; OpenAI, 2023b; 01.AI, 2023b, a), with Claude 2 Anthropic (2023) and Kimi-Chat AI (2023) even claiming to be able to process up to 200K tokens. However, the rapid emergence of long-context LLMs has outpaced the development of adequate evaluation benchmarks. Present long-context benchmarks predominantly feature contexts averaging around 10K tokens (Bai et al., 2023; Tay et al., 2020), invariably falling below 100K tokens. This lag in the advancement of long-context evaluation methodologies impedes both the comparative analysis of diverse long-context LLMs and the pinpointing of potential enhancements in long-context processing.

In this work, we present $\infty$ Bench , the first comprehensive benchmark featuring an average data length surpassing 100K tokens. $\infty$ Bench includes tasks in different domains (novels, code, math, etc.) and languages (English and Chinese). To fully evaluate the performance of long-context LLMs, $\infty$ Bench integrates synthetic tasks that can be auto-generated for even longer contexts (e.g., finding the top- $k$ number in an array) in addition to a set of realistic tasks.

To construct tasks annotated by humans, we develop 5 annotation pipelines for detailed example annotation. These pipelines undergo iterative refinement until the examples meet quality standards. Auto-generated tasks, conversely, can be easily scaled to various lengths. Upon completing $\infty$ Bench, we assess the performance of several state-of-the-art (SOTA) long-context LLMs on this benchmark to gauge its difficulty and evaluate the effectiveness of these models. The results show that current SOTA LLMs are not fully equipped to handle all tasks within $\infty$ Bench, highlighting the ongoing challenge of enabling LLMs to process long contexts effectively. We also conduct intriguing analyses on the behavior of LLMs on such long contexts, including the task length ablation, the absent of “lost in the middle phenomenon Liu et al. (2023)”, and the context recalling prompting techniques.

Our contributions can be summarized as follows:

We construct and release $\infty$ Bench, the first multi-domain bilingual benchmark for evaluating the ability to understand and reason over contexts surpassing 100K tokens.

We evaluate SOTA long-context LLMs on $\infty$ Bench, which reveals severe performance degradation of these LLMs when scaling context lengths. These experimental results and analysis also indicate promising directions to improve long-context LLMs.

Related Work

Transformers, typically trained on text sequences under 8K tokens due to self-attention’s quadratic complexity, face challenges in longer downstream tasks. To address this, two main strategies have emerged: firstly, the development of positional encodings capable of handling longer text sequences Sun et al. (2022); Press et al. (2021), and secondly, the refinement of inference stage techniques to extend current LLMs post-training. The primary approach involves modifying rotary positional encoding Su et al. (2023) and implementing post-training adjustments to better manage the increased relative positional distances in longer sequences Zhu et al. (2023); Peng et al. (2023b); Chen et al. (2023a).

Recently, many LLMs have shown the ability to handle over 100K tokens. Some popular proprietary 100K+ LLMs include GPT-4, Claude 2 (Anthropic, 2023), and Kimi-Chat (AI, 2023). On the other hand, there are much fewer open-source 100K+ models. Some notable models include YaRN (Peng et al., 2023b) and Yi-200K (01.AI, 2023a, b). In this paper, we benchmark GPT-4, Claude 2, Kimi-Chat, and YaRN-Mistral-7B-128Khttps://huggingface.co/NousResearch/Yarn-Mistral-7b-128k, we denote this model by YaRN-Mistral. on $\infty$ Bench, which are some of the latest and strongest LLMs that claim to be able to handle over 100K tokens.

Numerous studies aim to accelerate self-attention computation. Research primarily concentrates on refining attention mechanisms through improved IO management (Dao et al., 2022; Dao, 2023), memory optimization (Kwon et al., 2023; Shazeer, 2019; Ainslie et al., 2023), and enhanced parallelization in decoding (Dao et al., 2023; Hong et al., 2023). Approaches like Sliding Window Attention (Beltagy et al., 2020), LM-Infinite (Han et al., 2023), and StreamingLLM (Xiao et al., 2023) introduce attention variants for handling infinitely long sequences without overwhelming computation or memory overhead. However, these techniques often face challenges in maintaining historical information.

Several benchmarks exist for evaluating long-context AI models, notably featuring context lengths of around 10K tokens. L-Eval (An et al., 2023) and LongBench (Bai et al., 2023) are prominent examples, aggregating pre-existing tasksKociský et al. (2017); Dasigi et al. (2021); Yang et al. (2018); Huang et al. (2021); Joshi et al. (2017) into comprehensive benchmarks. LongBench encompasses four categories—QA, summarization, synthetic retrieval, and code—spanning 21 tasks, with four being novel. Conversely, L-Eval incorporates 18 tasks across QA, summarization, math, retrieval, and multiple-choice (MC) domains, introducing three new tasks. Another notable benchmark, LooGLE (Li et al., 2023), differentiates between short and long dependency examples, focusing on summary and QA tasks; its summary corpus contrasts with ours, utilizing academic papers over novels. The Long-Range Arena (LRA) (Tay et al., 2020) further diversifies with six tasks in text, image, and math, designed for scalability. In comparison, $\infty$ Bench stands out for its substantially longer contexts and a broader range of task domains. Table 1 offers a detailed comparison of these long-context benchmarks.

∞\inftyBench

$\infty$ Bench encompasses 12 tasks spanning 5 domains: retrieval, code, math, novels, and dialogue. Two of these tasks are derived from existing literatureMohtashami and Jaggi (2023); Liu et al. (2023). Among the newly introduced tasks, half are generated automatically, while the remainder are annotated by humans.

In total, $\infty$ Bench includes 3946 examples, featuring a length beyond 100K tokens (average approximately 200K). Figure 2 illustrates the distribution of these tasks. Table 2 details their respective input and output lengths as well as the number of examples per task.

Next, we illustrate each task in detail. The tasks can be grouped into two broad categories. The first involves realistic context collected from real-world scenarios which has potential practical usage of long context LLMs. The second depends on synthetic contexts which are created or collected for testing certain capabilities of long-context LLMs.

We develop novel-based tasks as outlined in Figure 3, utilizing novels sourced from websiteshttps://www.sparknotes.com/https://www.cliffsnotes.com/ and are manually filtered. More annotation information in Appendix. C.

In these tasks, models are tasked with reasoning over entire novels presented during inference. Recognizing that many novels, along with their movie adaptations and related discussions, are accessible online and may have been encountered by LLMs during training, we adopt key entity replacement as a countermeasure. This involves substituting prominent entities determined by annotators, such as main character names, with unrelated ones, creating “fake novels”.

Using these altered novels, we design tasks in three formats: summarization, open-form question answering (QA), and multiple-choice (MC) questions, applying key entity replacement to the annotations as well. All English tasks share the same set of modified novels.

The En.Sum task requires models to generate a concise summary of the novel. Gold standard labels are sourced from the web and undergo manual filtering to remove non-summarization content, like comments. Model performance is evaluated using the ROUGE-L-Sum metric (Lin, 2004).

We employ the same annotation pipeline for both En.QA and Zh.QA tasks, ensuring that the questions necessitate long-range dependency and reasoning, beyond simple short passage retrieval. The tasks are primarily categorized into two types of reasoning:

Aggregation: This involves compiling various pieces of information scattered throughout the novel. An example question in $\infty$ Bench is “How much money in total did A spend on lunch?”

Filtering: This requires identifying specific information from a larger set. An example question in $\infty$ Bench is “What color dress did A wear when A met B for the second time?”

These tasks test LLMs to locate and process information within the novel, performing reasoning through aggregation or filtering to derive answers.

The En.MC task is annotated similarly to En.QA, but differs in that the model is presented with four answer choices. Annotators are instructed to craft these options to be challenging.

1.2 Dialogue

The construction process for the En.Dia task is depicted in Figure 3. We gather movie and drama scripts from a designated online databasehttps://imsdb.com/, focusing on a corpus of long, multi-role dialogues. Only the English scripts are retained and necessary cleaning is applied.

In the En.Dia task, random instances of character names within a script are replaced with $MASK$ . The objective is to correctly identify these masked names. For scripts falling short of 100K tokens, we augment them by padding with additional scripts.

1.3 Code

We develop the task as per the process illustrated in Figure 3. Code repositories, sourced from PyPIhttps://pypi.org/, undergo a filtering process, and those outside the 64K to 256K token range are excluded (tokenization via the tiktoken tokenizerOpenAI (2023c)). Each repository is transformed into a single file, aggregating the content from all files within, each prefaced by its relative path to the root directory. Three of the authors then insert a deliberate and obvious error into one function per repository. The options are presented in the Class.Function format. Six methods are employed for bug insertion: (1) deleting a necessary variable declaration; (2) using an incorrect number of arguments in function calls; (3) creating infinite loops; (4) causing indentation errors; (5) substituting references with undefined variable/function names; (6) introducing blatant syntax errors (e.g., non-closed brackets).

Initial results indicate that this task is too challenging for current LLMs (None of the baseline models can identify the most obvious error such as non-closed brackets). To mitigate this, we offer four answer choices, one containing the injected bug and the others are bug-free. Note that this makes many examples easily solved by external retrieval preprocess. However, we encourage the users not to use external retrieval preprocess to keep the evaluation fair. And we are looking forward to the stage where LLMs can directly solve the problem without selection options.

2 Synthetic Context

The second category of tasks is characterized by a synthetic context. These tasks, devoid of direct real-world application or use case, are engineered to evaluate the capability for processing lengthy contexts. We delineate four essential ability for effective long-context processing:

Location and retrieval. This encompasses all retrieval tasks.

Elevated information resolution. This involves the Retrieve.Number task.

State preservation. This incorporates the Code.Run and Math.Find functions.

Sequential processing. This utilizes the Math.Calc function.

In retrieval tasks, models retrieve specific character sequences from lengthy contexts with predominantly irrelevant content. Such tests, adaptable for any context length, can assess the impact of information placement on model performance, like the lost-in-the-middle phenomenon (Liu et al., 2023). The three retrieval tasks in $\infty$ Bench vary in complexity.

This task is first proposed by Mohtashami and Jaggi (2023). Models are prompted to find a specific called pass key, which is a random 5-digit sequence. The pass key is inserted into a lengthy and noisy context, as shown below. In $\infty$ Bench, we generate examples with 59 different pass key locations that are evenly distributed in the context. At each location, we construct 10 examples with different pass keys. This results in 590 examples.

To examine the local attention of LLMs, we have enhanced the complexity of Retrieve.PassKey by increasing the answer length to 10 digits and incorporating successive repetitive digits. For example, a in Retrieve.PassKey valued 98762, while in Retrieve.Number is 9998877762. This modification aims to assess the local resolution capabilities of long context models, as our preliminary experiments indicate that LLMs struggle with discerning repeated numbers.

Liu et al. (2023) introduce a key-value retrieval task within a large JSON object containing many key-value pairs (e.g., 30eea139-b6dd-43fc-bc5d-0d3d17980229 $\rightarrow$ bfd36c2b-c57e-41ef-9cc1-b21b4e60e664). This task demands the model to accurately identify and retrieve the value corresponding to a specified key. The complexity of this task is heightened due to the indistinguishable format of relevant and irrelevant information.

2.2 Code

In this task, we evaluate the ability of LLMs to simulate multi-step function executions that involve basic arithmetic operations. While this task is readily solvable using a Python interpreter, the focus here is on the long-term state tracking required in such tasks. The capability of state tracking has been demonstrated in GPT-4 (Bubeck et al., 2023). Specifically, the task involves creating Python code consisting of multiple simple functions, incorporating operations such as addition, subtraction, and nested function calls. The structural design of these tasks is as follows:

Some functions’ return values are dependent on other functions (e.g., func_0 invokes func_1). We define depth as the number of cascading function calls initiated by a single call. Thus, the depth for func_1’s call within func_0 is 1. In Code.Run, we employ depths ranging from 2 to 10, ensuring each function calls at most one other function. To keep the simplicity of each single step of computation, these functions are restricted to performing only addition and subtraction.

2.3 Math

Math.Find assesses the model’s capability to identify specific elements within a large array, requiring comprehensive observation for accuracy. This task also tests the ability to preserve states while encoding the context. Concretely, the model receives an extensive list of numbers and is tasked with locating one of seven key numbers: the three largest (1st, 2nd, and 3rd), the three smallest (1st, 2nd, and 3rd), and the median.

To assess sequential processing skills, Math.Calc prompts the model to compute the result of a lengthy arithmetic expression featuring addition and subtraction. Initial experiments indicate that current LLMs struggle to directly produce the final answer. Hence, we instead query the LLMs to provide the intermediate result following each operator. Model performance is evaluated based on the number of correct values preceding the first error.

Experiments

We conduct a thorough set of experiments on $\infty$ Bench. We will introduce the baselines, experimental setup, and main results in this section.

$\infty$ Bench generally requires the ability to handle input contexts longer than 100k. There is a handful of LLMs that claim to be capable of handling contexts over 100k. We include four baselines. The first three are proprietary LLMs as we do not have access to the model, while the last baseline is open-sourced. Details on evaluation are in Appendix. D.

GPT by OpenAI is one of the most widely used and capable LLMs in the market, and a recent version of GPT-4 (OpenAI, 2023b) can support 128K contexts.

Claude 2 (Anthropic, 2023) is a proprietary chat-based LLM released by Anthropic AI and has shown impressive capabilities. The second version of the Claude series supports 200K contexts. We manually enter each example through the webpage because we have no access to their API.

Kimi-Chat, a proprietary chat-oriented LLM developed by Moonshot AI AI (2023), is designed to process contexts up to 200K. Due to the lack of API access, we manually input the test data using their web interface.

YaRN-Mistral is a derivative of Mistral-7B (Jiang et al., 2023) introduced by Peng et al. (2023b). The original Mistral-7B was trained on input lengths up to 8K and shows a reduced performance in longer contexts. Peng et al. (2023b) adapted it to 128K contexts by modifying the position encoding and continued training.

2 Experimental Setup

For each model-task combination, we craft prompts to optimize model performance on short dummy examples. Detailed prompt templates for each model and task can be found in Appendix B.

All API-based baselines are subject to a maximum input length limit and will reject inputs exceeding this threshold. While YaRN-Mistral is theoretically capable of handling longer contexts, the authors only claim abilities up to 128K. Therefore, inputs are truncated by removing the center and joining both ends. This approach is predicated on the assumption that key information, such as instructions and book titles, is typically located at either the start or the end of a prompt.

3 Main Result

Table 3 and Figure 1 display the performances of various baselines on $\infty$ Bench. Notably, GPT-4 outperforms other baselines in the retrieval, code, and math domains, with a considerably higher average score. However, in the novel-based tasks, no distinct winner emerges among the proprietary LLMs. On the other hand, the open-source YaRN-Mistral lags behind the proprietary models in most tasks, exhibiting almost random performance in multiple areas. This aligns with its relatively inferior performance in shorter contexts compared to these models. Additionally, it is observed that the baselines generally excel more in retrieval tasks than in other areas, echoing the relative simplicity of these tasks for human participants.

Analysis

We subsequently perform a detailed analysis of the results, identifying and emphasizing several notable and interesting phenomena.

In line with our benchmark’s goal to assess proficiency in managing lengthy contexts, we verify the baselines’ capability with shortened context versions. A subset of the auto-generated tasks is modified accordingly, and the performance outcomes are illustrated in Figure 4. It is observed that model performance generally declines with longer input lengths compared to shorter ones. This suggests that while these baselines are technically equipped to handle extended inputs, their effectiveness diminishes significantly under such conditions.

2 Lost in the middle

Prior research indicates a performance decline in some LLMs when answers are positioned around the center of the context (Liu et al., 2023). However, our findings do not strongly corroborate this. As depicted in Figure 5, we analyze model performance based on answer location in three location-dependent tasks. We observe no consistent trend between performance and answer position across different models. For instance, GPT-4 shows a preference for early answers in Retrieval.KV but favors later ones in En.Dia. In contrast, Claude 2’s performance remains relatively unaffected by answer position on all three tasks, whereas YaRN-Mistral and Kimi-Chat excel with end-positioned answers (except that YaRN-Mistral get zero performance on all positions on Retrieval.KV).

One plausible reason why we have different observations from Liu et al. (2023) is that they experiment with different models using at most 16K length contexts, which is about 8 times shorter than our setting. The models in their study are also different from ours. Finally, the tasks are different: their experiments involve document question answering (and their result with Retrieval.KV arguably does not show a very pronounced performance drop as well). We hypothesize that the phenomenon of “Lost in the middle” is only exhibited on specific tasks and models. A more thorough investigation of these differences is beyond the scope of this paper.

3 Context Recalling

We identify an intriguing prompting technique for tasks involving extended context, termed context recalling. This technique posits that, although the information is present in the context and accessible via direct attention, it may be more effective to first prompt the model to recall the relevant information in its generation before engaging in further reasoning. In our experiments using Code.Debug, when we merely instructed GPT-4 to process information step-by-step, the accuracy was 15.74%. However, by explicitly directing GPT-4 to repeat the relevant code before analysis, its accuracy on Code.Debug markedly improved to 39.59%. This approach of context recalling warrants additional investigation.

Conclusions

We introduce $\infty$ Bench, the first benchmark tailored for long contexts exceeding 100K in average length. Empirical evidence indicates that despite claims of proficiency with such extensive contexts, current LLMs demonstrate significant performance degradation when dealing with them. This finding highlights the need for advanced methodologies to improve LLMs’ efficiency in processing long context. Additionally, our analysis offers insights into LLM behavior in long-context tasks, guiding future research.

Limitations

While our benchmark offers valuable insights into LLM performance, it may not be sufficiently diverse or extensive to provide a comprehensive assessment of model capabilities, a constraint common to most benchmarks. Additionally, the reliance on exact match for scoring, dependent on prompt templates and answer parsing methods, may necessitate tailored redesigns for new model evaluations.

Furthermore, supporting contexts up to 100K tokens may fall short for applications requiring analysis of extensive datasets, such as multiple books or entire databases. Exploring LLMs’ capacity to handle up to a million tokens or more presents a promising research avenue. In practical applications, finetuning models to memorize context, rather than processing it during inference, could offer a more efficient alternative, albeit with significant computational demands.

Ethics Statement

Our human annotators are directed to exclude data that may raise sensitive ethical issues, such as offensive language or social biases. Nonetheless, the potential for encountering sensitive content persists, particularly if the sourced books or code contain such material. This concern is somewhat mitigated since the benchmark’s primary focus is on evaluating the long-context capabilities of LLMs, rather than influencing their social bias.

The goal of this research is to advance the development of LLMs’ proficiency in handling extensive contexts. This could aid in implementing more effective “guardrails” against misuse by incorporating detailed specifications prior to user interactions. However, this approach also potentially increases the risk of novel prompt injection attacks.

References

Appendix A RWKV

RWKV (Peng et al., 2023a) is an architecture that combines the power of the transformer architecture (Vaswani et al., 2017) and recurrent neural network (Hochreiter and Schmidhuber, 1997). Its training process can be parallelized while the inference procedure is recurrent, enabling $O(1)$ complexity during inference. Hence, the memory usage does not scale with context length, allowing it to support arbitrary-length inputs. We use the RKWV-4-World-7B version of this model series. However, we should keep in mind that this model was not trained on inputs of this length.

Table 4 shows the performance of RWKV-4-World-7 in comparison to our baselines. We find that RWKV-4-World-7B outputs unintelligible texts on our benchmark, which causes it to achieve zero performance on Retrieve.PassKey, which is the easiest task for other baselines. This is likely because this model was not trained on inputs of this length and suffers from train-test domain shift.We emphasize that this result is not evidence that the architecture of RWKV is incapable of handling lengthy inputs. Therefore, we do not consider testing it on other tasks in our benchmark.

Appendix B Prompt Templates

In the following templates, many tasks has an part that is provided in each example. Generally, they are a short question-like text that tells the model what it is supposed to do. One example is “What is the pass key?”.