Long-context LLMs Struggle with Long In-context Learning

Tianle Li, Ge Zhang, Quy Duc Do, Xiang Yue, Wenhu Chen

Introduction

Large language models have already entered the long context era. Myriad of LLMs has been released to support long context windows from 32K to 2M tokens. These methods (Hao et al., 2022; Chen et al., 2023a; Peng et al., 2023b; Ratner et al., 2023; Xiao et al., 2024) can unlock lots of complex real-world applications from long-document question-answering, multi-document summarization, long-horizon agent tasks, repo-level code understanding.

One line of research is based on AliBi (Press et al., 2022) and RoPE (Su et al., 2024) embedding, which allows us to train Transformers with short sequences and subsequently apply them to longer sequences during inference. Recently, different approaches (Xiong et al., 2023; Fu et al., 2024; Liu et al., 2024) help the model to extrapolate to 128K window size with continued pre-training. Later on, LongRoPE (Ding et al., 2024) was proposed to further extend the context window to 2M tokens. Another line of research also utilizes methodologies like context window sliding and segmentation to overcome the issue of the limited context window in original Transformers (Hao et al., 2022; Ratner et al., 2023). Furthermore, architectural innovations, transitioning from traditional Transformer-based designs to recurrent models or state space models, have shown promise in facilitating long-range computations naturally Orvieto et al. (2023); Gu & Dao (2023); Peng et al. (2023a). These techniques have been incorporated into several current open-source LLMs to enhance long sequence understanding capability (Chen et al., 2023b; Tworkowski et al., 2023).

These long-context models are primarily evaluated on three types of evaluations: 1. language model perplexity over long documents, which is used by most papers. 2. passkey retrieval (Mohtashami & Jaggi, 2023; Chen et al., 2023a; Li et al., 2023a) or needle-in-a-haystack (Team et al., 2023; Fu et al., 2024), which requires reciting a randomly inserted information in a long sequence. Several LLMs achieve 99%+ on this synthetic task. 3. long-document question-answer or summarization over Qasper (Dasigi et al., 2021).

Evaluations (1) and (2) only provide a minimum bar for LLMs to pass, but their results cannot reflect LLMs’ true ability to deal with realistic long-sequence tasks. Evaluation (3) provides a more realistic metric, however, these tasks are more focused on retrieving correct information from the long input. In question answering, LLMs can take a shortcut to read a short snippet to predict the answer without reading the entire document as demonstrated in Figure 2 case (b). Similarly, summarization also suffers from the strong position bias, where LLMs can utilize the few leading sentences (Nallapati et al., 2017) to achieve high performance. Therefore, these metrics are insufficient to measure LLMs’ ability to comprehend and reason over the entire input sequence.

In this paper, we propose to adopt in-context learning (ICL) on extreme-label classification tasks (Anil et al., 2022; Milios et al., 2023) to evaluate long-context LLMs. Unlike the prior tasks, in-context learning requires LLMs to recognize the task by scanning over the entire input to understand the label space. This task necessitates LLMs’ ability to comprehend the entire input to make predictions. Due to the massive label space, the task demonstration could easily become a long sequence. For example, Discovery (Sileo et al., 2019) encompasses 174 classes with each example taking an average of 61 tokens. Therefore, the minimum demonstration for 1 shot/class already exceeds 10K tokens. Normally, LLMs demand more than 1 shot/class to understand the nuances of different fine-grained labels. Thus, this task becomes a natural testbed for long-context understanding.

To systematically assess how these extended input capabilities affect model performance in the realm of fine-grained text classification with in-context learning, we have compiled a benchmark, i.e. LongICLBench, consisting of six carefully-selected tasks with different difficulty levels in terms of context length and label space.

We evaluate the performance of 13 long-context LLMs and find that the performance of the models uniformly dips as the task becomes more complex (e.g. requiring longer demonstration) as shown in Figure 3. Some models like Qwen and Mistral even degrade linearly w.r.t the input length. Simultaneously, most of the models can benefit from the extensive demonstration if the length is within a certain range. As the input grows longer, it either hurts or makes the performance fluctuate as shown in Figure 1. Moreover, we make further analysis on the distribution of label position to investigate the factors that affect the long in-context learning capability of these models. It is shown that the position distribution of instances in the prompt can dramatically influence the performance of some of the evaluated models including GPT4-turbo.

In a nutshell, our contributions to this work can be summarized as follows:

- We have developed LongICLBench, dedicated to assessing long in-context learning tasks for large language models. This benchmark serves as a complement to earlier benchmarks that concentrated on tasks like long document summarization, question answering (QA), or retrieval, focusing instead on long in-context learning. - We evaluate a line of recent long-context LLMs on LongICLBench and reveal their performances with gradually changed difficulty levels. Simultaneously, we find the sensitivity of some of the long-context LLMs regarding instance position in the prompt. We hope the evaluation results can provide more insights for the improvement of the design of long-context large language models.

Related Work

Long In-context Learning on LLMs As pre-trained language models continue to grow in size, in-context learning (ICL) has emerged as a favored approach for addressing a wide array of tasks without the need for extensive fine-tuning (Dong et al., 2023). A body of research has established that increasing the number of example demonstrations can enhance ICL performance (Liu et al., 2022; Wu et al., 2023). Nonetheless, there are studies indicating that longer input prompts can actually diminish performance (Liu et al., 2023), with the effectiveness of prior large language models (LLMs) being constrained by the maximum sequence length encountered during their training. It is also claimed in previous works that LLM+ICL falls short on specification-heavy tasks due to inadequate long-text understanding ability (Peng et al., 2023c). To counter this issue, various works have introduced memory augmentation and extrapolation techniques to support ICL with an extensive set of demonstrations (Li et al., 2023c; Wang et al., 2023).

Long Context Techniques over LLMs The effectiveness of Transformer-based models is hindered by the quadratic increase in computational cost relative to sequence length, particularly in handling long context inputs. Recent efforts have explored various strategies to address this challenge. Some studies have pursued continued fine-tuning of the LLM with longer context inputs, aiming to adapt the model to extended sequences (Rozière et al., 2024; Tworkowski et al., 2023). Others have leveraged techniques such as position extrapolation and interpolation, building upon relative rotary positional embedding (Su et al., 2021), to extend input length beyond the training phase (Press et al., 2022; Chen et al., 2023a). Additionally, a range of approaches has been proposed to mitigate computational issues, including sliding memory window and chunk segmentation methods (Hao et al., 2022; Ratner et al., 2023; Zhu et al., 2024). Furthermore, alternative architectures beyond the Transformer have been explored to handle long inputs more naturally, such as selective-state-spaces models, which represent a variation of recurrent neural networks Peng et al. (2023a); Gu & Dao (2023). These diverse approaches claim that they can enhance the capabilities of LLMs in processing long context inputs more efficiently.

Long Context Evaluation Due to the imperious demands for the support of long-range LLMs, there is a series of benchmarks focusing on long context evaluation. Long-Range Arena (Tay et al., 2021) includes tasks consisting of sequences ranging from 1K to 16K tokens to evaluate variations of fast Transformers. LongBench (Bai et al., 2023b) comprises 21 bilingual datasets within 6 types of tasks with an average length of around 6k words, which have been processed in a unified format to enable effortless evaluation. L-Eval Benchmark (An et al., 2023) supports 20 sub-tasks with input lengths of 3K to 200K tokens. LooGLE (Li et al., 2023b) focuses on summarization and four types of long dependency QA tasks with test instances exceeding 100k words. Most recently, $\infty$ Bench (Zhang et al., 2024) encompasses 12 tasks, collecting from realistic, auto-generated, and human-annotated datasets with an average length of 200K tokens. Another recent work explores the impact of extending input lengths on the capabilities of Large Language Models, especially on reasoning tasks (Levy et al., 2024). Versatile as these benchmarks, none of them focus on exploring the capability of LLMs confronted with long in-context learning with extreme label space, which is quite different from the tasks of long-document understanding or synthetic needle in a haystack. Thus, our LongICLBench is proposed to fill the niche and make a more comprehensive long-context evaluation for LLMs.

Extreme-label Classification Extreme-label Classification involves categorizing data into one of an extremely large number of labels, and finds application across a variety of real-world domains such as emotion classification from text, named entity recognition, and biological function prediction, each requiring precise differentiation among vast label spaces (Zhang et al., 2017; Sileo et al., 2019; Demszky et al., 2020; Ding et al., 2021). Existing methods to tackle Extreme-label Classification tasks range from embedding-based approaches to fine-tuned retrievals (Bhatia et al., 2015; Vulić et al., 2021), focusing on efficiently managing and leveraging the large label space. However, integrating this task with long-context large language models presents unique challenges. The sheer scale of the label space in extreme-label classification complicates the in-context learning process, where LLMs are expected to discern fine-grained differences among labels based on extensive context (Milios et al., 2023). These challenges make the proposed LongICLBench with a range of difficulty levels a good testing scenario to evaluate the capability of long-context large language models.

Long In-context Evaluation

To support the evaluation of long in-context learning on extreme-label classification tasks in different domains and various difficulty levels, we collect six datasets containing context length from short to long. In order to balance the sequence token length within each dataset and the goal of evaluation for long in-context learning, we keep a subset of the classes among all the classes to format evaluation sets around 1 round, 2 rounds, 3 rounds, 4 rounds, and 5 rounds correspondingly, where each round represent a complete set of examples containing all unique chosen labels. We sample the number of instances from each of the classes evenly to reduce the bias resulting from the label distribution. The statistics of the datasets are described in detail in Table 1 and Appendix A.1.

BANKING77 (Casanueva et al., 2020) is a banking-domain intent detection dataset comprising 13,083 annotated examples over 77 intents. We keep all of the types of intents, and each of the instances contains around 28 tokens.

TacRED (Zhang et al., 2017) is a large-scale relation extraction dataset with 106,264 examples built over news and web text from the corpus used in the yearly TAC Knowledge Base Population. Only one relation is labeled for each of the sentences in the dataset. It covers 41 relation types in total, with an average length of 80 tokens for each example.

DialogRE (Yu et al., 2020) is a human-annotated dialogue-based relation extraction dataset composed of 1788 dialogues from a famous American television comedy, Friends, with 36 possible relation types existing between an argument pair in a dialogue. Each example contains around 226 tokens on average.

Discovery (Sileo et al., 2019) automatically discovers sentence pairs with relevant discourse markers and curates a large dataset containing 174 discourse markers with at least 10K examples each. Each example contains around 61 tokens. There are 174 types of discourse markers. This dataset is the most difficult task with fine-grained labels.

2 Model and Experimental Setup

In the exploration of in-context learning for extreme-label classification, we conduct a comprehensive evaluation for a series of recent open-source long-context language models of size around 7B parameters. We also include the SoTA models like Gemini and GPT-4-turbo. Table 2 provides an overview of the models investigated, highlighting the innovations in their architecture specifically for dealing with long context. We can observe that there are multiple strategies adopted to extend the context window. Some of the models support the training context window size while some models support length extrapolation. RWKV (Peng et al., 2023a) and Mamba (Gu & Dao, 2023) are the two new RNN-like architectures to decrease attention complexity, which would allow the model to easily extrapolate to much longer inputs with linear time/memory complexity.

We construct a prompt following the template as shown in A.2 for each of the datasets. To fairly evaluate the open-source and API-based models with a series of input lengths, we sample the same example set for all the models with labels distributed evenly to ensure an unbiased distribution for the in-context demonstration. For instance, an input of one round will include one set of examples traversing all the types, and 5 rounds will contain instances from each of the labels 5 times. For testing, we sample 500 examples from the test set of each dataset, simultaneously ensuring an even distribution in terms of the type of labels. All the open-source models are loaded from the weights in HuggingFacehttps://huggingface.co, while the API-based models are called with the scripts in the official documentations https://platform.openai.com/docs/guides/text-generation/chat-completions-api, https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/overview.

3 Experiment Result

The main evaluation results are demonstrated in Table 4, Table 4, Table 6 and Table 6. For the entity recognition and relationship extraction dataset, we use the F1 score as the evaluation metric, and Accuracy is utilized for the other datasets. From the presented results, generally, we can find that models of Transformer-based architecture perform consistently better than the RNN-based ones in all the evaluated datasets. However, both of them are still falling behind the powerful API-based models, especially GPT4-turbo. For a relatively simple task like BANKING77, whose context length from 1 round to 5 rounds is 2K to 14 K, most of the models can benefit from the extensive context with more demonstrations. As shown in Figure 1 and Table 4, from 2K to 4K, there is either a huge increase nearly doubling the accuracy, or a complete failure for most of the open-source models. After 3 rounds, limited performance gain can be achieved by adding more examples. When it comes to more complicated tasks like TacRED and DialogueRE in Table 4 and Table 6, which are more urgently requiring the capability of long-context comprehension, the overall performance of all the few-shot models drops compared to BANKING77. As shown in the middle plot of Figure 1, only GPT4-turbo can consistently benefit from more demonstrations, all of the other models reach their peak at the middle with context length around 20K.

For the most challenging Discovery dataset, which has an extremely large label space including 174 classes, one round of traversing for all the label possibilities has already made up a context length of 10K. In this extreme case, all of the models, including GPT4-turbo, fail to tell the difference among the fine-grained types, leading to a score of 0. The results across different datasets reveal the models’ capability to understand different types of tasks. Our initial hypothesis suggests that the strongest LLMs like GPT-4-turbo are capped at a certain complexity level between DialogRE and Discovery.

Another interesting observation we have is that some LLMs’ performance on the extreme-label ICL seems highly predictable. According to Figure 3, the performance of Qwen and Mistral are almost linear w.r.t the demonstration length. This reveals that there might be an underlying mathematical relation between performance and the task complexity for ICL.

Exploratory Experiment

Inspired by the Lost in the Middle phenomenon Liu et al. (2023), we take analysis experiments to explore whether the position distribution of the instances will make a difference in the performance for long in-context learning with extreme-label classification tasks.

In our investigation, we conducted pilot experiments on TacRED, a medium-complexity dataset, with each label type demonstrated three times, resulting in a total of 123 distinct instances (calculated as $41\times 3$ ). Within these experiments, instances bearing the same labels are distributed randomly to form a scattered configuration. For each instance, we track its relative position within the prompt alongside its corresponding label, thereafter computing the accuracy for each label class. As illustrated in the first row of Figure 4, the visualization delineates the accuracy of each label, aligned with its position within the prompt, where diverse colors symbolize various label types. In scenarios where class instances are scattered, certain models, such as InternLM2-7B-base, demonstrate acceptable performances—approximately 60% accuracy merely on specific labels, as highlighted by a red circle in Figure 4, regardless of the instance placements. Conversely, other models, like ChatGLM3-6B-32K, exhibit robust performance across a broad spectrum of labels. Remarkably, the GPT4-turbo model consistently surpasses an 80% accuracy threshold for the majority of label types, with only a minimal count of exceptions.

2 Grouped Distribution

To facilitate a clear comparison between scattered and grouped distributions, we organize instances of the same class to be adjacent within the demonstration prompts. The impact of this reorganization on model performance, both pre and post-grouping, is presented in Table 7. A pronounced trend emerges, highlighting a general decline in performance across most models after grouping instances by class. Notably, models such as Mistral-7B-v0.2-base and InternLM2-7B-base exhibit significant performance drops, underscoring a pronounced sensitivity to instance grouping. In an effort to delve deeper into this phenomenon, we visualize the accuracy of grouped labels in relation to their positions within the prompt, as illustrated in Figure 4. This visualization reveals that instances of the same class, denoted by dots of the same color, are positioned nearby. It became evident that some models, like InternLM2-7B-base, demonstrate high sensitivity to the distribution of instances, only handling instances with labels positioned at the end of the prompt. Conversely, other open-source models such as ChatGLM3-6B-32K, with a modest 3.3% drop in accuracy, proved to be more resilient to changes in instance positioning, maintaining high performance across varied positions. Surprisingly, even the GPT4-turbo is not immune to the challenges posed by grouped distributions, experiencing a notable decline in performance by 20.3%. This observed decrease in performance is consistent across models, unaffected by the specific positions of the labels within the prompt.

Conclusion

In summary, our research explores the capability of large language models on long in-context learning tasks, particularly in extreme-label classification scenarios. We curate a dataset LongICLBench consisting of long in-context learning tasks with different difficulty levels with respect to the context length. Through our study, we have discovered that while LLMs show promising performance on inputs up to 20K tokens, their ability to process and understand longer sequences significantly decreases. Our exploratory experiments further highlight the impact of the distribution of examples within prompts on model performance. We hope LongICLBench and our findings contribute to the ongoing efforts to enhance LLMs’ understanding of long contexts.

References

Appendix A Appendix

We list a few additional datasets as follows:

GoEmotions (Demszky et al., 2020) is the largest manually annotated dataset of 58k English comments from Reddit, which is labeled into 27 emotion categories or Neutral. There are 27 types of emotion types and drop the rare ones with few examples. Each selected example contains 28 tokens on average.

Few-NERD (Ding et al., 2021) is a large-scale human-annotated name entity recognition dataset with a hierarchy of 8 coarse-grained and 66 fine-grained entity types. Each of the instances is a paragraph with approximately 61 tokens on average and contains one or multiple entity names as the ground truth answer. There are 66 types of entities in the collection.

The performance for the two tasks is demonstrated in Table 8 and Table 9.

A.2 Prompting Template

The prompting template for each of the datasets is presented at Table 10