Lessons from the Trenches on Reproducible Evaluation of Language Models

Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Abbasi, Alham Fikri Aji, Pawan Sasanka Ammanamanchi, Sidney Black, Jordan Clive, Anthony DiPofi, Julen Etxaniz, Benjamin Fattori, Jessica Zosa Forde, Charles Foster, Jeffrey Hsu, Mimansa Jaiswal, Wilson Y. Lee, Haonan Li, Charles Lovering, Niklas Muennighoff, Ellie Pavlick, Jason Phang, Aviya Skowron, Samson Tan, Xiangru Tang, Kevin A. Wang, Genta Indra Winata, François Yvon, Andy Zou

cs.CL

Introduction

Evaluation on shared benchmark tasks is a crucial tool used to track and communicate progress in the machine learning and language modeling communities (Ruder, 2021). Benchmarks are used to track progress toward shared community goals and to demonstrate the improvements of newly proposed methods over prior baselines. Evaluation practices thus play a crucial role in the direction of the field: inconsistencies or biases in evaluation practices can lead to skewed performance comparisons, which may influence the direction of future research and the adoption of new methods by the community (Dehghani et al., 2021) or lead to adverse effects from deploying suboptimal or harmful models (Bender & Friedman, 2018) on tasks for which they are ill-suited (Raji et al., 2022).

Unfortunately, transparent and reproducible evaluation of large language models is very challenging. In our research we have frequently struggled to reproduce the results reported in various papers as well as carry out new evaluations ourselves. To address this problem we built the Language Model Evaluation Harness (Gao et al., 2021), a flexible evaluation library that serves as research infrastructure for evaluation. Our goal with lm-eval is to enable researchers to run any benchmark on any model as easily as possible, while also making it easy for creators of new model inference libraries or evaluation benchmarks to connect their work to the broader ecosystem.

Over the past three years, the design of lm-eval has evolved as the needs of the open source community and our understanding of best practices for language model evaluation have evolved. In this paper we detail lessons learned that have been especially beneficial to obtaining useful and rigorous findings. We highlight several commonly-faced challenges in evaluating language models, including the difficulty of assessing the correctness of natural language responses, challenges in benchmark design, and the dependence upon implementation details that are often obscured or unreported (Section 2). We then discuss best practices we’ve identified to improve how to communicate results and improve evaluation rigor in the language modeling community, despite these challenges (Section 3). Finally, we detail how we have used our learnings to inform the design of lm-eval (Section 4).

Challenges in Evaluating Language Models

The biggest challenge in language model evaluation is a concept we term the Key Problem: When evaluating language models, there can be many semantically equivalent but syntactically different ways of expressing the same idea. In an ideal world, we would have a way to automatically detect when two sentences express the same content but in different words. Unfortunately, our best tools for determining whether two sentences are semantically equivalent are the very models we are seeking to evaluate. This problem drives many of the approaches to LM benchmarking, and many problems in LM evaluation stem from there not being any silver bullets for solving the Key Problem.

In principle, this would be solvable by simply having expert human annotators score model responses for correctness. The main reason this is not ubiquitous is cost: performing accurate human studies is not only difficult and time-consuming but also very expensive due to fair compensation, pricing smaller actors or organizations out of performing such evaluations. Additionally, there are other reasons relying on solely human assessments must be done with caution: they can be flawed and biased, especially for complex judgments such as factuality (Hosking et al., 2024; Xu et al., 2023; Wu & Aji, 2023). Expert, trained human judgment can alleviate these issues but is inherently non-scalable.

To address the high costs of manual human evaluation, automated metrics are often used. These offer notable advantages in that they are (theoretically) fully reproducible, far easier and cheaper to compute, and can avoid some of the issues faced by human studies (Wei & Jia, 2021; Freitag et al., 2021; Amidei et al., 2020). Automated metrics such as BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004) seek to directly solve the Key Problem by measuring the distance from a generated response to a gold-standard one, such as via counting the n-gram overlap between the two texts. Heuristic-based metrics such as BLEU (and its derivatives) have flaws (Callison-Burch et al., 2006) and present reproducibility challenges (Marie et al., 2021), but can be useful. More recently, model-based metrics have recently gained momentum through evaluation methods that leverage large language models as a grader (Kim et al., 2024; Wang et al., 2024; Liu et al., 2023b), especially as proxies for human preference evaluation (Zheng et al., 2023), but these are known to be flawed (Wang et al., 2023; Shen et al., 2023a; Zeng et al., 2024; Hu et al., 2024; Liu et al., 2023c; Chen et al., 2024) and suffer from similar reproducibility issues as BLEU, ROUGE, and their variants.

The Key Problem can alternately be sidestepped by artificially restricting the answer space. The most prevalent way to achieve this is to reframe questions as multiple choice problems, with a single gold target answer and a finite, static set of possible responses (Hendrycks et al., 2020; Srivastava et al., 2022; Li’evin et al., 2022; Lin et al., 2022; Robinson et al., 2023; Holtzman et al., 2022). Alternatively, when a reference answer is known, one can perform string-matching approaches heuristically to determine whether the model’s answer matches the ground truth (Dua et al., 2019; Joshi et al., 2017; Hendrycks et al., 2021).

This challenge does not necessarily impact other applications of language models and related technologies, such as playing games where it easy to check that the game has ended (Romstad et al., 2008; Silver et al., 2018; † et al.(2022)(FAIR)†, Bakhtin, Brown, Dinan, Farina, Flaherty, Fried, Goff, Gray, Hu, et al., FAIR), more constrained scientific applications (Jumper et al., 2021; Ahdritz et al., 2022), or domains where we have practically usable verifiers even when the solutions are not checkable in all contexts (Biderman, 2020; Biderman & Raff, 2022; Lewkowycz et al., 2022). In the case of LLMs, the most notable cases where this ground-truth verifier is known are coding and mathematics problems, although the verifiers used, such as unit tests, may still break down in edge cases (Liu et al., 2023a)

2 Benchmark Design and Validity

Typically, we do not care about the actual numeric score of a model on a benchmark. Instead, we desire the benchmark to be a useful proxy for some real-world phenomenon. The validity of an evaluation is the extent to which these correlate (Messick, 1994). For a recent overview of validity concerns in NLP benchmarking, see Subramonian et al. (2023). Also see Raji et al. (2021); Saphra et al. (2023); Davis (2023) for extended discussion of construct validity in LLM evaluation.

While validity is an ongoing problem in language model evaluation, we focus on mitigating other concerns first: as we will describe, lm-eval is designed to ensure measurements are consistent across runs and models, regardless of (construct) validity. This is due to our goal of building research infrastructure for evaluations. While we as researchers prefer some evaluation benchmarks to others, our goal is to enable researchers to run any evaluation benchmarks on any models.

3 Implementation Difficulties and (Ir)Reproducibility

Once a benchmark has been designed, it then needs to be implemented by machine learning researchers around the world to see use in driving progress in the field. This introduces a host of new challenges that need to be addressed in order to ensure that everyone is evaluating models on a benchmark in the same fashion when comparing results. This adaptation process can introduce inconsistencies and make it difficult to draw conclusions across different implementations. Researchers must adapt it to their own workflows and libraries for the purposes of actually adopting the benchmark in their research.

The importance of interoperability and full reproducibility stems from the fact that language models are incredibly sensitive to precise details that may not be obvious to practitioners. Even minor variations in prompts, formatting, or other implementation details can significantly impact the performance and validity of evaluations (Weber et al., 2023; Sclar et al., 2023; Mizrahi et al., 2024; Alzahrani et al., 2024; Lu et al., 2022; Webson & Pavlick, 2022; Min et al., 2022). Without access to the original evaluation code, when re-implementing evaluation procedures from scratch is required, it is nearly impossible to account for all the subtle details that can affect outcomes. As a result, these implementations are likely to diverge in ways that make it extremely difficult to ensure fair comparisons across works, even when evaluating on the same benchmark. Even having the prompts reported in a paper is no substitute for having access to the actual evaluation code: prompts in papers are often incorrect or difficult to map to the exact code implementation because they’ve been stylized to be human-readable.

3.2 Lack of Agreement About “Apples to Apples”

Even assuming that benchmarks are implemented consistently across works, the question of how to draw fair comparisons across models and methods is still difficult for LMs.

For instance, different instruction-tuned models may be trained to expect certain formats (Taori et al., 2023; Sanh et al., 2022; Wei et al., 2022) – using these models’ intended prompt formats can make the evaluation tasks inherently different or change their difficulty, but not using these can also bias against models trained with formats not matching tasks’ “standard” prompting styles. Likewise, if an original benchmark implementation (including prompting and postprocessing) is tailored for a specific model, other models trained differently will suffer, artificially skewing perceptions of what techniques are effective.

Likewise, some questions of how to set up controlled experiments are still open–is it ideal to normalize performance and comparisons by the number of parameters? Training FLOPs? Inference cost? Must training data be held equal? How should models which can leverage external resources such as retrieved documents or external tools be compared? These questions are all context-dependent but can impact findings significantly. For example, Wang et al. (2022) explore comparisons across architectures and training objectives, and choose to normalize for FLOPs, thus comparing encoder-decoder models with double the parameters to decoder-only models. Comparing results of models with equivalent training FLOPs, regardless of the allocation of those FLOPs, is commonplace (Hoffmann et al. (2022); Peng et al. (2023); Touvron et al. (2023), inter alia). However, in a more memory-constrained setting, comparing models equi-parameter may be more logical. While this is not inherently problematic, as different application contexts motivate different evaluation criteria, it is common to gloss headline claims as referring to the general case without paying significant attention to such caveats.

3.3 Comparisons with Prior Work are Expensive (and Sometimes Impossible)

Setting aside the question of establishing fair comparisons between methods or models, an additional key challenge in language modeling research is that many barriers prevent thorough comparison with related work.

Many LMs developed by industrial labs, often used as reference points for benchmarks, have never been released externally (Chowdhery et al., 2023; Hoffmann et al., 2022), preventing comparisons except by pulling unverified evaluation numbers from technical reports. Those models that have been made available via APIs may non-transparently not match the published versions or otherwise be modified for deployment. Additionally, these API models are quickly deprecated and no longer accessible, rendering slews of work no longer reproducible111Notably, OpenAI’s code-davinci-002 model was deprecated in January 2024, making at minimum hundreds of research studies irreproducible.. API access, especially for large volumes of evaluation, is quite expensive.

Further, a growing number of companies no longer make base language models available, but enforce interaction via a chat interface or API, which may include product features such as personalization222https://blog.google/technology/ai/bard-google-ai-search-updates/, safety controls333https://openai.com/blog/our-approach-to-ai-safety or domain-specific tooling444 ChatGPT vs. Microsoft Copilot: What’s the difference?. Attempts to compare these closed systems, which integrate a language model along with proprietary features, introduce a whole new set of complications.

4 Fast-changing Progress and Conventions

Due to the time-consuming nature of developing good benchmarks and the rapid pace of change in NLP research in the past decade, many widely used language model evaluation benchmarks do not represent the current paradigm of how language models are trained. This has two major impacts:

Benchmarks are being used for purposes they were not originally designed for or designed for validity under: for example, a large number of benchmarks have been built around fine-tuning on a known training set and closed space of labels (Wang et al., 2019b; a).

There is no “ground-truth” implementation from the original benchmark authors for many of these popular benchmarks “retrofitted” to be used with autoregressive LMs. In the absence of a clear standard, the community’s methodology for evaluating on these benchmarks may be fragmented or undocumented (Clark et al., 2018; Paperno et al., 2016).

To illustrate the effects of this development timeline, Figure 1 shows how many prominent LM benchmarks were designed prior to shifts such as in-context learning and chat interaction, and therefore were not designed to take these formats and approaches into account. This can affect validity or difficulty in unforeseen ways.

Best Practices for Language Model Evaluation

While LM evaluation is difficult and suffers from a number of challenges as we have described, there are measures that can be taken to significantly improve current practices. We provide our high-level recommendations regarding such measures, and detail our motivations briefly for each.

If possible, full evaluation code including the full prompts used should also be provided for reproducible evaluation runs, as well as further identifiers such as links to specific commits used. Failing this, sharing prompts is often not done, but can drastically improve reproducibility.

For fair comparison against other models, evaluation should be done with the same set of prompts unless there’s a good reason not to. Prompts should not be optimized for performance on a given model but not others, and the amount of prompt engineering done should be disclosed.

Comparing results across papers can be misleading due to a wide range of experimental differences, including prompts, sample size, metric calculation, and more (Marie et al., 2021).

Results should not be copied or reported from other papers (Marie, 2022) whenever possible, unless one can verify that the exact same code has been used to run the experiments in those papers. If such copying is unavoidable, it should be clearly marked as such and treated carefully.

Providing model outputs alongside evaluation code can allow others to recalculate scores based on these artifacts, which can be useful for performing statistical significance testing and for assessing the impact of different evaluation metrics or scoring approaches.

Evaluation of large models or APIs can be quite costly–sharing such artifacts allows researchers without access to significant compute to participate in evaluation research.

Finally, sharing outputs can allow results on API models to be reproduced to some extent, even if the models are subsequently deprecated.

Qualitatively review a small batch of results and outputs before testing at scale: it is very easy to have bugs in your generation code, especially when working with multiple sets of benchmarks, prompts, and models of different architectures. Catching issues early can save a lot of time and compute, and increase confidence in results.

Quantitative scores only provide so much information. To understand why a model is scoring so well or so poorly, it is important to do some sort of qualitative error analysis. This can sometimes reveal superficial errors that are easier to correct with post-processing (Bawden & Yvon, 2023), or more fundamental errors.

Most works on language modeling do not perform statistical significance testing, which can significantly decrease confidence in results (Marie et al., 2021).

Although costly, reporting results run over more than one random seed can dramatically boost the validity and utility of results. For example, averaging across model runs (Sellam et al., 2022), or averaging over multiple selections of few-shot examples.

Even when not retraining models, statistical analysis can bound the expected variation across model training runs (Jordan, 2023).

The Language Model Evaluation Harness

Informed by these practices we have built lm-eval. Unlike subsequent555lm-eval was originally built in 2021 and has been in continuous use at EleutherAI and elsewhere since then, despite not being formally introduced in any papers. work on unified benchmarking libraries (Liang et al., 2023; Srivastava et al., 2022; Barton, 2024), the Evaluation Harness does not seek to solely prescribe what the correct benchmark or evaluation protocols to use are, and allows users to select their desired tasks and use cases.

The role of the lm-eval is to solve the orchestration problem: previously, performing thorough LM evaluations would require painstaking re-implementation of previous tasks (likely to introduce subtle methodological divergences) or individually installing, debugging, and using dozens of small libraries. Our goal is to make it easy to allow researchers or library users to simply install one codebase, and run their method plus selected baselines on their desired tasks in a controlled fashion. At the same time, we strive to design best practices into the functionality of the library itself, so that the default and easy-to-use functionality guides users to follow best practices.

We provide an overview of lm-eval’s major components and design philosophy. At its core, lm-eval allows for the contribution of two types of implementations: evaluation Tasks and integrations with novel LM implementations.

lm-eval is built around modular implementations of evaluation tasks, implemented as a Task class using a common API. This allows tasks to be collected in a common library, for new tasks to be extended or implemented easily, and for novel tasks to be easily shared reproducibly among practitioners or other library users. Users can implement tasks either via YAML-based configuration files or via subclassing the provided Task class and providing custom code for specific methods. In Figure 2, we show an example of the evaluation logic packaged within a Task class.

We provide a number of implementations for common tasks, and accept new tasks sourced from the community. We strive to match the paper originally introducing a benchmark dataset in its methodology, including using the same prompts if applicable. For tasks such as those introduced prior to prompted evaluation becoming the standard, we source evaluation methodology from the paper first posing the evaluation dataset as a prompted task. For example, we implement many tasks as adapted for in-context learning by Brown et al. (2020).

The next core piece of lm-eval is the LM API. Because effective orchestration is our core goal, we allow arbitrary software libraries or (autoregressive) language model architectures to extend a provided interface for LM objects.

For ease of use, and compartmentalization of the model definition and external library integrations for custom models away from core evaluation logic, we assume that LMs operate upon dispatched Requests which consist of mapping string inputs to some string or probability as output. We thus abstract tokenizers away within the LM class, and treat a neural language model combined with its tokenizer as a single system being evaluated.

LMs implement a simple interface, consisting of several types of Requests in order to be used within the library for all supported tasks.

We allow for 3 core types of Requests that may be sent to a language model, which consist of distinct types of measurements that can be performed to observe a model’s response or latent capabilities in a prompted format. These are:

(Conditional) Loglikelihoods (loglikelihood, multiple_choice) - computing the probability of given output string(s), conditioned on some provided input.

Perplexities (loglikelihood_rolling) - measuring the average loglikelihood or probability of producing the tokens in a given dataset.

Generation (generate_until) - generating text until a given stopping condition is reached, from a model conditioned on some provided input.

Provided with these three primitive operations, we are able to implement the major ways in the literature that have been used to evaluate LMs (Gao et al. (2020), Brown et al. (2020), inter alia). While these high-level approaches are standard, they all contain a number of subtle implementation decisions which are often not disclosed in papers. Therefore, we include a full formal description of common implementation details involved in ours and others’ approaches within Appendix A for completeness, which we hope will be a useful contribution to the literature.

2 Addressing Challenges and Incorporating Best Practices

Here we detail how we position lm-eval to address the issues mentioned in Section 2 and incorporate the recommendations in Section 3, in order to encourage a more robust evaluation ecosystem.

lm-eval encourages and enables reproducible evaluation in several ways. First, by providing a standardized implementation of many common tasks, practitioners can report on these tasks and ensure they are evaluating on the same prompt and implementation as other users of the library.

Alongside task results we report a version field, incremented each time a task must be modified in a way that affects its scoring. Therefore, in the case where task implementations have bugs or must otherwise be updated, one can still reference the version of the task used, to ensure future research can reproduce reported results.

While this is not a panacea for the costs of comparing to prior work, and rerunning baselines oneself is advised, when prior work uses our library one can be confident that the results from prior work match what one would have gotten had one rerun it oneself using that version of the library (Beeching et al., 2023).

lm-eval provides support for performing qualitative analysis of evaluation scores. In keeping with our recommended best practices, we implement the following, which allow for qualitative checks to be a core part of the evaluation workflow when using lm-eval:

We allow for artificially limiting the amount of samples used for a given evaluation run, to enable code to be tested and outputs to be reviewed in small batches prior to full evaluation runs.

Per-sample logging is supported, for post-hoc reproduction of scores or error analysis of model mistakes or evaluation implementation.

lm-eval reports the standard error (SE) of most supported metrics666The standard error is always the standard error of the evaluation metric. In most, but not all, cases this is the standard error of the mean, as most evaluation benchmarks report the mean score as their final metric., calculated by either bootstrapping or dividing the sample standard deviation by the root of the sample size.

By reporting these SE calculations prominently in every evaluation run, we make it easy for practitioners to add simple statistical measures such as confidence intervals to their results. While we believe more rigorous and widespread statistical testing in LM evaluation is still needed, we hope that this will spur the community to report and be more aware of statistical significance concerns and lower the difficulty of reporting such measures.

Note that the standard errors described here refer to an estimate of the variance in the observed scores should the evaluation data be recollected from the same distribution (Recht et al., 2019; Zhang et al., 2024). Other forms of variance, such as temporal drift (Longpre et al., 2023), retraining models with different seeds (a paradigm we are currently experimenting with), or variance across prompts (Sanh et al., 2022; Muennighoff et al., 2022) must be calculated differently.

Case Studies

Finally, we demonstrate lm-eval’s utility for improving evaluation rigor and understanding via case studies of its successful usage. Additional case studies can be found in Appendix B.

Even when scoring criteria are held identical, the specific prompt of an evalution task can heavily impact results. In collaboration with the BigScience Workshop, the PromptSource (Bach et al., 2022) library was added to a fork of lm-eval to enable easy evaluation across many different prompt templates777https://github.com/bigscience-workshop/lm-evaluation-harness for the first time. Most papers that came out of the BigScience Workshop used this functionality to report distributions of scores across different prompting set-ups (Sanh et al., 2022; Muennighoff et al., 2022; Yong et al., 2023; Workshop et al., 2023). PromptSource, along with other innovations introduced in the BigScience fork, is now supported natively in lm-eval. We have also further extended our functionality to enable people to use the Jinja templating language directly in their configuration files to make it easy to define custom evaluation templates by hand and algorithmically, regardless of whether they use prompts from the PromptSource library.

While the approach used by BigScience in reporting distributions across prompts has not been widely adopted, the idea that prompts should be considered as part of the evaluation set-up has become widely accepted. We hope that future research will especially continue to focus on collecting realistic prompts (Shen et al., 2023b; Xie et al., 2023; Hofmann et al., 2024) or measuring the extent to which results with particular common set-ups generalize to more realistic use-cases (Lyu et al., 2024), or otherwise investigate prompting as a human-computer interaction problem.

2 Difficulties in Comparing Scores Across Evaluation Setups

As mentioned in Section 2.3, scores on evaluation tasks can be substantially affected by the specifics of the evaluation implementation and setup. Here, we provide an example of studying such sensitivity to divergences in evaluation methodology, and how lm-eval can be used to improve confidence in the comparison of scores across models by preventing such divergences. We focus our attention on two popular language modeling benchmarks: the ARC question answering benchmark (Clark et al., 2018) and MMLU (Hendrycks et al., 2021).

ARC was first adapted to the in-context learning setting by Brown et al. (2020) who implement the dataset as a “cloze” task: the model is prompted with Question: {question} $\backslash$ nAnswer: and the likelihood of each potential completion string is compared. By contrast, MMLU (Hendrycks et al., 2020) provides the model with the question text, each of the 4 possible answers preceded by an answer letter A, B, C, or D, and scores the model based on the generation of the letter corresponding to the correct answer. Additionally, Hendrycks et al. (2020) aggregate scores via the micro average over all samples instead of the macro average over per-subject scores. However, not all papers evaluate on these tasks in the same way as the original formats.

However, if models do not adopt these approaches, or disclose their exact settings, it is impossible to reliably compare stated model performance. In Table 1, we compare evaluation on the Challenge subset of ARC using the prompt from Brown et al. (2020) (“Cloze”) and using an MMLU-style answer letter with explicit multiple choice options (“MMLU-style”)888While they do not report their evaluation set-up, this latter style appears to produce scores consistent with those reported in Jiang et al. (2024) for 25-shot ARC Challenge. We were unable to find a way to reproduce their scores using the standard cloze-style evaluation.. We additionally compare MMLU scores between the original MMLU prompting style (“MMLU-style”) and an approach we term “Hybrid”, consisting of an MMLU-style prompt but using the answer strings instead of answer letters as the set of continuations over which we can score. As described further in Appendix B.2, this can be done by modifying two lines in lm-eval’s ARC and MMLU config files.

Comparing these prompting styles, the degree to which models differ in performance widely varies, as well as which prompt performs better. If certain model creators chose one prompting and scoring style and certain other model creators chose the other, and each used the cited numbers from the respective technical reports for comparing their model to other baselines, the comparisons would be nonsensical and not provide information on which model were “truly” performant. Additionally, better statistical reporting such as the use of confidence intervals which we report, does not resolve these issues–while it gives a sense of how reliable a given measurement is (for example, MMLU has a smaller confidence interval due to the use of a larger amount of samples), it cannot tell us how much model performance will vary across different measurement settings and cannot indicate a comparison should not be made.

This demonstrates the vital importance of not copying numbers from across other papers’ reported evaluation scores, and of sharing full details on one’s own evaluation setup. We thus hope the use of lm-eval will boost rigor and confidence in novel evaluation results and encourage better communication of evaluation setups.

3 Empowering Benchmark Creators and LM Evaluation Research

Providing a library for evaluation orchestration that is configurable as we have described in Section 4 has many other uses, and we have observed the community leveraging lm-eval effectively for these purposes.

Experimentation on the complexity or difficulties in LM evaluation has been made easier via our configurable task design. Alzahrani et al. (2024) and Lyu et al. (2024) explore the effects of prompting and other distractors on model robustness and performance using lm-eval, as well as investigate the role of evaluation methodology such as the tradeoffs of loglikelihood versus generative evaluation, as we also detail in Appendix A.

lm-eval has been adopted by the community to make the design of novel benchmarks easier: our extensible Task configurations, and corresponding codebase have been used by the community to prototype the evaluation of their new benchmark datasets in lm-eval. By providing this location for community members to design and contribute novel evaluation code, we sidestep the challenging problem of tracking down and using extant evaluation code from various papers entirely: the reference implementations for these new tasks are directly in lm-eval in the first place. As described in Section 4.1, we strive to reduce barriers to task development and contribution, such as providing low-friction modes of development (modular configuration files) or examples implementing tasks “in the style of MMLU”.

lm-eval has recently received contributions for a variety of datasets relying on lm-eval for their evaluation and benchmark prototyping and design (Faysse et al., 2024; Son et al., 2024a; b; Kweon et al., 2024; Li et al., 2024). By directly contributing their new evaluation tasks to lm-eval, benchmark authors also get to have full control over the dissemination (Hendrycks & Woodside, 2024) and implementation of their evaluation, making it far easier for the language modeling community to discover and recognize new benchmarking contributions that might otherwise go unrecognized or unadopted (Dehghani et al., 2021). This is the power of orchestration–the goal is to put new evaluation benchmarks in the hands of the community, and put tools for creating benchmarks given an evaluation dataset in the hands of evaluation developers, while smoothing over potential roadblocks we discuss in Section 2. As an additional concrete example, tasks in lm-eval have been used to back not only the popular Open LLM Leaderboard (Beeching et al., 2023), but also the construction of arbitrary novel leaderboards, which have been used to make custom comparisons between models on more specific use cases, such as non-English languages.

Thus, the orchestration viewpoint we take allows downstream users and developers to create their own approaches which best fit their goals and applications, allowing for the fostering of more evaluation development and a more interoperable ecosystem, rather than setting a few chosen metrics in stone.

Conclusion

We have presented a number of common challenges in LM evaluation and our recommendations to mitigate the worst of these pitfalls. We introduce lm-eval, a library for evaluation orchestration built to enable easier and more reproducible benchmarking across common evaluation tasks and model implementations.

We hope that lm-eval will continue to be used by the community to improve rigor and our collective understanding of LM evaluations.

References

Appendix A Formalizing Measurements

Here we provide a formal description of the most common approaches to obtaining outputs or measurements from LMs for evaluation, as we implement in lm-eval. We include this for a number of reasons: as a reference for future work, notes on the history of certain LM eval practices, and as an illustrative example of just how many implementation details or methodological choices do not typically make it into evaluation papers and yet can vitally impact results or findings.

Throughout, we consider an auto-regressive language model (LM), with vocabulary $V$ . Given an input consisting of tokens $x_{0},x_{1},...,x_{n-1}$ , the model outputs a probability distribution over the vocabulary, $P(x_{n}|x_{0},x_{1},...,x_{n-1})$ . Internally, this is represented as returning “logits” of shape $(1,|V|)$ , which when taking a log-softmax over the vocabulary dimension, yields log probabilities (“logprobs” or “loglikelihoods”) of each token in $V$ . Logits are the raw, unnormalized predictions of the model before applying the softmax function. Crucially, due to the parallel training and causal masking of autoregressive LMs, it is possible to obtain from a single LM call with $x_{0},x_{1},...,x_{n-1}$ as input, logits of shape $(n,|V|)$ with the $i$ -th element of these logits representing $P(x_{i}|x_{0},x_{1},...,x_{i-1})$ for all $1\leq i\leq n$ . (That is, for every token position of the input, we obtain concurrently the model’s prediction for the subsequent token, starting from its prediction for the value of $x_{1}$ and ending with the model’s predictions for the (not provided) “ $x_{n}$ ” token.)

A.2 Ranking-Based Multiple Choice QA

Given our language model, we aim to compute the conditional (log) probability (or “loglikelihood”) of a target string $y$ conditioned on input $x$ , denoted as $\log P(y|x)$ . This can be performed in a single LM call.

Let $x=x_{0},x_{1},...,x_{n-1}$ be an input sequence of $n$ tokens and $y=y_{0},y_{1},...,y_{m-1}$ be the target sequence of $m$ tokens, where $x_{i}$ and $y_{i}$ represent individual tokens. To compute $\log P(y|x)$ , we follow these steps:

Concatenate $x$ and $y$ to form a new sequence, but discard the final token $y_{m-1}$ . The resulting sequence is $x_{0},x_{1},...,x_{n-1},y_{0},y_{1},...,y_{m-2}$ .

Pass this concatenated sequence through the language model to obtain logits $l$ of shape $(n+m-1,|V|)$ , where $|V|$ is the size of the vocabulary. The last $m$ positions in these logits correspond to the predicted probability distributions for the target tokens $y_{0}$ to $y_{m-1}$ , conditioned on the input $x$ and the preceding target tokens.

Apply a log-softmax function to the last $m$ logits to obtain log probabilities for the completion tokens only.

Calculate the conditional loglikelihood of the target string $y$ given the input $x$ by summing the log probabilities of each target token:

𝑛𝑖subscript𝑦𝑖\log P(y|x)=\sum_{i=0}^{m-1}\log p(y_{i}|x,y_{0},...,y_{i-1})=\sum_{i=0}^{m-1}l(n+i,y_{i}), (1) where $\log p(y_{i}|x,y_{0},...,y_{i-1})$ is the log probability of the $i$ -th target token conditioned on the full input $x$ and the preceding target tokens. (and where $x,y_{0},...y_{-1}$ denotes conditioning on only $x$ .)

With this primitive for computing $\log P(y|x)$ , several options for evaluation (and decisions regarding hyperparameters) become available.

Equation 1 determines how to compute $\log P(y|x)$ . We now describe how to perform loglikelihood-based multiple choice as described by Brown et al. (2020): given $k$ possible answer strings $a_{1},a_{2},...,a_{k}$ , we compute the model’s answer to be $\texttt{argmax}(\log P(a_{1}|x),\log P(a_{2}|x),...,\log P(a_{k}|x))$ . In other words, the model selects the answer string with the highest conditional log probability given the input $x$ .

This can be performed with worst-case $k$ LM calls using the approach to calculate $\log P(y|x)$ for each $a_{i}=y$ described above. However, the number of LM calls can be reduced if one or more answer strings are only a single token in length. Assume some $a_{i}$ is only encoded by a single token $z$ . Then, when calculating the loglikelihood of another answer string $a_{0}$ , we obtain the (log-softmaxed) logits of shape $(n+m-1,|V|)$ as an intermediate output. These logits contain the predicted log probabilities for each token in the vocabulary at each position, conditioned on the input $x$ and the preceding tokens. To extract the loglikelihood of predicting the single-token answer $a_{i}$ conditioned on $x$ , we can simply select the element in $l$ corresponding to token $z$ at position $n$ . This logit represents the log probability of predicting token $z$ immediately after the input sequence $x_{0},x_{1},...,x_{n-1}$ .

Thus, we can calculate the loglikelihood of a single-token continuation “for free” and remove an additional LM call for each such single-token $a_{i}$ .

While the above approach uses the raw loglikelihoods of each given answer choice to select a model answer, other options are available. For instance, if each answer string $a_{i}$ is different in length, this process may frequently default to selecting the shortest $a_{i}$ simply because loglikelihoods are the sum over individual tokens’ log probabilities. Several options for normalizing these loglikelihoods are possible, as also described in Gao (2021):

Token-length normalization: each $a_{i}$ ’s loglikelihood is divided by $m_{i}$ , its length in tokens, to gain the per-token loglikelihood of each answer. This approach requires no additional LM calls, and is used alternately with raw loglikelihoods for most tasks by Brown et al. (2020).

Byte-length normalization: each $a_{i}$ ’s loglikelihood is divided by its length in bytes, removing the dependence on the model’s tokenizer but still normalizing by answer string length. lm-eval provides this metric where applicable as acc_norm.

Mutual Information: each $a_{i}$ ’s loglikelihood is defined as $\log P(a_{i}|x)-\log P(a_{i}|null)$ , where $null$ is either the empty string, a BOS token, or a placeholder such as "Answer:". This can be thought of as a notion of the pointwise mutual information (Shannon, 1948; Askell et al., 2021), $\log\left(\frac{P(a_{i}|x)}{P(a_{i})}\right)$ , which measures the increase in the likelihood of outputting $a_{i}$ when conditioned on the input $x$ , compared to the likelihood of outputting $a_{i}$ unconditionally. Intuitively, this measure of mutual information captures the extent to which introducing $x$ makes $a_{i}$ more likely. Although this approach is nonstandard, it is provided in lm-eval under the option acc_mutual_info, and used selectively by Brown et al. (2020) and Askell et al. (2021) for certain tasks.

In addition to computing loglikelihoods and normalized loglikelihoods, we may also want to determine whether a given target string $y$ would be produced by greedily decoding from the input $x$ . Let $z$ be the concatenation of $x$ and $y$ as defined in the previous sections, and let $l$ be the logits of shape $(n+m-1,|V|)$ obtained by passing $z$ through the language model. To compute the exact match, we compute $\sum_{i=0}^{m-1}\mathbb{1}[y_{i}=\texttt{argmax}(l(n+i,\cdot))]$ , where $\mathbb{1}[\cdot]$ is the indicator function that returns 1 if the condition is true and 0 otherwise, and $l(n+i,\cdot)$ represents the logits vector corresponding to the model’s output logits predicting the $n+1+i$ -th token and therefore the $i$ -th token position in $y$ (0-indexed). Intuitively, this sum checks whether each token $y_{i}$ in the target string $y$ matches the most probable (argmax) token predicted by the model at each step of greedy decoding. If the sum equals $m$ (the length of $y$ ), it means that all tokens in $y$ would be produced by greedily generating $m$ tokens starting from $x$ . In this case, we return True to indicate an exact match. Otherwise, if the sum is less than $m$ , we return False, indicating that $y$ would not be produced by greedy decoding. Computing the exact match can be useful in scenarios where we want to assess whether the model can generate a specific target string verbatim.

In the above derivations, we assume that one can safely tokenize $x$ and $y$ separately and concatenate their tokenizations. This assumption is not always valid, and most widely-used language model tokenizers provide no such guarantees.

While this factor does not impact the validity of the above calculations, we note that this implies one should be very careful with how their prompt will be tokenized, especially in cases where the tokenization of an input and output pair separately may not align with the tokenization the LM most often saw during training. There are some recently proposed mitigations to remedy this issue, such as “token healing”999https://github.com/guidance-ai/guidance/blob/main/notebooks/tutorials/token_healing.ipynb,

In the case of lm-eval, we achieve a majority of benefits via shifting trailing prompt whitespace into each target string $y$ 101010https://github.com/meta-llama/llama/issues/217#issuecomment-1774147331, and do not include trailing input whitespace for all tasks we implement, so this operation should be null for the vast majority of cases.

We hope that future work can examine the most practical way to remove such tokenization-based concerns and difficulties from evaluation, such as BPE dropout (Provilkov et al., 2020), other regularization techniques, or other novel tokenization innovations.

A.3 Perplexity evaluation

A common approach to measure language modeling performance on some data distribution $D$ is to measure perplexity, which is defined as the exponential of the average negative loglikelihood per token (Jelinek et al., 2005; Brown et al., 1992), that is:

where $|D|$ is the number of documents in the dataset, $y_{j}$ is the $j$ -th document in $D$ , $N_{j}$ is the total number of tokens in $y_{j}$ , and $y_{j_{i}}$ represents the $i$ -th token of $y_{j}$ .

To calculate perplexity on a selected dataset $D$ , each dataset document $y$ is tokenized and fed into a language model following the procedure to calculate $\log P(y)$ described in Appendix A.2, via computing $\log P(y|x)$ , where $x$ is set to either the empty string or a beginning-of-text token. Thus, given $\log P(y)$ , for each document $y\in D$ we can sum up the per-document loglikelihoods and divide by the number of total dataset tokens. However, comparing perplexity across models that use different tokenizers can be challenging, as the number of tokens per document and the average next-token prediction difficulty will vary.

To avoid introducing a dependence on tokenizers while reporting perplexity scores, several options are available:

Bits per Byte: This metric measures the average number of bits required to encode each byte of the input text, providing a tokenization-agnostic measure of language modeling performance (Gao et al., 2020). Formally:

where $\log$ is in base $e$ and $B_{j}$ is the length in bytes of document $y_{j}$ . Alternately, bits per byte can be written as

That is, taking the base-2 log of perplexity and renormalizing by the number of bytes rather than tokens.

Word-Level Perplexity: By tokenizing the input text into words, such as via splitting on whitespace, we can calculate perplexity based on the average loglikelihood per word rather than per-token, making the metric comparable across models with different subword tokenizers.

Byte-level Perplexity: Similarly, calculating perplexity averaged over the number of bytes instead allows for a different tokenization-independent perplexity calculation, as the number of bytes in each document’s string remains constant regardless of the tokenizer used.

Both byte- and word-level perplexities can be calculated via replacing $N_{j}$ in Equation 2 instead with the number of bytes or “words” in document $j$ .

In lm-eval we implement and report all 3 of the above metrics and report them as tokenization-agnostic measures of perplexity. This approach aligns with the work of Gao et al. (2020), who popularized the use of bits per byte for measuring perplexity, and has been adopted in subsequent studies such as Magnusson et al. (2023) and Hoffmann et al. (2022).

Another challenge is the approach taken to measure perplexity on documents longer than the context length of a given LM. A natural approach, as used by Gao et al. (2020), is to chunk documents longer than a model’s training context size $L$ into non-overlapping chunks. For example, a document of length 4500 tokens evaluated using a model with context length 2048 would be processed as follows: tokens 0:2047 (with token 0 being a prepended BOS token) are fed to predict tokens 1:2048, then tokens 2048:4095 are fed to predict tokens 2049:4096, and finally tokens 4096:4499 are fed to predict tokens 4097:4500. The loglikelihoods of each chunk are then summed to obtain the entire document’s loglikelihood.

However, Press et al. (2020) observe a phenomenon they call the “Early Token Curse”, referring to the fact that tokens with a greater amount of context preceding them are fundamentally easier to predict, whereas the first several tokens a model must predict “from scratch” are difficult or impossible to predict without information to condition on. To mitigate this issue, they propose a strided or sliding window perplexity evaluation method.

Instead of creating non-overlapping windows of tokens of size $L$ , the strided approach introduces a stride $s$ such that overlapping windows of size $L$ , shifting at each time by $s$ positions, are used to score $s$ new tokens’ loglikelihoods. This is equivalent to Gao et al. (2020)’s approach when $s=L$ . This approach reduces the skew of perplexity favoring models with larger $L$ (and thus fewer tokens appearing at the beginning of a context window) that is introduced by the early token curse via reducing the number of tokens appearing with little context preceding them. However, it is worth noting that the prevalence of such affected tokens decreases with larger context window sizes $L$ for a model.

A downside of the above method is that a naive implementation requires in the worst case, $\frac{L}{s}$ times the calls to an LM and $\frac{L}{s}$ the compute compared to the non-overlapping window approach. (However, some architectures can leverage KV cache reuse to avoid the cost of repeatedly re-encoding tokens). lm-eval follows (Gao et al., 2020) in using non-overlapping windows of size $L$ . We believe this choice balances the computational cost and the mitigation of the early token curse, while still providing a standardized and comparable measure of language modeling performance.

A.4 Generative Evaluation

While loglikelihood-based tasks, such as multiple-choice question answering, provide a valuable measure of a language model’s understanding and ability to rank given options, they do not directly assess the model’s capacity to generate coherent and relevant text. Generative tasks, on the other hand, require the model to produce original text based on the given context.

Generative tasks have gained significant importance in recent times, particularly due to the fact that many popular language model APIs either do not provide111111https://docs.anthropic.com/claude/reference/complete_post or greatly limit access121212https://platform.openai.com/docs/api-reference/chat to log probabilities or other intrinsic measures of the model’s confidence in its outputs. This shift has made it especially necessary to rely on the generated text itself to evaluate the model’s performance and capabilities.

Generative tasks often involve various techniques for controlling the diversity and quality of the generated text, such as sampling with temperature, top-k or top-p (nucleus) sampling (Holtzman et al., 2020), and beam search (Li et al., 2016). The choice of these hyperparameters can significantly impact the model’s output and, consequently, its performance on the task. It is essential to consider and report these hyperparameters when evaluating generative models, as they can greatly influence the generated text’s characteristics and the model’s overall performance.

Due to the challenges in measuring language model output (Section 2.1), particularly in verifying the semantics of natural language, and because free-form generation sacrifices the benefit of the artificially restricted input space of multiple-choice tasks, the challenge of scoring answers for quality or correctness must be tackled differently.

The open-ended nature of generative tasks means that there may be multiple valid and appropriate responses to a given prompt. A common evaluation strategy is to use few-shot prompts, where the model is provided with a number of examples demonstrating the desired input-output format. The model is then prompted with a new input, and its generated response is extracted using regular expressions (regex) or other heuristic approaches to obtain the normalized answer strings which can be evaluated using exact-match or other metrics. This approach allows for a more structured evaluation of the model’s ability to generate accurate and relevant responses based on the given examples.

However, this is often a highly imperfect solution, as different models may generate responses in varying formats, making it challenging to create a universal regex pattern that works for all models. Moreover, the effectiveness of the regex-based extraction is highly dependent on the specific format used in the original task creation, which could introduce bias towards models that generate responses in a similar format. To address these limitations, lm-eval provides a highly customized answer extraction mechanism through a Filter component tied to Task implementations, allowing the model output to be put through an arbitrary number of filters and post-processing steps.

These custom heuristic approaches make the release of evaluation code, and our recommendations in general, even more crucial. Without knowledge of the extent to what extraction code is used, how it may be tailored to a model, or without access to model outputs, it is difficult to separate models’ compliance with the evaluation format from their answer correctness. Therefore, it is essential to provide detailed information about the answer extraction process and make the code and model outputs available to ensure transparency and reproducibility in generative model evaluation.

A.5 Comparing Generative and Loglikelihood-based Evaluation

A notable advantage of generative evaluation is that it might serve as a better proxy for assessing a language model’s performance in real-world applications. In most practical use cases, such as the increasingly popular conversational chatbot format, language models are expected to generate coherent and contextually relevant text based on a given prompt or context. By focusing on the quality and appropriateness of the generated text, generative evaluation provides a more direct assessment of the model’s performance in these real-world use cases. This is in contrast to loglikelihood-based tasks, which, while informative, may not fully capture the model’s ability to generate text that is both fluent and contextually appropriate.

On the other hand, loglikelihood-based evaluations have their own advantages, particularly when it comes to evaluating smaller or weaker models, or “base” models not trained to follow instructions (Sanh et al., 2022). These evaluations can provide a useful ranking or measurement of a model’s performance, even if the model is not capable of generating high-quality text on its own. By assessing how likely the model is to assign a high probability to the correct answer, loglikelihood-based evaluations can offer insights into the model’s understanding of the task. Moreover, techniques like Brier Score can be used to obtain smoother measurements of a model’s performance (Schaeffer et al., 2023), providing a more nuanced assessment of its capabilities. This can be particularly valuable when comparing and ranking models of different sizes and capacities.

Appendix B Case Studies

Recent work has explored the potential of various novel architectural designs to enable fully-subquadratic complexity in input sequence length while still achieving transformer-level quality or better (See Table 2 for a number of references). However, tracking progress towards this goal requires a reliable set of evaluations that can 1) be used to compare fairly against baselines and 2) provide useful signal even at small scales of experimentation.

As shown in Section 5.2 and elsewhere in the literature, evaluating models on different prompts or differently-framed evaluation setups for the same evaluation “task” can render comparisons not meaningful. This is especially important in the case of small language models trained on novel architectures, as “weaker” models may be hypothetically less robust to evaluation noise or differences in evaluation setup.

lm-eval has been used as a tool by many recent architecture releases to evaluate the performance of their proposed architecture against common baselines. We survey a number of recent releases, and note, for a number of commonly used benchmarks, whether researchers report their architecture’s performance on that benchmark, and if they specifically state the usage of lm-eval to evaluate these tasks where applicable.

The selection of tasks we check are the following:

(Merity et al., 2016): Wikitext-103 is a 103 million word language modeling dataset sourced from Wikipedia by Merity et al. (2016) to serve as a language modeling benchmark. It contains a training, validation, and test split, with a typical setup being to train a (small) model from scratch on the dataset and evaluate its test set perplexity (PPL).

(Tay et al., 2021): LRA is a sequence modeling dataset consisting of a suite of various tasks meant to test the long-range modeling abilities of models. While solving the more challenging longer-context tasks in LRA drove earlier work on long-context sequence models (Gu et al., 2022), subsequent work has shown that LRA may not correlate with desired downstream tasks for pretrained models (Alam et al., 2024).

(Bisk et al., 2020): PIQA is a question-answering dataset meant to evaluate physical common-sense reasoning. It is typically evaluated using loglikelihood-based multiple choice and normalized (acc_norm) or unnormalized (acc) accuracy is reported.

(Mihaylov et al., 2018): OpenBookQA is a question-answering dataset meant to evaluate the combination of common knowledge with open-book exam questions. It is also typically evaluated using loglikelihood-based multiple choice and normalized (acc_norm) or unnormalized (acc) accuracy is reported.

(Sakaguchi et al., 2019): WinoGrande is a dataset consisting of Winograd Schema Challenge-like minimal sentence pairs with one word flipped. Language models are typically evaluated on this dataset by comparing the probability of correctly completing the end of the sentence given the correct or incorrect context (the sentence up to and including the flipped word), and reporting accuracy (acc) (Radford et al., 2019; Brown et al., 2020).

(Zellers et al., 2019): HellaSwag is an adversarially created dataset meant to test “commonsense natural language inference” mined from WikiHow. Models are typically evaluated by choosing the most likely to be generated completion text from a correct option and (nonsensical) set of incorrect answer options (acc, acc_norm).

(Clark et al., 2018): ARC is a challenging question answering dataset consisting of an Easy and Challenge subset. Questions are sourced from standardized tests on natural sciences. lm-eval follows Brown et al. (2020) in using a “cloze” style loglikelihood-based evaluation and reports acc, acc_norm over the set of answer strings.

(Paperno et al., 2016): The LAMBADA dataset is a word prediction benchmark consisting of short passages from Book Corpus (Zhu et al., 2015) books, with a language model required to predict the final word. Radford et al. (2019) introduce a cleaned and detokenized variant of LAMBADA 131313For more information, see here and here., often denoted as “Lambada (OpenAI).”

This corresponds to the lambada_openai task in lm-eval, and the dataset can be found at https://huggingface.co/datasets/EleutherAI/lambada_openai. Two metrics are reported: average perplexity over the continuation string, and exact match accuracy calculated directly as described in Appendix A.

(Wang et al., 2019a): SuperGLUE is a benchmark containing a collection of NLU tasks (BoolQ (Clark et al., 2019), CB (De Marneffe et al., 2019), COPA (Roemmele et al., 2011), MultiRC (Khashabi et al., 2018), ReCoRD (Zhang et al., 2018), RTE (Wang et al., 2019b; Dagan et al., 2005; Haim et al., 2006; Giampiccolo et al., 2007; Bentivogli et al., 2009), WiC (Pilehvar & Camacho-Collados, 2018), WSC (Levesque et al., 2012)). lm-eval implements SuperGLUE as a loglikelihood-based multiple choice classification task based on (Brown et al., 2020).

We report our assessments in Table 2. Overall, we find that lm-eval has been used frequently to measure new architectures’ zero-shot performance, and increases our confidence that most new works are evaluating on the same key benchmarks and methodologies. We note that not using lm-eval does not imply a lack of evaluation rigor, and that encouragingly, many works which do not use lm-eval do however report implementation details, such as holding the tokenizer used for perplexity calculations on Wikitext-103 constant (Fu et al., 2023; Poli et al., 2023) and report hyperparameters. Additionally, public and reproducible evaluation code is not sufficient for full reproducibility and confident comparison–reporting dataset contents, or training one’s own controlled baselines, is also important and frequently but not always done. However, we simply wish to emphasize the ease with which lm-eval provides tools to researchers for performing research on advancing language modeling.

B.2 Comparisons Across Evaluation Settings

Here we provide extra materials and information for Section 5.2.

Here, to illustrate an example of the configurability and reproducibility of our library, we share the configurations used to compare the effect of prompting on MMLU and ARC.

A YAML configuration file for the ARC-easy task, implemented in the “cloze” style as done by (Brown et al., 2020).

A YAML configuration file for the ARC-easy task, as implemented following the prompting style of MMLU in Hendrycks et al. (2020).

We can observe that these configuration files define several components:

The source dataset from the Datasets(Lhoest et al., 2021) library (local datasets are also supported), and the splits to use for testing and few-shot examples. Few-shot examples are drawn from a special fewshot split if specified, else drawn from the training set, validation set, or (in the worst case) non-overlapping test set examples with the current test set example being evaluated, in decreasing order of prioritization.

the doc_to_* attributes define mappings to input prompt, gold target label, and the list of answer choice strings, respectively in order.

We provide a list of metrics to use–here, acc denotes unnormalized loglikelihood to score answers, and acc_norm using byte-length normalization of loglikelihoods.

Finally, the metadata.version field stores the task’s version attribute to report.

For prototyping, and for the quick modification of interrelated task variants during experimentation, configurations can also inherit from one another: the following is a config file for ARC-challenge, in the cloze style:

a config file for a single MMLU subset in its original style is the following: