SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, Karthik Narasimhan

Introduction

Language models (LMs) are rapidly being deployed in commercial products such as chatbots and coding assistants. At the same time, existing benchmarks have become saturated (Kiela et al., 2021; Ott et al., 2022) and fail to capture the frontier of what state-of-the-art LMs can and cannot do. There is a need for challenging benchmarks that more accurately reflect real-world applications of LMs to help shape their future development and usage (Srivastava et al., 2023).

Building a good benchmark is difficult since tasks must be challenging enough to stump existing models, but model predictions must also be easy to verify (Martínez-Plumed et al., 2021). Coding tasks are appealing as they pose challenging problems to LMs and generated solutions can be easily verified by running unit tests. However, existing coding benchmarks, such as HumanEval (Chen et al., 2021), mostly involve self-contained problems that can be solved in a few lines of code.

In the real world, software engineering is not as simple. Fixing a bug might involve navigating a large repository, understanding the interplay between functions in different files, or spotting a small error in convoluted code. Inspired by this, we introduce SWE-bench, a benchmark that evaluates LMs in a realistic software engineering setting. As shown in Figure 1, models are tasked to resolve issues (typically a bug report or a feature request) submitted to popular GitHub repositories. Each task requires generating a patch describing changes to apply to the existing codebase. The revised codebase is then evaluated using the repository’s testing framework.

SWE-bench offers several advantages over existing LM programming benchmarks. These include, a realistic setting that utilizes user-submitted issues and solutions, diverse inputs featuring unique code problems from $12$ repositories, a robust framework for execution-based evaluation, and the ability to continuously update the benchmark with new instances, requiring minimal human intervention.

We evaluate SWE-bench on multiple state-of-the-art LMs and find that they fail to solve all except the simplest issues. For instance, Claude 2 and GPT-4 only resolve $4.8\%$ and $1.7\%$ of tasks respectively; even using an oracle that retrieves the files to edit from a reference solution. Using a BM25 retriever, performance drops further to $1.96\%$ for Claude 2.

To aid open model development in this direction, we release a training dataset, SWE-bench-train consisting of $19000$ non-testing task instances from $37$ other repositories. Using this dataset, we finetune two models, SWE-Llama $7$ b and $13$ b based on CodeLlama (Rozière et al., 2023), that are competitive with Claude 2 and can solve issues using over $100000$ tokens as context. We hope SWE-bench serves as a challenging software engineering benchmark that aids in better understanding of the abilities and limitations of LMs.

SWE-bench

SWE-bench is a benchmark featuring GitHub issues from popular repositories that report bugs or request new features, and pull requests that make changes to the repository to resolve these issues. The task is to generate a pull request that addresses a given issue and passes tests related to the issue.

GitHub is a rich data source for software development, but repositories, issues, and pull requests can be noisy, ad-hoc, or poorly documented or maintained. To find high-quality task instances at scale, we use a $3$ -stage pipeline as follows.

Stage I: Repo selection and data scraping. We start by collecting pull requests (PRs) from $12$ popular open-source Python repositories on GitHub, producing about $\sim$ 90000$$ PRs in total. We focus on popular repositories as they tend be better maintained, have clear contributor guidelines, and have better test coverage. Each PR has an associated codebase, which is the state of the repository before the PR was merged.

Stage II: Attribute-based filtering. We create candidate tasks by selecting the merged PRs that (1) resolve a GitHub issue and (2) make changes to the test files of the repository, which indicates that the user likely contributed tests to check whether the issue has been resolved.

Stage III: Execution-based filtering. For each candidate task, we apply the PR’s test content, and log the associated test results before and after the PR’s other content is applied. We filter out task instances without at least one test where its status changes from a fail to pass (henceforth referred to as fail-to-pass test). We also filter out instances that result in installation or runtime errors.

Through these stages of filtering, the original $90000$ PRs are filtered down to the $2294$ task instances which comprise SWE-bench. A final breakdown of these task instances across repositories is presented in Figure 3, and Table 1 highlights the key features of SWE-bench task instances. We highlight that the codebases are large with thousands of files, and the reference pull requests often make changes to multiple files at once. Technical details about SWE-bench’s construction pipeline are discussed in Appendix A. More statistics are in Appendix A.5.

2 Task Formulation

Model input. A model is given an issue text description and a complete codebase. The model is then tasked to make an edit to the codebase to resolve the issue. In practice, we represent edits as patch files, which specify which lines in the codebase to modify in order to resolve the issue.

Evaluation metrics. To evaluate a proposed solution, we apply the generated patch, using unix’s patch program, to the codebase and then execute the unit and system tests associated with the task instance. If the patch applies successfully and all of these tests pass we consider the proposed solution to have successfully resolved the issue. The metric for our benchmark is the percentage of task instances that are resolved. Additional technical details in Appendix A.4.

3 Features of SWE-bench

Traditional benchmarks in NLP typically involve only short input and output sequences and consider somewhat “contrived” problems created specifically for the benchmark. In contrast, SWE-bench’s realistic construction setting imbues the dataset with unique properties, which we discuss below.

Real-world software engineering tasks. Since each task instance in SWE-bench consists of a large and complex codebase and a description of a relevant issue, solving SWE-bench requires demonstrating sophisticated skills and knowledge possessed by experienced software engineers but are not commonly evaluated in traditional code generation benchmarks.

Continually updatable. Our collection process can be easily applied to any Python repository on GitHub and requires almost no human intervention. Therefore, we can extend SWE-bench with a continual supply of new task instances and evaluate LMs on issues created after their training date, which ensures that the solution was not included in their training corpus.

Diverse long inputs. Issue descriptions are typically long and detailed ( $195$ words on average), and codebases regularly contain many thousands of files. Solving SWE-bench requires identifying the relatively small number of lines that need to be edited to solve the issue amongst a sea of context.

Robust evaluation. For each task instance, there is at least one fail-to-pass test which was used to test the reference solution, and $40\%$ of instances have at least two fail-to-pass tests. These tests evaluate whether the model addressed the problem in the issue. In addition, a median of $51$ additional tests run to check whether prior functionality is properly maintained.

Cross-context code editing. Unlike prior settings that may constrain scope to a function or class (e.g., Chen et al., 2021; Cassano et al., 2022) or provide cloze-style fill-in blanks (e.g., Lu et al., 2021; Fried et al., 2023), SWE-bench does not provide such explicit guidance. Rather than merely having to produce a short code snippet, our benchmark challenges models to generate revisions in multiple locations of a large codebase. SWE-bench’s reference solutions average editing $1.7$ files, $3.0$ functions, and $32.8$ lines (added or removed).

Wide scope for possible solutions. The task of repository-scale code editing can serve as a level playing field to compare approaches ranging from retrieval and long-context models to decision-making agents, which could reason and act in code. SWE-bench also allows creative freedom, as models can generate novel solutions that may deviate from the reference PR.

SWE-Llama: Fine-tuning CodeLlama for SWE-bench

It is important to benchmark the performance of open models on SWE-bench alongside proprietary models. At the time of writing, only the CodeLlama models (Rozière et al., 2023) are able to handle the very long contexts necessary. However, we observe that the off-the-shelf CodeLlama variants are not capable of following the detailed instructions to generate repository-wide code edits, and typically output placeholder responses or unrelated code. To evaluate the capabilities of these models, we perform supervised fine-tuning on the $7$ billion- and $13$ billion-parameter CodeLlama-Python models. The resulting models are specialized repository editors that can run on consumer hardware and resolve GitHub issues.

Training data. We follow our data collection procedure and collect $19000$ issue-PR pairs from an additional 37 popular Python package repositories. In contrast to Section 2.1, we do not require that pull requests contribute test changes. This allows us to create a much larger training set to use for supervised fine-tuning. To minimize the risk of any data contamination, the set of repositories in the training data are disjoint from the packages included in the evaluation benchmark.

Training details. Given the instructions, an issue text from GitHub and the relevant code files as the prompt, we finetune SWE-Llama to generate the patch that solved the given issue (the “gold patch”). For memory efficiency, we fine-tune only the weights of the attention sublayer using LoRA Hu et al. (2022), and exclude training sequences with more than $30000$ tokens, reducing the effective size of the training corpus to $10000$ instances. More details are provided in Appendix B.

Experimental Setup

In this section we explain how inputs are constructed to run SWE-bench evaluation. In addition, we review the models that we evaluate in this work.

SWE-bench instances provide an issue description and a codebase as input to the model. While issues descriptions are usually short ( $195$ words on average as shown in Table 1), codebases consist of many more tokens ( $438$ K lines on average) than can typically be fit into an LMs context window. Then the question remains of exactly how to choose the relevant context to provide to the model during generation?

To address this issue for our baselines, we simply use a generic retrieval system to select the files to insert as context. In particular, we evaluate models under two relevant context settings: 1) sparse retrieval and 2) an oracle retrieval.

Sparse retrieval. Dense retrieval methods are ill-suited to our setting due to very long key and query lengths, and especially the unusual setting of retrieving code documents with natural language queries. Therefore, we choose to use BM25 retrieval (Robertson et al., 2009) to retrieve relevant files to provide as context for each task instance. We experiment with three different maximum context limits, and simply retrieve as many files as fits within the specified limit. We evaluate each model on all limits that fit within its context window and report the best performance.

“Oracle” retrieval. We additionally consider a setting where we only use all files edited by the reference patch that solved the issue on GitHub. This “oracle” retrieval setting is less realistic, since a software engineer working on addressing an issue does not know a priori which files may need to be modified. However, this setting is also not necessarily comprehensive since edited files alone may not include all the required context to understand exactly how software will behave when interacting with unseen parts of the code.

We compare the BM25 retrieval results against the “oracle” retrieval setting in Table 4, where we see that BM25 retrieves a superset of the oracle files in about $40\%$ of instances with the $27000$ token context limit but only also excludes all of the oracle files in over half of instances.

2 Input Format

Once the retrieved files are selected using one of the two methods above, we construct the input to the model consisting of task instructions, the issue text, retrieved files and documentation, and finally an example patch file and prompt for generating the patch file. Examples of instances and further details on this formulation are provided in Appendix D.

3 Models

Due to the need to process long sequence lengths, there are only a few models that are currently suitable for SWE-bench. Thus we evaluate ChatGPT-3.5 (gpt-3.5-turbo-16k-0613), GPT-4 (gpt-4-32k-0613), Claude 2, and SWE-Llama with their context limits shown in Table 2.

Results

In this section, we report results for models in a multitude of settings with different retrieval mechanism and prompting style, then provide some analysis and insight into model performance and difficulty. We summarize models’ performance on both the BM25 and “oracle” retrieval settings in Table 5. Across the board, models struggle significantly to resolve issues. The best performing model, Claude 2, only achieves a mere $4.8\%$ pass rate using the “oracle” retrieval context. When evaluated in the BM25 retrieval setting, Claude 2’s performance drops to $1.96\%$ . Performance in the BM25 retrieval setting highlights the importance of choosing appropriate context, which becomes a theme in our analysis that we discuss further below.

Difficulty differs across repositories. When breaking performance down by repository we observe that all models show similar trends across different repositories. Despite this, the issues resolved by each model do not necessarily overlap extensively. For example, in the “oracle” setting Claude 2 and SWE-Llama 13b perform comparably, with each model resolving $110$ and $91$ instances respectively. Yet of these instances, Claude 2 only solves $42\%$ of the instances solved by SWE-Llama.

This may also be related to the presence of images in issues, which can be encoded into the issue markdown with embedded image links (i.e. ![image][https://...]). Some repositories naturally feature more instances with images; for example $32$ % of matplotlib and $10$ % of seaborn instances contain embedded images in their issue text compared to just $2$ % of all instances. Solving these instances may require multi-modal LMs or some kind of external tool use to process images.

Difficulty correlates with context length. Models may be pre-trained on long sequences of code but are typically asked to generate single functions at a time with limited context provided to frame the question. Shown in Figure 5, we see that as total context length increases, Claude 2’s performance drops considerably; behavior that is also observed in other models. In our evaluation settings, models see a lot of code that may not be directly related to solving the issue at hand, and they seem to frequently struggle with localizing problematic code needing to be updated. This result corroborates other studies showing that models can become distracted by additional context or as the target sequence moves earlier or later within the context window (Liu et al., 2023b). Even when increasing the maximum context size for BM25 would increase recall with respect to the oracle files, performance can still drop, as shown in Table 4, as models are simply ineffective at localizing problematic code in a sea of tokens.

Further investigating this, we provide an input ablation on the “oracle” retrieval context, where retrieved files are collapsed entirely, except for the lines actually edited by the true pull request (with $\pm 15$ lines of buffer) shown in Figure 5. In this setting, we see increases in performance, with GPT-4 jumping from $1.3\%$ to $3.4\%$ and Claude 2 from $4.8\%$ to $5.9\%$ .

Difficulty does not correlate with issue resolution date. In Table 7 we show model results in the “oracle” retrieval setting, partitioned by date, for PRs created before or after 2023. We find that for most models there’s little difference in performance before or after this date, with the exception of GPT-4. We consider this result to be largely promising as it suggests that despite models having been exposed to some version of an repository’s codebase, they are unlikely to “cheat” to address issues simply by generating a more recent version of the repository.

Finetuned models are sensitive to context distribution shifts. The finetuned models SWE-Llama 7b and 13b perform surprisingly poorly with BM25 retrieved context. As these models were finetuned using the “oracle” retrieval as context, we suspect this shift in context makes it difficult for the model to perform reliably. For instance, SWE-Llama was trained to edit every file included as context whereas in the BM25 setting many files provided in context are not expected to be changed.

Generating patches is easier than generating whole files. Models are often trained using standard code files and likely rarely see patch files. We generally formulate our task to have models generate patch files as opposed to recreating the entire file with their proposed change, since patch files will usually be a much more efficient representation of a file change. As shown in Table 5, we observe that models still struggle with generating well-formatted patch files. So we experiment with asking models to instead regenerate entire files with their proposed changes to resolve the issue. In this setting, we find that models generally perform worse at this task than when generating patch files; for instance, Claude 2 scores at $2.2\%$ compared to $4.8\%$ in the main table for “oracle” retrieval. Even when controlling for instance length, generating on the shorter half of the task instances by input tokens yields $3.9\%$ compared to $7.8\%$ for generating patches with Claude 2.

Language models tend to generate shorter, simpler edits. Model generated patch files tend to add and remove fewer lines than their respective gold patch. As shown in Table 8, compared to an average gold patch, model generated patch files that apply correctly are less than half the total length ( $74.5$ versus $30.1$ lines) of gold edit patch files, and rarely edit more than a single file.

We select $11$ generations from SWE-Llama and Claude 2 collectively to better understand the quality of the task and generated patches under the “oracle” retrieval setting. Here we discuss one example from SWE-Llama and summarize our overall findings, with in-depth analyses for the remaining examples shown in Appendix F.

We’ll consider the task instance sphinx-doc__sphinx-8713 from the Sphinx documentation generator, shown in Figure 6. The issue states that the napoleon extension of Sphinx is not properly formatting the documentation keyword “Other Parameters” when the config setting napoleon.use_param is set to True. The issue text further provides a detailed code snippet of where the problematic source code is suspected to be, as well as some code examples for reproducing the error and additional information related to package versions. For this particular instance, the model did not resolve the task, failing to pass some of the tests resolved by the gold solution.

In the “oracle” retrieval setting, the model input provides this issue text along with some instructions, the full contents of files edited by the gold patch, and an example of the diff format we expect the answer to be in. The total model input consists of $1558$ lines of context or $20882$ tokens. When comparing the gold patch and the model’s patch, we find an obvious mistake. While the model edits the correct function, _parse_other_parameters_section at line $684$ in sphinx/ext/napoleon/docstring.py , it changes the function to behave as if napoleon. use_param were always True instead of checking the config setting first and copying what the _parse_parameters_section does, like the gold patch. In the tests, test_parameters _with_class_reference directly compares the documentation produced using a config where napoleon_use_param is set to False, which catches the model’s error immediately.

Comparing results across all the examples we consider, we notice a few prominent trends in behavior. Models tend to write primitive Python code and do not leverage existing third-party libraries or the rest of the codebase for their solutions. Models’ generations also reflect a “greedy” approach of solving the problem exactly, with little regard for code style or logical constraints that might be reflected by the codebase (i.e. using relative instead of absolute imports). In contrast, we observe that many gold patches will make structural improvements that cover a much larger scope of the codebase; these edits not only resolve the issue, but also anticipate and solve obvious potential future issues. We present additional case studies and identify more nuanced discrepancies in Appendix F.

Related Work

Evaluation of LMs. Several recent works for evaluating LMs have either proposed a collection of mutually distinct tasks spanning across multiple domains (Hendrycks et al., 2021; Liang et al., 2022; Srivastava et al., 2023) or turned to the web as an interactive setting featuring tasks that require multiple steps to solve (Yao et al., 2022; Zhou et al., 2023; Deng et al., 2023; Liu et al., 2023d). There are several drawbacks with such a “potpourri” style setup. First, each task tends to narrowly focus on one or a few skills, resulting in challenges that are typically too simple, pigeonhole the model into a reduced role, and do not provide models with the bandwidth to exercise their versatility or potentially demonstrate new abilities (Srivastava et al., 2023). Consequently, a model’s performance on such task conglomerations may not yield actionable, deep insights regarding its capabilities and how to improve them (Schlangen, 2019; Martínez-Plumed et al., 2021; Bowman & Dahl, 2021). SWE-bench addresses these shortcomings, as our work demonstrates that it is significantly challenging, presents a wide range of possibilities for improving LMs to solve this task, and is easy to refresh over time with new task instances, each of which introduce novel, nuanced, and practical challenges.

Code Generation Benchmarks. HumanEval (Chen et al., 2021) is the current standard in a long-standing pursuit of synthesizing code from natural language descriptions (Yu et al., 2018; Austin et al., 2021; Hendrycks et al., 2021; Li et al., 2022a; Zan et al., 2023). In the past year, subsequent benchmarks have sought to augment HumanEval with extensions to different languages (Cassano et al., 2022; Athiwaratkun et al., 2023; Orlanski et al., 2023), variations in edit scope (Yu et al., 2023; Du et al., 2023), similar but novel code completion tasks (Muennighoff et al., 2023), and more testing (Liu et al., 2023a). Simultaneously, separate works have sought to introduce new coding paradigms (Yin et al., 2022; Yang et al., 2023) or design library-specific problems (Lai et al., 2022; Zan et al., 2022). Instead of partitioning problems into siloed datasets and curtailing them for simplicity’s sake, SWE-bench’s collection procedure transforms the source code with minimal post-processing, preserving a much broader set of challenges grounded in real-world software engineering beyond closed form completion, such as patch generation, reasoning over long contexts, navigating a codebase directory, and capturing dependency-based relationships across modules.

ML for Software Engineering. To overcome traditional program analysis techniques that may not scale or incorporate natural language, one direction of current software engineering research has is to use neural networks, including LMs, to automate real-world software development processes (Maniatis et al., 2023; Zheng et al., 2023; Hou et al., 2023). Use cases include automating commit generation (Jung, 2021; Liu et al., 2023c), PR review (Yang et al., 2016; Li et al., 2022b; Tufano et al., 2021), bug localization Kim et al. (2019); Chakraborty et al. (2018), testing (Kang et al., 2023; Xia et al., 2023; Wang et al., 2023), and program repair (Monperrus, 2018; Gupta et al., 2017; Allamanis et al., 2017; Gazzola et al., 2019; Goues et al., 2019; Gao et al., 2022; Dinh et al., 2023; Motwani & Brun, 2023). Most relevant to SWE-bench are works that have sought to apply LMs towards automated program repair (Xia & Zhang, 2022; 2023; Fan et al., 2023), guiding code editing with commits (Chakraborty & Ray, 2021; Zhang et al., 2022; Fakhoury et al., 2023). However, none of the existing datasets (Just et al., 2014; Karampatsis & Sutton, 2019) present code context at the scale of SWE-bench. Moreover, SWE-bench isolates the changes at the function level, and can be easily extended to new programming languages and other software modalities. SWE-bench is compatible with such works, but provides a significantly more realistic and challenging arena to carry out future experiments towards augmenting LMs with software engineering tools and practices.

Discussion

Limitations and future directions. SWE-bench task instances are all in Python; we hope to apply SWE-bench’s task instance collection procedure to expand its coverage to more programming languages and domains. Second, our experiments aim to establish a baseline of the simplest and most straight-forward approaches for this task; we do not intend to constrain future methodologies to the same type of approach and encourage future work to investigate different methods. To this end, we are particularly excited about agent-based approaches for identifying relevant context from a codebase, larger scale models fine-tuned for patch generation, and augmenting LMs with program analysis and software engineering tools. Lastly, while this work evaluates models using execution-based code testing, relying solely on this method is insufficient to guarantee reliable performance of model generations, as we find automated code generations from LMs can frequently be less comprehensive, efficient, or readable compared to human-written solutions.

Conclusion. The complexity of real-world software development processes extends far beyond just code completion. By drawing on the open-source collaborative pipeline, SWE-bench creates a faithful mirror of real world coding environments. This more realistic environment encourages creative solutions that can have immediate applicability in open-source software development. We hope that this benchmark and our other contributions can serve as valuable assets in the future development of LMs that are more practical, intelligent, and autonomous.

Ethics Statement

SWE-bench is collected entirely from public repositories with licenses that permit software usage that our contributions are in accordance with. Details of the licenses are included in Table 12. During the collection or evaluation processes, we do not collect information about GitHub users, and the SWE-bench task instances do not use GitHub data beyond what is offered via the public API and website. Our contributions do not involve any human subject participation; we do not perform crowdsourcing or recruit human task workers for any part of SWE-bench, including its collection and evaluation procedures along with the experiments. SWE-bench’s filtering criteria for GitHub repositories based on popularity does not implicitly or explicitly rely on any discriminative or biased heuristics for repository selection. For the dataset release, we plan to open source the SWE-bench task instances, the collection and evaluation infrastructure, the experimental results, the training data used for fine-tuning SWE-Llama models, and the SWE-Llama model weights. Following best practice precedents, we will also put forth ample documentation to describe each component and its use, and we will also put in place convenient communication channels for soliciting feedback to improve SWE-bench. SWE-bench does not put forth any immediately harmful insights. We briefly discuss the potential impact of SWE-bench’s usage in Section E.

Reproducibility Statement

For our submission, we have uploaded the entirety of the source code as a zipped file that has been properly anonymized. We have organized the codebase such that separate directories correspond to different contributions within the main paper (i.e. dataset collection, evaluation, open source model inference, SWE-Llama training, etc.). The source code contains inline documentation that details purpose and usage of different parts of the codebase. In addition, we also include the full set of 2294 SWE-bench task instances that contains all the components discussed in the main paper. Beyond the documentation in the source code, we include thorough technical details for the collection pipeline and evaluation procedures in Section A.2 and Section A.4 that complements the original details in Section 2 of the main paper. These sections fully cover the logic presented in the code and can be helpful for understanding it. Moving forward, as discussed in the ethics statement, we plan to more formally release SWE-bench to the public as an open source repository with thorough details that describes the benchmark, outlines the code, and details its usage. A major component of SWE-bench is the collection framework, which will be part of the open sourced code. Because of its easily maintainable design, as discussed in the main paper, our hope and belief is that SWE-bench should be highly reproducible.

Acknowledgements

We thank Danqi Chen, Tri Dao, Zexuan Zhong, Tianyu Gao, Will Merrill, Mengzhou Xia, Dan Friedman, Adithya Bhaskar, Austin Watkins, Aatmik Gupta, and Richard Zhu for their valuable feedback and advice.

References

Appendix

In the appendix, we provide more thorough details regarding the dataset construction process, evaluation pipeline, and characterization of the SWE-bench benchmark.

Appendix A Benchmark Details

This section complements Section 2 with a more technical and fine-grained summary of the data collection, execution-based validation, and evaluation procedures, along with a fuller characterization of the task instances.

Pull request scraping. From a list of the top $5000$ most downloaded PyPI libraries during August 2023, we select the top $100$ packages, identify each library’s corresponding open-source GitHub repository, verify which packages have licenses allowing for free software use, and collect all PRs for these repositories via the GitHub developer API. We elect to source problems from well-trafficked repositories because widespread use usually suggests that the repository has extensive documentation, structured open-source development guidelines, and working, well-formatted code.

Task instance construction. We construct candidate task instances from PRs that satisfy three conditions. First, the PR’s status must be Merged. A Merged status indicates that the PR’s associated code changes were accepted and incorporated into its parent repository. Second, the PR resolves one or more issues in its repository. An issue is defined according to its canonical usage in GitHub as a digital ticket for tracking bugs, enhancements, or any general development goals for a software project. We scan a PR’s title, body, and commit messages for linked issues (i.e. “fixes # $24$ ”). Third, the PR must introduce one or more new tests. A new test is counted when a PR’s code changes edits a file path containing a testing-related keyword (e.g. “test”, “testing”).

A PR that satisfies these criteria is then converted into a candidate task instance such as the example in Figure 7. The codebase $C$ is identified by the repository’s owner/name moniker and the pull request’s base commit. Recovering the actual codebase from this information is straightforward. We create mirrors of the original GitHub repositories, where each mirror is uniquely identified as owner__name. Cloning a repository’s corresponding mirror and checking out the base commit yields $C$ in its pre-PR state. The problem statement $P$ is an aggregate of all related issues’ titles and descriptions along with any subsequent comments written before the timestamp of the PR’s initial commit to avoid leakage of solution details. A PR’s code changes are separated into a test patch and a gold patch $\delta$ . $T$ consists of all tests from files edited in the test patch. As shown in Figure 7, both $T$ and $\delta$ are stored as patch files. Further details about parsing PR and semantic data is in Appendix A.2.

Execution-based validation. We verify the usability of a task instance via execution. For each candidate, we first define a virtual environment to serve as an execution context, then install $C$ before applying any patches, and finally run $T$ once before and once after the solution $\delta$ is applied. A candidate is removed from consideration for the final dataset if any step in the verification process fails. In addition, to ensure that a solution $\delta$ is non-trivial, we compare the pre-solution and post-solution validation logs to check for whether there are one or more tests in $T$ where the status changes from fail to pass. Lastly, we exclude task instances with tests that invoke newly created functions or classes first introduced in the solution $\delta$ . Since naming such constructs is typically an arbitrary process and usually not explicitly specified in the problem statement, resolving tests such as these may be an impossible task even for human developers. Information about execution contexts, codebase installation, determining test statuses from logs, and more are in Appendix A.3.

Continuous Updates. SWE-bench’s collection process is easily extensible to any open source code repositories, allowing for easy and low-maintenance extension to new programming languages and code domains. This design also provides SWE-bench with temporal robustness; as new language models trained on more recent source code are released over time, SWE-bench can simply be updated to produce new task instances based on PRs created after any LM’s training date.

A.2 Construction Process

We discuss additional details regarding the conversion of a pull request object into a candidate task instance. At a high level, the main goal of this conversion is to acquire relevant information for putting together the codebase $C$ , problem statement $P$ , unit tests $T$ , and solution $\delta$ components introduced in Section 2. To this end, a SWE-bench task instance consists of the following fields, presented in the following Table 9. Collectively, the fields correspond to the four task instance modules.

Problem Statement. The problem statement $P$ for each task instance is readily available as the problem_statement field. The problem statement is an aggregate of all issues’ first comments along with any comments attached to those issues that were created before the creation date of the PR’s initial commit. We crawl for issues from PR’s title, body, and commit messages. After concatenating these components’ text data, we first remove any Markdown-style comments, then look through the remaining text for references to issue numbers (a pound # sign followed by a number) and check whether the word preceding the issue number reference is included in a set of keywords suggesting that the issue was resolved by the PR (e.g. “closes”, “fixes”, “resolves”). The found issues are recorded in the issue_numbers field, then separate web requests are made to retrieve each issue’s data. To form the problem_statement, each issue’s title and body are added together and then concatenated with the next issue’s if there are multiple. It is also during this step that the hints_text field is created and collected from the PR’s comment section, where text from comments created before the PR’s initial commit. The intuition for this collection methodology is that such PR comments would likely contain natural language and pseudo-code suggestions to the original human task worker regarding how to complete the problem at hand. The experiments presented in this work do not make use of hints_text, but we believe this information may be interesting for future investigations.

Codebase. The codebase $C$ content is not stored in plaintext for every task instance. Rather, the task instance contains a reference to the relevant codebase via the repo and base_commit field. Both fields are available in the original PR’s data. To make retrieval of the codebase $C$ from these two elements reproducible and reliable, we create mirrors of the original repository. Mirrors for the repository constituting both the evaluation and fine tuning data are collected and open-sourced under the SWE-bench GitHub organization. Because an original repository’s code may be subject to changes in its commit and edit history outside of the authors’ control, we choose to create a mirror repository to ensure that later modifications to the codebase do not potentially render a task instance unusable due to a corruption or removal of the associated base_commit. Additionally, we create a mirror instead of cloning and storing the latest version of a repository. This is because a mirror retains the original commit hashes, history, branches, and tags, serving as a faithful and complete history of the technical details of the original repository. A mirror does not retain stars, watchers, issues, or pull requests from the original repository.

We create a mirror from a repository after and within the same day when task instances were collected. The mirror retains the original repository’s “owner/name” moniker, except that the “/” character is converted to a “__” to confirm to GitHub naming conventions. Given this infrastructure, retrieving a task instance’s codebase is straightforward. First, the correct mirror can be cloned from the SWE-bench organization using repo. Next, within the local copy of the mirror, checking out the base_commit will reset the repository to codebase $C$ . To proceed to another task instance from the same repository, git version control is used to automatically remove any modifications associated with the current task instance before checking out the next task instance’s base commit.

Solution, Test Patches. The solution $\delta$ and tests $T$ are derived from the file changes data, or diff, of a PR. As mentioned in Section 2.1, the original diff along with solution $\delta$ and tests $T$ are represented as a .patch file, a format for efficiently specifying transformations to line-based text files. Generally speaking, a .patch is structured as a list of blocks, where each block consists of a header and one or more hunks that collectively correspond to changes to a single file. The header contains metadata specifying a file path and line numbers, while the actual modifications to the target file are encoded as multiple lines prefixed by “+” and “-” to indicate additions and removals. To create the tests $T$ , we first identifying every unique block within the patch, then pick out and conglomerate blocks with file paths that contain testing-related keywords (e.g. “tests”, “testing”). The remaining blocks are merged to form the solution $\delta$ . We validate the robustness of the script written to parse correctly $T$ and $\delta$ by applying both patches to the corresponding codebase $C$ and running the tests; we then check that the results reproduce the behavior of the base PR’s diff data. The solution $\delta$ is saved as the patch field while the tests $T$ are saved as the test_patch field.

Remaining Fields. The created_at field is a timestamp that specifies when the base PR was created. We retain the created_at field from the original data and use this field to perform temporal analysis of model performance. The version field is a string that corresponds to the release version, with respect to the repo, during which the PR was released. Depending on availability and the amount of effort required for each method, we create the version field by retrieving the information directly from the source code, building the repository locally and invoking code to display the version to standard output, or comparing the created_at field with a timeline of release versions from a repository’s webpage. We create executable contexts for every version of a repository, as discussed in greater detail in § A.3.

A.3 Execution-Based Validation

After filtering through all the PRs from a repository and converting those that satisfy the aforementioned criteria into candidate task instances, the next step is to validate the usability of each task instance via execution. This procedure is broken down into three steps. First, we create executable contexts for each release version of a repository. Next, we check whether the solution $\delta$ and tests $T$ can be applied, installed, and run successfully on top of codebase $C$ . Finally, we examine each task instance’s execution log to verify a specific set of behaviors to ensure that the task is usable and fair for model evaluation.

Executable Contexts. We choose to create executable contexts per release version after experimenting with various degrees of granularity with regards to what definition level to define virtual environments for. Defining task instance-specific contexts is most conducive to ensuring end-to-end installation success, but comes at the cost of laborious manual handcrafting. On the other hand, a repository-specific context based on the latest version of a repository is typically too coarse of a definition that is not compatible with older versions’ requirements. We find that release versions are a good proxy for capturing the dependency requirements across a subset of task instances, striking a manageable balance between installation success and manual effort. We manually create each executable context by examining the codebase of the latest task instance for each version. Based on the source code and documentation typically found in the repository’s README and CONTRIBUTING guides, we find out the Python version, necessary dependencies, and installation command.

Validation Engine. The purpose of the validation engine is to verify candidate task instances. Specifically, this step checks first, that the solution $\delta$ and tests $T$ can be applied to codebase $C$ , and second, that the codebase can be properly installed and run within the corresponding virtual environment. To do this, we perform validation repository-by-repository, where for each repository’s set of task instances, we perform the following procedure:

Create executable contexts as conda envs. based on latest task instance per version.

Iterate across each task instances group, where for each task instance, we perform the following within the corresponding conda env.

Remove any file changes and checkout the task instance’s base_commit. This sets the repository to codebase $C$ .

Run the installation command to instantiate codebase $C$ .

Apply the test patch $T$ to codebase $C$ .

Run the testing script, determined from test patch $T$ , to generate test result logs $log_{pre}$ .

Apply the solution $\delta$ patch to the codebase $C$ with tests $T$ .

Run the testing script from part (d) again to generate test result logs $log_{post}$ .

The testing command consists of the testing framework used by the repository (e.g. pytest, tox) with paths specified in $T$ appended. The testing command would run any and all tests that are specified within the contents of each file path. If any of the steps $(a)$ through $(f)$ fails, the candidate task instance is discarded from consideration. With moderate variation across repositories, we observe that this step generally removes half of the candidate task instances.

Examining Validation Logs. Last but not least, we check the logs $log_{pre}$ and $log_{post}$ created by the validation engine for specific properties. First, to guard against arbitrary naming choices, we check $log_{pre}$ for ImportError and AttributeError occurrences, which are potentially indicative of dependency naming related errors that would trivial and near-impossible to address correctly. To this end, we remove all task instances with such errors in their $log_{pre}$ from consideration. Next, we compare the test results to check that the task instance is non-trivial, indicated by at least one or more tests having a fail status before the solution $\delta$ is applied, then a pass status after. To check this, we first define several repository-specific parsers to convert $log_{pre}$ and $log_{post}$ into mappings of test $t_{i}\in T$ to a status $s\in$ [fail,pass]. Given these two data structures, we then check that there exists at least one $t_{i}$ where $s$ changes from fail to pass. If no such tests are found, the task instance is removed from consideration.

If a task instance fulfills these two criteria, then it is included in the evaluation dataset. Table 10 displays a summary of how many task instances were removed from consideration across the construction process and execution based validation steps. We save all finalized task instances to a single .json file that is open sourced and available for download.

Alongside the task instances, we also create a corresponding folder containing the ground truth test results. For each task instance, from their respective $log_{pre}$ and $log_{post}$ test-to-status mappings, we create a test results data structure where the keys are FAIL_TO_FAIL, FAIL_TO_PASS, PASS_TO_FAIL, and PASS_TO_PASS, and the values are lists of tests. By “caching” these results, we remove the need to re-run the solution $\delta$ at evaluation time (although re-running is an available option). We use this data structure to verify task completion, as discussed in Section A.4.

A.4 Evaluation Procedure

We provide a visualization of the evaluation procedure in Figure 8. The evaluation procedure scores the model’s $\hat{\delta}$ .patch generation with respect to the behavior of the solution $\delta$ . At a finer-grained level, the evaluation procedure can be broken down into four separate steps, highlighted by the numbered steps in Figure 8. First, the codebase and problem statement are visible and given to the LM; the LM then generates a .patch prediction $\hat{\delta}$ . In the evaluation step, the following steps are performed per prediction on the target task instance:

Remove any file changes and checkout the task instance’s base commit. This sets the repository to codebase $C$ .

Activate the executable context corresponding to the task instance’s version.

Run installation command to instantiate codebase $C$ .

Apply prediction patch $\hat{\delta}$ to codebase $C$ with tests $T$ .

If the previous step fails, we attempt to fix prediction patch $\hat{\delta}$ automatically and reapply it.

Run the testing script, determined from test patch $T$ , to generate test result logs $log_{\hat{\delta}}$ .

Steps 1 through 4 reliably do not fail due to verification during the task instance validation process. If applying the prediction patch (Step 5) fails, we attempt to repair the prediction patch file by removing unnecessary context lines and recalculating the header values (Step 6). If the remaining patch fails again or running the test command (Step 7) fails, then the prediction is automatically given a score of . Assuming these steps succeed, the output log $log_{\hat{\delta}}$ can then be converted to a test-to-status mapping, identical in structure to the via the appropriate, repository-specific parser introduced in § A.3.

Evaluation Metrics Calculation. To determine task completion, we compare the test-to-status mapping parsed from $log_{\hat{\delta}}$ with the list of tests corresponding to the FAIL_TO_PASS and PASS_TO_PASS keys from the ground truth test results data structure. Determining task completion is straightforward; we check that all FAIL_TO_PASS and PASS_TO_PASS tests are found and have a pass status in the evaluation test-to-status mapping. If a test is missing or has a non-pass status, it is considered a fail status. As defined and used in the main paper, a task is considered solved if all tests across FAIL_TO_PASS and PASS_TO_PASS pass.

A.5 Evaluation Characterization

We include an expanded form of Table 1 that includes repository specific statistics in Table 11. Table 12 presents a brief description of each repository extracted from the repository’s documentation along with the repository’s associated open source license. The associated licenses all permit non-commercial usage of the original library source code as long as the permissions in the original licenses are upheld and retained. In addition to the original statistics presented in Table 1, we introduce three new values. The $\delta$ # Lines Added and $\delta$ # Lines Removed together sum up to $\delta$ Lines Edited. “Added” refers to the number of new lines that are introduced, while “Removed” are pre-existing lines taken out by the solution. The $|T|$ (Pass to Pass) statistic refers to the number of tests that were passing before the solution $\delta$ was applied during the validation pipeline. Unlike fail to pass tests that are intended to characterize the problem statement $P$ and determine if a revision addresses the issue, pass to pass tests are included to ensure that the revision does not break or violate any existing expected behavior. These tests are extracted during the validation log examination phase as discussed in § A.3. We note that fail to fail tests and pass to fail tests are not considered during evaluation, and those statistics are not reflected in the above table.

Task Instance Issue Categories. To provide a better sense of the types of problems that SWE-bench task instances include, we perform simple analyses on the issues, identified by the issue_numbers field, for each task instance. Per issue, we inspect metadata, specifically tags, to characterize the type of contribution put forth by the PR. Table 13 groups and shows several examples of the $2289$ tags we found across all issues. While the absolute majority of issues are associated with bug fixes, SWE-bench’s task instances are associated with a diverse set of code changes with purposes beyond debugging and error correction.

Attribute Distributions. In Figure 9, we present plots of the cumulative distribution function for attributes introduced in Table 1. From these plots, we see that the median SWE-bench task instance has a problem description of $140$ words, and will take place within a codebase containing just shy of $1900$ files and $400$ K lines. The corresponding reference solution $\delta$ will usually edit a single function within a file, changing $\sim$ $15$ lines, and has a single fail to pass test to verify the correctness of the change along with $51$ pass to pass tests to check whether existing behavior is preserved.

Patch Fix Rate. We present Table 14, which presents summary statistics of how many task instances each model generated patches for (out of 2294), how many of these patches applied successfully, and how many of the successfully applied patches required undergoing the patch fixing procedure introduced in Appendix A.4. We find that fixed patches tend to make up a smaller percentage of the SWE-Llama patches that successfully applied, suggesting that SWE-Llama’s fine tuning procedure has a positive effect on generating well-formatted patches. For closed source models, fewer patches apply successfully, and of the ones that do, a greater percentage require the post-generation fix, suggesting that models still struggle with patch generation and structured outputs in general.

Appendix B Additional Details on Training SWE-Llama

Optimization. We finetune using LoRA (Hu et al., 2022) with $r=16$ , $\alpha=16$ , $\text{dropout}=0.05$ , on the query, key, value, and output projection matrices of every attention sublayer. We train with a learning rate of $6e-4$ and a batch size of $32$ sequences per gradient step for a maximum of $4$ epochs. During training, we save checkpoints every $50$ steps, and after training, select the best checkpoint based on the validation loss on a held-out $100$ instances. SWE-Llama 7b was initialized with CodeLlama-Python 7b and trained in $20$ hours on $4$ NVIDIA A100s. SWE-Llama 13b was initialized with CodeLlama-Python 13b and trained in $47$ hours on $8$ NVIDIA A100s. We used DeepSpeed Ulysses (Jacobs et al., 2023) and Flash Attention (Dao et al., 2022) to enable long context training.

Appendix C Additional Results

We include a repository-by-repository breakdown of model performance in Table 15 that corresponds to Figure 4 in the main paper. As discussed, in the main paper, performance differs heavily across repositories.

Appendix D Additional Experimental Details

Sparse retrieval. During retrieval we make a slight augmentation to the documents by pre-pended files’ contents with their file paths to better enable retrieval based on filenames that may be mentioned directly in the issue.

Oracle retrieval. Oracle retrieval file paths are simply extracted directly from the reference solution’s patch file excluding test files.

D.2 Inference Settings

Since generations are relatively expensive, we only generate a single patch file per instance. Following precedent in code generation for evaluation in Pass@ $1$ (Chen et al., 2021; Rozière et al., 2023), we simply use greedy decoding for all models.

D.3 Prompt Template Example

Models are prompted with the following general template with slight variations depending on the model used.

Experiments using slightly more or fewer lines of instructions or examples seemed to not affect overall performance substantially, except for the findings of experiments stated in Section 5.

Appendix E Societal Impact

As reasoning on code has emerged as a foundational skill underlying many LM’s capability, a potential future of machine-automated software engineering raises many important questions and has important potential ramifications with regards to AI Safety (Gros et al., 2023). It is important to address questions on how to ensure AI-generated code is faithful to human intents and what guardrails might be in place when human objectives are misinterpreted by code agents that then carry out the task. To observe such problems in a controlled setting and manifest their solutions, we hope SWE-bench might serve as a testbed for designing safe, robust measures towards aligned, verifiable, and safe AI-driven software engineering.

Appendix F In-depth Analysis of SWE-Llama Generations

In this section, we provide five additional qualitative analyses of generations from both Claude 2 and SWE-Llama generations (Oracle retrieval setting) following the style of Section 5.1.

Claude 2 qualitative studies can be found in Tables 16 and 17. Tables 18, 19, and 20 are task instances that Claude 2 did not address correctly. SWE-Llama qualitative studies are covered across Tables 21, 22, 23, 24, 25. For Tables 21, 22, and 23, we present task instances solved correctly by SWE-Llama 13b. In Table 24 and 25, we present two task instances where SWE-Llama 13b does not address the issue correctly, pointing out a subset of the reasoning and generation skills that models may not be adept at enough to accomplish the task at hand.

The observations we make across these sections corroborate with the points stated in the main paper, which is that models tend to struggle with multi-line and multi-file changes, are more adept when the required fix is relatively short, and need help with understanding the codebase in an efficient manner.

Discussion. For this task instance that comes from the django/django repository, the model is asked to introduce a context variable that would allow a user to hide the “Save and Add Another” button via a context variable, similar to how it is done for two other existing buttons. The task is a bit more difficult compared to the prior two settings because no explicit stack trace or programmatic demonstration of the issue is offered. In this relatively under-specified setting that does not provide suggestion with regards to localizing the function correctly, the model successfully reasons that it should adjust the existing show_save_and_add_another key/value pair. When comparing the gold patch solution, it can be argued that the model generated patch produces a much more efficient solution in terms of lines edited, as it makes the smallest edit necessary to incorporate context as a flag for setting the show_save_and_add_another hidden status. However, similar to the discussion in Table 22, stylistically, the gold patch edits are much more consistent with the codebase, and additional changes that are not explicitly discussed in the issue are also made to adhere to what has been done in the codebase (i.e. the addition of a can_save_and_add_another field . This task is an example of a potentially exciting direction where via human guidance or better understanding of a codebase in general, models would then adjust their generations to not just make the functionally correct changes, but also the stylistically right ones, too.