SantaCoder: don't reach for the stars!

Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Kumar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov, Manuel Romero, Michael Lappert, Francesco De Toni, Bernardo García del Río, Qian Liu, Shamik Bose, Urvashi Bhattacharyya, Terry Yue Zhuo, Ian Yu, Paulo Villegas, Marco Zocca, Sourab Mangrulkar, David Lansky, Huu Nguyen, Danish Contractor, Luis Villa, Jia Li, Dzmitry Bahdanau, Yacine Jernite, Sean Hughes, Daniel Fried, Arjun Guha, Harm de Vries, Leandro von Werra

cs.SE cs.AI cs.LG

Introduction

Over the last two years, we have witnessed tremendous progress in the development of code generating AI assistants (Chen et al., 2021; Chowdhery et al., 2022; Nijkamp et al., 2022; Fried et al., 2022; Li et al., 2022; Athiwaratkun et al., 2022). Machine learning models are now capable of assisting professional developers through the synthesis of novel code snippets, not only from surrounding code fragments, but also from natural language instructions. The models powering these code completion systems are usually referred to as Large Language Models for Code—or code LLMs—and are created by training large transformer neural networks (Vaswani et al., 2017) on big corpora of source code. However, with the exception of a few small-scale efforts (Xu et al., 2022b), there is generally a lack of transparency on the development of code LLMs, in part due to their commercial value and the legal uncertainty around distributing training data and models. Some groups have released model weights (Fried et al., 2022; Nijkamp et al., 2022) or provided access to the model through a paid API service (Chen et al., 2021; Athiwaratkun et al., 2022), but these works did not release the full training data or the preprocessing methods that were used.

BigCodeSee https://www.bigcode-project.org is an open scientific collaboration working on the responsible development of large language models for code, empowering the machine learning and open-source communities through open governance. BigCode was inspired by the BigScience project, an open-scientific collaboration which culminated in July 2022 with the release of a large multi-lingual language model (Scao et al., 2022). As in BigScience, various BigCode working groups focus on relevant subtopics such as collecting datasets, implementing methods for training code LLMs, developing an evaluation suite, and discussing ethical best practices for these powerful models. For example, the Legal, Ethics, and Governance working group has explored questions on data licensing, attribution of generated code to original code, the redaction of Personally Identifiable Information (PII), and the risks of outputting malicious code. In earlier work, the BigCode community released The Stack v1.1 (Kocetkov et al., 2022), a 6.4 TB dataset of permissively licensed source code in 384 programming languages. That work also introduced “Am I in The Stack”,https://huggingface.co/spaces/bigcode/in-the-stack a governance tool for developers to check whether their source is part of the dataset, and an opt-out form for those who wish to have their code removed from the dataset.https://www.bigcode-project.org/docs/about/the-stack/

In this tech report, we summarize the learnings of the BigCode community in developing the Santa models, a set of 1.1B-parameter models trained on the Java, JavaScript, and Python subsets of The Stack and evaluated on MultiPL-E (Cassano et al., 2022). We describe the first steps of the community towards developing larger code models and report experiments to de-risk the model architecture and the data processing pipeline. Specifically, the contributions of this report can be summarized as follows:

We describe the current state of the PII redaction pipeline. We detail how we create a PII benchmark of 400 code files, describe the filters for detecting emails, ip addresses, and secret keys, and analyze its performance on the annotation benchmark. All experiments in this work are conducted on the PII-redacted version of The Stack.

We run ablations for Multi Query Attention (MQA) (Shazeer, 2019; Chowdhery et al., 2022; Li et al., 2022) and Fill-in-the-Middle (FIM) (Fried et al., 2022; Bavarian et al., 2022). MQA can significantly speed-up inference for larger batch sizes, while FIM enables code models to do infilling tasks. We find that both changes only slightly deteriorate downstream performance compared to baseline models.

We investigate the impact of 4 preprocessing methods on the training data: filtering files from repositories with 5+ GitHub stars, filtering files with a high comments-to-code ratio, more aggressive filtering of near-duplicates, and filtering files with a low character-to-token ratio. We observe modest impact of the new filters except for the stars filter, which deteriorates performance on text2code benchmarks significantly. This is an interesting result given that previous work has explicitly filtered for GitHub Stars as a proxy for data quality (Gao et al., 2020; Xu et al., 2022b).

Using the findings from these experiments, we train a final 1.1B parameter model, dubbed SantaCoder, on Python, JavaScript, and Java. This model obtains comparable or stronger performance than previous open-source multilingual models, InCoder-6.7B and CodeGen-Multi-2.7B, on code generation and infilling tasks on the MultiPL-E benchmark for these three languages, despite being substantially smaller.

Related Work

Recently, there has been an increasing amount of research on using large-scale transformer models to analyze or generate source code. Many studies have focused on using decoder-only models with a causal language modeling objective (Chen et al., 2021; Austin et al., 2021; Nijkamp et al., 2022; Christopoulou et al., 2022; Izadi et al., 2022; Xu et al., 2022b; Athiwaratkun et al., 2022), while other studies have investigated encoder (Feng et al., 2020a; Kanade et al., 2020) and encoder-decoder architectures (Li et al., 2022; Ahmad et al., 2021; Wang et al., 2021; Roziere et al., 2021). Bavarian et al. (2022); Fried et al. (2022) propose to use decoder-only models for code-infilling tasks using a causal masking mechanism, and Bavarian et al. (2022) argues that training with such a fill-in-the middle (FIM) objective does not harm the model’s ability to do left-to-right generation. Shazeer (2019) proposes Multi Query Attention (MQA), an architectural change to the transformer neural network in which key and value embeddings are shared across attention heads, resulting in lower memory requirements and faster inference for large batch settings. Multi Query Attention was implemented in AlphaCode (Li et al., 2022) and PaLM (Chowdhery et al., 2022).

Evaluating text-to-code

The correctness of generated code can be tested using unit tests, a method for approximating semantic equivalence. Textual similarity metrics have also been used to evaluate code (Feng et al., 2020b; Ren et al., 2020); however, they have been shown to correlate only weakly with code correctness (Austin et al., 2021; Chen et al., 2021).

Many single-language benchmarks for evaluating code completion exist (Kulal et al., 2019; Iyer et al., 2018; Zhong et al., 2017; Yu et al., 2018; Austin et al., 2021; Hendrycks et al., 2021; Chen et al., 2021; Austin et al., 2021; Athiwaratkun et al., 2022; Lai et al., 2022). Two of the most popular benchmarks for Python are HumanEval (Chen et al., 2021) and MBPP (Austin et al., 2021), which consist of a natural language description of a function and a set of unit tests.

MultiPL-E (Cassano et al., 2022) extends two popular benchmarks for code completion, MBPP and HumanEval, to 18 additional languages. The doctests, function signatures, and unit tests for each benchmark suite are automatically compiled to new languages. Python-specific terminology in the prompt is automatically replaced with the equivalent terminology used for each programming language. MBXP (Athiwaratkun et al., 2022) is a concurrent benchmark that uses a similar approach, but differs in the details of type inference, prompt construction, and evaluation. In particular, MBXP uses the same set of assertions in the prompt that it uses to test the correctness of generated solutions. In contrast, MultiPL-E keeps the tests hidden from the model and only uses them to test correctness.

Evaluating other tasks

Code generation models have also been used to solve a variety of tasks (Tufano et al., 2020; Feng et al., 2020b; Ahmed & Devanbu, 2022; Hellendoorn et al., 2018; Pradel et al., 2020). CodeXGLUE (Lu et al., 2021) is a set of 14 datasets for evaluating code generation models. The tasks include code-to-code tasks like clone detection, code repair, and code translation; text-to-code tasks like code search and code generation; and code-to-text tasks like generating documentation. The programming languages included vary by task; the most common are Python and Java.

Opt-out process

Developers who do not wish their source code to be used for training code LLMs are given the opportunity to opt-out of The Stack (Kocetkov et al., 2022). We received 9 opt-out requests before the cut-off date for removing data (31 October 2022). These individuals accounted for 299 repositories. Of these, 161 were already excluded from The Stack v1.0 (because they did not have a permissive license), and 138 were in The Stack v1.0. We honored the requests to opt-out and removed these repositories from The Stack v1.1. After the cut-off date (31 October 2022), we have received more requests for requests and we will remove these repositories prior to releasing The Stack v1.2.

Redacting Personally Identifiable Information

We describe our first efforts to redact PII from The Stack.

We construct a PII benchmark by annotating the following entities on a small subset of The Stack: names, emails, usernames, passwords, IP addresses, API keys, and SSH keys. We pre-filtered 400 samples from a total of 4000 code files that were likely to contain Personally Identifiable Information (PII). We first select 4000 code files from 11 programming languages, with a total of 800 samples for Python and C++, 400 samples for Java, JavaScript, TypeScript, and PHP, and 160 samples for C, C#, Markdown, Go, and Ruby. To detect keys in these samples, we used the detect-secrets toolhttps://github.com/Yelp/detect-secrets with all default plugins activated. In addition, we used regular expressions to detect emails, IPv4 and IPv6 addresses, see Appendix C.1. Twelve members of the BigCode community annotated the files on the LightTag platformhttps://www.lighttag.io/, with one annotator assigned per file. After the annotation phase, one member reviewed all the annotation tags. To further increase annotation quality, we ran our initial PII detection tools on the annotated files and manually corrected any incorrect annotations identified as false positives or false negatives.

2 PII detection and redaction

For the first iteration of the PII redaction pipeline, we focus on emails, IP addresses, and keys, and leave the detection of names, usernames, and passwords for future work.

We use a regular expression to detect emails, see Appendix C.1. We replace detected emails with [random 5 character string]@example.com.

IP addresses

We use regular expressions for IPv4 and IPv6 IP addresses, see Appendix C.1. In addition, we check if the detected IP addresses have a valid format using the ipaddress python package. We also do not select IP addresses of the format a.b.c.d where a, b, c and d are single digit numbers, except if the words “dns” or “server” appear in the neighboring context (100 characters before or after). These detected addresses were mostly false positives, consisting of package and release versions. Lastly, we do not anonymize private IP addressesThey are non-internet facing IP addresses used in internal networks and popular DNS servers, as we don’t consider them sensitive information. See Appendix C.2 for the full list.

We replace detected IP addresses with one of 5 randomly generated IP addresses.

Keys

We employed the detect-secrets tool to identify secret keys in the code files. To this end, we kept all the regex and entropy based plugins, including the AWS key detector, the GitHub Token detector, the Azure storage key detector, and the Base64 High Entropy String detector. You can find the full list of plugins in Appendix C.4. We deactivated keyword detectors because they were detecting commonly used words like ”password” rather than actual secret keys. To remove false positives, we activated filters like UUIDs and string-like secret filtering, see the full list in Appendix C.3. We also observed that entropy detectors sometimes detected human-readable text like paths and URLs as secrets, even when adjusting the entropy threshold. To address this issue, we added a gibberishhttps://github.com/domanchi/gibberish-detector detector filter on top of detect-secrets to verify that the detected string was actually gibberish. Additionally, we noticed that hashes were sometimes falsely detected as secret keys. To mitigate this problem, we added a hash filter that verifies the size of the detected string and checks for the presence of keywords like “sha”, “md5”, “hash”, and “byte” in the neighboring context. Finally, to avoid corrupting any files, we prevent the removal of keys from files where words like “sha” or “hash” are mentioned in more than 2% of the number of lines.

3 Performance analysis

We evaluated our PII detection pipeline on the benchmark we annotated. The 400 files contained 214 emails, 99 IP addresses and 34 secret keys. Figure 2 shows the precision and recall for each PII entity. Email and IP address detection perform well, with a precision and recall above 90% for emails and above 80% for IP addresses. While key detection also achieves almost 80% precision, its recall is much lower (slightly above 50%). We found that the key detection pipeline was especially sensitive to the precision-recall trade-off, as including more plugins or disabling some filters detected more keys but also increased the number of false positives.

PII detection on The Stack

We run the PII pipeline on the Python, Java and JavaScript subsets of The Stack v1.1 (Kocetkov et al., 2022). Table 1 shows some statistics on the number of files containing PII and the total number of secrets found. Some files containing PII are not modified if they contain only private IP addresses or popular DNS servers, as explained in the previous section. The number of files containing PII is significantly lower for JavaScript compared to Python and Java, but this could be due to the fact that JavaScript files were filtered based on line length and percentage of alphanumeric characters before running PII detection. We also observe that Python and JavaScript have a higher number of secrets per file compared to Java.

To better understand these results, we computed the relevant percentiles in Table 2. We can see that Java indeed has fewer secrets per file, and that almost 0.1% of the files contain a large number of secrets (about 100). We found that some of these files contained multiple instances of PII, such as emails stored in some form of database, or are files containing long encodings and key-like strings that are split into multiple keys. Finally, we also plot the distributions of detected secrets by entity type in Figure 2. For this graph, we filtered out files with more than 100 secrets, but this did not change the distribution of PII across languages. We observe that IP addresses are most often found in Python, keys in JavaScript and emails in Java.

Experiments

The base training dataset for the experiments in this paper contains 268 GB of Python, Java and JavaScript files from The Stack v1.1 (Kocetkov et al., 2022) after removing data from opt-out requests, near-deduplication, PII-redaction (see Section 4), and filtering based on line-length and percentage of alphanumeric characters. This dataset was also decontaminated by removing files that contained test-samples from the following benchmarks: HumanEval (Chen et al., 2021), APPS (Hendrycks et al., 2021), MBPP (Austin et al., 2021) and MultiPL-E (Cassano et al., 2022).

Tokenizer

Seeing as the Santa models were the first models our community would train, our design choices for the tokenizer were modulated by a conservative approach, partly based on insights developed during the development of InCoder (Fried et al., 2022). We train a Hugging Face Tokenizer (MOI et al., 2022) using the Byte-Pair Encoding (BPE) algorithm on raw bytes with a vocabulary size of 49,152 tokens. This tokenizer was trained on 600,000 rows (Around 2.6 GB) of data—200,000 for each language—which were pre-tokenized using a digit splitter and the default GPT-2 pre-tokenizer regex before being converted to bytes.

Training details

Our base model is a 1.1B-parameter decoder-only transformer with FIM and MQA trained in float16. It has 24 layers, 16 heads and a hidden-size of 2048. The model is trained for 300K iterations with a global batch-size of 192 using Adam (Kingma & Ba, 2015) with $\beta_{1}=0.9$ , $\beta_{2}=0.95$ , $\epsilon=10^{-8}$ and a weight-decay of $0.1$ . A total of 118B tokens are seen in training. The learning-rate is set to $2\times 10^{-4}$ and follows a cosine decay after warming up for 2% of the training steps. Each training run takes 3.1 days to complete on 96 Tesla V100 GPUs for a total of $1.05\times 10^{21}$ FLOPs. The final model described in Section 6.2 uses twice the amount of compute.

2 Architecture ablations

We perform ablation experiments to de-risk the model architecture and training objective. Specifically, we investigate Fill-in-the-Middle (Bavarian et al., 2022) and Multi Query Attention (MQA) (Shazeer, 2019).

Recent works (Fried et al., 2022; Bavarian et al., 2022) have shown that autoregressive language-models can learn to infill code snippets by random transformation of the training data. Bavarian et al. (2022) argue that such data transformations do not harm the left-to-right generative capabilities of the model. Following Bavarian et al. (2022), we implement FIM as a random transformation of the input sequence and split each training document into three parts uniformly at random: prefix, middle and suffix. Each part is prepended with a corresponding sentinel token, then documents are rearranged to put the middle part at the end of the sequence. The autoregressive training objective is unchanged. We use context-level FIM, apply transformations at the character level, use a FIM-rate of $0.5$ and SPM+PSM joint training. We compare our base model to a model that was trained with the standard left-to-right objective only (No-FIM).

Multi Query Attention vs Multi Head Attention

Shazeer (2019) proposes Multi Query Attention (MQA), an architectural change to transformer that shares key and value embeddings across attention heads. Compared to Multi Head Attention (MHA), this lowers the memory bandwidth requirements at generation time and results in faster inference. We compare our base model to a similar model using MHA instead, with the same hyper-parameters otherwise. Note that the MHA model has more parameters (1.3B) than the base model in this setting.

3 Data filtering ablations

We experiment with a number of preprocessing methods applied to the base dataset, described in Section 5.1. Note that the filters are applied on top of the other filters such as near-deduplication, line length filtering, etc.

Do popular repositories contain good quality code? We use GitHub stars as a proxy metric. We set the minimum threshold to 5 stars, as we believe that a lower number of stars would not be an indicator of popularity. This filter removes more than 60% of the data (in terms of volume), see Table 3. Note that more than 40% of the files do not have stars and that setting the threshold to 10 stars would remove an additional 5% of the data.

Comment-to-code ratio

Good code should be well documented. With this assumption, we filter files with a high comments-to-code ratio. We use the ast and tokenize modules to extract docstrings and comments from Python files, and Pygments to extract comments from Java and JavaScript files. We then analyze the comment-to-code character ratio. We find that about 20% of Python and Java files and 45% of JavaScript files have no comments. We use a minimum threshold of 1%, removing an additional 3% of files in each language. We also find that files with a ratio above 80% have poor quality, so we filter them out, eliminating 2% of data in all languages. Overall, this comment-to-code filter removes 20% of the data in terms of volume.

More near-deduplication

While exact-match deduplication is the most common preprocessing step for code LLMs (see Table 4), Kocetkov et al. (2022) showed that near-deduplication leads to additional performance gains. Their near-deduplication pipeline largely inherited the settings from CodeParrot (Tunstall et al., 2022): MinHash (Broder, 2000) + Locality Sensitive Hashing (LSH) based on datasketchhttps://github.com/ekzhu/datasketch with unigrams (non-alphanumeric tokens) and a $0.85$ Jaccard similarity threshold. Additionally, it also recalculates the true unigram Jaccard similarity during the post-processing stage to weed out any false positives. In this paper, we investigate whether different deduplication settings can further improve performance.

To this end, we conduct ablation experiments on a 200K subset of the raw python dataset from the Stack v1.1. We investigate the number of false positives and false negatives by comparing the clustered files with their real Jaccard similarity. We find that: 1) Using unigrams during MinHash calculation leads to many false positives, around 20% at $0.85$ . Increasing the n-gram size reduces false positives, but also increases false negatives. This is an expected trade-off between precision and recall; 2) A lower threshold would cause more documents to be removed at the cost of processing time. In our experiments, we have observed good duplicates occur with a similarity as low as $0.65$ , even though the FP and FN rates don’t change much.

We find that combining 5-grams and a $0.7$ threshold strikes a good balance between false positives and false negatives while removing an additional 16%–20% files. In particular, the increased false negatives occur mostly among documents with lower real Jaccard similarity bounds, whereas documents with higher similarities ( $>0.85$ ) even have a decreased false negative rate (from 35% to 24%). Due to time constraints, we apply such deduplication on the already deduplicated datasets using the Stack v1 hyperparameters. We will refer to the final results as more near-deduplication or near-deduplication alt.

Unlike other data preprocessing or filtering techniques that target one document at a time, near-deduplication requires a centralized index that can be prohibitive for large data processing. We have released the deduplication code used in this paper on GitHubhttps://github.com/bigcode-project/bigcode-dataset and will release a distributed version soon. For reference, it takes about 10 hours to deduplicate 42 million Java documents using plain multiprocessing while it takes less than 40 minutes in a distributed (but comparable) environment.

Tokenizer fertility

Can we use the tokenizer to remove low-quality files from the dataset? We experiment with filtering files with a low character-to-token ratioWe slightly abuse the term tokenizer fertility in this work as it usually refers to the average number of subwords per token, where a token is determined by the true tokenizer of the programming language. See e.g. (Rust et al., 2021). For each language, we find that files with a ratio below the 5th percentile are usually of poor quality, but increasing the threshold may eliminate some good-quality files. We therefore set the cutoff value for this ratio to the following values: 2.5 for Python, 2.9 for Java, and 2.6 for JavaScript. This filters out roughly 4% to 5% of data. Note that these values depend highly on the tokenizer and the data. This filter may also be biased against files with non-English comments.

4 Evaluation

The text2code task involves generating the body of a function from a prompt that includes a function description, the function signature (its name and arguments), and optionally a handful of example inputs and outputs. Every problem is accompanied by a set of hidden test cases, which are used to determine if the generated function is correct. We use the MultiPL-E text2code benchmark Cassano et al. (2022), which is derived from HumanEval Chen et al. (2021) and MBPP Austin et al. (2021) (the “sanitized” subset of MBPP.). Whereas the latter two benchmarks target Python, MultiPL-E has a suite of compilers that translate HumanEval and MBPP to 18 other programming languages. Since our models are only trained on Java, JavaScript, and Python, we only evaluate them on these three languages.

We use the methodology of Chen et al. (2021) and we calculate pass@ $k$ rates for ( $k=1,10,100$ ) for every problem. Intuitively, pass@1 estimates the likelihood a model will generate a correct solution in a single attempt, whereas pass@10 and pass@100 estimate the likelihood that the model will generate a correct solution given 10 and 100 attempts respectively. Following the literature, we sample 200 completions at temperatures 0.2 and 0.8 and use 0.2 to estimate pass@1 and 0.8 for pass@10 and pass@100.

Fill-in-the-middle evaluation

To evaluate fill-in-the-middle, we use the single-line exact match metric, which was introduced by Fried et al. (2022) and also employed by Bavarian et al. (2022). For every benchmark problem, we mask out a single line of text from the function body (i.e., not from the function description or signature), and prompt the model to fill in that line of code. We exclude blank lines and comments, and count the number of times the model produces exactly the masked out line. This benchmark requires working solutions for problems, which MultiPL-E does not have. (A text2code benchmark like MultiPL-E only needs hidden tests.) Instead, of writing solutions by hand, we use solutions generated by a code generation model, which is the approach of Athiwaratkun et al. (2022). Specifically, we use working solutions produced by code-davinci-002 at temperature 0.8. Note that this approach does not produce solutions to every problem, since not all problems are solvable. Moreover, for uniformity, we use this approach for Python as well, even though hand-written Python solutions exist for our benchmarks. We only report fill-in-the-middle evaluations for the data filtering ablations.

Results

For the architecture ablations, we report the results on text2code benchmarks in Table 5. For the data filtering ablations, we show the text2code results in Figure 4 and report the fill-in-the middle evaluations in Table 6. We show the HumanEval performance throughout all training runs in Figure 3. You can find the full results tables of the text2code experiments are Appendix A.

We see a small drop in performance for Multi Query Attention (MQA) compared to Multi Head Attention (MHA). As shown in Table 5, the MHA model improves pass@100 with 1-4% on HumanEval and with 1-3% on MBPP. We specifically observe noticeable improvements for the JavaScript versions of the text2code benchmarks. However, it should be noted that the MHA model has more parameters (1.3B) than the MQA model (1.1B), and a head-to-head comparison might, therefore, not be entirely fair. We think that the inference speed-ups of MQA might outweigh the small drop in performance.

FIM for cheap

We observe a minor drop in performance of the FIM model compared to the No-FIM model. Specifically, we see that the pass@100 performance of the FIM model is 2-4% lower on HumanEval and 1% lower on MBPP. While Bavarian et al. (2022) presented evidence for the existence of a FIM-for-free property (i.e., arguing that autoregressive models can be trained with FIM without harming left-to-right capabilities), we do find a small but consistent drop of FIM models on left-to-right text2code benchmarks.

Modest impact of near-deduplication, comments, and fertility filter

On text2code benchmarks, we observe small gains for the near-deduplication and comment-to-code filters and a neutral effect of the tokenizer filter. The near-deduplication filter improves HumanEval performance by 1-3% and MBPP by 1-4% across the three programming languages. The comment-to-code filter improves HumanEval performance by 0-2% but decreases MBPP performance in certain cases (Java). See Appendix A for the full results table. On fill-in-the-middle benchmarks, we see that the tokenizer fertility filter performs well, improving performance by 2-4% across the three languages. The near-duplication and comments filters have a mixed effect, improving fill-in-the-middle performance for Python but deteriorating performance for JavaScript.

GitHub stars deteriorate performance

Surprisingly, we find that the GitHub stars filter performs poorly. On HumanEval and MBPP, the pass@100 performance consistently drops by 3-6% across the three languages. On the fill-in-the-middle benchmark, the performance drops by 5-11% (Table 6). Note that the stars filter removes the most data (over 60%) and, therefore, raises the question whether the performance difference is due to the smaller dataset. However, as can be seen in Figure 3, HumanEval pass@100 diverged early on in training, indicating that the drop in performance is not only due to data size but also data quality.

2 Final model

Based on the insights from the architecture and dataset ablations, we train a final model, which we call SantaCoder, with MQA and FIM and the two data filters that yielded the best results: more near-deduplication and comments-to-code filter. We train this model for 600K iterations (236B tokens) and keep all other hyper-parameters the same.

Doubling the training iterations leads to much stronger text2code performance on MultiPL-E, significantly boosting performance across all benchmarks and programming languages (see Figure 4). Looking at the performance throughout training (Figure 3), it is likely that longer training can further increase performance. Surprisingly, we find that the final training run did not improve the fill-in-the-middle evaluations (see Table 6), at least on these single line infilling tasks.

Comparison to InCoder, CodeGen, and Codex

Table 7 compares our SantaCoder model to comparably-sized code generation models from previous work on the MultiPL-E benchmark, using the methodology described in Section 5.4. We find that our model generally outperforms previous open-source multi-language code generation models despite being smaller, outperforming the InCoder 6.7B (Fried et al., 2022) model on both left-to-right generation and single line fill-in-the-middle infilling across languages, and obtaining comparable or stronger performance to CodeGen-multi 2.7B (Nijkamp et al., 2022).

Conclusion

We described the progress of the BigCode project until December 2022. The community took its first steps towards redacting PII and demonstrated that regular expressions are reasonably effective at detecting emails and IP addresses. Future work should focus on increasing the precision and recall of secret keys, as well as detecting other sensitive information such as names, usernames, and password. Using the PII-redacted version of The Stack, we conducted a series of architectural and data filtering ablations. One of our main findings was that filtering for Github stars consistently decreased performance across all benchmarks and programming languages. Using the findings of these ablation studies, we trained a final 1.1B model—dubbed SantaCoder—for 236B tokens and showed it is able to outperform previous multi-lingual code models (InCoder-6.7B and CodeGen-Multi-2.7B) on both left-to-right generation and infilling tasks. We anticipate that larger architectures and more training data will be able to produce stronger multilingual, infilling-capable models, and plan to continue to scale the findings from our investigations here.

Contributions

Carlos Munoz Ferrandis, Christopher Akiki, Danish Contractor, Harm de Vries, Huu Nguyen, Leandro von Werra, Luis Villa, Sean Hughes, Yacine Jernite, David Lansky

PII redaction

Loubna Ben Allal, Jia Li, Paulo Villegas, Harm de Vries, Leandro Von Werra, Christopher Akiki, Ian Yu, Michael Lappert, Urvashi Bhattacharyya, Shamik Bose, Bernardo García del Río, Francesco De Toni, Terry Yue Zhuo, Qian Liu, Manuel Romero

Dataset

Denis Kocetkov, Chenghao Mou, Loubna Ben Allal, Leandro von Werra, Dmitry Abulkhanov, Christopher Akiki, Raymond Li

Tokenizer

Christopher Akiki, Sergey Troshin, Dmitry Abulkhanov, Daniel Fried, Leandro von Werra, Harm de Vries

Training and architecture

Raymond Li, Daniel Fried, Hailey Schoelkopf, Joel Lamy Poirier, Qian Liu, Niklas Muennighoff, Loubna Ben Allal, Dzmitry Bahdanau, Harm de Vries, Leandro von Werra

Opt out

Sean Hughes, Carlos Munoz Ferrandis, Christopher Akiki, Denis Kocetkov, Harm de Vries, Huu Nguyen, Leandro von Werra, Luis Villa

Evaluation

Arjun Guha, Yangtian Zi, Carolyn Jane Anderson, Loubna Ben Allal, Raymond Li, Niklas Muennighoff, Manan Dey, Logesh Kumar Umapathi, Leandro von Werra, Harm de Vries, Marco Zocca

Inference

Mayank Mishra, Alex Gu, Joel Lamy Poirier, Leandro von Werra, Harm de Vries, Sourab Mangrulka

Acknowledgement

We thank ServiceNow and HuggingFace for the provided compute resources.

References

Appendix A Full text2code results

We report the full results of all experiments. Table 8 and 9 show the full results for the data filtering ablations on HumanEval and MBPP, respectively. Table 10 and 11 reports the full results for the architecture ablations on HumanEval and MBPP, respectively.

Appendix B Docstring generation

In addition to code completion benchmarks, we also report results on docstring generation. To this end, we evaluate our models on CodeXGLUE code-to-text Lu et al. (2021), which is a benchmark constructed from CodeSearchNet Husain et al. (2019). We use the bigcode-evaluation-harness library Ben Allal et al. (2022), which is derived from lm-evaluation-harness Gao et al. (2021). Models are prompted with a Python function signature and asked to output a corresponding docstring. Results are shown in Table 12.

We find all BigCode Santa variants with 1.1B parameters to outperform the 6.7B InCoder model (Fried et al., 2022), which we attribute to differences in the training datasets. Among BigCode models, variants trained on more Python perform better: The stars variant with 32% of Python in its training corpus outperforms the tokenizer fertility variant with only 28.5% of Python (see proportions in Table 3). The bfloat16 is the same as the no-fim variant, except for the latter being trained in float16. There’s no notable performance difference between the two, likely because at our small scale of 1.1B parameters we did not face any training instabilites.

Qualitative examples

Below is an example prompt from CodeXGLUE. Model generations and the correct solution are in Table 13.

Appendix C PII

We used the following regular expression to detect emails.

We replace detected emails with [random 5 character string]@example.com.

IP addresses

We used the following regular expressions to detect IPv4 and IPv6 addresses.

Data pre-filtering

This is the regular expression we used to pre-filter the annotation dataset for data containing emails.

For IP addresses, we used the same regular expression as the one used for PII detection.

C.2 List of private IP addresses and popular DNS servers

C.3 Detect-secrets filters

detect_secrets.filters.heuristic.is_potential_uuid

detect_secrets.filters.heuristic.is_likely_id_string

detect_secrets.filters.heuristic.is_templated_secret

detect_secrets.filters.heuristic.is_sequential_string

Implementation available at https://github.com/bigcode-project/bigcode-dataset/blob/6b3f54751b6e38e1ed70f2307331d6943ba39eae/pii/utils/keys_detection.py#L11.

C.4 Detect-secrets plugins

Implementation available at https://github.com/bigcode-project/bigcode-dataset/blob/6b3f54751b6e38e1ed70f2307331d6943ba39eae/pii/utils/keys_detection.py#L19.