Measuring The Impact Of Programming Language Distribution

Gabriel Orlanski, Kefan Xiao, Xavier Garcia, Jeffrey Hui, Joshua Howland, Jonathan Malmaud, Jacob Austin, Rishabh Singh, Michele Catasta

Introduction

In the 2022 StackOverflow Developer Survey, Rust was the 14th most popular programming language despite not ranking in the survey taken five years prior. However, the 13th most popular language, Go, has nearly doubled Rust’s number of StackOverflow questions in this time frame. Further, despite their similar popularity, Go has nearly 350% more source code available (Kocetkov et al., 2022). These disparities highlight the problem that many popular programming languages are starkly low-resource, especially compared to the most popular languages.

Despite their impressive generative capabilities, especially in code, Large Language Models (LLM) are adversely impacted by this language resource imbalance. Thus, developers will likely find minimal utility from LLMs if they are not using the extremely popular languages. It is therefore imperative to investigate how to mitigate the discrepancy between a language’s popularity and the amount of data available for it. Prior works focusing on code generation (Ahmad et al., 2021) and multilingual natural language processing (Arivazhagan et al., 2019; Conneau et al., 2019) use temperature-based strategies to balance the training languages. Such a strategy duplicates extremely low-resource languages thousands of times, which has been shown to significantly reduce performance (Allamanis, 2019).

Beyond the the language balancing strategy, evaluating code LLMs in a multi-lingual setting presents significant challenges. Existing datasets are either mono-lingual (Chen et al., 2021; Austin et al., 2021; Lai et al., 2022) or limited to only a subset of popular programming languages (Roziere et al., 2020). Each problem in these datasets, which we henceforth refer to as a benchmark, contains an input, and a canonical solution along with the test-cases for checking correctness. Creating a new benchmark for each language of interest would require insurmountable engineering and monetary costs. To address both of these problems, we present the BabelCode framework for execution-based evaluation of any benchmark in any language and use it to investigate the impact of programming language distribution on code generation and translation.

BabelCode is open-sourced, has an extensive test suite, and supports evaluating four benchmarks in 14 languages. It is designed specifically to enable future research directions such as the evaluation of custom data-structures. BabelCode allows investigation of novel research directions through the measurement of memory and runtime usage for a given prediction, as well as the outcomes of individual test cases. Furthermore, we can use BabelCode to build multi-lingual execution based benchmarks from existing mono-lingual datasets. We demonstrate this functionality by creating a new dataset called Translating Python Programming Puzzles (TP3) from the Python Programming Puzzles (Schuster et al., 2021) benchmark, where the objective is to translate expert-level python programs to other languages. The source programs for TP3 are the hand-crafted verification functions for each problem in P3. As the authors hand-wrote each function, they are significantly more complex than the current state-of-the-art code translation benchmarks, such as Transcoder (Roziere et al., 2020), for which code LLMs are already achieving highly impressive results.

Our presented framework is closely related to the concurrent work of MBXP (Athiwaratkun et al., 2023) and Multi-PLE(Cassano et al., 2022). While MBXP is quite similar to BabelCode, it is not open-sourced and requires that the input benchmarks be in Python. Multi-PLE is open-sourced, but only supports generation tasks and contains significant errors in multiple languages. BabelCode addresses these issues through an extensive test suite that ensures that the code generated is correct, and that crucial functionality, such as data structure equivalence, works when executed.

With the BabelCode framework, we investigate remedies to the problems of programming language imbalance. We utilize the Unimax algorithm (Chung et al., 2023) to limit the maximum number of times to duplicate a language’s data to a constant NN. We then train 1B, 2B, and 4B parameter decoder-only models on both the natural and Unimax NN distributions. We utilize the UL2 (Tay et al., 2022) and causal language modeling training objective. We find that models trained on the balanced dataset significantly outperform the baseline models on low-resource languages across all tasks. Further, we find that the resulting performance drop on high-resource languages is mitigated by increasing the model size.

This paper makes the following key contributions:

We propose and release BabelCode, a new execution-based evaluation framework that allows for multilingual evaluation of code generation and translation capabilities of code language models. It also supports the easy addition of new benchmark tasks and execution-based metrics.

We show that the code language models trained on the natural distributions of GitHub source code have poor performance on low-resource languages in both generation and translation tasks.

We propose a new data balancing strategy for programming languages to improve performance on low-resource languages. We demonstrate that the resulting models outperform the baseline models across all tasks by an average of 12.34% pass@kpass@k for all languages, with a further improvement of 39.70% pass@kpass@k to low-resource languages.

We find that the average improvements on low-resource languages from training on balanced data do not scale with model size. But scaling model sizes significantly helps the average pass@kpass@k loss compared to the baselines on high-resource languages going from a loss of 39.70% with the 1B model to a loss of 2.47% with the 4B model.

The BabelCode Framework

BabelCode enables the evaluation of a collection of problems, each consisting of a prompt and a set of test cases, in any language through four stages: 1) represent each test case in our domain specific language (DSL) defined in Figure 2, 2) use this generic form to generate the test cases in the target language from the input and output values, 3) use a Jinjahttps://jinja.palletsprojects.com/en/3.1.x/ template to generate a testing script in the target language, and 4) execute the target script through the command line. This is done autonomously, requiring minimal human intervention. We provide an overview of how an example problem is translated in Figure 8. Overall the key novel elements of BabelCode are: I) the use of a DSL to translate programming questions, II) type-specific equivalence, III) the ability to measure the performance of a given program at a low level (i.e., memory used, runtime, which tests passed), and IV) a large scale test-suite for ensuring correctness of generated code.

BabelCode shares many design similarities to the concurrent work from Athiwaratkun et al. (2023). Specifically, we follow the same approach to inferring argument and return types. We follow the respective documentation and tutorials for each language to determine which native types to use. We also use these docs to determine the docstring formatting and naming convention. These mappings are used to generate unit and integration tests for each language automatically. They ensure that each language’s implementation is syntactically correct and that, when executed, the equality comparison is correct. We describe the framework design and similarities to Athiwaratkun et al. (2023) in Appendix A.

DSL Representations: Using a DSL in the first phase, we do not force the inputs to be Python, thus enabling more flexibility to represent more generic tasks. For example, given the inputs from two test cases: {"a":[,[],]} and {"a":[]}, we only represent the types in our generic DSL. Thus, the resulting type string for this input is map>. We do not represent the actual values in the generic form as we can easily translate literals across languages. This allows users to create a dataset from any language by requiring that they only represent the types of the inputs and outputs in this generic form. The language agnostic nature of the DSL enables future extensions of BabelCode to incorporate complex inputs and outputs such as custom data-structures. For example, the representation of a node class in a BST could be BSTNode.

Test Statement Execution: We opt to print the result of each test case (i.e. TEST-0...PASSED) to the standard output in a parseable format across all languages. Along with try-catch blocks, this allows the evaluation of every test case for a given problem. This allows finer analysis of individual programs when compared to using assert statements as it identifies if specific corner cases fail.

Prompt Translation: As Wang et al. (2022a) showed, LLMs are sensitive to the input prompts for code generation. Therefore BabelCode supports prompt translation and construction for multiple different problem formulations. We replace the names of languages, such as Python, with the target language. We use the language-specific naming convention to properly format the signature in the best practice style. If an argument uses a reserved keyword, we append arg to its name so that it retains the same meaning but will no longer conflict. We replace Python-specific terms with their equivalent names in the target language. For tasks formulated as code-completion, we support formatting the problem description as a native docstring. We do not translate the import statements in the header. Instead, we exclude the headers from all languages to provide a language-agnostic format. Translating prompts to a target language is not novel by itself, as both Athiwaratkun et al. (2023) and Cassano et al. (2022) proposed methods to accomplish this. BabelCode’s builds on those works by translating reserved characters. For example, in Julia, the ”$” in docstrings will raise errors if not properly escaped. Thus, we implement methods to automatically handle such cases and ensure correctness.

2 Differences To Prior Works

We summarize the high-level differences between BabelCode and prior works in Table 1. The MBXP framework from Athiwaratkun et al. (2023) is the most similar to our work as discussed in subsection 2.1. Similar to BabelCode, MBXP does have individual test-case results; however, it uses assert statements and thus can only determine the first test-case that fails. MBXP does use language experts to review the generated code’s quality and discuss the validation it supports to ensure that generated code parses and/or compiles for its respective language. BabelCode also has this functionality but, additionally, it ensures correctness through a test suite that covers the execution of generated code. We provide scripts to allow validating that source solutions to a dataset pass the generated code. For languages that do not have a solution in the respective dataset, we generate “mock” predictions that return the expected output type. This allows us to ensure that generated code is correct in all supported languages even if no solution exists.

The MultiPL-E framework from Cassano et al. (2022) supports 18 languages compared to BabelCode’s 16. However, we support four datasets, while MultiPL-E only currently has support for two datasets. In addition, BabelCode also supports fine-grained evaluation metrics for memory, running time, and individual test cases. Our extensive test suite and validation scripts have also exposed many language-specific idiosyncrasies that naive methods of translation fail to handle. For example, in Julia, any “$” will be treated as string interpolation, even if it is in a docstring. Thus, in the majority of cases, these must be escaped. We automatically rename variables that use reserved keywords. In languages such as C#, the == operator checks equivalence by reference instead of value. Besides corner cases, our DSL and templates allow us to effectively implement proper floating point equivalence for problems that return a float. Finally, in many languages, MultiPL-E uses types that are not considered best practice, such as in Scala, where it relies on the Java types ArrayList instead of the native List.

Low-Resource Code Language Models

Because the data availability can vary greatly by programming language, we can consider the goal of building a multilingual code model as a data-imbalanced multi-task learning problem. Previous work in the multilingual natural language community (Conneau et al., 2019; Arivazhagan et al., 2019) and in the program synthesis space (Ahmad et al., 2021) have used sampling strategies relying on temperature-scaling. In this work, we use the Unimax (Chung et al., 2023) strategy to address this imbalance. The Unimax algorithm assumes that we are given a budget of how many examples we plan to consume during training and a maximum number of times, NN, any single example can be duplicated in the training corpus. Then, we separate the data into buckets by programming language and add NN epochs of each of the lowest-resource languages until we can safely distribute the remaining budget across all the remaining languages without exceeding NN epochs over any one of these remaining languages. This will allow us to control the number of epochs NN we perform over the low-resource languages to minimize overfitting while allowing fair distribution of the compute budget to the remaining high-resource languages. We will ablate the choice of NN in our experiments.

Experimental Setup

To understand the impact of training decoder-only models on the different programming language distributions, we train models in 3 sizes: 1B, 2B, and 4B. For each of these sizes, we train 5 different models on each distribution: Natural and Unimax NN, where N{1,2,3,4}N\in\{1,2,3,4\}. The parameters and training differences are listed in Table 2. We follow Chowdhery et al. (2022) for all other architecture choices. Every model has a context window of 2048 and is trained identically with the same vocabulary described in subsection 4.3. We use a base learning rate of 0.01 and a constant warmup with a step inverse decay. The number of warmup steps is kept to 10% of the total training steps per model. The total number of training steps is 38000, 77000, 190000 for the 1B, 2B, and 4B models, respectively. We use the Adafactor optimizer (Shazeer & Stern, 2018) and a batch size of 256. We prepend [code] to the beginning and add the tag [eod] to the end of each file from our training data. Finally, we use the T5X and SeqIO (Roberts et al., 2022) frameworks. We use the UL2 (Tay et al., 2022) objective with an additional causal language modeling objective as described in Appendix D.

2 Training Data

Our curated source code corpus was obtained by collecting publicly available code data on the web using a custom code data collection system. We apply a similar license filter as Kocetkov et al. (2022) to remove any files with non-permissible licenses, use simple heuristics to filter out low-quality code and apply near-deduplication to obtain our corpus of high quality, permissive source code. After preprocessing, we select 14 programming languages by their file extensions according to the mapping used by GitHub’s Linguist libraryhttps://github.com/github/linguist/ to segment the dataset by language. To calculate the number of examples per language, we use SeqIO’s caching feature and take the number of examples after post-processing (Roberts et al., 2022). We list the percentages of all examples and file extensions used per language in Appendix C. With these numbers, we consider the top 7 languages to be high-resource(HR): Java, Python, C++, PHP, TypeScript, JavaScript, and Go. We further consider the bottom 7 languages to be low-resource(LR): Dart, Lua, Rust, C#, R, Julia, and Haskell.

3 Vocabulary

The original PaLM (Chowdhery et al., 2022) vocabulary focuses on multilingual natural language. In contrast, we trained our SentencePiece (Kudo & Richardson, 2018) vocabulary with 64k tokens from the training data directly. Each programming language is uniformly sampled to build the vocabulary. In previous works, such as Chen et al. (2021), a list of tokens that consists of a different number of whitespace is manually added to represent code more efficiently. In our work, we rely on the SentencePiece model to learn the whitespace tokens by allowing extra whitespace tokens and whitespace-only tokens. In the end, the model can represent up to 12 whitespaces into one token. In addition, numbers are split into individual tokens.

4 Benchmarks

BabelCode currently supports 4 datasets. To allow the translation of any dataset to any language, we modify each benchmark as well as remove problems that were incompatible. These changes are described in Appendix B. For HumanEval (Chen et al., 2021), MBPP (Austin et al., 2021), and Transcoder (Roziere et al., 2020), we add the prefix BabelCode- (BC) to indicate that we are using the BabelCode specific version. Further, for Transcoder, we use the same version as in Chowdhery et al. (2022). BC-HumanEval (BC-HE) has 161 out of the original 164 HumanEval questions. BC-MBPP has 855 of the original 999 questions. BC-Transcoder (BC-TC) has 524 of the original 560 questions.

We additionally introduce a new dataset called Translating Python Programming Puzzles (TP3). We take the verification functions from the questions in the original Python Programming Puzzles dataset (Schuster et al., 2021) to create this dataset. These functions are hand-crafted by the authors and are used to check if an answer satisfies the constraints of the puzzle. These puzzles range in difficulty from basic character checking to competitive programming problems. Thus, each verification function is written by an expert python programmer and requires a significant understanding of programming to translate. In total, there are 370 python functions to translate. Examples from TP3 can be found in subsection B.4.

5 Evaluation

For BC-HumanEval, we follow Chen et al. (2021) and generate 200 programs per problem. Further, we use a zero-shot prompt described in subsection E.1. We use the built-in docstring translation of BabelCode. We generate 50 programs per problem on our three translation tasks and use the prompts described in subsection E.2. We consider these prompts zero-shot as we do not provide any additional examples. However, we provide the translated signature without the docstring in the prompt. We do not consider this to be data leakage as it is trivial to translate signatures with libraries such as Treesitterhttps://tree-sitter.githcub.io/tree-sitter/.

For every dataset, we use T=0.8T=0.8, topp=0.95top_{p}=0.95, and do not use topktop_{k}. We use the pass@kpass@k estimator (Chen et al., 2021) to measure the performance. We use k=100k=100 and k=25k=25 for generation and translation, respectively.

Results

We report the baseline results for our trained models and PaLM-Coder in Figure 4. On BC-HumanEval, we find that the 2B model has a better pass@100pass@100 than that of PaLM-Coder 8B on all but C# and Python. On average, the BC-2B model trained on the natural distribution of GitHub data has average improvements of 48.17% compared to PaLM-Coder 8B despite having a quarter of the number of parameters and training on 6.4B fewer code tokens. Further, we find that the 4B model outperforms PaLM-Coder 62B on 6 of the 14 languages evaluated. This likely results from the 4B model seeing over 53B tokens more than what PaLM-Coder 62B did. Another likely factor in this discrepancy is that the data PaLM-Coder was fine-tuned on included all languages on GitHub in contrast to our filtered training dataset.

We also observe that performance on languages do not scale with respect to their resource level nor the model’s size. C#, Dart, Julia, and Haskell have significantly higher gains when scaling to 4B model size when compared to the other languages. While this may be due to the increased number of training tokens, it is not consistent across all LR languages as the increase in performance for R and Lua when scaling from 1B to 2B is similar to that when scaling from 2B to 4B. Instead, this result is likely due to better transfer from languages such as Java, Python, and C++.

The importance of scale for multi-lingual code models is further demonstrated by the results of the translation tasks. We find that in BC-TP3, the 1B and 2B models’ performance is similar. However, the most significant gains are from scaling up to 4B where it beats PaLM-Coder 8B on all but three languages in this zero-shot translation. We do make note, though, that while we do not provide any examples for in-context learning, we do provide the signature in the target language during generation. This finding is less pronounced in BC-Transcoder as the scaling observed in all languages is more akin to that seen in BC-HumanEval.

2 Impact of Balancing Programming Languages

Figure 5 shows the mean pass@kpass@k scores of different models trained on each of the 5 distributions for each of the 4 datasets. As expected, the natural distribution is optimal if the focus is solely HR languages as the performance losses when training on Unimax balanced data are 15.47%, 14.00%, and 9.35% for the 1B, 2B, and 4B models, respectively. However, for any LR language, Unimax is clearly better given that there is an average pass@100pass@100 improvement on these languages of 111.85%, 68.38%, and 19.22% for the 1B, 2B, and 4B size models, respectively. For generation tasks, we find that N=3N=3 is optimal with respect to the difference between performance gained on LR and performance lost on HR languages. On the 1B, 2B, and 4B models, the ones trained on the Unimax 3 dataset had differences of 130.17%, 87.80%, and 36.00%, respectively.

We observe similar scaling trends on TP3, as training on a Unimax distribution yielded average pass@25pass@25 improvements to LR languages of 124.45% for the 1B model, 64.51% for the 2B model, and 51.29% for the 4B model when compared to the same sized models trained on the natural distribution. Unlike BC-HumanEval, training the 4B on Unimax Distributions yielded better average HR performance with an increase of 6.80%. As shown in Figure 6, training a 4B model on the Unimax 2 distribution had a mean pass@25pass@25 improvement of 71.59% in LR languages and an improvement of 20.31% on HR languages when compared to the natural distribution. Training on other Unimax distributions does not see as large of improvements. For the 4B model, we find mean LR improvements of 42.39%, 52.91%, and 38.26% when trained on the Unimax 1, 3, and 4 distributions, respectively. This indicates that for TP3, at least, balancing the training data for each language improves translation capabilities. However, less Python data adversely affects understanding the source code necessary to translate it properly.

When evaluated on BC-Transcoder, we find that LR performance increased with size. When the source language is C++, training on the Unimax distributions yielded an average pass@25pass@25 improvements of 7.57%, 6.76%, and 11.80% for the 1B, 2B, and 4B models, respectively. Translating Python to other languages followed this trend with an average change of -26.04%, 15.1%, and 22.47% for the 1B, 2B, and 4B models, respectively. On BC-Transcoder, we find similar benefits when translating from Python to other languages, although the performance on higher resource languages is significantly worse. When translating from C++ to other languages, we find that training both a 1B and 2B model on the UM 4 distribution improves performance on 5 of the 7 LR languages. For 4B sized models, the UM 2 distribution is optimal as LR performance increased by an average of 20.47% when compared to training on the natural distribution. As the source code of BC-Transcoder focuses on language-agnostic algorithm implementations, this scaling trend is most likely due to the importance of a surface-level understanding of the target language. Further, the fact that this trend does not appear for BC-HumanEval or TP3 indicates that neither model size nor duplication of language data enables the model to have a deep understanding of these low-resource languages.

3 Effects Of Language Balance on Predictions

We find that, as is expected, decreasing the number of tokens for a language negatively impacts its performance on that language. To compare the overall effects of language balancing at each size, we focus on the Unimax 1 and Unimax 2 distributions as they represent the largest change in proportions of HR languages when compared to the Natural distribution. Figure 7 shows that on BC-HumanEval, training on either UM 1 or UM 2 will cause the model to generate fewer correct solutions than when the model is trained on the Natural distribution with respect to HR languages. However, this is not due to those models generating more programs with either compilation or run-time errors as the raw average increase is only 0.40 and 1.15 for the models trained on the Unimax 1 and Unimax 2 respectively. Rather, we find that the largest decrease is in the mean % test cases passed per problem. Training on the Unimax 1 and Unimax 2 distributions results in 5.50% and 9.09% fewer test cases respectively when compared to the model trained on the natural distribution.

On LR languages, the Unimax 1 distribution yielded the best improvements compared to the other distributions. Specifically, the programs generated by the model trained on the Natural distribution passed, on average, 5.13% of the test cases per problem. In comparison, 9.53% and 10.48% of average test cases per problem were solved by the models trained on the Unimax 1 and Unimax 2 distributions. The less than 1% improvement when going from Unimax 1 to Unimax 2 suggests that, for generation tasks, multi-lingual models of code benefit the most from seeing unique data.

In our translation task of TP3, we observe consistent improvements in the mean number of test cases passed for both HR and LR languages. For the former, we observe an average improvement of 2.58% and 3.06% compared to the Natural distribution for the UM 1 and 2 distributions respectively. On LR languages, we find average improvements of 3.40% and 4.99% over the Natural distribution for the UM 1 and UM 2 distributions respectively. These results, along with the performance improvements discussed in subsection 5.2, indicate that translation tasks benefit highly from uniformly balanced languages. This is, likely, due to the task formulation where natural language understanding is not necessary. Higher resource languages are more likely to contain diverse natural language and code pairs due to the language’s popularity.

Thus, performance on NL2Code tasks, such as BC-HumanEval, depends on the unique samples of code and doc-strings in the training corpus. Translation, on the other hand, does not have this constraint. Rather, it appears that uniformly balancing languages is the optimal strategy for this task.

Related Works

Code Evaluation Existing code benchmarks have primarily focused on surface matching evaluation (Lu et al., 2021; Yin et al., 2018; Wang et al., 2022b; Husain et al., 2019). Recent works have introduced new execution-based benchmarks for both generation (Austin et al., 2021; Hendrycks et al., 2021; Chen et al., 2021; Lai et al., 2022) and repair (Yasunaga & Liang, 2021) tasks, however, these have been limited to only Python. Additional works have introduced generation (Li et al., 2022) and translation (Roziere et al., 2020) tasks in multiple-languages, but are limited to only C++, Java, and Python. We acknowledge concurrent works by Cassano et al. (2022) and Athiwaratkun et al. (2023) on translating HumanEval and MBPP into multiple programming languages. As we note in subsection 2.2, BabelCode supports deeper analysis on a wider range of tasks while including significant methods for ensuring correctness.

Code LLMs Recent years has seen significant interest in LLMs for code. CodeBERT (Feng et al., 2020) is the first work to train an encoder only model on code. CodeT5 (Wang et al., 2021), PLBART (Ahmad et al., 2021), and additional works (Clement et al., 2020; Orlanski & Gittens, 2021; Chakraborty et al., 2022) examine training encoder-decoder models on code. Similar to this work, Ahmad et al. (2021) investigate difference data balancing strategies for pre-training. Our work differs in that we focus on balancing many programming languages in pre-training data. AlphaCode (Li et al., 2022), Codex (Chen et al., 2021), PaLM (Chowdhery et al., 2022), and other works (Nijkamp et al., 2022; Fried et al., 2022; Allal et al., 2023; Christopoulou et al., 2022) have shown that decoder-only code language models achieve exceptional performance on a wide range of tasks. Additional works have investigated different training strategies (Roziere et al., 2020; Bavarian et al., 2022) and different pre-training data (Rozière et al., 2021; Orlanski et al., 2022; Austin et al., 2021).

Language Balancing Choosing a proper sampling distribution from a mixture of datasets of various size is a difficult problems. Initial attempts at studying this in the multilingual natural language processing literature relied on temperature-based approaches (Conneau et al., 2019; Arivazhagan et al., 2019). These approaches oversample the low-resource tasks and downsample the high-resource ones. Other works have adopted more dynamic approaches, adapting the sampling rates in an online fashion during training (Wang et al., 2020).

Conclusion

We proposed the BabelCode framework for multi-lingual execution-based evaluation and a new strategy for balancing programming language distributions. We highlight the ease of creating new benchmarks with BabelCode by proposing the Translating Python Programming Puzzles. Our experiments demonstrate that adjusting how much we oversample low-resource languages and downsample high-resource languages greatly improves low-resource performance with minimal impact to to the performance of high-resource languages in tasks involving either a single or multiple programming language. By open-sourcing BabelCode, future work can investigate improved balancing strategies along with new multi-lingual programming language questions.

Acknowledgements

We thank Michael Janner, Owen Lewis, Alex Polozov, Uros Popovic, Devjeet Roy, Tal Schuster, and Charles Sutton for their helpful discussions and feedback on the paper.

References

Appendix A BabelCode Design

BabelCode’s design shares many similarities to Athiwaratkun et al. (2023) and Cassano et al. (2022). For translation, we too implement a recursive visitor pattern to translate input and output values to the corresponding code in the target language. When converting a coding dataset, we follow prior works by parsing assert statements using AST parsing libraries to determine the inputs and outputs for a given question. To find the function name for a problem, we once again use AST parsers to find the function definition located in the ground truth solution. The found tree is additionally used for parsing the argument names and types. If the types for either the arguments or returns do not exist, we infer them based on the types found from the literal values of the inputs and outputs. While our implementation differs, the overall process is similar to Athiwaratkun et al. (2023) and Cassano et al. (2022). Following Cassano et al. (2022), we execute the generated code through the command line using each language’s recommended commands to compile and run a given script. As Athiwaratkun et al. (2023) is not open sourced, we cannot compare the similarities of this portion.

Appendix B Dataset Changes

B.2 Changes To HumanEval

B.3 Changes To Transcoder

B.4 TP3 Examples

Appendix C Training Languages

Appendix D Training Objective

This paper uses a variant of the UL2 objective (Tay et al., 2022) for training the code language models. The UL2 objective consists of a mixture of span corruption and prefix language modeling objectives, as defined in Raffel et al. (2020). In this work, we select two span corruption instances using the implementation provided in the T5 library. See https://github.com/google-research/text-to-text-transfer-transformer/blob/main/t5/data/preprocessors.py#L1923 The only differences between these two instances consist of different values for the noise_density and mean_noise_span_length arguments. In particular, we use (3.0, 0.15) and (32, 0.5) for the (noise_density, mean_noise_span_length) arguments for each span corruption instance respectively.

The prefix language modeling objective randomly breaks text into two pieces, and the model is tasked to reconstruct the latter, given the former. Finally, we add an additional objective which consists of causal language modeling, which can be considered a special case of prefix language modeling; the first piece consists of the empty string. We assign the probabilities 10%, 10%, 20%, and 60% for each objective, respectively.

Appendix E Prompts Used

Each {{\ldots}} represents a field that is filled in.

Example from HumanEval for generating C# code:

E.2 Translation Tasks

Each {{\ldots}} represents a field that is filled in. The {{fields}} correspond to the source language we are translating from, while {{fields}} correspond to the target language to translate too.

Example For TP3 translation from Python to Haskell:

Appendix F Full Results