Coder Reviewer Reranking for Code Generation

Tianyi Zhang, Tao Yu, Tatsunori B. Hashimoto, Mike Lewis, Wen-tau Yih, Daniel Fried, Sida I. Wang

Introduction

Recent pretrained language models (PLMs) have demonstrated an impressive ability to generate code given natural language instructions (Chen et al., 2021; Fried et al., 2022; Chowdhery et al., 2022; Nijkamp et al., 2022). One popular technique is to use a generative language model trained on code, which we call the Coder model, to sample multiple code solutions for a single instruction and rerank the solutions based on the likelihood the Coder model assigns to each (Chen et al., 2021). Despite its wide-spread use, reranking with the Coder model often mistakenly prefers degenerate solutions, e.g., extremely short code or repetitive solutions. As a result, reranking performance often decreases when the number of candidate program increases (Figure 2(c)). These biases are known to arise in language models when using mode-seeking inference methods such as greedy decoding (Holtzman et al., 2020) or beam search (Li et al., 2016; Stahlberg & Byrne, 2019).

In this work, we take inspiration from collaborative software development. For example, in the standard practice of code review, programmers submit implementations given specifications and have the submitted code cross validated by other code reviewers. We instantiate this idea by using prompting to obtain a Reviewer model, which checks the generated programs against the language instruction. Formally, we first sample programs yy given the instruction xx via the Coder model p(yx)p(y|x) and cross check via the Reviewer model p(xy)p(x|y). The Reviewer model reinforces the language instruction by evaluating the likelihood of every word in the instruction.

To obtain a consensus between the Coder and the Reviewer, we propose Coder-Reviewer reranking which selects the solutions by the product of the reviewer model and the coder model, p(xy)p(yx)p(x|y)p(y|x). We show that Coder-Reviewer is a specific instantiation of the Maximum Mutual Information (MMI) objective (Li et al., 2016), which favors solutions that have high mutual information with the instruction and down weights generic solutions (where p(y)p(y) is high). MMI has also been shown to be effective against degenerate solutions in many other natural language processing tasks (Yin & Neubig, 2019; Lewis & Fan, 2019; Fried et al., 2018).

To implement the Reviewer model p(xy)p(x|y), we propose a simple prompting method. After a program yy is generated by the Coder model p(yx)p(y|x), we invert the order in which the instruction xx and the solution yy appear in the prompt, and query the pretrained language model again to estimate p(xy)p(x|y). Our prompting approach avoids any additional training and is easy to generalize to different programming languages. Figure 1 shows an example prompt: the Coder model generates programs given the function header and the docstring; we then extract the generated program and place it before the docstring when prompting the Reviewer model.

We carry out an extensive empirical study on six datasets with three different programming languages and experiment with seven models from three model families. Compared to past works’ approach of ranking with the Coder model p(yx)p(y|x) alone, Coder-Reviewer reranking demonstrates consistent and effective performance gains (up to 17%17\% absolute accuracy gain). When combined with executability filtering, Coder-Reviewer reranking can often outperform the minimum Bayes risk decoding method (Shi et al., 2022), which involves more complex aggregation of the executed outputs. The code is available on GitHub.https://github.com/facebookresearch/coder_reviewer_reranking

Related Work

Many prior works explored code generation with neural networks (Allamanis et al., 2015; Ling et al., 2016; Iyer et al., 2018; Yin & Neubig, 2017; Yasunaga & Liang, 2020) and many benchmarks have been proposed to evaluate code model performance (Hendrycks et al., 2021; Yu et al., 2018; Lin et al., 2018). Recently, large language models pretrained on code have shown unprecedented zero-/few-shot ability, and can even perform well in code competitions that are challenging to programmers (Chowdhery et al., 2022; Chen et al., 2021; Austin et al., 2021; Li et al., 2022). Our work builds on the impressive ability of pretrained code models and achieves additional gains by leveraging a Reviewer model that evaluates the probability of the language instruction given generated programs.

Maximum Mutual Information

(MMI) and its variants have been shown to be effective in many natural language processing tasks, including text classification (Min et al., 2021), speech processing (Bahl et al., 1986), dialogue (Li et al., 2016), instruction following (Fried et al., 2018), question answering (Lewis & Fan, 2019), and semantic parsing (Yin & Neubig, 2019). In contrast to Maximum Likelihood which optimizes logp(xy)\log p(x|y), MMI optimizes the pointwise mutual information logp(x,y)p(x)p(y)\log\frac{p(x,y)}{p(x)p(y)}. In practice, it is popular to optimize a weighted version of the MMI objective (Li et al., 2016). We show in Section 4 that Coder-Reviewer reranking is a specific instantiation of the weighted MMI objective. However, Coder-Reviewer reranking differs from this work by leveraging prompting to obtain the Reviewer model p(xy)p(x|y), rather than training a separate model, and by showing that the objective produces substantial benefits for the task of code generation. Concurrently, Ye et al. (2022) explore a MMI-like prompting approach for reasoning tasks.

Reranking Methods for Code Generation.

Chen et al. (2021) point out that the diverse samples from large language models often contain correct programs and they propose to rank program samples by the Coder model p(yx)p(y|x). Since then, many methods have been proposed to leverage sample consistency (Shi et al., 2022) or training supervised rerankers (Inala et al., 2022) In particular, Shi et al. (2022) and Li et al. (2022) propose to cluster program surface forms using the executed outputs of the generated programs. Chen et al. (2022) propose to generate unit tests for Python function completion problems and design selection programs that validate generated programs and unit tests against each other. On the one hand, Coder-Reviewer reranking does not require execution, which enables it to be applied in more diverse scenarios, and Coder-Reviewer reranking is not specific to any programming languages or packages. On the other hand, Coder-Reviewer reranking is orthogonal and complementary to methods that incorporate execution semantics: in Section 7.1, we show empirical results on the benefits of combining Coder-Reviewer reranking with execution-based ranking methods.

Background

We are interested in using pretrained code language models to generate code yy conditioned on natural language instructions xx. In addition, we assume access to a context cc that provides useful code context such as package imports or data dependencies. In the example shown in Figure 1, xx is the instruction “return decimal part of the input number”, yy is the generated function body, and cc includes importing the math package. For few-shot code generation, we also have nn demonstration examples (x^1,y^1,c^1)(\hat{x}_{1},\hat{y}_{1},\hat{c}_{1}) to (x^n,y^n,c^n)(\hat{x}_{n},\hat{y}_{n},\hat{c}_{n}).

To solve this problem, our key leverage is a Coder model — a conditional language model pθ(yc,x)p_{\theta}(y|c,x), which can generate programs given the instruction and context. In this case of few-shot generation, the Coder model also conditions on the demonstration examples, i.e., we consider pθ(yc^1,x^1,y^1,,c^n,x^n,y^n,c,x)p_{\theta}(y|\hat{c}_{1},\hat{x}_{1},\hat{y}_{1},\ldots,\hat{c}_{n},\hat{x}_{n},\hat{y}_{n},c,x). Throughout this work, we implement these conditional models using prompting: placing the relevant objects such as cc and xx into the input context of the language model.

Collecting Program Samples.

To obtain programs from the Coder model pθp_{\theta}, we sample autoregressively using a temperature-scaled pθp_{\theta}. If the temperature is low, temperature scaling increases the probability of sampling well-formed programs.

Coder-Only Reranking.

Once we have collected 25-100 samples, we rerank and select a single program as our output program. A popular technique in prior work (Chen et al., 2021) is to pick the program yy that has the highest Coder model likelihood logp(yx,c)\log p(y|x,c). However, the Coder model likelihood logp(yx,c)\log p(y|x,c) is biased toward shorter programs (Stahlberg & Byrne, 2019). To counter this shortcoming, Chen et al. (2021) propose to apply length normalization and instead rerank using the average token-level log likelihood, 1ylogp(yc,x)\frac{1}{\lvert y\rvert}\log p(y|c,x).

Failures of Coder-Only Reranking.

Figure 2(a) and Figure 2(b) plot the Coder logp(yx)\log p(y|x) and Normalized Coder 1ylogp(yc,x)\frac{1}{\lvert y\rvert}\log p(y|c,x) values against the lengths of generated programs y\lvert y\rvert. From these figures we can observe that Coder-only reranking prefers short solutions; applying length normalization overcorrects and prefers long solutions. This is a known issue of reranking/searching using the likelihood of neural language models (Holtzman et al., 2020; Stahlberg & Byrne, 2019). In practice, many of these short programs are trivial (e.g. containing only a return or pass statement) and many of these long programs contain repetitive code. Figure 2(c) plots the accuracy of ranking with Coder model or the length normalized Coder model (N. Coder) versus the number of program samples. We observe that accuracy degrades as the number of program samples increases, since degenerate but high-scoring solutions have an increasing chance of appearing as more programs are sampled. This demonstrates that both types of Coder likelihood are insufficient as reranking objectives for code generation. We further analyze this behavior in Section 5.

Coder-Reviewer Reranking

In real-world professional software development, it is common to have programmers review each other’s work. Motivated by this, and the previously-demonstrated shortcomings of Coder-only reranking, we introduce a Reviewer model. Note that we use the same underlying model pθp_{\theta} for both the Coder and the Reviewer model but prompt differently. In the case of few-shot code generation, we will have pϕ(xc,y,c^1,y^1,x^1,,c^n,y^n,x^n)p_{\phi}(x|c,y,\hat{c}_{1},\hat{y}_{1},\hat{x}_{1},\ldots,\hat{c}_{n},\hat{y}_{n},\hat{x}_{n}). Compared to the Coder model, the Reviewer model is tasked with evaluating the likelihood of the instruction xx. Degenerate programs cannot account for the instruction well and therefore will often have a low Reviewer model likelihood.

Prompting for the Reviewer Model.

We adopt a prompting-based approach to implement the Reviewer model. In short, once the program samples yy are generated, we invert the order in which instruction xx and program yy appear in the input context and query the language model pθp_{\theta} again to obtain their likelihood. Recall that in Figure 1, we demonstrate a task-specific prompt we designed for Python function completion datasets. In this prompt, we first duplicate the function header and place the generated program yy before the instruction docstring xx. Additionally, we insert a natural language task specification “write the docstring for the above function” into the prompt to further specify the task to the pretrained language model. In Figure 3, we give another example of a 3-shot task-agnostic prompt on the NL2Bash dataset. Shi et al. (2022) propose this task-agnostic Coder model prompt that marks the location of instruction and generated programs with html-like tags. For prompting the Reviewer model, we swap the language instruction and the generated program, along with their tags. Importantly, we apply this inversion to both the demonstration examples and the last test example.

Combining Coder and Reviewer.

To leverage the information provided by both the Coder and the Reviewer model, we propose Coder-Reviewer reranking, which reranks programs using the product of the Coder and the Reviewer,

Similarly, we can apply Coder-Reviewer reranking when using the length normalized Coder score,

Here, we normalize the reviewer model score p(xy)p(x|y) to match the scale of the normalized Coder model score.

Because Coder-Reviewer Reranking combines two models by taking their product, Coder-Reviewer Reranking is sensitive to low probability under either model. As a result, Coder-Reviewer reranking seeks out program samples that obtain a consensus from both the Coder and the Reviewer. In the next section, we illustrate that by seeking a consensus, Coder-Reviewer can successfully alleviate the biases from each individual component.

Understanding the Relation between Coder-Reviewer Reranking and Maximum Mutual Information.

We now show that Coder-Reviewer reranking is a special instantiation of the Maximum Mutual Information objective (Li et al., 2016). To avoid notation clutter, we abstract away the problem structure specific to code generation and denote task input xx and output yy. The usual Maximum Likelihood objective optimizes p(yx)p(y|x), which is the same as reranking with the Coder model. However, Maximum Likelihood has the risk of preferring generic or degenerate answers (Holtzman et al., 2020) and Li et al. (2016) propose to optimize Mutual Information p(x,y)p(x)p(y)\frac{p(x,y)}{p(x)p(y)}. Moreover, Li et al. (2016) propose to optimize for a weighted version of the Mutual Information objective, using a weighting parameter α\alpha:

From (2), we see that Coder-Reviewer reranking is a special case of this objective where α\alpha is set to 0.50.5 (we include a derivation in Appendix A). From (3), we see that both can be viewed as interpolating likelihood with a mutual information criterion, with strength α\alpha. In other words, Coder-Reviwer reranking favors program samples that have high mutual information with the language instruction and therefore can filter out low quality solutions that cannot explain the instruction well. This perspective also motivates exploring using hyperparameter α\alpha to control the mixing ratio between the Coder and the Reviewer. In section Section C.3, we show that the off-the-shelf hyperparameter α=0.5\alpha=0.5 usually works well already, and tuning α\alpha can give a small additional gain. Next, we conduct an quantitative analysis to show that Coder-Reviewer model can filter low quality programs.

Analysis of Degenerate Cases

In this section, we experiment with ranking artificially created degenerate programs to analyze the preference of different ranking methods. We experiment with the Codex002 model (Chen et al., 2021) on the 0-shot sanitized MBPP dataset (MBPP-S Austin et al., 2021), which is a Python function completion dataset with example prompt structure shown in Figure 1. We defer experimental details and dataset details to Figure 5.

For each MBPP problem, we sampled 25 programs via the Coder model p(yx)p(y|x). Then, we construct three cases that reflect common degeneracies and inject them among the 125 samples. We construct these cases: A. ReturnOnly where the function body only contains a return statement B. Repetitive where the function body contains print statements printing from 1 to 50, with each line containing exactly one print statement. C. CopyPrompt where the function body only contains a comment that repeats the language instruction.

With these programs, we now compare the reranking behavior of Coder p(yx)p(y|x), normalized Coder (N.Coder) 1ylogp(yx)\frac{1}{\lvert y\rvert}\log p(y|x), and Reviewer p(xy)p(x|y). We also included Coder-Reviewer reranking and the normalized version N.Coder-Reviewer into the comparison. Figure 4 plots the mean reciprocal rank (MRR) of the three degenerate cases ranked by different methods, where high MRR suggests a bias toward the degenerate case. Figure 4 shows that both Coder and Reviewer methods fail on different degenerate programs: Coder favors programs that are effectively empty, even though those clearly do not follow the language instructions; Normalized Coder favors programs full of repetitions; and Reviewer can be biased toward spurious surface form overlap between generated programs and the language instruction. In contrast, Coder-Reviewer reranking can alleviate the biases in its individual components.

Based on this analysis, we prescribe three easy-to-implement procedures to reject degenerate solutions before reranking takes place. By strengthening the Coder and Reviewer baselines, we know that any performance gain brought by Coder-Reviewer cannot be easily attributed to simple fixes.

We filter empty programs. For Python function completion datasets, we also filter trivial solutions that only contain return or pass.

We filter repetitive programs whose zlib compressed representation is more than four times shorter than the original program.This threshold corresponds to Repetitive, which prints from 11 to 5050.

For all Python code generation problems, we use an off-the-shelf canonicalization software pyminifier to remove comments, docstrings, and replace all print and assertion messages to empty strings. This procedure helps reduce the spurious surface form overlap with the language instruction. For function completion problems, we also standardize the function names.

We analyze the effects of the above procedure in Section 7.3. Overall, we find that these procedures help improve both the baseline Coder methods and our proposed Coder-Reviewer methods. Next, we present our extensive empirical study to demonstrate that Coder-Reviewer is an effective and consistent method.

Experimental Setup

We present an extensive empirical study that spans six datasets and three model families. Our evaluation data consists of both zero-shot and few-shot settings and investigates both task-specific and task-agnostic prompt design, and we study eight models whose parameter counts range across two orders of magnitude.

We provide a brief description of our experimental datasets and present a detailed description in Section B.1. We first consider three zero-shot datasets, which all involve generating Python code. HumanEval and MBPP-Sanitized are two popular Python function completion datasets. On these datasets, we generally follow the prompt design in Chen et al. (2022) and use the Reviewer prompt presented in Figure 1.

is a subset of DS-1000 (Lai et al., 2022) and contains 155155 realistic questions adapted from StackOverflow about matplotlib. The rest of DS-1000 cannot be easily handled by left-to-right autoregressive because it often introduces additional contexts that are better addressed by infilling models. We leave applying Coder-Reviewer reranking to infilling models for future work.

Next, we consider three three-shot datasets, MBPP, Spider, and NL2Bash, which include programming languages other than Python. Our task-agnostic prompts and other experimental design on these three datasets are all taken from Shi et al. (2022) and are similar to the one presented in Figure 3.

2 Models

models are descendants of GPT-3 (Brown et al., 2020). We consider three variants of the codex model: Codex002 (codex-davinci-002), Codex001 (codex-davinci-001), and Codex-Cushman (codex-cushman). While the exact modeling details are not made public, we assume based on their codenames that Codex002 and Codex001 are 175175B models and Codex-Cushman is a 1212B model.

InCoder (Fried et al., 2022)

models are trained with the causal mask language modeling objective and are capable of both infilling and left-to-right generation. In this work, we only leverage their autoregressive ability to keep the comparison consistent with other model families and leave the investigation of its infilling ability for future work. We experiment with 66B and 11B models from this family.

CodeGen (Nijkamp et al., 2022)

is a family of autoregressive language models pretrained on code data. We experiment with the 1616B, 66B, and 22B CodeGen models.

3 Implementation Details

Due to the lack of validation split on the benchmarks we experimented with, we restrain ourselves from hyperparameter search and rely on a single set of decoding hyperparameters. On all datasets, we sample with temperature 0.40.4 and set the max tokens to be 300300. For our main results in Table 1, we sample 125125 different programs for each problem and then bootstrap 5050 times to report the mean accuracy of reranking 2525 samples. Given the small size of these datasets, we find bootstrapping to be helpful in reducing variance. In Figure 6, we analyze the impact of decoding hyperparameters on the validation accuracy. We apply the proposed degenerate solution rejection to all baseline and proposed methods and study its effect in Section 7.3. Additionally, we apply executability filtering to all baseline and proposed methods to better compare with state-of-the-art methods that rely on execution more heavily. Executability filtering (Shi et al., 2022) removes programs that produce runtime errors before applying any other ranking methods. We present the ranking results without executability filtering in Section C.1.

4 Reranking Methods

We first compare to baseline selection methods that are popular in existing work. Random reports the percentage of correct programs after degenerate solution rejection. Coder selects the program that has the highest p(yx)p(y|x) and N. Coder applies length normalization to the Coder likelihood.

Proposed Methods.

We then compare the methods proposed in this work. Reviewer selects programs that has the highest p(xy)p(x|y). Coder-Reviewer ranks via the product of the Coder and the Reviewer model scores and N. Coder-Reviewer applies length normalization to the component model scores before taking the product.

State-of-the-art Methods.

We compare to the competitive Minimum Bayes Risk method (MBR-Exec; Shi et al., 2022), which compares the executed outputs of the generated programs on test inputs and selects programs whose outputs are most typical. While for Python function completion problems it is straightforward to compare execution outputs, this is less obvious when execution results involve more complex objects. On the plotting dataset, due to the complexity and the multimodal nature of matplotlibplots with different styles such as colors, fonts, etc. might all be correct but it is difficult to aggregate them., we compare the output figures directly pixel-by-pixel. On HumanEval, we extract the first test case appearing in the docstring and use the executed output for MBR-Exec. For other datasets, we follow the practices recommended in Shi et al. (2022). We compare to MBR-Exec using the same setting with baseline and proposed methods in Table 1.

We also compare to CodeT (Chen et al., 2022), which generates assertions for Python function completion problems and proposes a novel graph-based aggregation process for selecting programs based on program-test agreement. While CodeT is a competitive method, it relies on the ability to generate good assertions and cannot be easily used on other programming languages/tasks. For example, models typically cannot generate unit tests used in the Plotting dataset because good tests involve complicated assertions on the figure object and do not naturally appear in public data. Another example is the spider dataset, where generating additional unit tests involves creating different input database (Zhong et al., 2022), a challenging problem on its own. Therefore, we separate CodeT from the rest of the comparison and compare to its reported numbers in Table 2.

Experimental Results

In Section 7.1, we show that Coder-Reviewer reranking is either the best performing method or competitive with the best performing method across all settings. In Figure 6 and Section 7.3, we perform ablation studies for the different decisions made in developing Coder-Reviewer reranking and show that Coder Reviewer can work well with default hyperparameters and is more stable than Coder reranking.

First, we observe that Coder-Reviewer variants consistently improves over Coder variants. Figure 5 plots the absolute accuracy difference between the best Coder-Reviewer variant (with and without length normalization) and the best Coder variant. Overall, we observe that Coder Reviewer leads to a consistent and significant improvement with very few exceptions. In certain settings, e.g. for InCoder models on the plotting dataset, Coder-Reviewer leads to more than 10%10\% improvement. The only 5 exceptions out of 48 model cross dataset pair where Coder-Reviewer is worse than Coder all involve CodeGen models. We suspect that difference in pretraining can cause a difference in the capability to estimate the Reviewer model p(xy)p(x|y) using the prompt we employ in this work.

Second, in Table 1, we present the ranking results on the largest models from each model family along with comparison to other methods. From Table 1 we find a Coder-Reviewer variant to be the best method most of the time and even when it is not, it is often the second best method. Analyzing across all data, we find that Coder-Reviewer variants are the best method 62.5%62.5\% of the time, Reviewer 12.5%12.5\% of the time, and MBR-Exec 20.83%20.83\% of the time. However, Reviewer and MBR-Exec both have scenarios where they trail behind Coder-Reviewer significantly. For example when applied to the Codex002 model, MBR-Exec is more than 10%10\% worse than normalized Coder-Reviewer on HumanEval and 8%8\% worse on Plotting. In practice, we find that MBR-Exec works better when the quality of the test inputs that it executes on is high and when it is easy to aggregate the execution outputs. Compare HumanEval and MBPP-S, which have very simlar format, MBR-Exec performs much better on MBPP-S than on HumanEval. We ascribe this failure to the lower quality of the test inputs in HumanEval.

Third, Table 2 shows the performance of CodeT and Coder-Reviewer in the same setting of reranking 100100 samples on HumanEval and MBPP-S. Here the results of CodeT are not strictly comparable to those of Coder-Reviewer. Recall that CodeT generates additional test cases for these Python function completion problems so CodeT does not execute any test cases provided in the input docstring. In contrast, we follow the setup in Shi et al. (2022) and execute the first test case to perform executbaility filtering. Still, Table 2 shows that Coder-Reviewer can achieve very competitive performance while being a more general method that can be applied to datasets where generating test cases is difficult.

In summary, our experimental results show that Coder-Reviewer reranking consistently improves performance over Coder-only reranking from past work across a diverse set of model families and datasets. In the following sections, we carry out ablation studies to understand the impact of design decisions and sensitivity to hyperparameters.

2 Understanding Hyperparameters

We explore the effect of different hyperparameters, including the number of samples and the mixing ratio α\alpha between Coder and Reviewer. We start by plotting the ranking accuracy of Coder, Reviewer, and Coder-Reviewer on the 0-shot MBPP-S dataset with Codex002 in Figure 6 (Figure 7 in Section C.2 includes plots on more methods and datasets). We observe that in contrast to its components Coder and Reviewer, Coder-Reviewer is more stable in ranking more sample programs, suggesting that it is more robust against potential degenerate solutions when compared to ranking with Coder or Reviewer model only.

In addition, recall that we can introduce a hyperparameter to control the mixing ratio α\alpha between Coder and Reviewer and have the objective (1α)logp(xy)+αlogp(yx)(1-\alpha)\log p(x|y)+\alpha\log p(y|x). In Coder-Reviewer reranking, we fix α\alpha at 0.50.5 for simplicity although the weighted objective is also popular in prior work (Li et al., 2016). From Figure 8 in Section C.3, we observe that the improvement from tuning α\alpha is typically less than 1%1\%, with a few exceptions with the CodeGen models where tuning can lead to more significant improvements. Finally, we explore a alternate formulation of the Maximum Mutual Information objective and show that it performs generally worse than Coder-reviewer reranking in Section C.4.

3 Understanding Degenerate Solution Rejection

We start our analysis by removing the proposed degenerate solution rejection on Codex002’s generated programs on HumanEval and MBPP-S. Table 3 shows the ranking results along with the performance degradation compared to applying the proposed rejection. We observe that degenerate solution rejection improves all of the baselines and Coder-Reviewer variants, with the most significant effect on Coder. Importantly, we see that the improvement from rejection is larger on Coder, Reviewer, and Coder-Reviewer than on Random, suggesting that these ranking methods have biases toward the degenerate solutions that we reject. On these two datasets, the effect of rejection is less obvious on normalized Coder and normalized Coder Reviewer but we find the rejection to be important for these methods on NL2Bash where it is more likely to have egregious repetitions in the program samples. Finally, these ablation results suggest that Coder-Reviewer still provides a benefit on top of degenerate solution rejection. Overall, we recommend the usage of our rejection methods because they are easy to implement and can effectively remove the most egregious degeneration programs targeting each ranking method.

Conclusion

We propose Coder-Reviewer reranking for code generation, which leads to a consistent and significant improvement over the Coder only reranking proposed in prior work. When combined with executability filtering, Coder-Reviewer reranking can often outperform the MBR-Exec method. Coder-Reviewer is easy to implement with prompting, can generalize to different programming languages, and works well with off-the-shelf hyperparameter.

References

Appendix A Understanding the Relation between Coder-Reviewer Reranking and Maximum Mutual Information.

We include the derivation from Li et al. (2016) to show that Coder-Reviewer reranking is a special instantiation of the Maximum Mutual Information objective. We can show

by using the fact that p(x)p(x) is a constant and removing/adding it does not change the optimziation objective.

Alternatively, Li et al. (2016) propose another way to instantiate this objective,

We show in Section C.4 that this alternate formulation leads to worse performance, which is a similar conclusion to the original finding in Li et al. (2016).

Appendix B Additional Experimental details

contains 164164 hand-written Python programming questions Chen et al. (2021). We use the prompts released by Chen et al. (2022), which removes the input-output cases present in the original prompts. This dataset evaluates the generated function by multiple assertions on the input-output relations.

MBPP-Sanitized (Austin et al., 2021)

contains 427427 crowd-sourced Python programming questions. Chen et al. (2022) adapts each problem into having a function header and relocate the natural language instructions into function docstrings, similar to the format of HumanEval. We use this prompt format for our experiment.

MBPP (Austin et al., 2021)

original version consists of 974974 python questions, with 500500 of them used for testing and the rest for few-shot prompting. The prompt taken from Shi et al. (2022) uses one input-output assertion as additional context to help make the language instruction more specific. Unlike the 0-shot sanitized version of MBPP, this 3-shot setting requires the models to define the desired function name on its own.

Spider (Yu et al., 2018)

is a benchmark of natural language to SQL query generation. There are 70007000 examples for training/demonstration and 10341034 questions for testing. The prompt taken from Shi et al. (2022) prepends the database schema as the program context. The generated SQL commands are evaluated with execution accuracy, comparing the database return values to the ones queried by ground-truth SQL commands.

NL2Bash (Lin et al., 2018)

is a benchmark of translating natural language to bash commands. Because it is difficult to obtain executable enviroments for bash commands, this dataset evaluates character-level BLEU-4 score.

Appendix C Additional Results

Table 5 to Table 10 plot ranking results on all eight models we experimented with and it also shows the results of not applying executability filtering. Analyzing across all data, we find that a Coder-Reviewer variant is the best method 62.5%62.5\% of the time, Reviewer 12.5%12.5\% of the time, and MBR-Exec 20.83%20.83\% of the time. Notably, a Coder-Reviewer variant is almost always the best method for Codex models with only two exceptions (Codex002 on Spider and MBPP). Even when none of Coder-Reviewer variants is the best method, a Coder-Reviewer variant is mostly the second best method. Overall, executability filtering improves most methods and does not change the comparison between different ranking methods.

C.2 Analyzing the Effect of Number of Program Samples

Figure 7 plots the performance of Coder-Reviewer variants, Coder variants, and Reviewer on all three 0-shot datasets we experimented with. Similar to Figure 6, Figure 7 shows that compared to Coder variants, Coder-Reviewer variants perform more consistently across the number of program samples being reranked. Here we plot the performance after applying degeneration solutions rejection. Still, on all three datasets, the performance of Coder starts to degrade after five or ten samples. Normalized Coder is more stable on HumanEval and MBPP-S but also does worse with more program samples on Plotting.

C.3 Analyzing the Effect of Tuning Mixing Ratio α𝛼\alpha

In addition, recall that we can introduce a hyperparameter to control the mixing ratio, α\alpha between Coder and Reviewer and have the objective (1α)logp(xy)+αlogp(yx)(1-\alpha)\log p(x|y)+\alpha\log p(y|x). Coder-Reviewer reranking is equivalent to fixing α\alpha at 0.50.5 although the weighted objective is also popular in prior work (Li et al., 2016). To explore the benefits of tuning this additional hyperparameter of α\alpha, we grid search between {0.1,0.2,,0.9}\{0.1,0.2,\ldots,0.9\} across all experimental settings. From Figure 8 in Section C.3, we observe that the improvement from tuning α\alpha is typically less than 1%1\%, with a few exceptions (Incoder-66B on Plotting and CodeGen-1616B on NL2Bash) where tuning can lead to more significant improvements. Given the lack of validation split in most benchmarks we experimented with, we do not include the result of tuning α\alpha as our main results. However, for practitioners who have access to validation data, we recommend that they can grid search over α\alpha in case tuning leads to additional gain.

C.4 Alternate Formulation

In Section 4, we show that Coder-Reviewer is related to pointwise mutual information regularization, when the regularization strength α\alpha is set to 0.50.5. There is another formulation of this objective that is also common in literature. With derivation shown in Appendix A, we can observe that

Here we compare to this alternate formulation (Alternate) on the 0-shot Python function completion datasets. We use prompting to estimate p(y)p(y) by removing the docstring beneath the function header and record the probability of the function body. To apply this formulation with length normalization, we modify the regularization term as 1ylogp(y)\frac{1}{\lvert y\rvert}\log p(y). From Table 4 we observe that the alternate formulation generally underperforms Coder-Reviewer.