Language Models of Code are Few-Shot Commonsense Learners

Aman Madaan, Shuyan Zhou, Uri Alon, Yiming Yang, Graham Neubig

Introduction

The growing capabilities of large pre-trained language models (LLMs) for generating text have enabled their successful application in a variety of tasks, including summarization, translation, and question-answering (Wang et al., 2019; Raffel et al., 2019; Brown et al., 2020; Chowdhery et al., 2022).

Nevertheless, while employing LLMs for natural language (NL) tasks is straightforward, a major remaining challenge is how to leverage LLMs for structured commonsense reasoning, including tasks such as generating event graphs (Tandon et al., 2019), reasoning graphs (Madaan et al., 2021a), scripts (Sakaguchi et al., 2021), and argument explanation graphs (Saha et al., 2021). Unlike traditional commonsense reasoning tasks such as reading comprehension or question answering, structured commonsense aims to generate structured output given a natural language input. This family of tasks relies on the natural language knowledge learned by the LLM, but it also requires complex structured prediction and generation.

To leverage LLMs, existing structured commonsense generation models modify the output format of a problem. Specifically, the structure to be generated (e.g., a graph or a table) is converted, or “serialized”, into text. Such conversions include “flattening” the graph into a list of node pairs (Figure 1(d)), or into a specification language such as dot (Figure 1(c); Gansner et al., 2006).

While converting the structured output into text has shown promising results Rajagopal et al. (2021); Madaan and Yang (2021), LLMs struggle to generate these “unnatural” outputs: LMs are primarily pre-trained on free-form text, and these serialized structured outputs strongly diverge from the majority of the pre-training data. Further, for natural language, semantically relevant words are typically found within a small span, whereas neighboring nodes in a graph might be pushed farther apart when representing a graph as a flat string.

Thus, a language model which was trained on natural language text is likely to fail to capture the topology of the graph. Consequently, using LLMs for graph generation typically requires a large amount of task-specific training data, and their generated outputs show structural errors and semantic inconsistencies, which need to be further fixed either manually or by using a secondary downstream model (Madaan et al., 2021b).

Despite these struggles, the recent success of large-language models of code (Code-LLMs; Chen et al., 2021b; Xu et al., 2022) for tasks such as code generation from natural language Austin et al. (2021); Nijkamp et al. (2022), code completion Fried et al. (2022), and code translation Wang et al. (2021), show that Code-LLMs are able to perform complex reasoning on structured data such as programs. Thus, instead of forcing LLMs of natural language (NL-LLMs) to be fine-tuned on structured commonsense data, an easier way to close the discrepancy between the pre-training data (free-form text) and the task-specific data (commonsense reasoning graphs) is to adapt LLMs that were pre-trained on code to structured commonsense reasoning in natural language.

Thus, our main insight is that large language models of code are good structured commonsense reasoners. Further, we show that Code-LLMs can be even better structured reasoners than NL-LLMs, when converting the desired output graph into a format similar to that observed in the code pre-training data. We call our method CoCoGen: models of Code for Commonsense Generation, and it is demonstrated in Figure 1.

We highlight the insight that Code-LLMs are better structured commonsense reasoners than NL-LLMs, when representing the desired graph prediction as code.

We propose CoCoGen: a method for leveraging LLMs of code for structured commonsense generation.

We perform an extensive evaluation across three structured commonsense generation tasks and demonstrate that CoCoGen vastly outperforms NL-LLMs, either fine-tuned or few-shot tested, while controlling for the number of downstream task examples.

We perform a thorough ablation study, which shows the role of data formatting, model size, and the number of few-shot examples.

CoCoGen: Representing Commonsense structures with code

We focus on tasks of structured commonsense generation. Each training example for such tasks is in the form $(\mathcal{T},{\mathcal{G}})$ , where $\mathcal{T}$ is a text input, and ${\mathcal{G}}$ is the structure to be generated (typically a graph). The key idea of CoCoGen is transforming an output graph ${\mathcal{G}}$ into a semantically equivalent program ${\mathcal{G}}_{c}$ written in a general-purpose programming language. In this work, we chose Python due to its popularity in the training data of modern Code-LLMs (Xu et al., 2022), but our approach is agnostic to the programming language. The code-transformed graphs are similar in their format to the pre-training data of Code-LLMs, and thus serve as easier to generalize training or few-shot examples than the original raw graph. CoCoGen uses Code-LLMs to generate ${\mathcal{G}}_{c}$ given $\mathcal{T}$ , which we eventually convert back into the graph ${\mathcal{G}}$ .

We use the task of script generation (proscript, Figure 1) as a running example to motivate our method: script generation aims to create a script ( ${\mathcal{G}}$ ) to achieve a given high-level goal ( $\mathcal{T}$ ).

We convert a $(\mathcal{T},{\mathcal{G}})$ pair into a Python class or function. The general procedure involves adding the input text $\mathcal{T}$ in the beginning of the code as a class attribute or descriptive comment, and encoding the structure ${\mathcal{G}}$ using standard constructs for representing structure in code (e.g., hashmaps, object attributes) or function calls. The goal here is to compose Python code that represents a $(\mathcal{T},{\mathcal{G}})$ pair, but retains the syntax and code conventions of typical Python code.

For example, for the script generation task, we convert the $(\mathcal{T},{\mathcal{G}})$ pair into a Tree class (Figure 1(b)). The goal $\mathcal{T}$ is added as class attribute (goal), and the script ${\mathcal{G}}$ is added by listing the nodes and edges separately. We first instantiate the list of nodes as objects of class Node. Then, the edges are added as an attribute children for each node (Figure 1(b)). For example, we instantiate the node “Take out several plates” as take_out_several_plates = Node(), and add it as a child of the node take_pies_out_to_cool.

While there are multiple ways of representing a training example as a Python class, we found empirically that this relatively simple format is the most effective, especially with larger models. We analyze the choice of format and its connection with the model size in Section 4.

2 Few-shot prompting for generating 𝒢𝒢{\mathcal{G}}

We focus on large-language models of the scale of Codex (Chen et al., 2021a). Due to their prohibitively expensive cost to fine-tune, these large models are typically used in a few-shot prompting mode. Few-shot prompting uses $k$ input-output examples $\{(x_{i},y_{i})\}_{i=1}^{k}$ to create an in-context prompt: $p=x_{1}\oplus y_{1}\;\cdot\;x_{2}\oplus y_{2}\;\cdot\;\ldots\cdot\;x_{k}\oplus y_{k}$ , where $\oplus$ is a symbol that separates an input from its output, and $\cdot$ separates different examples.

A new (test) input $x$ is appended to the prompt $p$ (that is: $p\;\cdot\;x$ ), and $p\;\cdot\;x\;\oplus$ is fed to the model for completion. As found by Brown et al. (2020), large language models show impressive few-shot capabilities in generating a completion $\hat{y}$ given the input $p\;\cdot\;x\;\oplus$ . The main question is how to construct the prompt?

In all experiments in this work, the prompt $p$ consists of $k$ Python classes, each representing a $(\mathcal{T},{\mathcal{G}}_{c})$ pair. For example, for script generation, each Python class represents a goal $\mathcal{T}$ and a script ${\mathcal{G}}_{c}$ from the training set. Given a new goal $\mathcal{T}$ for inference, a partial Python class (i.e., only specifying the goal) is created and appended to the prompt. Figure 2 shows such a partial class. Here, the code generation model is expected to complete the class by generating the definition for Node objects and their dependencies for the goal make hot green tea.

In our experiments, we used Codex (Chen et al., 2021a) and found that it nearly always generates syntactically valid Python. Thus, the generated code can be easily converted back into a graph and evaluated using the dataset’s standard, original, metrics. Appendix F lists sample prompts for each of the tasks we experimented with.

Evaluation

We experiment with three diverse structured commonsense generation tasks: (i) script generation (proscript, Section 3.2), (ii) entity state tracking (propara, Section 3.3), and (iii) explanation graph generation (explagraphs, Section 3.4) Dataset details are included in Appendix D. Despite sharing the general goal of structured commonsense generation, the three tasks are quite diverse in terms of the generated output and the kind of required reasoning.

As our main Code-LLM for CoCoGen, we experiment with the latest version of Codex code-davinci-002 from OpenAIAs of June 2022 in few-shot prompting mode.

Baselines

We experimented with the following types of baselines:

Text few-shot: Our hypothesis is that code-generation models can be repurposed to generate structured output better. Thus, natural baselines for our approach are NL-LLMs – language models trained on natural language corpus. We experiment with the latest versions of curie (text-curie-001) and davinci (text-davinci-002), the two GPT-3 based models by OpenAI (Brown et al., 2020). For both these models, the prompt consists of $(\mathcal{T},{\mathcal{G}})$ examples, where ${\mathcal{G}}$ is simply flattened into a string (as in Figure 1(c)). davinci is estimated to be much larger in size than curie, as our experiments also reveal (Appendix A). davinci, popularly known as GPT-3, is the strongest text-generation model available through OpenAI APIs.https://beta.openai.com/docs/models/gpt-3

Fine-tuning: we fine-tune a t5-large model for explagraphs, and use the results from Sakaguchi et al. (2021) on t5-xxl for proscript tasks. In contrast to the few-shot setup where the model only has access to a few examples, fine-tuned models observe the entire training data of the downstream task.

Choice of prompt

We created the prompt $p$ by randomly sampling $k$ examples from the training set. As all models have a bounded input size (e.g., 4096 tokens for Codex code-davinci-002 and 4000 for GPT-3 text-davinci-002), the exact value of $k$ is task dependent: more examples can fit in a prompt in tasks where $(\mathcal{T},{\mathcal{G}})$ is short. In our experiments, $k$ varies between $5$ and $30$ , and the GPT-3 baseline is always fairly given the same prompts as Codex. To control for the variance caused by the specific examples selected into $p$ , we repeat each experiment with at least 3 different prompts, and report the average. We report the mean and standard deviations in Appendix I.

CoCoGen: We use CoCoGen to refer to setups where a Codex is used with a Python prompt. In Section 4, we also experiment with dynamically creating a prompt for each input example, using a NL-LLMs with code prompts, and using Code-LLMs with textual prompts.

2 Script generation: proscript

Given a high-level goal (e.g., bake a cake), the goal of script generation is to generate a graph where each node is an action, and edges capture dependency between the actions (Figure 1(a)). We use the proscript (Sakaguchi et al., 2021) dataset, where the scripts are directed acyclic graphs, which were collected from a diverse range of sources including ROCStories (Mostafazadeh et al., 2016), Descript (Wanzare et al., 2016), and Virtual home (Puig et al., 2018).

Let ${\mathcal{G}}({\mathcal{V}},{\mathcal{E}})$ be a script for a high-level goal $\mathcal{T}$ with node and edge sets ${\mathcal{V}}$ and ${\mathcal{E}}$ , respectively. Following Sakaguchi et al. (2021), we experiment with two sub-tasks: (i) script generation:generating the entire script ${\mathcal{G}}({\mathcal{V}},{\mathcal{E}})$ given a goal $\mathcal{T}$ , and (ii) edge prediction:predicting the edge set ${\mathcal{E}}$ given the nodes ${\mathcal{V}}$ and the goal $\mathcal{T}$ .

Figure 1 shows an input-output example from proscript, and our conversion of the graph into Python code: we convert each node $v\in{\mathcal{V}}$ into an instance of a Node class; we create the edges by adding children attribute for each of the nodes. Additional examples are present in Figure 6

To represent a sample for edge prediction, we list the nodes in a random order (specified after the comment # nodes in Figure 1(b)). The model then completes the class by generating the code below the comment # edges.

We denote the script that was generated by the model as $\hat{{\mathcal{G}}}$ , and evaluate $\hat{{\mathcal{G}}}$ vs. ${\mathcal{G}}$ for both semantic and structural similarity. To evaluate semantic similarity, we use bleu, rouge-l, and the learned metric bleurt to determine the content overlap. Following Sakaguchi et al. (2021), we use the following metrics for structural evaluation of generated scripts:

Graph edit distance (ged): the number of required edits (node/edge removal/additions) to transform $\hat{{\mathcal{G}}}$ to ${\mathcal{G}}$ Abu-Aisheh et al. (2015);

Graph isomorphism (iso; Cordella et al., 2001): determines whether $\hat{{\mathcal{G}}}$ and ${\mathcal{G}}$ are isomorphic based on their structure, disregarding the textual content of nodes;

Graph size: average number of nodes and edges, $(|{\mathcal{G}}(V)|,|{\mathcal{G}}(E)|,|\hat{{\mathcal{G}}}(V)|,|\hat{{\mathcal{G}}}(V))$ and the average degree ( $\text{d}({\mathcal{G}}(V))$ ), where the high-level goal is for $\hat{{\mathcal{G}}}$ to have as close measures to ${\mathcal{G}}$ as possible.

Edge Prediction metrics

For the edge prediction task, the set of nodes is given, and the goal is to predict the edges between them. Following Sakaguchi et al. (2021), we measure precision, recall, and $F_{1}$ comparing the true and predicted edges. Specifically, $p=\frac{|{\bm{E}}\cap\hat{{\bm{E}}}|}{|\hat{{\bm{E}}}|}$ , $r=\frac{|{\bm{E}}\cup\hat{{\bm{E}}}|}{|{\bm{E}}|}$ , and $F_{1}=\frac{2pr}{p+r}$ .

Results

Table 1 shows the results for script generation. The main results are that CoCoGen (based on Codex), with just 15 prompt examples, outperforms the fine-tuned model t5 which has been fine-tuned on all 3500 samples. Further, CoCoGen outperforms the few-shot NL-LM curie across all semantic metrics and structural metrics. CoCoGen outperforms davinci across all semantic metrics, while davinci performs slightly better in two structural metrics.

Table 2 shows the results for edge prediction: CoCoGen significantly outperforms the NL-LLMs curie and davinci. When comparing with t5, which was fine-tuned, CoCoGen with only 15 examples outperforms the fine-tuned t5 which was fine-tuned on 100 examples. The impressive performance in the edge-generation task allows us to highlight the better ability of CoCoGen in capturing structure, while factoring out all models’ ability to generate the NL content.

3 Entity state tracking: propara

The text inputs $\mathcal{T}$ of entity state tracking are a sequence of actions in natural language about a particular topic (e.g., photosynthesis) and a collection of entities (e.g., water). The goal is to predict the state of each entity after the executions of an action. We use the propara dataset Dalvi et al. (2018) as the test-bed for this task.

We construct the Python code $\mathcal{G}_{c}$ as follows, and an example is shown in Figure 3. First, we define the main function and list all $n$ actions as comments inside the main function. Second, we create $k$ variables named as state_k where $k$ is the number of participants of the topic. The semantics of each variable is described in the comments as well. Finally, to represent the state change after each step, we define $n$ functions where each function corresponds to an action. We additionally define an init function to represent the initialization of entity states. Inside each function, the value of each variable tells the state of the corresponding entity after the execution of that action. Given a new test example where only the actions and the entities are give, we construct the input string until the init function, and we append it to the few-shot prompts for predictions.

We follow Dalvi et al. (2018) and measure precision, recall and $F_{1}$ score of the predicted entity states. We randomly sampled three examples from the training set as the few-shot prompt.

Results

As shown in Table 3, CoCoGen achieves a significantly better $F_{1}$ score than davinci. Across the five prompts, CoCoGen achieves 5.0 higher $F_{1}$ than davinci on average. In addition, CoCoGen yields stronger performance than curie, achieving $F_{1}$ of 63.0, which is 74% higher than curie (36.1).curie often failed to produce output with the desired format, and thus its high precision and low recall.

In propara, CoCoGen will be ranked $6^{th}$ on the leaderboard.As of 10/11/2022, https://leaderboard.allenai.org/propara/submissions/public However, all the methods above CoCoGen require fine-tuning on the entire training corpus. In contrast, CoCoGen uses only 3 examples in the prompt and has a gap of less than 10 $F_{1}$ points vs. the current state-of-the-art (Ma et al., 2022). In the few-shot settings, CoCoGen is state-of-the-art in propara.

4 Argument graph generation: explagraphs

Given a belief (e.g., factory farming should not be banned) and an argument (e.g., factory farming feeds millions), the goal of this task is to generate a graph that uses the argument to either support or counter the belief (Saha et al., 2021). The text input to the task is thus a tuple of (belief, argument, “supports”/“counters”), and the structured output is an explanation graph (Figure 4).

We use the explagraphs dataset for this task (Saha et al., 2021). Since we focus on generating the argument graph, we take the stance as given and use the stance that was predicted by a stance prediction model released by Saha et al..

To convert an explagraphs to Python, the belief, argument, and stance are instantiated as string variables. Next, we define the graph structure by specifying the edges. Unlike proscript, the edges in explagraphs are typed. Thus, each edge is added as an add_edge(source, edge_type, destination) function call. We also list the starting nodes in a list instantiated with a begin variable (Figure 4). Given a test example, we construct the input until the line of # Edges and let a model complete the remaining.

We use the metrics defined by Saha et al. (2021) (see Section 6 of Saha et al. (2021) for a detailed description of the mechanisms used to calculate these metrics):

Structural accuracy (StCA): fraction of graphs that are connected DAGs with two concepts each from belief and the argument.

Semantic correctness (SeCA): a learned metric that evaluates if the correct stance is inferred from a (belief, graph) pair.

G-BERTScore (G-BS): measures BERTscore- (Zhang et al., 2020) based overlap between generated and reference edges .

GED (ged): avg. edits required to transform the generated graph to the reference graph.

Edge importance accuracy (EA): measures the importance of each edge in predicting the target stance. A high EA implies that each edge in the generated output contains unique semantic information, and removing any edge will hurt.

Results

Table 4 shows that CoCoGen with only 30 examples outperforms the t5 model that was fine-tuned using 1500 examples, across all metrics. Further, CoCoGen outperforms the NL-LLMs davinci and curie with a text-prompt across all metrics by about 50%-100%.

Analysis

In this section, we analyze the effect of three important components of CoCoGen: (i) the contributions of Code-LLMs and structured prompt ${\mathcal{G}}_{c}$ ; (ii) the selection of examples in the in-context prompt; and (iii) the design of the Python class.

Which component is more important, using a Code-LLMs or the structured formatting of the input as code? To answer this, we experimented with a text prompt with a Code-LLM Codex, and a code prompt with an NL-LLM, davinci. Table 5 shows that both contributions are indeed important: performance improves for the NL-LLM davinci both when we use a code prompt, and when we use a Code-LLM. However when using both a Code-LLM and a code prompt – the improvement is greater than the sum of each of these solely.

Dynamic prompt selection

The prompts for all experiments in Section 3 were created by random sampling of examples from the training set. Specifically, a set of $k$ $(\mathcal{T},{\mathcal{G}})$ pairs are sampled and concatenated into a prompt $p$ , which we used for inference over all examples $x_{test}$ in the test set. As an alternative to creating prompts, there is now a growing interest in customizing the in-context examples each example $x_{test}$ . Popular techniques typically train a retriever, which is used to fetch the closest examples (Liu et al., 2021; Rubin et al., 2021; Poesia et al., 2021). We also experimented with such dynamic creation of the prompt, that depends on the particular test example. Specifically, following Poesia et al. (2021), we performed knowledge similarity tuning (kst): we trained a retriever model to retrieve the $k$ closest examples for a given input.

The results indicate that the efficacy of dynamic prompts depends on both the training data and task. In the edge-prediction sub-task of proscript, edges between events in similar scripts are helpful, and Table 6 shows that the model was able to effectively leverage this information. In the script generation sub-task of proscript, Table 8 shows that kst provides gains as well (Appendix B).

In explagraphs, we observed that the training data had multiple examples which were nearly identical, and thus dynamically created prompts often included such duplicate examples, effectively reducing diversity and prompt size (Table 9).

Python Formatting

We performed an extensive study of the effect of the Python format on the downstream task performance in Appendix G. We find that: (i) there are no clear task-agnostic Python class designs that work uniformly well; and that (ii) larger models are less sensitive to prompt (Python class) design. In general, our approach benefits the most from code formats that as similar as possible to the conventions of typical code.

Human evaluation

We conduct human evaluation of the graphs generated by CoCoGen and davinci to supplement automated metrics. The results (Appendix C) indicate that human evaluation is closely correlated with the automated metrics: for explagraphs, graphs generated by CoCoGen are found to be more relevant and correct. For proscript generation, both davinci and CoCoGen have complementary strengths, but CoCoGen is generally better in terms of relevance.

Related work

Existing methods for structured commonsense generation typically flatten the output graphs as strings (Madaan and Yang, 2021; Madaan et al., 2021a; Sakaguchi et al., 2021). Consequently, these methods struggle with generation of well-formed outputs (Sakaguchi et al., 2021; Madaan et al., 2021b). In contrast, we address the problem of structured generation by (1) translating the task into Python code, and (2) generating code using large-code generation models.

Code representation for procedural knowledge reasoning

Programs inherently encode rich structures, and they can efficiently represent task procedures. Existing works leverage the control-flows, nested functions and API calls of a programming language such as Python to control the situated agents in the embodied environment (Sun et al., 2019; Zhou et al., 2022; Singh et al., 2022). In this work, we go beyond these procedural tasks and show the effectiveness of using Code-LLMs on broader structured commonsense tasks.

Adapting Code-LLMs for reasoning

As code-generation models (Code-LLMs) are getting increasingly popular, there is a growing interest in adapting them for a wide range reasoning tasks. Wu et al. (2022) use Codex and PaLM (Chowdhery et al., 2022) for converting mathematical statements written in natural language into a formal structure that can be used for theorem provers, with moderate success. The task is challenging, as it involves understanding the concepts used in the theorem (e.g., set of real numbers) and the complex relationship between them. Our work is similar in spirit to Wu et al. (2022), and seeks to leverage the dual abilities of Code-LLMs for text and symbolic reasoning. However, differently from their work, we close the gap between the pre-training data and our tasks by translating our output into Python code. As our experiments show, this step is crucial in outperforming text-only and fine-tuned models. To the best of our knowledge, our work is the first to transform a natural-language reasoning problem into code to successfully leverage code generation methods.

Symbolic reasoning using LLMs

The use of programming languages like LISP (Tanimoto, 1987) and Prolog (Colmerauer and Roussel, 1996) to process natural language has a long history in AI. However, the recent progress in large language models has obviated the need for specialized methods for symbolic processing. Cobbe et al. (2021) and Chowdhery et al. (2022) address middle-school level algebra problem solving using large-language models in a few-shot setup. These problems require a model to understand the order in which a set of operations should be performed over symbols (typically small integers). In contrast, structured commonsense reasoning requires broader information than supplied in the prompt, while utilizing the models’ structural generation capabilities for generating output effectively. Thus, the tasks in our work push a model to use both its reasoning and symbolic manipulation capabilities.

Conclusion

We present the first work to employ large language models of code for structured commonsense generation. By converting the output commonsense structures to Python code, CoCoGen provides a simple and effective method for leveraging the code-generation abilities of Code-LLMs for structured generation. These results open a promising direction for structural commonsense reasoning. We believe that the principles and the methods presented in this paper are applicable to additional NLP tasks that require “language understanding” and structured prediction.

Acknowledgments

We thank Kaixin Ma, Keisuke Sakaguchi and Niket Tandon for thoughtful discussion and helping with proscript datasets and the anonymous reviewers for valuable feedback. This material is partly based on research sponsored in part by the Air Force Research Laboratory under agreement number FA8750-19-2-0200. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the Air Force Research Laboratory or the U.S. Government. This project was also partially supported by a gift from AWS AI.

Limitations

Some experiments in this work are performed with language models that are not open-sourced, namely davinci, curie, and Codex. Existing documentation (Brown et al., 2020; Chen et al., 2021b) does not fully describe the details of these models, such as the pretraining corpus, model size, and model biases. Therefore, we can only provide educational guesses on these details (analysis in Appendix A). In addition, even though Codex is free to use for research as of June 2022, we are unsure whether the research community will continue to have free access in the future. Nonetheless, we release our code and model outputs to ensure the reproducibility of our work. Furthermore, in cases where the models we experiment with reveal any issue, the publicly available code will allow future investigations.

Another limitation of our work is that we exclusively experiment with datasets in English. Exploring the efficacy of structured generation methods in cross-lingual settings is an interesting and important future work.

References

Appendix A Few-shot models size estimates

As OpenAI has not released any details of the size of their few-shot models, we estimate the relative strengths and weaknesses on code and text generation by calculating the average loss per token. To calculate the avg. loss of each of these models on code, we use the implementation provided by Xu et al. (2022).https://github.com/VHellendoorn/Code-LMs#evaluation The perplexity on text corpus was evaluated on 30 random wikipedia pages from Wikiplotshttps://github.com/markriedl/WikiPlots following a similar procedure The structure and text generation capabilities of the models are apparent from the results in Table 7; davinci outperforms Codex on text generation but is worse on code-generation and vice-versa. curie underperforms both davinci and Codex significantly. Importantly, these results show that Codex and davinci are of comparable capacities, making their comparison fair.

Appendix B Dynamic prompt Creation

As an alternative to creating prompts, there is now a growing interest in customizing the in-context examples each example $\mathcal{T}_{test}$ . Popular techniques typically train a retriever, which is used to fetch the examples in the training set that are closest to $\mathcal{T}_{test}$ (Liu et al., 2021; Rubin et al., 2021; Poesia et al., 2021).

Specifically Poesia et al. (2021) train a retriever with a target-similarity tuning (TST) objective over a corpus of $\mathcal{D}$ of $(x,y)$ examples. TST learns an embedding function $f$ such that for a pair of examples $(x_{i},y_{i})$ and $(x_{j},y_{j})$ , if $y_{i}\sim y_{j}\implies f(x_{i})\sim f(x_{j})$ . For a new $x$ , $f(x)$ is used to retrieve the closest examples from $\mathcal{D}$ .

We follow Poesia et al. (2021), and train a knowledge-similarity tuner (kst). We use mpnet-basehttps://huggingface.co/microsoft/mpnet-base with SentenceTransformers (Reimers and Gurevych, 2019) to fine-tune a retrieval function $f$ by minimizing the following loss:

where $f_{\theta}$ is parameterized using a transformer.

Results on using kst with proscript (Table 8) and explagraphs (Table 9). While kst is highly effective for edge-prediction 6, the results are mixed for explagraphs and proscript. For proscript, kst yields marginal gains. However, for explagraphs, a number of training examples have overlapping theme (Table 10), and thus creating a prompt dynamically reduces the effective information in the prompt.

Appendix C Human Evaluation

Out of the four tasks used in this work, proscript edge prediction and propara have only one possible correct value. Thus, following prior work, we report the automated, standard metrics for these tasks. For explagraphs, we use model-based metrics proposed by Saha et al. (2021), which were found to have a high correlation with human judgments. For proscript graph generation, we conducted an exhaustive automated evaluation that separately scores the correctness of the nodes and the correctness of the edges.

However, automated metrics are limited in their ability to evaluate model-generated output. Thus, to further investigate the quality of outputs, we conduct a human evaluation to compare the outputs generated by CoCoGen and davinci. We sampled 20 examples, and three of the authors performed the evaluation. Annotators were shown two graphs (generated by CoCoGen and davinci) and were asked to select one they thought was better regarding relevance and correctness. The selection for each criterion was made independently: the same graph could The annotations were done separately: the same graph could have more relevant nodes (higher relevance) but may not be correct. The identity of the model that generated each graph (CoCoGen or davinci) was shuffled and unknown to the evaluators.

The results in Table 11 indicate that human evaluation is closely correlated with the automated metrics: for explagraphs, annotators found the graphs generated by CoCoGen to be more relevant and correct. We find that davinci often fails to recover semantic relations between nodes in the argument graphs. For example, consider a belief (B) urbanization harms natural habitats for the animals in the world. We want to generate a graph that can counter this belief with the argument (A) urbanization causes increase in jobs.

For the same prompt, CoCoGen generated (urbanization; causes; increase in jobs); (increase in jobs; has context; good); (good; not capable of; harms) whereas davinci generated (jobs; not harms; natural habitats) $\rightarrow$ (natural habitats; not part of; animals). Note that davinci successfully recovered relevant events (“natural habitat” “animals”) but arranged them in incorrect relations. For proscript, the human evaluation shows that CoCoGen and davinci have complementary strengths, while CoCoGen generally produces more relevant and correct outputs.

Appendix D Dataset statistics

Dataset statistics are shown in Table 12. The test split for explagraphs is not available, so we evaluate on the validation split. For proscript, we obtained the test splits from the authors.

Appendix E Sample outputs

Sample outputs from CoCoGen for all the tasks are located at https://github.com/madaan/CoCoGen/tree/main/outputs. Representative examples from each task are presented in Figure 5. Surprisingly, CoCoGen (Codex with a Python prompt) generates syntactically valid Python graphs that are similar to the task graphs/tables in nearly 100% of the cases.

Appendix F Prompts

The prompts for each tasks are present at this anonymous URL:

proscript script-generation: https://github.com/madaan/CoCoGen/tree/main/data/proscript_script_generation/prompt.txt

proscript edge-prediction: https://github.com/madaan/CoCoGen/tree/main/data/proscript_edge_prediction/prompt.txt

propara: https://github.com/madaan/CoCoGen/tree/main/data/explagraphs/prompt.txt

explagraphs: https://github.com/madaan/CoCoGen/tree/main/data/explagraphs/prompt.txt

These prompts are also present in the attached supplementary material, and can be found in the data folder under respective task sub-directories.

Appendix G Designing Python class for a structured task

Figure 7 shows three different designs for Explagraphs. For proscript, the various formats include representing proscript as a Networkxhttps://networkx.org/ class (8), DOT-like class 9, and as a Tree (10).

Appendix H Impact of Model size

The Codex model released by OpenAI is available in two versionsas of June 2022: code-davinci-001 and code-davinci-002. While the exact sizes of the models are unknown because of their proprietary nature, OpenAI API states that code-davinci-002 is the Most capable Codex model Tables 16 and 17 compares CoCoGen +code-davinci-001 with CoCoGen +code-davinci-002. Note that both code-davinci-001 and code-davinci-002 can fit 4000 tokens, so the number of in-context examples was identical for the two settings. The results show that for identical prompts, CoCoGen +code-davinci-002 vastly outperforms CoCoGen +code-davinci-001, showing the importance of having a better underlying code generation model.

In Table 14 shows the performance of Codex-001 (smaller) and Codex-002 (larger, also see Appendix A) on identical prompts. Our experiments show that as model size increases, the sensitivity of the model on the prompt reduces. This indicates that for very large models, prompt design might get progressively easier.

Appendix I Variation in prompts

We run each experiment with 3 different random seeds, where the random seeds decides the order of examples in the prompt. We find minimal variance between runs using different fixed prompts between 3 runs. Further, as shown in the Tables 18,19, 20, and 21, all improvements of CoCoGen over davinci are statistically significant (p-value $<$ 0.001).