Decomposed Prompting: A Modular Approach for Solving Complex Tasks

Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, Ashish Sabharwal

Introduction

To address these limitations, we propose Decomposed Prompting (DecomP), a new approach to solve complex tasks by instead decomposing them into simpler sub-tasks and delegating these to sub-task specific LLMs, with both the decomposer and the sub-task LLMs (henceforth, sub-task handlers) having their own few-shot prompts. Fig 1 illustrates our approach. The decomposer prompt only describes a sequence of sub-tasks (A, B, and C) needed to solve the complex tasks, indicated with the dashed lines. Each sub-task is then delegated to the corresponding sub-task handler shown on the right.

This approach has several advantages over prior work (as also shown in the figure). The sub-task handlers can be shown a broader and richer set of examples (of the simpler task) than the specific ones needed for the complex task prompt (task A). If a sub-task is too complex, it can be further decomposed into simpler sub-tasks (task B). Similar to software libraries, these sub-task handlers can be shared across multiple tasks; e.g., here tasks A and C are reused in the model for task B. As noted above, a sub-task handler can be easily swapped with an improved implementation without any change to the rest of the system. Few-shot prompt based LLMs can be even replaced with a symbolic system for tasks more suited for non-neural methods; e.g., task C uses a symbolic retrieval system such as Elasticsearch that can handle very large-scale corpora. Lastly, we can even improve upon prior work by simply adding an error-correcting sub-task handler as a post-processing step.

Related Work

Large-scale Language models (LLMs) have been shown to learn various NLP tasks given just few examples as prompts (Brown et al., 2020). Recently, they have also been successfully applied to various multi-step reasoning tasks by providing the intermediate reasoning steps, i.e. Chain-of-Thought (Wei et al., 2022; Chowdhery et al., 2022), needed to arrive at the answer. An alternate approach has been to compose multiple LLMs or LLMs with symbolic functions to perform multi-step reasoning (Jung et al., 2022; Creswell et al., 2023; Press et al., 2022; Parisi et al., 2022; Gao et al., 2022; Schick et al., 2023, inter alia). We view these prior works as specialized systems with a pre-defined decomposition structure.

The closest works to our approach are the ideas of least-to-most prompting (Zhou et al., 2023) and successive prompting (Dua et al., 2022) where one prompt/model is used to generate the sub-questions needed to answer a complex question and a second prompt/model sequentially answers these sub-questions. In contrast, our approach allows for diverse decomposition structures including recursion and other non-linear decomposition structures. E.g., by definition, least-to-most asks questions from easiest to the hardest and requires an LLM to eventually answer the complete question (“most” in least-to-most) whereas we have no such restriction. Additionally, we iteratively generate new questions based on previous answers (similar to successive prompting) and can explicitly assign different prompts or symbolic systems to answer each sub-question.

Our work follows a long literature in NLP on neural modular modeling architectures (Andreas et al., 2016; Talmor & Berant, 2018; Min et al., 2019; Jiang & Bansal, 2019; Gupta et al., 2020; Perez et al., 2020; Khot et al., 2021; Levine et al., 2022) for question-answering and other tasks. We take particular inspiration from the Text Modular Networks approach of Khot et al. (2021), whereby problem decomposition consists of a learned next question generator trained to generate questions in the language of a collection of textual and symbolic agents. Best-first search strategy was used to explore the space of possible decompositions during inference. In contrast to this work, which largely centered around supervised training of the next-question generator given existing agents, we leverage the power and recent successes of few-shot LLMs to build both the decomposer and the sub-task agents that best fit the ideal decomposition. This has the advantage of obviating the need for specialized supervised training data that may not always be available for all sub-tasks – a key bottleneck of this prior work.

Decomposed Prompting

As with conventional few-shot prompting, the goal is to teach an LLM to find an answer AA to a query QQ using a small set of in-context examples D={E1,...,ED}D=\{E_{1},...,E_{|D|}\}. The answer AA is obtained from the underlying distribution p(AQ,D,θ)p(A\mid Q,D,\theta) (Dohan et al., 2022). In the most basic few-shot setup, examples take the form Ej=(Qj,Aj)E_{j}=(Q_{j},A_{j}). In the case of CoT-style prompting, the goal is to obtain answers by first generating a sequence or chain of intermediate reasoning steps or “thoughts” TT, and then deriving the final answer based on TT. To teach this ability, one uses more sophisticated in-context examples that take the form Ej=(Qj,(Tj,1,,Tj,k),Aj)E_{j}=(Q_{j},(T_{j,1},\ldots,T_{j,k}),A_{j}).

In DecomP, the core is a decomposer LLM that tries to solve a complex task by generating a prompting program PP for it. Each step of PP directs a simpler sub-query to a function in an auxiliary set of sub-task functions F\mathcal{F} available to the system. Given a query QQ whose answer is AA, the program PP is a sequence of the form \big{(}(f_{1},Q_{1},A_{1}),...,(f_{k},Q_{k},A_{k})\big{)} where AkA_{k} is the final answer predicted by PP and QiQ_{i} is a sub-query directed to the sub-task function fiFf_{i}\in\mathcal{F}. PP is executed by a high-level imperative controller, which passes the inputs and outputs between the decomposer and sub-task handler until a stopping condition in PP is met and the final output obtained.

To teach the decomposer LLM in a few-shot prompting manner, we use in-context examples that take the form E_{j}=\big{(}(Q_{j},\big{(}f_{j,1},Q_{j,1},A_{j,1}),...,(f_{j,k_{j}},Q_{j,k_{j}},A_{j,k_{j}})\big{)}\big{)} where Aj,kj=AjA_{j,k_{j}}=A_{j} is the final answer for QjQ_{j} and (Qj,1,,Qj,kj)(Q_{j,1},\ldots,Q_{j,k_{j}}) is a decomposition of QjQ_{j}. Each sub-task function ff, in turn, is operationalized via a sub-task handler as an in-context prompting LLM (e.g., a separate CoT-style prompt or a additional prompting program dedicated to that sub-task), or any other symbolic or learned function (e.g., a calculator or specialized supervised trained model).

To illustrate this with an example, consider a multi-step task such as “Concatenate the first letter of every word in strstr using a space”. We can solve this task by decomposing it into a sequence of three simple sub-tasks: 1) Collect the list of words in the strstr; 2) For each word, extract the third letter; 3) Concatenate the extracted letters using space as the separator. Fig. 2 shows an example decomposition prompt for this task. Much like a conventional structured program, the top-level decomp prompt provides an example program EjE_{j} using three sub-task functions: f1:f_{1}:split that splits words in an input string, f2:f_{2}:str_pos that finds character positions in strings and f3:f_{3}:merge that concatenates characters. In this case, we operationalize each sub-task function as a separate in-context prompt (e.g., using a standard prompting approach for split and merge on the right side), each containing a set of in-context examples that are independent of the original complex task.

In addition to the three functions described above, additional control structure is included, such as the symbolic function foreach, which iterates over arrays and references to previous answers such as #1. We note that such a helper function is not strictly necessary (e.g., we could directly generate “Q2’: What is the first letter of Jack?” and “Q3’: What is the first letter of Ryan?” instead of Q2 in the figure) and is added to reduce the manual effort needed to specify the decomposition and also reduce potential errors during decomposition. In our experiments we use two of the compositional operators defined by Khot et al. (2022) (see appendix for details), although it is capable of using all their operators (which also capture the QDMR operators from Wolfson et al. (2020)).

2 Prompt Execution and Inference

Given a new question and a set of background in-context examples DD, the inference (i.e., the program construction and execution) process is illustrated in Fig. 3. The new complex question is fed to the decomposer prompt to get the first sub-question to be asked to the split prompt. With the help of our symbolic controller, the answer generated from this prompt is then appended to the decomposer prompt to get the second sub-question, Q2Q2. Due to the foreach operator in the generated question, Q2Q2 results in two questions (one for each word in #1\#1) to be fed to the str_pos prompt. The answers are combined into an array to get the answer #2\#2. The entire decomposition history is used to generate Q3Q3 and passed to the merge prompt to get the final answer. Since the task has been solved, the decomposition prompt produces the special end-of-sequence marker([EOQ]) and the last answer is returned as the final answer. Formally, performing inference involves finding the best answer AA to a new query QQ, which in the simplest form involves computing the MAP answer using the LLMs predictive distribution for AA, i.e., A^=arg maxAp(AD,Q,θ)\hat{A}=\operatorname*{arg\,max}_{A}p(A\mid D,Q,\theta) (Dohan et al., 2022). For practicality, such computations are approximated using greedy search in our experiments.

3 DecomP Capabilities

Recursive Decomposition Some problems can be naturally broken down into one or more smaller problems of the same form. Recursive algorithms such as merge sort use this idea to solve large problems efficiently, using a succinctly described method. We apply this same principle in DecomP by allowing the decomposer prompt to recursively call itself, as shown in Fig. 5 for the task of list reversal. By using recursion, we are able to generalize any base prompting approach (CoT in this figure) to much longer lists by breaking the input into smaller and smaller lists till we reach a list length where the model is highly accurate. Such recursive approaches can not be described by current methods such as CoT and standard prompting. Least-to-most prompting (Zhou et al., 2023) also proposes a similar solution but differs in two key aspects (a) it has to identify all the sub-problems in one-shot instead of our iterative top-down decomposition (b) it has to learn to identify the relevant answers from the previous solutions which we get for free from our decomposition.

In certain cases, the sub-tasks may not be feasible to solve using only a LLM. E.g., retrieving knowledge from a KB or large corpus. Such sub-tasks, however, can be easily solved using existing systems such as retrieving documents using an Elasticsearch index or webpages using Google search (Lazaridou et al., 2022). Fig. 6 shows how DecomP can easily use such a system to retrieve the relevant documents and answer a single-hop open-domain question.

Case Studies

We showcase DecomP’s strengths through four tasks; two symbolic manipulation tasks similar to those investigated by Wei et al. (2022) and two existing textual multi-hop reasoning tasks. Unless specified, we use text-davinci-002 InstructGPT3 model (Ouyang et al., 2022) as the LLM and report the Exact Match (EM) numbers, following prior work. For order-independent list answers, we evaluate set equality as EM. We compare our approach to CoT rather than each specific decomposition structure used in prior work. See App. G for the complete prompts for all our tasks.

Chain-Of-Thought The letter at position 4 of ”Herbert” is ”b”. The letter at position 4 of ”Alexander” is ”x”. The letter at position 4 of ”Simon” is ”o”. Concatenating ”b”, ”x”, ”o” using a space leads to ”b x o”. So, ”Herbert Alexander Simon” outputs ”b x o”. …

Chain-Of-Thought (rolled out) The words in ”Herbert Alexander Simon” are ”Herbert”, ”Alexander”, and ”Simon”. The letters and their positions in ”Herbert” are ”[(H, 1), (e, 2), (r, 3), (b, 4), (e, 5), (r, 6), (t, 7)]”. The letter at position 4 in this sequence is ”b”. \cdots outputs ”b x o”. …

We similarly adapt the least-to-most prompt (Zhou et al., 2023) to include rollout. (see App. G). We compare these four prompting techniques on 4 datasets to evaluate generalization along 3 axes: (1) new letter position k=3k=3;Note that none of the sub-task prompts contain examples for this position. (2) longer inputs, #words=4 and 5; (3) new delimiter ”;”. The words in the test examples come from a list of most popular first and last names.forebears.io/earth/forenames and forebears.io/earth/surnames All evaluation datasets have 100 examples. We present results on space as a delimiter averaged across three prompts in Fig. 8.We obtain similar results with semi-colon shown in Fig. 22 in the appendix.

DecomP outperforms chain-of-thought and least-to-most prompting, even when the prompt uses the same reasoning procedure as the rolled out decomposition. This shows that the separate prompts are more effective at teaching hard sub-tasks than a single CoT prompt.

DecomP generalizes perfectly to longer sequences. As the length of the input sequence increases, our approach continues to achieve close to 100% accuracy on this task.Note that we report aggregate metrics for DecomP too but the std. dev is zero here The CoT-based approaches drop noticeably in their scores with longer input lengths, widening the performance gap.

2 List Reversal (Recursive Decomposition)

We use the task of reversing lists of wordsWe use the vocabulary from Wei et al. (2022): https://www.vocabulary.com/lists/189583to show how recursive DecomP enables length generalization. We adapt the relevant CoT prompt from Wei et al. (2022), and integrate it in a decomposed prompt. As a control, we also compare to a CoT version w/ rollout of our decomposed prompt. All prompts contain the same 3 examples of reversing word sequences with 3-5 items. We evaluate all prompts for generalization to 4, 6, 8, and 10-item sequences. Here we use davinci-001 to show that DecomP enables a weaker model approach davinci-002’s performance (which does solve this task). We use the strategy from Fig. 5 and provide our prompts in App. G. Fig. 8 shows the results of the prompting strategies on different input lengths.

DecomP improves the length generalization of few-shot prompting. While our base CoT prompt does not generalize at all to longer sequences, our approach can recursively decompose the problem and achieve better length generalization. Moreover, the CoT version of our decomposition strategy fails because the unrolled prompt becomes too long and convoluted without the ability to abstract away sub-modules.

3 Long-Context Question Answering

We next evaluate on the CommaQA-E dataset (Khot et al., 2022) under the reading comprehension setting. The dataset consists of synthetically generated entities (e.g. Erowid award), facts (“Wetherality was an actor in the movie Dewbar.”) and multi-hop questions (e.g., “What awards have the actors of the Erowid winning movies received?”). Due to the presence of many distractors and, as a result, longer context, this dataset has been shown to be hard for standard LMs even when fine-tuned.

To fit these questions within GPT3’s context limit (2049 tokens), we generate a smaller version of the CommaQA-E dataset and of the compositional generalization split such that we can fit at least four examples in the context for CoT prompts. The CoT prompts describe the sequence of facts needed to arrive at the answer (see App. G for all the prompts).

For DecomP, we can separate the task of decomposition (independent of the context) from the sub-tasks of single-hop question answering. As shown in Fig. 9, we provide examples of the context-independent decomposition in the decomposer prompt and use the separate sub-task prompts to teach the QA skill over the given context. Additionally, we can choose the granularity of decomposition to trade off human effort for increased accuracy. For example, we could have single QA prompt to handle all the questions or create QA prompts for different classes of questions. In our experiments, each sub-task prompt contains 8 QA examples (2 questions/para). We evaluate three different prompts and report the average results in Fig. 10.

We make three observations on CommaQA. DecomP is more accurate than CoT irrespective of the granularity of decomposition or the evaluation split. Finer grained decomposition can help improve task performance by providing more examples for each class of questions, which in turn increases single-hop QA accuracy. DecomP generalizes to new compositions such as the compositional generalization split of CommaQA, which tests models on unseen compositions of relations observed in the training set. While CoT has a drop in score, both decomposition-based approaches actually get a small bump (the subset of relations used in this split are easier for our QA models).

4 Open-Domain Question Answering

Next, we demonstrate the ability of our approach to integrate external API calls on the task of open-domain multihop question answering. We evaluate our approach on three datasets: (1) 2WikiMultihopQA (Ho et al., 2020) (2) MuSiQue (Trivedi et al., 2022) (3) HotpotQA (Yang et al., 2018). We describe the open-domain versions of these datasets in more detail in App. A We use the Codex (code-davinci-002) model here since it can fit the much longer contexts needed. We also evaluate the impact of model scale on DecomP by using models from the Flan-T5 family: Flan-T5-Large (0.7B), Flan-T5-XL (3B), and Flan-T5-XXL (11B).We still use GPT3-sized models for decomposition since only these models are reliably able to produce the required structured outputs.

Fig. 11 shows the decomposition prompt we use. The decomposer generates (singlehop) sub-questions and delegates them to retrieve_odqa (described in Fig. 6). As we showed earlier, this module retrieves relevant documents then uses an RC model to answer. retrieve_odqa returns both the answer and the documents, allowing subsequent sub-questions to use the answers (e.g. “Mack Rides”) and the multihop_rcqa model to use the documents. The final multihop_rcqa model is prompted to produce the answer directly or using CoT given K paragraphs.

We compare our approach against two baselines: A. No Context (No-Ctxt), A closed-book setting baseline where the model must rely only on its parametric knowledge. B. NoDecomp Context (NoDecomp-Ctxt), A simple retrieval baseline where we retrieve K paragraphs using the multi-hop question as the input and use that as context. For both NoDecomp-Ctxt and Decomp-Ctxt, K is selected by hyperparameter tuning (App. A). We manually annotate CoTs and decompositions for 20 training set questions, and sample 3 prompts of 15 questions each for all approaches. The detailed prompts are given in the Appendix G. We evaluate on 300 held-out dev questions in each dataset.

We present results on all three datasets with direct QA prompts in Fig. 10 with other results in App. A. The Decomp-Ctxt models performs significantly better than No-Ctxt models in all the settings showing that external knowledge can be leveraged to improve few-shot models on open-domain mulithop QA. Furthermore, we show that our Decomp-Ctxt models outperform the strong retrieval baseline (NoDecomp-Ctxt) in all settings except one (Codex with HotpotQA). Finally, we show that even with the much smaller Flan-T5-XXL model, Decomp-Ctxt outperforms all the baselines and can even achieve scores comparable to the Codex-only systems.

5 Additional Results

Post-processing CoT for error correction DecomP also allows us to create a targeted sub-task handler to focus on the source of error in any system. For example, CoT for arithmetic reasoning often rely on patterns (answer is .*) to extract answers but the CoT does not always fit this pattern. Instead, we can assign the answer extraction to a better sub-task handler (GPT3) and reduce these types of errors. This results in a 17 pt improvement on MultiArith (78 \to 95) and 14 pt improvement on GSM8K (36 \to 50.6) compared to CoT prompting (details in App. B).

While DecomP outperforms the baselines in aggregate, we also see the gains of DecomP are consistent across prompt choices (see App. D) and decomposition schemes (see App. E).

Conclusion

We proposed a new approach, Decomposed Prompting, to solve complex tasks using few-shot prompts, by decomposing them into a prompting program built out of simpler sub-tasks. Drawing inspiration from software libraries, our decomposer and shared sub-tasks are designed in a modular fashion: they use their own few-shot prompts, allowing one to independently optimize each prompt, decompose a sub-task further if necessary, or even seamlessly replace it with a symbolic system. We show that Decomposed Prompting outperforms prior work on four different tasks and generalization settings, establishing it as an effective few-shot paradigm for solving complex tasks.

Acknowledgements

We thank members of the Aristo team at the Allen Institute for AI (AI2) for their constructive feedback and the reviewers for their invaluable suggestions. This work was supported in part by the National Science Foundation under grants IIS2007290.

References

Appendix A Open Domain QA Details

We use HotpotQA in the fullwiki setting where it comes with the associated Wikipedia corpus for open-domain QA. 2WikiMultihopQA and MuSiQue, however, are originally reading comprehension datasets. Questions in 2WikiMultihopQA and MuSiQue are associated with 10 and 20 paragraphs respectively. To turn these datasets into open-domain QA datasets, we create a corpora for each dataset by combining all the paragraphs in the train, dev and test questions. As a result we get a corpus size of 430,225 paragraphs for 2WikiMultihopQA and 139,416 for MuSiQue.

A.2 Hyperparameter Tuning for Open Domain QA

We treat the number of paragraphs to retrieve (KK) in NoDecomp-Ctxt and Decomp-Ctxt models as a hyperparameter. We select it based on a grid search on a set of values to maximize performance on a held out set of 100 questions for each dataset. For NoDecomp-Ctxt, we search K{6,8,10}K\in\{6,8,10\} for GPT3 models and K2,4,6,8K\in{2,4,6,8} for Flan-T5-* models. For Decomp-Ctxt, we search K{2,4,6}K\in\{2,4,6\} for GPT3 and Flan-T5-* models. Note that the ranges are different between GPT3 and Flan-T5-* as GPT3 can fit in more number of tokens. The ranges are different for NoDecomp-Ctxt and Decomp-Ctxt as KK refers to number of paragraphs retrieved in each round of retrieval, and NoDecomp-Ctxt has only one step of retrieval whereas Decomp-Ctxt usually has multiple retrieval steps.

A.3 Additional Results

We present all the results on the MuSiQue dataset in Fig. 13. Across all settings, we can see that retrieval helps substantially (large gains over No-Ctxt QA) with further improvements achieved by our DecomP-based Decomp-Ctxt QA model.

A.3.2 HotpotQA

We present all the results on the HotpotQA dataset in Fig. 14. On this dataset too, we can see large gains by incorporating retrieval but the gains from using DecomP are mostly seen in the smaller models.

A.3.3 2WikiMultihopQA

We present all the results on the 2WikiMultihopQA dataset in Fig. 15. On this dataset, we can see large gains by incorporating retrieval and also observe substantial gains by incorporating DecomP (as compared to NoDecomp-Ctxt).

Appendix B Math QA

We apply Decomposed Prompting to two math QA datasets: GSM8K Cobbe et al. (2021) and MultiArith Roy & Roth (2015). For Chain-of-thought, we used the original prompts for math reasoning Wei et al. (2022). For example:

Q: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today? A: There are 15 trees originally. Then there were 21 trees after some more were planted. So there must have been 21 - 15 = 6. The answer is 6.

Most CoT systems Wei et al. (2022); Wang et al. (2023) rely on extracting the answer by finding the number following “answer is”. However, this may not always be accurate. For example, the following CoT would be unanswerable by relying on simple patterns.

Parker chews 4 pieces of gum a day. There are 15 pieces of gum in a pack. So he will need 4 * 30 / 15 = 8 packs of gum to last him 30 days.

Rather than relying on patterns with limited generalization, we can use a language model to extract the answer more reliably. Specifically, we use Decomposed Prompting to decompose the task into first identifying the chain-of-thought reasoning and then using a second GPT3-based sub-module to extract the answer from the CoT. We show examples of our prompts here (full prompt in App. G):

QC: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today? QS: [cot] There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today? A: There are 15 trees originally. Then there were 21 trees after some more were planted. So there must have been 21 - 15 = 6 trees planted. QS: [gpt_ans] There are 15 trees originally. Then there were 21 trees after some more were planted. So there must have been 21 - 15 = 6 trees planted. A: 6 QS: [EOQ]

Q: There are 15 trees originally. Then there were 21 trees after some more were planted. So there must have been 21 - 15 = 6 trees planted. A: 6

We present our results in Fig. 17. On the GSM8K data setWe randomly sample 300 examples from the test set due to costs with API usage, we outperform CoT by 14 points. On the MultiArith datasetWe randomly sample 200 examples from the test set due to costs with API usage, we achieve a 17 pt improvement compare to CoT. While this is a simple change, it illustrates the possibility of using DecomP for other complex answer types, e.g. non-extractive answer generation from chain-of-thoughts.

Appendix C Effect of Scale on CommAQA

We evaluate text-curie-001, text-davinci-001 and text-davinci-002 on the CommAQA dataset. Since the curie-001 and davinci-001 have a smaller context window size, we further reduced our prompts to fit within their context windows (2048 tokens). As shown in Fig. 17, both CoT and DecomP are effected by the model size.

Appendix D Results on all prompts

We present the results of the letter concatenation task (with space delimiter) for different values of N in Fig. 18. Our results are stable across the different prompts (P1, P2 and P3) and always outperform CoT and Least-to-Most prompting.

D.2 Per-Prompt Results on CommaQA

We also present the results of all the prompts on the CommAQA dataset in Fig. 19. Here too, we can observe that DecomP outperforms CoT on each prompt set.

Appendix E Effect of Decomposition Scheme

To evaluate the effect of the decomposition scheme, we experiment with two other simple decomposition structures for the letter concatenation and reversal tasks.

For letter concatenation, we consider an alternate scheme where we use GPT3 to generate each question rather than loop over the answers, e.g.,

QC: Take the last letters of the words in ”Augusta Ada King” and concatenate them using a space. QS: [split] What are the words in ”Augusta Ada King”? A: [”Augusta”, ”Ada”, ”King”] QS: [str_position] What is the last letter in ”Augusta”? A: ”a” QS: [str_position] What is the last letter in ”Ada”? A: ”a” QS: [str_position] What is the last letter in ”King”? A: ”g” QS: [merge] Concatenate [”a”, ”a”, ”g”] using a space. A: ”a a g” QS: [EOQ]

By using the decomposer prompt model to generate the sub-questions, we can be more robust to formatting issues in the output answers, e.g., we can expect GPT3 to still generate the appropriate sub-questions even if the first answer is not a valid array. However, the generated sub-questions may not correctly use all the elements of the list (change in order, missed element, repeated elements, etc).

For list reversal, instead of splitting into halves, we take the tail of the list, reverse it and then concatenate it to the head. i.e. reverse(list) = reverse(list[1:]) + list. This requires more GPT3 calls (O(n)) compared to the original approach of splitting the list into halves (O(log(n))).

In both these cases, we noticed that the performance did not drop as shown in Fig. 21 and Fig. 21. On the letter concatenation task, the results were exactly the same. The new reversal decomposition schema was actually stronger on longer inputs at the cost of more calls to GPT3 (O(ln(n)) using binary splits vs O(n) one element at a time). Both these decomposition schemes are still better than CoT.

Appendix F Error Analysis

We analyzed the errors in DecomP on the letter concatenation task and only found errors in the sub-task execution.

Q: Take the letters at position 3 of the words in ”Nancy Samina Abbas Caudhari Bano” and concatenate them using a space. A: n m b u n Prediction: c m b u n Error: Incorrect letter extraction (Sub-task) What is at position 3 in ”[(N, 1), (a, 2), (n, 3), (c, 4), (y, 5)]”? \Rightarrow ”c” Q: Take the letters at position 3 of the words in ”Orlando Stephen Cho Teixeira Pierre” and concatenate them using a space. A: l e o i e Prediction: leoie Error: Incorrect concatenation (Sub-task) Concatenate [”l”, ”e”, ”o”, ”i”, ”e”] using a space. \Rightarrow ”leoie”

F.1.2 CoT w/ rollout

We analyzed the errors in CoT on the letter concatenation task and found similar errors during the generation of CoT. But the frequency of these errors was higher than DecomP, as it is not possible to effectively teach each sub-task with CoT.

Q: Take the letters at position 3 of the words in ”Sheila Nicolas Verma Sha Sousa” and concatenate them using a space. A: e c r a u Pred: i c r a u Error: Incorrect letter extraction …The letters and their positions in ”Sheila” are ”[(S, 1), (h, 2), (e, 3), (i, 4), (l, 5), (a, 6)]”. The letter at position 3 in this sequence is ”i”… Q:Take the letters at position 3 of the words in ”Shobha Kailash Nakamura Peter Benitez” and concatenate them using a space. A: o i k t n Pred: o l k t i Error: Incorrect letter extraction …”Benitez” are ”[(B, 1), (e, 2), (n, 3), (i, 4), (t, 5), (e, 6), (z, 7)]”. The letter at position 3 in this sequence is ”i”…

F.2 CommaQA

Similarly in CommaQA, the errors are mostly due to sub-task errors, which in this dataset correspond to answering single-hop questions. CoT also makes the same types of errors but they are more frequent since this QA sub-task can not be delegated to a specialized prompt in CoT. Since all errors are of this type, we show only one example here.

Q: What awards have movies written by people born in 1933 won? A: [”Hydrallium”, ”Pompasole”] Pred: [”Pompasole”] Error: Incorrect sub-question answer Sub-Q:What movies has Haldron written? Sub-A: [”Polytetrafluoromethane”, ”Skia”, ”Warpstone”] Pred: [”Skia”, ”Warpstone”]

Appendix G Task Prompts

We have provided the task prompts for all the datasets for COT and our Decomposed Prompting approach.

Since CoT methods also perform 2-step reasoning: first generate the chain-of-thought and second extract the answer from the CoT, we use the same decomposition-based framework for COT baselines too. For example, consider the following example in our COT prompt:

QC: Take the letters at position 1 of the words in ”Alan Mathison Turing” and concatenate them using a space. QS: [extract] The letter at position 1 of ”Alan” is ”A”. The letter at position 1 of ”Mathison” is ”M”. The letter at position 1 of ”Turing” is ”T”. Concatenating ”A”, ”M”, ”T” using a space leads to ”A M T”. So, ”Alan Mathison Turing” outputs ”A M T”. A: ”A M T” QS: [EOQ]

GPT3 generates the chain-of-thought during the ”decomposition” step and a regex-based answer extractor extract (’.* outputs "(.*)"\.’) then takes this CoT and generates the answer. In some cases, the module name is skipped in the prompt (the CoT is sent to the extractor by default).

In this work, we use the same operators as defined by Khot et al.. Their select operator is just the basic operator that replaces references to an answer index with its answer. When not specified, select is assumed to be the default operator. In addition, we consider two operators in our experiments: project_values and project_values_flat_unique.

project_values: This operator takes a list answer #i=X\#i=X and iterates over it to generate new questions by replacing mentions of #i\#i i.e. Q = [q.replace(#i, x) for x \in X]. The answer to each question is simply concatenated to get the final answer i.e. A = [model(q) for q \in Q]. We refer to this as foreach for simplicity in the main text.

project_values_flat_unique: This operator performs the same steps as project_values but then additionally flattens the list and only returns the unique entities in the flattened list. We refer to this as foreach_merge in the main text for simplicity.

G.1 Letter Concatentation

We show one of the prompts used for experiments here. The entire set of prompts is provided as supplemetary material.

G.1.2 COT with rollout

G.1.3 COT

G.1.4 Least-to-most w/ rollout

G.1.5 Alt DecomP schema (Generate Each Sub-Question)

G.2 Sequence Reversal

The prompts in this section implement Algorithm 1.

G.3 Long-Document QA

We show one of the prompts used for CommaQA experiments here. The entire set of prompts is provided as supplemetary material.

G.3.2 Decomposed Prompting: (fine)

G.3.3 COT

G.4 Shorter Prompts for Smaller Context Windows

G.5 Math QA

The decomposer here deterministically calls cot to generate the CoT and then calls gpt_ans to extract the answer.

G.6 Open Domain QA

The prompts in this section implement Decomposed Prompting approach to open-domain multihop QA. For brevity we’ve included prompts for 5 of 20 randomly sampled questions. The full prompts are attached with the submission and will also be released with the code. Note that we selected a set of 100 questions from the development set to tune the hyperparameter (number of paragraphs to retrieve for all of the retrieval-based approaches).