Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models

Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, Jianfeng Gao

Introduction

Remarkable progress has been observed in recent large language models (LLMs) for various natural language processing tasks, with prominent examples such as GPT-3 , PaLM , LLaMA , ChatGPT , and the recently developed GPT-4 . LLMs have demonstrated emergent abilities, including in-context learning and chain-of-thought (CoT) reasoning . These models are capable of solving diverse tasks in a zero-shot fashion or with the aid of a few examples , and they show great potential in planning and decision-making akin to human beings . Despite these capabilities, LLMs face inherent limitations, such as an inability to access up-to-date information , perform precise mathematical reasoning , or utilize specialized models . Therefore, enhancing current LLMs with the capability to automatically compose external tools for real-world task solving is critical to address these drawbacks.

Consider the example ② in Figure 1: Which is the main persuasive appeal used in this ad?. To answer this question, one needs to: 1) infer that there is an ad image containing text context and call a text decoder to understand the semantics; 2) retrieve background knowledge about persuasive appeals and the differences among three persuasive appeals; 3) generate a solution based on the input query and intermediate results from previous steps; and 4) finally produce the answer in a task-specific format. On the other hand, when answering Which animal’s skin is adapted for survival in cold places (③), one might need to call modules such as an image captioner to decipher image information and a web search engine to retrieve domain knowledge to understand scientific terminologies. However, current tool-augmented LLMs still face challenges when addressing these real-world queries across various scenarios. Most existing approaches are either limited to a small number of tools or relying on domain-specific tools , and thus are not easy to generalize to queries of new domains (see sections 2 and A.1 for further discussion). In this work, we study how to enable LLMs to synthesize programs to capture the logic of composing heterogeneous tools.

To address the challenges of existing work, we introduce Chameleon, a plug-and-play compositional reasoning framework that leverages LLMs to synthesize programs and compose various tools for a wide range of tasks. Unlike existing tool-augmented LLMs , Chameleon uses a richer set of tools, including LLMs, off-the-shelf vision models, web search engines, Python functions, and heuristics-based modules. Moreover, Chameleon leverages the in-context learning capabilities of LLMs and builds on an LLM as a natural language planner, without requiring any training or carefully curated rules. Prompted by tool descriptions and usage examples, the planner infers a program composed of a sequence of tools to execute in order to generate the final response for a user query. Instead of generating programs in domain-specific languages , Chameleon generates natural-language-like (NL) programs (e.g., [Text_Detector, Knowledge_Retrieval, Solution_Generator, Answer_Generator] for the second query in Figure 1). The NL-like programs are easy to understand and debug by users with limited programming experience, and easily extendable to new modules. During each module’s execution, the module processes the query and cached context, returns a result determined by the module itself, and updates the query and context for subsequent execution. Composing modules as a sequential program allows subsequent modules to leverage prior cached context and updated queries.

We showcase the adaptability and effectiveness of Chameleon on two tasks: ScienceQA and TabMWP . ScienceQA is a multi-modal question answering benchmark spanning multiple context formats and various scientific topics, while TabMWP is a mathematical benchmark involving diverse tabular contexts. These two benchmarks serve as a good testbed to evaluate Chameleon’s ability to coordinate diverse tools across different types and domains. Notably, Chameleon with GPT-4 achieves an 86.54% accuracy on ScienceQA, significantly improving upon the best published few-shot model by 11.37%. On TabMWP, using GPT-4 as the underlying LLM, Chameleon achieves an improvement of 7.97% over chain-of-thought (CoT) prompted GPT-4 and a 17.0% increase over the best-published model , lifting the state of the art to 98.78%. Further studies suggest that using GPT-4 as a planner exhibits more consistent and rational tool selection and is able to infer potential constraints given the instructions, compared to other LLMs like ChatGPT.

Our contributions are as follows: (1) We develop a plug-and-play compositional reasoning framework, Chameleon, that effectively composes external tools to address inherent limitations of LLMs and tackle a broad range of reasoning tasks. (2) Relying on an LLM as a natural language planner to generate programs, Chameleon successfully integrates various tools, including LLMs, off-the-shelf vision models, web search engines, Python functions, and rule-based modules, to build a versatile and adaptable AI system capable of answering real-world queries. (3) We demonstrate Chameleon’s effectiveness on two challenging benchmarks, significantly surpassing the state of the art.

Related Work

Neural modular and compositional approaches have been explored to automatically perform desired sub-task decomposition, enhancing interpretability and adaptability across various reasoning tasks. Early work posits that complex reasoning tasks are fundamentally compositional and proposes neural module networks (NMN) to decompose them into subtasks. However, these methods rely on brittle off-the-shelf parsers and are limited by module configurations. Some later work , takes a step further by predicting instance-specific network layouts in an end-to-end manner, without relying on parsers, using reinforcement learning and weak supervised learning. In visual reasoning, models comprising a program generator and an execution engine have been proposed to combine deep representation learning and symbolic program execution . In the domain of mathematical reasoning, an interpretable solver has been developed to incorporate theorem knowledge as conditional rules and perform symbolic reasoning step by step . Our work takes inspiration from neural module networks, yet it offers several distinct advantages. First, Chameleon does not require expensive supervision of task-specific programs for modeling training. Instead, it generates sequential programs, consisting of modules, that are easy to generalize to various domains and tasks, allowing the extension to new modules in a plug-and-play manner. Second, Chameleon does not require any training, but uses the in-context learning capabilities of LLMs to generate programs prompted by natural language instruction and demonstrations.

Tool-Augmented Language Models

In recent years, the development of large language models (LLMs) has made tremendous progress and has stimulated research in prompt learning and instruction learning . Despite the impressive performance of LLMs, they suffer from inherent limitations, such as the inability to access up-to-date information , utilize external tools , or perform precise mathematical reasoning . Recent benchmarks, such as ScienceQA and TabMWP , have emerged to evaluate the capability of LLMs to tackle intricate reasoning challenges, especially those emphasizing the use of external tools. Concurrently, there has been a growing interest in harnessing external tools and modular approaches to augment LLMs. These augmented LLMs can access real-time information aided by web search engines and leverage domain-specific knowledge from external resources . Some work leverages the Python interpreter to generate complex programs to employ powerful computational resources, and execute logical reasoning tasks more effectively . For example, Toolformer constructs tool-use augmented data to train language models to select five tools. In the realm of visual tools, various approaches have been proposed to enhance the capabilities of large language models in handling visual tasks , augmented with Hugging Face models , Azure models , visual foundation models .

We compare Chameleon with other tool-augmented language models in Table 1. Many of these approaches are either constrained to a small set of tools or limited to task-specific tools, which reduces their capabilities across various skill dimensions and hampers their generalizability to new tasks. A recent line of work relies on large amounts of supervision and focuses on generating commands and programs to infer the choice of tools. However, this approach needs to carefully tailored prompts to specific tasks and particular tools, and is neither flexible nor adaptive. In contrast, Chameleon instructs LLMs with natural language instructions that simply describe the roles of each module and provide a few calling examples, eliminating the need for additional training or tool-specific prompts when learning to compose different tools. More importantly, Chameleon offers users flexibility in terms of tool types and sources, updating the underlying LLMs, adding new tools, and adapting to new tasks. Our work shares the same spirit of AutoGPT , an autonomous GPT-4 agent with the artificial general intelligence (AGI) ambition to incorporate numerous tools to achieve user-defined goals. While AutoGPT is still under development, our work is the first to instantiate the idea and verify its effectiveness on well-studied benchmarks.

General Framework: Chameleon

To address the limitations of current LLMs in utilizing diverse tools, we propose Chameleon, a novel plug-and-play compositional reasoning framework, synthesizing the composition of various tools to accommodate a wide range of problems. Chameleon is comprised of a module inventory that defines different types of tools and an LLM-based planner, whose purpose is to decompose the original problem into sub-tasks that can be effectively solved by task-specific tools. Unlike existing tool-augmented LLM approaches , our module inventory features multiple tool types as illustrated in Table 2, enabling Chameleon to exhibit various reasoning abilities, including image understanding, knowledge retrieval, web search, complex mathematical reasoning, and table understanding. Instead of generating domain-specific programs , Chameleon employs an LLM-based planner to create natural-language-like programs that follow natural language instructions, which is less error-prone, easily expandable to new modules, and user-friendly.

We formalize our planner as follows: given the input query $x_{0}$ , the module inventory $\mathcal{M}$ , and constraints $\mathcal{G}$ , the natural language planner $\mathcal{P}$ selects a set of modules that can be executed sequentially to answer the query via generating a program in a natural-language-like format. The module inventory $\mathcal{M}$ consists of a set of pre-built modules: $\{M_{i}\}$ , each corresponding to a tool of various types (Table 2). $\mathcal{G}$ are the constraints for the plan generation, for example, the concurrent relations and sequence orders of modules. In our work, the planner $\mathcal{P}$ is an LLM prompted to generate a sequence of module names in a few-shot setup. The planner is prompted in natural language with a planning task instruction $\mathcal{I}$ , the descriptions of modules in $\mathcal{M}$ with corresponding constraints $\mathcal{G}$ , as well as a few demonstration examples $\mathcal{D}$ . A $T$ -length plan sampled from $\mathcal{P}$ can be denoted as $p={M^{1},\dots,M^{T}}$ , where $M^{t}$ represents an the $t$ -th element in the generated plan and $M^{t}\in\mathcal{M}$ . Formally, given an input query (problem statement) $x_{0}$ , a plan $p$ is generated as follows:

Given the generated plan, the corresponding modules for each step are then executed sequentially. The plan is a natural-language program where each module is bound simply via string matching. When evaluating the module $M^{t}$ at time step $t$ , the output of the execution $y^{t}$ is calculated by:

where $x^{t-1}$ is the input for the current module $M^{t}$ , and $c^{t-1}$ is the cached information (e.g., image semantics, retrieved knowledge, generated programs) resulting from the execution history of modules. Both the problem input $x^{t}$ and cache $c^{t}$ for the next module $M^{t+1}$ are updated, respectively, by:

The update_input and update_cache functions are hand-designed for each $M_{i}$ . Specifically, update_input is applied to elements in the input query, including the question, table context, and image. These elements are updated after module execution. update_cache corresponds to the generation of new information, such as a description for the input image or retrieved knowledge from external resources. Finally, the response $r$ to the query is generated by the last module $M^{T}$ :

Applications of Chameleon

We demonstrate the applications of Chameleon on two challenging tasks: ScienceQA (section 4.2) and TabMWP (section 4.3), using the module inventory introduced in section 4.1. Further experimental details can be found in appendix A.2.

To accommodate various reasoning capabilities over a diverse range of queries, our system utilizes a rich module inventory of various external tools. We provide a high-level overview of this inventory here, with detailed implementations in specific experiments. The complete module inventory, $\mathcal{M}$ , is presented in Table 2. Each tool within the inventory is defined as follows:

Knowledge Retrieval: This module retrieves additional background knowledge crucial for tackling complex problems. It is especially beneficial for specialized domains like science and mathematics, providing context for the task. For example, if a query is about a tax form table, this module could generate knowledge about tax procedures, offering valuable context.

Bing Search: Like “Knowledge Retrieval”, the “Bing Search” module aims to provide wide-ranging task-relevant knowledge. In contrast, it excels when broader or up-to-date information from multiple sources is required. Using the search engine API, this module returns relevant search results based on the input query, which are then parsed and used by subsequent modules to gather richer context information from diverse sources, enhancing problem-solving effectiveness.

Query Generator: Since the original problem typically lacks a tailored query for retrieving task-relevant information, this module creates search engine queries based on the problem, which are then used by the “Bing Search” module. Mostly, it is a good strategy to use the “Query Generator” module before the “Bing Search”. Coupled with the search engine tool, generating more targeted queries generally facilitates both the recall and precision of retrieved information.

Image Captioner: Designed to generate captions for images, this module provides crucial supplementary context for queries. It is particularly valuable when understanding an image semantically, like identifying objects and interactions in a scene. Using pre-trained models, it translates visual data into language, facilitating effective comprehension and reasoning about image content.

Text Detector: This module is designed to identify text within a given image. Typically, the “Text Detector” is employed when a question requires the extraction of textual information from images containing diagrams, charts, tables, maps, or other visual elements. By effectively detecting text in various formats, this module aids in the analysis and understanding of image-based content.

Row Lookup: This module is crucial when queries involve tabular context, as locating relevant cells is often required. Large tables can distract the system, so “Row Lookup” simplifies the table by retaining only the rows relevant to the query. If all rows are pertinent, it returns the original table.

Column Lookup: Like the “Row Lookup” module, “Column Lookup” addresses questions involving tabular context by focusing on relevant columns. It simplifies the table by retaining only pertinent columns, or returns the original table if all columns are relevant.

Table Verbalizer: Converting structured tables into text is likely to enhance the comprehension of tabular information by various downstream modules as shown by for open-domain question answering, making this module a vital part of our system. It translates tables into easily understandable descriptions for modules like “Program Generator” and “Solution Generator”, particularly useful for small, domain-specific tables like stem-and-leaf plots or function tables.

Program Generator: Program-aided approaches are shown to enhance the logical and mathematical reasoning abilities of LLMs . The “Program Generator” generates Python programs to solve queries effectively, which is particularly beneficial for queries requiring complex computations or intricate logical operations, such as “if-else” statements.

Program Verifier: Recent studies highlight the importance of verification to reduce hallucination . Hence, “Program Verifier” ensures the validity and error-free nature of programs generated by “Program Generator”. It checks for syntax and logical errors, and potential execution issues, enhancing the reliability and accuracy of the solutions.

Program Executor: This module executes the program generated by “Program Generator” and produces the result, bridging the gap between program generation and final solution derivation.

Solution Generator: This module generates a detailed solution to the input query using all the cached information. Employing a chain-of-thought prompting approach , it ensures coherent and well-structured responses. The planner can directly employ this module instead of other functional modules if it can solve the query independently, especially for simpler ones.

Answer Generator: This task-specific module uses a rule-based approach to extract and normalize answers from the results of the “Program Executor” or “Solution Generator”. Unlike the Solution Generator” that provides detailed multi-step solutions, “Answer Generator” serves as the final module in the pipeline, providing concise and task-specific answers.

2 Science Question Answering

Science Question Answering (ScienceQA ) is a diverse benchmark for multi-modal question answering over a range of scientific topics and contexts. As examples illustrated in Figure 1, answering these questions requires various tools and skills like image captioning, text detection, knowledge retrieval, online resource search, and multi-clue visual reasoning. When generating programs for using tools, we limit the search space to the relevant inventory subset (Table 6 in the appendix). Programs are deemed invalid and default to a “Solution Generator” and “Answer Generator” sequence if these are not the final two elements, following the chain-of-thought prompting baseline . See Table 8 in the appendix for the constructed natural language planner prompt. The prompts for LLM-based modules like “Knowledge Retrieval”, “Query Generator”, and “Solution Generator” are shown in Table 10, 11, and 12, respectively, in the appendix.

3 Tabular Mathematical Reasoning

TabMWP is a mathematical reasoning task involving diverse tabular contexts like schedules, prices, tax forms, plots, and function relations (Figure 2). It requires AI systems to understand various table formats and perform precise numerical or symbolic computations. Like ScienceQA, we constrain the program search space to focus on two tool types: 1) those helping LLMs better digest tabular information (e.g., “Row Lookup”, “Column Lookup”, and “Table Verbalizer”) and 2) those performing faithful symbolic computations (e.g., “Program Generator”, “Program Verifier”, and “Program Executor”) as listed in Table 6. The generated programs must meet certain constraints, such as including “Answer Generator” and placing “Program Generator” prior to both “Program Verifier” and “Program Executor”. Non-compliant programs default to a sequence of “Program Generator”, “Program Verifier”, “Program Executor”, and “Answer Generator”, aligning with the program-of-thought prompting baseline with added verification.

Experiments

We assess Chameleon’s effectiveness and adaptability on two complex reasoning tasks, ScienceQA and TabMWP . See experimental details in appendix A.2.

ScienceQA. Table 3 presents the results of existing baselines and our approach Chameleon, with key results highlighted in Figure 3 (a). Employing ChatGPT as the base LLM, Chameleon achieves a 79.93% accuracy, a 1.62% improvement over Chain-of-Thought (CoT) prompted ChatGPT. Notably, Chameleon is a generalized form of CoT, where the generated program is a sequence of “Solution Generator” and “Answer Generator”. Chameleon benefits from additional tool usage, such as “Knowledge Retrieval”, “Bing Search”, “Image Captioner”, and “Text Detector”. When built upon GPT-4 , our model attains an accuracy of 86.54%, outperforming GPT-4 CoT by 2.55% and GPT-3 CoT by 11.37%, creating the new state of the art in few-shot settings.

TabMWP. Table 4 presents results with key models in Figure 3 (b). Similarly, significant improvements are observed for Chameleon over both fine-tuned and few-shot models. It is worth noting that both CoT and Program-of-Thought (PoT) can be viewed as special cases of Chameleon. Apart from “Solution Generator” and “Answer Generator”, CoT doesn’t utilize any tool, while PoT only relies on symbolic programming tools like “Program Generator” and “Program Executor”. Chameleon (ChatGPT) outperforms ChatGPT CoT and ChatGPT PoT by 11.25% and 3.79%, respectively, emphasizing the advantage of our enriched tool set. With GPT-4, Chameleon gains an additional 5.50%, reaching a 98.78% accuracy. Notably, Chameleon (GPT-4) surpasses Codex PoT-SC , the best-published model, by 17.0% and human performance by 8.56%.

2 Qualitative Analysis

The proportions of key tools called in the programs from Chameleon on ScienceQA and TabMWP are visualized in Figure 4 and Figure 5, respectively. Interestingly, ChatGPT and GPT-4 exhibit different planning behaviors. Generally, ChatGPT has a strong bias toward using or not using certain tools, highly influenced by in-context examples. For instance, ChatGPT calls “Knowledge Retrieval” in 72% of queries but only calls “Bing Search” in 3% of cases on ScienceQA; on TabMWP, ChatGPT heavily relies on “Row Lookup” (47%) but calls “Column Lookup” less frequently (4%). However, GPT-4 acts more objectively and rationally in tool selection. For example, GPT-4 calls “Knowledge Retrieval” more frequently (81% vs. 72%) and calls “Bing Search” more than ChatGPT (11% vs. 3%) when answering scientific questions on ScienceQA. Impressively, GPT-4 consistently calls “Query Generator” and “Bing Search” simultaneously by observing the tool usage descriptions, while ChatGPT lacks such reasoning capability.

Ablation study with disabled modules.

We study the accuracy decline of Chameleon when key modules in the generated programs are disabled (Table 5), using ChaptGPT as the underlying LLMs and 500 test examples. The results reveal that “Knowledge Retrieval” plays a vital role in both tasks. Domain-specific tools, such as the search engine and vision models for ScienceQA, and program tools for TabMWP, also prove to be important.

Module transitions.

We visualize the transition graphs of modules for generated programs by Chameleon (GPT-4) on ScienceQA and TabMWP in Figure 7 and 8, respectively. The transition probabilities in these graphs are computed from the tool transitions observed on the test sets. These graphs show that the GPT-4 planner is able to make good decisions on how to sequence tools in a few-shot setup. For example, on ScienceQA, Chameleon often decides to rely on either “Knowledge Retriever” or “Bing Search”, but rarely both. On TabMWP, we observe two main modes: either going through the solution generator module or via the program generator, verifier, and executor.

3 Case Study

Examples from Chameleon (GPT-4) on ScienceQA are visualized in Figure 1. Chameleon (GPT-4) is able to adapt to different input queries by generating programs that compose various tools and executing them sequentially to obtain accurate responses. For instance, to answer the first question (①), What is the direction of this push?, the system calls the image captioner model to extract semantic information from the image and employs the knowledge retrieval model to gather background knowledge for multi-modal reasoning. In the second example (②), the natural language planner infers that a text detector tool is needed to understand the context of the ad. The third query (③; more details provided in Figure 9 in the appendix), Which animal’s skin is adapted for survival in cold places?, involves scientific terminology related to animal survival. The planner decides to call the Bing search engine to access domain-specific knowledge, benefiting from the numerous online resources.

Visualization examples of TabMWP.

The adaptability and versatility of Chameleon for various queries are also observed on TabMWP, as illustrated in the examples in Figure 2. The first example (①) involves mathematical reasoning on a tax form. Chameleon (1) calls the knowledge retrieval model to recall basic knowledge that assists in understanding this domain-specific table, (2) describes the table in a more readable natural language format, and (3) finally relies on program-aided tools to perform precise computations. In the second example (②), the system generates Python code that closely aligns with the background knowledge provided by the knowledge retrieval model. The third example (③) requires the system to locate the cell in a large tabular context given the input query. Chameleon calls the row lookup model to help accurately locate the relevant rows and generate the language solution via an LLM model, instead of relying on program-based tools.

Failure cases and limitations. Failure examples from Chameleon (GPT-4) are illustrated in Tables 5 to 24 in the appendix. Inaccurate responses may arise from the limitations of the current modules or from suboptimal programs generated by the planner. Additionally, the module inventory may lack tools capable of addressing specific abilities. Future directions could involve upgrading the modules and the planner, or expanding the module inventory to support a broader range of capabilities. Further limitations and broader impacts are respectively discussed in sections B and C of the appendix.

4 Error Analysis

To examine the error sources of the base large language models and understand how our model reduces mistakes from different aspects, we conduct an error analysis, as shown in Figure 6. We select 50 mistake examples from the ChatGPT baseline on ScienceQA as the evaluation set. We count the number of mistake examples and analyze their corresponding mistake type categories for ChatGPT, our Chameleon (ChatGPT) approach, and Chameleon (GPT-4).

The results show that our Chameleon approach can substantially reduce the number of mistakes compared to ChatGPT. Our model features tools for image captioning and knowledge retrieval, thus the mistakes made by ChatGPT in the category of image understanding are reduced to 10 and 19 from 32 by Chameleon (ChatGPT) and Chameleon (GPT-4); while the mistakes made by ChatGPT in the category of knowledge understanding are reduced to 6 and 3 from 37 by Chameleon (ChatGPT) and Chameleon (GPT-4). Benefiting from the sequential execution of tools, the mistakes caused by solution generation are significantly reduced as well. Additionally, we find that the task planning of GPT-4 outperforms ChatGPT by a large margin.

Conclusion

In conclusion, we introduce a novel plug-and-play compositional reasoning framework, Chameleon, that addresses the limitations of current large language models by augmenting them with external tools in a plug-and-play manner. Our approach employs a diverse set of tools and demonstrates impressive adaptability and effectiveness on two challenging benchmarks, ScienceQA and TabMWP. By achieving significant improvements in accuracy over existing state-of-the-art models, Chameleon showcases its potential for addressing real-world queries across various domains.

Acknowledgment

We would like to thank Chunyuan Li, Qiuyuan Huang, and other members of the Deep Learning group at Microsoft Research for their valuable discussions. We also thank Fan Yin from University of California, Los Angeles, and Mingyang Sun from University of Electronic Science and Technology of China for their thorough review of our paper and constructive feedback. Pan Lu’s research for this work was financially supported by Microsoft during his visit at Microsoft Research, and was also partially supported by the Amazon PhD Fellowship, Bloomberg PhD Fellowship, Qualcomm Innovation Fellowship, and UCLA Dissertation Year Fellowship. Kai-Wei was supported an ONR grant N00014-23-1-2780 and as a Sloan Fellow.

References

Appendix A Appendix

To address the limitations of LLMs, an active research direction involves augmenting language models with access to external tools and resources, as well as exploring the integration of external tools and plug-and-play modular approaches. For example, aided by web search engines and external knowledge resources, LLMs are able to access real-time information and leverage domain-specific knowledge . To enhance mathematical reasoning abilities, recent work uses LLMs to generate complex programs to exploit powerful computational resources, and execute logical reasoning tasks more effectively . Another line of recent work, such as ViperGPT , Visual ChatGPT , VisProg , and HuggingGPT incorporates a collection of foundation computer vision models to equip LLMs with the abilities to perform visual reasoning tasks.

A.2 Experimental Details

The inventory subsets for ScienceQA and TabMWP are shown in Table 6.

Planner implementations.

We choose the gpt-3.5-turbo engine for ChatGPT and the gpt-4 engine for GPT-4 when constructing the LLM-based planner. The maximum length for generated programs is set to 128, and the temperature is set to 0 for the most deterministic generation. The planner prompts for the ScienceQA and TabMWP are illustrated in Table 8 and Table 9, respectively.

Module implementations for ScienceQA.

By default, the LLM-based models use four in-context examples as demonstrations, have a temperature setting of 0, and allow a maximum of 512 tokens for completion. Additional specific implementation details are provided as follows:

Knowledge Retrieval: The prompt consists of 3 demonstration examples and the template is shown in Table 10.

Query Generator: The prompt template is shown in Table 11. The maximum number of tokens for completion is set as 64.

Solution Generator: The prompt consists of 2 demonstration examples and the template is shown in Table 12.

Image Captioner: We use the captioning modelhttps://huggingface.co/nlpconnect/vit-gpt2-image-captioning to generate textual descriptions for input images. The maximum length of generated captions is set to 16, the number of beams is 4, and the maximum number of output tokens is 512.

Text Detector: This module is based on the github modelhttps://github.com/JaidedAI/EasyOCR to extract the text contents with coordinates in the image.

Bing Search: This module calls the Bing Search APIhttps://www.microsoft.com/bing and returns the top three responses for the text query.

Answer Generator: This module extracts the answer snippet from the result provided by the “Solution Generator” and selects the most similar option from the given choices.

Module implementations for TabMWP.

Similar to ScienceQA, the LLM-based modules by default use four in-context examples as demonstrations, have a temperature setting of 0, and allow a maximum of 512 tokens for completion. Additional implementation details are provided as follows:

Knowledge Retrieval: The prompt consists of 5 demonstration examples and the template is shown in Table 13.

Row Lookup: It is enabled only when there are more than three rows and 18 table cells, in order to accelerate inference. The prompt consists of 7 demonstration examples and the template is shown in Table 14. The maximum number of tokens for completion is set as 256.

Column Lookup: Similarly, this module is enabled with two or more columns and 18 or more table cells. The prompt consists of 6 demonstration examples and the template is shown in Table 15. The maximum number of tokens for completion is set as 256.

Table Verbalizer: The prompt consists of 7 demonstration examples and the template is shown in Table 16.

Program Generator: The prompt template is shown in Table 17. The maximum number of tokens for completion is set as 256.

Solution Generator: The prompt consists of 16 demonstration examples and the template is shown in Table 18.

Answer Generator: It is used to normalize answers with two-place precision for questions with numerical answers and select the most similar option for multiple-choice questions.

Implementations of update_input and update_cache.

update_input is triggered by the execution of specific tools, like ‘Row_Lookup’, which alter or replace elements in the input to reflect the updated state. Tools such as ‘Image_Captioner’, ‘Text_Detector’, ‘Knowledge_Retrieval’, ‘Web_Search’, and ‘Program_Generation’ generate new elements. update_cache stores these new elements in the cache, making them accessible for later tools’ execution.

A.3 Experimental Results

Chameleon utilizes the LLM-based natural language planner to generate programs, i.e., sequences of used modules (tools). We report the statistics of the number of unique generated programs and the average length of corresponding tool sequences by Chameleon in Table 7. On both ScienceQA and TabMWP, using GPT-4 as the base LLM generates fewer distinct programs, i.e., more consistent programs, than using ChatGPT, even when given the exact same prompt in the planning model. Our results are consistent with the findings in , which observes that GPT-4 has a superior capability of understanding long contexts, aligning with human instructions, and performing high-level reasoning compared to other LLMs such as ChatGPT.

Appendix B Limitations

While Chameleon represents a significant stride in exploiting large language models (LLMs) for compositional reasoning in a plug-and-play manner, there are a few areas that could benefit from further refinement. One such area is the expansion of its adaptability to a wider variety of tasks and domains, beyond the benchmarks presented. The LLM-based planner, responsible for synthesizing programs and determining the sequence of tools, introduces an innovative approach, yet it also raises intriguing research questions about optimizing the process for tool selection and sequence. It is plausible in the current system design that the quality of the LLM-based planner could impact overall performance. Moreover, Chameleon generates the program at one step, without incorporating a re-planning mechanism as the modules in the program are processed. Furthermore, we make the assumption that the list of modules and their descriptions will fit within the context window of LLMs, which may not always be the case. As the task complexity increases and the module inventory expands, there might be a corresponding surge in computational demands or limitations due to the context limit, indicating potential areas for future optimization. However, these potential areas for enhancement don’t detract from the paper’s central achievements, but instead provide valuable directions for future work and research.

Appendix C Broader Impacts

The work presented in this paper, Chameleon, has significant potential for positive societal impact. By augmenting large language models (LLMs) with plug-and-play modules for compositional reasoning, Chameleon can provide more accurate responses to complex, multi-modal tasks, making it a potentially valuable framework for various applications, including but not limited to education, finance, and decision support systems. Additionally, the system’s ability to synthesize programs without requiring any training could democratize access to AI technology, enabling non-experts to leverage the power of AI in diverse fields. As research continues to advance in large language models and tool integration, we anticipate that our framework will serve as a foundation for further innovations in pursuing more generalizable and efficient solutions to complex reasoning tasks.

While there might be negative societal impacts associated with the Chameleon, such as misinformation and privacy concerns if data sources and external tools it utilizes are not curated meticulously, we believe these risks can be carefully managed and minimized. There’s also a risk that excessive reliance on Chameleon’s increased autonomy may undermine critical thinking skills or job functions. To effectively mitigate these issues, careful curation of data sources and external tools, along with a strong commitment to user data protection, are essential. Additionally, Chameleon’s autonomy should be viewed as a means to augment, not replace, human capabilities. Therefore, the development of robust ethical guidelines, transparency mechanisms, and safeguards is critical, underlying our commitment to the socially responsible deployment of AI.