Large Language Models as Tool Makers

Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, Denny Zhou

Introduction

Large language models (LLMs) have demonstrated outstanding capabilities across a broad array of NLP tasks (Brown et al., 2020; Chowdhery et al., 2022; Zhang et al., 2022; Hoffmann et al., 2022; OpenAI, 2023; Google, 2023) and have even shown promising signs of achieving certain aspects of artificial general intelligence (Bubeck et al., 2023; Kosinski, 2023). Moreover, analogous to the evolution of human intelligence, recent research has unveiled the potential of augmenting LLMs with external tools, thereby significantly enhancing their problem-solving capacities and efficiencies (Yao et al., 2023; Liu et al., 2023; Parisi et al., 2022; Schick et al., 2023).

However, the applicability of these tool-using methods is largely contingent on the availability of suitable tools. According to the lessons learned from the evolutionary milestones of humans, a crucial turning point was that humans got the ability to fabricate their own tools to address emerging challenges. Inspired by the importance of tool-making for humans, in this work, we embark on an initial exploration to apply this evolutionary concept to the realm of LLMs. We propose a closed-loop framework, which we term as LLMs As Tool Makers (LATM), enables LLMs to generate their own reusable tools to tackle new tasks. Our approach comprises two key stages: 1) tool making: an LLM, known as the tool maker, designs tools (implemented as Python functions) specifically for a given task. 2) tool using: another LLM referred to as the tool user, which can be the same as the tool maker, applies the tools to handle new requests. The two-stage design allows LATM to allocate jobs in each stage to the most suitable LLM. Specifically, the tool-making process, which requires a high degree of capability, can be assigned to a powerful albeit resource-intensive model (e.g., GPT-4). On the other hand, the tool-using process, which is comparatively simpler, can be assigned to a lightweight and cost-effective model (e.g., GPT-3.5 Turbo). This approach not only enhances the problem-solving capabilities of LLMs, but also significantly reduces the average computational cost of addressing a series of tasks.

As the tool-making process needs to be executed only once for a given functionality, the resulting tools can be reused across different task instances. This approach paves the way for a scalable and cost-efficient solution for handling complex tasks. For instance, consider a task where a user ask the LLM to schedule a meeting that works for everyone (e.g., in email conversations). Lightweight models like GPT-3.5 Turbo often struggle with such tasks that involve complex arithmetic reasoning. In contrast, more powerful models (e.g., GPT-4) can find the correct solutions, despite that the inference costs become much higher. LATM overcomes these hurdles by employing a powerful yet expensive model as the tool maker, and passing it to a cost-effective model as the tool user, for subsequent usage. After the tool has been forged, the lightweight tool user can use it to solve the task efficiently with high performance. This paradigm can similarly be applied to recurring tasks in various workflows, such as parsing and analyzing web documents into specific data formats or formulating routing plans that satisfy several custom requirements, or being used to solve popular games like the 24-game, Sudoku. Furthermore, we introduce another lightweight LLM, the dispatcher, which determines whether an incoming problem can be solved using existing tools or if a new tool needs to be created. This adds an additional layer of dynamism to our framework, enabling real-time, on-the-fly tool-making and usage.

Our experiments validate the effectiveness of this approach on a range of complex reasoning tasks, including several challenging Big-Bench tasks (Srivastava et al., 2022). The results show that LATM can achieve performance on par with more resource-intensive models while being more cost-effective. This novel approach to LLMs, which mimics the evolutionary leap of humans in creating and using tools, opens up exciting possibilities for a growing community with LLM-generated tools.

Related Work

Recently, significant progress has been made in enhancing the problem-solving abilities of large language models (LLMs) for complex tasks. For instance, CoT prompting (Wei et al., 2022; Wang et al., 2022) has been proposed to bolster LLM reasoning capabilities, demonstrating improved performance across various reasoning and natural language processing tasks. CoT is typically articulated through natural languages (Ling et al., 2017; Cobbe et al., 2021; Suzgun et al., 2022; Shi et al., 2022; Zhou et al., 2022), yet it might also be effectively represented using programming languages (Amini et al., 2019; Austin et al., 2021; Nye et al., 2021; Chowdhery et al., 2022; Gao et al., 2023; Chen et al., 2022). More recently, Arora et al. (2023) proposed using LLMs to generate structured views over documents, balancing quality and cost by ensembling extractions from multiple synthesized functions. Our method shares a similar spirit with Arora et al. (2023) in managing cost and quality trade-offs but focuses on more general use cases.

Augmenting language models with tools.

Recent works have explored the potential of using external tools to supplement LLMs’ capabilities for complex tasks. Yao et al. (2023); Yang et al. (2023) proposed augmenting reasoning traces with task-specific actions in LLMs, enabling models to reason and act synergistically. Various studies (Liu et al., 2023; Parisi et al., 2022; Schick et al., 2023; Shen et al., 2023; Lu et al., 2023; Paranjape et al., 2023; Liang et al., 2023) have demonstrated that supplementing LLMs with tools, such as calculators, search engines, translation systems, calendars, or even API calls on other models, can help solve tasks that are not easily addressed by LLMs alone.

Similar to LATM, methods like Chameleon (Lu et al., 2023) also incorporate Python executors in the pipeline. However, their primary focus is on using Python executors to accurately solve sub-steps involving arithmetic reasoning, similar to Gao et al. (2023); Chen et al. (2022). In contrast, we use Python executors to create reusable tools for addressing other task instances. Furthermore, the separation of the tool maker and tool user enables the use of a lightweight model for most inferences, thus enhancing efficiency and cost-effectiveness in LATM.

Adaptive generation in language models.

In addition, recent research has proposed methods to adaptively control decoding in LLMs to improve text generation efficiency (Leviathan et al., 2022; Chen et al., 2023a; Xia et al., 2023). Speculative decoding is based on the notion that generating text tokens (a more expensive process) can be expedited with a faster yet less powerful model while approximating the performance of larger, costlier models by using them to score generated tokens (a much faster process). Our approach of passing tools from a more expensive model to a smaller, faster model also shares a similar spirit of adaptive computing. Instead of altering the decoding procedure, we transfer newly generated tools between models to boost both the performance and efficiency of an LLM in solving tasks.

Language model cascades.

There is recent evidence that LLMs can enable repeated interactions and that multiple LLMs can be combined to extend their capabilities further (Wu et al., 2022; Zhou et al., 2022; Dohan et al., 2022; Chen et al., 2023c). Also, Chen et al. (2023b) demonstrated that identifying optimal LLM combinations can help reduce costs while improving accuracy. Our motivation aligns with these findings; however, rather than merely cascading LLMs, we identify task categories that can be better addressed using new tools generated by a larger model and assign each individual inference within that task category to a smaller model.

LLM as Tool Maker (LATM)

In the LATM paradigm, the main process can be split into two stages: Tool Making and Tool Using. Each stage utilizes different types of Large Language Models (LLMs) to balance performance and cost-effectiveness. All the prompts used in our experiments are shown in Appendix B.

This stage employs a powerful yet more expensive model, such as GPT-4, to serve as the tool maker. Tool maker’s role is to create a generic and reusable tool (implemented as a Python function) from a few demonstrations of a task. This stage can be further divided into three sub-stages:

Tool Proposing: In this stage, tool maker attempts to generate a Python function to solve the demonstrations from the given task. This process follows the “programming by example” (PbE) paradigm (Halbert, 1984) where several concrete demonstrations are provided, and the model is required to write programs that produce the demonstrated behaviors. In our experiments, we use $3$ demonstrations for this stage. If the proposed tool is unexecutable or encounters errors, tool maker appends the error messages to the history and makes another attempt.

Tool Verification: In this stage, the tool maker generates unit tests using validation samples and subsequently executes these tests on the proposed tool. We utilize $3$ validation samples in our experiments. If the tool fails any of these tests, the tool maker records the error in its history and makes an attempt to rectify the issues within the unit tests (this procedure will only correct the function calls in the unit test part and will not correct the function). The ability of LLMs to self-debug has been demonstrated effectively in recent research (Madaan et al., 2023; Chen et al., 2023c; Lu et al., 2023). However, within the LATM pipeline, the verification stage serves a slightly different usage. This stage fulfills two key roles: 1) it provides examples that demonstrate how to convert natural language questions into function calls, and 2) it verifies the tool’s reliability, enabling the entire process to be fully automated.

Tool Wrapping: If the execution or verification fails over a preset threshold, the Tool Making stage is viewed as failed. Otherwise, tool maker is ready to prepare the wrapped tool for tool user. This step involves wrapping up the function code and providing demonstrations of how to convert a task into a function call. These demonstrations are extracted from the Tool Verification step, which converts questions into unit tests. This final product is then ready for use by the tool user. Please see Appendix C for examples of the wrapped tools.

Tool Using.

This second stage involves a lightweight and cost-effective model, such as GPT-3.5 Turbo, to serve as the tool user. The tool user’s role is to utilize the verified tool to solve various instances of the task. The prompt for this stage is the wrapped tool which contains the function for solving the task and demonstrations of how to convert a task query into a function call. With the demonstrations, tool user can then generate the required function call in an in-context learning fashion. The function calls are then executed to solve the task. Optionally, postprocessing can be applied to convert the output to match the required format of the task, such as options for multiple-choice questions.

The tool-making stage, including tool proposing, verification, and wrapping, only needs to be performed once for each type of task. The resulting tools can then be reused for all instances of that task. This makes LATM significantly more efficient and cost-effective than using a powerful model alone. Furthermore, the Python function tools are a more generic form of Chain-of-Thought, enhancing the overall utility and flexibility of the LLMs, as they can be used to solve questions that involve algorithmic reasoning ability (Veličković and Blundell, 2021).

To illustrate our methodology, Figure 3 provides a concrete example of how the tool maker solves the logical deduction task from BigBench (Srivastava et al., 2022) by producing a tool (a Python function), and how the tool user utilize the tool. This task requires inferring the ordering of five objects and then answering a question. The conditions include both relative positions of certain object pairs and the absolute positions of some objects, as demonstrated in the “Tool maker input” block in Figure 3. To solve this task, the tool maker, e.g., GPT-4, generates a generic program that solves the task by extracting constraints from the question and then searching over all permutations for the result. The tool user, e.g., GPT-3.5 Turbo, can then utilize this program to solve the task, using a function call that merely extracts relevant information from the natural language instances of the task. We show more examples of the generated new tools for solving other tasks in Appendix C.

2 Handling Streaming Data with Dispatcher

In real-world scenarios, task instances typically arrive in sequence. To accommodate this stream of data, we introduce a third LLM, the dispatcher, which determines whether to engage the tool user or tool maker for each incoming task. This module bears similarities to the tool selection feature present in existing works (Lu et al., 2023; Shen et al., 2023; Schick et al., 2023; Paranjape et al., 2023). However, our dispatcher is distinct in its ability to identify new tasks that cannot be addressed by existing tools and to engage the tool maker to generate new tools for these tasks.

Specifically, the dispatcher maintains a record of existing tools produced by the tool maker. When a new task instance is received, the dispatcher initially determines if there is a suitable tool for the task at hand. If a suitable tool exists, the dispatcher passes the instance and its corresponding tool to the tool user for task resolution. If no appropriate tool is found, the dispatcher identifies the instance as a new task and solves the instance with a powerful model or even invokes a human labeler. The instances from a new task are then cached until sufficient cached instances are available for the tool maker to make a new tool. The dispatcher’s workflow is illustrated in Figure 4. Given the simplicity of the dispatching task, the dispatcher can be a lightweight model equipped with proper prompts (See Appendix B), which adds only a marginal cost to the overall pipeline.

Experiments

We evaluate our approach on six datasets from diverse domains, including Logical Deduction, Tracking Shuffled Objects, Dyck Language, Word Sorting, Chinese Remainder Theorem, and Scheduling Meeting. The first five datasets are sourced from BigBench (Srivastava et al., 2022). We take the $5$ objects version of the Logical Deduction and Tracking Shuffled Objects tasks, referred to as Logical Deduction (5) and Tracking Shuffled Objects (5) in the paper. We also constructed the Scheduling Meeting task to demonstrate the effectiveness of LATM in real-world scenarios. Detailed information on dataset generation can be found in Appendix D. We divide each dataset into training, validation, and test sets, containing 3, 3, and 240 instances, respectively.

Model settings.

During the tool-making stage, we set the temperature to 0.3 to introduce randomness to the generation process, allowing for retries if necessary. For this stage, we conduct experiments using GPT-4 and GPT-3.5 Turbo models with the ChatCompletion API, always appending the response to the chat history to create an interactive experience. In the tool-using stage, the LLM API call is made only once, and we also perform ablation studies on GPT-3-type models with the standard Completion API. When using the tools, we consistently set the temperature to 0.0. We set the maximal retry times to be 3 for the tool-proposing and tool-verification stages.

2 Effectiveness of the Tool-Making Stage

In the tool-making stage, we use a powerful yet slower model to generate generic Python functions tailored to a specific task. This step is performed only once for each task, and the overhead is amortized across all instances of that task. In our experiments, we use GPT-4 (OpenAI, 2023) as a representative tool maker, while we explore other models’ tool-making capabilities in Section 4.5.

We provide several few-shot exemplars for the language model, guiding it to generate generic Python programs, as illustrated in Figure 3.

Our observations indicate that when GPT-4 is employed as the tool maker, the model frequently devises suitable algorithms for solving tasks. For instance, as shown in Table 1, the tool maker creates code to solve the logical deduction task by searching through all permutations and selecting the correct one that satisfies the given constraints. In our experiment, the tool-verification stage is mainly used to provide examples that demonstrate how to convert natural language questions into function calls, and we only observe 2 cases out of the 60 trials that the tool maker can correct its mistakes with the guide of error messages. See Section 4.5 for more discussions on the tool maker.

3 LATM Improves the Performance of Lightweight LLMs

In Table 2, we compare the performance of Chain-of-Thought prompting (Wei et al., 2022) with our method, LATM. We employ GPT-4 as the tool maker to generate tools for the six tasks, and evaluate the performance of both GPT-3.5 Turbo and GPT-4 as tool user. The results demonstrate that with the help of the tool, a lightweight model like GPT-3.5 Turbo can achieve performance on par with GPT-4, significantly outperforming CoT prompting. Additionally, the average cost of using GPT-3.5 Turbo with the tool is much lower compared to using GPT-4. This highlights the effectiveness of LATM in enhancing the performance of lightweight models and therefore reducing the cost compared to employing expensive models. Intriguingly, for the Dyck Language task, GPT-3.5 Turbo as the tool user even surpasses GPT-4 in its role as the tool user. Upon investigating the failure cases, we find that when converting the question into a function call, GPT-4 occasionally solves part of the problem unnecessarily, which leads to incorrect function output.

4 Extending LATM to a Streaming Setting with a Mixture of Tasks

As mentioned in Section 3.2, we can extend LATM to a streaming setting where instances from (potentially) different tasks arrive on-the-fly. In this case, we require another model, the dispatcher, to determine the task to which the instance belongs. We use GPT-3.5 Turbo as the dispatcher and evaluate its ability to: 1) identify existing tools to solve an incoming instance; 2) request tool-making for instances from an unseen task.

We first assess the ability of the dispatcher to identify existing tools for a given instance. We randomly mix the six tasks from Section 4.1 and generate a test set with 100 samples. For each instance in the test set, we use the dispatcher to identify the appropriate existing tool with the prompt that contains task examples associated with existing tools, as shown in Appendix B. If the tool is identified correctly, we consider it a success. The accuracy of determining the correct tool is $94\%\pm 2\%$ over five random constructions of the test set.

Requesting tool-making.

Next, we evaluate the dispatcher’s ability to request tool-making for instances from an unseen task. We randomly select four tasks as existing tasks with tools ready. We then pick four tasks for testing: two are unseen, and two are within the existing tasks. We generate a test set with 100 samples. For each instance in the test set, we use the dispatcher to determine whether it needs to request tool-making or if the instance can be solved by an existing tool. The accuracy of making the correct request is $95\%\pm 4\%$ .

The results demonstrate that the dispatcher can effectively identify existing tools and request tool-making for unseen tasks without a significant performance drop. This suggests that LATM can be smoothly extended to a streaming setting with a mixture of tasks.

5 Ablation Study

We investigate the capacity requirements for the language model used in the tool-making stage. Generally, we found that a more powerful and expensive model better serves the purpose, as this stage is performed only once for each task, and high accuracy is crucial for effectively passing tools to a smaller model. Specifically, on hard tasks like Logical Deduction and Tracking Shuffled Objects, GPT-3.5 Turbo fails in all the 5 trails. And the major failure reason is that the tool is not general enough and may only work on the training samples. On the other hand, we also discovered that for easy tasks, the tool maker can be a lightweight language model. For simple tasks like Word Sorting, GPT-3.5 Turbo can effortlessly generate a program that solves the task. Another limitation that may contribute to the tool maker’s failure is the context length constraints. Since we use the entire history in each step of tool-making to enhance the reliability of the tool-making stage, this also introduces a longer context. In this case GPT-4 with 8192 context length is preferable.

Capacity required for the tool-using language model.

In this section, we investigate the capacity requirements for the tool-using model. The results are presented in Table 4. We observed that GPT-3.5 Turbo offers the best balance between performance and cost among all the models tested. Regarding the older GPT-3 series of models (ada, babbage, curie, davinci), we found that models that before instruction tuning often perform better than their counterparts post instruction tuning. We hypothesize that the instruction tuning phase in these models may adversely impact the in-context learning ability, which is crucial for the tool-using stage.

CoT as a tool does not help.

In addition to LATM, we investigate if we can improve task performance by reusing Chain-of-Thought (CoT) from a larger model to a smaller model similar to LATM pipeline. Specifically, we use the same larger model (GPT-4) in the “CoT-making” stage, using zero-shot prompting “Let’s think step by step.” to elicit the intermediate thought steps, and then use the generated CoT to the same smaller tool-using model (GPT-3.5 Turbo). We test this on two tasks and report the results Table 5. We observe that using CoT from a large model has a similar or even worse performance than human-written CoT, which is much worse than LATM.

Conclusion and Future Work

We introduced LATM, a closed-loop framework empowering large language models (LLMs) to create and utilize their own tools for diverse tasks. Our approach, inspired by human’s evolutionary strides in tool creation, employs two key stages: Tool Making and Tool Using. This division of labor allows us to harness the capabilities of advanced LLMs while significantly reducing computational costs. Our experiments confirmed the efficacy of LATM across various complex tasks, demonstrating that our framework performs comparably to resource-intensive models while being more cost-effective. In addition, we show that adding another dispatcher LLM can further provide flexibility to our framework, enabling on-the-fly tool creation and usage.

In our evaluation process, we identified a significant lack of high-quality datasets that authentically represent daily human-computer interactions, including recurring tasks such as scheduling meetings or booking flights over email or phone calls, in their raw natural language format. We anticipate that our work will stimulate the research community to create such datasets, which could prove instrumental in cultivating the next generation of AI systems. These systems, capable of generating and applying their own tools, will be equipped to tackle complex tasks more effectively. An exciting avenue for future research is enabling the tool maker to refine and upgrade existing tools to manage new problem instances, much like in software development. This adaptability could further catalyze the evolution of the AI ecosystem, unlocking a wealth of opportunities.

References

Appendix A Broader Impact and Limitations

This paper explores the potential of enabling Large Language Models (LLMs) to create their own tools, thus allowing them greater autonomy in developing their ecosystem. While this avenue of research is promising, it also raises important ethical, safety, and control considerations that need to be carefully addressed.

One of the most significant impacts of our work lies in the potential for LLMs to grow and achieve unprecedented capabilities automatically. This could significantly enhance the range and complexity of tasks these models can handle, potentially revolutionizing fields such as customer service, technical support, and even areas of research and development. It could lead to more efficient use of computational resources and a reduction in human intervention, especially for routine or repetitive tasks.

However, this newfound autonomy of LLMs is a double-edged sword. As we endow LLMs with the ability to generate their own tools, we also create a scenario where the quality of the tools they develop may not always meet the standards or expectations set by human developers. Without proper safeguards, there’s a risk that these models could generate solutions that are suboptimal, incorrect, or even potentially harmful. Furthermore, as LLMs become more autonomous, the potential for loss of control increases. If these models are widely used without appropriate regulation, there could be unforeseen consequences, potentially even leading to scenarios where humans lose control over the AI systems.

In this study, we have not addressed these control and safety issues in depth, and our work has some limitations. Our proposed framework, LLM As Tool Maker, while effective in the tested scenarios, is still in its early stages of development. It is crucial to note that the real-world performance and safety of the system may vary based on the complexity and nature of the tasks it is applied to. Additionally, the evaluation and validation of the tools created by the tool maker in a real-world setting is a challenge that needs to be addressed.

Appendix B LATM Prompts

Appendix C Wrapped Tools

Appendix D Dataset Construction

For the “schedule meeting” task, we use the following template to generate the dataset:

where the interval is randomly sampled from $\{0.5,1,1.5\}$ , and the availability of A and B are randomly sampled from 8:00-18:00 with 30 minutes as the granularity. The answer is computed by computing the intersection of the two availability sets and then find the earliest time slot that is at least as long as the meeting duration. If there is no such time slot, we return “No time slot works.”.