Toolformer: Language Models Can Teach Themselves to Use Tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, Thomas Scialom

cs.CL

Introduction

Large language models achieve impressive zero- and few-shot results on a variety of natural language processing tasks (Brown et al., 2020; Chowdhery et al., 2022, i.a.) and show several emergent capabilities (Wei et al., 2022). However, all of these models have several inherent limitations that can at best be partially addressed by further scaling. These limitations include an inability to access up-to-date information on recent events (Komeili et al., 2022) and the related tendency to hallucinate facts (Maynez et al., 2020; Ji et al., 2022), difficulties in understanding low-resource languages (Lin et al., 2021), a lack of mathematical skills to perform precise calculations (Patel et al., 2021) and an unawareness of the progression of time (Dhingra et al., 2022).

A simple way to overcome these limitations of today’s language models is to give them the ability to use external tools such as search engines, calculators, or calendars. However, existing approaches either rely on large amounts of human annotations (Komeili et al., 2022; Thoppilan et al., 2022) or limit tool use to task-specific settings only (e.g., Gao et al., 2022; Parisi et al., 2022), hindering a more widespread adoption of tool use in LMs. Therefore, we propose Toolformer, a model that learns to use tools in a novel way, which fulfills the following desiderata:

The use of tools should be learned in a self-supervised way without requiring large amounts of human annotations. This is important not only because of the costs associated with such annotations, but also because what humans find useful may be different from what a model finds useful.

The LM should not lose any of its generality and should be able to decide for itself when and how to use which tool. In contrast to existing approaches, this enables a much more comprehensive use of tools that is not tied to specific tasks.

Our approach for achieving these goals is based on the recent idea of using large LMs with in-context learning (Brown et al., 2020) to generate entire datasets from scratch (Schick and Schütze, 2021b; Honovich et al., 2022; Wang et al., 2022): Given just a handful of human-written examples of how an API can be used, we let a LM annotate a huge language modeling dataset with potential API calls. We then use a self-supervised loss to determine which of these API calls actually help the model in predicting future tokens. Finally, we finetune the LM itself on the API calls that it considers useful. As illustrated in Figure 1, through this simple approach, LMs can learn to control a variety of tools, and to choose for themselves which tool to use when and how.

As our approach is agnostic of the dataset being used, we can apply it to the exact same dataset that was used to pretrain a model in the first place. This ensures that the model does not lose any of its generality and language modeling abilities. We conduct experiments on a variety of different downstream tasks, demonstrating that after learning to use tools, Toolformer, which is based on a pretrained GPT-J model (Wang and Komatsuzaki, 2021) with 6.7B parameters, achieves much stronger zero-shot results, clearly outperforming a much larger GPT-3 model (Brown et al., 2020) and several other baselines on various tasks.

Approach

Our aim is to equip a language model $M$ with the ability to use different tools by means of API calls. We require that inputs and outputs for each API can be represented as text sequences. This allows seamless insertion of API calls into any given text, using special tokens to mark the start and end of each such call.

We represent each API call as a tuple $c=({a}_{c},{i}_{c})$ where $a_{c}$ is the name of the API and $i_{c}$ is the corresponding input. Given an API call $c$ with a corresponding result $r$ , we denote the linearized sequences of the API call not including and including its result, respectively, as:

where “”, “” and “ $\rightarrow$ ” are special tokens.In practice, we use the token sequences “ [”, “]” and “->” to represent “”, “” and “ $\rightarrow$ ”, respectively. This enables our approach to work without modifying the existing LM’s vocabulary. For reasons of readability, we still refer to them as “”, “” and “ $\rightarrow$ ” throughout this section. Some examples of linearized API calls inserted into text sequences are shown in Figure 1.

Given a dataset $\mathcal{C}=\{\mathbf{x}^{1},\ldots,\mathbf{x}^{|\mathcal{C}|}\}$ of plain texts, we first convert this dataset into a dataset $\mathcal{C}^{*}$ augmented with API calls. This is done in three steps, illustrated in Figure 2: First, we exploit the in-context learning ability of $M$ to sample a large number of potential API calls. We then execute these API calls and finally check whether the obtained responses are helpful for predicting future tokens; this is used as a filtering criterion. After filtering, we merge API calls for different tools, resulting in the augmented dataset $\mathcal{C}^{*}$ , and finetune $M$ itself on this dataset. Each of these steps is described in more detail below.

For each API, we write a prompt $P(\mathbf{x})$ that encourages the LM to annotate an example $\mathbf{x}=x_{1},\ldots,x_{n}$ with API calls. An example of such a prompt for a question answering tool is shown in Figure 3; all prompts used are shown in Appendix A.2. Let $p_{M}(z_{n+1}\mid z_{1},\ldots,z_{n})$ be the probability that $M$ assigns to token $z_{n+1}$ as a continuation for the sequence $z_{1},\ldots,z_{n}$ . We first sample up to $k$ candidate positions for doing API calls by computing, for each $i\in\{1,\ldots,n\}$ , the probability

that $M$ assigns to starting an API call at position $i$ . Given a sampling threshold $\tau_{s}$ , we keep all positions $I=\{i\mid p_{i}>\tau_{s}\}$ ; if there are more than $k$ such positions, we only keep the top $k$ .

For each position $i\in I$ , we then obtain up to $m$ API calls $c_{i}^{1},\ldots,c_{i}^{m}$ by sampling from $M$ given the sequence $[P(\mathbf{x}),x_{1},\ldots,x_{i-1},\texttt{<API>}]$ as a prefix and as an end-of-sequence token.We discard all examples where $M$ does not generate the token.

As a next step, we execute all API calls generated by $M$ to obtain the corresponding results. How this is done depends entirely on the API itself – for example, it can involve calling another neural network, executing a Python script or using a retrieval system to perform search over a large corpus. The response for each API call $c_{i}$ needs to be a single text sequence $r_{i}$ .

be the weighted cross entropy loss for $M$ over the tokens $x_{i},\ldots,x_{n}$ if the model is prefixed with $\mathbf{z}$ . We compare two different instantiations of this loss:

where $\varepsilon$ denotes an empty sequence. The former is the weighted loss over all tokens $x_{i},\ldots,x_{n}$ if the API call and its result are given to $M$ as a prefix;We provide $\text{e}(c_{i},r_{i})$ as a prefix instead of inserting it at position $i$ because $M$ is not yet finetuned on any examples containing API calls, so inserting it in the middle of $\mathbf{x}$ would interrupt the flow and not align with patterns in the pretraining corpus, thus hurting perplexity. the latter is the minimum of the losses obtained from (i) doing no API call at all and (ii) doing an API call, but not providing the response. Intuitively, an API call is helpful to $M$ if providing it with both the input and the output of this call makes it easier for the model to predict future tokens, compared to not receiving the API call at all, or receiving only its input. Given a filtering threshold $\tau_{f}$ , we thus only keep API calls for which

holds, i.e., adding the API call and its result reduces the loss by at least $\tau_{f}$ , compared to not doing any API call or obtaining no result from it.

After sampling and filtering calls for all APIs, we finally merge the remaining API calls and interleave them with the original inputs. That is, for an input text $\mathbf{x}=x_{1},\ldots,x_{n}$ with a corresponding API call and result $(c_{i},r_{i})$ at position $i$ , we construct the new sequence $\mathbf{x}^{*}=x_{1:{i-1}},\text{e}(c_{i},r_{i}),x_{i:n}$ ; we proceed analogously for texts with multiple API calls. Doing this for all $\mathbf{x}\in\mathcal{C}$ results in the new dataset $\mathcal{C}^{*}$ augmented with API calls. We use this new dataset to finetune $M$ , using a standard language modeling objective. Crucially, apart from inserted API calls the augmented dataset $\mathcal{C}^{*}$ contains the exact same texts as $\mathcal{C}$ , the original dataset. As a consequence, finetuning $M$ on $\mathcal{C}^{*}$ exposes it to the same content as finetuning on $\mathcal{C}$ . Moreover, as API calls are inserted in exactly those positions and with exactly those inputs that help $M$ predict future tokens, finetuning on $\mathcal{C}^{*}$ enables the language model to decide when and how to use which tool, based purely on its own feedback.

When generating text with $M$ after finetuning with our approach, we perform regular decoding until $M$ produces the “ $\rightarrow$ ” token, indicating that it next expects the response for an API call. At this point, we interrupt the decoding process, call the appropriate API to get a response, and continue the decoding process after inserting both the response and the token.

Tools

We explore a variety of tools to address different shortcomings of regular LMs. The only constraints we impose on these tools is that (i) both their inputs and outputs can be represented as text sequences, and (ii) we can obtain a few demonstrations of their intended use. Concretely, we explore the following five tools: a question answering system, a Wikipedia search engine, a calculator, a calendar, and a machine translation system. Some examples of potential calls and return strings for the APIs associated with each of these tools are shown in Table 1. We briefly discuss all tools below; further details can be found in Appendix A.

Our first tool is a question answering system based on another LM that can answer simple factoid questions. Specifically, we use Atlas (Izacard et al., 2022), a retrieval-augmented LM finetuned on Natural Questions (Kwiatkowski et al., 2019).

As a second tool, we use a calculator that can perform simple numeric calculations; we only support the four basic arithmetic operations. Results are always rounded to two decimal places.

Our third tool is a search engine that, given a search term, returns short text snippets from Wikipedia. Compared to our question answering tool, this search enables a model to get more comprehensive information on a subject, but requires it to extract the relevant parts by itself. As our search engine, we use a BM25 retriever (Robertson et al., 1995; Baeza-Yates et al., 1999) that indexes the Wikipedia dump from KILT (Petroni et al., 2021).

Our fourth tool is a machine translation system based on a LM that can translate a phrase from any language into English. More concretely, we use the 600M parameter NLLB (Costa-jussà et al., 2022) as our multilingual machine translation model that works for 200 languages (including low-resource ones). The source language is automatically detected using the fastText classifier (Joulin et al., 2016), while the target language is always set to English.

Our final tool is a calendar API that, when queried, returns the current date without taking any input. This provides temporal context for predictions that require some awareness of time.

Experiments

We investigate whether our approach enables a model to use tools without any further supervision and to decide for itself when and how to call which of the available tools. To test this, we select a variety of downstream tasks where we assume at least one of the considered tools to be useful, and evaluate performance in zero-shot settings (Section 4.2). Beyond that, we also ensure that our approach does not hurt the model’s core language modeling abilities; we verify this by looking at perplexity on two language modeling datasets (Section 4.3). Finally, we investigate how the ability to learn using tools is affected by model size (Section 4.4).

Throughout all of our experiments, we use a subset of CCNet (Wenzek et al., 2020) as our language modeling dataset $\mathcal{C}$ and GPT-J (Wang and Komatsuzaki, 2021) as our language model $M$ . To reduce the computational cost of annotating $\mathcal{C}$ with API calls, we define heuristics for some APIs to get a subset of $\mathcal{C}$ for which API calls are more likely to be helpful than for an average text. For example, we only consider texts for the calculator tool if they contain at least three numbers. Details of the heuristics used are given in Appendix A. For obtaining $\mathcal{C}^{*}$ from $\mathcal{C}$ , we perform all steps described in Section 2 and additionally filter out all examples for which all API calls were eliminated in the filtering step.While this filtering alters the distribution of training examples, we assume that the remaining examples are close enough to the original distribution so that $M$ ’s language modeling abilities remain unaffected. This assumption is empirically validated in Section 4.3. For the weighting function, we use

to make sure that API calls happen close to where the information provided by the API is actually helpful for the model. The thresholds $\tau_{s}$ and $\tau_{f}$ are chosen individually for each tool to ensure a sufficiently larger number of examples; see Appendix A for details. Table 2 shows relevant statistics of our final dataset augmented with API calls.

We finetune $M$ on $\mathcal{C}^{*}$ using a batch size of 128 and a learning rate of $1\cdot 10^{-5}$ with linear warmup for the first 10% of training. Details of our finetuning procedure are given in Appendix B.

Throughout the remainder of this section, we mainly compare the following models:

GPT-J: A regular GPT-J model without any finetuning.

GPT-J + CC: GPT-J finetuned on $\mathcal{C}$ , our subset of CCNet without any API calls.

Toolformer: GPT-J finetuned on $\mathcal{C}^{*}$ , our subset of CCNet augmented with API calls.

Toolformer (disabled): The same model as Toolformer, but API calls are disabled during decoding.This is achieved by manually setting the probability of the token to 0.

For most tasks, we additionally compare to OPT (66B) (Zhang et al., 2022) and GPT-3We use the original davinci variant that is not finetuned on any instructions. (175B) (Brown et al., 2020), two models that are about 10 and 25 times larger than our other baseline models, respectively.

2 Downstream Tasks

We evaluate all models on a variety of downstream tasks. In all cases, we consider a prompted zero-shot setup – i.e., models are instructed to solve each task in natural language, but we do not provide any in-context examples. This is in contrast to prior work on tool use (e.g., Gao et al., 2022; Parisi et al., 2022), where models are provided with dataset-specific examples of how a tool can be used to solve a concrete task. We choose the more challenging zero-shot setup as we are interested in seeing whether Toolformer works in precisely those cases where a user does not specify in advance which tools should be used in which way for solving a specific problem.

We use standard greedy decoding, but with one modification for Toolformer: We let the model start an API call not just when is the most likely token, but whenever it is one of the $k$ most likely tokens. For $k=1$ , this corresponds to regular greedy decoding; we instead use $k=10$ to increase the disposition of our model to make use of the APIs that it has access to. At the same time, we only at most one API call per input to make sure the model does not get stuck in a loop where it constantly calls APIs without producing any actual output. The effect of these modifications is explored in Section 5.

We evaluate our models on the SQuAD, Google-RE and T-REx subsets of the LAMA benchmark (Petroni et al., 2019). For each of these subsets, the task is to complete a short statement with a missing fact (e.g., a date or a place). As LAMA was originally designed to evaluate masked language models (e.g., Devlin et al., 2019), we filter out examples where the mask token is not the final token, so that the remaining examples can be processed in a left-to-right fashion. To account for different tokenizations and added complexity from not informing the model that a single word is required, we use a slightly more lenient evaluation criterion than exact match and simply check whether the correct word is within the first five words predicted by the model. As LAMA is based on statements obtained directly from Wikipedia, we prevent Toolformer from using the Wikipedia Search API to avoid giving it an unfair advantage.

Results for all models can be seen in Table 3. All GPT-J models without tool use achieve similar performance. Crucially, Toolformer clearly outperforms these baseline models, improving upon the best baseline by 11.7, 5.2 and 18.6 points, respectively. It also clearly outperforms OPT (66B) and GPT-3 (175B), despite both models being much larger. This is achieved because the model independently decides to ask the question answering tool for the required information in almost all cases (98.1%); for only very few examples, it uses a different tool (0.7%) or no tool at all (1.2%).

2.2 Math Datasets

We test mathematical reasoning abilities on ASDiv (Miao et al., 2020), SVAMP (Patel et al., 2021) and the MAWPS benchmark (Koncel-Kedziorski et al., 2016). We again account for the fact that we test all models in a zero-shot setup by using a more lenient evaluation criterion: As the required output is always a number, we simply check for the first number predicted by the model.An exception to this is if the model’s prediction contains an equation (e.g., “The correct answer is 5+3=8”), in which case we consider the first number after the “=” sign to be its prediction.

Table 4 shows results for all benchmarks. While GPT-J and GPT-J + CC perform about the same, Toolformer achieves stronger results even when API calls are disabled. We surmise that this is because the model is finetuned on many examples of API calls and their results, improving its own mathematical capabilities. Nonetheless, allowing the model to make API calls more than doubles performance for all tasks, and also clearly outperforms the much larger OPT and GPT-3 models. This is because across all benchmarks, for 97.9% of all examples the model decides to ask the calculator tool for help.

2.3 Question Answering

We look at Web Questions (Berant et al., 2013), Natural Questions (Kwiatkowski et al., 2019) and TriviaQA (Joshi et al., 2017), the three question answering datasets considered by Brown et al. (2020). For evaluation, we check whether the first 20 words predicted by a model contain the correct answer instead of requiring an exact match. For Toolformer, we disable the question answering tool as this would make solving the tasks trivial, especially given that the underlying QA system was finetuned on Natural Questions.

Results are shown in Table 5. Once again, Toolformer clearly outperforms all other models based on GPT-J, this time mostly relying on the Wikipedia search API (99.3%) to find relevant information. However, Toolformer still lags behind the much larger GPT-3 (175B) model. This is likely due to both the simplicity of our search engine (in many cases, it returns results that are clearly not a good match for a given query) and the inability of Toolformer to interact with it, e.g., by reformulating its query if results are not helpful or by browsing through multiple of the top results. We believe that adding this functionality is an exciting direction for future work.

2.4 Multilingual Question Answering

We evaluate Toolformer and all baseline models on MLQA (Lewis et al., 2019), a multilingual question-answering benchmark. A context paragraph for each question is provided in English, while the question can be in Arabic, German, Spanish, Hindi, Vietnamese, or Simplified Chinese. In order to solve the task, the model needs to be able to understand both the paragraph and the question, so it may benefit from translating the question into English. Our evaluation metric is the percentage of times the model’s generation, capped at 10 words, contains the correct answer.

Results are shown in Table 6. Using API calls consistently improves Toolformer’s performance for all languages, suggesting that it has learned to make use of the machine translation tool. Depending on the language, this tool is used for 63.8% to 94.9% of all examples; the only exception to this is Hindi, for which the machine translation tool is used in only 7.3% of cases. However, Toolformer does not consistently outperform vanilla GPT-J. This is mainly because for some languages, finetuning on CCNet deteriorates performance; this might be due to a distribution shift compared to GPT-J’s original pretraining data.

OPT and GPT-3 perform surprisingly weak across all languages, mostly because they fail to provide an answer in English despite being instructed to do so. A potential reason for GPT-J not suffering from this problem is that it was trained on more multilingual data than both OPT and GPT-3, including the EuroParl corpus (Koehn, 2005; Gao et al., 2020). As an upper bound, we also evaluate GPT-J and GPT-3 on a variant of MLQA where both the context and the question are provided in English. In this setup, GPT-3 performs better than all other models, supporting our hypothesis that its subpar performance on MLQA is due to the multilingual aspect of the task.

2.5 Temporal Datasets

To investigate the calendar API’s utility, we evaluate all models on TempLAMA (Dhingra et al., 2022) and a new dataset that we call Dateset. TempLAMA is a dataset built from Wikidata that contains cloze queries about facts that change with time (e.g., “Cristiano Ronaldo plays for ___”) as well as the correct answer for the years between 2010 and 2020. Dateset, described in Appendix D, is also generated through a series of templates, but populated using a combination of random dates/durations (e.g., “What day of the week was it 30 days ago?”). Critically, knowing the current date is required to answer these questions. For both tasks, we use the same evaluation as for the original LAMA dataset.

Results shown in Table 7 illustrate that Toolformer outperforms all baselines for both TempLAMA and Dateset. However, closer inspection shows that improvements on TempLAMA can not be attributed to the calendar tool, which is only used for 0.2% of all examples, but mostly to the Wikipedia search and question answering tools, which Toolformer calls the most. This makes sense given that named entities in TempLama are often so specific and rare that even knowing the exact date alone would be of little help. The best course of action for this dataset – first querying the calendar API to get the current date, and then querying the question answering system with this date – is not only prohibited by our restriction of using at most one API call per example, but also hard to learn for Toolformer given that all API calls in its training data are sampled independently.

For Dateset, on the other hand, the considerable improvement of Toolformer compared to other models can be fully accredited to the calendar tool, which it makes use of for 54.8% of all examples.

3 Language Modeling

In addition to verifying improved performance on various downstream tasks, we also want to ensure that language modeling performance of Toolformer does not degrade through our finetuning with API calls. To this end, we evaluate our models on two language modeling datasets: WikiText (Merity et al., 2017) and a subset of 10,000 randomly selected documents from CCNet (Wenzek et al., 2020) that were not used during training. Perplexities of various models are shown in Table 8. As one would expect, finetuning on CCNet leads to slightly improved performance on a different CCNet subset, but it slightly deteriorates performance on WikiText, presumably because the original pretraining data for GPT-J is more similar to WikiText than our randomly selected subset of CCNet. Most importantly, however, training on $\mathcal{C}^{*}$ (our dataset annotated with API calls) does not lead to an increase in perplexity compared to training on $\mathcal{C}$ when API calls are disabled at inference time.We do not evaluate the perplexity of Toolformer with API calls enabled as computing the probability $p_{M}(x_{t}\mid x_{1},\ldots,x_{t-1})$ of token $x_{t}$ given $x_{1},\ldots,x_{t-1}$ would require marginalizing over all potential API calls that the model could make at position $t$ , which is intractable.

4 Scaling Laws

We investigate how the ability to ask external tools for help affects performance as we vary the size of our LM. To this end, we apply our approach not just to GPT-J, but also to four smaller models from the GPT-2 family (Radford et al., 2019), with 124M, 355M, 775M and 1.6B parameters, respectively. We do so using only a subset of three tools: the question answering system, the calculator, and the Wikipedia search engine. Apart from this, we follow the experimental setup described in Section 4.1.

Figure 4 shows that the ability to leverage the provided tools only emerges at around 775M parameters: smaller models achieve similar performance both with and without tools. An exception to this is the Wikipedia search engine used mostly for QA benchmarks; we hypothesize that this is because the API is comparably easy to use. While models become better at solving tasks without API calls as they grow in size, their ability to make good use of the provided API improves at the same time. As a consequence, there remains a large gap between predictions with and without API calls even for our biggest model.

Analysis

We investigate the effect of our modified decoding strategy introduced in Section 4.2, where instead of always generating the most likely token, we generate the token if it is one of the $k$ most likely tokens. Table 9 shows performance on the T-REx subset of LAMA and on WebQS for different values of $k$ . As expected, increasing $k$ leads to the model doing API calls for more examples – from 40.3% and 8.5% with $k=1$ (i.e., regular greedy decoding) to 98.1% and 100% for $k=10$ . While for T-REx, there is already a clear improvement in performance with greedy decoding, on WebQS our model only starts to make a substantial number of API calls as we slightly increase $k$ . Interestingly, for $k=1$ the model is calibrated to some extent: It decides to call APIs for examples that it would perform particularly badly on without making API calls. This can be seen from the fact that performance on examples where it decides not to make an API call (44.3 and 19.9) is higher than average performance if no API calls are made at all (34.9 and 18.9). However, this calibration is lost for higher values of $k$ .

We qualitatively analyze some API calls generated with our approach for different APIs. Table 10 shows some examples of texts from CCNet augmented with API calls, as well as the corresponding score $L_{i}^{-}-L_{i}^{+}$ that is used as a filtering criterion, and whether the API calls made by the model are intuitively useful in the given context. As can be seen, high values of $L_{i}^{-}-L_{i}^{+}$ typically correspond to useful API calls, whereas low values correspond to API calls that do not provide any information that is useful for predicting future tokens. There are some exceptions, e.g., an API call for “Fast train success” in the fourth example that does not give any relevant information but still reduces perplexity. However, some amount of noise in the API calls that are not filtered can actually be useful as it forces the model finetuned on $\mathcal{C}^{*}$ to not always blindly follow the results of each call it makes.

Related Work

There are various approaches that augment language models with some form of additional textual information during pretraining, including various forms of metadata (Keskar et al., 2019), HTML tags (Aghajanyan et al., 2021), Wikipedia markup (Schick et al., 2022), or related texts obtained from an information retrieval system (Guu et al., 2020; Borgeaud et al., 2021; Izacard et al., 2022). For all of these approaches, additional information is always provided, regardless of whether it is helpful or not. In contrast, Toolformer learns for itself to explicitly asks for the right information.

Several approaches aim to equip LMs with the ability to use external tools such as search engines (Komeili et al., 2022; Thoppilan et al., 2022; Lazaridou et al., 2022; Shuster et al., 2022; Yao et al., 2022), web browsers (Nakano et al., 2021), calculators (Cobbe et al., 2021; Thoppilan et al., 2022), translation systems (Thoppilan et al., 2022) and Python interpreters (Gao et al., 2022). The way these models learn to use tools can roughly be divided into two approaches: Either they rely on large amounts of human supervision (Komeili et al., 2022; Nakano et al., 2021; Thoppilan et al., 2022) or they work by prompting the language model in a few-shot setup tailored towards a specific task where it is known a priori which tools needs to be used (Gao et al., 2022; Lazaridou et al., 2022; Yao et al., 2022). In contrast, the self-supervised nature of Toolformer enables it to learn how and when to use tools without requiring a specific prompt that shows task-specific examples of how a tool could be used. Perhaps most closely related to our work is TALM (Parisi et al., 2022), an approach that uses a similar self-supervised objective for teaching a model to use a calculator and a search engine, but explores this only in settings where a model is finetuned for downstream tasks.

The idea of using self-training and bootstrapping techniques to improve models has been investigated in various contexts, ranging from word sense disambiguation (Yarowsky, 1995), relation extraction (Brin, 1999; Agichtein and Gravano, 2000), parsing (McClosky et al., 2006; Reichart and Rappoport, 2007), sequence generation (He et al., 2020), few-shot text classification (Schick and Schütze, 2021a) and retrieval (Izacard and Grave, 2021) to reasoning (Zelikman et al., 2022). In a similar spirit to these approaches, Toolformer is trained on its own predictions after applying a perplexity-based filtering step.

Limitations

While our approach enables LMs to learn how to use a variety of tools in a self-supervised way, there are some clear limitations to what can be achieved with our method in its current form. One such limitation is the inability of Toolformer to use tools in a chain (i.e., using the output of one tool as an input for another tool). This is due to the fact that API calls for each tool are generated independently; as a consequence, there are no examples of chained tool use in the finetuning dataset. Our current approach also does not allow the LM to use a tool in an interactive way – especially for tools such as search engines, that could potentially return hundreds of different results, enabling a LM to browse through these results or to refine its search query in a similar spirit to Nakano et al. (2021) can be crucial for certain applications. Beyond this, we found models trained with Toolformer to often be sensitive to the exact wording of their input when deciding whether or not to call an API; this is perhaps unsurprising given that LMs are known to be very sensitive to the prompt they are provided with in both zero-and few-shot settings (Jiang et al., 2020; Schick and Schütze, 2021a). Depending on the tool, our method is also very sample-inefficient; for example, processing more than a million documents results in only a few thousand examples of useful calls to the calculator API. A potential solution to this problem might be to iteratively apply our approach, similar to how this is done in related bootstrapping approaches (Schick and Schütze, 2021a; Izacard and Grave, 2021; Parisi et al., 2022). Finally, when deciding whether or not to make an API call, Toolformer currently does not take into account the tool-dependent, computational cost incurred from making an API call.

Conclusion

We have introduced Toolformer, a language model that learns in a self-supervised way how to use different tools such as search engines, calculators, and translation systems via simple API calls. This is done by finetuning on a large number of sampled API calls that are filtered based on whether they reduce perplexity on future tokens. Toolformer considerably improves zero-shot performance of a 6.7B parameter GPT-J model, enabling it to even outperform a much larger GPT-3 model on a range of different downstream tasks.

References

Appendix A API Details

When sampling and filtering API calls, by default we use values of $\tau_{s}=0.05$ and $\tau_{f}=1.0$ – i.e., we only make API calls at positions where the probability of the token is at least 5%, and we keep API calls if they reduce the loss by at least 1.0. We only keep the top $k=5$ such positions and sample up to $m=5$ API calls for each position identified in a piece of text. Due to the heuristic filtering described below, we generate API calls for the calculator and machine translation system on only a small subset of $\mathcal{C}$ ; to compensate for this, we set $\tau_{s}=0.0$ , $k=20$ and $m=10$ for these tools. As the resulting sets of API calls are still comparably small, we additionally set $\tau_{f}=0.5$ .

We use the Atlas model of Izacard et al. (2022) finetuned on Natural Questions (Kwiatkowski et al., 2019) as our question answering system. For creating $\mathcal{C}^{*}$ we use Atlas-large, enabling us to efficiently process millions of API calls; during inference, we use the larger Atlas-xxl model.

Our calculator is based on a simple Python script and only supports the operators “ $+$ ”, “ $-$ ”, “ $*$ ”, and “ $/$ ”. It does not return any result for syntactically invalid equations. For sampling API calls, we apply heuristic filters to our subset of CCNet and only process documents that either (i) contain at least three numbers within a window of 100 tokens, where one of these numbers is the result of applying a mathematical operation to the other two, (ii) contain one of the sequences “=”, “equals”, “equal to”, “total of”, “average of” followed by a number, or (iii) contain at least three numbers; for texts that only match the last criterion, we only keep a random subset of 1%.

For creating our dataset $\mathcal{C}^{*}$ , we operate under the assumption that the calendar date in such cases should be the date that the document was created. We approximate this by extracting the date from the URL, if it is present. We filter out texts for which a date cannot be extracted, leaving around 18% of the documents.

For both training and inference, we use the 600M parameter NLLB (Costa-jussà et al., 2022) as our machine translation (MT) model. The source language is automatically detected using the fastText classifier (Joulin et al., 2016), while the target language is always set to English. Since most of the CCNet dataset is in English, we filter out the parts that contain only English text before generating API calls. More specifically, we only keep those paragraphs which contain text chunks in a language other than English preceded and followed by English text. We use text chunks of size 10 tokens. To determine whether the middle text chunk is in a language different than English we again use the fastText classifier with a confidence greater than 0.8. We also filter out any text chunks that contain only numbers or special symbols. This filtering mechanism allows us to generate data more efficiently by focusing our API call generations in places where the MT tool is likely to be helpful. After generating the MT API calls, we additionally remove from our training set those where the input to the MT tool appears after the API call but not before it. While during data generation the model can look ahead to generate API calls, this is not possible at inference time, so we want to dissuade the model from calling the API in such cases.

A.2 Prompts

Below, we list the prompts used to sample API calls for each tool considered.

We use the following prompt for the question answering tool: {spverbatim} Your task is to add calls to a Question Answering API to a piece of text. The questions should help you get information required to complete the text. You can call the API by writing ”[QA(question)]” where ”question” is the question you want to ask. Here are some examples of API calls: Input: Joe Biden was born in Scranton, Pennsylvania. Output: Joe Biden was born in [QA(”Where was Joe Biden born?”)] Scranton, [QA(”In which state is Scranton?”)] Pennsylvania.

Input: Coca-Cola, or Coke, is a carbonated soft drink manufactured by the Coca-Cola Company. Output: Coca-Cola, or [QA(”What other name is Coca-Cola known by?”)] Coke, is a carbonated soft drink manufactured by [QA(”Who manufactures Coca-Cola?”)] the Coca-Cola Company.

We use the following prompt for the calculator: {spverbatim} Your task is to add calls to a Calculator API to a piece of text. The calls should help you get information required to complete the text. You can call the API by writing ”[Calculator(expression)]” where ”expression” is the expression to be computed. Here are some examples of API calls: Input: The number in the next term is 18 + 12 x 3 = 54. Output: The number in the next term is 18 + 12 x 3 = [Calculator(18 + 12 * 3)] 54.

Input: The population is 658,893 people. This is 11.4Output: The population is 658,893 people. This is 11.4

Input: A total of 252 qualifying matches were played, and 723 goals were scored (an average of 2.87 per match). This is three times less than the 2169 goals last year. Output: A total of 252 qualifying matches were played, and 723 goals were scored (an average of [Calculator(723 / 252)] 2.87 per match). This is twenty goals more than the [Calculator(723 - 20)] 703 goals last year.

Input: I went to Paris in 1994 and stayed there until 2011, so in total, it was 17 years. Output: I went to Paris in 1994 and stayed there until 2011, so in total, it was [Calculator(2011 - 1994)] 17 years.

Input: From this, we have 4 * 30 minutes = 120 minutes. Output: From this, we have 4 * 30 minutes = [Calculator(4 * 30)] 120 minutes.

We use the following prompt for the Wikipedia search tool: {spverbatim} Your task is to complete a given piece of text. You can use a Wikipedia Search API to look up information. You can do so by writing ”[WikiSearch(term)]” where ”term” is the search term you want to look up. Here are some examples of API calls: Input: The colors on the flag of Ghana have the following meanings: red is for the blood of martyrs, green for forests, and gold for mineral wealth. Output: The colors on the flag of Ghana have the following meanings: red is for [WikiSearch(”Ghana flag red meaning”)] the blood of martyrs, green for forests, and gold for mineral wealth.

Input: But what are the risks during production of nanomaterials? Some nanomaterials may give rise to various kinds of lung damage. Output: But what are the risks during production of nanomaterials? [WikiSearch(”nanomaterial production risks”)] Some nanomaterials may give rise to various kinds of lung damage.

Input: Metformin is the first-line drug for patients with type 2 diabetes and obesity. Output: Metformin is the first-line drug for [WikiSearch(”Metformin first-line drug”)] patients with type 2 diabetes and obesity.

We use the following prompt for the machine translation tool:

Your task is to complete a given piece of text by using a Machine Translation API. You can do so by writing ”[MT(text)]” where text is the text to be translated into English. Here are some examples:

Input: He has published one book: O homem suprimido (“The Supressed Man”) Output: He has published one book: O homem suprimido [MT(O homem suprimido)] (“The Supressed Man”)

Input: In Morris de Jonge’s Jeschuah, der klassische jüdische Mann, there is a description of a Jewish writer Output: In Morris de Jonge’s Jeschuah, der klassische jüdische Mann [MT(der klassische jüdische Mann)], there is a description of a Jewish writer

Input: 南京高淳县住房和城乡建设局城市新区设计 a plane of reference Gaochun is one of seven districts of the provincial capital Nanjing Output: [MT(南京高淳县住房和城乡建设局城市新区设计)] a plane of reference Gaochun is one of seven districts of the provincial capital Nanjing

We use the following prompt for the calendar tool:

Your task is to add calls to a Calendar API to a piece of text. The API calls should help you get information required to complete the text. You can call the API by writing ”[Calendar()]” Here are some examples of API calls:

Input: Today is the first Friday of the year. Output: Today is the first [Calendar()] Friday of the year.

Input: The president of the United States is Joe Biden. Output: The president of the United States is [Calendar()] Joe Biden.

Input: The current day of the week is Wednesday. Output: The current day of the week is [Calendar()] Wednesday.

Input: The number of days from now until Christmas is 30. Output: The number of days from now until Christmas is [Calendar()] 30.

Input: The store is never open on the weekend, so today it is closed. Output: The store is never open on the weekend, so today [Calendar()] it is closed.

Appendix B Toolformer Training

We use up to 25k examples per API. Max sequence length 1,024. Effective batch size of 128. All models are trained using DeepSpeed’s ZeRO-3 (Rasley et al., 2020). We used 8 NVIDIA A100 40GB GPUs with BF16. Training up to 2k steps, where we evaluate PPL on a small development set from CCNet containing 1,000 examples every 500 steps. We pick the checkpoint that performs best.

Appendix C Zero-Shot Prompts

For both LAMA and TempLAMA, given an input text $\mathbf{x}$ , we use the following prompt: Please complete the following text so that it is factually correct: $\mathbf{x}$ .

C.2 Math Benchmarks

For all math benchmarks, given a context $\mathbf{x}$ and a question $\mathbf{q}$ , our prompt is: $\mathbf{x}\ \mathbf{q}$ The answer is.

C.3 Question Answering

For all question answering datasets, including Dateset, we simply prefix the question with Answer the following question:. We append a question mark if the question does not already end with one.

C.4 Multilingual Question Answering

For MLQA, given a context $\mathbf{x}$ and a question $\mathbf{q}$ , our prompt is: Your task is to answer a question based on the following paragraph: $\mathbf{x}$ Now answer the following question in English: $\mathbf{q}$ .

Appendix D Dateset

Dateset is created by first randomly selecting 500 “current dates”. For each current date, another relatively past/future date is randomly selected within a four-year range, and the two dates are used to fill the query templates in Table 11. An example of one such query using the first template would be, “How many days ago was August 14, 2020?” If called, the Calendar tool would return the presumed current date (e.g., “Today is Sunday, November 20, 2020”).