Solving Quantitative Reasoning Problems with Language Models

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, Vedant Misra

Introduction

Artificial neural networks have seen remarkable success in a variety of domains including computer vision, speech recognition, audio and image generation, translation, game playing, and robotics. In particular, large language models have achieved excellent performance across a variety of natural language tasks including common-sense reasoning, question answering, and summarization (Raffel et al.,, 2019; Brown et al.,, 2020; Rae et al.,, 2021; Smith et al.,, 2022; Chowdhery et al.,, 2022). However, these models have struggled with tasks that require quantitative reasoning, such as solving mathematics, science, and engineering problems (Hendrycks et al.,, 2021; Cobbe et al.,, 2021).

Quantitative reasoning problems are an interesting domain of application for language models because they test the capability of models on several fronts. They require the solver to correctly parse a natural language input, potentially recall world knowledge that pertains to the problem, and apply an algorithm or series of computations to the information provided in order to arrive at a correct solution. They also require that the solver is able to correctly parse and generate precise sequences of mathematical tokens, as well as apply a computational procedure to tokens via symbolic and numerical manipulation. Finally, such problems are a proving ground for research toward robust quantitative reasoning solvers that are useful in supporting the work of humans in scientific and technical fields.

Previous research has shown that large language models achieve impressive performance on math and programming questions after training on domain specific datasets (Chen et al.,, 2021; Austin et al.,, 2021; Drori et al.,, 2021). In this work, we apply this approach to quantitative reasoning problems, in which the model is required to provide a complete and self-contained solution, without relying on external tools. Such tasks include mathematics word problems, competition mathematics evaluations, and many problems in science and engineering.

Question: A line parallel to y=4x+6y=4x+6 passes through (5,10)(5,10). What is the yy-coordinate of the point where this line crosses the yy-axis? Target: Since the line is parallel to y=4x+6y=4x+6, we know that its slope is 4. Thus the equation of the line is y=4x+by=4x+b where bb is the yy-intercept. Plugging (5,10)(5,10) into this gives 10=45+bb=1010=4\cdot 5+b\Rightarrow b=\boxed{-10} which is what we wanted. Model output: The line is parallel to y=4x+6y=4x+6, which means that it has the same slope as y=4x+6y=4x+6. Since the line passes through (5,10)(5,10), we can use the point-slope form of a line to find the equation of the line: y10\displaystyle y-10 =4(x5)\displaystyle=4(x-5) y10\displaystyle y-10 =4x20\displaystyle=4x-20 y\displaystyle y =4x10\displaystyle=4x-10 Now that we have the equation of the line, we can find the yy-coordinate of the point where the line crosses the yy-axis by substituting x=0x=0 into the equation: y=4010=10.y=4\cdot 0-10=\boxed{-10}. Figure 1: Question from the MATH dataset (Algebra 4) with a solution generated by Minerva 62B. The solution was selected by majority vote over 256 samples. This and other MATH questions were prompted with the same 4-shot prompt. The model derives the solution by writing down a line equation, simplifying it, substituting xx, and solving for yy. The model outputs valid LaTeX along with natural language explanations. The approach taken by the model is different from the one used in the ground truth target. Question: Assume that the variance of the first nn natural numbers is 10, and the variance of the first mm even natural numbers is 1616. Compute m+nm+n. Model output: Let σ2\sigma^{2} be the variance of the first mm even natural numbers, and τ2\tau^{2} be the variance of the first nn natural numbers. Then, we have τ2=n2112\tau^{2}=\frac{n^{2}-1}{12} and σ2=m213\sigma^{2}=\frac{m^{2}-1}{3}. We solve the system of equations: n2112\displaystyle\frac{n^{2}-1}{12} =10\displaystyle=10 m213\displaystyle\frac{m^{2}-1}{3} =16.\displaystyle=16. This translates into the system n2\displaystyle n^{2} =121\displaystyle=121 m2\displaystyle m^{2} =48+1=49.\displaystyle=48+1=49. Therefore, n=11n=\boxed{11} and m=7m=\boxed{7}, so n+m=18n+m=\boxed{18}. Question: For every a,ba,b, bab\neq a prove that a2+b22>(a+b2)2.\frac{a^{2}+b^{2}}{2}>\left(\frac{a+b}{2}\right)^{2}. Model output: \\ \begin{array}[]{rll}\frac{a^{2}+b^{2}}{2}>\left(\frac{a+b}{2}\right)^{2}\\ \\ \Longleftrightarrow\frac{a^{2}+b^{2}}{2}>\frac{a^{2}+b^{2}+2ab}{4}\\ \\ \Longleftrightarrow a^{2}+b^{2}>\frac{a^{2}+b^{2}+2ab}{2}\\ \\ \Longleftrightarrow 2a^{2}+2b^{2}>a^{2}+b^{2}+2ab\\ \\ \Longleftrightarrow a^{2}+b^{2}>2ab\\ \\ \Longleftrightarrow a^{2}+b^{2}-2ab>0\\ \\ \Longleftrightarrow(a-b)^{2}>0\\ \\ \end{array} which is true, because the square of a real number is positive. Figure 2: Example questions with solutions from Minerva 540B. Left: Question from the Joint Entrance Examination Main Math 2020 exam taken each year by almost 2M Indian high-school students intending to study engineering and similar fields. Right: Question from the National Math Exam in Poland (May 2022). The exam is taken by approximately 270K high-school students every year. 1.1 Our Contribution We present Minerva ​​, a language model that achieves strong performance on many quantitative reasoning tasks. Our model is able to process scientific and mathematical questions formulated in natural language, and generate step-by-step solutions using correct LaTeX notation. Figures 1, 2 and 3 show example responses from Minerva to mathematics and physics questions.

Minerva is based on the PaLM general language models Chowdhery et al., (2022) that are further trained on a high-quality dataset containing scientific and mathematical data. In particular, we start with 8B, 62B, and 540B parameter pretrained models, and continue training them on our technical content dataset. We achieve state-of-the-art performance on MATH Hendrycks et al., (2021), GSM8k Cobbe et al., (2021), and a STEM subset of the MMLU Hendrycks et al., (2020) dataset, all of which include math and science questions formulated in natural language. Notably, our models achieve robust performance in a few-shot setting without being explicitly trained on these evaluation datasets.

The main novelty of this paper is a large training dataset that juxtaposes natural language with the correct use of formal mathematical language, such as equations and diagrams. The data is collected from the arXiv preprint server and from web pages that we carefully process to minimise the loss of mathematical content. In addition to establishing new state-of-the-art results, this work establishes a new baseline for the performance achievable on quantitative reasoning benchmarks by increasing data quality and model size.

To further increase the coverage of quantitative reasoning benchmarks, we build a dataset of over 200 undergraduate-level questions in science and mathematics from MIT’s OpenCourseWare (OCW). This provides a measure of our model’s quantitative reasoning abilities in a chain-of-thought context beyond a pure mathematical setting.

2 Related Works

Solving quantitative reasoning problems expressed in natural language has been an active area of study (Koncel-Kedziorski et al.,, 2015; Hosseini et al.,, 2014). Prompting language models using scratchpad Nye et al., (2021) or chain-of-thought Wei et al., (2022) solutions can lead them to output step-by-step solutions to unseen problems. The GSM8k work Cobbe et al., (2021) showed that training verifiers to rerank model outputs can lead to improved performance. The original version of GSM8k included special syntax for algebraic calculations, which were processed by a calculator. In this work we focus on self-contained models without access to external tools.

The standard method for evaluating language models on generative tasks is to greedily sample one solution per problem. Recent works (Chen et al.,, 2021; Li et al.,, 2022; Thoppilan et al.,, 2022; Wang et al.,, 2022) have shown that it is advantageous to sample multiple solutions per problem, and then filter those down to a final answer. We find that majority voting (Wang et al.,, 2022) significantly improves performance over greedy decoding.

The work Drori et al., (2021) includes an evaluation of davinci-002, OpenAI’s latest publicly available language model, on a subset of 90 problems from the MATH dataset. Due to the focus on a subset of questions, as well as changes made to the way questions are formatted, it is difficult to directly compare our results with those of Drori et al., (2021). In Section 3, we compare OpenAI davinci-002 with our models under the same experimental conditions.

Applying code generating models to mathematical problems has been an active area of exploration. PaLM Chowdhery et al., (2022) showed that a large language model with code in its training dataset can achieve good performance on a code version of GSM8k. Furthermore, the Codex model (Chen et al.,, 2021) can generate code solutions to MATH problems Drori et al., (2021). These solutions often rely on external libraries to perform mathematical operations such as solving equations or taking limits. This is a complementary approach to ours, in which we directly probe the model’s ability to arrive at an answer by relying only on its own reasoning capability.

Mathematics developed as a discipline based in natural language, but its axiomatic fundamentals make it possible to simulate mathematical thinking. This can be achieved using specialized programming languages that facilitate the simulation of logical and mathematical thinking using a computer, such as Coq (development team,, 2022), Isabelle (Wenzel et al.,, 2008), HOL4 (Harrison,, 1996), Lean (de Moura et al.,, 2015), Metamath (Megill and Wheeler,, 2019) and Mizar (Grabowski et al.,, 2010). Work on automation of proof assistants and automated theorem provers such as E (Schulz,, 2013), leanCoP (Otten,, 2008), and Vampire (Kovács and Voronkov,, 2013) has substantially benefited from integration with machine learning methods (Alemi et al.,, 2016; Goertzel et al.,, 2021; Li et al.,, 2021; Polu and Sutskever,, 2020; Kaliszyk et al.,, 2018).

Previous work trained language models to predict mathematical expressions Rabe et al., (2021); Li et al., (2021); Polu and Sutskever, (2020); Wu et al., (2021); Han et al., (2022); Polu et al., (2022); Jiang et al., (2022); Wu et al., (2022). In turn, such a predictive model can be used to guide a proof search, as done by Polu and Sutskever, (2020). Large language models excel in modelling natural language, though in the case of formal languages, models that facilitate retaining information about the graph structure of a given mathematical formula, such as GNNs, are still very competitive.

New benchmark datasets (Hendrycks et al.,, 2021; Welleck et al.,, 2021) cover more advanced mathematical topics. In this domain language models are facing limited competition from other classes of models.

Training and Evaluation

Our models were trained on a dataset of 38.5B tokens from webpages filtered for mathematical content and from papers submitted to the arXiv preprint server. In addition, the dataset includes general natural language data, which is the same dataset that was used for pretraining PaLM ​​. Our mathematical webpage dataset was constructed by collecting pages that contain mathematical expressions in MathJax format. The pages underwent a cleaning process that removes most HTML tags but preserves mathematical notation, including LaTeX symbols and formatting. The result is that mathematical formulae like eπi+1=0e^{\pi i}+1=0 or E=mc2E=mc^{2} are presented in full to the model during training. This procedure makes it possible for the model to perform well on tasks that require calculation and symbolic manipulation. Table 1 provides a breakdown of the training dataset. See Appendix B for more details.

2 Models and Training Procedure

Our approach is to start with the PaLM pretrained decoder-only transformer language models Chowdhery et al., (2022), and further train (finetune) them on our mathematical dataset using an autoregressive objective. Table 2 contains the main model and training hyperparameters. The largest model, with 540540B parameters, was finetuned on 2626B tokens. While this model is highly undertrained compared to the 8B and 62B models, it still achieves superior performance. Additional details can be found in Appendix C.

3 Evaluation Datasets

We mainly focus on few shot evaluation, though see Appendix E.3 for a discussion of finetuned evaluation. For evaluation, we truncate the inputs from the left to 10241024 tokens and we use the model to generate up to 512512 tokens. When sampling once per problem, we sample greedily. When sampling multiple times per problem we use nucleus sampling (Holtzman et al.,, 2019) with temperature T=0.6T=0.6, p=0.95p=0.95. For generative tasks, the model produces a chain-of-thought answer and demarcates a final answer. We evaluate a solution as correct if the final answer matches the ground truth solution, independent of the quality of the chain-of-thought preceding it. To evaluate correctness, we parse the final answers and compare them using the SymPy library (Meurer et al.,, 2017). This is done in order to correctly identify answers that are mathematically equivalent such as 1/31/\sqrt{3} and 3/3\sqrt{3}/3. See Appendix D.1 for further details.

The existing datasets on which we focus are:

MATH: a dataset of 12K middle school and high school mathematics problems Hendrycks et al., (2021). Problem statements are written in LaTeX. We prompt the model with a fixed 4-shot prompt (listed in Appendix D.2). This prompt includes four random examples from the training dataset whose ground truth targets are not too long.

GSM8k: middle school math word problems Cobbe et al., (2021). Models are evaluated using the chain-of-thought prompt from Wei et al. Wei et al., (2022). Previous models evaluated on GSM8k made use of an external calculator. In this work, our model does not have access to any external tools.

MMLU-STEM: subset of the MMLU dataset (Hendrycks et al.,, 2020) focused on science, technology, engineering, and mathematics (STEM). For the original version, we use the 5-shot prompt from the development set for each task. We also consider chain-of-thought prompting for this task, where we prompt the model with examples that include step-by-step solutions. We use a multiple-choice version of the MATH prompt for topics that involve mathematical reasoning, and add step-by-step solutions to the standard 5-shot prompts for the rest of the topics. See Appendix G for more details.

4 Undergraduate-Level STEM Problems

To evaluate the scientific reasoning capabilities of Minerva ​, we harvested a set of STEM problems at the undergraduate level, most of which involve multi-step reasoning, which we refer to in this paper as OCWCourses ​. Using publicly-available course materials offered by MIT (OpenCourseWare), we collected problems with automatically-verifiable solutions (either numeric or symbolically verifiable via SymPy) from courses including “solid-state chemistry”, “information and entropy”, “differential equations”, and “special relativity.” These problems were processed by contractors to be self-contained and to have a clearly-delineated final answer. Problems asking for a proof or open-ended short answer were not included. In total we curated 272 problems, 191 of which have numeric solutions and 81 have symbolic solutions. In Appendix F, we detail the contributions from each course, and the process of converting these course materials into a format suitable for processing by language models. We also provide the text of all problems. We plan to release these as part of an open-source dataset which will be detailed in an upcoming manuscript.

5 Inference-Time Techniques

We find that we can considerably outperform greedy decoding by sampling k>1k>1 solutions (with a non-zero temperature) and selecting one using majority voting Wang et al., (2022). This consists of grouping predictions with respect to their final answer and selecting the most common answer. We denote this as maj1@k, following Li et al., (2022). A variation of this algorithm, denoted majn@k, involves selecting the nn most common answers. Intuitively, the reason majority voting improves performance is that while there are many ways to answer a question incorrectly, there are typically very few ways to answer correctly.

Contrast majority voting with pass@k ​, where a task is considered solved if any single sample solves it out of kk samples. See Section 4.2 for more details on pass@k performance. In Appendix E.1, we report on how performance depends on kk for different metrics. We find that while pass@k continues to improve as kk is increased, majority voting performance saturates faster: 97%97\% of the large kk accuracy is achieved at k=64k=64 for MATH and k=16k=16 for GSM8k. This is likely because majority voting selects the most common answer in the modeled distribution, and the error of this estimate decreases with increasing kk. This is in contrast to pass@k where the performance improvement comes from the tail of the distribution, which can keep improving as kk is increased.

Log-likelihood is another metric that can be used to rerank samples. We found that majority voting performs significantly better than log-likelihood reranking (see Appendix E.2).

Results

Table 3 summarizes the results for Minerva models and other models, on the evaluation datasets described in Section 2.3. Figure 4 presents a breakdown of the MATH dataset results by subtopic. For MMLU evaluations, unless otherwise noted, performance is measured by using the standard 5-shot prompt per topic and picking the answer with the highest score. When evaluating MMLU with majority voting, we sample k=16k=16 model answers using a chain-of-thought prompt.

We present model output samples in Figures 1, 2 and 3, and additional output samples are listed in the Appendix. In addition, we evaluated Minerva 62B on the National Math Exam in Poland and found that it achieves a score of 57%57\%, which happened to be the national average in 2021 (CKE,, 2021, p. 23). The 540B model achieves 65%65\%.

We include results on the latest publicly available language model from OpenAI, davinci-002, evaluated using the OpenAI API with temperature set to the official recommendation (T=0.2T=0.2). The combination of training data, scale and inference techniques yields state of the art results on all the technical tasks that we considered. For all tasks (with the exception of GSM8k), the improvement with respect to previous results is considerable.

While our main focus is on few shot evaluation, we also tried to finetune Minerva on MATH. While we did not observe any improvement, we found that finetuning PaLM on MATH did give a significant improvement, which suggests that the marginal utility of standard finetuning decreases as the quality and diversity of the unsupervised training dataset improves. Further details can be found in Appendix E.3.

In Appendix H, we study the performance of Minerva 540B on simple arithmetic tasks. The model achieves over 80%80\% accuracy on 10-digit addition and over 20%20\% accuracy on 18-digit addition.

Performance Analysis

To better understand the types of mistakes our models make, we compare the performance of Minerva 8B and Minerva 62B on 216 problems with high confidence majority decisions of both models. Specifically, we selected examples where the top answer received at least 15%15\% of votes, and that either Minerva 8B was correct and Minerva 62B was incorrect (15 samples), or vice versa (201 samples). The categories and examples for each category are described in Appendix I.2.

As shown in Table 4, the prevailing errors of the 8B model were related to incorrect reasoning or calculations. Many of the calculation errors were relatively benign arithmetic mistakes. Solutions that were too short were relatively rare (in these cases, the model immediately produces an incorrect answer without any intermediate reasoning steps). Finally, in a few cases, the model hallucinates an equation or mathematical fact that is not real.

In the samples where the 62B model was incorrect, the dominating failure modes were again incorrect reasoning and incorrect calculations. In summary, we find that the 62B Minerva model retains most of the skills of the 8B model and improves upon both reasoning and calculation robustness.

2 False Positives

In our approach to solving quantitative reasoning problems, we are able to automatically verify whether the final answer to a problem is correct, but we do not have an automatic way to verify the model’s chain of reasoning. This leaves open the possibility of false positives: samples which have the correct final answer, but for which the reasoning is incomplete or incorrect.

We selected 100 random questions from MATH (20 per difficulty level), along with answers sampled at zero temperature from the 62B model. We then manually inspected the answers to determine the false positive rate, which is the ratio between number of false positive examples and number of examples for which the final answer is correct; see Table 5. We found that the overall false positive rate is low, though it does increase with difficulty level.

Our focus on pass@1 and majority voting as the primary evaluation metrics is due in part to the fact that they are less susceptible to false positives than pass@k (Li et al.,, 2022). While the pass@256 accuracy is 84.5%84.5\% for the 62B model, false positives account for part of it. We inspected the samples that failed in majority voting but passed on pass@k due to a single correct answer, and estimate the false positive rate for pass@256 to be 30% among samples selected in this way. After removing false positives, we estimate that the pass@256 accuracy to be bigger than 68%; see Appendix I.3 for details.

Memorization

A central question in interpreting Minerva ​​’s solutions is whether performance reflects genuine analytic capability or instead rote memorization. This is especially relevant as there has been much prior work indicating that language models often memorize some fraction of their training data (Trinh and Le,, 2018; Radford et al.,, 2019; Carlini et al.,, 2022). When examining model solutions, we find that memorization of intermediate facts, such as numerical values of square roots or trigonometric identities, are crucial elements of model solutions. Truly strong performance would combine recall of intermediate facts with genuine solution synthesis. We would like to investigate a strong form of memorization, where model performance is a result of memorizing the explicit problems and solutions in our evaluation set, but also a weaker form, where the model has memorized alternate answers to the same questions.

In order to evaluate the degree to which our models solve problems by recalling information memorized from training data, we conduct three analyses on the MATH dataset. First we directly search for problems and solutions in our training corpus. Next, we generate modified versions of problems and evaluate our models’ robustness to these changes. Finally, we measure the degree of overlap between the ground truth solutions and solutions generated by our model and measure the effect of this similarity on model performance. Overall, we find little evidence that the model’s performance can be attributed to memorization.

We selected the problems for which our 62B parameter model produced a correct answer, and filtered them to the 100 problems with the highest majority vote score, expecting that problems with a high majority vote score are more likely to have been memorized. For each of these question-answer pairs, we compute the BLEU score across chunks of 500 characters in our Math Web Pages dataset (a histogram of the BLEU scores is shown in Appendix Figure 10). We then manually inspect the 250 documents with the highest BLEU scores. While many of the top matches were from homework help sites with math questions and solutions, none of the questions matched the questions in the subset of MATH under consideration. We have included these 250 segments in Appendix J.1. We note that some problems from MATH can be found on the web. Nevertheless, this analysis concludes that these problems did not make it through our data collection process.

2 Performance on Modified MATH Problems

To further investigate memorization, we randomly selected twenty problems which the 62B model answered correctly under majority voting. We manually modified each problem either by introducing minor changes to problem wording (framing) or by changing the numbers which appeared in the problem and modifying the solution accordingly. We then compared the accuracy over sampled solutions before and after the modification. Results are shown in Figure 5. In both cases the accuracy before and after modifications are correlated, with no clear bias in favor of the original formulation. This is suggestive of minimal memorization. The modified problems are listed in Appendix J.2.

3 BLEU Score Between Ground Truth and Generated Solutions

We seek to detect memorization of solutions by computing BLEU score between ground truth answers and model generated answers. We use the 62B model and analyze 256256 samples per problem in the MATH dataset. First, we compute overlap statistics for all correct samples. We find that 160 out of 5,000 test questions have a sample with a BLEU score greater than or equal to 80 (see Appendix J.3). We note that they tend to be short solutions. To understand the effect of answer similarity on performance, we remove model samples above a certain BLEU score threshold, and recompute the majority vote accuracy. We find that majority vote performance is robust even down to relatively low similarities (see Figure 5), indicating that performance cannot be attributed to model outputs that are very similar to ground truth answers.

Conclusions and Discussion

In this work, we take an approach to quantitative reasoning that relies on solving problems using mathematical reasoning expressed in natural language. We show that by training a large language model on a high quality mathematical dataset, we are able to achieve strong performance on tasks that require logical reasoning, numerical calculation, and symbolic manipulation. Our model does not make use of external tools, and at inference time relies exclusively on autoregressive sampling to achieve this performance. Complementary approaches to quantitative reasoning include code-generating models and formal methods. These are all different routes toward a common goal: an agent that can reason about and solve quantitative problems. We believe that such an agent should combine useful elements from all of these approaches.

Our approach to quantitative reasoning has several limitations. First, we have no automatic way of verifying the correctness of the model’s answers. This is in contrast to formal approaches, for which automatic verification is intrinsic. Second, our model has no access to external tools such as a calculator or a Python interpreter. It is therefore limited in its ability to perform quantitative reasoning tasks that require complicated numerical calculations. Third, because our model was trained on a large amount of data, we have little direct control over the specific capabilities that the model acquired.

2 Societal Impact

Artificial neural networks capable of solving quantitative reasoning problems in a general setting have the potential of substantial societal impact. Minerva ​, while a step in this direction, is still far from achieving this goal, and its potential societal impact is therefore limited. The model’s performance is still well below human performance, and furthermore, we do not have an automatic way of verifying the correctness of its outputs. If these issues could be solved, we expect the impacts of this model to be broadly positive. A direct application could be an accessible and affordable math tutor which could help improve educational inequalities.

Acknowledgments

We thank David Andre, Jacob Austin, Maarten Bosma, Aakanksha Chowdhery, Sergey Ioffe, Colin Raffel, Charles Sutton, and Christian Szegedy for helpful discussions.

References

Appendix A Detailed Contributions

Aitor prepared the Mathematical web pages dataset and Aitor and David prepared the arXiv dataset used to train Minerva ​​.

Aitor trained the Minerva models presented in the paper, and he, along with David and Vedant, conducted ablation studies.

Aitor, Ambrose, and David built the experimental infrastructure for training and evaluating Minerva ​. They, along with Anders, Ethan, Henryk, Vinay, and Vedant collected the evaluation datasets and conducted model evaluations.

Aitor, Anders, Behnam, Ethan, Guy, and Vedant conducted experiments and ablation studies on inference-time techniques.

Vedant and Vinay collected the OCWCourses dataset and supervised the contractors’ work.

Aitor, Ambrose, Anders, David, Ethan, Guy, Henryk, Theo, Vedant, Vinay, and Yuhuai analyzed the models’ results, including sample explorations to categorize model mistakes and identify false positives.

Aitor, Anders, and Cem conducted fine-tuning evaluation experiments.

Ethan, Vedant, and Vinay designed and conducted the memorization experiments.

Aitor, Anders, Ethan, Guy, Henryk, Imanol, Vedant, and Yuhuai wrote the paper.

Aitor, Behnam, Guy, and Vedant advised and led the project throughout its life cycle.

Appendix B Training Dataset Details

The two main data sources for our training dataset are arXiv papers and web pages that contain mathematics. Here we present additional details on how the data from each source was collected and processed.

The arXiv dataset contains 2M arXiv papers up to February 2021, in LaTeX format. If multiple LaTeX files were present, they were concatenated. Comments were removed, and anything before the first section header or after an appendix/bibliography header was removed. The title and abstract of each paper were added to the document from the arXiv metadata. In order to retain high quality documents and maximize the information per token, papers were filtered out if they were longer than 75k tokens, had on average more than 0.6 tokens per character, had no \section headers, or ended up being empty after processing. The final arXiv dataset after processing includes 1.2M papers totalling 5858GB of data.

B.2 Mathematical web pages

We started with a collection of web pages that included the string "

A majority of these HTML documents contain math in TeX or AsciiMath format inside tags of the form