Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning

Pan Lu, Liang Qiu, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, Tanmay Rajpurohit, Peter Clark, Ashwin Kalyan

Introduction

Developing machines equipped with mathematical reasoning capabilities is one of the long-standing goals of artificial intelligence. Solving math word problems (MWPs) is a well-defined task to diagnose the ability of intelligent systems to perform numerical reasoning and problem-solving as humans. A surge of datasets has been proposed to facilitate the research in this domain (Upadhyay & Chang, 2017; Amini et al., 2019; Miao et al., 2020; Cobbe et al., 2021). However, most existing MWP datasets focus on textual math word problems only. Tables, widely distributed in different documents such as invoices, health records, and financial reports, contain rich structured information different from unstructured text. Solving math word problems in such a tabular context is much more challenging than existing MWP benchmarks since the system needs to make cell selections and align heterogeneous information before performing further numerical reasoning.

To fill this gap, we propose Tabular Math Word Problems (TabMWP), a new large-scale dataset that contains 38,431 math word problems with tabular context, taken from grade-level math curricula. There are two question types: free-text questions in which the answer is an integer or decimal number, and multi-choice questions where the answer is a text span chosen from option candidates. Different from existing MWP datasets, each problem in TabMWP is accompanied by a tabular context, which is represented in three formats: an image, a semi-structured text, and a structured table. Each problem is also annotated with a detailed solution that reveals the multi-step reasoning steps to ensure full explainability. To solve problems in TabMWP, a system requires multi-hop mathematical reasoning over heterogeneous information by looking up table cells given textual clues and conducting multi-step operations to predict the final answer. Take the problem above in Figure 1 as an example. To answer the question “how much will she spend (if Tracy buys three kinds of beads)?”, we first need to look up the corresponding three rows in the given table, calculate the individual cost for each kind of bead, and finally sum three costs up to get the answer of 31.44.

Inspired the success of the large pre-trained language model GPT-3 (Brown et al., 2020) in solving math word problems (Wei et al., 2022; Wang et al., 2022), we first build a strong baseline using few-shot GPT-3 on TabMWP. A few in-context examples are randomly selected from the training set, along with the test example, and are constructed as a prompt for GPT-3 to predict the answer. However, recent studies have shown that this type of few-shot learning can be highly unstable across different selections of in-context examples (Zhao et al., 2021; Liu et al., 2022a; Lu et al., 2022c). It could be worse on TabMWP since its problems are distributed across multiple question types and diverse table layouts. Liu et al. (2022a) try to address this issue by retrieving semantically similar examples. However, this method might not work well sometimes on TabMWP because it is not capable of measuring the similarity of structured information, such as the number of cells in tables.

To alleviate this challenge, we further propose a novel approach that can learn to select in-context examples from a small amount of training data via policy gradient for prompt learning, termed PromptPG. As illustrated in Figure 2, an agent learns to find optimal in-context examples from a candidate pool, with the goal of maximizing the prediction rewards on given training examples when interacting with the GPT-3 environment. A policy network defines the strategy of how to select the in-context examples given the current training example. The policy network is built on top of the language model BERT (Devlin et al., 2018) with fixed parameters, followed by a one-layer linear neural network with learnable parameters. The learnable parameters are updated following the policy gradient strategy (Sutton et al., 1998). Unlike random selection (Wei et al., 2022; Wang et al., 2022), brute-force search, or retrieval-based selection (Liu et al., 2022a), PromptPG learns to construct the prompt dynamically given the candidate pool when interacting with the GPT-3 API.

We implement two state-of-the-art methods as baselines, i.e., UnifiedQA (Khashabi et al., 2020) on general question answering and TAPEX (Liu et al., 2022b) on tabular question answering. Both are implemented in pre-trained and fine-tuned settings. Experimental results show that our model PromptPG can achieve an overall accuracy of 68.23% on TabMWP, which greatly surpasses previous methods by a large margin of up to 5.31%. Further analysis demonstrates that PromptPG can select better in-context examples compared with a wide range of existing selection strategies and reduce the prediction variance significantly compared to random selection.

The main contributions of our work are as follows: (a) We present a new large-scale dataset, TabMWP, the first dataset for math word problems with tabular context; (b) We propose a novel approach, PromptPG, which learns the prompt dynamically via policy gradient to select in-context examples for few-shot GPT-3. To the best of our knowledge, it is the first work that applies reinforcement learning to select in-context examples for the few-shot GPT-3 model; (c) Experimental results show that PromptPG achieves an improvement of up to 5.31% on TabMWP over existing methods, with reduced selection instability compared to random selection.

The TabMWP Dataset

A tabular math word problem pp is represented as a pair (tt, qq), where tt is a table context and qq is a question. The table tt could be represented in a visual format as an image, semi-structured text, or a structured database. In this work, we focus on the semi-structured format as the table context for simplicity. The table tt features complicated layouts and formats: it contains multiple rows and columns, and each cell can be a string of text, a string of a number, or a mix of them. Depending on the question and answer types, the question qq may be accompanied by multiple-choice options c={c1,c2,,cn}c=\{c_{1},c_{2},\dots,c_{n}\} or a unit uu. Given a semi-structured tabular context tt and an unstructured question text qq, the task is to generate the answer aa, which is either numerical only text for a free-text question, or a text span from given options for a multiple-choice question.

2 Dataset Construction

Data collection. We construct TabMWP based on openly available content and more details are provided in Appendix A.1. Only math word problems that are accompanied by a tabular context and a detailed solution are collected. We develop a script to extract the tabular context, the question, options that apply, the correct answer, and the solution for each problem. These elements can be precisely identified using HTML tags. For each table, we take a screenshot and store its raw text.

Data preprocessing. To make TabMWP compatible with various baselines, we represent the tabular context as three formats: an image, semi-structured text, and a structured spreadsheet. The semi-structured format is created by converting the raw table text into a flattened token sequence, with each row separated by a newline character ‘\\backslashn’ and each column separated by ‘\mid’. The semi-structured text is further transformed to the structured format, which can be easily retrieved and executed by SQL-based methods (Liu et al., 2022b) using packages like pandas. For clarity, the table title is separated from the raw table. Examples of three formats are shown in Appendix A.1.

For better quantitative evaluation, we formalize the TabMWP problems as two question types: (a) free-text questions, where the answer is numerical text only and the unit text is separately extracted; and (b) multi-choice questions, the answer of which is the text span from choice options. The order of choice options is shuffled to alleviate distribution bias. Redundant information in solutions is removed, and some solutions are manually rewritten to be more human-readable. Finally, problems with the same table, question, and answer text are regarded as redundant and thus removed. We further conduct quality control to ensure data quality, which is discussed in Appendix A.1.

3 Dataset Statistics

Key statistics. The TabMWP dataset contains 38,431 tabular math word problems, which are partitioned with 6:2:2 into the training, development, and test splits, corresponding to 23,059, 7,686, and 7,686 problems. Their main statistics are shown in Table 1. 74.7% of the questions in TabMWP belong to free-text questions, while 25.3% are multi-choice questions. There are 28,876 different questions, 6,153 different answers, and 35,442 different solutions, indicating that TabMWP has a rich diversity in the problem distribution. The questions have an average of 22.1 words in length and solutions of 49.5, showing that they have lexical richness.

One distinct characteristic of TabMWP is that each problem is accompanied by a tabular context, without which the problem would be unsolvable. There are 37,644 different tables in total, and 60.5% of the tables have a title. The table has an average of 5.9 rows and 2.2 columns, which results in an average of 12.9 cells and a maximum of 54 cells. These statistics suggest that tables in TabMWP distribute diversely across semantics and layouts.

Comparison to existing datasets. As shown in Table 2.3, TabMWP differs from related datasets in various aspects: (1) TabMWP is the first dataset to study math word problems over tabular context on open domains and is the largest in terms of data size; (2) Problems in TabMWP are annotated with the tabular context, unlike previous MWP datasets in the first segment; (3) Different from Table QA datasets like FinQA, TAT-QA, and MultiHiertt, a lack of either mathematical reasoning or the tabular context renders the problems in TabMWP unanswerable; (4) There are two question types in TabMWP, and the answer could be a text span, an integer number, or a decimal number; (5) Each problem is annotated with natural language solutions to reveal multi-hop reasoning steps.

Methods

Provided with a few in-context examples of math word problems as the context, GPT-3 can generate the answer for a test problem, and shows impressive performance across different MWP datasets (Wei et al., 2022; Wang et al., 2022). Inspired by its success, we first build a strong baseline using few-shot GPT-3 on our TabMWP dataset. Specifically, a few training examples, along with the test example pip_{i}, are provided to GPT-3 for the answer prediction. Each training example consists of a table context tt, a question qq, options cc that apply, and an answer aa. To make the few-shot GPT-3 model workable on TabMWP, we utilize the semi-structured format as the tabular context. Following Wei et al. (2022), a solution ss can be augmented in front of the answer aa to reveal the multi-step reasoning process, which is able to boost the prediction performance.

2 Dynamic Prompting via Policy Gradient

The in-context examples can be randomly (Wei et al., 2022; Wang et al., 2022) or retrieval-based selected (Liu et al., 2022a) from the training set. Recent research, however, has shown that few-shot GPT-3 can be highly unstable across different selections of in-context examples and permutations of those examples (Zhao et al., 2021; Liu et al., 2022a; Lu et al., 2022c). This instability may be more severe on TabMWP, where examples are more distinct because they include both unstructured questions of various types and semi-structured tables in various layouts. To alleviate this issue, we aim to propose a novel approach that can learn to select performing in-context examples using a policy gradient strategy, without brute-force searching or manually designed heuristics.

Formally, given a TabMWP problem pip_{i}, we want the agent to find KK in-context examples ei={ei1,ei2,...,eiK}e_{i}=\{e_{i}^{1},e_{i}^{2},...,e_{i}^{K}\} from a candidate pool EcandE_{\text{cand}}, and generate the answer a^i\hat{a}_{i}, maximizing a reward ri=R(a^ipi)r_{i}=R(\hat{a}_{i}|p_{i}). The in-context examples are selected according to a policy

where θ\theta are the policy’s parameters. The answer is generated through: a^i=GPT-3(ei,pi)\hat{a}_{i}=\text{GPT-3}(e_{i},p_{i}) using the selected examples and the given problem as the input prompt. The reward is then computed by evaluating the generated answer a^i\hat{a}_{i} with respect to the ground truth answer aia_{i}:

where NN is the size of each batch yielded from our training problem set PtrainP_{\text{train}}. In this work, we experiment using the REINFORCE policy gradient algorithm (Williams, 1992):

Intuitively, if the predicted answer is correct, we update the policy so that the probability of selecting the same prompts gets higher. Otherwise, we update the policy to reduce the probability of selecting such less matched examples. The learning process is summarized in Algorithm 1 in the appendix.

To get the contextualized representation of the given problem and candidate examples, we use the BERT (Devlin et al., 2018) [CLS] token representation as the problem encoding. We add a small linear layer on top of the BERT final pooling layer. That allows our model to learn both the semantic similarity that the pre-trained BERT model provides and the hidden logical similarity shared among the math problems. During training, the parameters of BERT are fixed and only the appended linear layer is updated, i.e., θ\theta is composed of the learnable parameters W\mathbf{W} and b\mathbf{b}:

Experiments

Baselines. We first develop two large language models, UnifiedQA (Khashabi et al., 2020) and TAPEX (Liu et al., 2022b), in both pre-trained and fine-tuned settings, as strong baselines on TabMWP. Different model sizes are included to examine the performance across different model capacities. We further implement the zero-shot GPT-3 model, the few-shot GPT-3 model, and their chain-of-thought (CoT) reasoning variants (Wei et al., 2022). We also study the heuristic guess baseline and human performance to analyze the lower and upper bounds on TabMWP, respectively.

Evaluation metric. The answer part is extracted from the GPT-3 generation using manually designed regular expressions. To evaluate the baselines and our method, we utilize the accuracy metric to determine if the generated answer is correct given the ground truth answer. For free-text problems where the answer is set as a number, we normalize the prediction and the label to decimal numbers with two-digit precision and check if their values are equivalent. For multi-choice problems, we choose the most similar one from options to the generated answer following Khashabi et al. (2020).

Implementation details. Fine-tuned UnifiedQA and TAPEX baselines are trained on the train split and evaluated on the test split. Few-shot GPT-3 and few-shot-CoT GPT-3 randomly select two in-context examples from the training data to build the prompt. Our PromptPG is built on top of few-shot GPT-3 with a different selection strategy: (a) in the training stage, the agent learns to select two examples from 20 candidates and is evaluated on 160 training examples to calculate the reward; (b) in the test stage, the agent with an optimal policy chooses two examples from 20 candidates for each test example. The candidates are randomly selected from the training set. Experiments for two few-shot GPT-3 baselines and our PromptPG are repeated three times, and the average accuracy is reported in Table 3. More implementation details can be found in Appendix A.4.

2 Experimental Results

Table 3 demonstrates the results of different baselines and our method on the TabMWP dataset. Benefiting from pre-training on the tabular corpus, the TAPEX baseline performs better on average than UnifiedQA with a similar model size, which is only pre-trained on unstructured textual data. Increasing the model size can improve the prediction accuracy for both UnifiedQA and TAPEX. Fine-tuned on TabMWP, the baseline models can significantly improve the prediction performance on the average and all aggregated accuracy metrics.

Without any examples provided to GPT-3, zero-shot GPT-3 achieves a comparable accuracy to the best fine-tuned baselines, UnifiedQA\textscLarge{}_{\textsc{Large}} and TAPEX\textscLarge{}_{\textsc{Large}}, showing its surprisingly good generalization ability on TabMWP. Provided with two randomly sampled in-context examples as the prompt, few-shot GPT-3 gets an improvement of 0.17%. Generating the multi-step solution before the answer, the few-shot-CoT GPT-3 model reports the best performance among all of these baseline models, with an accuracy of 62.92%. Unlike few-shot-CoT GPT-3 randomly selecting the in-context examples, our proposed PromptPG learns to select performing examples with the help of policy gradient. PromptPG establishes a state-of-the-art performance on the TabMWP dataset: it surpasses the best baseline few-shot-CoT GPT-3 by 5.31% on average. PromptPG shows its consistent advantages on two question types, two grade groups, and most of the answer types.

Heuristic guess and human performance. The accuracy of multi-choice questions by heuristic guess is 39.81%, which aligns with the fact that there are 2.88 options on average. The accuracy for free-text questions is considerably low since the inputs of TabMWP problems do not have direct clues for the answers. Humans outperform all benchmarks consistently across question types, answer types, and grade groups, with a 21.99% average accuracy advantage over our best performing PromptPG. This gap is to be filled by future research on semi-structured mathematical reasoning.

Problem types and difficulty. Among all the baselines, we find it is easier for models to answer multi-choice questions than free-text questions. Questions with the boolean (BOOL) and other (OTH) answer types tend to have lower accuracy scores than the extractive (EXTR) answer type, because the former ones need the abilities of fact verification and language understanding on diverse options, respectively. It is also not surprising for us to find that all the models perform worse on problems in grades 7-8 than in a lower-level group of 1-6.

3 Ablation Study

Here, we will study how different factors have an effect on the performances of baselines and our method on TabMWP. Experiments are conducted on 1,000 development examples.

Blind study of the dataset. We evaluate the information gain of each component of the TabMWP problems by removing it from model inputs. To eliminate the impact and variance caused by example selection, the study is conducted using the zero-shot GPT-3 model. As shown in Table 4, there is a dramatic decline when either the tabular context (T) or the question text (Q) is missing from the inputs. For example, T\rightarrowA and Q\rightarrowA only attain an average accuracy of 6.10% and 7.00%, respectively, and their accuracies are near to zero on the multi-choice questions. Taking both tabular and textual data as inputs (TQ\rightarrowA), the model significantly beats the heuristic guess. With the complete input information (TQ(C)\rightarrowA), the full model achieves the best performance. The blind study shows that our TabMWP is robust and reliable in distribution, and all input components are indispensable parts that provide necessary information for answering the questions.

Number of training examples. We study the effect of different numbers of training examples on our dynamic prompt learning in Figure 3 (a). With more training examples, the prediction accuracy first gradually increases to a peak of around 160 training examples. After that, the accuracy goes down with a growing variance. We reckon it is because the policy gradient algorithm can benefit from the scaling-up training data but fails to exploit more examples efficiently.

Number of candidate examples. In Figure 3 (b), we investigate how different numbers of candidate examples can affect policy learning performance. With the increasing candidate number, it is observed that the prediction accuracy will first go up and then go down after a threshold, given 80 or 160 training examples. It is probably because when the candidate pool is too small, the policy gradient algorithm has a limited action space to explore enough problem types. In contrast, too many candidates could make the algorithm hard to learn an optimal policy in a large search space.

Different selection strategies. In Table 5, we compare the proposed PromptPG with random selection and other heuristic-based example selection strategies for the few-shot-CoT GPT-3 model. Compared to random selection, selecting the same question or answer type of examples helps the model to take the task-relevant examples as the prompt, thus improving the accuracy and reducing the variance. Choosing the most complex examples does not boost the prediction performance consistently. Manual selection selects the two examples from 20 with the highest evaluation accuracy on one-shot-CoT GPT-3 as the fixed set of in-context examples. Although it achieves the lowest prediction variance of 0, it only improves by 1.7% over random selection. The most semantically similar examples, as a kind of nearest neighbor search of the test example, help construct the performing and stable prompt for GPT-3. PromptPG shows its effectiveness in selecting optimal in-context examples over other strategies and largely reduces the instability.

4 Case Study

We conduct the case study in Appendix A.7. We visualize the two in-context examples selected by strategies of our PromptPG, nearest neighbor search, and random selection, in Figure 5, 6, and 7, respectively. The nearest neighbor search strategy selects the “superficially” similar examples to the test example. Instead, PromptPG tends to select examples that have multiple reasoning steps in the solution and similar abilities in mathematical reasoning, which results in higher prediction accuracy. Successful examples in Figure 8 - 12 show that PromptPG is able to generate reasonable reasoning steps to predict correct answers for a wide range of TabMWP problems. Failure examples in Figure 13 - 18 suggest that PromptPG has limitations when solving problems provided with complex tabular contexts or requiring a high-level ability of mathematical reasoning.

Related Work

The task of solving Math Word Problems (MWPs) is to predict the answer given a natural language description of a math problem. There have been great efforts in developing datasets for MWPs, including Math23K (Wang et al., 2017), MathQA (Amini et al., 2019), ASDiv (Miao et al., 2020), SVAMP (Patel et al., 2021), and Lila (Mishra et al., 2022). However, these datasets only involve the textual modality, and most are limited to a small data scale. Some recent datasets like DVQA (Kafle et al., 2018), IconQA (Lu et al., 2021b), Geometry3K (Lu et al., 2021a), and UniGeo (Chen et al., 2022) introduce math problems with diagrams as the visual context, where the system needs to perform mathematical reasoning over multi-modal information. To the best of our knowledge, our dataset TabMWP is the first dataset that requires mathematical reasoning over heterogeneous information from both the textual question and the tabular context. To solve MWPs, one popular line of previous methods is to generate the intermediate expressions and execute them to get the final answers (Huang et al., 2017; Roy & Roth, 2017; Amini et al., 2019). Inspired by the recent progress achieved by GPT-3 in solving MWPs (Wei et al., 2022; Wang et al., 2022; Kojima et al., 2022), we evaluate TabMWP using GPT-3 models in zero-shot and few-shot learning manners.

2 Table QA Datasets

Table Question Answering (Table QA) refers to the task of answering questions about tabular data. Numerous datasets have been developed for Table QA. For example, TabMCQ (Jauhar et al., 2016) is an early dataset collected from grade exams. Datasets like WTQ (Pasupat & Liang, 2015), WikiSQL (Zhong et al., 2017), and SQA (Iyyer et al., 2017) contain semi-structured tables from Wikipedia, while Spider (Yu et al., 2018) collects structured tables sourced from databases. Recent work aims at introducing datasets that require multi-hop reasoning between the textual and tabular data: HybridQA (Chen et al., 2020b), OTTQA (Chen et al., 2020a), MultiModalQA (Talmor et al., 2020), AIT-QA (Katsis et al., 2021), and FeTaQA (Nan et al., 2022). Datasets most related to our TabMWP dataset are FinQA (Chen et al., 2021), TAT-QA (Zhu et al., 2021), and MultiHiertt (Zhao et al., 2022) because they need numerical reasoning on financial reports with tabular data. Note that 77.6% of questions in TAT-QA can be solvable without mathematical reasoning and 50.0% of questions in FinQA are not table-must to be answered. In contrast, our proposed TabMWP collects questions where both mathematical reasoning and tabular context are necessary.

3 Prompt Learning for Language Models

Large pre-trained language models, such as GPT-3 (Brown et al., 2020), have shown their remarkable ability of few-shot learning on a wide range of downstream tasks (Houlsby et al., 2019; Brown et al., 2020; Ma et al., 2022; Lu et al., 2022a). Given a few in-context examples as demonstrations, GPT-3 can generalize to unseen test examples without parameter updating. For example, Wei et al. (2022) randomly select different in-context examples from the training set and formulate their corresponding prompt with a test sample. However, recent studies show that few-shot GPT-3 highly depends on the selection of in-context examples and could be unstable, varying from the near chance to near state-of-the-art performance (Zhao et al., 2021; Liu et al., 2022a; Lu et al., 2022b). To mitigate the volatility of selecting in-context examples, Lu et al. (2022c) propose retrieving relevant examples that are semantically similar to the test sample. Other possible strategies could be using brute-force permutation search or relying on manually designed heuristics like choosing the most complex examples. Inspired by reinforcement learning’s ability to search for an optimal action policy, we propose applying the policy gradient strategy (Sutton et al., 1998) to learn to select in-context examples more efficiently and stably without designing human-designed heuristics.

Conclusion

In this paper, we propose TabMWP, the first large-scale dataset for math word problems in tabular contexts. TabMWP contains 38,431 open-domain problems with two question types and three answer types, and each problem is annotated with a multi-step solution. We evaluate TabMWP using state-of-the-art QA and TableQA methods in both pre-trained and fine-tuned settings, as well as the large pre-trained language model GPT-3. We further propose a novel approach, PromptPG, for few-shot GPT-3, which utilizes policy gradient to learn to select in-context examples from the training data and construct the performing prompt for the test example. Experimental results show that PromptPG outperforms existing strong baselines by a large margin of 5.31% and reduces the accuracy volatility compared to random selection. To the best of our knowledge, it is the first work that applies reinforcement learning to select in-context examples for the few-shot GPT-3 model.

Acknowledgment

We would like to thank Zhou Yu and Jiuxiang Gu for insightful discussions on dataset collection. We thank Muhao Chen and Yao Fu for constructive suggestions in developing baselines and experiments. The work does not relate to Liang Qiu’s position at Amazon Alexa.

References

Appendix A Appendix

The raw problems are collected from an online learning website, IXLhttps://www.ixl.com/math, which hosts a large number of high-quality math problems curated by educational experts.

Quality control. The goal of constructing TabMWP is to collect math word problems that necessitate multi-hop mathematical reasoning between the question and the tabular context. Therefore, we ask human experts to filter problems that can be solved either without the context of the table or by looking up table cells without numerical reasoning. To further ensure data quality, we ask human experts to perform a final review to re-check the dataset and manually revise incorrect annotations.

A.2 Human study

To examine how humans perform on our TabMWP dataset, we released the human evaluation task on Amazon Mechanical Turk (AMT) to the test split. We designed two sub-tasks for the human study: answering the free-text questions and answering the multi-choice questions. The user interfaces for the two sub-tasks are shown in Figure 4. Each human intelligence task (HIT) contains 5 exam questions and 15 test questions. A worker should have a HIT Approval Rate of 98% or higher and be approved with 5,000 or more HITs. The worker is provided with detailed instructions at the beginning and needs to pass at least 3 free-text exam questions or 4 multi-choice exam questions to be qualified for the human study. Each HIT is assigned to two different workers. We assign a reward of 0.80and0.80 and0.60 for one HIT of free-text and multi-choice sub-tasks, respectively.

A.3 The PromptPG Algorithm

The pipeline of PromptPG to learn to select in-context examples is summarized in Algorithm 1.

A.4 Implementation Details

Heuristics guess. To investigate the lower bound of the accuracy on TabMWP, we design simple heuristics to guess answers for each question type. For multi-choice questions, we randomly select one from the given options with even probabilities. For free-text questions on TabMWP, the answers could only be integral or decimal numbers. Intuitively, we take advantage of regular expressions to extract all the numbers from the tabular context and the question text as candidates, and then randomly choose one number as the prediction.

UnifiedQA baselines. UnifiedQA (Khashabi et al., 2020) is a T5-based (Raffel et al., 2020) QA system that was pre-trained on 8 seed QA datasets of multiple formats but with a unified text-to-text paradigm. We load the pre-trained checkpoint as the pre-trained baseline and train it on TabMWP as the fine-tuned baseline. Three different parameter sizes are compared: small (60M), base (220M), and large (770M).

TAPEX baselines. TAPEX (Liu et al., 2022b) is a BART-based (Lewis et al., 2020) language model pre-trained on structured tabular data to mimic the behavior of a SQL executor that can answer table-based questions. TAPEX shows state-of-the-art performance on four table-related datasets. We establish the pre-trained and fine-tuned baselines on top of TAPEX with two model sizes: base (140M) and large (400M).

Zero-shot GPT-3 and zero-shot-CoT GPT-3. We establish the zero-shot baseline based on GPT-3 (Brown et al., 2020). The zero-shot setup follows the format of TQ(C)\rightarrowA where the input is the concatenation of tokens of the tabular context (T), the question text (Q), and choice options (C) that apply while the output is to predict the answer (A). Following Kojima et al. (2022), we further build zero-shot-CoT GPT-3, which refers to the GPT-3 model with a chain-of-thought (CoT) prompt. Specifically, we add the prompt “Let’s think step by step” at the end of the input to ask the model to generate the multi-step solution (S) to mimic the reasoning process as humans. Then the model takes the raw input and the newly generated solution to predict the final answer.

Few-shot GPT-3 and few-shot-CoT GPT-3. In the few-shot setting, we follow the standard prompting (Wei et al., 2022) where in-context examples are randomly selected from the training data as demonstrations for the text example. Similarly, the few-shot-CoT GPT-3 baseline takes the prompt template of TQ(C)\rightarrowSA to generate the solution before the final answer.

For the GPT-3 engine, we use text-davinci-002, the most capable engine recommended by the official documentation. The temperature is set as 0 and the top probability is set as 1.0 to get the most deterministic prediction. The maximum number of tokens allowed for generating text is 512. Both the frequency penalty and the presence penalty are set as the default value, i.e., 0.

A.5 More Experimental Results

Number of few-shot examples. We study the few-shot-CoT GPT-3 model with random selection in terms of the different numbers of in-context shots. For each number of in-context shots, the experiment was conducted on 1,000 development examples and repeated three times. The results are shown in Table 9. When increasing the number of in-context shots from the current 2 to 4, the few-shot-CoT GPT-3 model reduces the prediction variance from the random selection of in-context shots and achieves an accuracy improvement of 2.5%. When the number of in-context shots is increased to 5, the model with random selection does not gain further benefits. Our PromptPG displays impressive advantages over random selection in terms of data efficiency and prediction accuracy. With only two in-context shots, PromptPG achieves the highest accuracy of 70.9% and a comparable low deviation compared to random selection with more shots.

A.6 Related work of Policy Gradient

Policy gradient is an approach to solving reinforcement learning problems that target modeling and optimizing the policy directly. Many policy gradient algorithms have been proposed in the past decade (Silver et al., 2014; Lillicrap et al., 2015; Mnih et al., 2016; Schulman et al., 2017; Barth-Maron et al., 2018). They have been proven effective in areas like robotics (Peters & Schaal, 2006) and chatbots (Kandasamy et al., 2017). In recent work that focuses on aligning language models with human values (Ouyang et al., 2022; Qiu et al., 2022; Glaese et al., 2022), policy gradient has been used to optimize language models with rewards learned from human feedback and preference. To the best of our knowledge, our PromptPG is the first work that proposes to select prompts dynamically for large pre-trained language models in the mathematical reasoning field.

A.7 Case study examples