PINTO: Faithful Language Reasoning Using Prompt-Generated Rationales

Peifeng Wang, Aaron Chan, Filip Ilievski, Muhao Chen, Xiang Ren

Introduction

Many language-based reasoning tasks require retrieving and reasoning over knowledge beyond the task input—e.g., commonsense reasoning and closed-book QA (Fig. 1, left) (Talmor et al., 2018; Mihaylov et al., 2018). Neural language models (LMs) have achieved impressive results on such tasks by utilizing latent knowledge encoded in their pretrained parameters (Raffel et al., 2020b; Brown et al., 2020). Still, given LMs’ black-box nature, it is unclear whether this knowledge is being used properly (Doshi-Velez & Kim, 2017; Lipton, 2018). Previous studies have shown that LMs often learn spurious correlations from artifacts in downstream training data, thus limiting their generalizability (Branco et al., 2021; Geirhos et al., 2020; D’Amour et al., 2020).

With this in mind, a number of prior works aim to make LMs’ reasoning processes more explicit by generating free-text rationales, which use LMs’ internal knowledge to describe a reasoning process in natural language (Narang et al., 2020; Wei et al., 2022b; Marasović et al., 2022; Zelikman et al., 2022). In the fine-tuned self-rationalizing paradigm, a single LM is fine-tuned to jointly generate the task output and rationale (Narang et al., 2020; Marasović et al., 2022; Zelikman et al., 2022). In the prompted self-rationalizing paradigm, a single LM is instead frozen and prompted to jointly generate the task output and rationale, with the prompt consisting of a few input-output-rationale demonstrations (Wei et al., 2022b). In the pipeline-rationalizing paradigm, a fine-tuned rationalizing LM first generates the rationale, which is then used as input for a separate fine-tuned reasoning LM to generate the output (Kumar & Talukdar, 2020; Rajani et al., 2019).

However, when considering generalization performance, reliability, and deployment costs, these existing paradigms all have key limitations. Fine-tuned self-rationalizing LMs often perform worse than non-rationalizing LMs, since their parameters are learned using two relatively dissimilar objectives, while also requiring expensive rationale annotations (Wiegreffe et al., 2020; Narang et al., 2020). Prompted self-rationalizing LMs yield strong task performance and only need a few rationale demonstrations for the prompt, but are computationally prohibitive since they generally require very large-scale (i.e., over 100B parameters) LMs to work effectively (Wei et al., 2022a; b). Besides requiring expensive rationale annotations, pipeline-rationalizing LMs’ generated rationale forms a non-differentiable bottleneck between the two modules, which complicates end-to-end training and can hurt task performance (Wiegreffe et al., 2020; Hase et al., 2020). Moreover, none of these paradigms has a mechanism for regularizing the rationale generation to faithfully reflect the reasoning process of the LM, without hurting task performance.

In this paper, we propose Prompted RatIonalizing with CouNTerfactual ReasOning ( PINTO), an LM pipeline that rationalizes via prompt-based learning, then reasons over the task input and rationale via counterfactual regularization. PINTO’s rationalizing module is a medium-scale (i.e., 20B parameters) LM that contains vast latent knowledge obtained via pretraining (Black et al., 2022). Though prohibitive to fine-tune, it is affordable for prompt-based learning. Given the task input and a minimal input-output-rationale demonstration prompt, the rationalizing module uses its internal knowledge to map out a suitable reasoning process for the task input by generating a free-text rationale. The rationalizing module is frozen during fine-tuning, which drastically reduces training costs and prevents it from exploiting spurious shortcuts in the downstream training data. PINTO’s reasoning module is a small-scale (i.e., under 1B parameters) LM to which knowledge is transferred from the rationalizing module. The reasoning module is fine-tuned to solve the downstream reasoning task by using the generated rationale as context for the task input. Crucially, to help ensure that the reasoning module’s behavior is dictated by the rationale (instead of by spurious shortcuts), the reasoning module is regularized to output less confident predictions when the rationale is noisily perturbed. To simulate shortcut reasoning, we consider two rationale perturbation strategies: token masking (i.e., rationale is ignored) and token replacement (i.e., rationale is misused).

Across four question answering datasets (CSQA, StrategyQA, OpenBookQA, QASC), we show that PINTO significantly improves the reasoning LM’s generalization, yielding higher performance on both in-distribution (ID) and out-of-distribution (OOD) test sets. Also, we find that rationales are utilized more faithfully by PINTO than by other methods, leading to better performance in low-resource settings. Furthermore, we show that PINTO’s counterfactual regularization allows us to further improve task performance with refined rationales.

Rationale-Based Language Reasoning

In this work, we study LMs’ ability to reason about language using implicit knowledge. We consider a specific type of multi-choice question answering (QA) tasks where the required knowledge for answering the question is not explicitly provided in the input and needs to be inferred from the LM’s parameters (Talmor et al., 2019; Khot et al., 2020): Given a question $q$ and a set of answer choices $A=\{a_{i}\}$ , the model’s goal is to predict a plausibility score $\rho(q,a_{i})$ for each $(q,a_{i})$ pair, so that the predicted answer $\hat{a}=\operatorname*{arg\,max}_{a_{i}\in A}\hskip 1.42262pt\rho(q,a_{i})$ matches the correct answer choice $a^{*}\in A$ .

Motivated by LMs’ common tendency to exploit reasoning shortcuts when solving tasks (Branco et al., 2021), we focus on methods that explicitly generate free-text rationales to explain their predictions. Whereas extractive rationales are limited to input token scoring (Denil et al., 2014; Sundararajan et al., 2017; Chan et al., 2022), free-text rationales use natural language to describe a reasoning process (e.g., with knowledge beyond the task input) (Narang et al., 2020; Wei et al., 2022b). Below, we discuss several paradigms (see also Fig. 1) for rationale-based language reasoning.

Fine-Tuned Self-Rationalization In this paradigm, an LM is fine-tuned to autogregressively generate the task output and rationale as a single sequence (Narang et al., 2020; Liu et al., 2018). If the rationale is generated after the task output, then the rationale is conditioned on the task output, and vice versa. Since the LM parameters are shared across two relatively dissimilar objectives, they often perform worse than non-rationalizing LMs (Wiegreffe et al., 2020; Narang et al., 2020). Notably, this paradigm requires expensive rationale annotations for all training instances.

Prompted Self-Rationalization In this paradigm, a pretrained LM is frozen and prompted to autogregressively generate the task output and rationale as a single sequence, with the prompt consisting of a few input-output-rationale demonstrations (Lampinen et al., 2022; Wei et al., 2022b). If the rationale is generated after the task output, then the rationale is conditioned on the task output, and vice versa. This paradigm performs well and only needs a few rationale annotations for the prompt, but it is computationally prohibitive since it generally requires very large-scale (i.e., over 100B parameters) LMs to work effectively (Lampinen et al., 2022; Wei et al., 2022b).

Pipeline Rationalization In this paradigm, a fine-tuned rationalizing LM first generates the rationale, which is then used as input for a separate fine-tuned reasoning LM to predict the task output (Kumar & Talukdar, 2020; Rajani et al., 2019). Here, the generated rationale forms a discrete (i.e., non-differentiable) bottleneck between the two modules, which complicates end-to-end training and can hurt task performance (Wiegreffe et al., 2020; Hase et al., 2020). Additionally, the dedicated rationalizing LM requires extra rationale annotation/computation costs.

PINTO: Faithful Language Reasoning

PINTO is a two-stage, rationalize-then-reason pipeline, designed to address the limitations of existing paradigms for rationale-based language reasoning (§2). Like the pipeline rationalization paradigm, PINTO has separate modules for rationalizing and reasoning (Fig. 2). However, PINTO’s rationalizing module is prompted instead of fine-tuned. Thus, PINTO does not suffer from the non-differentiable bottleneck issue and has lower rationale annotation/computation costs.

Following prior works, PINTO is based on choice-specific rationales (Kumar & Talukdar, 2020; Hase et al., 2020). First, given $q$ and $A$ , the rationalizing module generates a set of choice-specific rationales $R=\{r_{i}\}$ , where each $r_{i}$ explains a reasoning process that supports answer choice $a_{i}\in A$ (§3.1), as opposed to generating one rationale per question. We opt for this design choice because rationales are often answer-leaking (Sun et al., 2022), i.e., the rationale itself is already sufficiently predictive of one of the answer choices. If the rationalizing module only generates one rationale per question, then it is forced to make an “early decision” on the predicted answer, such that the reasoning module would only be left to recover the answer from the rationale (Kumar & Talukdar, 2020). While prior works require expensive rationale annotations to train/prompt the rationalizing module (Kumar & Talukdar, 2020; Hase et al., 2020), PINTO’s rationalizing module is a frozen pretrained LM that uses only a few question-answer-rationale demonstrations as a prompt (§3.1). Second, given $q$ , $a_{i}\in A$ , and $r_{i}\in R$ , the reasoning module outputs plausibility score $\rho(q,a_{i},r_{i})$ (§3.2). We also design a regularization objective that encourages the reasoning module to properly use the rationales to predict the answer (§3.3). We describe each module in more detail below.

Prior works mainly rely on human-annotated rationales for teaching a model to rationalize (Kumar & Talukdar, 2020; Hase et al., 2020; Sun et al., 2022). However, such rationale annotations are expensive and frequently of low quality (Aggarwal et al., 2021; Sun et al., 2022; Rajani et al., 2019), e.g., not providing sufficient knowledge to support a given answer. Meanwhile, a recent study shows that rationales automatically generated by pretrained LMs are often preferable over human-annotated rationales (Wiegreffe et al., 2021). Therefore, for PINTO’s rationalizing module, we propose using a pretrained LM to generate rationales via in-context learning, which prompts the frozen LM to retrieve knowledge from its parameters (Wei et al., 2022b).

The prompt consists of a fixed set of question-answer-rationale demonstrations that are randomly selected from the training set. Each demonstration consists of a question $q$ , answer choices $A$ ,We include the answer choices $A$ in the prompt so that the LM is aware of all the available choices and thus could generate a rationale that is more distinctive. gold answer $a^{*}\in A$ , and a human-annotated free-text rationale $r^{*}\in R$ for $a^{*}$ (Table 1).As opposed to full human annotation, we only need a few (usually $<8$ ) examples per dataset. With this prompt $p$ , we use the LM to generate rationales for every instance from the dataset. Specifically, for each $a_{i}\in A$ of some instance ( $q$ , $A$ ), the rationalizing LM’s input is constructed as [ $p$ , $q$ , $A$ , $a_{i}$ ]. Then, we use greedy decoding of the LM output to obtain rationale $r_{i}$ for $a_{i}$ . Note that the LM input does not have any information about the gold answer $a^{*}$ . Our rationalizing module’s design assumes that $r_{i}$ will be aligned with accurate knowledge if and only if $a_{i}=a^{*}$ , since it should intuitively be difficult to retrieve correct knowledge that supports an incorrect answer choice (see Table 11 in the appendix for examples of the generation). The reasoning module then predicts the correct answer by reasoning over the rationales for each answer choice.

2 Reasoning Module

Given a question $q$ , the answer choices $A$ , answer candidate $a_{i}\in A$ , and rationale $r_{i}$ , the reasoning module learns to output plausibility score $\rho_{i}=\rho(q,A,a_{i},r_{i})$ . Following prior works, we use a text-to-text Transformer LM as the backbone of our reasoning module (Wiegreffe et al., 2020; Hase et al., 2020). For each $a_{i}$ , the reasoning module’s input is defined as the token sequence $s=[q\oplus a_{1}\oplus...\oplus a_{|A|}\oplus r_{i}]$ , where $\oplus$ denotes concatenation. Meanwhile, the reasoning module’s output is obtained by sequentially teacher-forcing $a_{i}$ ’s tokens $t_{i}=[t_{i}^{1},t_{i}^{2},...,t_{i}^{|a_{i}|}]$ into the decoder, rather than via greedy decoding. This way, we can compute the reasoning module’s output token probabilities for arbitrary answer choices $a_{i}$ . Following Shwartz et al. (2020), we compute $a_{i}$ ’s plausibility score $\rho_{i}$ by aggregating the probabilities $P$ of tokens $t_{i}^{j}$ as:

Next, we use the softmax function to normalize $\rho_{i}$ as probability $P(a_{i}\hskip 1.42262pt|\hskip 1.42262ptq,A,R)=e^{\rho_{i}}/\sum_{j=1}^{|A|}\hskip 1.42262pte^{\rho_{j}}$ . During inference, given question $q$ and answer choices $A$ , the rationalizing module first generates rationales $R=\{r_{i}\}$ , then the reasoning module computes the predicted answer choice as $\hat{a}=\operatorname*{arg\,max}_{a_{i}\in A}P(a_{i}\hskip 1.42262pt|\hskip 1.42262ptq,A,R)$ .

3 Training

For multi-choice QA, the standard training objective is to maximize the likelihood of the correct answer choice using cross-entropy loss, computed as:

where $Q(a_{i}\hskip 1.42262pt|\hskip 1.42262ptq,A)$ is 1 if $a_{i}=a^{*}$ and 0 otherwise. Let $Q(A\hskip 1.42262pt|\hskip 1.42262ptq,A)$ be the one-hot target distribution over all $a_{i}\in A$ . There can be spurious correlations between $q$ and $A$ (Branco et al., 2021), so the reasoning module may take undesirable shortcuts instead of properly using the rationale to predict the answer (Gururangan et al., 2018; McCoy et al., 2019). In this case, the rationales would be unfaithful in explaining the model’s behavior and useless for model debugging.

To address this, we introduce a counterfactual regularization objective in which the reasoning module is regularized to output less confident predictions when the rationale is not utilized properly (i.e., shortcuts are used). This is implemented using label smoothing (Szegedy et al., 2016), which softens the target distribution $Q(A\hskip 1.42262pt|\hskip 1.42262ptq,A)$ by linearly combining it with a noisy distribution $U(A\hskip 1.42262pt|\hskip 1.42262ptq,A)$ , often set as the uniform distribution. Therefore, given tunable label smoothing factor $0<\epsilon<1$ , we compute the label-smoothed target distribution as: $Q^{\prime}(A\hskip 1.42262pt\hskip 1.42262pt|\hskip 1.42262pt\hskip 1.42262ptq,A)=(1-\epsilon)\hskip 1.42262ptQ(A\hskip 1.42262pt|\hskip 1.42262ptq,A)+\epsilon\hskip 1.42262ptU(A\hskip 1.42262pt|\hskip 1.42262ptq,A)$ .

In order to simulate shortcut reasoning, we consider two strategies for perturbing the generated rationales $r_{i}$ . Token Masking addresses the case where the reasoning module ignores the rationale and instead exploits spurious cues in the rest of the input. To simulate this, we mask out the rationales in the input. Recall that the backbone of the reasoning module is a Transformer LM, which uses a self-attention mechanism to aggregate information across tokens. Hence, we implement rationale masking by zeroing the attention mask for rationale tokens.We do not choose to replace the tokens in a rationale with special mask tokens since the LM is already pretrained to recover the mask tokens, and we want to ensure that this ability is completely deprived. Token Replacement addresses the case where the reasoning module misunderstands the rationales’ meaning and thus uses them improperly. To simulate this, we randomly replace $k\%$ of the rationale tokens with other tokens uniformly sampled from the entire language vocabulary.

At each fine-tuning step, we randomly select one of the strategies for obtaining perturbed rationales $R^{\prime}=\{r^{\prime}_{i}\}$ , which helps keep the LM from overfitting to any particular strategy. Then, the counterfactual regularization loss is computed as:

This counterfactual regularization teaches the reasoning module to be less confident when the rationales are either absent or problematic, so that it can learn to make sounder use of the rationales.

Experimental Setup

Questions and hypotheses We design experiments to answer the following questions: (1) What is the impact of our PINTO pipeline on faithfulness and end-task performance? We expect our pipeline with counterfactual training technique to obtain improvements in both aspects. (2) How does the quality of rationales affect the end-task performance of PINTO? We hypothesize that improving the quality of the rationales of PINTO improves its accuracy. (3) Does faithful reasoning based on rationales lead to better generalization? We expect that a method like PINTO that learns to rely on rationales can better generalize to a low resource setting and out-of-distribution (OOD) datasets.

Datasets We experiment with several CSR benchmarks. (1) CommonsenseQA (Talmor et al., 2018) is a 5-choice QA dataset testing general commonsense reasoning about the concepts from ConceptNet (Speer et al., 2017). (2) StrategyQA (Geva et al., 2021) is a binary (yes/no) QA dataset that requires models to infer the reasoning strategy. (3) OpenBookQA (Mihaylov et al., 2018) is a 4-choice QA dataset that requests reasoning based on open book as well as broad commonsense knowledge. (4) QASC (Khot et al., 2020) is an 8-choice QA dataset that requires a system to answer a question with a valid composition of basic facts using common sense. Since the gold labels for the testing sets of these datasets are not publicly available, we treat the official development set as our test set, and separate the training data into our own training set and development set.

Evaluation Metrics To evaluate the reasoning model’s task performance, we use the accuracy metric and consider both ID and OOD test sets in our experiments. ID/OOD test sets are taken from the same/different dataset as the training set. To evaluate the faithfulness of the generated rationale to the reasoning model’s predicted label, we adopt the LAS metric (Hase et al., 2020). LAS measures rationale-label consistency as how well the rationale helps a simulator model predict the reasoning model’s predicted label. Following Hase et al. (2020), we implement the simulator as a fine-tuned T5-Base LM (Raffel et al., 2020a). To aggregate accuracy and LAS as a single metric, we use Normalized Relative Gain (NRG) metric (Chan et al., 2022). Across all compared methods, NRG first normalizes each of the two constituent metrics’ scores as values in $$, then obtains the aggregate score by taking the mean of the two normalized scores.

Implementation Details For the rationalizing module, we use GPT-neox (Black et al., 2022), a pretrained, autoregressive LM with 20B parameters. We manually annotate 7 examples to set up the prompt for each task dataset. For the reasoning module, we adopt T5-base (Raffel et al., 2020a) with only 220 million parameters, which is around two orders of magnitude smaller than the rationalizing module. During fine-tuning, the standard training loss (Eq. 1) and our counterfactual training loss (Eq. 2) are directly combined as the overall training loss. For perturbing rationales, we randomly choose the token masking or token replacement strategy with a equal chance in each training batch. The replacing rate for token replacement is empirically set to $30\%$ . We run all the experiments on the compared methods 4 times using a fixed set of random seeds and report the average results.

Baselines (1) Without Rationales is a T5-based model fine-tuned on the task dataset without using any rationales as additional input. (2) Prompted Self-Rationalization is a GPT-NeoX LM that learns from a few examples in the prompt to firstly generate a few short sentences as the rationale and then predict the answer. Here, we use the chain-of-thought prompting configuration from Wei et al. (2022b). (3) Distilled Self-Rationalization is a small LM (T5-base) trained on the rationales generated by the Prompted Self-Rationalization model. We implement two variants of the distillation model: a) Rationalize-First, which firstly generates the rationale and then predicts the answer, and b) Predict-First, which firstly predicts the answer and then generates the rationale. (4) NILE (Kumar & Talukdar, 2020) trains a rationalization module by fine-tuning a T5-3B model (Raffel et al., 2020a) with the rationales annotated by humans, then trains a reasoning module by fine-tuning a T5-Base model with the task dataset as in our method. We only apply NILE on the CSQA and StrategyQA datasets, since they provide human-annotated gold rationales. (5) Standard Training uses the same rationalize-then-reason pipeline as our method, except the reasoning module is not fine-tuned with the counterfactual training loss. (6) Dropout Context is the same as the Standard Training baseline, except the question is randomly dropped out from the input while fine-tuning the reasoning module. This is a strategy used in prior work to encourage the reasoning module to make good use of the input rationales (Hase et al., 2020).

Further, we also consider two variants of PINTO, namely Token Masking Only and Token Replacement Only as baselines. These baselines only adopt token masking or token replacement for perturbing rationale tokens, respectively.

Experiments

In-Distribution (ID) Performance We first evaluate all methods on ID test sets. Table 2 shows the task performance of these methods, with fine-tuning methods using T5-Base as the reasoning module. We have the following two observations. First, the Prompted Self-Rationalization baseline (using the 20B-parameter GPT-NeoX) generally does not outperform the fine-tuning methods while the GPT-3 version is reported to achieve $73.50$ and $66.53$ in accuracy on CSQA and StrategyQA, respectively (Wei et al., 2022b). This validates that Prompted Self-Rationalization requires very large LMs to work effectively (Wei et al., 2022a). Second, simply augmenting the reasoning module with rationales (as in Standard Training) does not always lead to better results compared with the Without Rationales baseline since the rationales may not be properly utilized. The Dropout Context baseline helps to address this issue in some, but not all cases, while PINTO consistently yields the best accuracy in most of the cases. We have similar observations from results using RoBERTa-Large as the reasoning module (Table 5 of §A.1). This demonstrates the effectiveness of our counterfactual regularization method in improving ID generalization.

Out-of-Distribution (OOD) Performance To further demonstrate the generalizability brought by faithful reasoning over rationales, we also investigate the performance of our method on OOD test sets. The intuition is that by utilizing rationales faithfully rather than fitting only the ID training data, our model achieves better OOD generalization without any fine-tuning. Table 3 shows the OOD performance of all the fine-tuning methods using T5-Base. We conclude that rationales are helpful in improving the generalizability of the model to a dataset unseen during fine-tuning. Among all the methods utilizing rationales, our method yields the best OOD performance, which confirms the benefit of faithful reasoning. A consistent conclusion can be made from the results based on RoBERTa-Large (Table 6 of §A.1).

Rationale-Label Association Table 2 also reports the faithfulness of all the methods involving rationalization measured by LAS. We observe that PINTO achieves a much higher score compared with the baselines except on OpenBookQA. This demonstrates that counterfactual regularization helps the reasoning module make predictions more faithfully with respect to the rationales.

2 Performance Analysis

How do different perturbation strategies contribute to the overall performance? Table 2 shows the results of the ablation study where we only conduct Token Masking or Token Replacement when perturbing the rationale tokens. From more cases, we note that Token Replacement leads to both better accuracy and faithfulness compared with Token Masking. This is because Token Replacement perturbs the semantics of the rationales more severely, thus further forcing the reasoning module to properly make use of the rationales. Our method yields the best results when both types of perturbation are conducted, which validates that these two strategies consider comprehensively the different ways in which a reasoning module could use the rationales improperly.

Can faithful rationales lead to better low-resource performance? We also investigate whether, with counterfactual training, the reasoning module can be fine-tuned with less training data. Figure 5 shows the accuracy of all the fine-tuning methods. We can observe that our method consistently outperforms the baselines at different percentages of training data. The observed larger performance gap is larger when less training data is used, demonstrating the data efficiency of our method.

Can we refine the reasoning behavior via rationales? One important application of faithful reasoning is that rationales provide a way to refine the behavior of a model, i.e., we can correct reasoning mistakes by providing a better rationale. To verify this, we make use of ECQA (Aggarwal et al., 2021) which augments CSQA with human-annotated rationales. We directly provide the human-annotated rationales to the fine-tuned reasoning modules to obtain its oracle results, shown in Figure 5. We see that human-annotated rationales generally lead to performance gain for all fine-tuning methods whereof the gain of our method is the largest. This again showcases the merits of ensuring the faithful reasoning on rationales in refining a system.

Is our method more sensitive to perturbed rationales? Intuitively, higher rationale faithfulness (i.e., stronger connection between the rationale the and reasoning module’s behavior) should yield greater sensitivity to noisily perturbed rationales. In other words, higher performance drop (sensitivity) signals higher faithfulness. To verify this, we conduct a stress test. We choose CSQA and OpenBookQA and replace each question in the testing set with a randomly sampled question but still keep the original answer choices. We then prompt our rationalizing module with the replaced question and the original choices to obtain a set of perturbed rationales. We finally provide the perturbed rationales to the reasoning module. Our results in Table 4 show that PINTO achieves a significantly higher performance drop than the other two methods (esp. on OBQA), indicating that counterfactual regularization is effective in improving rationale faithfulness.

Related Work

Extensive work has been done on solving implicit reasoning tasks by augmenting reasoning LMs with external knowledge beyond the task input. Prior works have explored retrieving implicit knowledge from: (1) knowledge graphs (Lin et al., 2019; Feng et al., 2020; Wang et al., 2020; Yan et al., 2021; Chan et al., 2021; Raman et al., 2021), (2) web corpora (Lv et al., 2020; Chen et al., 2017; Yang et al., 2015; Ryu et al., 2014), or (3) pretrained LMs (Shwartz et al., 2020; Liu et al., 2021; Bosselut et al., 2019; Shin et al., 2020). Although knowledge retrieval has shown to be helpful in boosting reasoning LMs’ task performance, it may not necessarily explain the decisions made by the LM. Given the lack of transparency in neural LMs’ complex behavior (Rudin, 2019; Caruana, 2019), model explainability is important for promoting human trust in NLP systems for high-stakes decision-making (Doshi-Velez & Kim, 2017; Lipton, 2018; Bender et al., 2021). We focus on rationale generation in this work as a way to both improve an LM’s task performance and provide justification for its predictions.

Prior works on free-text rationale generation can be grouped into three paradigms. In the fine-tuned self-rationalizing paradigm, a single LM is fine-tuned to jointly generate the task output and rationale (Narang et al., 2020; Marasović et al., 2022; Zelikman et al., 2022; Li et al., 2022). Since the LM parameters are shared across two relatively dissimilar objectives, they often perform worse than non-rationalizing LMs (Wiegreffe et al., 2020; Narang et al., 2020). Notably, this paradigm requires expensive rationale annotations for all training instances. In the prompted self-rationalizing paradigm, a single LM is instead frozen and prompted to jointly generate the task output and rationale, with the prompt consisting of a few input-output-rationale demonstrations (Wei et al., 2022b). This paradigm performs well and only needs a few rationale annotations for the prompt, but it is computationally prohibitive since it generally requires very large-scale LMs to work effectively (Lampinen et al., 2022; Wei et al., 2022b). In the pipeline-rationalizing paradigm, a fine-tuned rationalizing LM first generates the rationale, which is then used as input for a separate fine-tuned reasoning LM to generate the output (Kumar & Talukdar, 2020; Rajani et al., 2019). Here, the generated rationale forms a discrete (i.e., non-differentiable) bottleneck between the two modules, which complicates end-to-end training and can hurt task performance (Wiegreffe et al., 2020; Hase et al., 2020). Additionally, the dedicated rationalizing LM requires extra rationale annotation/computation costs. Moreover, none of these paradigms has a mechanism for regularizing the rationale generation to faithfully reflect the reasoning process of the LM, without hurting task performance. PINTO avoids these limitations by rationalizing via prompt-based learning (using a frozen medium-scale LM), then reasoning over the task input and rationale via counterfactual regularization (using a fine-tuned small-scale LM).

Conclusion

This paper presents PINTO, an LM pipeline that rationalizes with prompt-based learning and reasons via counterfactual regularization. Through prompting, we remove the need for expensive human annotation and leverage the massive knowledge encoded in a medium-sized LM to perform rationalization. With counterfactual regularization in addition to standard training objective, our reasoning module learns to reason over the generated rationales more faithfully. Experiments show that our method outperforms baselines on both in-distribution and out-of-distribution datasets in accuracy, while providing higher faithfulness. Our analysis also shows that we can further improve task performance with a more faithful reasoning module and refined rationales.

Acknowledgement

We thank the anonymous reviewers and all the collaborators in USC INK research lab for their valuable feedback. This material is based upon work sponsored by the DARPA MCS program under Contract No. N660011924033 with the United States Office Of Naval Research.

References

Appendix A Appendix

Table 5-6 show both the ID and OOD results based on RoBERTa-Large as the reasoning module. The observations are consistent with Table 2-3 where we fine-tune T5-Base as the reasoning module.

A.2 Ablation on the LM size for the Rationalizing Module

Table 7 shows the results of the Pipeline approaches using LMs with different model sizes as the rationalizing module.

A.3 Variance statistics of all the fine-tuned models

Table 8 and Table 9 show the variance statistics (standard deviation) along with ID and OOD task performance (accuracy) of the fine-tuned methods from Table 2 and Table 3.

A.4 Human evaluation on the rationales

We conducted a human evaluation of 100 generated rationales from the CSQA dataset. The evaluation is a head-to-head comparison between the human-annotated rationales and the machine-generated rationales. Annotators were asked to judge for 5 dimensions on a 3-point Likert scale following a prior work (Wiegreffe et al., 2021): 1) Factuality (How factual is this Explanation?) 2) Grammaticality (Is this Explanation grammatical?) 3) New Info (Does the Explanation provide new facts, information, or reasoning not stated in the Question and Answer?) 4) Supports Answer (Is the Explanation relevant to justifying the Answer?) 5) Completeness (Does the Explanation provide enough information to jusify the answer?)

We obtain a fair level of agreement measured by Fleiss Kappa (k=0.34) for the evaluation. The results in Table 10 show that machine-generated results are competitive with human annotation on most of the evaluating dimensions. Generated rationales are even judged to be more grammatical than human annotations. As for completeness, generated rationales are slightly worse than human annotations. We think this is because the human annotators were explicitly encouraged to provide more comprehensive rationales when annotating the CSQA dataset (Aggarwal et al., 2021).

A.5 Case Study

We provide concrete examples in Table 11 to showcase how our prompted LM rationalizes for correct and incorrect choices and how PINTO reasons more faithfully compared with the Standard baseline. In the question (second row) from CSQA, we can see that for incorrect choices, the generated rationales do not support them to be the answer while the one for the correct choice refrigerator does. In the question (third row) from StrategyQA, the rationale for the correct choice yes is sound and reasonable while the rationale for the incorrect choice no is factually correct but does not answer the question directly (died in a plane crash vs. died in the space journey). For both questions, PINTO properly leverages the rationales and make the correct predictions while the Standard baseline fails.

A.6 Prompts for rationalization

Table 12- 15 show the complete prompts we use to obtain rationales from LM for CSQA, StrategyQA, OpenBookQA and QASC datasets.