ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning

Olga Golovneva, Moya Chen, Spencer Poff, Martin Corredor, Luke Zettlemoyer, Maryam Fazel-Zarandi, Asli Celikyilmaz

Introduction

Scaling language models has improved state-of-the-art performance on nearly every NLP benchmark (Brown et al., 2020), with large language models (LLMs) performing impressively as few-shot learners (Brown et al., 2020). Despite these achievements, even the largest of these models still struggle with tasks including math word problems (Hendrycks et al., 2021), symbolic manipulation (Rytting & Wingate, 2021), and commonsense reasoning (West et al., 2022). Recent work has shown that prompting (Wei et al., 2022; Wang et al., 2022) or fine-tuning (Lampinen et al., 2022) LLMs to generate step-by-step rationales can lead to improvements on reasoning tasks. Some of these include small-scale analysis of specific error types within step-by-step rationales (Lewkowycz et al., 2022; Chowdhery et al., 2022), as shown in Table 1. However, existing works primarily focus on end-task performance. Although text generation evaluation metrics sometimes offer fine-grained quality evaluations (e.g., adequacy, fluency) against human scores (Opitz & Frank, 2021; Leiter et al., 2022), these metrics generally treat the output as a whole, and many of these generative metrics operate on tasks such as summarization or machine-translation rather than reasoning.

In this paper, we present ROSCOE, a suite of interpretable and fine-grained step-by-step generation evaluation metrics to address the above gaps. Rather than providing one score that only evaluates the generated text on the overall, ROSCOE encapsulates fine-grained metrics under four perspectives: (1) semantic alignment defines to what extend the generated reasoning is coherent, and grounded with the source context; (2) logical inference evaluates if the generated reasoning steps are consistent within itself and checks for logical fallacies; (3) semantic similarity quantifies the degree of similarity between the generated reasoning and the context or between intermediate steps to capture hallucinations or repetitions; and (4) language coherence evaluates if the whole chain flows naturally.

To evaluate ROSCOE against existing metrics, we devise a taxonomy of reasoning errors for multi-step generations and use it to create synthetic data and collect human evaluations on commonly used reasoning datasets. Our taxonomy and annotated datasets help us gain deeper insights into the causes of reasoning inconsistencies and weaknesses of LLMs. We evaluate ROSCOE with $18$ fine-grained metrics under the above four perspectives. ROSCOE demonstrates performance gains against baseline evaluation metrics on all tasks that require reasoning over context. Additional sensitivity analysis shows that ROSCOE is more robust when dealing with tasks that require logical and arithmetic reasoning.

Contributions. (1) We propose a new taxonomy for reasoning errors, and use it for collecting human annotations and creating synthetic datasets. (2) Using our taxonomy, we propose a new suite of metrics that focus on sequence and step level analysis of step-by-step reasoning. (3) We present extensive comparative analysis on 11 datasets of varied complex reasoning problems demonstrating the strengths of each metric, especially in terms of interpretability relative to baselines, and considerations for use.

Related Work

Evaluating Explanations. Free-form natural Language (NL) explanations of model decisions should enable accurate representation of the reasoning process and degree of plausibility (Danilevsky et al., 2020; Jacovi & Goldberg, 2021; Jacovi et al., 2021). A qualitative assessment of NL explanations with correctness labels collected from human judges was presented in (Camburu et al., 2018). Recent work has also investigated automatic metrics for natural language generation (NLG) evaluation including word overlap or embedding based similarly with human written explanations (Clinciu et al., 2021). Though fast and cost-effective, automatic metrics for NLG are not equipped to measure the logical inconsistencies or information gain with thinking steps (Reiter, 2019; Celikyilmaz et al., 2020). Explanations have also been evaluated by collecting datasets, and running correlation analysis to investigate the degree to which an automatic metric correlates with human judgements of clarity, relevance and informativeness (Leiter et al., 2022; Welleck et al., 2022).Although reliable, human evaluation is an expensive, domain specific, and time-consuming process. In comparison, ROSCOE provides generic automatic evaluation procedures that are domain and task specific.

Automatic Metrics. Many NLG evaluation metrics exist in the literature including ones based on: n-gram match (Lin, 2004), regression (Sellam et al., 2020), embedding proximity (Zhang et al., 2020), paraphrasing (Thompson & Post, 2020), generation as an evaluator (Yuan et al., 2021); information alignment (Deng et al., 2021); among others. Although these metrics are easy to use, they evaluate the alignment of two texts as a whole and are not designed to assess individual reasoning steps. The closest metrics to ours are CTC (Deng et al., 2021) and BARTScore (Yuan et al., 2021), as both introduce a set of interpretable metrics to evaluate the similarity between two texts. However, ROSCOE is unique in providing fine-grained interpretations of reasoning steps, determining contradictions, and identifying ordering issues in the reasoning narrative.

Self-Consistency with LLMs. Recent work on improving LLMs performance on complex reasoning tasks uses an ensemble strategy called self-consistency (Wang et al., 2022). This method samples a diverse set of reasoning paths from a language model via reasoning traces prompting and returns the most consistent final answer in the set. Other work evaluates the diversity of a reasoning path (Li et al., 2022), or the consistency of an inference step (Creswell et al., 2022) or finetune LLMs (Zelikman et al., 2022) to improve on difficult NLP tasks. In contrast to these works, we present a suit of metrics that focus on determining the type of the error (e.g., commonsense or logical inconsistency) in a reasoning path, if one exists.

Reasoning Error Taxonomy and Datasets Construction

Problem Formulation. Our goal is to score step-by-step rationales generated by a language model. We assume that the model is given a source context ${\bm{s}}=\{s_{1},\cdots,s_{T}\}$ of T-sentences indicating a problem statement followed by a question and is prompted to generate step-by-step reasoning (Nye et al., 2021). We refer to this as a hypothesis ${\bm{h}}=\{h_{1},\cdots,h_{N}\}$ of N-steps, including a final answer as the last step. We do not assume availability of gold step-by-step reasoning references ${\bm{r}}=\{r_{1},\cdots,r_{K}\}$ of K-steps.

Taxonomy. We propose a new taxonomy of generic reasoning errors for language problem solving. We first conduct manual preliminary analysis on different types of LLMs reasoning errors using five Human judged datasets described below. Based on our analysis, we identified nine error types centered on the overall reasoning chain (i.e., the quality of the step-by-step thinking, including consistency with the context and commonsense reasoning). Our taxonomy also includes fine-grained errors marking inconsistency of a reasoning step with the previous steps, whether each step contributes to the final decision, and overall logical inference or fluency issues. The definition of error types is in Table 2, and Table 10 provides examples.

Datasets and Annotations. To evaluate ROSCOE, we select datasets covering diverse set of tasks that require reasoning skills (e.g., logical, arithmetic, and commonsense reasoning tasks). We separate these datasets into two: (1) Diagnostics datasets that contain gold standard step-wise reasoning chains, where we synthetically perturb some of the reasoning steps to introduce different generation errors (e.g., missing step, mathematical error, etc.); (2) Human judged datasets with model generated step-by-step reasoning outputs where the reasoning error evaluations are solicited from expert judges. We investigate these in $\S$ 5.

Reasoning Scorer: ROSCOE

We present our fine-grained metrics under four perspectives: semantic alignment, semantic similarity, logical inference and language coherence. Each metric is bounded within $ $, where$ 1 $indicates the perfect score and corresponds to failure. A metric is reference-free or unsupervised when it uses the source and hypothesis ($ {\bm{h}}\rightarrow{\bm{s}} $), while reference-based or supervised when evaluated between hypothesis and reference ($ {\bm{h}}\rightarrow{\bm{r}}$).

At the core of the ROSCOE semantic alignmentSemantic alignment refers to determination of relations between concepts with the same or a similar intended meaning (Agirre et al., 2013). metrics is the reasoning alignment vector from the $N$ -step hypothesis ${\bm{h}}$ to the source ${\bm{s}}$ of length $T$ : $r\textnormal{-align}({\bm{h}}\rightarrow{\bm{s}})=\{\alpha_{1},\alpha_{2},\cdots,\alpha_{N}\}$ , where each alignment value $\alpha_{i}=r\textnormal{-align}(h_{i}\rightarrow{\bm{s}})=[1+\max_{j=1}^{T}(\cos(h_{i},s_{j})]/2\in$ is the normalized cosine similarity between hypothesis step and most similar sentence in a context, and explicitly measures the grounding of the step-wise reasoning with respect to the source text (illustrated in App. D, Fig. 3). We estimate the alignment vector $r\textnormal{-align}({\bm{h}}\rightarrow{\bm{s}})$ by matching source text and the reasoning chains on the embeddings of tokens and individual reasoning steps. A similar information alignment score is introduced in CTC (Deng et al., 2021) to measure the confidence that the information of the $i$ -th source document token $s_{j}$ is grounded by a hypothesis token $h_{i}$ . Our reasoning alignment is different in that we measure if a hypothesized reasoning step $h_{i}$ supports the source context ${\bm{s}}$ . Our proposed metrics are summarized in Table 3.

2 Semantic Similarity Metrics (ROSCOE-SS)

Semantic similarity metrics quantify the degree of semantic equivalence between pieces of text. As opposed to the ROSCOE-SA metrics, ROSCOE-SS considers text as a whole, rather than relying on text units comparisons. We propose the following metrics summarized in Table 4.

3 Logical Inference Metrics (ROSCOE-LI)

4 Language Coherence Metrics (ROSCOE-LC)

Experimental Setup

Diagnostics Datasets. We construct our first category of labeled datasets by generating perturbations — i.e., deterministic modifications — on half of the reference reasoning steps and assign binary labels based on whether or not a chain has been perturbed. We select seven language understanding and entailment datasets that require complex problem solving skills, and have reference step-by-step explanations: Entailment-Bank (deductive reasoning) (Dalvi et al., 2021), ProofWriter (logical reasoning) (Tafjord et al., 2021); three arithmetic reasoning datasets MATH (Hendrycks et al., 2021), ASDIV (Miao et al., 2020) and AQUA (Liang et al., 2018); EQASC (explanations for commonsense question answering) (Aggarwal et al., 2021), and StrategyQA (question answering with implicit reasoning strategies) (Geva et al., 2021) (see dataset details in App. E.1). Using our taxonomy, we introduce 12 error perturbation rules and apply on these datasets to construct our diagnostics datasets (see details in App. E.3).

Human Judged Datasets. We select our second category of datasets from commonly used complex reasoning tasks: GSM8K (arithmetic reasoning) (Cobbe et al., 2021), DROP (discrete reasoning) (Dua et al., 2019), ESNLI (deductive and commonsense reasoning) (Camburu et al., 2018), COSMOS-QA (commonsense reasoning) (Huang et al., 2019) and SemEVAL (Ostermann et al., 2018) (commonsense reasoning). Wei et al. (2022) provide model generated chain of thought reasoning steps for GSM8K. We used chains produced by the 175b_verification model to annotate for reasoning errors. For other datasets, we prompt GPT-3 LLM (Brown et al., 2020) with few-shot in-context examples to obtain step-by-step reasoning sequences (see examples in App. E.2). We use the error types in our taxonomy in Table 2 as human evaluation perspectives of reasoning errors where we solicit five expert annotatorsWe chose expert annotators over crowd-sourcing, because our annotation task is cognitively challenging and requires fine-grained annotation.. The data collection interface provided judges with the source text (e.g., source and a question, or hypothesis, premise, and a question if they entail) and associated reasoning text clearly separated into individual steps. Judges were asked to rate the chain as a whole (e.g., on overall quality) as well as each individual step (e.g., commonsense errors, contradicts with the previous steps). App. Table 16 summarizes the distribution of error types annotated by the judges. See App. F for details.

ROSCOE Training. To obtain reasoning step embeddings, we finetune SimCSE (Gao et al., 2021), a supervised sentence similarity model extending the RoBERTa word embedding model (Liu et al., 2019) on multi-step reasoning datasets we listed in $\S$ 5 (see details in Table 11)Fine-tuned model is available at https://huggingface.co/facebook/roscoe-512-roberta-base. SimCSE is a contrastive learning model that is trained on triplets of reference reasoning steps, positive and hard-negative hypothesis reasoning steps to minimize the cross-entropy objective with in-batch negatives. For contrastive learning, we use the context and reference reasoning steps as a positive sample $({\bm{s}},{\bm{r}})$ , and context and perturbed reference steps $({\bm{s}},{\bm{h}})$ as hard-negative pairs. For finetuning, we embed source context and hypothesis chain as a whole, without splitting it into steps. With the finetuned model we embed each individual step, as well as a reasoning chain as a whole. We use the pretrained checkpoint of supervised SimCSE model sup-simcse-roberta-base to initialize our model, and further train it for five epochs on our synthetic train data (details in App. G). We also compare ROSCOE scores calculated against sup-simcse-roberta-base SimCSE model, and all-mpnet-base-v2 sentence embedding model (Reimers & Gurevych, 2019) to understand metrics sensitivity to the embedding method.

Baseline Metrics. We use text generation evaluation metrics as baseline metrics and comprehensively examine the ones outlined in §2, which are: n-gram match based metrics including ROUGE-1, ROUGE-2, and ROUGE-L (Lin, 2004); pre-trained scores including BLEURT (Sellam et al., 2020), PRISM (Thompson & Post, 2020), BERTScore (Zhang et al., 2020), BARTScore using the Faithfulness ( ${\bm{s}}\rightarrow{\bm{h}}$ ) direction for factuality and relevance, and its finetuned variant BARTScore+CNN+Para BARTScore+ (Yuan et al., 2021); and information alignment metrics of CTC, CTC-Relevancy and CTC-Consistency. We also include BARTScore-P, which we obtain by finetuneing BART (Lewis et al., 2020) on the same reasoning datasets we use for finetuning our SimCSE embedding models. Most of our ROSCOE metrics are constructed reference-free. We also have metrics that use reference reasoning steps which we examine against human judgements. We use the official code for each metric.

Meta Evaluation. We use Somers’ $D$ We use SciPy (Virtanen et al., 2020) to calculate correlations and obtain p-values from a hypothesis test where the null hypothesis is an absence of association. (Somers, 1962), which measures the ordinal association between two measured quantities, to meta-evaluate each scorer against synthetic and human scores. We prefer Somers’ $D$ over more commonly used Kendall’s $\tau$ or Kendall’s $\tau\textit{-}b$ , because it is better in handling the ties of a biased random variable (Agresti, 2010, Section 7.1.5), which imposes an upper bound on the possible values Kendall’s $\tau(\textit{-}b)$ can take. For each score $Y$ considered, our correlations are built against the biased random variable $X\in$ , represented by the perturbation or error presence indicator and evaluated using $D(Y|X)=\tau(X,Y)/\tau(X,X)$ .

Experimental Results

Controlled Experiments with Diagnostics Datasets. Table 7 shows Somers’ $D$ correlation for metrics measured reference-free on six different datasets and compares baselines to ROSCOE-* aggregated categories calculated with finetuned embeddings: ROSCOE-SA, ROSCOE-SS, ROSCOE-LI, ROSCOE-LC. Results also include ROSCOE metrics with all-mpnet-base-v2 (ROSCOE-SA1, ROSCOE-SS1) and sup-simcse-roberta-base (ROSCOE-SA2, ROSCOE-SS2) sentence embedding models. Correlations for ProofWriter are taken on its depth-5 subset. We report highest correlation scores across perturbations within each dataset. The breakdown of all ROSCOE metrics is in App. Table 18.

We observe that: (1) ROSCOE can outperform all other reference-free methods on all six diagnostic datasets, (2) the gains for ROSCOE-SS are more pronounced in four out of six diagnostics datasets, which suggests that ROSCOE can capture hallucinations and repetitions in step-wise reasoning. On Proofwriter, our scorers show lower correlations, because as shown in Table E.1, the context is a list of facts and rules and the reasoning steps can include unordered fact and rule combinations, but still a correct answer can be deduced. This makes it challenging for ROSCOE to evaluate the steps in sequence. Overall, the correlations of the baseline metrics are much lower than ROSCOE, because the baseline metrics are designed to capture the semantic or lexical overlap between a reference and hypothesis and it is harder to detect logical consistency without a golden reference text. ROSCOE is specifically focused on reference-free settings, and can gauge each individual step against the source and other generated steps. In fact, our metrics also work well against the baselines in the reference-based setting (comparing against reference reasoning steps). In App. Table 19 we present correlations when metrics are measured as reference-based. We also observe that finetuning SimCSE gives highest improvements on the ASDIV dataset. ASDIV is a 1-step reasoning dataset (see App. Table 12), where step is represented by an equation with one of the arithmetic perturbations added. We hypothesize that including these patterns in finetuning helped the model to better learn relationships between context and equations, and resulted in higher scores. On EQASC dataset, Repetition* scores are able to catch all duplicated steps in a chain, i.e., we can separate perturbed and non-perturbed chains based on the given threshold value for the Repetition* scores, and achieve perfect correlation scores (App. Table 20). To understand if finetuning actually helps to improve scoring, we compare non-aggregated metrics (see details in App. Table 18). We observe, that finetuning indeed helps to improve ROSCOE: on average across datasets, all correlations except Repetition_* scores improve (up to $0.556$ on Informativeness-Chain), with mean Repetition-Token not changing, and mean Repetition-Step degrading by $0.005$ . We speculate that since we finetune the model using reasoning chains and context as a whole, it helps to better capture step-by-step rationales, while possibly degrading on word and sentence-level semantics.

Meta-Evaluations on Human Judgement Datasets. Table 8 reports a summary of meta-evaluation of ROSCOE metrics comparing against baselines on human judged datasets. The correlations are measured based on the presence of a particular error from Table 2 and we report the highest correlation across all error types within each dataset. We observe that: (1) on all tasks, ROSCOE metrics outperform all other baselines when evaluated as reference-free; (2) overall, ROSCOE yields considerably better correlations, which indicates that step-by-step reasoning generations can be more effectively evaluated with ROSCOE. In general, most correlations with human judgements are moderate when compared to the synthetic correlation scores, indicating that step-by-step reasoning evaluation is among the cognitively hard tasks for neural models (Deutsch et al., 2022). Interpretable metrics such as ROSCOE can provide better information about a model’s reasoning skills, thus future work should improve such metrics on aligning with human judgments. In App. H.2, we show fine-grained experimental analysis per each human labeled dataset. Specific examples showcasing ROSCOE scoring abilities are summarized in Table 40.

Analysis

To evaluate how well metric values match human assessment of reasoning, we measure sensitivity to the level of errors. We perturb sentences in the MATH (arithmetic) and EntailmentBank (deductive reasoning) diagnostic datasets (similar to $\S$ 5) and inject different levels of errors into the reasoning text. Using randomly selected perturbation types, we construct up to a maximum of 3 perturbations per instance. We measure the correlation (Somers’ $D$ ) between the reasoning inconsistency level 1, 2, 3 of the reasoning steps (i.e., the number of injected errors) and the metric score. Fig. 1 illustrates the results averaged over different perturbations.

We expect the metrics correlate with humans better when the level of errors is high. Both semantic alignment of the reasoning ROSCOE-SA , and the semantic similarity metrics ROSCOE-SS show consistent behavior on both datasets, while baseline metrics fluctuate with low correlations. Baseline metrics perform better on EntailmentBank. On MATH, ROSCOE-LC and the baseline metrics show minimal impact, which can be that some of the perturbations applied on the MATH dataset (e.g., RandomOperation, or ShuffleNumbers) are harder to detect with language model based (BARTScore) and NLI model based (ROSCOE-LC) metrics.

What does ROSCOE illuminate about scores across errors and tasks? For an ideal scorer based on ease of use, it would be possible to pick a set of fixed thresholds that had error discrimination power across datasets. However, we show that this dataset-agnostic ideal is currently not possible and an issue endemic across scores, including baselines. We study which metrics correlate strongly with which perturbations, with a focus of consistency across datasets. From this, we plot the interquartile ranges for strongly correlated metric and perturbation pairs. We show a sample of these in Fig. 2, though find that the trends generally hold across metrics and perturbations (see Fig 6). We note that within a given dataset, scores are well separated: the perturbed version of a dataset for a given score and perturbation type shows little interquartile overlap with the original version. However, this does not hold across datasets – e.g., in (Score: Info-Chain, Perturbation: Repetition), if one were to set a detective threshold for the Repetition perturbation based off EntBank (around $0.95$ ), it would mark almost all values of EQASC as perturbed, even non-perturbed samples. This shows the challenge of using metrics for classification without calibration for drifts in both mean and variance across datasets, even if a metric generally correlates well with detecting a given error.

Conclusion

In this paper, we introduce ROSCOE, a new suite of interpretable, unsupervised metrics that enables evaluation of step-by-step reasoning generations of LMs when no golden reference generation exists. We present a taxonomy of reasoning errors used to generate and evaluate our metrics. Experimental results, from evaluating on both synthetic and human-labeled datasets exhibiting multiple types of reasoning (commonsense, arithmetic, and logical inference, etc.), demonstrate superior performance compared to prior semantic and lexical similarly based baseline metrics for text generation. Our analysis shows improved capability in evaluation of reasoning exhibiting nuances, such as factual and logical errors in step-wise decisions.

Ethics Statement

Explainability builds transparency and trust for users, eases bug-fixing and shortens improvement cycles for metric designers, and will be required by law/regulations for AI systems to be applied to large-scale, high-stakes domains. In this context, we hope our work will catalyze efforts on the topic of explainable evaluation metrics for language model rationale generations. We should mention that our evaluation metrics do not monitor the explanations from integrity or bias perspectives. Our work also uses five human expert annotators and in the annotation process, annotators need to rate the model generated candidate rationals. While the model-generated explanations can produce potentially unsafe content, the datasets for annotations include domains related to logical and arithmetic concepts and general commonsense knowledge. The anecdotal consensus was that the generations were safe and didn’t include biased statements.

Reproducibility Statement

To ensure the reproducibility of our empirical results, we will open source our code to Github, which will contain: instructions for installing the virtual environment, data preprocessing, all score generation and correlation scripts (both for ROSCOE and baselines), and trained embedding models. Detailed explanation of all the finetuned models and metrics are given in the main paper as well as in the Appendices. We will also release all the diagnostic and human judgment datasets used in our experiments.

References

Appendix

Appendix A Limitations

Our study is the first initial step that investigates the evaluation of the step-by-step reasoning produced by large language models. Our taxonomy (in Table 2) covers several reasoning errors and we designed our metrics to evaluate a spectrum of criteria including the ones in the taxonomy. Even though we cannot say we cover all possible reasoning errors, our metrics are generic enough, work on natural language rationales, and consider the alignment with the input context and the generated explanation. Nevertheless, we believe our study can spur others to investigate different reasoning errors and use our code and datasets as templates to extend further.

Due to the extensive analysis needed to thoroughly test and communicate the ability of our proposed metrics to capture reasoning errors, we decided to leave some follow-up questions, such as the application of these metrics for improving downstream task performance, for future exploration.

Appendix B Few-shot Prompting Examples (Cont. from § 1)

Below is the 2-shot example we used to generate the explanations from GPT-3 as we show in the Fig. 1.

Appendix C Taxonomy of Reasoning Errors (Cont. from § 3)

To gain deeper insights into the types of reasoning errors introduced by LLMs while explaining their decisions, we propose a new taxonomy of generic reasoning errors for language problem solving. Specifically, we sampled from the training portions of the logical inference and commonsense reasoning datasets, and prompted GPT-3 with reasoning explanations using prompts similar to App. B. We used task specific in-domain examples for prompting. We also analyzed model generated explanations shared in Wei et al. (2022). We then manually looked into each explanation and identified potential errors that are inconsistent with the source, question or the prompt and within the reasoning chain. Some tasks require a model to classify the logical relationship between premise and a hypothesis, others are question and answering tasks. We adjusted our context and prompts according to the type of the task.

Our reasoning error taxonomy is summarized in Table 10. It contains types of errors concerning an overall chain or an individual step. Specifically, the chain-level coarse-grained evaluations of the overall reasoning chain deals with overall quality of the step-by-step thinking, coherence, consistency of the explanation within itself, and consistency with the context, etc. On the other hand the step-level fine-grained evaluations focus on the consistency of a reasoning step with the previous steps, if a step conveys new and supporting information over the previous steps, factuality or logical inference issues. We use these error categories to construct diagnostics datasets with perturbed errors as well as human judged datasets of reasoning errors. In the taxonomy, we indicate *-step level errors to differentiate from the chain level error types.

Appendix D ROSCOE Metrics Details (Cont. from §§\S4)

ROSCOE metrics are constructed under four categories: semantic alignment, semantic similarity, logical inference, and logical coherence. The details of each metric is explained in $\S$ 4. At the core of ROSCOE semantic alignment metrics is the reasoning alignment score, which we designed to measure the grounding of step-by-step reasoning with respect to the source text. Fig. 3 illustrates the reasoning alignment.

The variation of scorers of the ROSCOE shares some similarities, thus we explain them here:

BARTScore (Yuan et al., 2021) claims that more high level text can be generated using sequence to sequence model. It can support different evaluation perspectives such as factuality (by evaluating from source to hypothesis) or informativeness (by evaluating from both directions between reference and hypothesis). BARTScore is used to measure the probability of generated text from a source text $x$ to a target set $y$ :

BARTScoreintroduce two variations: (1) finetuning, in which the BART model is finetuned on the task specific dataset to make the pre-training domain closer to the evaluation domain. (2) prompting, in which a task specific textual prompt is appended to the source $x$ to get the $y$ . In our experiments we compare the the BARTScorebaseline and one with the prompting variant BARTScore+to compare in the experiments.

CTC (Compression, Transduction, and Creation) (Deng et al., 2021), is a suite of metrics that unifies different perspectives of different tasks (e.g, summarization, style transfer, or text rewriting) into information alignment, which measures weather the information in one generation component is grounded in another. The information alignment is defined as follows: let $x$ (e.g, dialog context) be the source input, $c$ (e.g., external world knowledge) be some additional context, and $y$ be the generated output text (e.g., generated response). The alignment is measured on token level and it is measured as the vector of scores:

where each score $\alpha_{i}$ indicates confidence that the n-th token in $a$ aligns with the whole sentence $b$ . Using the information alignment they define a list of metrics to evaluate text for different tasks. In our experiments we use two of these metrics that are closer to ROSCOE: the Relevance (CTC Relevance), which measures the consistency of the generated text with the source and its balanced between the reference, and the Consistency (CTC Consistency) which deals with the faithfullness of the generated text to the input context by the alignment between the two.

Appendix E Experimental Setup Details (Cont. from §§\S 5)

In the following we present details of each diagnostics dataset used in our work. Table 11 illustrates how each dataset is used in our experiments. StrategyQA dataset is only used to finetune the SimCSE embeddings model, because it contains reference reasoning chains in train and validation partitions, but not in the test partition. The rest of the six diagnostic datasets are used for sentence embedding model finetuning, and evaluating our models as presented in the experiments results. All datasets with examples are summarised in Table 12.

EntailmentBank (EntBank) (Dalvi et al., 2021) is a complex question answering dataset which contains multi-step entailment trees, namely a tree of multi-premise entailment steps from facts that are known, through intermediate conclusions to hypothesis of interest (which in this case the question and answer).

ProofWriter (Tafjord et al., 2021) is a question answering dataset for logical reasoning. It contains 500k questions, answers and proofs over natural-language rulebases. This dataset is mostly used to emulate reasoning over rules expressed in language, including proof generation. The datasets proofs include intermediate conclusions. In our experiments, we used depth-0, depth-1, depth-2, depth-3, and depth-5 OWA sets.

MATH (Hendrycks et al., 2021) is a dataset of 12,500 problems from high school math competitions. Given a math problem such as in Table 12 models generate a sequence, such as $\frac{2}{3}$ , that encodes the final answer.

ASDIV (Miao et al., 2020) (Academia Sinica Diverse MWP Dataset) is a dataset of 2,305 questions on diverse math word problem solving. It includes a diverse operations such as basic arithmetic or aggregative operations (e.g., comparisons, set-operations).

AQUA (Liang et al., 2018) is a dataset of 100,000 algebraic word problems with step-wise solutions as shown below. In the original dataset each question is decomposed in four parts, two inputs and two outputs: the description of the problem and a question, and the possible (multiple choice) answer options, one being the correct one. In this work we only used the context and question, the step-wise solution and the correct answer to construct our diagnostic dataset.

EQASC (Aggarwal et al., 2021) is a multi-hop question answering dataset with 98K explanation annotations for multi-step factual reasoning. Each instance in the dataset comes with a question, multiple answer choices, explanation of each answer choice and a free flow explanation of the whole context. In our experiments we used the correct answer’s explanation to construct our diagnostic datasets.

StrategyQA (Geva et al., 2021) is another multi-step question answering (QA) dataset, that covers a diverse set of reasoning skills. StrategyQA consists of 2,780 questions, annotated with their decomposition and per-step evidence.

E.2 Human Judged Dataset Construction

In the following we present details of each human judged datasets used in our work. Table 11 lists each dataset and illustrates how each dataset is used in our experiments. Specifically, all six datasets are used for evaluations in the experiments results and model finetuning, and one dataset was used for finetuning only. The dataset details are explained below.

To construct these datasets, we first sample instances from each dataset (see the number of instances sampled in Table 11). We use GPT-3 with few-shot in-context examples and a prompt to generate step-by-step reasoning (e.g., "explain step-by-step") for each sampled instance (see in-context examples and prompts in App. B). Then, using our taxonomy we constructed a list of evaluation perspectives to label the model generated step-by-step reasoning step of each of these datasets. We explain the details of the perspectives used to label human judged datasets in $\S$ 5 and App. F. All datasets with examples are summarised in in Table 13. In the following we present details of each human judged datasets.

DROP (Dua et al., 2019), Discrete Reasoning Over the content of Paragraphs, is a dataset of 96K of instances with context and a question. To solve the tasks, a system must resolve references in the context that match with the question, and perform discrete operations over them (such as addition, counting, or sorting). These operations require comprehensive understanding of the content of the input context.

GSM8K (Cobbe et al., 2021) is a dataset of 8.5K linguistically diverse grade school math word problems. On this dataset, even the largest transformer models fail to achieve high test performance, despite the conceptual simplicity of this problem distribution.

CosmosQA (Huang et al., 2019) is a dataset of 35K problems that require commonsense-based reading comprehension, formulated as multiple-choice questions. The questions focus on reading between the lines over a diverse collection of people’s everyday narratives, asking such questions as "what might be the possible reason of …?", or "what would have happened if …?". The dataset does not introduce step-by-step reasoning output, and contains multiple choice answers.

ESNLI (Camburu et al., 2018) is the extended version of the Stanford Natural Language Inference corpus (Bowman et al., 2015) of 570K labeled sentence pairs with entailment or contradiction labels. ESNLI includes human labeled explanations of the entailment decision.

SemEVAL (Ostermann et al., 2018) is a dataset on machine comprehension using commonsense knowledge. It contains questions that require commonsense knowledge for finding the correct answer.

E.3 Synthetic Diagnostics Dataset Generation with Perturbation Rules

To construct the diagnostics datasets we apply synthetic perturbations on half of the chains from six datasets (for details see App. E.1 and the summary Table 11). Also, in Table 14 we illustrate these synthetic perturbations applied on reasoning steps $\{r_{i}\}$ of gold reference chains of all the datasets. In there, $\bm{g^{*}}$ indicates a grammar error, which includes changing verb tense, dropping verb, or random word swap. $\bm{s^{*}}$ represents change the semantics of one step in the chain by replacing named entities. To simulate extrinsic hallucinations, we use random steps from other chains within the same dataset.

To construct diagnostic data from math datasets, we introduce four additional perturbations to simulate step-wise explanation errors that might arise in arithmetic reasoning task (Arithmetic error), general knowledge about relationships and equation construction (Common sense error), and misinformation about object/subject characteristics (Factuality or Hallucination):

Shuffle numbers: randomly shuffles all numbers in the chain,

Shuffle operations: randomly shuffles all math operations in the chain,

Random number: randomly replaces one number in the chain,

Random operation: randomly replaces one math operation in the chain.

Appendix F Human Annotations (Cont. from §§\S 5)

To construct Human Judged Datasets, we perform human annotations on five datasets which we summarize in Table 11 (Type=’Human judged’). These datasets do not include explanations (except GSM8K and ESNLI), so we construct model generated reasoning steps and label them with reasoning errors. We explain our generation process in $\S$ 5 and App. E.2. We used five expert human annotators to collect reasoning error labels on five datasets. We asked human evaluators to directly rate the generated reasoning errors on overall chain level using a Likert scale from 1 to 5. We also asked them to mark whether each error type proposed in our error taxonomy ( $\S$ 3) appeared in each step in step-level evaluations. In Fig. 4 and Fig. 5 we illustrate the UI used to collect the data. Table 15 summarizes questions that experts were asked. Table 16 reports the distribution of errors for each dataset. In general, we found that it was hard to get anonymous crowd workers to annotate our data accurately even when we paid averages of upwards of $30 an hour, hence relying on expert annotators. For the annotation sessions reported in the text of the paper, we find that it takes an average of 754 seconds for expert annotators to complete a session of at most 5 examples, or slightly over 2-and-a-half minutes per example. This highlights the difficulty of obtaining high-quality annotations on these cognitive challenging tasks.

Appendix G Sentence Embedding Model Training (Cont. from §§\S6)

Model training. We use the train portions of the perturbed diagnostics datasets to finetune the SimCSE embeddings model (explained in § 5) and validation portions to select the best embedding model. The test portions are used to evaluate our metrics against baseline metrics. We randomly select 500,000 samples with replacement from each dataset to create uniform representation and reduce bias.

The hyperparameters used to finetune SimCSE model are described in Table 17. We use NVIDIA Tesla V100 Volta GPU instances with 32GB Graphics Card. We perform hyperparameter search, varying batch size in $\{32,64,256,512,1024,2048\}$ , learning rate in $\{5e\textnormal{-}06,1e\textnormal{-}05,5e\textnormal{-}05,1e\textnormal{-}04\}$ , and max sequence length in $\{64,128,512\}$ . Not all combinations of batch size and max sequence length were explored due to memory limitations.

Validation. We replace original validation procedure on semantic textual similarity tasks with similarity-based validation on perturbed reasoning chains. In particular, during training, we select best checkpoint that maximizes cosine similarity between positive and minimizes cosine similarity between hard-negative pairs within the batch of size $B$ as the following:

Model is evaluated every 100 steps on the development dataset and the best checkpoint is applied at the inference. Other parameters not described in this section are kept as in the original SimCSE model used for initialization.

Inference. We compare ROSCOE scores calculated against three embeddings: finetuned SimCSE model, sup-simcse-roberta-base SimCSE model, and all-mpnet-base-v2 sentence embedding model (Reimers & Gurevych, 2019). During inference, we set the random seed to 42. Without this, the embedding-based scores naturally varied by about 0.01.

Appendix H Additional Experimental Results (Cont. from §§\S6)

In this section, we presented Somers’ $D$ correlation of all metrics on all Diagnostics datasets. Table 18 summarizes the evaluations when investigated reference-free. One of the characteristics of our ROSCOE metrics is that, they can provide judgement of the model generated reasoning steps with and without the human reference reasoning chains. In the experiments section in §6, we discussed the results of our unsupervised scores in comparison to baseline scores when measured reference-free. In Table 19, we summarize the correlation analysis on ROSCOE metrics in comparison to baselines on diagnostic datasets when reference is present for evaluation. Specifically, each score is measured between the human provided reasoning steps (reference) and the model generated reasoning steps (hypothesis). We also display fine-grained meta-evaluations of all metrics on each diagnostics dataset in separate tables. Specifically, Tables 20, 26 for EQASC, Tables 21, 27 for EntailmentBank, Tables 22, 28 for MATH, Tables 23, 29 for ProofWriter, Tables 24, 30 for ASDIV, and Tables 25, 31 for AQUA.

To understand if designed reference-free scores capture targeted error types we analyze perturbation-level correlations summarized in Fig. 6. Out of the all considered scores, Info-Chain is able to cover 10 out of 12 of errors, except Remove Step and Semantic error perturbations. In general we can note that ROSCOE fails to consistently identify missing step error type represented by Remove Step perturbation across different datasets, while other synthesized error types are covered by at least one score type.

Reference-based scores are covering all synthetic errors, with Semantic Coverage Chain showing strong correlations with all types of perturbations (Table 19). We also note that along with ROSCOE scores, the highest correlation among all reference-based scores belong to ROUGE and BERT scores (Tables 26-31). ROUGE scores consistently outperform on Repetition, Hallucination, Remove Step, Shuffle Steps, Swap Steps, Negate Step, and Semantic perturbations, while under performing on Random operation, and Shuffle operations. We attribute this to the fact that ROUGE is an n-gram based score, so it is better in catching errors were wording has significantly changed, while failing to catch small changes within steps.

It is worth noting that some scores, especially those among reference-based evaluations, get the highest possible Somers’ D correlation scores of $1.0$ . What it means is that in some scenarios, there is a perfect correlation between the metric and the error type. In other words, for this metric we can find a threshold such generated chains that have scores greater than the threshold do not have errors of the given type, and in all generated chains with scores less than the threshold have that error. It is especially evident on referenced-based metrics that directly compare the reference solution and hypothesis. In this scenario, we build correlation for two groups: 1) non-perturbed hypothesis: the score is calculated by comparing embedding similarities of the reference with itself, and we expect to get high scores, 2) perturbed hypothesis: comparing reference with its perturbed version, where the scores should be lower. In some cases, we are able to perfectly separate perturbed and non-perturbed chains based on the corresponding metric values by selecting a threshold, in other cases we cannot due to a number of false-negatives (i.e., a chain gets a high score, although the error is present). As an example, consider the Semantic Coverage-Chain metric calculated on EQASC dataset using all-mpnet-base-v2 sentence embeddings, and Hallucination perturbation (Table 26). Here the Somers’ D correlation score is $1.0$ . Semantic Coverage-Chain is calculated as a normalized cosine distance between the chain embedding of the reference solution ${\bm{r}}$ , and the chain embedding of the hypothesis ${\bm{h}}$ : $[1+\cos({\bm{r}},{\bm{h}})]/2$ . Recall that in our setup, half of the hypothesis chains are perturbed reference chains, and another half is the same as the reference. While Hallucination perturbation is an insertion of a random step from a dataset, it is hard to predict how if will affect the embedding of the chain as a whole, but on the unperturbed chains, where ${\bm{h}}=={\bm{r}}$ , the Semantic Coverage-Chain should be: $[1+\cos({\bm{r}},{\bm{r}})]/2=1.0$ . Further review confirmed that in this dataset there are no false-positive instances, i.e., all chains with perturbations had Semantic Coverage-Chain score less than $1.0$ . That means, we can always identify if the chain contains a Hallucination error or not, by comparing Semantic Coverage-Chain value with $1.0$ (threshold value), which is reflected in perfect Somers’ D score.

Highest correlations among reference-free scores belong to the Repetition-* scores, that exhibit perfect correlation on EQASC dataset (Tables 20-25). For other datasets, non-perfect correlations can be attributed to the small number of false-negatives, i.e. they give low Repetition-* scores for chains with non-duplicated but similar steps, while all chains with duplicates got almost scores (Fig. 7). In EQASC explanations are created from a set of facts that are not directly related to each other, but are intended to give an answer when combined together. Among all datasets considered, these steps are most dissimilar, and thus can be separated with similarity-based scores.

H.2 Experiments with Human Judgement Datasets

In this section, we present Somers’ $D$ correlation of all metrics on all Human Judged datasets in separate tables. Specifically, Table 32 summarizes meta-evaluations for ROSCOE metrics in comparison to baselines on all human judged datasets. Fine-grained evaluations are presented in Table 33 for DROP, Table 34, 38 for GSM8K, Table 35, 39 for ESNLI, Table 36 for CosmosQA, and Table 37 for SemEVAL. Human evaluation perspectives used in evaluations are described in App. Table 15.

Looking at how errors are captured by ROSCOE reference-free scores (Fig. 8), we observe strongest correlations between Redundancy error and Repetition-*, Self-Consistency scores. Repetition error is not present in this analysis as it has at most 3 occurrences per dataset. Out of the all considered scores, Self-Consistency is able to cover 6 out of 7 evaluation perspectives, except Missing Step.

We further look at specific human annotated examples where our ROSCOE gives highest and lowest scores to understand strength and weaknesses of the proposed approach. Results are summarized in Table 40. Similar analysis for diagnostic datasets is summarized in Table 41.