"My Answer is C": First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models

Xinpeng Wang, Bolei Ma, Chengzhi Hu, Leon Weber-Genzel, Paul Röttger, Frauke Kreuter, Dirk Hovy, Barbara Plank

Introduction

Multiple Choice Questions (MCQ) are one of the most popular evaluation formats for understanding the capabilities of Large Language Models (LLMs), such as commonsense reasoning Bisk et al. (2020); Sap et al. (2019); Sakaguchi et al. (2021); Zellers et al. (2019); Clark et al. (2018); Talmor et al. (2019), truthfulness Lin et al. (2022). They are also an important part of aggregated evaluation benchmarks such as MMLU Hendrycks et al. (2021), BIG-bench bench authors (2023) and HELM Liang et al. (2022), where MCQ is the most common setting. Recently, this format was also adopted to evaluate moral beliefs Scherrer et al. (2023), or opinions on public issues Santurkar et al. (2023); Durmus et al. (2023) encoded in LLMs.

The most common way to evaluate MCQ accuracy is to look at the model’s first token prediction Santurkar et al. (2023); Hendrycks et al. (2021); Durmus et al. (2023); Dominguez-Olmedo et al. (2023); Tjuatja et al. (2023); Liang et al. (2022). However, many state-of-the-art LLMs have been tuned to follow instructions to better align with the user’s intent Ouyang et al. (2022), which leads to diverse and more natural response styles from the models. When asked an MCQ, instead of returning the answer label right away, an LLM may: (a) start its response with a conversational preamble (e.g., “Sure”) or (b) refuse to answer if the question touches on a sensitive topic. Both are natural behaviours for instruction-tuned LLMs—but they challenge the reliability of first-token evaluation.

In this work, we study how reliable first-token probabilities are for evaluating MCQ accuracy, by comparing them to the answers in generated text output. We show that the first-token evaluation is not faithful to text output: it often does not match the text output’s answer (e.g., over $60\%$ mismatch for Llama2-7b-Chat). We also measure the refusal rate, sensitivity to the prompt formulation and the impact of decoding temperature across six instruction-tuned models to better understand the characteristics of the two evaluation methods. Our findings suggest that it is imperative to go beyond the first-token evaluation setting and inspect the text output, too, to better evaluate LLMs in realistic scenarios.

Related Work

Fourrier et al. (2023) reviewed the token probability-based MCQ evaluation methods implemented by multi-task LLM evaluation benchmarks Hendrycks et al. (2021); Liang et al. (2022); Gao et al. (2023), showing that model performance varies depending on implementation details. Nonetheless, little is known about the reliability of the design compared to the text output. Scherrer et al. (2023) directly looked at the text output by applying rule-based mapping from the text to the options. However, no comparison to token probability based method was shown. Hu and Levy (2023) suggested not to replace probability measurement with prompting, when the task is not “challenging to translate into direct probability measurement”. When it comes to tasks that are challenging (survey questions), our work shows the issue of combining the probability measurement (first-token evaluation) and the prompting (MCQ format).

Selection Bias

Several works Dominguez-Olmedo et al. (2023); Zheng et al. (2023); Tjuatja et al. (2023) have shown that LLMs are biased when asking MCQs, such as preferring the option ‘A’ (A-bias) and being influenced by the option order. However, they only focused on the first token of the model’s response. It is unclear about the bias level of the text output evaluation. Our work aims to address this gap.

Experiments

We use OpinionQA Santurkar et al. (2023) which was curated by formatting the survey questions from Pew Research Centerhttps://www.pewresearch.org/ into a prompt format. Given that numerous questions in the OpinionQA dataset do not pertain to public opinion but rather to personal information, we have curated a subset of 414 questions specifically focused on soliciting views about public issues. We chose the survey setting since it contains sensitive and controversial questions which the models may opt not to answer. The refusal answer turns out to be one of the major reasons for the mismatch between the first token and text output evaluation (see Section 4.2).

Prompt Format

Each question consists of a General Instruction, a Question, and a set of Answer Options, as shown in Figure 1. To investigate the impact of the general instruction on the instruction following ability of the model, we design general instructions of different constraint levels, as shown in Table 1. The Low Constraint and Example Template instructions directly inherit from the two instruction templates used in Santurkar et al. (2023). To evaluate the model’s response consistency and mitigate selection bias, each question is presented ten times with the answer options shuffled in a different order for each iteration.

Models

We evaluated six instruction-tuned LLMs: Llama2-Chat-7b, 13b, 70b Touvron et al. (2023), Mistral-Instruct-v0.1, 0.2 Jiang et al. (2023) and Mixtral-8x7b-Instruct-v0.1 Jiang et al. (2024). Postfix "instruct/chat" is not used in the result for simplicity. We use greedy sampling for decoding.

First-Token Evaluation

Evaluating the first-token log probability is commonly used in the MCQ setting. Following previous studies Hendrycks et al. (2021); Santurkar et al. (2023), this method involves calculating the log probabilities for specific answer options (e.g. ‘A’, ‘B’, ‘C’). The option assigned the highest log probability is then selected as the model’s answer. Contrary to the approach taken by Santurkar et al. (2023), which excludes ‘Refused’ as a potential answer, our method also considers the log probability assigned to the refusal option. This inclusion provides a more holistic view of the model’s response spectrum.

Text Output Evaluation

We use a classifier to categorize the text output into one of the answer options. It is constructed by fine-tuning Mistral-7b-Instruct-v0.2 on annotated responses from the model we evaluated in Section 3. We manually annotated 2070 response samples generated by the 5 evaluated models (414 samples per model) . Responses from Mistral-7b-Instruct-v0.1 were not annotated since the answers follow the format well and can be easily mapped to the options . Table 6 shows examples of the model response of different models with their annotated labels. We split the data from each model into training and test sets by a 80/20 ratio. We don’t have a dev set since we evaluate our classifiers in one go and apply them directly for classification. We compared our trained classifier to other methods via classification accuracy, macro-F1 and weighted-F1 score averaged on the five test datasets, shown in Table 2. Our parameter-efficient-fine-tuned (PEFT) Mangrulkar et al. (2022) classifier achieved 99% accuracy. The annotation details, the annotated dataset statistics (label distribution), and the classifier training are shown in Appendix A.2, A.3, and A.4.

Results

To assess the alignment between the first token and text output evaluation, we measure the proportion of cases where the answer chosen by the first-token evaluation differs from the choice made in the text output, as shown in Figure 2(a)

In general, Llama2 models show a higher mismatch rate than Mistral models. As model size increases from 7B to 70B, the mismatch rate of the Llama2 model decreases, starting at $66.2\%$ and decreasing to $13.3\%$ . The mismatch rate decreases as we increase the constraint level from Low to High for all models except Mistral-7b-Instruct-v0.2. To know the source of the mismatch, we also plot the portion of mismatch due to refusal, as shown with light color (and further described in Section 4.2). The refusal is an important factor for mismatch, however, there is still a considerable amount of mismatch due to non-safety reasons.

Surprisingly, the Example Template leads to a higher mismatch rate than High Constraint instruction in five models out of six, especially for Mistral-7b-Instruct-V0.1 and Llama2-70b-Chat, which show good instruction following ability and low mismatch rate under other general instructions. This is probably due to the fact that it follows the literal pattern in the example where the answer is given as ‘C’. To test this hypothesis, we count the choice distribution from the Llama2-70b-Chat model under the Example Template instruction. In Figure 3(a), the first token evaluation selects ‘C’ about $85\%$ of the time (compared to $32.1\%$ with High constraint, see Figure 7), whereas the classified text output is more evenly distributed. This shows that the first token log probability gets shifted to the token ‘C’ substantially, influenced by the given example. This also explains why refusal only contributes a little to the high mismatch rate for Llama2-70b.

To test the impact of the answer choice given in the example, we replace the ‘C’ in the answer with “A/B/C” and show the choice distribution in Figure 3(b). Compared to Figure 3(a), the distribution shifted from ‘C’ to ‘A’ and ‘B’ for both first-token evaluation and the classified text output. This shows the substantial impact the example template has on the model’s response. It also suggests that the few-shot templates used in objective tasks are not suitable for subjective tasks since there are no “correct” examples. It is generally not a good instruction format for evaluating the model on public opinion questions.

2 Refusal Rate

There are two refusal behaviours we observed from the model. The first occurs when the model explicitly selects the “Refused” option from among the available answer choices. The second type of refusal occurs when the model opts not to provide an answer to a question deemed sensitive. We combine both cases into a single refusal category. Contrary to the observation from Santurkar et al. (2023), who reported a low rate of refusal across various models, we find a pronounced tendency for models to refuse responses due to safety concerns. The trend is most evident in open-source models that have been trained not to express opinions on sensitive issues.

Figure 2(b) shows the refusal rate of the models evaluated under instructions of different constraint levels. In general, Llama2 models show a higher refusal rate than Mistral models. Llama2-7b-Chat has the highest refusal rate with $51.4\%$ . Therefore, it is crucial to consider the model’s refusal behaviour when evaluating its response to questions related to sensitive topics, as this plays an important part in the model’s response. As model size increases from 7B to 70B, the refusal rate of the Llama2 model decreases, starting at over $50\%$ and decreasing to less than $10\%$ . For the Mistral-7b-Instruct model, v0.1 exhibits a lower rate of refusal responses compared to v0.2. This is likely attributable to stronger safety guardrails in the newer version.

As well as the model size, the instruction prompt also has an impact on the refusal rate. Generally, models with higher instruction constraints show fewer refusal responses. All models except Llama2-7b-Chat display the highest refusal rate with the Low Constraint instruction.

3 Answer Consistency

We further evaluated the answer consistency by calculating the entropy of the answers from the 10 runs, shuffling the option order, as shown in Table 3. The text output achieves better consistency than the first token evaluation for all the models except Mixtral 8x7b. This shows that the text output is more robust to the prompt perturbation and has less selection bias. Another trend is that models with higher capability have better consistency, where Mixtral 8x7b and the Llama2 70b-Chat achieve the best consistency.

Conclusion

We compared first-token evaluation methods with the text output for survey questions and showed that the first-token evaluation heavily misrepresents the text output for five out of six instruction-tuned models we evaluated. The results question the reliability of first-token evaluation for instruction-tuned language models, especially in settings where refusal is likely due to the sensitive nature of topics asked in the question. We also showed that the first-token evaluation is more sensitive to the prompt format and has more selection bias than text output. We call for a more direct and realistic evaluation framework to help better understand the LLM’s behaviour in real-life settings.

Limitations

In this work, we only focus on the log probability assigned to the first token of the response. Other probability-based evaluation methods include calculating the probability of every candidate answer sequence. Based on our findings in the generative setting, we question the reliability of the traditional approach that relies on the model’s probability assignment to answer candidates, which is often used in the discriminative setting. Therefore, we call for more studies on the reliability of other probability-based evaluation methods by comparing them directly to the text output.

We are only interested in LLMs’ behaviour when asking survey questions, not how aligned to human responses they are. It would be interesting to see how the alignment changes when we switch to text output evaluation. We leave this to future work.

Ethics Statement

In this work, we use a publicly available survey dataset OpinionQA Santurkar et al. (2023), which was curated based on the survey questions from the Pew Research Center. It’s worth noting that some questions may contain content that is directly or indirectly sensitive to certain social groups. However, the risk of privacy breaches or abuse of the data or models presented here is highly unlikely. We solely present the responses generated by the LLMs in an objective manner. We do not intend to express our personal opinions on the questions.

Acknowledgements

We thank the members of MaiNLP, MilaNLP, and SODA-LMU for their constructive feedback. XW, CH and BP are supported by ERC Consolidator Grant DIALECT 101043235 and in parts by Independent Research Fund Denmark (DFF) Sapere Aude grant 9063-00077B. BM and FK are supported by BERD@NFDI (German Research Foundation grant 460037581), and MCML. PR and DH are members of the Data and Marketing Insights research unit of the Bocconi Institute for Data Science and Analysis, and are supported by a MUR FARE 2020 initiative under grant agreement Prot. R20YSMBZ8S (INDOMITA) and the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (No. 949944, INTEGRATOR).

References

Appendix A Appendix

Figure 4 shows the impact of the decoding strategy. As the temperature increases, the model prioritizes the answer diversity, which leads to a worse consistency level, but a lower mismatch and refusal rate.

A.2 Model Output Annotation

To train the classifier for text output classification, we collected response samples from the five models under the medium constraint condition of the prompt. The annotation process was carried out by a single in-house annotator, who was provided with the original survey questions along with their multiple-choice options and an additional “Refused” option to indicate refusal. The order of the options was randomly shuffled for each question. Additionally, the annotator received the model outputs, i.e., the responses to the survey questions. The task was to assign an appropriate option to each response. Figure 5 showcases a data sample that the annotator received. In cases of nonsensical responses, the annotator was instructed to mark them as “nan”. Afterward, a second in-house annotator was invited to review and refine the annotations made by the first annotator. There exists disagreement on minor cases which were resolved after discussion.

A.3 Dataset Statistics

Table 5 shows the label distribution of the annotated dataset we curated for the five models we evaluated.

A.4 Classifier

Figure 4 shows the performance on the output of the five models we evaluated. We exclude Mistral-Instruct-v0.1 here since it shows a low mismatch rate and most of the responses can be easily mapped to one of the response options using rule-based methods. For simplicity, we do not consider multi-label cases here since they are only found in Mistral models and make up a small part of the total responses. The model is considered correct when it predicts one of the labels.

We use RegEx to search for the option letter pattern “[A-Z].” in the answer.

Few shot learning

For the few-shot learning setup, we add four model outputs and the corresponding labels as examples into the instruction before asking for the prediction, as shown in Figure 6. We then use the first token from the classifier’s output as the prediction.

Finetuning

To improve the classification performance and reduce computational overhead, we annotated the 414 responses generated from the five models we evaluated (except Mistral7b-Instruct-v0.1), resulting in 2070 samples in total. Annotation details are in A.2. We use parameter-efficient fine-tuning (PEFT) to train our classifier on the annotated model responses, and use the first token of the classifier’s response as the prediction.

A.5 Option Count Distribution

Figure 7 shows the option count distribution of Llama2-70b-chat under the instruction of (a) Example Template with Single Answer "C", (b) Example Template with Multiple Answers "A/B/C" and (c) High Constraint Instruction. Example Template leads to option count distribution mismatch compared to High Constraint Instruction.

A.6 Output Cases

Given the subjective nature of the survey questions and their diverse topics, the model outputs exhibit various response types. Additionally, instances may arise where the models decline to respond to specific sensitive or objective questions, owing to safety mechanisms and inherent model features. Table 6 showcases a selection of output cases from the Mixtral model under the medium constraint condition of the prompt. The output cases range from single-choice responses (with or without explanation) to multiple-choice responses, encompassing various types of refusals and occasionally yielding nonsensical outputs.