Calibration of Pre-trained Transformers

Shrey Desai, Greg Durrett

Introduction

Neural networks have seen wide adoption but are frequently criticized for being black boxes, offering little insight as to why predictions are made Benítez et al. (1997); Dayhoff and DeLeo (2001); Castelvecchi (2016) and making it difficult to diagnose errors at test-time. These properties are particularly exhibited by pre-trained Transformer models Devlin et al. (2019); Liu et al. (2019); Yang et al. (2019), which dominate benchmark tasks like SuperGLUE Wang et al. (2019), but use a large number of self-attention heads across many layers in a way that is difficult to unpack Clark et al. (2019); Kovaleva et al. (2019). One step towards understanding whether these models can be trusted is by analyzing whether they are calibrated Raftery et al. (2005); Jiang et al. (2012); Kendall and Gal (2017): how aligned their posterior probabilities are with empirical likelihoods Brier (1950); Guo et al. (2017). If a model assigns 70% probability to an event, the event should occur 70% of the time if the model is calibrated. Although the model’s mechanism itself may be uninterpretable, a calibrated model at least gives us a signal that it “knows what it doesn’t know,” which can make these models easier to deploy in practice Jiang et al. (2012).

In this work, we evaluate the calibration of two pre-trained models, BERT Devlin et al. (2019) and RoBERTa Liu et al. (2019), on three tasks: natural language inference Bowman et al. (2015), paraphrase detection Iyer et al. (2017), and commonsense reasoning Zellers et al. (2018). These tasks represent standard evaluation settings for pre-trained models, and critically, challenging out-of-domain test datasets are available for each. Such test data allows us to measure calibration in more realistic settings where samples stem from a dissimilar input distribution, which is exactly the scenario where we hope a well-calibrated model would avoid making confident yet incorrect predictions.

Our experiments yield several key results. First, even when used out-of-the-box, pre-trained models are calibrated in-domain. In out-of-domain settings, where non-pre-trained models like ESIM Chen et al. (2017) are overconfident, we find that pre-trained models are significantly better calibrated. Second, we show that temperature scaling Guo et al. (2017), multiplying non-normalized logits by a single scalar hyperparameter, is widely effective at improving in-domain calibration. Finally, we show that regularizing the model to be less certain during training can beneficially smooth probabilities, improving out-of-domain calibration.

Related Work

Calibration has been well-studied in statistical machine learning, including applications in forecasting Brier (1950); Raftery et al. (2005); Gneiting et al. (2007); Palmer et al. (2008), medicine Yang and Thompson (2010); Jiang et al. (2012), and computer vision Kendall and Gal (2017); Guo et al. (2017); Lee et al. (2018). Past work in natural language processing has studied calibration in the non-neural Nguyen and O’Connor (2015) and neural Kumar and Sarawagi (2019) settings across several tasks. However, past work has not analyzed large-scale pre-trained models, and we additionally analyze out-of-domain settings, whereas past work largely focuses on in-domain calibration Nguyen and O’Connor (2015); Guo et al. (2017).

Another way of hardening models against out-of-domain data is to be able to explicitly detect these examples, which has been studied previously Hendrycks and Gimpel (2016); Liang et al. (2018); Lee et al. (2018). However, this assumes a discrete notion of domain; calibration is a more general paradigm and gracefully handles settings where domains are less quantized.

Posterior Calibration

Experiments

We perform evaluations on three language understanding tasks: natural language inference, paraphrase detection, and commonsense reasoning. Significant past work has studied cross-domain robustness using sentiment analysis Chen et al. (2018); Peng et al. (2018); Miller (2019); Desai et al. (2019). However, we explicitly elect to use tasks where out-of-domain performance is substantially lower and challenging domain shifts are exhibited. Below, we describe our in-domain and out-of-domain datasets.Dataset splits are detailed in Appendix A. Furthermore, out-of-domain datasets are strictly used for evaluating the generalization of in-domain models, so the training split is unused. For all datasets, we split the development set in half to obtain a held-out, non-blind test set.

The Stanford Natural Language Inference (SNLI) corpus is a large-scale entailment dataset where the task is to determine whether a hypothesis is entailed, contradicted by, or neutral with respect to a premise Bowman et al. (2015). Multi-Genre Natural Language Inference (MNLI) Williams et al. (2018) contains similar entailment data across several domains, which we can use as unseen test domains.

Paraphrase Detection.

Quora Question Pairs (QQP) contains sentence pairs from Quora that are semantically equivalent Iyer et al. (2017). Our out-of-domain setting is TwitterPPDB (TPPDB), which contains sentence pairs from Twitter where tweets are considered paraphrases if they have shared URLs Lan et al. (2017).

Commonsense Reasoning.

Situations With Adversarial Generations (SWAG) is a grounded commonsense reasoning task where models must select the most plausible continuation of a sentence among four candidates Zellers et al. (2018). HellaSWAG (HSWAG), an adversarial out-of-domain dataset, serves as a more challenging benchmark for pre-trained models Zellers et al. (2019); it is distributionally different in that its examples exploit statistical biases in pre-trained models.

2 Systems for Comparison

Table 1 shows a breakdown of the models used in our experiments. We use the same set of hyperparameters across all tasks. For pre-trained models, we omit hyperparameters that induce brittleness during fine-tuning, e.g., employing a decaying learning rate schedule with linear warmup Sun et al. (2019); Lan et al. (2020). Detailed information on optimization is available in Appendix B.

3 Out-of-the-box Calibration

First, we analyze “out-of-the-box” calibration; that is, the calibration error derived from evaluating a model on a dataset without using post-processing steps like temperature scaling Guo et al. (2017). For each task, we train the model on the in-domain training set, and then evaluate its performance on the in-domain and out-of-domain test sets. Quantitative results are shown in Table 2. In addition, we plot reliability diagrams Nguyen and O’Connor (2015); Guo et al. (2017) in Figure 1, which visualize the alignment between posterior probabilities (confidence) and empirical outcomes (accuracy), where a perfectly calibrated model has $\textrm{conf}(k)=\textrm{acc}(k)$ for each bucket of real-valued predictions $k$ . We remark on a few observed phenomena below:

Simpler models, such as DA, achieve competitive in-domain ECE on SNLI (1.02) and QQP (3.37), and are notably better than pre-trained models on SNLI in this regard. However, the more complex ESIM, both in number of parameters and architecture, sees increased in-domain ECE despite having higher accuracy on all tasks.

However, pre-trained models are generally more accurate and calibrated.

Rather surprisingly, pre-trained models do not show characteristics of the aforementioned inverse relationship, despite having significantly more parameters. On SNLI, RoBERTa achieves an ECE in the ballpark of DA and ESIM, but on QQP and SWAG, both BERT and RoBERTa consistently achieve higher accuracies and lower ECEs. Pre-trained models are especially strong out-of-domain, where on HellaSWAG in particular, RoBERTa reduces ECE by a factor of 3.4 compared to DA.

Using RoBERTa always improves in-domain calibration over BERT.

In addition to obtaining better task performance than BERT, RoBERTa consistently achieves lower in-domain ECE. Even out-of-domain, RoBERTa outperforms BERT in all but one setting (TwitterPPDB). Nonetheless, our results show that representations induced by robust pre-training (e.g., using a larger corpus, more training steps, dynamic masking) Liu et al. (2019) lead to more calibrated posteriors. Whether other changes to pre-training Yang et al. (2019); Lan et al. (2020); Clark et al. (2020) lead to further improvements is an open question.

4 Post-hoc Calibration

There are a number of techniques that can be applied to correct a model’s calibration post-hoc. Using our in-domain development set, we can, for example, post-process model probabilities via temperature scaling Guo et al. (2017), where a scalar temperature hyperparameter $T$ divides non-normalized logits before the softmax operation. As $T\rightarrow 0$ , the distribution’s mode receives all the probability mass, while as $T\rightarrow\infty$ , the probabilities become uniform.

Furthermore, we experiment with models trained in-domain with label smoothing (LS) Miller et al. (1996); Pereyra et al. (2017) as opposed to conventional maximum likelihood estimation (MLE). By nature, MLE encourages models to sharpen the posterior distribution around the gold label, leading to confidence which is typically unwarranted in out-of-domain settings. Label smoothing presents one solution to overconfidence by maintaining uncertainty over the label space during training: we minimize the KL divergence with the distribution placing a $1-\alpha$ fraction of probability mass on the gold label and $\frac{\alpha}{|\mathcal{Y}|-1}$ fraction of mass on each other label, where $\alpha\in(0,1)$ is a hyperparameter.For example, the one-hot target is transformed into [0.9, 0.05, 0.05] when $\alpha=0.1$ . This re-formulated learning objective does not require changing the model architecture.

For each task, we train the model with either MLE or LS ( $\alpha=0.1$ ) using the in-domain training set, use the in-domain development set to learn an optimal temperature $T$ , and then evaluate the model (scaled with $T$ ) on the in-domain and out-of-domain test sets. From Table 3 and Figure 2, we draw the following conclusions:

MLE models are always better than LS models in-domain, which suggests incorporating uncertainty when in-domain samples are available is not an effective regularization scheme. Even when using a small smoothing value (0.1), LS models do not achieve nearly as good out-of-the-box results as MLE models, and temperature scaling hurts LS in many cases. By contrast, RoBERTa with temperature-scaled MLE achieves ECE values from 0.7-0.8, implying that MLE training yields scores that are fundamentally good but just need some minor rescaling.

However, out-of-domain, label smoothing is generally more effective.

In most cases, MLE models do not perform well on out-of-domain datasets, with ECEs ranging from 8-12. However, LS models are forced to distribute probability mass across classes, and as a result, achieve significantly lower ECEs on average. We note that LS is particularly effective when the distribution shift is strong. On the adversarial HellaSWAG, for example, RoBERTa-LS obtains a factor of 5.8 less ECE than RoBERTa-MLE. This phenomenon is visually depicted in Figure 3 where we see RoBERTa-LS is significantly closer to the identity function despite being used out-of-the-box.

Optimal temperature scaling values are bounded within a small interval.

Table 4 reports the learned temperature values for BERT-MLE and RoBERTa-MLE. For in-domain tasks, the optimal temperature values are generally in the range 1-1.4. Interestingly, out-of-domain, TwitterPPDB and HellaSWAG require larger temperature values than MNLI, which suggests the degree of distribution shift and magnitude of $T$ may be closely related.

Conclusion

Posterior calibration is one lens to understand the trustworthiness of model confidence scores. In this work, we examine the calibration of pre-trained Transformers in both in-domain and out-of-domain settings. Results show BERT and RoBERTa coupled with temperature scaling achieve low ECEs in-domain, and when trained with label smoothing, are also competitive out-of-domain.

Acknowledgments

This work was partially supported by NSF Grant IIS-1814522 and a gift from Arm. The authors acknowledge a DURIP equipment grant to UT Austin that provided computational resources to conduct this research. Additionally, we thank R. Thomas McCoy for answering questions about DA and ESIM.

References

Appendix A Dataset Splits

Appendix B Training and Optimization

For non-pre-trained model baselines, we use the open-source implementations of DA Parikh et al. (2016) and ESIM Chen et al. (2017) in AllenNLP Gardner et al. (2018), except in the case of SWAG/HellaSWAG, where we run the baselines available in the authors’ code.https://github.com/rowanz/swagaf For BERT Devlin et al. (2019) and RoBERTa Liu et al. (2019), we use bert-base-uncased and roberta-base, respectively, from HuggingFace Transformers Wolf et al. (2019). BERT is fine-tuned with a maximum of 3 epochs, batch size of 16, learning rate of 2e-5, gradient clip of 1.0, and no weight decay. Similarly, RoBERTa is fine-tuned with a maximum of 3 epochs, batch size of 32, learning rate of 1e-5, gradient clip of 1.0, and weight decay of 0.1. Both models are optimized with AdamW Loshchilov and Hutter (2019). Other than early stopping on the development set, we do not perform additional hyperparameter searches. Finally, all experiments are conducted on NVIDIA V100 32GB GPUs, with the total time for fine-tuning all models being under 24 hours.

Furthermore, temperature scaling line searches are performed in the range [0.01, 5.0] with a granularity of 0.01. These searches are quite fast and can be performed on a CPU; we simply evaluate calibration error by rescaling cached logits. On a Intel Xeon E3-1270 v3 CPU, all searches can be completed in under 15 minutes.

Appendix C Reproducibility

Table 6 shows the accuracy and expected calibration error (ECE) of BERT and RoBERTa on the development sets of the datasets we consider. We do not report post-hoc calibration results using the development set since these require tuning on the development set itself.