Deep Anomaly Detection with Outlier Exposure
Dan Hendrycks, Mantas Mazeika, Thomas Dietterich
Introduction
Machine Learning systems in deployment often encounter data that is unlike the model’s training data. This can occur in discovering novel astronomical phenomena, finding unknown diseases, or detecting sensor failure. In these situations, models that can detect anomalies (Liu et al., 2018; Emmott et al., 2013) are capable of correctly flagging unusual examples for human intervention, or carefully proceeding with a more conservative fallback policy.
Behind many machine learning systems are deep learning models (Krizhevsky et al., 2012) which can provide high performance in a variety of applications, so long as the data seen at test time is similar to the training data. However, when there is a distribution mismatch, deep neural network classifiers tend to give high confidence predictions on anomalous test examples (Nguyen et al., 2015). This can invalidate the use of prediction probabilities as calibrated confidence estimates (Guo et al., 2017), and makes detecting anomalous examples doubly important.
Several previous works seek to address these problems by giving deep neural network classifiers a means of assigning anomaly scores to inputs. These scores can then be used for detecting out-of-distribution (OOD) examples (Hendrycks & Gimpel, 2017; Lee et al., 2018; Liu et al., 2018). These approaches have been demonstrated to work surprisingly well for complex input spaces, such as images, text, and speech. Moreover, they do not require modeling the full data distribution, but instead can use heuristics for detecting unmodeled phenomena. Several of these methods detect unmodeled phenomena by using representations from only in-distribution data.
In this paper, we investigate a complementary method where we train models to detect unmodeled data by learning cues for whether an input is unmodeled. While it is difficult to model the full data distribution, we can learn effective heuristics for detecting out-of-distribution inputs by exposing the model to OOD examples, thus learning a more conservative concept of the inliers and enabling the detection of novel forms of anomalies. We propose leveraging diverse, realistic datasets for this purpose, with a method we call Outlier Exposure (OE). OE provides a simple and effective way to consistently improve existing methods for OOD detection.
Through numerous experiments, we extensively evaluate the broad applicability of Outlier Exposure. For multiclass neural networks, we provide thorough results on Computer Vision and Natural Language Processing tasks which show that Outlier Exposure can help anomaly detectors generalize to and perform well on unseen distributions of outliers, even on large-scale images. We also demonstrate that Outlier Exposure provides gains over several existing approaches to out-of-distribution detection. Our results also show the flexibility of Outlier Exposure, as we can train various models with different sources of outlier distributions. Additionally, we establish that Outlier Exposure can make density estimates of OOD samples significantly more useful for OOD detection. Finally, we demonstrate that Outlier Exposure improves the calibration of neural network classifiers in the realistic setting where a fraction of the data is OOD. Our code is made publicly available at https://github.com/hendrycks/outlier-exposure.
Related Work
Out-of-Distribution Detection with Deep Networks. Hendrycks & Gimpel (2017) demonstrate that a deep, pre-trained classifier has a lower maximum softmax probability on anomalous examples than in-distribution examples, so a classifier can conveniently double as a consistently useful out-of-distribution detector. Building on this work, DeVries & Taylor (2018) attach an auxiliary branch onto a pre-trained classifier and derive a new OOD score from this branch. Liang et al. (2018) present a method which can improve performance of OOD detectors that use a softmax distribution. In particular, they make the maximum softmax probability more discriminative between anomalies and in-distribution examples by pre-processing input data with adversarial perturbations (Goodfellow et al., 2015). Unlike in our work, their parameters are tailored to each source of anomalies.
Lee et al. (2018) train a classifier concurrently with a GAN (Radford et al., 2016; Goodfellow et al., 2014), and the classifier is trained to have lower confidence on GAN samples. For each testing distribution of anomalies, they tune the classifier and GAN using samples from that out-distribution, as discussed in Appendix B of their work. Unlike Liang et al. (2018); Lee et al. (2018), in this work we train our method without tuning parameters to fit specific types of anomaly test distributions, so our results are not directly comparable with their results. Many other works (de Vries et al., 2016; Subramanya et al., 2017; Malinin & Gales, 2018; Bevandic et al., 2018) also encourage the model to have lower confidence on anomalous examples. Recently, Liu et al. (2018) provide theoretical guarantees for detecting out-of-distribution examples under the assumption that a suitably powerful anomaly detector is available.
Utilizing Auxiliary Datasets. Outlier Exposure uses an auxiliary dataset entirely disjoint from test-time data in order to teach the network better representations for anomaly detection. Goodfellow et al. (2015) train on adversarial examples to increased robustness. Salakhutdinov et al. (2011) pre-train unsupervised deep models on a database of web images for stronger features. Radford et al. (2017) train an unsupervised network on a corpus of Amazon reviews for a month in order to obtain quality sentiment representations. Zeiler & Fergus (2014) find that pre-training a network on the large ImageNet database (Russakovsky et al., 2015) endows the network with general representations that are useful in many fine-tuning applications. Chen & Gupta (2015); Mahajan et al. (2018) show that representations learned from images scraped from the nigh unlimited source of search engines and photo-sharing websites improve object detection performance.
Outlier Exposure
We consider the task of deciding whether or not a sample is from a learned distribution called . Samples from are called “in-distribution,” and otherwise are said to be “out-of-distribution” (OOD) or samples from . In real applications, it may be difficult to know the distribution of outliers one will encounter in advance. Thus, we consider the realistic setting where is unknown. Given a parametrized OOD detector and an Outlier Exposure (OE) dataset , disjoint from , we train the model to discover signals and learn heuristics to detect whether a query is sampled from or . We find that these heuristics generalize to unseen distributions .
Deep parametrized anomaly detectors typically leverage learned representations from an auxiliary task, such as classification or density estimation. Given a model and the original learning objective , we can thus formalize Outlier Exposure as minimizing the objective
over the parameters of . In cases where labeled data is not available, then can be ignored.
Outlier Exposure can be applied with many types of data and original tasks. Hence, the specific formulation of is a design choice, and depends on the task at hand and the OOD detector used. For example, when using the maximum softmax probability baseline detector (Hendrycks & Gimpel, 2017), we set to the cross-entropy from to the uniform distribution (Lee et al., 2018). When the original objective is density estimation and labels are not available, we set to a margin ranking loss on the log probabilities and .
Experiments
We evaluate OOD detectors with and without OE on a wide range of datasets. Each evaluation consists of an in-distribution dataset used to train an initial model, a dataset of anomalous examples , and a baseline detector to which we apply OE. We describe the datasets in Section 4.2. The OOD detectors and losses are described on a case-by-case basis.
In the first experiment, we show that OE can help detectors generalize to new text and image anomalies. This is all accomplished without assuming access to the test distribution during training or tuning, unlike much previous work. In the confidence branch experiment, we show that OE is flexible and complements a binary anomaly detector. Then we demonstrate that using synthetic outliers does not work as well as using real and diverse data; previously it was assumed that we need synthetic data or carefully selected close-to-distribution data, but real and diverse data is enough. We conclude with experiments in density estimation. In these experiments we find that a cutting-edge density estimator unexpectedly assigns higher density to out-of-distribution samples than in-distribution samples, and we ameliorate this surprising behavior with Outlier Exposure.
We evaluate out-of-distribution detection methods on their ability to detect OOD points. For this purpose, we treat the OOD examples as the positive class, and we evaluate three metrics: area under the receiver operating characteristic curve (AUROC), area under the precision-recall curve (AUPR), and the false positive rate at true positive rate (FPR). The AUROC and AUPR are holistic metrics that summarize the performance of a detection method across multiple thresholds. The AUROC can be thought of as the probability that an anomalous example is given a higher OOD score than a in-distribution example (Davis & Goadrich, 2006). Thus, a higher AUROC is better, and an uninformative detector has an AUROC of 50%. The AUPR is useful when anomalous examples are infrequent (Manning & Schütze, 1999), as it takes the base rate of anomalies into account. During evaluation with these metrics, the base rate of to test examples in all of our experiments is :.
Whereas the previous two metrics represent the detection performance across various thresholds, the FPR metric represents performance at one strict threshold. By observing performance at a strict threshold, we can make clear comparisons among strong detectors. The FPR metric (Liu et al., 2018; Kumar et al., 2016; Balntas et al., 2016) is the probability that an in-distribution example (negative) raises a false alarm when of anomalous examples (positive) are detected, so a lower FPR is better. Capturing nearly all anomalies with few false alarms can be of high practical value.
2 Datasets
SVHN. The SVHN dataset (Netzer et al., 2011) contains color images of house numbers. There are ten classes comprised of the digits -. The training set has images, and the test set has images. For preprocessing, we rescale the pixels to be in the interval $. CIFAR. The two CIFAR (Krizhevsky & Hinton, 2009) datasets contain32\times 3210050,00010,00020064\times 64100,00010,0001,803,46020,0006040505,9525008,5442,210$ for testing.
2.2 Outlier Exposure Datasets
80 Million Tiny Images. 80 Million Tiny Images (Torralba et al., 2008) is a large-scale, diverse dataset of natural images scrapped from the web. We use this dataset as for experiments with SVHN, CIFAR-10, and CIFAR-100 as . We remove all examples of 80 Million Tiny Images which appear in the CIFAR datasets, so that and are disjoint. In Section 5 we note that only a small fraction of this dataset is necessary for successful OE. ImageNet-22K. We use the ImageNet dataset with images from approximately 22 thousand classes as for Tiny ImageNet and Places365 since images from 80 Million Tiny Images are too low-resolution. To make and are disjoint, images in ImageNet-1K are removed. WikiText-2. WikiText-2 is a corpus of Wikipedia articles typically used for language modeling. We use WikiText-2 as for language modeling experiments with Penn Treebank as . For classification tasks on 20 Newsgroups, TREC, and SST, we treat each sentence of WikiText-2 as an individual example, and use simple filters to remove low-quality sentences.
3 Multiclass Classification
Unlike Liang et al. (2018); Lee et al. (2018) and like Hendrycks & Gimpel (2017); DeVries & Taylor (2018), we do not tune our hyperparameters for each distribution, so that is kept unknown like with real-world anomalies. Instead, the coefficients were determined early in experimentation with validation distributions described in Appendix A. In particular, we use for vision experiments and for NLP experiments. Like previous OOD detection methods involving network fine-tuning, we chose so that impact on classification accuracy is negligible.
For nearly all of the vision experiments, we train Wide Residual Networks (Zagoruyko & Komodakis, 2016) and then fine-tune network copies with OE for 10 epochs. However we use a pre-trained ResNet-18 for Places365. For NLP experiments, we train 2-layer GRUs (Cho et al., 2014) for 5 epochs, then fine-tune network copies with OE for 2 epochs. Networks trained on CIFAR-10 or CIFAR-100 are exposed to images from 80 Million Tiny Images, and the Tiny ImageNet and Places365 classifiers are exposed to ImageNet-22K. NLP classifiers are exposed to WikiText-2. Further architectural and training details are in Appendix B. For all tasks, OE improves average performance by a large margin. Averaged results are shown in Tables 1 and 2. Sample ROC curves are shown in Figures 1 and 4. Detailed results on individual datasets are in Table 7 and Table 8 in Appendix A. Notice that the SVHN classifier with OE can be used to detect new anomalies such as emojis and street view alphabet letters, even though is a dataset of natural images. Thus, Outlier Exposure helps models to generalize to unseen distributions far better than the baseline.
Synthetic Outliers. Outlier Exposure leverages the simplicity of downloading real datasets, but it is possible to generate synthetic outliers. Note that we made an attempt to distort images with noise and use these as outliers for OE, but the classifier quickly memorized this statistical pattern and did not detect new OOD examples any better than before (Hafner et al., 2018). A method with better success is from Lee et al. (2018). They carefully train a GAN to generate synthetic examples near the classifier’s decision boundary. The classifier is encouraged to have a low maximum softmax probability on these synthetic examples. For CIFAR classifiers, they mention that a GAN can be a better source of anomalies than datasets such as SVHN. In contrast, we find that the simpler approach of drawing anomalies from a diverse dataset is sufficient for marked improvements in OOD detection.
We train a 40-4 Wide Residual Network using Lee et al. (2018)’s publicly available code, and use the network’s maximum softmax probabilities as our baseline. Another classifier trains concurrently with a GAN so that the classifier assigns GAN-generated examples a high OOD score. We want each to be novel. Consequently we use their code’s default hyperparameters, and exactly one model encounters all tested distributions. This is unlike their work since, for each distribution, they train and tune a new network. We do not evaluate on Tiny ImageNet, Places365, nor text, since DCGANs cannot stably generate such images and text reliably. Lastly, we take the network trained in tandem with a GAN and fine-tune it with OE. Table 4 shows the large gains from using OE with a real and diverse dataset over using synthetic samples from a GAN.
4 Density Estimation
Density estimators learn a probability density function over the data distribution . Anomalous examples should have low probability density, as they are scarce in by definition (Nalisnick et al., 2019). Consequently, density estimates are another means by which to score anomalies (Zong et al., 2018). We show the ability of OE to improve density estimates on low-probability, outlying data.
PixelCNN++. Autoregressive neural density estimators provide a way to parametrize the probability density of image data. Although sampling from these architectures is slow, they allow for evaluating the probability density with a single forward pass through a CNN, making them promising candidates for OOD detection. We use PixelCNN++ (Salimans et al., 2017) as a baseline OOD detector, and we train it on CIFAR-10. The OOD score of example is the bits per pixel (BPP), defined as , where nll is the negative log-likelihood. With this loss we fine-tune for 2 epochs using OE, which we find is sufficient for the training loss to converge. Here OE is implemented with a margin loss over the log-likelihood difference between in-distribution and anomalous examples, so that the loss for a sample from and point from is
Results are shown in Table 5. Notice that PixelCNN++ without OE unexpectedly assigns lower BPP from SVHN images than CIFAR-10 images. For all datasets, OE significantly improves results.
Language Modeling. We next explore using OE on language models. We use QRNN (Merity et al., 2018a; b) language models as baseline OOD detectors. For the OOD score, we use bits per character (BPC) or bits per word (BPW), defined as , where is the negative log-likelihood of the sequence . Outlier Exposure is implemented by adding the cross entropy to the uniform distribution on tokens from sequences in as an additional loss term.
For , we convert the language-modeling version of Penn Treebank, split into sequences of length 70 for backpropagation for word-level models, and 150 for character-level models. We do not train or evaluate with preserved hidden states as in BPTT. This is because retaining hidden states would greatly simplify the task of OOD detection. Accordingly, the OOD detection task is to provide a score for 70- or 150-token sequences in the unseen datasets.
We train word-level models for 300 epochs, and character-level models for 50 epochs. We then fine-tune using OE on WikiText-2 for 5 epochs. For the character-level language model, we create a character-level version of WikiText-2 by converting words to lowercase and leaving out characters which do not appear in PTB. OOD detection results for the word-level and character-level language models are shown in Table 6; expanded results and descriptions are in Appendix F. In all cases, OE improves over the baseline, and the improvement is especially large for the word-level model.
Discussion
Extensions to Multilabel Classifiers and the Reject Option. Outlier Exposure can work in more classification regimes than just those considered above. For example, a multilabel classifier trained on CIFAR-10 obtains an 88.8% mean AUROC when using the maximum prediction probability as the OOD score. By training with OE to decrease the classifier’s output probabilities on OOD samples, the mean AUROC increases to 97.1%. This is slightly less than the AUROC for a multiclass model tuned with OE. An alternative OOD detection formulation is to give classifiers a “reject class” (Bartlett & Wegkamp, 2008). Outlier Exposure is also flexible enough to improve performance in this setting, but we find that even with OE, classifiers with the reject option or multilabel outputs are not as competitive as OOD detectors with multiclass outputs.
Flexibility in Choosing . Early in experimentation, we found that the choice of is important for generalization to unseen distributions. For example, adding Gaussian noise to samples from to create does not teach the network to generalize to unseen anomaly distributions for complex . Similarly, we found in Section 4.3 that synthetic anomalies do not work as well as real data for . In contrast, our experiments demonstrate that the large datasets of realistic anomalies described in Section 4.2.2 do generalize to unseen distributions.
In addition to size and realism, we found diversity of to be an important factor. Concretely, a CIFAR-100 classifier with CIFAR-10 as hardly improves over the baseline. A CIFAR-10 classifier exposed to ten CIFAR-100 outlier classes corresponds to an average AUPR of 78.5%. Exposed to 30 such classes, the classifier’s average AUPR becomes 85.1%. Next, 50 classes corresponds to 85.3%, and from thereon additional CIFAR-100 classes barely improve performance. This suggests that dataset diversity is important, not just size. In fact, experiments in this paper often used around 1% of the images in the 80 Million Tiny Images dataset since we only briefly fine-tuned the models. We also found that using only 50,000 examples from this dataset led to a negligible degradation in detection performance. Additionally, datasets with significantly different statistics can perform similarly. For instance, using the Project Gutenberg dataset in lieu of WikiText-2 for in the SST experiments gives an average AUROC of 90.1% instead of 89.3%.
Closeness of , , and . Our experiments show several interesting effects of the closeness of the datasets involved. Firstly, we find that and need not be close for training with OE to improve performance on . In Appendix A, we observe that an OOD detector for SVHN has its performance improve with Outlier Exposure even though (1) samples are images of natural scenes rather than digits, and (2) includes unnatural examples such as emojis. We observed the same in our preliminary experiments with MNIST; using 80 Million Tiny Images as , OE increased the AUPR from 94.2% to 97.0%.
Secondly, we find that the closeness of to can be an important factor in the success of OE. In the NLP experiments, preprocessing to be closer to improves OOD detection performance significantly. Without preprocessing, the network may discover easy-to-learn cues which reveal whether the input is in- or out-of-distribution, so the OE training objective can be optimized in unintended ways. That results in weaker detectors. In a separate experiment, we use Online Hard Example Mining so that difficult outliers have more weight in Outlier Exposure. Although this improves performance on the hardest anomalies, anomalies without plausible local statistics like noise are detected slightly less effectively than before. Thus hard or close-to-distribution examples do not necessarily teach the detector all valuable heuristics for detecting various forms of anomalies. Real-world applications of OE could use the method of Sun et al. (2018) to refine a scraped auxiliary dataset to be appropriately close to .
OE Improves Calibration. When using classifiers for prediction, it is important that confidence estimates given for the predictions do not misrepresent empirical performance. A calibrated classifier gives confidence probabilities that match the empirical frequency of correctness. That is, if a calibrated model predicts an event with 30% probability, then 30% of the time the event transpires.
Existing confidence calibration approaches consider the standard setting where data at test-time is always drawn from . We extend this setting to include examples from at test-time since systems should provide calibrated probabilities on both in- and out-of-distribution samples. The classifier should have low-confidence predictions on these OOD examples, since they do not have a class. Building on the temperature tuning method of Guo et al. (2017), we demonstrate that OE can improve calibration performance in this realistic setting. Summary results are shown in Figure 3. Detailed results and a description of the metrics are in Appendix G.
Conclusion
In this paper, we proposed Outlier Exposure, a simple technique that enhances many current OOD detectors across various settings. It uses out-of-distribution samples to teach a network heuristics to detect new, unmodeled, out-of-distribution examples. We showed that this method is broadly applicable in vision and natural language settings, even for large-scale image tasks. OE can improve model calibration and several previous anomaly detection techniques. Further, OE can teach density estimation models to assign more plausible densities to out-of-distribution samples. Finally, Outlier Exposure is computationally inexpensive, and it can be applied with low overhead to existing systems. In summary, Outlier Exposure is an effective and complementary approach for enhancing out-of-distribution detection systems.
We thank NVIDIA for donating GPUs used in this research. This research was supported by a grant from the Future of Life Institute.
References
Appendix A Expanded Multiclass Results
Expanded mutliclass out-of-distribution detection results are in Table 7 and Table 8.
Anomalous Data. For each in-distribution dataset , we comprehensively evaluate OOD detectors on artificial and real anomalous distributions following Hendrycks & Gimpel (2017). For each learned distribution , the number of test distributions that we compare against is approximately double that of most previous works.
Gaussian anomalies have each dimension i.i.d. sampled from an isotropic Gaussian distribution. Rademacher anomalies are images where each dimension is or with equal probability, so each dimension is sampled from a symmetric Rademacher distribution. Bernoulli images have each pixel sampled from a Bernoulli distribution if the input range is $. Blobs data consist in algorithmically generated amorphous shapes with definite edges. Icons-50 is a dataset of icons and emojis (Hendrycks & Dietterich, 2018); icons from the “Number” class are removed. Textures is a dataset of describable textural images (Cimpoi et al., 2014). Places365 consists in images for scene recognition rather than object recognition (Zhou et al., 2017). LSUN is another scene understanding dataset with fewer classes than Places365 (Yu et al., 2015). ImageNet anomalous examples are taken from the 800 ImageNet-1K classes disjoint from Tiny ImageNet’s 200 classes, and when possible each image is cropped with bounding box information as in Tiny ImageNet. For the Places365 experiment, ImageNet is ImageNet-1K with all 1000 classes. With CIFAR-10 as\mathcal{D}_{\text{in}}\mathcal{D}_{\text{out}}^{\text{test}}$ and vice versa; recall that the CIFAR-10 and CIFAR-100 classes do not overlap. Chars74K is a dataset of photographed characters in various styles; digits and letters such as “O” and “l” were removed since they can look like numbers. Places69 has images from 69 scene categories not found in the Places365 dataset.
SNLI is a dataset of predicates and hypotheses for natural language inference. We use the hypotheses for . IMDB is a sentiment classification dataset of movie reviews, with similar statistics to those of SST. Multi30K is a dataset of English-German image descriptions, of which we use the English descriptions. WMT16 is the English portion of the test set from WMT16. Yelp is a dataset of restaurant reviews. English Web Treebank (EWT) consists of five individual datasets: Answers (A), Email (E), Newsgroups (N), Reviews (R), and Weblog (W). Each contains examples from the indicated domain.
Validation Data. For each experiment, we create a set of validation distributions . The first anomalies are uniform noise anomalies where each pixel is sampled from or depending on the input space of the classifier. The remaining validation sources are generated by corrupting in-distribution data, so that the data becomes out-of-distribution. One such source of anomalies is created by taking the pixelwise arithmetic mean of a random pair of in-distribution images. Other anomalies are created by taking the geometric mean of a random pair of in-distribution images. Jigsaw anomalies are created by taking an in-distribution example, partitioning the image into 16 equally sized patches, and permuting those patches. Speckle Noised anomalies are created by applying speckle noise to in-distribution images. RGB Ghosted anomalies involves shifting and reordering the color channels of in-distribution images. Inverted images are anomalies which have some or all of their color channels inverted.
Appendix B Architectures and Training Details
Appendix C Training from Scratch with Outlier Exposure Usually Improves Detection Performance
Elsewhere we show results for pre-trained networks that are fine-tuned with OE. However, a network trained from scratch which simultaneously trains with OE tends to give superior results. For example, a CIFAR-10 Wide ResNet trained normally obtains a classification error rate of 5.16% and an FPR95 of 34.94%. Fine-tuned, this network has an error rate of 5.27% and an FPR95 of 9.50%. Yet if we instead train the network from scratch and expose it to outliers as it trains, then the error rate is 4.26% and the FPR95 is 6.15%. This architecture corresponds to a 9.50% RMS calibration error with OE fine-tuning, but by training with OE from scratch the RMS calibration error is 6.15%. Compared to fine-tuning, training a network in tandem with OE tends to produce a network with a better error rate, calibration, and OOD detection performance. The reason why we use OE for fine-tuning is because training from scratch requires more time and sometimes more GPU memory than fine-tuning.
Appendix D OE Works on Other Vision Architectures
Outlier Exposure also improves vision OOD detection performance for more than just Wide ResNets. Table 9 shows that Outlier Exposure also improves vision OOD detection performance for “All Convolutional Networks” (Salimans & Kingma, 2016).
Appendix E Outlier Exposure with H(𝒰;p)𝐻𝒰𝑝H(\mathcal{U};p) Scores Does Better Than with MSP Scores
While tends to be a discriminative OOD score for example , models with OE can do better by using instead. This alternative accounts for classes with small probability mass rather than just the class with most mass. Additionally, the model with OE is trained to give anomalous examples a uniform posterior not just a lower MSP. This simple change roundly aids performance as shown in Table 10. This general performance improvement is most pronounced on datasets with many classes. For instance, when and , swapping the MSP score with the score increases the AUROC 76.5% to 97.1%.
Appendix F Expanded Language Modeling Results
Detailed OOD detection results with language modeling datasets are shown in Table 11.
The datasets come from the English Web Treebank (Bies et al., 2012), which contains text from five different domains: Yahoo! Answers, emails, newsgroups, product reviews, and weblogs. Other NLP datasets we consider do not satisfy the language modeling assumption of continuity in the examples, so we do not evaluate on them.
Appendix G Confidence Calibration
Models integrated into a decision making process should indicate when they are trustworthy, and such models should not have inordinate confidence in their predictions. In an effort to combat a false sense of certainty from overconfident models, we aim to calibrate model confidence. A model is calibrated if its predicted probabilities match empirical frequencies. Thus if a calibrated model predicts an event with 30% probability, then 30% of the time the event transpires. Prior research (Guo et al., 2017; Nguyen & O’Connor, 2015; Kuleshov & Liang, 2015) considers calibrating systems where test-time queries are samples from , but systems also encounter samples from and should also ascribe low confidence to these samples. Hence, we use OE to control the confidence on these samples.
In order to evaluate a multiclass classifier’s calibration, we present three metrics. First we establish context. For input example , let be the ground truth class. Let be the model’s class prediction, and let be the corresponding model confidence or prediction probability. Denote the set of prediction-label pairs made by the model with .
Along similar lines, the MAD Calibration Error—which is an improper scoring rule due to its use of absolute differences rather than squared differences—is estimated with
Soft F1 Score. If a classifier makes only a few mistakes, then most examples should have high confidence. But if the classifier gives all predictions high confidence, including its mistakes, then the previous metrics will indicate that the model is calibrated on the vast majority of instances, despite having systematic miscalibration. The Soft F1 score (Pastor-Pellicer et al., 2013; Hendrycks & Gimpel, 2017) is suited for measuring the calibration of a system where there is an acute imbalance between mistaken and correct decisions. Since we treat mistakes a positive examples, we can write the model’s confidence that the examples are anomalous with . To indicate that an example is positive (mistaken), we use the vector such that for . Then the Soft F1 score is
G.2 Setup and Results
Softmax Temperature Tuning. Guo et al. (2017) show that good calibration can be obtained by including a tuned temperature parameter into the softmax: . We tune to maximize log likelihood on a validation set after the network has been trained on the training set.
Results. In this calibration experiment, the baseline is confidence estimation with softmax temperature tuning. Therefore, we train SVHN, CIFAR-10, CIFAR-100, and Tiny ImageNet classifiers with , , , and training examples held out, respectively. A copy of this classifier is fine-tuned with Outlier Exposure. Then we determine the optimal temperatures of the original and OE-fine-tuned classifiers on the held-out examples. To measure calibration, we take equally many examples from a given in-distribution dataset and OOD dataset . Out-of-distribution points are understood to be incorrectly classified since their label is not in the model’s output space, so calibrated models should assign these out-of-distribution points low confidence. Results are in Table 12. Outlier Exposure noticeably improves model calibration.
G.3 Posterior Rescaling
While temperature tuning improves calibration, the confidence estimate cannot be less than , the number of classes. For an out-of-distribution example like Gaussian Noise, a good model should have no confidence in its prediction over classes. One possibility is to add a reject option, or a st class, which we cover in Section 5. A simpler option we found is to perform an affine transformation of with the formula . This simple transformation makes it possible for a network to express no confidence on an out-of-distribution input and improves calibration performance. As Table 13 shows, this simple - posterior rescaling technique consistently improves calibration, and the model fine-tuned with OE using temperature tuning and posterior rescaling achieved large calibration improvements.
Appendix H Additional ROC and PR Curves
In Figure 4, we show additional PR and ROC Curves using the Tiny ImageNet dataset and various anomalous distributions.