WAIC, but Why? Generative Ensembles for Robust Anomaly Detection

Hyunsun Choi, Eric Jang, Alexander A. Alemi

Introduction

Knowing when a machine learning (ML) model is qualified to make predictions on an input is critical to safe deployment of ML technology in the real world. When training and test distributions differ, neural networks may provide – with high confidence – arbitrary predictions on inputs that they are unaccustomed to seeing. This is known as the Out-of-Distribution (OoD) problem. In addition to ML safety, identifying OoD inputs (also referred to as anomaly detection) is a crucial feature of many data-driven applications, such as credit card fraud detection and monitoring patient health in medical settings.

A typical OoD scenario is as follows: a machine learning model infers a predictive distribution $p({\textnormal{y}}|{\textnormal{x}})$ from input x, which at training time is sampled from a distribution $p({\textnormal{x}})$ . At test-time, $p({\textnormal{y}}|{\textnormal{x}})$ is evaluated on a single input sampled from $q(x)$ , and the objective of OoD detection is to infer whether $p({\textnormal{x}})\equiv q({\textnormal{x}})$ . This problem may seem ill-posed at first glance, because we wish to compare $p({\textnormal{x}})$ and $q({\textnormal{x}})$ using only a single sample from $q(x)$ . Nevertheless, this setup is common when serving ML model predictions over the Internet to anonymous, untrusted users, and is also assumed in adversarial ML literature. Each user generates data from a unique test set $q(x)$ and may only request one prediction.

One approach to OoD detection is to combine a dataset of anomalies with in-distribution data and train a binary classifier to tell them apart, or alternatively, appending a “None of the above” category to a classification model. The classifier then learns a decision boundary, or likelihood ratio, between $p({\textnormal{x}})$ and the anomaly distribution $q({\textnormal{x}})$ . However, the discriminative approach to anomaly detection requires $q({\textnormal{x}})$ to be specified at training time; this is a severe flaw when anomalous data is rare (e.g. medical seizures) or not known ahead of time, e.g. generated by an adversary.

On the other hand, density estimation techniques do not assume an anomaly distribution at training time, and can be used to assign lower likelihoods to OoD inputs (Bishop, 1994). However, we present a couple concerns about the appropriateness of likelihood models for OoD detection.

When $p(x)$ is unknown, a generative model $p_{\theta}({\textnormal{x}})$ , parameterized by $\theta$ , can be trained to approximate $p({\textnormal{x}})$ from its samples. Generative modeling algorithms have improved dramatically in recent years, and are capable of learning probabilistic models over massive, high-dimensional datasets such as images, video, and natural language (Kingma & Dhariwal, 2018; Vaswani et al., 2017; Wang et al., 2018). Autoregressive Models and Normalizing Flows (NF) are fully-observed likelihood models that construct a tractable log-likelihood approximation to the data-generating density $p({\textnormal{x}})$ (Uria et al., 2016; Dinh et al., 2014; Rezende & Mohamed, 2015). Variational Autoencoders (VAE) are latent variable models that maximize a variational lower bound on log density (Kingma & Welling, 2014; Rezende et al., 2014). Finally, Generative Adversarial Networks (GAN) are implicit density models that minimize a divergence metric between $p({\textnormal{x}})$ and generative distribution $p_{\theta}({\textnormal{x}})$ (Goodfellow et al., 2014).

Likelihood models implemented using neural networks are susceptible to malformed inputs that exploit idiosyncratic computation within the model (Szegedy et al., 2013). When judging natural images, we assume an OoD input $x\sim q({\textnormal{x}})$ should remain OoD within some $L^{P}$ -norm, and yet a Fast Gradient Sign Method (FGSM) attack (Goodfellow et al., 2015) on the predictive distribution can realize extremely high likelihood predictions (Nguyen et al., 2015). Conversely, a FGSM attack in the reverse direction on an in-distribution sample $x\sim p({\textnormal{x}})$ creates a perceptually identical input with low likelihood (Kos et al., 2018).

An earlier version of this paper, concurrently with work by Nalisnick et al. (2018), and Hendrycks et al. (2018) showed that likelihood models can be fooled by OoD datasets that are not even adversarial by construction. As shown in Figure 1, a likelihood model trained on the CIFAR-10 dataset yields higher likelihood predictions on SVHN images than the CIFAR-10 training data itself! For a flow-based model with an isotropic Gaussian prior, this implies that SVHN images are systematically projected closer to the origin than the training data (Nalisnick et al., 2018).

2 Likelihood and Typicality

Even if we could compute likelihoods exactly, it may not be a sufficient measure for scoring OoD inputs. It is tempting to suggest a simple one-tailed test in which lower likelihoods are OoD, but the intuition that in-distribution inputs should have the highest likelihoods does not hold in higher dimensions. For instance, consider an isotropic 784-dimensional Gaussian. A data point at the origin has maximum likelihood, and yet it is highly atypical because the vast majority of probability mass lies in an annulus of radius $\sqrt{784}$ . Likelihoods can determine whether a point lies in the support of a distribution, but do not reveal where the probability mass is concentrated.

The main contributions of this work are as follows. First, we observe that likelihood models assign higher densities to OoD datasets than the ones they are trained on (SVHN for CIFAR-10 models, and MNIST for Fashion MNIST models). Second, we propose Generative Ensembles, an anomaly detection algorithm that combines density estimation with uncertainty estimation. Generative Ensembles are trained independently of the task model, and can be implemented using exact or approximate likelihood models. We also demonstrate how predictive uncertainty can be applied to robustify implicit GANs and leverage them for anomaly detection. Finally, we present yet another surprising property of deep generative models that warrants further explanation: density estimation should not be able to account for probability mass, and yet Generative Ensembles outperform OoD baselines on the majority of common OoD detection problems, and demonstrate competitive results with discriminative classification approaches on the Kaggle Credit Fraud dataset.

Related Work

Anomaly detection methods are closely intertwined with techniques used in uncertainty estimation, adversarial defense literature, and novelty detection.

OoD detection is closely related to the problem of uncertainty estimation, whose goal is to yield calibrated confidence measures for a model’s predictive distribution $p({\textnormal{y}}|{\textnormal{x}})$ . Well-calibrated uncertainty estimation integrates several forms of uncertainty into $p({\textnormal{y}}|{\textnormal{x}})$ : model mispecification uncertainty (OoD detection of invalid inputs), aleatoric uncertainty (irreducible input noise for valid inputs), and epistemic uncertainty (unknown model parameters for valid inputs). In this paper, we study OoD detection in isolation; instead of considering whether $p({\textnormal{y}}|{\textnormal{x}})$ should be trusted for a given $x$ , we are trying to determine whether $x$ should be fed into $p({\textnormal{y}}|{\textnormal{x}})$ at all.

Predictive uncertainty estimation is a model-dependent OoD technique because it depends on task-specific information (such as labels and task model architecture) in order to yield an integrated estimate of uncertainty. ODIN (Liang et al., 2018), MC Dropout (Gal & Ghahramani, 2016) and DeepEnsemble (Lakshminarayanan et al., 2017) model a calibrated predictive distribution for a classification task. Variational information bottleneck (VIB) (Alemi et al., 2018b) performs divergence estimation in latent space to detect OoD, but is technically a model-dependent technique because the latent code is trained jointly with the downstream classification task.

One limitation of model-dependent OoD techniques is that they may discard information about $p({\textnormal{x}})$ in learning the task-specific model $p(y|x)$ . Consider a contrived binary classification model on images that learns to solve the task perfectly by discarding all information except the contents of the first pixel (no other information is preserved in the features). Subsequently, the model yields confident predictions on any distribution that happens to preserve identical first-pixel statistics. In contrast, density estimation in data space $x$ considers the structure of the entire input manifold, without bias towards a particular downstream task or task-specific compression.

In our work we estimate predictive uncertainty of the scoring model itself. Unlike predictive uncertainty methods applied to the task model’s predictions, Generative Ensembles do not require task-specific labels to train. Furthermore, model-independent OoD detection aids interpretation of predictive uncertainty by isolating the uncertainty component arising from OoD inputs.

2 Adversarial Defense

Although adversarial attack and defense literature usually considers small $L^{p}$ -norm modifications to input (demonstrating the alarming sensitivity of neural networks), there is no such restriction in practice to the degree with which an input can be perturbed in a test setting. We consider adversarial inputs in the broader context of the OoD problem, where inputs can be swapped with other datasets, transformed and corrupted (Hendrycks & Dietterich, 2019), or are explicitly designed to fool the model.

Song et al. (2018) observe that adversarial examples designed to fool a downstream task tend to have low likelihoods under an independent generative model. They propose a “data purification” pipeline, where inputs are modified via gradient ascent on model likelihood before being passing to the classifier. Their evaluations are restricted to $L^{p}$ -norm attacks on in-distribution inputs on the classifier, and do not take into account that the generative model itself may be susceptible to OoD errors. In fact, gradient ascent on model likelihood has the exact opposite of the desired effect when the input is OoD to begin with. In our experiments we measure the degree to which we can identify adversarially perturbed “rubbish inputs” as anomalies, and also note that adversarial examples for IWAE predictions have high rates under the latent code, making them suitable for anomaly detection.

3 Novelty Detection

When learning signals are scarce, such as in reinforcement learning (RL) with sparse rewards, the anomaly detection problem is re-framed as novelty detection, whereby an agent attempts to visit states that are OoD with respect to previous experience (Fu et al., 2017; Marsland, 2003). Anomaly detection algorithms, including our proposed Generative Ensembles, are directly applicable as novelty bonuses for exploration. However, we point out in Section 1.2 that likelihoods are deceiving in high dimensions - the point of highest density under a high-dimensional state distribution may be exceedingly rare.

Generative Ensembles

We introduce Generative Ensembles, a novel anomaly detection algorithm that combines likelihood models with predictive uncertainty estimation via ensemble variance. Concretely, an ensemble of generative models that compute exact or approximate likelihoods (autoregressive models, flow-based models, VAEs) are used to estimate the Watanabe-Akaike Information Criterion (WAIC).

The correction term subtracts the variance in likelihoods across independent samples from the posterior. This acts to robustify our estimate, ensuring that points that are sensitive to the particular choice of posterior parameters are penalized.

In this work we do not have exact posterior samples, so we instead utilize independently trained model instances as a proxy for posterior samples, following (Lakshminarayanan et al., 2017). Being trained with Stochastic Gradient Descent (SGD), the independent models in the ensemble act as approximate posterior samples (Mandt et al., 2017).

2 Does WAIC Address Typicality?

Although WAIC can protect models against likelihood estimation errors, we show via a toy model that it should not be able to distinguish whether a point is in the typical set. We illustrate this phenomena in Figure 2 on ensembles of Gaussians fitted to samples from an isotropic Gaussian with leave-one-out cross validation. Like likelihoods, the WAIC measure also decreases monotonically from the origin.

Given that SVHN latent codes lie interior to the CIFAR-10 annulus (Figure 1), WAIC should fail to distinguish SVHN as OoD. To our surprise, we show in Figure 3 and Table 1 that not only does WAIC reject SVHN as OoD, but it outperforms a baseline that does test for typicality by measuring the Euclidean distance to the origin!

Glow models ought to map the training distribution (CIFAR-10) to a distribution whose probability mass is concentrated in an annulus of radius $\sqrt{3072}(\approx 55.4)$ , but as we show in Figure 4, the Glow model actually maps CIFAR, SVHN, and Celeb-A to annuli whose mean radii span the range $42-54$ . In our experiments, the variance of radii across models is larger for SVHN and Celeb-A than CIFAR-10, allowing the WAIC metric to correctly identify these as anomalies despite prior hypotheses that WAIC is insufficient.

3 Generative Ensembles of GANs

We describe how to improve GAN-based anomaly detection with ensembles. Although we cannot readily estimate WAIC with GANs, we can still leverage the principle of epistemic uncertainty to improve a GAN discriminator’s ability to detect OoD inputs.

Figure 5b illustrates a simple 2D density modeling task where individual GAN discriminators – when trained to convergence – learn a discriminative boundary that does not adequately capture $p({\textnormal{x}})$ . Unsurprisingly, a discriminative model tasked with classifying between $p({\textnormal{x}})$ and $q({\textnormal{x}})$ perform poorly when presented with inputs that belong to neither distribution. Despite this apparent shortcoming, GAN discriminators are still applied to successfully to anomaly detection problems (Schlegl et al., 2017; Deecke et al., 2018; Kliger & Fleishman, 2018).

Unlike discriminative anomaly classifiers, which model $p({\textnormal{x}})/q({\textnormal{x}})$ for a static $q({\textnormal{x}})$ , the generative distribution $p_{\theta}({\textnormal{x}})$ of a GAN is trained jointly with the discriminator. The likelihood ratio $p({\textnormal{x}})/p_{\theta}({\textnormal{x}})$ learned by a GAN discriminator is uniquely randomized by GAN training dynamics on $\theta$ (Figure 5b). By training an ensemble of GANs, we can recover an (unnormalized) approximation of $p({\textnormal{x}})$ via decision boundaries between $p({\textnormal{x}})$ and randomly sampled $p_{\theta}({\textnormal{x}})$ (Figure 5c). We implement an anomaly detector using the variance of the discriminator logit, and show that although ensembling GANs lead to far more effective OoD detection than a single GAN (Supplemental Material), it does not outperform our WAIC-based Generative Ensembles.

Experimental Results

Following the experiments proposed by (Liang et al., 2018) and (Alemi et al., 2018b), we train OoD models on MNIST, Fashion MNIST, CIFAR-10 datasets, and evaluate anomaly detection on test samples from other datasets. Source code for VAE and GAN experiments are located at https://github.com/hschoi1/rich_latent, and code for reproducing ODIN baselines are located at https://github.com/ericjang/odin.

The baseline methods, ODIN and VIB, are dependent on a classification task and learn from the joint distribution of images and labels, while our methods use only images. For the VIB baseline, we use the rate term as the threshold variable. The experiments in (Alemi et al., 2018b) make use of (28, 28, 5) “location-aware” features concatenated to the model inputs, to assist in distinguishing spatial inhomogeneities in the data. In this work we train vanilla generative models with no special modifications, so we also train VIB without location-aware features. For CIFAR-10 experiments, we train VIB for 26 epochs and converge at 75.7% classification accuracy on the test set. All other experimental parameters for VIB are identical to those in (Alemi et al., 2018b).

Despite being trained without labels, our methods – in particular Generative Ensembles – outperform ODIN and VIB on most OoD tasks. The VAE rate term is robust to adversarial inputs on the IWAE, because the FGSM perturbation primarily minimizes the (larger) distortion component of the variational lower bound. The performance of VAE rate versus VIB also suggests that latent codes learned from generative objectives are more useful for OoD detection that latent codes learned via a classification-specific objective. Surprisingly, despite not being a general estimate of typicality, WAIC dramatically outperforms the $\left\|d\right\|$ baseline.

We present in Figure 6 images from OoD datasets with the smallest and largest WAICs, respectively. Models trained on FashionMNIST tend to assign higher likelihoods to straight, vertical objects like “1’s” and pants. It is not altogether surprising that AUROC scores on on the HFlip transformation was low, given that many pants, dresses are symmetric.

We consider the problem of detecting fraudulent credit card transactions from the Kaggle Credit Fraud Challenge (Dal Pozzolo et al., 2015). A conventional approach to fraud detection is to include a small fraction of fraudulent transactions in the training set, and then learn a discriminative classifier. Instead, we treat fraud detection as an anomaly detection problem where only normal credit card transactions are available at training time. This is motivated by realistic test scenarios, where an adversary is hardly restricted to generating data identically distributed to the training set.

Unsurprisingly, the classifier baseline performs best because fraudulent test samples are distributed identically to fraudulent training samples. Even so, the single-model density estimation and Generative Ensemble achieve reasonable results.

2 Failure Analysis

In this section we discuss the experiments in which Generative Ensembles performed poorly, and suggest simple fixes to address these issues.

In our early experiments, we found that a VAE trained on Fashion MNIST performed poorly on all OoD datasets when using $p_{\theta}({\textnormal{x}})$ and WAIC metrics. This was surprising, since the same metrics performed well when the same VAE architecture was trained on MNIST. To explain this phenomenon, we show in Figure 7 inputs and VAE-decoded outputs from Fashion MNIST and MNIST test sets. Fashion MNIST images are reconstructed properly, while MNIST images are are barely recognizable after decoding.

A VAE’s training objective can be interpreted as the sum of a pixel-wise autoencoding loss (distortion) and a “semantic” loss (rate). Even though Fashion MNIST appears to be better reconstructed in a semantic sense, the distortion values between the FashionMNIST and MNIST test datasets are numerically quite similar, as shown in Figure 7. Distortion terms make up the bulk of the IWAE predictions in our models, thus explaining why $p_{\theta}({\textnormal{x}})$ was not very discriminative when classifying OoD MNIST examples.

Discussion and Future Work

Out-of-Distribution (OoD) detection is a critical piece of infrastructure for ML applications where the test data distribution is not known at training time. Our paper has two main contributions. The more straightforward one is a novel technique for OoD detection in which ensembles of generative models can be used to estimate WAIC. We perform a comparison to prior techniques published in OoD literature to show that it is competitive with past models and argue for why OoD detection should be done “model-free”.

The other contribution is the intriguing observation that WAIC shouldn’t work! It has been established in the literature (Nalisnick et al., 2018; Hendrycks et al., 2018) that likelihood alone is not sufficient for determining whether data is out of distribution. Simply put, regions with high likelihood are not necessarily regions with high probability mass. It is premature, however, to invoke this explanation as the sole reason generative models trained on CIFAR-10 fail to identify SVHN as OoD, as we show, e pur si muove, that robust likelihood measures like WAIC can distinguish the samples correctly. As we show in Figure 4, the maximum likelihood objective typically used in flow-based models forces the training distribution (CIFAR-10) towards the region of maximum likelihood in latent space. We hypothesize that these datasets are quite different, different enough that the SVHN images likelihoods can be very sensitive to the initial conditions and architectural hyperparameters of the trained generative models. The surprising effectiveness of WAIC motivates further investigation into robust measures of typicality which incorporate both likelihood as well as some notion of local volume, as it is clear that there is an incomplete theoretical understanding within the research community of the limits of likelihood based OOD detection.

Furthermore, the observation that CIFAR-10 is mapped to an annulus of radius $<\sqrt{3072}$ is in itself quite disturbing, as it suggests that better flow-based generative models (for sampling) can be obtained by encouraging the training distribution to overlap better with the typical set in latent space.

Acknowledgements

We thank Manoj Kumar, Peter Liu, Jie Ren, Justin Gilmer, Ben Poole, Augustus Odena, and Balaji Lakshminarayanan for code and valuable discussion. We would also like to thank all the organizers and participants of DL Jeju Camp 2018.

References

Appendix A Terminology and Abbreviations

Appendix B OoD Detection with GAN Discriminators

Appendix C VAE Architectural Details

We use a flexible learned prior $p_{\theta}({\textnormal{z}})$ in our VAE experiments, but did not observe a significant performance difference compared to the default mixture prior in the base VAE code sample. We use an alternating chain of 6 MAF bijectors and 6 random permutation bijectors. Each MAF bijector uses TensorFlow Probability’s default implementation with the following parameter: