Noise Contrastive Priors for Functional Uncertainty

Danijar Hafner, Dustin Tran, Timothy Lillicrap, Alex Irpan, James Davidson

missingum@section Introduction

Many successful applications of neural networks [Krizhevsky et al., 2012, Sutskever et al., 2014, van den Oord et al., 2016] are in restricted settings where predictions are only made for inputs similar to the training distribution. In real-world scenarios, neural networks can face truly novel data points during inference, and in these settings it can be valuable to have good estimates of the model’s uncertainty. For example, in healthcare, reliable uncertainty estimates can prevent overconfident decisions for rare or novel patient conditions [Schulam and Saria, 2015]. Another application are autonomous agents that should actively explore their environment, requiring uncertainty estimates to decide what data points will be most informative.

Epistemic uncertainty describes the amount of missing knowledge about the data generating function. Uncertainty can in principle be completely reduced by observing more data points at the right locations and training on them. In contrast, the data generating function may also have inherent randomness, which we call aleatoric noise. This noise can be captured by models outputting a distribution rather than a point prediction. Obtaining more data points allows the noise estimate to move closer to the true value, which is usually different from zero. For active learning, it is crucial to separate the two types of randomness: we want to acquire labels in regions of high uncertainty but low noise [Lindley et al., 1956].

Bayesian analysis provides a principled approach to modeling uncertainty in neural networks [Denker et al., 1987, MacKay, 1992b]. Namely, one places a prior over the network’s weights and biases. This induces a distribution over the functions that the network represents, capturing uncertainty about which function best fits the data. Specifying this prior remains an open challenge. Common practice is to use an independent normal prior in weight space, which is neither informative about the induced function class nor the data (e.g., it is sensitive to parameterization). This can cause the induced function posterior to generalize in unforeseen ways on out-of-distribution (OOD) inputs, which are inputs outside of the distribution that generated the training data.

Motivated by these challenges, we introduce noise contrastive priors (NCPs), which encourage uncertainty outside of the training distribution through a loss in data space. NCPs are compatible with any model that represents functional uncertainty as a random variable, are easy to scale, and yield reliable uncertainty estimates that show significantly improved active learning performance.

missingum@section Noise Contrastive Priors

Specifying priors is intuitive for small probabilistic models, where each variable often has a clear interpretation [Blei, 2014]. It is less intuitive for neural networks, where the parameters serve more as adaptive basis coefficients in a nonparametric function. For example, neural network models are non-identifiable due to weight symmetries that yield the same function [Müller and Insua, 1998]. This makes it difficult to express informative priors on the weights, such as expressing high uncertainty on unfamiliar examples.

To prevent overconfident predictions, a good input prior $p_{\rm prior}(x)$ should include OOD examples so that it acts beyond the training distribution. A good output prior $p_{\rm prior}(y\mid x)$ should be a high-entropy distribution, representing high uncertainty about the model output given OOD inputs.

Generating OOD inputs

Exactly generating OOD data is difficult. A priori, we must uniformly represent the input domain. A posteriori, we must represent the complement of the training distribution. Both distributions are typically uniform over infinite support, making them ill-defined. To estimate OOD inputs, we develop an algorithm inspired by noise contrastive estimation [Gutmann and Hyvärinen, 2010, Mnih and Kavukcuoglu, 2013], where a complement distribution is approximated using random noise.

A hypothesis of our work is that in practice it is enough to encourage high uncertainty output near the boundary of the training distribution, and that this effect will propagate to the entire OOD space. This hypothesis is backed up by previous work [Lee et al., 2017] as well as our experiments (see Figure 1). This means we no longer need to sample arbitrary OOD inputs. It is enough to sample OOD points that lie close to the boundary of the training distribution, and to apply our desired prior at those points.

Parameter Estimation

The variances $\sigma_{x}^{2}$ and $\sigma_{y}^{2}$ are hyperparameters that tune how far from the boundary we sample, and how large we want the output uncertainty to be. We choose $\mu_{x}=0$ to apply the prior equally in all directions from the data points. The output mean $\mu_{y}$ determines the default prediction of the model outside of the training distribution, for example $\mu_{y}=0$ . We set $\mu_{y}=y$ which corresponds to data augmentation [Matsuoka, 1992, An, 1996], where a model is trained to recover the true labels from perturbed inputs. This way, NCP makes the model uncertain but still encourages its prediction to generalize to OOD inputs.

For training, we minimize the loss function

The first term represents typical maximum likelihood. The second term is added by our method: it represents the analogous term on a data prior, where maximum likelihood can be derived as minimizing a KL divergence to the empirical training distribution $p_{\rm train}(y\mid x)$ over training inputs. The hyperparameter $\gamma$ sets the relative influence of the prior, allowing to trade-off between the two terms.

Interpretation as function prior

The noise contrastive prior can be interpreted as inducing a function prior. This is formalized through the prior predictive distribution,

Because network weights are trained to fit the data prior, the prior acts as “pseudo-data.” This is similar to classical work on conjugate priors: a $\operatorname{Beta}(\alpha,\beta)$ prior on the probability of a Bernoulli likelihood implies a Beta posterior, and if the posterior mode is chosen as an optimal parameter setting, then the prior translates to $\alpha-1$ successes and $\beta-1$ failures. It is also similar to pseudo-data in sparse Gaussian processes [Quiñonero-Candela and Rasmussen, 2005].

Data priors encourage learning parameters that not only capture the training data well but also the prior data. In practice, we can combine NCP with other priors, for example the typical normal prior in weight space for Bayesian neural networks, although we did not find this necessary in our experiments.

missingum@section Variational Inference with NCP

In this section, we apply a Bayesian treatment of NCP where we perform posterior inference instead of point estimation. Consider a regression task that we model as $p(y\midx,\theta)=\operatorname{Normal}(\mu(x),\sigma^{2}(x))$ with mean and variance predicted by a neural network from the inputs. This model is heteroskedastic, meaning that it can predict a different aleatoric noise amount for every point in the input space. We apply NCP to posit epistemic uncertainty on the output of the mean $\mu$ , and we infer the induced weight posterior for only the output layer [Lázaro-Gredilla and Figueiras-Vidal, 2010, Calandra et al., 2014] that predicts the mean. This results in the model

where $q_{\phi}(\theta)$ forms an approximate posterior over weights. We do not model uncertainty about the noise estimate, as this is not required for the approximation for the Gaussian expected information gain [MacKay, 1992a] that we will use to acquire labels. The distribution of the mean induced by the weight posterior, $q(\mu(x))=\int\mu(x,\theta)q_{\phi}(\theta)\mathop{}\!\textnormal{d}\theta$ , represents epistemic uncertainty. Note that this is different from the predictive distribution, which combines both uncertainty and noise. The loss function is

Note Equation 5’s relationship to the variational lower bound for typical Bayesian neural networks [Blundell et al., 2015]. The expected log likelihood term is the same. Only the KL divergence differs in that it now penalizes the approximate posterior in output space rather than in weight space. This change avoids a common pathology in variational Bayesian neural network training where the variational distribution collapses to the prior; this pathology happens on a per-dimension basis as the KL decomposes into a sum of KLs for each weight dimension, making it easy for many dimensions to collapse [Bowman et al., 2015]. By penalizing deviations in output space, the approximate posterior can only collapse if the entire predictive distribution is set to the output prior; this is hard to achieve as the model would pay a large cost in the data misfit (log-likelihood) term because little capacity remains to additionally fit the data.

The loss function is an (approximate) lower bound to the log-marginal likelihood. See Appendix B for its derivation via reparameterizing the original KL in weight space, resulting in the reverse KL divergence known from variational inference.

missingum@section Related Work

Classic work has investigated entropic priors [Buntine and Weigend, 1991] and hierarchical priors [MacKay, 1992b, Neal, 2012, Lampinen and Vehtari, 2001]. More recently, Depeweg et al. introduce networks with latent variables in order to disentangle forms of uncertainty, and Flam-Shepherd et al. propose general-purpose weight priors based on approximating Gaussian processes. Other works have analyzed priors for compression and model selection [Ghosh and Doshi-Velez, 2017, Louizos et al., 2017]. Instead of a prior in weight space (or latent inputs as in Depeweg et al. ), NCPs take the functional view by imposing explicit regularities in terms of the network’s inputs and outputs. This is similar in nature to Sun et al. , who define a GP prior for BNNs resulting in an interesting but more complex algorithm. Malinin and Gales propose prior networks to avoid an explicit belief over parameters for classification tasks.

Input and output regularization

There is classic work on adding noise to inputs for improved generalization [Matsuoka, 1992, An, 1996, Bishop, 1995]. For example, denoising autoencoders [Vincent et al., 2008] encourage reconstructions given noisy encodings. Output regularization is also a classic idea from the maximum entropy principle [Jaynes, 1957], where it has motivated label smoothing [Szegedy et al., 2016] and entropy penalties [Pereyra et al., 2017]. Also related is virtual adversarial training [Miyato et al., 2015], which includes examples that are close to the current input but cause a maximal change in the model output, and mixup [Zhang et al., 2018], which includes examples under the vicinity of training data. These methods are orthogonal to NCPs: they aim to improve generalization from finite data within the training distribution (interpolation), while we aim to improve uncertainty estimates outside of the training distribution (extrapolation).

Classifying out-of-distribution inputs

A simple approach for neural network uncertainty is to classify whether data points belong to the data distribution, or are OOD [Hendrycks and Gimpel, 2017]. Recently, Lee et al. introduce a GAN to generate OOD samples, and Liang et al. add perturbations to the input, applying an “OOD detector” to improve softmax scores on OOD samples by scaling the temperature. Extending these directions of research, we connect to Bayesian principles and focus on uncertainty estimates that are useful for active data acquisition.

missingum@section Experiments

To demonstrate their usefulness, we evaluate NCPs on various tasks where uncertainty estimates are desired. Our focus is on active learning for regression tasks, where only few targets are visible in the beginning, and additional targets are selected regularly based on an acquisition function. We use two data sets: a toy example and a large flights data set. We also evaluate how sensitive our method is to the choice of input noise. Finally, we show that NCP scales to large data sets by training on the full flights data set in a passive learning setting. Our implementation uses TensorFlow Probability [Dillon et al., 2017, Tran et al., 2016] and is open-sourced at https://github.com/brain-research/ncp. An implementation of NCP is also available in Aboleth [Aboleth Developers, 2017] and Bayesian Layers [Tran et al., 2018].

We compare four neural network models, all using leaky ReLU activations [Maas et al., 2013] and trained using Adam [Kingma and Ba, 2014]. The four models are:

Deterministic neural network (Det) A neural network that predicts the mean and variance of a normal distribution. The name stands for deterministic, as there is no weight uncertainty.

Bayes by Backprop (BBB) A Bayesian neural network trained via gradient-based variational inference with a independent normal prior in weight space [Blundell et al., 2015, Kucukelbir et al., 2017]. We use the same model as in Section 3 but with a KL in weight space.

Bayes by Backprop with noise contrastive prior (BBB+NCP) Bayes by Backprop with NCP on the predicted mean distribution as described in Section 3.

Out-of-distribution classifier with noise contrastive prior (OCD+NCP) An uncertainty classifier model described in Appendix A. It is a deterministic neural network combined with NCP which we use as a baseline alternative to Bayes by Backprop with NCP.

For active learning, we select new data points $\{x,y\}$ for which $x$ maximizes the expected information gain under the model $q(y\midx)=\int p(y\midx,\theta)q(\theta)\mathop{}\!\textnormal{d}\theta$ ,

The expected information gain is the mutual information between weights and output given input. It measures how many bits of information the new data point is expected to reveal about the optimal weights. Intuitively, the expected information gain is the largest where the model has high epistemic uncertainty but expects low aleatoric noise.

We use the form of the expected information gain for Gaussian posterior predictive distributions discussed in MacKay [1992a]. Moreover, to select batches of data points, we place a softmax distribution on the information gain for all available data points and acquire labels by sampling with a temperature of $\tau=0.5$ to get diversity. This results in the acquisition rule

For visualization purposes, we start with experiments on a 1-dimensional regression task that consists of a sine function with a small slope and increasing variance for higher inputs. Training data can be acquired only within two bands, and the model is evaluated on all data points that are not visible to the model. This structured split between training and testing data causes a distributional shift at test time, requiring successful models to have reliable uncertainty estimates to avoid mispredictions for OOD inputs.

For this experiment, we use two layers of 200 hidden units, a batch size of $10$ , and a learning rate of $3\times 10^{-4}$ for all models. NCP models use noise $\epsilon\sim\operatorname{Normal}(0,0.5)$ . We start with 10 randomly selected initial targets, and select 1 additional target every 1000 epochs. Figure 3 shows the root mean squared error (RMSE) and negative log predictive density (NLPD) throughout learning. The two baseline models severely overfit to the training distribution early on when only few data points are visible. Models with NCP outperform BBB, which in turn outperforms Det. Figure 1 visualizes the models’ predictive distributions at the end of training, showing that NCP prevents overconfident generalization.

missingum@subsection2 Active learning on flight delays

We consider the flight delay data set [Hensman et al., 2013, Deisenroth and Ng, 2015, Lakshminarayanan et al., 2016], a large scale regression benchmark with several published results. The data set has 8 input variables describing a flight, and the target is the delay of the flight in minutes. There are 700K training examples and 100K test examples. The test set has a subtle distributional shift, since the 100K data points temporally follow after the training data.

We use two layers with 50 units each, a batch size of $10$ , and a learning rate of $10^{-4}$ . For NCP models, ${\epsilon\sim\operatorname{Normal}(0,0.1)}$ . Starting from 10 labels, the models select a batch of 10 additional labels every 50 epochs. The 700K data points of the training data set are available for acquisition, and we evaluate performance on the typical test split. Figure 4 shows the performance for the visible data points and the test set respectively. We note that BBB and BBB+NCP show similar NLPD on the visible data points, but the NCP models generalize better to unseen data. Moreover, the Bayesian neural network with NCP achieves lower RMSE than the one without and the classifier based model achieves lower RMSE than the deterministic neural network. All uncertainty-based models outperform the deterministic neural network.

missingum@subsection3 Robustness to noise patterns

The choice of input noise might seem like a critical hyper parameter for NCP. In this experiment, we find that our method is robust to the choice of input noise. The experimental setup is the same as for the active learning experiment described in Section 5.2, but with uniform or normal input noise with different variance ( $\sigma_{x}^{2}\in\{0.1,0.2,\cdots,1.0\}$ ). For uniform input noise, this means noise is drawn from the interval $[-2\sigma_{x},2\sigma_{x}]$ .

We observe that BBB+NCP is robust to the size of the input noise. NCP consistently improves RMSE for the tested noise sizes and yields the best NLPD for all noise sizes below 0.6. For our ODC baseline, we observe an intuitive trade-off: smaller input noise increases the regularization strength, leading to better NLPD but reduced RMSE. Robustness to the choice of input noise is further supported by the analogous experiment on toy data set, where above a small threshold (BBB+NCP $\sigma_{x}^{2}\geq 0.3$ and ODC+NCP $\sigma_{x}^{2}\geq 0.1$ ), NCP consistently performs well (Figure 6).

missingum@subsection4 Large scale regression of flight delays

In addition to the active learning experiments, we perform a passive learning run on all 700K data points of the flights data set to explore the scalability of NCP. We use networks of 3 layers with 1000 units and a learning rate of $10^{-4}$ . Table 1 compares the performance of our models to previously published results. We significantly improve state of the art performance on this data set.

missingum@section Discussion

We develop noise contrastive priors (NCPs), a prior for neural networks in data space. NCPs encourage network weights that not only explain the training data but also capture high uncertainty on OOD inputs. We show that NCPs offer strong improvements over baselines and scale to large regression tasks.

We focused on active learning for regression tasks, where uncertainty is crucial for determining which data points to select next. NCPs are only one form of a data prior, designed to encourage uncertainty on OOD inputs. In future work, it would be interesting to apply NCPs to alternative settings where uncertainty is important, such as image classification (using correlated noise noise, such as mixup [Zhang et al., 2018]) and learning with sparse or missing data. Priors in data space can easily capture properties such as periodicity or spatial invariance, and they may provide a scalable alternative to Gaussian process priors.

We thank Balaji Lakshminarayanan, Jascha Sohl-Dickstein, Matthew D. Hoffman, and Rif Saurous for their comments.

References

Appendix A OOD Classifier Model with NCP

We showed how to apply NCP to a Bayesian neural network model that captures function uncertainty in a belief over parameters. An alternative approach to capture uncertainty is to make explicit predictions about whether an input is OOD. There is no belief over weights in this model. Figure 2(b) shows such a mixture model via a binary variable $o$ ,

where $p(o=1\mid x)$ is the OOD probability of $x$ . If $o=0$ (“in distribution”), the model outputs the neural network prediction. Otherwise, if $o=1$ (“out of distribution”), the model uses a fixed output prior. The neural network weights $\theta$ are estimated using a point estimate, so we do not maintain a belief distribution over them.

Analogously to the Bayesian neural network model in Section 3, we can either set $\mu_{y},\sigma_{y}^{2}$ manually or use the neural network prediction for potentially improved generalization. In our experiments, we implement the OOD classifier model using a single neural network with two output layers that parameterize the Gaussian distribution and the binary distribution.

Appendix B Deriving Variational Inference with NCP

In Section 3, we described a variational inference objective with NCP which takes the log-likelihood term and adds a forward KL-divergence from the mean prior to the model mean. To derive this:

Above, we reparameteterized the KL in weight space as a KL in output space; by the change of variables, this is equivalent if the mapping $\mu(\cdot,\theta)$ is continuous and 1-1 with respect to $\theta$ . This assumption does not hold for neural nets as multiple parameter vectors can lead to the same predictive distribution, thus the approximation above. A compact reparameterization of the neural network (equivalence class of parameteters) would make this an equality.

Note that the derivation uses the opposite direction of the KL divergence than what we use in the main text. The forward KL divergence we use was originally motivated from maximum likelihood with data augmentation, in which the data prior appears on the left-hand-side of the KL divergence when interpreting maximum likelihood as minimizing the KL divergence from the data distribution to the model. In preliminary experiments, we haven’t found that the direction makes a significant difference, but this requires future investigation.

Appendix C Robustness Experiment on Toy Dataset

Appendix D Related Active Learning Work

Active learning is often employed in domains where data is cheap but labeling is expensive, and is motivated by the idea that not all data points are equally valuable when it comes to learning [Settles, 2009, Dasgupta, 2004]. Active learning techniques can be coarsely grouped into three categories. Ensemble methods [Seung et al., 1992, McCallumzy and Nigamy, 1998, Freund et al., 1997] generate queries that have the greatest disagreement between a set of classifiers. Error reduction approaches incorporate the select data based on the predicted reduction in classifier error based on information [MacKay, 1992a], Monte Carlo estimation [Roy and McCallum, 2001], or hard-negative example mining [Sung, 1994, Rowley et al., 1998].

Uncertainty-based techniques select samples for which the classifier is most uncertain. Approaches include maximum entropy [Joshi et al., 2009], distance from the decision boundary [Tong and Koller, 2001], pseudo labelling high confidence examples [Wang et al., 2017], and mixtures of information density and uncertainty measures [Li and Guo, 2013]. Within this category, the area most related to our work are Bayesian methods. Kapoor et al. estimate expected improvement using a Gaussian process. Other approaches use classifier confidence [Lewis and Gale, 1994], predicted expected error [Roy and McCallum, 2001], or model disagreement [Houlsby et al., 2011]. Recently, Gal et al. applied a convolutional neural network with dropout uncertainty to images.