BRUNO: A Deep Recurrent Model for Exchangeable Data

Iryna Korshunova, Jonas Degrave, Ferenc Huszár, Yarin Gal, Arthur Gretton, Joni Dambre

Introduction

We address the problem of modelling unordered sets of objects that have some characteristic in common. Set modelling has been a recent focus in machine learning, both due to relevant application domains and to efficiency gains when dealing with groups of objects . The relevant concept in statistics is the notion of an exchangeable sequence of random variables – a sequence where any re-ordering of the elements is equally likely. To fulfil this definition, subsequent observations must behave like previous ones, which implies that we can make predictions about the future. This property allows the formulation of some machine learning problems in terms of modelling exchangeable data. For instance, one can think of few-shot concept learning as learning to complete short exchangeable sequences . A related example comes from the generative image modelling field, where we might want to generate images that are in some ways similar to the ones from a given set. At present, however, there are few flexible and provably exchangeable deep generative models to solve this problem.

Formally, a finite or infinite sequence of random variables x1,x2,x3,x_{1},x_{2},x_{3},\dots is said to be exchangeable if for all nn and all permutations π\pi

i. e. the joint probability remains the same under any permutation of the sequence. If random variables in the sequence are independent and identically distributed (i. i. d.), then it is easy to see that the sequence is exchangeable. The converse is false: exchangeable random variables can be correlated. One example of an exchangeable but non-i. i. d. sequence is a sequence of variables x1,,xnx_{1},\dots,x_{n}, which jointly have a multivariate normal distribution Nn(0,Σ)\mathcal{N}_{n}(\bm{0},\bm{\Sigma}) with the same variance and covariance for all the dimensions : Σii=1 and Σij,ij=ρ, with 0ρ<1\Sigma_{ii}=1\textnormal{ and }\Sigma_{ij,i\neq j}=\rho,\textnormal{ with }0\leq\rho<1.

The concept of exchangeability is intimately related to Bayesian statistics. De Finetti’s theorem states that every exchangeable process (infinite sequence of random variables) is a mixture of i. i. d. processes:

where θ\theta is some parameter (finite or infinite dimensional) conditioned on which, the random variables are i. i. d. . In our previous Gaussian example, one can prove that x1,,xnx_{1},\dots,x_{n} are i. i. d. with xiN(θ,1ρ)x_{i}\sim\mathcal{N}(\theta,1-\rho) conditioned on θN(0,ρ)\theta\sim\mathcal{N}(0,\rho).

In terms of predictive distributions p(xnx1:n1)p(x_{n}|x_{1:n-1}), the stochastic process in Eq. 2 can be written as

by conditioning both sides on x1:n1x_{1:n-1}. Eq. 3 is exactly the posterior predictive distribution, where we marginalise the likelihood of xnx_{n} given θ\theta with respect to the posterior distribution of θ\theta. From this follows one possible interpretation of the de Finetti’s theorem: learning to fit an exchangeable model to sequences of data is implicitly the same as learning to reason about the hidden variables behind the data.

One strategy for defining models of exchangeable sequences is through explicit Bayesian modelling: one defines a prior p(θ)p(\theta), a likelihood p(xiθ)p(x_{i}|\theta) and calculates the posterior in Eq. 2 directly. Here, the key difficulty is the intractability of the posterior and the predictive distribution p(xnx1:n1)p(x_{n}|x_{1:n-1}). Both of these expressions require integrating over the parameter θ\theta, so we might end up having to use approximations. This could violate the exchangeability property and make explicit Bayesian modelling difficult.

On the other hand, we do not have to explicitly represent the posterior to ensure exchangeability. One could define a predictive distribution p(xnx1:n1)p(x_{n}|x_{1:n-1}) directly, and as long as the process is exchangeable, it is consistent with Bayesian reasoning. The key difficulty here is defining an easy-to-calculate p(xnx1:n1)p(x_{n}|x_{1:n-1}) which satisfies exchangeability. For example, it is not clear how to train or modify an ordinary recurrent neural network (RNN) to model exchangeable data. In our opinion, the main challenge is to ensure that a hidden state contains information about all previous inputs x1:nx_{1:n} regardless of sequence length.

In this paper, we propose a novel architecture which combines features of the approaches above, which we will refer to as BRUNO: Bayesian RecUrrent Neural mOdel. Our model is provably exchangeable, and makes use of deep features learned from observations so as to model complex data types such as images. To achieve this, we construct a bijective mapping between random variables xiXx_{i}\in\mathcal{X} in the observation space and features ziZz_{i}\in\mathcal{Z}, and explicitly define an exchangeable model for the sequences z1,z2,z3,z_{1},z_{2},z_{3},\dots, where we know an analytic form of p(znz1:n1)p(z_{n}|z_{1:n-1}) without explicitly computing the integral in Eq. 3.

Using BRUNO, we are able to generate samples conditioned on the input sequence by sampling directly from p(xnx1:n1)p(x_{n}|x_{1:n-1}). The latter is also tractable to evaluate, i. e. has linear complexity in the number of data points. In respect of model training, evaluating the predictive distribution requires a single pass through the neural network that implements XZ\mathcal{X}\mapsto\mathcal{Z} mapping. The model can be learned straightforwardly, since p(xnx1:n1)p(x_{n}|x_{1:n-1}) is differentiable with respect to the model parameters.

The paper is structured as follows. In Section 2 we will look at two methods selected to highlight the relation of our work with previous approaches to modelling exchangeable data. Section 3 will describe BRUNO, along with necessary background information. In Section 4, we will use our model for conditional image generation, few-shot learning, set expansion and set anomaly detection. Our code is available at github.com/IraKorshunova/bruno.

Related work

Bayesian sets aim to model exchangeable sequences of binary random variables by analytically computing the integrals in Eq. 2, 3. This is made possible by using a Bernoulli distribution for the likelihood and a beta distribution for the prior. To apply this method to other types of data, e.g. images, one needs to engineer a set of binary features . In that case, there is usually no one-to-one mapping between the input space X\mathcal{X} and the features space Z\mathcal{Z}: in consequence, it is not possible to draw samples from p(xnx1:n1)p(x_{n}|x_{1:n-1}). Unlike Bayesian sets, our approach does have a bijective transformation, which guarantees that inference in Z\mathcal{Z} is equivalent to inference in space X\mathcal{X}.

The neural statistician is an extension of a variational autoencoder model applied to datasets. In addition to learning an approximate inference network over the latent variable zi\bm{z}_{i} for every xi\bm{x}_{i} in the set, approximate inference is also implemented over a latent variable c\bm{c} – a context that is global to the dataset. The architecture for the inference network q(cx1,,xn)q(\bm{c}|\bm{x}_{1},\dots,\bm{x}_{n}) maps every xi\bm{x}_{i} into a feature vector and applies a mean pooling operation across these representations. The resulting vector is then used to produce parameters of a Gaussian distribution over c\bm{c}. Mean pooling makes q(cx1,,xn)q(\bm{c}|\bm{x}_{1},\dots,\bm{x}_{n}) invariant under permutations of the inputs. In addition to the inference networks, the neural statistician also has a generative component p(x1,,xnc)p(\bm{x}_{1},\dots,\bm{x}_{n}|\bm{c}) which assumes that xi\bm{x}_{i}’s are independent given c\bm{c}. Here, it is easy to see that c\bm{c} plays the role of θ\theta from Eq. 2. In the neural statistician, it is intractable to compute p(x1,,xn)p(\bm{x}_{1},\dots,\bm{x}_{n}), so its variational lower bound is used instead. In our model, we perform an implicit inference over θ\theta and can exactly compute predictive distributions and the marginal likelihood. Despite these differences, both neural statistician and BRUNO can be applied in similar settings, namely few-shot learning and conditional image generation, albeit with some restrictions, as we will see in Section 4.

Method

We begin this section with an overview of the mathematical tools needed to construct our model: first the Student-t process ; and then the Real NVP – a deep, stably invertible and learnable neural network architecture for density estimation . We next propose BRUNO, wherein we combine an exchangeable Student-t process with the Real NVP, and derive recurrent equations for the predictive distribution such that our model can be trained as an RNN. Our model is illustrated in Figure 1.

The Student-t process (TP\mathcal{TP}) is the most general elliptically symmetric process with an analytically representable density . The more commonly used Gaussian processes (GP\mathcal{G}Ps) can be seen as limiting case of TP\mathcal{TP}s. In what follows, we provide the background and definition of TP\mathcal{TP}s.

Then conditional distribution p(zbza)p(\bm{z}_{b}|\bm{z}_{a}) is given by

In the general case, when one needs to invert the covariance matrix, the complexity of computing p(zbza)p(\bm{z}_{b}|\bm{z}_{a}) is O(na3)\mathcal{O}(n_{a}^{3}). These computations become infeasible for large datasets, which is a known bottleneck for GP\mathcal{G}Ps and TP\mathcal{T}Ps . In Section 3.3, we will show that exchangeable processes do not have this issue.

The parameter ν\nu, representing the degrees of freedom, has a large impact on the behaviour of TP\mathcal{T}Ps. It controls how heavy-tailed the t-distribution is: as ν\nu increases, the tails get lighter and the t-distribution gets closer to the Gaussian. From Eq. 6, we can see that as ν\nu or nan_{a} tends to infinity, the predictive distribution tends to the one from a GP\mathcal{G}P. Thus, for small ν\nu and nan_{a}, a TP\mathcal{T}P would give less certain predictions than its corresponding GP\mathcal{G}P.

A second feature of the TP\mathcal{T}P is the scaling of the predictive variance with a βa\beta_{a} coefficient, which explicitly depends on the values of the conditioning observations. From Eq. 6, the value of βa\beta_{a} is precisely the Hotelling statistic for the vector za\bm{z}_{a}, and has a χna2\chi^{2}_{n_{a}} distribution with mean nan_{a} in the event that zaNna(μa,Kaa)\bm{z}_{a}\sim\mathcal{N}_{n_{a}}(\bm{\mu}_{a},\bm{K}_{aa}). Looking at the weight \nicefrac(ν+βa2)(ν+na2)\nicefrac{{(\nu+\beta_{a}-2)}}{{(\nu+n_{a}-2)}}, we see that the variance of p(zbza)p(\bm{z}_{b}|\bm{z}_{a}) is increased over the Gaussian default when βa>na\beta_{a}>n_{a}, and is reduced otherwise. In other words, when the samples are dispersed more than they would be under the Gaussian distribution, the predictive uncertainty is increased compared with the Gaussian case. It is helpful in understanding these two properties to recall that the multivariate Student-t distribution can be thought of as a Gaussian distribution with an inverse Wishart prior on the covariance .

2 Real NVP

The main building block of Real NVP is a coupling layer. It implements a mapping XY\mathcal{X}\mapsto\mathcal{Y} that transforms half of its inputs while copying the other half directly to the output:

where \odot is an elementwise product, ss (scale) and tt (translation) are arbitrarily complex functions, e.g. convolutional neural networks.

One can show that the coupling layer is a bijective, easily invertible mapping with a triangular Jacobian and composition of such layers preserves these properties. To obtain a highly nonlinear mapping f(x)f(\bm{x}), one needs to stack coupling layers XY1Y2Z\mathcal{X}\mapsto\mathcal{Y}_{1}\mapsto\mathcal{Y}_{2}\dots\mapsto\mathcal{Z} while alternating the dimensions that are being copied to the output.

To make good use of modelling densities, the Real NVP has to treat its inputs as instances of a continuous random variable . To do so, integer pixel values in x\bm{x} are dequantised by adding uniform noise u[0,1)D{\bm{u}\in[0,1)^{D}}. The values x+u[0,256)D{\bm{x}+\bm{u}\in[0,256)^{D}} are then rescaled to a [0,1)[0,1) interval and transformed with an elementwise function: f(x)=logit(α+(12α)x){f(x)=\textnormal{logit}(\alpha+(1-2\alpha)x)} with some small α\alpha.

3 BRUNO: the exchangeable sequence model

We now combine Bayesian and deep learning tools from the previous sections and present our model for exchangeable sequences whose schematic is given in Figure 1.

We make the following assumptions about the latents:

A1: dimensions {zd}d=1,,D\{z^{d}\}_{d=1,\dots,D} are independent, so p(z)=d=1Dp(zd){p(\bm{z})=\prod_{d=1}^{D}p(z^{d})}

A2: for every dimension dd, we assume the following: (z1d,znd)MVTn(νd,μd1,Kd)(z^{d}_{1},\dots z^{d}_{n})\sim MVT_{n}(\nu^{d},\mu^{d}\bm{1},\bm{K}^{d}), with parameters:

n×nn\times n covariance matrix Kd\bm{K}^{d} with Kiid=vd\bm{K}^{d}_{ii}=v^{d} and Kij,ijd=ρd\bm{K}^{d}_{ij,i\neq j}=\rho^{d} where 0ρd<vd0\leq\rho^{d}<v^{d} to make sure that Kd\bm{K}^{d} is a positive-definite matrix that complies with covariance properties of exchangeable sequences .

The exchangeable structure of the covariance matrix and having the same mean for every nn, guarantees that the sequence z1d,z2dzndz^{d}_{1},z^{d}_{2}\dots z^{d}_{n} is exchangeable. Because the covariance matrix is simple, we can derive recurrent updates for the parameters of p(zn+1dz1:nd)p(z_{n+1}^{d}|z_{1:n}^{d}). Using the recurrence is a lot more efficient compared to the closed-form expressions in Eq. 6 since we want to compute the predictive distribution for every step nn.

We start from a prior Student-t distribution for p(z1)p(z_{1}) with parameters μ1=μ\mu_{1}=\mu , v1=vv_{1}=v, ν1=ν\nu_{1}=\nu, β1=0\beta_{1}=0. Here, we will drop the dimension index dd to simplify the notation. A detailed derivation of the following results is given in Appendix B. To compute the degrees of freedom, mean and variance of p(zn+1z1:n)p(z_{n+1}|z_{1:n}) for every nn, we begin with the recurrent relations

where dn=ρv+ρ(n1)d_{n}=\frac{\rho}{v+\rho(n-1)}. Note that the GP\mathcal{G}P recursions simply use the latter two equations, i.e. if we were to assume that (z1d,znd)Nn(μd1,Kd){(z^{d}_{1},\dots z^{d}_{n})\sim\mathcal{N}_{n}(\mu^{d}\bm{1},\bm{K}^{d})}. For TP\mathcal{T}Ps, however, we also need to compute β\beta – a data-dependent term that scales the covariance matrix as in Eq. 6. To update β\beta, we introduce recurrent expressions for the auxiliary variables:

From these equations, we see that computational complexity of making predictions in exchangeable GP\mathcal{G}Ps or TP\mathcal{T}Ps scales linearly with the number of observations, i.e. O(n)\mathcal{O}(n) instead of a general O(n3)\mathcal{O}(n^{3}) case where one needs to compute an inverse covariance matrix.

So far, we have constructed an exchangeable Student-t process in the latent space Z\mathcal{Z}. By coupling it with a bijective Real NVP mapping, we get an exchangeable process in space X\mathcal{X}. Although we do not have an explicit analytic form of the transitions in X\mathcal{X}, we still can sample from this process and evaluate the predictive distribution via the change of variables formula in Eq. 7.

4 Training

Having an easy-to-evaluate autoregressive distribution p(xn+1x1:n)p(\bm{x}_{n+1}|\bm{x}_{1:n}) allows us to use a training scheme that is common for RNNs, i.e. maximise the likelihood of the next element in the sequence at every step. Thus, our objective function for a single sequence of fixed length NN can be written as L=n=0N1logp(xn+1x1:n)\mathcal{L}=\sum_{n=0}^{N-1}\log p(\bm{x}_{n+1}|\bm{x}_{1:n}), which is equivalent to maximising the joint log-likelihood logp(x1,,xN)\log p(\bm{x}_{1},\dots,\bm{x}_{N}). While we do have a closed-form expression for the latter, we chose not to use it during training in order to minimize the difference between the implementation of training and testing phases. Note that at test time, dealing with the joint log-likelihood would be inconvenient or even impossible due to high memory costs when NN gets large, which again motivates the use of a recurrent formulation.

During training, we update the weights of the Real NVP model and also learn the parameters of the prior Student-t distribution. For the latter, we have three trainable parameters per dimension: degrees of freedom νd\nu^{d}, variance vdv^{d} and covariance ρd\rho^{d}. The mean μd\mu^{d} is fixed to 0 for every dd and is not updated during training.

Experiments

In this section, we will consider a few problems that fit naturally into the framework of modeling exchangeable data. We chose to work with sequences of images, so the results are easy to analyse; yet BRUNO does not make any image-specific assumptions, and our conclusions can generalise to other types of data. Specifically, for non-image data, one can use a general-purpose Real NVP coupling layer as proposed by Papamakarios et al., . In contrast to the original Real NVP model, which uses convolutional architecture for scaling and translation functions in Eq. 8, a general implementation has ss and tt composed from fully connected layers. We experimented with both convolutional and non-convolutional architectures, the details of which are given in Appendix C.

In our experiments, the models are trained on image sequences of length 20. We form each sequence by uniformly sampling a class and then selecting 20 random images from that class. This scheme implies that a model is trained to implicitly infer a class label that is global to a sequence. In what follows, we will see how this property can be used in a few tasks.

We first consider a problem of generating samples conditionally on a set of images, which reduces to sampling from a predictive distribution. This is different from a general Bayesian approach, where one needs to infer the posterior over some meaningful latent variable and then ‘decode’ it.

To draw samples from p(xn+1x1:n)p(\bm{x}_{n+1}|\bm{x}_{1:n}), we first sample zp(zn+1z1:n)\bm{z}\sim p(\bm{z}_{n+1}|\bm{z}_{1:n}) and then compute the inverse Real NVP mapping: x=f1(z)\bm{x}=f^{-1}(\bm{z}). Since we assumed that dimensions of z\bm{z} are independent, we can sample each zdz^{d} from a univariate Student-t distribution. To do so, we modified Bailey’s polar t-distribution generation method to be computationally efficient for GPU. Its algorithm is given in Appendix D.

In Figure 3, we show samples from the prior distribution p(x1)p(\bm{x}_{1}) and conditional samples from a predictive distribution p(xn+1x1:n)p(\bm{x}_{n+1}|\bm{x}_{1:n}) at steps n=1,,20n=1,\dots,20. Here, we used a convolutional Real NVP model as a part of BRUNO. The model was trained on Omniglot same-class image sequences of length 20 and we used the train-test split and preprocessing as defined by Vinyals et al., 2016b . Namely, we resized the images to 28×2828\times 28 pixels and augmented the dataset with rotations by multiples of 90 degrees yielding 4,800 and 1,692 classes for training and testing respectively.

To better understand how BRUNO behaves, we test it on special types of input sequences that were not seen during training. In Appendix E, we give an example where the same image is used throughout the sequence. In that case, the variability of the samples reduces as the models gets more of the same input. This property does not hold for the neural statistician model , discussed in Section 2. As mentioned earlier, the neural statistician computes the approximate posterior q(cx1,,xn)q(\bm{c}|\bm{x}_{1},\dots,\bm{x}_{n}) and then uses its mean to sample x\bm{x} from a conditional model p(xcmean)p(\bm{x}|\bm{c}_{mean}). This scheme does not account for the variability in the inputs as a consequence of applying mean pooling over the features of x1,,xn\bm{x}_{1},\dots,\bm{x}_{n} when computing q(cx1,,xn)q(\bm{c}|\bm{x}_{1},\dots,\bm{x}_{n}). Thus, when all xix_{i}’s are the same, it would still sample different instances from the class specified by xix_{i}. Given the code provided by the authors of the neural statistician and following an email exchange, we could not reproduce the results from their paper, so we refrained from making any direct comparisons.

More generated samples from convolutional and non-convolutional architectures trained on MNIST , Fashion-MNIST and CIFAR-10 are given in the appendix. For a couple of these models, we analyse the parameters of the learnt latent distributions (see Appendix F).

2 Few-shot learning

Previously, we saw that BRUNO can generate images of the unseen classes even after being conditioned on a couple of examples. In this section, we will see how one can use its conditional probabilities not only for generation, but also for a few-shot classification.

We evaluate the few-shot learning accuracy of the model from Section 4.1 on the unseen Omniglot characters from the 1,692 testing classes following the nn-shot and kk-way classification setup proposed by Vinyals et al., 2016b . For every test case, we randomly draw a test image xn+1\bm{x}_{n+1} and a sequence of nn images from the target class. At the same time, we draw nn images for every of the k1k-1 random decoy classes. To classify an image xn+1\bm{x}_{n+1}, we compute p(xn+1x1:nC=i){p(\bm{x}_{n+1}|\bm{x}^{C=i}_{1:n})} for each class i=1k{i=1\dots k} in the batch. An image is classified correctly when the conditional probability is highest for the target class compared to the decoy classes. This evaluation is performed 20 times for each of the test classes and the average classification accuracy is reported in Table 1.

For comparison, we considered three models from Vinyals et al., 2016b : (a) k-nearest neighbours (k-NN), where matching is done on raw pixels (Pixels), (b) k-NN with matching on discriminative features from a state-of-the-art classifier (Baseline Classifier), and (c) Matching networks.

We observe that BRUNO model from Section 4.1 outperforms the baseline classifier, despite having been trained on relatively long sequences with a generative objective, i.e. maximising the likelihood of the input images. Yet, it cannot compete with matching networks – a model tailored for a few-shot learning and trained in a discriminative way on short sequences such that its test-time protocol exactly matches the training time protocol. One can argue, however, that a comparison between models trained generatively and discriminatively is not fair. Generative modelling is a more general, harder problem to solve than discrimination, so a generatively trained model may waste a lot of statistical power on modelling aspects of the data which are irrelevant for the classification task. To verify our intuition, we fine-tuned BRUNO with a discriminative objective, i.e. maximising the likelihood of correct labels in nn-shot, kk-way classification episodes formed from the training examples of Omniglot. While we could sample a different nn and kk for every training episode like in matching networks, we found it sufficient to fix nn and kk during training. Namely, we chose the setting with n=1n=1 and k=20k=20. From Table 1, we see that this additional discriminative training makes BRUNO competitive with state-of-the-art models across all nn-shot and kk-way tasks.

As an extension to the few-shot learning task, we showed that BRUNO could also be used for online set anomaly detection. These experiments can be found in Appendix H.

3 𝒢​𝒫𝒢𝒫\mathcal{GP}-based models

In practice, we noticed that training TP\mathcal{TP}-based models can be easier compared to GP\mathcal{GP}-based models as they are more robust to anomalous training inputs and are less sensitive to the choise of hyperparameters. Under certain conditions, we were not able to obtain convergent training with GP\mathcal{GP}-based models which was not the case when using TP\mathcal{TP}s; an example is given in Appendix G. However, we found a few heuristics that make for a successful training such that TP\mathcal{TP} and GP\mathcal{GP}-based models perform equally well in terms of test likelihoods, sample quality and few-shot classification results. For instance, it was crucial to use weight normalisation with a data-dependent initialisation of parameters of the Real NVP . As a result, one can opt for using GP\mathcal{GP}s due to their simpler implementation. Nevertheless, a Student-t process remains a strictly richer model class for the latent space with negligible additional computational costs.

Discussion and conclusion

In this paper, we introduced BRUNO, a new technique combining deep learning and Student-t or Gaussian processes for modelling exchangeable data. With this architecture, we may carry out implicit Bayesian inference, avoiding the need to compute posteriors and eliminating the high computational cost or approximation errors often associated with explicit Bayesian inference.

Based on our experiments, BRUNO shows promise for applications such as conditional image generation, few-shot concept learning, few-shot classification and online anomaly detection. The probabilistic construction makes the BRUNO approach particularly useful and versatile in transfer learning and multi-task situations. To demonstrate this, we showed that BRUNO trained in a generative way achieves good performance in a downstream few-shot classification task without any task-specific retraining. Though, the performance can be significantly improved with discriminative fine-tuning.

Training BRUNO is a form of meta-learning or learning-to-learn: it learns to perform Bayesian inference on various sets of data. Just as encoding translational invariance in convolutional neural networks seems to be the key to success in vision applications, we believe that the notion of exchangeability is equally central to data-efficient meta-learning. In this sense, architectures like BRUNO and Deep Sets can be seen as the most natural starting point for these applications.

As a consequence of exchangeability-by-design, BRUNO is endowed with a hidden state which integrates information about all inputs regardless of sequence length. This desired property for meta-learning is usually difficult to ensure in general RNNs as they do not automatically generalise to longer sequences than they were trained on and are sensitive to the ordering of inputs. Based on this observation, the most promising applications for BRUNO may fall in the many-shot meta-learning regime, where larger sets of data are available in each episode. Such problems naturally arise in privacy-preserving on-device machine learning, or federated meta-learning , which is a potential future application area for BRUNO.

Acknowledgements

We would like to thank Lucas Theis for his conceptual contributions to BRUNO, Conrado Miranda and Frederic Godin for their helpful comments on the paper, Wittawat Jitkrittum for useful discussions, and Lionel Pigou for setting up the hardware.

References

Appendix A Proofs

Given an exchangeable sequence (x1,x2,,xn)(x_{1},x_{2},\dots,x_{n}) of random variables xiXx_{i}\in\mathcal{X} and a bijective mapping f:XZf:\mathcal{X}\mapsto\mathcal{Z}, the sequence (f(x1),f(x2),,f(xn))(f(x_{1}),f(x_{2}),\dots,f(x_{n})) is exchangeable.

where detJ=i=1nf(xi)xi\det\bm{J}=\prod_{i=1}^{n}\frac{\partial f(x_{i})}{\partial x_{i}} is the determinant of the Jacobian of g\bm{g}. Since both the joint probability of (x1,x2,,xn)(x_{1},x_{2},\dots,x_{n}) and the detJ\left|\det\bm{J}\right| are invariant to the permutation of sequence entries, so must be p(z1,z2,,zn)p(z_{1},z_{2},\dots,z_{n}). This proves that (z1,z2,,zn)(z_{1},z_{2},\dots,z_{n}) is exchangeable. \square

Lemma 2

Given two exchangeable sequence x=(x1,x2,,xn)\bm{x}=(x_{1},x_{2},\dots,x_{n}) and y=(y1,y2,,yn)\bm{y}=(y_{1},y_{2},\dots,y_{n}) of random variables, where xix_{i} is independent from yjy_{j} for i,j\forall i,j, the concatenated sequence xy=((x1,y1),(x2,y2),,(xn,yn))\bm{x}^{\frown}\bm{y}=((x_{1},y_{1}),(x_{2},y_{2}),\dots,(x_{n},y_{n})) is exchangeable as well.

Proof. For any permutation π\pi, as both sequences x\bm{x} and y\bm{y} are exchangeable we have:

Independence between elements in x\bm{x} and y\bm{y} allows to write it as a joint distribution:

and thus the sequence xy\bm{x}^{\frown}\bm{y} is exchangeable. \square

This Lemma justifies our construction with DD independent exchangeable processes in the latent space as given in A1 from Section 3.3.

Appendix B Derivation of recurrent Bayesian updates for exchangeable Student-t and Gaussian processes

Note that this parameterization of the multivariate t-distribution as defined by Shah et al., is slightly different from the commonly used one. We used this parametrization as it makes the formulas simpler.

the conditional distribution p(xbxa)p(\bm{x}_{b}|\bm{x}_{a}) is given by:

Derivation of this result is given in the appendix of . Let us now simplify these equations for the case of exchangeable sequences with the following covariance structure:

Computing the parameters of the predictive distribution requires the inverse of Kaa\bm{K}_{aa}, which we can find using the Sherman-Morrison formula:

After a few steps, the inverse of Kaa\bm{K}_{aa} is:

Note that entries of Kaa1\bm{K}_{aa}^{-1} explicitly depend on nn.

Equations for the mean and variance of the predictive distribution require the following term:

With this in mind, it is easy to derive the following recurrence:

Finally, let us derive recurrent equations for βn+1=(xaμa)TKaa1(xaμa){\beta_{n+1}=(\bm{x}_{a}-\bm{\mu}_{a})^{T}K_{aa}^{-1}(\bm{x}_{a}-\bm{\mu}_{a})}.

Similarly, βn\beta_{n} from p(xnx1:n1)p(x_{n}|x_{1:n-1}) is:

It is easy to show that anbnan1bn1=1\frac{a_{n}-b_{n}}{a_{n-1}-b_{n-1}}=1, so βn+1\beta_{n+1} can be written recursively as:

Appendix C Implementation details

For simple datasets, such as MNIST, we found it tolerable to use models that rely upon a general implementation of the Real NVP coupling layer similarly to Papamakarios et al., . Namely, when scaling and translation functions ss and tt are fully-connected neural networks. In our model, networks ss and tt share the parameters in the first two dense layers with 1024 hidden units and ELU nonlinearity . Their output layers are different: ss ends with a dense layer with tanh\tanh and tt ends with a dense layer without a nonlinearity. We stacked 6 coupling layers with alternating the indices of the transformed dimensions between odd and even as described by Dinh et al., . For the first layer, which implements a logit transformation of the inputs, namely f(x)=logit(α+(12α)x){f(x)=\textnormal{logit}(\alpha+(1-2\alpha)x)}, we used α=106\alpha=10^{-6}. The logit transformation ensures that when taking the inverse mapping during sample generation, the outputs always lie within (α12α,1α12α){(\frac{-\alpha}{1-2\alpha},\frac{1-\alpha}{1-2\alpha})}.

In Omniglot, Fashion MNIST and CIFAR-10 experiments, we built upon a Real NVP model originally designed for CIFAR-10 by Dinh et al., : a multi-scale architecture with deep convolutional residual networks in the coupling layers. Our main difference was the use of coupling layers with fully-connected ss and tt networks (as described above) placed on top of the original convolutional Real NVP model. We found that adding these layers allowed for a faster convergence and improved results. This is likely due to a better mixing of the information before the output of the Real NVP gets into the Student-t layer. We also found that using weight normalisation within every ss and tt function was crucial for successful training of large models.

The model parameters were optimized using RMSProp with a decaying learning rate starting from 10310^{-3}. Trainable parameters of a TP\mathcal{TP} or GP\mathcal{GP} were updated with a 10x smaller learning rate and were initialized as following: νd=1000\nu^{d}=1000, vd=1.v^{d}=1., ρd=0.1\rho^{d}=0.1 for every dimension dd. The mean μd\mu^{d} was fixed at 0. For the Omniglot model, we used a batch size of 32, sequence length of 20 and trained for 200K iterations. The other models were trained for a smaller number of iterations, i.e. ranging from 50K to 100K updates.

Appendix D Sampling from a Student-t distribution

Appendix E Sample analysis

In Figure 3, which includes Figure 2 from the main text, we want to illustrate how sample variability depends on the variance of the inputs. From these examples, we see that in the case of a repeated input image, samples get more coherent as the number of conditioning inputs grows. It also shows that BRUNO does not merely generate samples according to the inferred class label.

While Omngilot is limited to 20 images per class, we can experiment with longer sequences using CIFAR-10 or MNIST. In Figure 4 and Figure 5, we show samples from the models trained on those datasets. In Figure 6, we also show more samples from the prior distribution p(x)p(\bm{x}).

Appendix F Parameter analysis

After training a model, we observed that a majority of the processes in the latent space have low correlations \nicefracρdvd\nicefrac{{\rho^{d}}}{{v^{d}}}, and thus their predictive distributions remain close to the prior. Figure 7 plots the number of dimensions where correlations exceed a certain value on the x-axis. For instance, MNIST model has 8 dimensions where the correlation is higher than 0.1. While we have not verified it experimentally, it is reasonable to expect those dimensions to capture information about visual features of the digits.

For TP\mathcal{TP}-based models, degrees of freedom νd\nu^{d} for every process in the latent space were intialized to 1000, which makes a TP\mathcal{TP} close to a GP\mathcal{GP}. After training, most of the dimensions retain fairly high degrees of freedom, but some can have small ν\nu’s. One can notice from Figure 8 that dimensions with high correlation tend to have smaller degrees of freedom.

We noticed that exchangeable TP\mathcal{TP}s and GP\mathcal{GP}s can behave differently for certain settings of hyperparameters even when TP\mathcal{TP}s have high degrees of freedom. Figure 9 gives one example when this is the case.

Appendix G Training of 𝒢​𝒫𝒢𝒫\mathcal{GP} and 𝒯​𝒫𝒯𝒫\mathcal{TP}-based models

When jointly optimizing Real NVP with a TP\mathcal{TP} or a GP\mathcal{GP} on top, we found that these two versions of BRUNO occasionally behave differently during training. Namely, with GP\mathcal{GP}s the convergence was harder to achive. We could pinpoint a few determining factors: (a) the use of weightnorm in the Real NVP layers, (b) an intialisation of the covariance parameters, and (c) presence of outliers in the training data. In Figure 10, we give examples of learning curves when BRUNO with GP\mathcal{GP}s tends not to work well. Here, we use a convolutional architecture and train on Fashion MNIST. To simulate outliers, every 100 iterations we feed a training batch where the last image of every sequence in the batch is completely white.

We would like to note that there are many settings where both versions of BRUNO diverge or they both work well, and that the results of this partial ablation study are not sufficient to draws general conclusions. However, we can speculate that when extending BRUNO to new problems, it is reasonable to start from a GP\mathcal{GP}-based model with weightnorm, small initial covariances, and small learning rates. However, when finding a good set of hyperparameters is difficult, it might be worth trying the TP\mathcal{TP}-based BRUNO.

Appendix H Set anomaly detection

Online anomaly detection for exchangeable data is one of the application where we can use BRUNO. This problem is closely related to the task of content-based image retrieval, where we need to rank an image x\bm{x} on how well it fits with the sequence x1:n\bm{x}_{1:n} . For the ranking, we use the probabilistic score proposed in Bayesian sets :

When we care exclusively about comparing ratios of conditional densities of xn+1\bm{x}_{n+1} under different sequences x1:n\bm{x}_{1:n}, we can compare densities in the latent space Z\mathcal{Z} instead. This is because the Jacobian from the change of variable formula does not depend on the sequence we condition on.

For the following experiment, we trained a small convolutional version of BRUNO only on even MNIST digits (30,508 training images). In Figure 11, we give typical examples of how the score evolves as the model gets more data points and how it behaves in the presence of inputs that do not conform with the majority of the sequence. This preliminary experiment shows that our model can detect anomalies in a stream of incoming data.

Appendix I Model samples