Neural Processes

Marta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola, Danilo J. Rezende, S. M. Ali Eslami, Yee Whye Teh

Introduction

Function approximation lies at the core of numerous problems in machine learning and one approach that has been exceptionally popular for this purpose over the past decade are deep neural networks. At a high level neural networks constitute black-box function approximators that learn to parameterise a single function from a large number of training data points. As such, the majority of the workload of a networks falls on the training phase while the evaluation and testing phases are reduced to quick forward-passes. Although high test-time performance is valuable for many real-world applications, the fact that network outputs cannot be updated after training may be undesirable. Meta-learning, for example, is an increasingly popular field of research that addresses exactly this limitation (Sutskever et al., 2014; Wang et al., 2016; Vinyals et al., 2016; Finn et al., 2017).

As an alternative to using neural networks one can also perform inference on a stochastic process in order to carry out function regression. The most common instantiation of this approach is a Gaussian process (GP), a model with complimentary properties to those of neural networks: GPs do not require a costly training phase and can carry out inference about the underlying ground truth function conditioned on some observations, which renders them very flexible at test-time. In addition GPs represent infinitely many different functions at locations that have not been observed thereby capturing the uncertainty over their predictions given some observations. However, GPs are computationally expensive: in their original formulation they scale cubically with respect to the number of data points, and current state of the art approximations still scale quadratically (Quiñonero-Candela & Rasmussen, 2005). Furthermore, the available kernels are usually restricted in their functional form and an additional optimisation procedure is required to identify the most suitable kernel, as well as its hyperparameters, for any given task.

As a result, there is growing interest in combining aspects of neural networks and inference on stochastic processes as a potential solution to some of the downsides of both (Huang et al., 2015; Wilson et al., 2016). In this work we introduce a neural network-based formulation that learns an approximation of a stochastic process, which we term Neural Processes (NPs). NPs display some of the fundamental properties of GPs, namely they learn to model distributions over functions, are able to estimate the uncertainty over their predictions conditioned on context observations, and shift some of the workload from training to test time, which allows for model flexibility. Crucially, NPs generate predictions in a computationally efficient way. Given nn context points and mm target points, inference with a trained NP corresponds to a forward pass in a deep NN, which scales with O(n+m)\mathcal{O}(n+m) as opposed to the O((n+m)3)\mathcal{O}((n+m)^{3}) runtime of classic GPs. Furthermore the model overcomes many functional design restrictions by learning an implicit ‘kernel’ from the data directly.

We introduce Neural Processes, a class of models that combine benefits of neural networks and stochastic processes.

We compare NPs to related work in meta-learning, deep latent variable models and Gaussian processes. Given that NPs are linked to many of these areas, they form a bridge for comparison between many related topics.

We showcase the benefits and abilities of NPs by applying them to a range of tasks including 1-D regression, real-world image completion, Bayesian optimization and contextual bandits.

Model

The standard approach to defining a stochastic process is via its finite-dimensional marginal distributions. Specifically, we consider the process as a random function F:XYF:\mathcal{X}\rightarrow\mathcal{Y} and for each finite sequence x1:n=(x1,,xn)x_{1:n}=(x_{1},\ldots,x_{n}) with xiXx_{i}\in\mathcal{X}, we define the marginal joint distribution over the function values Y1:n:=(F(x1),,F(xn))Y_{1:n}:=(F(x_{1}),\ldots,F(x_{n})). For example, in the case of GPs, these joint distributions are multivariate Gaussians parameterised by a mean and a covariance function.

Given a collection of joint distributions ρx1:n\rho_{x_{1:n}} we can derive two necessary conditions to be able to define a stochastic process FF such that ρx1:n\rho_{x_{1:n}} is the marginal distribution of (F(x1),,F(xn))(F(x_{1}),\ldots,F(x_{n})), for each finite sequence x1:nx_{1:n}. These conditions are: (finite) exchangeability and consistency. As stated by the Kolmogorov Extension Theorem (Øksendal, 2003) these conditions are sufficient to define a stochastic process.

Exchangeability This condition requires the joint distributions to be invariant to permutations of the elements in x1:nx_{1:n}. More precisely, for each finite nn, if π\pi is a permutation of {1,,n}\{1,\ldots,n\}, then:

where π(x1:n):=(xπ(1),,xπ(n))\pi(x_{1:n}):=(x_{\pi(1)},\ldots,x_{\pi(n)}) and π(y1:n):=(yπ(1),,yπ(n))\pi(y_{1:n}):=(y_{\pi(1)},\ldots,y_{\pi(n)}).

Consistency If we marginalise out a part of the sequence the resulting marginal distribution is the same as that defined on the original sequence. More precisely, if 1mn1\leq m\leq n, then:

Take, for example, three different sequences x1:n,π(x1:n)x_{1:n},\pi(x_{1:n}) and x1:mx_{1:m} as well as their corresponding joint distributions ρx1:n,ρπ(x1:n)\rho_{x_{1:n}},\rho_{\pi(x_{1:n})} and ρx1:m\rho_{x_{1:m}}. In order for these joint distributions to all be marginals of some higher-dimensional distribution given by the stochastic process FF, they have to satisfy equations 1 and 2 above.

Given a particular instantiation of the stochastic process ff the joint distribution is defined as:

Here pp denotes the abstract probability distribution over all random quantities. Instead of Yi=F(xi)Y_{i}=F(x_{i}), we add some observation noise YiN(F(xi),σ2)Y_{i}\sim\mathcal{N}(F(x_{i}),\sigma^{2}) and define pp as:

Inserting this into equation 3 the stochastic process is specified by:

In other words, exchangeability and consistency of the collection of joint distributions {ρx1:n}\{\rho_{x_{1:n}}\} impliy the existence of a stochastic process FF such that the observations Y1:nY_{1:n} become iid conditional upon FF. This essentially corresponds to a conditional version of de Finetti’s Theorem that anchors much of Bayesian nonparametrics (De Finetti, 1937). In order to represent a stochastic process using a NP, we will approximate it with a neural network, and assume that FF can be parameterised by a high-dimensional random vector zz, and write F(x)=g(x,z)F(x)=g(x,z) for some fixed and learnable function gg (i.e. the randomness in FF is due to that of zz). The generative model (Figure 1(a)) then follows from (5):

where, following ideas of variational auto-encoders, we assume p(z)p(z) is a multivariate standard normal, and g(xi,z)g(x_{i},z) is a neural network which captures the complexities of the model.

To learn such a distribution over random functions, rather than a single function, it is essential to train the system using multiple datasets concurrently, with each dataset being a sequence of inputs x1:nx_{1:n} and outputs y1:ny_{1:n}, so that we can learn the variability of the random function from the variability of the datasets (see section 2.2).

Since the decoder gg is non-linear, we can use amortised variational inference to learn it. Let q(zx1:n,y1:n)q(z|x_{1:n},y_{1:n}) be a variational posterior of the latent variables zz, parameterised by another neural network that is invariant to permutations of the sequences x1:n,y1:nx_{1:n},y_{1:n}. Then the evidence lower-bound (ELBO) is given by:

In an alternative objective that better reflects the desired model behaviour at test time, we split the dataset into a context set, x1:m,y1:mx_{1:m},y_{1:m} and a target set xm+1:n,ym+1:nx_{m+1:n},y_{m+1:n}, and model the conditional of the target given the context. This gives:

Note that in the above the conditional prior p(zx1:m,y1:m)p(z|x_{1:m},y_{1:m}) is intractable. We can approximate it using the variational posterior q(zx1:m,y1:m)q(z|x_{1:m},y_{1:m}), which gives,

2 Distributions over functions

A key motivation for NPs is the ability to represent a distribution over functions rather than a single function. In order to train such a model we need a training procedure that reflects this task.

More formally, to train a NP we form a dataset that consists of functions f:XYf:X\to Y that are sampled from some underlying distribution D\mathcal{D}. As an illustrating example consider a dataset consisting of functions fd(x)GPf_{d}(x)\sim\mathcal{GP} that have been generated using a Gaussian process with a fixed kernel. For each of the functions fd(x)f_{d}(x) our dataset contains a number of (x,y)i(x,y)_{i} tuples where yi=fd(xi)y_{i}=f_{d}(x_{i}). For training purposes we divide these points into a set of nn context points C={(x,y)i}i=1nC=\{(x,y)_{i}\}_{i=1}^{n} and a set of n+mn+m target points which consists of all points in CC as well as mm additional unobserved points T={(x,y)i}i=1n+mT=\{(x,y)_{i}\}_{i=1}^{n+m}. During testing the model is presented with some context CC and has to predict the target values yT=f(xT)y_{T}=f(x_{T}) at target positions xTx_{T}.

In order to be able to predict accurately across the entire dataset a model needs to learn a distribution that covers all of the functions observed in training and be able to take into account the context data at test time.

3 Global latent variable

As mentioned above, neural processes include a latent variable zz that captures FF. This latent variable is of particular interest because it captures the global uncertainty, which allows us to sample at a global level – one function fdf_{d} at a time, rather than at a local output level – one yiy_{i} value for each xix_{i} at a time (independently of the remaining yTy_{T}).

In addition, since we are passing all of the context’s information through this single variable we can formulate the model in a Bayesian framework. In the absence of context points CC the latent distribution p(z)p(z) would correspond to a data specific prior the model has learned during training. As we add observations the latent distribution encoded by the model amounts to the posterior p(zC)p(z|C) over the function given the context. On top of this, as shown in equation 9, instead of using a zero-information prior p(z)p(z), we condition the prior on the context. As such this prior is equivalent to a less informed posterior of the underlying function. This formulation makes it clear that the posterior given a subset of the context points will serve as the prior when additional context points are included. By using this setup, and training with different sizes of context, we encourage the learned model to be flexible with regards to the number and position of the context points.

4 The Neural process model

In our implementation of NPs we accommodate for two additional desiderata: invariance to the order of context points and computational efficiency. The resulting model can be boiled down to three core components (see Figure 1(b)):

An encoder hh from input space into representation space that takes in pairs of (x,y)i(x,y)_{i} context values and produces a representation ri=h((x,y)i)r_{i}=h((x,y)_{i}) for each of the pairs. We parameterise hh as a neural network.

An aggregator aa that summarises the encoded inputs. We are interested in obtaining a single order-invariant global representation rr that parameterises the latent distribution zN(μ(r),Iσ(r))z\sim\mathcal{N}(\mu(r),I\sigma(r)). The simplest operation that ensures order-invariance and works well in practice is the mean function r=a(ri)=1ni=1nrir=a(r_{i})=\frac{1}{n}\sum_{i=1}^{n}r_{i}. Crucially, the aggregator reduces the runtime to O(n+m)\mathcal{O}(n+m) where nn and mm are the number of context and target points respectively.

A conditional decoder gg that takes as input the sampled global latent variable zz as well as the new target locations xTx_{T} and outputs the predictions y^T\hat{y}_{T} for the corresponding values of f(xT)=yTf(x_{T})=y_{T}.

Related work

Neural Processes (NPs) are a generalisation of Conditional Neural Processes (CNPs, Garnelo et al. (2018)). CNPs share a large part of the motivation behind neural processes, but lack a latent variable that allows for global sampling (see Figure 2(c) for a diagram of the model). As a result, CNPs are unable to produce different function samples for the same context data, which can be important if modelling this uncertainty is desirable. It is worth mentioning that the original CNP formulation did include experiments with a latent variable in addition to the deterministic connection. However, given the deterministic connections to the predicted variables, the role of the global latent variable is not clear. In contrast, NPs constitute a more clear-cut generalisation of the original deterministic CNP with stronger parallels to other latent variable models and approximate Bayesian methods. These parallels allow us to compare our model to a wide range of related research areas in the following sections.

Finally, NPs and CNPs themselves can be seen as generalizations of recently published generative query networks (GQN) which apply a similar training procedure to predict new viewpoints in 3D scenes given some context observations (Eslami et al., 2018). Consistent GQN (CGQN) is an extension of GQN that focuses on generating consistent samples and is thus also closely related to NPs (Kumar et al., 2018).

2 Gaussian processes

We start by considering models that, like NPs, lie on the spectrum between neural networks (NNs) and Gaussian processes (GPs). Algorithms on the NN end of the spectrum fit a single function that they learn from a very large amount of data directly. GPs on the other hand can represent a distribution over a family of functions, which is constrained by an assumption on the functional form of the covariance between two points.

Scattered across this spectrum, we can place recent research that has combined ideas from Bayesian non-parametrics with neural networks. Methods like (Calandra et al., 2016; Huang et al., 2015) remain fairly close to the GPs, but incorporate NNs to pre-process the input data. Deep GPs have some conceptual similarity to NNs as they stack GPs to obtain deep models (Damianou & Lawrence, 2013). Approaches that are more similar to NNs include for example neural networks whose weights are sampled using a GPs (Wilson et al., 2011) or networks where each unit represents a different kernel (Sun et al., 2018).

There are two models on this spectrum that are closely related to NPs: matching networks (MN, Vinyals et al. (2016)) and deep kernel learning (DKL, Wilson et al. (2016)). As with NPs both use NNs to extract representations from the data, but while NPs learn the ‘kernel’ to compare data points implicitly these other two models pass the representation to an explicit distance kernel. MNs use this kernel to measure the similarity between contexts and targets for few shot classification while the kernel in DKL is used to parametrise a GP. Because of this explicit kernel the computational complexity of MNs and DKL would be quadratic and cubic instead of O(n+m)\mathcal{O}(n+m) like it is for NPs. To overcome this computational complexity DKL replace a standard GP with a kernel approximation given by a KISS GP (Wilson & Nickisch, 2015), while prototypical networks (Snell et al., 2017) are introduced as a more light-weight version of MNs that also scale with O(n+m)\mathcal{O}(n+m).

Finally concurrent work by Ma et al introduces variational implicit processes, which share large part of the motivation of NPs but are implemented as GPs (Ma et al., 2018). In this context NPs can be interpreted as a neural implementation of an implicit stochastic process.

On this spectrum from NNs to GPs, neural processes remain closer to the neural end than most of the models mentioned above. By giving up on the explicit definition of a kernel NPs lose some of the mathematical guarantees of GPs, but trade this off for data-driven ‘priors’ and computational efficiency.

3 Meta-learning

In contemporary meta-learning vocabulary, NPs and GPs can be seen to be methods for ‘few-shot function estimation’. In this section we compare with related models that can be used to the same end. A prominent example is matching networks (Vinyals et al., 2016), but there is a large literature of similar models for classification (Koch et al., 2015; Santoro et al., 2016), reinforcement learning (Wang et al., 2016), parameter update (Finn et al., 2017, 2018), natural language processing (Bowman et al., 2015) and program induction (Devlin et al., 2017). Related are generative meta-learning approaches that carry out few-shot estimation of the data densities (van den Oord et al., 2016; Reed et al., 2017; Bornschein et al., 2017; Rezende et al., 2016).

Meta-learning models share the fundamental motivations of NPs as they shift workload from training time to test time. NPs can therefore be described as meta-learning algorithms for few-shot function regression, although as shown in Garnelo et al. (2018) they can also be applied to few-shot learning tasks beyond regression.

4 Bayesian methods

The link between meta-learning methods and other research areas, like GPs and Bayesian methods, is not always evident. Interestingly, recent work by Grant et al. (2018) out the relation between model agnostic meta learning (MAML, Finn et al. (2017)) and hierarchical Bayesian inference. In this work the meta-learning properties of MAML are formulated as a result of task-specific variables that are conditionally independent given a higher level variable. This hierarchical description can be rewritten as a probabilistic inference problem and the resulting marginal likelihood p(yC)p(y|C) matches the original MAML objective. The parallels between NPs and hierarchical Bayesian methods are similarly straightforward. Given the graphical model in Figure 2(d) we can write out the conditional marginal likelihood as a hierarchical inference problem:

Another interesting area that connects Bayesian methods and NNs are Bayesian neural networks (Gal & Ghahramani, 2016; Blundell et al., 2015; Louizos et al., 2017; Louizos & Welling, 2017). These models learn distributions over the network weights and use the posterior of these weights to estimate the values of yTy_{T} given yCy_{C}. In this context NPs can be thought of as amortised version of Bayesian DL.

5 Conditional latent variable models

We have covered algorithms that are conceptually similar to NPs and algorithms that carry out similar tasks to NPs. In this section we look at models of the same family as NPs: conditional latent variable models. Such models (Figure 2(a)) learn the conditional distribution p(yTyC,z)p(y_{T}|y_{C},z) where zz is a latent variable that can be sampled to generate different predictions. Training this type of directed graphical model is intractable and as with variational autoencoders (VAEs, Rezende et al. (2014); Kingma & Welling (2013)), conditional variational autoencoders (CVAEs, Sohn et al. (2015)) approximate the objective function using the variational lower bound on the log likelihood:

We refer to the latent variable of CVAEs zTz_{T} as a local latent variable in order to distinguish from global latent variables that are present in the models later on. We call this latent variable local as it is sampled anew for each of the output predictions yT,iy_{T,i}. This is in contrast to a global latent variable that is only sampled once and used to predict multiple values of yty_{t}. In the CVAE, conditioning on the context is done by adding the dependence both in the prior p(zTyc)p(z_{T}|y_{c}) and decoder p(yzT,yc)p(y|z_{T},y_{c}) so they can be considered as deterministic functions of the context.

CVAEs have been extended in a number of ways for example by adding attention (Rezende et al., 2016). Another related extension is generative matching networks (GMNs, Bartunov & Vetrov (2016)), where the conditioning input is pre-processed in a way that is similar to the matching networks model.

A more complex version of the CVAE that is very relevant in this context is the neural statistician (NS, Edwards & Storkey (2016)). Similar to the neural process, the neural statistician contains a global latent variable zz that captures global uncertainty (see Figure 2(b)). A crucial difference is that while NPs represent the distribution over functions, NS represents the distribution over sets. Since NS does not contain a corresponding xx value for each yy value, it does not capture a pair-wise relation like GPs and NPs, but rather a general distribution of the yy values. Rather than generating different yy values by querying the model with different xx values, NS generates different yy values by sampling an additional local hidden variable zTz_{T}.

The ELBO of the NS reflects the hierarchical nature of the model with a double expectation over the local and the global variable. If we leave out the local latent variable for a more direct comparison to NPs the ELBO becomes:

Notably the prior p(z)p(z) of NS is not conditioned on the context. The prior of NPs on the other hand is conditional (equation 9), which brings the training objective closer to the way the model is used at test time. Another variant of the ELBO is presented in the variational homoencoder (Hewitt et al., 2018), a model that is very similar to the neural statistician but uses a separate subset of data points for the context and predictions.

As reflected in Figure 2 the main difference between NPs and the conditional latent variable models is the lack of an xx variable that allows for targeted sampling of the latent distribution. This change, despite seeming small, drastically changes the range of applications. Targeted sampling, for example allows for generation and completion tasks (e.g. the image completion tasks) or the addition of some downstream task (like using an NP for reinforcement learning). It is worth mentioning that all of the conditional latent variable models have also been applied to few-shot classification problems, where the data space generally consists of input tuples (x,y)(x,y), rather than just single outputs yy. The models are able to carry this out by framing classification either as a comparison between the log likelihoods of the different classes or by looking at the KL between the different posteriors, thereby overcoming the need of working with data tuples.

Results

In order to test whether neural processes indeed learn to model distributions over functions we first apply them to a 1-D function regression task. The functions for this experiment are generated using a GP with varying kernel parameters for each function. At every training step we sample a set of values for the Gaussian kernel of a GP and use those to sample a function fD(x)f_{D}(x). A random number of the (x,y)C(x,y)_{C} pairs are passed into the decoder of the NP as context points. We pick additional unobserved pairs (x,y)U(x,y)_{U} which we combine with the observed context points (x,y)C(x,y)_{C} as targets and feed xTx_{T} to the decoder that returns its estimate yT^\hat{y_{T}} of the underlying value of yTy_{T}.

Some sample curves are shown in Figure 3. For the same underlying ground truth curve (black line) we run the neural process using varying numbers of context points and generate several samples for each run (light-blue lines). As evidenced by the results the model has learned some key properties of the 1-D curves from the data such as continuity and the general shape of functions sampled from a GP with a Gaussian kernel. When provided with only one context point the model generates curves that fluctuate around 0, the prior of the data-generating GP. Crucially, these curves go through or near the observed context point and display a higher variance in regions where no observations are present. As the number of context points increases this uncertainty is reduced and the model’s predictions better match the underlying ground truth. Given that this is a neural approximation the curves will sometimes only approach the observations points as opposed to go through them as it is the case for GPs. On the other hand once the model is trained it can regress more than just one data set i.e. it will produce sensible results for curves generated using any kernel parameters observed during training.

2 2-D function regression

One of the benefits of neural processes is their functional flexibility as they can learn non-trivial ‘kernels’ from the data directly. In order to test this we apply NPs to a more complex regression problem. We carry out image completion as a regression task, where we provide some of the pixels as context and do pixel-wise prediction over the entire image. In this formulation the xix_{i} values would correspond to the Cartesian coordinates of each pixel and the yiy_{i} values to the pixel intensity (see Figure 4 for an explanation of this). It is important to point out that we choose images as our dataset because they constitute a complex 2-D function and they are easy to evaluate visually. It is important to point out that NPs, as such, have not been designed for image generation like other specialised generative models.

We train separate models on the MNIST (LeCun et al., 1998) and the CelebA (Liu et al., 2015) datasets. As shown in Figure 4 the model performs well on both tasks. In the case of the MNIST digits the uncertainty is reflected in the variability of the generated digit. Given only a few context points more than just one digit can fit the observations and as a result the model produces different digits when sampled several times. As the number of context points increases the set of possible digits is reduced and the model produces the same digit, albeit with structural modifications that become smaller as the number of context points increases.

The same holds for the CelebA dataset. In this case, when provided limited context the model samples from a wider range of possible faces and as it observes more context points it converges towards very similar looking faces. We do not expect the model to reconstruct the target image perfectly even when all the pixels are provided as context, since the latent variable zz constitutes a strong bottleneck. This can be seen in the final column of the figure where the predicted images are not only not identical to the ground truth but also vary between themselves. The latter is likely a cause of the latent variance which has been clipped to a small value to avoid collapsing, so even when no uncertainty is present we can generate different samples from p(zC)p(z|C).

3 Black-box optimisation with Thompson sampling

To showcase the utility of sampling entire consistent trajectories we apply neural processes to Bayesian optimisation on 1-D function using Thompson sampling (Thompson, 1933). Thompson sampling (also known as randomised probability matching) is an approach to tackle the exploration-exploitation dilemma by maintaining a posterior distribution over model parameters. A decision is taken by drawing a sample of model parameters and acting greedily under the resulting policy. The posterior distribution is then updated and the process is repeated. Despite its simplicity, Thompson sampling has been shown to be highly effective both empirically and in theory. It is commonly applied to black box optimisation and multi-armed bandit problems (e.g. Agrawal & Goyal, 2012; Shahriari et al., 2016).

Neural processes lend themselves naturally to Thompson sampling by instead drawing a function over the space of interest, finding its minimum and adding the observed outcome to the context set for the next iteration. As shown in in Figure 3, function draws show high variance when only few observations are available, modelling uncertainty in a similar way to draws from a posterior over parameters given a small data set. An example of this procedure for neural processes on a 1-D objective function is shown in Figure 5.

We report the average number of steps required by an NP to reach the global minimum of a function generated from a GP prior in Table 1. For an easier comparison the values are normalised by the amount of steps required when doing optimisation using random search. On average, NPs take four times fewer iterations than random search on this task. An upper bound on performance is given by a Gaussian process with the same kernel than the GP that generated the function to be optimised. NPs do not reach this optimal performance, as their samples are more noisy than those of a GP, but are faster to evaluate since merely a forward pass through the network is needed. This difference in computational speed is bound to get more notable as the dimensionality of the problem and the number of necessary function evaluations increases.

4 Contextual bandits

Finally, we apply neural processes to the wheel bandit problem introduced in Riquelme et al. (2018), which constitutes a contextual bandit task on the unit circle with varying needs for exploration that can be smoothly parameterised. The problem can be summarised as follows (see Figure 6 for clarity): a unit circle is divided into a low-reward region (blue area) and four high-reward regions (the other four coloured areas). The size of the low-reward region is defined by a scalar δ\delta. At every episode a different value for δ\delta is selected. The agent is then provided with some coordinates X=(X1,X2)X=(X_{1},X_{2}) within the circle and has to choose among k=5k=5 arms depending on the area the coordinates fall into. If Xδ||X||\leq\delta, the sample falls within the low-reward region (blue). In this case k=1k=1 is the optimal action, as it provides a reward drawn from rN(1.2,0.012)r\sim\mathcal{}{N}(1.2,0.01^{2}), while all other actions only return rN(1.0,0.012)r\sim\mathcal{N}(1.0,0.01^{2}). If the sample falls within any of the four high-reward region (X>δ||X||>\delta), the optimal arm will be any of the remaining four k=25k=2-5, depending on the specific area. Pulling the optimal arm here results in a high reward rN(50.0,0.012)r\sim\mathcal{N}(50.0,0.01^{2}), and as before all other arms receive N(1.0,0.012)\mathcal{N}(1.0,0.01^{2}) except for arm k=1k=1 which again returns N(1.2,0.012)\mathcal{N}(1.2,0.01^{2}).

We compare our model to a large range of methods that can be used for Thompson sampling, taking results from Riquelme et al. (2018), who kindly agreed to share the experiment and evaluation code with us. Neural Processes can be applied to this problem by training on a distribution of tasks before applying the method. Since the methods described in Riquelme et al. (2018) do not require such a pre-training phase, we also include Model-agnostic meta-learning (MAML, Finn et al. (2017)), a method relying on a similar pre-training phase, using code made available by the authors. For both NPs and MAML methods, we create a batch for pre-training by first sampling MM different wheel problems {δi}i=1M,δiU(0,1)\{\delta_{i}\}_{i=1}^{M},\delta_{i}\sim\mathcal{U}(0,1), followed by sampling tuples {(X,a,r)j}j=1N\{(X,a,r)_{j}\}_{j=1}^{N} for context XX, arm aa and associated reward rr for each δi\delta_{i}. We set M=64,N=562M=64,N=562, using 512 context and 50 target points for Neural Processes, and an equal amount of data points for the meta- and inner-updates in MAML. Note that since gradient steps are necessary for MAML to adapt to data from each test problem, we reset the parameters after each evaluation run. This additional step is not necessary for neural processes.

Table 2 shows the quantitative evaluation on this task. We observe that Neural Processes are a highly competitive method, performing similar to MAML and the NeuralLinear baseline in Riquelme et al. (2018), which is consistently among the best out of 20 algorithms compared.

Discussion

We introduce Neural processes, a family of models that combines the benefits of stochastic processes and neural networks. NPs learn to represent distributions over functions and make flexible predictions at test time conditioned on some context input. Instead of requiring a handcrafted kernel, NPs learn an implicit measure from the data directly.

We apply NPs to a range of regression tasks to showcase their flexibility. The goal of this paper is to introduce NPs and compare them to the currenly ongoing research. As such, the tasks presented here are diverse but relatively low-dimensional. We leave it to future work to scale NPs up to higher dimensional problems that are likely to highlight the benefit of lower computational complexity and data driven representations.

Acknowledgements

We would like to thank Tiago Ramalho, Oriol Vinyals, Adam Kosiorek, Irene Garnelo, Daniel Burgess, Kevin McKee and Claire McCoy for insightful discussions and being awesome people. We would also like to thank Carlos Riquelme, George Tucker and Jasper Snoek for providing the code to reproduce the results of their contextual bandits experiments (and, of course, also being awesome people).

References