Tighter Variational Bounds are Not Necessarily Better

Tom Rainforth, Adam R. Kosiorek, Tuan Anh Le, Chris J. Maddison, Maximilian Igl, Frank Wood, Yee Whye Teh

Introduction

Variational bounds provide tractable and state-of-the-art objectives for training deep generative models (Kingma & Welling, 2014; Rezende et al., 2014). Typically taking the form of a lower bound on the intractable model evidence, they provide surrogate targets that are more amenable to optimization. In general, this optimization requires the generation of approximate posterior samples during the model training and so a number of methods simultaneously learn an inference network alongside the target generative network.

As well as assisting the training process, this inference network is often also of direct interest itself. For example, variational bounds are often used to train auto-encoders (Bourlard & Kamp, 1988; Hinton & Zemel, 1994; Gregor et al., 2016; Chen et al., 2017), for which the inference network forms the encoder. Variational bounds are also used in amortized and traditional Bayesian inference contexts (Hoffman et al., 2013; Ranganath et al., 2014; Paige & Wood, 2016; Le et al., 2017), for which the generative model is fixed and the inference network is the primary target for the training.

The performance of variational approaches depends upon the choice of evidence lower bound (elbo) and the formulation of the inference network, with the two often intricately linked to one another; if the inference network formulation is not sufficiently expressive, this can have a knock-on effect on the generative network (Burda et al., 2016). In choosing the elbo, it is often implicitly assumed that using tighter elbos is universally beneficial, at least whenever this does not in turn lead to higher variance gradient estimates.

In this work we question this implicit assumption by demonstrating that, although using a tighter elbo is typically beneficial to gradient updates of the generative network, it can be detrimental to updates of the inference network. Remarkably, we find that it is possible to simultaneously tighten the bound, reduce the variance of the gradient updates, and arbitrarily deteriorate the training of the inference network.

Specifically, we present theoretical and empirical evidence that increasing the number of importance sampling particles, $K$ , to tighten the bound in the importance-weighted auto-encoder (iwae) (Burda et al., 2016), degrades the signal-to-noise ratio (snr) of the gradient estimates for the inference network, inevitably deteriorating the overall learning process. In short, this behavior manifests because even though increasing $K$ decreases the standard deviation of the gradient estimates, it decreases the magnitude of the true gradient faster, such that the relative variance increases.

Our results suggest that it may be best to use distinct objectives for learning the generative and inference networks, or that when using the same target, it should take into account the needs of both networks. Namely, while tighter bounds are typically better for training the generative network, looser bounds are often preferable for training the inference network. Based on these insights, we introduce three new algorithms: the partially importance-weighted auto-encoder (piwae), the multiply importance-weighted auto-encoder (miwae), and the combination importance-weighted auto-encoder (ciwae). Each of these include iwae as a special case and are based on the same set of importance weights, but use these weights in different ways to ensure a higher SNR for the inference network.

We demonstrate that our new algorithms can produce inference networks more closely representing the true posterior than iwae, while matching the training of the generative network, or potentially even improving it in the case of piwae. Even when treating the iwae objective itself as the measure of performance, all our algorithms are able to demonstrate clear improvements over iwae.

Background and Notation

Let $x$ be an $\mathcal{X}$ -valued random variable defined via a process involving an unobserved $\mathcal{Z}$ -valued random variable $z$ with joint density $p_{\theta}(x,z)$ . Direct maximum likelihood estimation of $\theta$ is generally intractable if $p_{\theta}(x,z)$ is a deep generative model due to the marginalization of $z$ . A common strategy is to instead optimize a variational lower bound on $\log p_{\theta}(x)$ , defined via an auxiliary inference model $q_{\phi}(z\lvert x)$ :

$\leq\log p_{\theta}(x)$ . The iwae objectives generalize the vae objective ( $K=1$ corresponds to the vae) and the bounds become strictly tighter as $K$ increases (Burda et al., 2016). When the family of $q_{\phi}$ contains the true posteriors, the global optimum parameters $\{\theta^{*},\phi^{*}\}$ are independent of $K$ , see e.g. (Le et al., 2018). Nonetheless, except for the most trivial models, it is not usually the case that $q_{\phi}$ contains the true posteriors, and Burda et al. (2016) provide strong empirical evidence that setting $K>1$ leads to significant empirical gains over the vae in terms of learning the generative model.

Optimizing tighter bounds is usually empirically associated with better models $p_{\theta}$ in terms of marginal likelihood on held out data. Other related approaches extend this to sequential Monte Carlo (smc) (Maddison et al., 2017; Le et al., 2018; Naesseth et al., 2018) or change the lower bound that is optimized to reduce the bias (Li & Turner, 2016; Bamler et al., 2017). A second, unrelated, approach is to tighten the bound by improving the expressiveness of $q_{\phi}$ (Salimans et al., 2015; Tran et al., 2015; Rezende & Mohamed, 2015; Kingma et al., 2016; Maaløe et al., 2016; Ranganath et al., 2016). In this work, we focus on the former, algorithmic, approaches to tightening bounds.

Assessing the Signal-to-Noise Ratio of the Gradient Estimators

Because it is not feasible to analytically optimize any elbo in complex models, the effectiveness of any particular choice of elbo is linked to our ability to numerically solve the resulting optimization problem. This motivates us to examine the effect $K$ has on the variance and magnitude of the gradient estimates of iwae for the two networks. More generally, we study iwae gradient estimators constructed as the average of $M$ estimates, each built from $K$ independent particles. We present a result characterizing the asymptotic signal-to-noise ratio in $M$ and $K$ . For the standard case of $M=1$ , our result shows that the signal-to-noise ratio of the reparameterization gradients of the inference network for the iwae decreases with rate $O(1/\sqrt{K})$ .

Thus, for a fixed budget of $T=MK$ samples, we have a family of estimators with the cases $K=1$ and $M=1$ corresponding respectively to the vae and iwae objectives. We will use ${\Delta}_{M,K}\left(\theta\right)$ to refer to gradient estimates with respect to $\theta$ and ${\Delta}_{M,K}\left(\phi\right)$ for those with respect to $\phi$ .

Variance is not always a good barometer for the effectiveness of a gradient estimation scheme; estimators with small expected values need proportionally smaller variances to be estimated accurately. In the case of iwae, when changes in $K$ simultaneously affect both the variance and expected value, the quality of the estimator for learning can actually worsen as the variance decreases. To see why, consider the marginal likelihood estimates $\hat{Z}_{m,K}=\sum_{k=1}^{K}w_{m,k}$ . Because these become exact (and thus independent of the proposal) as $K\to\infty$ , it must be the case that $\lim_{K\rightarrow\infty}{\Delta}_{M,K}(\phi)=0$ . Thus as $K$ becomes large, the expected value of the gradient must decrease along with its variance, such that the variance relative to the problem scaling need not actually improve.

To investigate this formally, we introduce the signal-to-noise-ratio (snr), defining it to be the absolute value of the expected estimate scaled by its standard deviation:

Assume that when $M=K=1$ , the expected gradients; the variances of the gradients; and the first four moments of $w_{1,1}$ , $\nabla_{\theta}w_{1,1}$ , and $\nabla_{\phi}w_{1,1}$ are all finite and the variances are also non-zero. Then the signal-to-noise ratios of the gradient estimates converge at the following rates

where $Z:=p_{\theta}(x)$ is the true marginal likelihood.

We give an intuitive demonstration of the result here and provide a formal proof in Appendix A. The effect of $M$ on the snr follows from using the law of large numbers on the random variable $\nabla_{\theta,\phi}\log\hat{Z}_{m,K}$ . Namely, the overall expectation is independent of $M$ and the variance reduces at a rate $O(1/M)$ . The effect of $K$ is more complicated but is perhaps most easily seen by noting that (Burda et al., 2016)

such that $\nabla_{\theta,\phi}\log\hat{Z}_{m,K}$ can be interpreted as a self-normalized importance sampling estimate. We can, therefore, invoke the known result (see e.g. Hesterberg (1988)) that the bias of a self-normalized importance sampler converges at a rate $O(1/K)$ and the standard deviation at a rate $O(1/\sqrt{K})$ . We thus see that the snr converges at a rate $O((1/K)/(1/\sqrt{K}))=O(1/\sqrt{K})$ if the asymptotic gradient is and $O((1)/(1/\sqrt{K}))=O(\sqrt{K})$ otherwise, giving the convergence rates in the $\phi$ and $\theta$ cases respectively. ∎

The implication of these rates is that increasing $M$ is monotonically beneficial to the snr for both $\theta$ and $\phi$ , but that increasing $K$ is beneficial to the former and detrimental to the latter. We emphasize that this means the snr for the iwae inference network gets worse as we increase $K$ : this is not just an opportunity cost from the fact that we could have increased $M$ instead, increasing the total number of samples used in the estimator actually worsens the snr!

An important point of note is that the dependence of the true inference network gradients becomes independent of $K$ as $K$ becomes large. Namely, because we have as an intermediary result from deriving the snrs that

we see that expected gradient points in the direction of $-\nabla_{\phi}\text{Var}\left[w_{1,1}\right]$ as $K\to\infty$ . This direction is rather interesting: it implies that as $K\to\infty$ , the optimal $\phi$ is that which minimizes the variance of the weights. This is well known to be the optimal importance sampling distribution in terms of estimating the marginal likelihood (Owen, 2013). Given that the role of the inference network during training is to estimate the marginal likelihood, this is thus arguably exactly what we want to optimize for. As such, this result, which complements those of (Cremer et al., 2017), suggests that increasing $K$ provides a preferable target in terms of the direction of the true inference network gradients. We thus see that there is a trade-off with the fact that increasing $K$ also diminishes the snr, reducing the estimates to pure noise if $K$ is set too high. In the absence of other factors, there may thus be a “sweet-spot” for setting $K$ .

2 Multiple Data Points

Typically when training deep generative models, one does not optimize a single elbo but instead its average over multiple data points, i.e.

Our results extend to this setting because the $z$ are drawn independently for each $x^{(n)}$ , so

Empirical Confirmation

Our convergence results hold exactly in relation to $M$ (and $N$ ) but are only asymptotic in $K$ due to the higher order terms. Therefore their applicability should be viewed with a healthy degree of skepticism in the small $K$ regime. With this in mind, we now present empirical support for our theoretical results and test how well they hold in the small $K$ regime using a simple Gaussian model, for which we can analytically calculate the ground truth.

To conduct our investigation, we randomly generated a synthetic dataset from the model with $D=20$ dimensions, $N=1024$ data points, and a true model parameter value $\mu_{\text{true}}$ that was itself randomly generated from a unit Gaussian, i.e. $\mu_{\text{true}}\sim\mathcal{N}(\mu_{\text{true}};0,I)$ . We then considered the gradient at a random point in the parameter space close to optimum (we also consider a point far from the optimum in Appendix C.3). Namely each dimension of each parameter was randomly offset from its optimum value using a zero-mean Gaussian with standard deviation $0.01$ . We then calculated empirical estimates of the elbo gradients for iwae, where $M=1$ is held fixed and we increase $K$ , and for vae, where $K=1$ is held fixed and we increase $M$ . In all cases we calculated $10^{4}$ such estimates and used these samples to provide empirical estimates for, amongst other things, the mean and standard deviation of the estimator, and thereby an empirical estimate for the snr.

We start by examining the qualitative behavior of the different gradient estimators as $K$ increases as shown in Figure 1. This shows histograms of the iwae gradient estimators for a single parameter of the inference network (left) and generative network (right). We first see in Figure 1(a) that as $K$ increases, both the magnitude and the standard deviation of the estimator decrease for the inference network, with the former decreasing faster. This matches the qualitative behavior of our theoretical result, with the snr ratio diminishing as $K$ increases. In particular, the probability that the gradient is positive or negative becomes roughly equal for larger values of $K$ , meaning the optimizer is equally likely to increase as decrease the inference network parameters at the next iteration. By contrast, for the generative network, iwae converges towards a non-zero gradient, such that, even though the snr initially decreases with $K$ , it then rises again, with a very clear gradient signal for $K=1000$ .

Note that this does not mean that the estimates are getting worse for the generative network. As we increase $K$ our bound is getting tighter and our estimates closer to the true gradient for the target that we actually want to optimize $\nabla_{\mu}\log Z$ . See Appendix C.2 for more details. As we previously discussed, it is also the case that increasing $K$ could be beneficial for the inference network even if it reduces the snr by improving the direction of the expected gradient. However, as we will now show, the snr is, for this problem, the dominant effect for the inference network.

The dsnr thus provides a measure of the expected proportion of the gradient that will point in the true direction. For perfect estimates of the gradients, then $\textsc{dsnr}\to\infty$ , but unlike the snr, arbitrarily bad estimates do not have $\textsc{dsnr}=0$ because even random vectors will have a component of their gradient in the true direction.

The convergence of the dsnr is shown in Figure 3, for which the true normalized gradient $u$ has been estimated empirically, noting that this varies with $K$ . We see a similar qualitative behavior to the snr, with the gradients of iwae for the inference network degrading to having the same directional accuracy as drawing a random vector. Interestingly, the dsnr seems to be following the same asymptotic convergence behavior as the snr for both networks in $M$ (as shown by the dashed lines), even though we have no theoretical result to suggest this should occur.

New Estimators

Based on our theoretical results, we now introduce three new algorithms that address the issue of diminishing snr for the inference network. Our first, miwae, is exactly equivalent to the general formulation given in (3), the distinction from previous approaches coming from the fact that it takes both $M>1$ and $K>1$ . The motivation for this is that because our inference network snr increases as $O(\sqrt{M/K})$ , we should be able to mitigate the issues increasing $K$ has on the snr by also increasing $M$ . For fairness, we will keep our overall budget $T=MK$ fixed, but we will show that given this budget, the optimal value for $M$ is often not $1$ . In practice, we expect that it will often be beneficial to increase the mini-batch size $N$ rather than $M$ for miwae; as we showed in Section 3.2 this has the same effect on the snr. Nonetheless, miwae forms an interesting reference method for testing our theoretical results and, as we will show, it can offer improvements over iwae for a given $N$ .

Our second algorithm, ciwae uses a convex combination of the iwae and vae bounds, namely

where we use the same $w_{k}$ for both terms. The motivation for ciwae is that, if we set $\beta$ to a relatively small value, the objective will behave mostly like iwae, except when the expected iwae gradient becomes very small. When this happens, the vae component should “take-over” and alleviate SNR issues: the asymptotic SNR of $\Delta_{K,\beta}^{\text{C}}$ for $\phi$ is $O(\sqrt{MK})$ because the vae component has non-zero expectation in the limit $K\rightarrow\infty$ .

Our results suggest that what is good for the generative network, in terms of setting $K$ , is often detrimental for the inference network. It is therefore natural to question whether it is sensible to always use the same target for both the inference and generative networks. Motivated by this, our third method, piwae, uses the iwae target when training the generative network, but the miwae target for training the inference network. We thus have

where we will generally set $K=ML$ so that the same weights can be used for both gradients.

We now use our new estimators to train deep generative models for the MNIST digits dataset (LeCun et al., 1998). For this, we duplicated the architecture and training schedule outlined in Burda et al. (2016). In particular, all networks were trained and evaluated using their stochastic binarization. For all methods we set a budget of $T=64$ weights in the target estimate for each datapoint in the minibatch.

Figure 5 shows the convergence of these metrics for each algorithm. Here we have considered the middle value for each of the parameters, namely $K=M=8$ for piwae and miwae, and $\beta=0.5$ for ciwae. We see that piwae and miwae both comfortably outperformed, and ciwae slightly outperformed, iwae in terms of iwae-64 metric, despite iwae being directly trained on this target. In terms of $\log\hat{p}(x)$ , piwae gave the best performance, followed by iwae. For the KL, we see that the vae performed best followed by miwae, with iwae performing the worst. We note here that the KL is not an exact measure of the inference network performance as it also depends on the generative model. As such, the apparent superior performance of the vae may be because it produces a simpler model, as per the observations of Burda et al. (2016), which in turn is easier to learn an inference network for. Critically though, piwae improves this metric whilst also improving generative network performance, such that this reasoning no longer applies. Similar behavior is observed for miwae and ciwae for different parameter settings (see Appendix D).

We next considered tuning the parameters for each of our algorithms as shown in Figure 6, for which we look at the final metric values after training. Table 1 further summarizes the performance for certain selected parameter settings. For miwae we see that as we increase $M$ , the $\log\hat{p}(x)$ metric gets worse, while the KL gets better. The iwae-64 metric initially increases with $M$ , before reducing again from $M=16$ to $M=64$ , suggesting that intermediate values for $M$ (i.e. $M\neq 1$ , $K\neq 1$ ) give a better trade-off. For piwae, similar behavior to miwae is seen for the iwae-64 and KL metrics. However, unlike for miwae, we see that $\log\hat{p}(x)$ initially increases with $M$ , such that piwae provides uniform improvement over iwae for the $M=2,4,8,$ and $16$ cases. ciwae exhibits similar behavior in increasing $\beta$ as increasing $M$ for miwae, but there appears to be a larger degree of noise in the evaluations, while the optimal value of $\beta$ , though non-zero, seems to be closer to iwae than for the other algorithms.

As an additional measure of the performance of the inference network that is distinct to any of the training targets, we also considered the effective sample size (ESS) (Owen, 2013) for the fully trained networks, defined as

The ESS is a measure of how many unweighted samples would be equivalent to the weighted sample set. A low ESS indicates that the inference network is struggling to perform effective inference for the generative network. The results, given in Figure 7, show that the ESSs for ciwae, miwae, and the vae were all significantly larger than for iwae and piwae, with iwae giving a particularly poor ESS.

Our final experiment looks at the snr values for the inference networks during training. Here we took a number of different neural network gradient weights at different layers of the network and calculated empirical estimates for their snr s at various points during the training. We then averaged these estimates over the different network weights, the results of which are given in Figure 8. This clearly shows the low snr exhibited by the iwae inference network, suggesting that our results from the simple Gaussian experiments carry over to the more complex neural network domain.

Conclusions

We have provided theoretical and empirical evidence that algorithmic approaches to increasing the tightness of the elbo independently to the expressiveness of the inference network can be detrimental to learning by reducing the signal-to-noise ratio of the inference network gradients. Experiments on a simple latent variable model confirmed our theoretical findings. We then exploited these insights to introduce three estimators, piwae, miwae, and ciwae and showed that each can deliver improvements over iwae, even when the metric used for this assessment is the iwae target itself. In particular, each was able to deliver improvement in the training of the inference network, without any reduction in the quality of the learned generative network.

Whereas miwae and ciwae mostly allow for balancing the requirements of the inference and generative networks, piwae appears to be able to offer simultaneous improvements to both, with the improved training of the inference network having a knock-on effect on the generative network. Key to achieving this is, is its use of separate targets for the two networks, opening up interesting avenues for future work.

Appendices for Tighter Variational Bounds are Not Necessarily Better

Tom Rainforth Adam R. Kosiorek Tuan Anh Le Chris J. Maddison Maximilian Igl Frank Wood Yee Whye Teh

Appendix A Proof of SNR Convergence Rates

Before proving Theorem 1, we first introduce the following lemma that will be helpful for demonstrating the result. We note that the lemma can be interpreted as a generalization on the well-known results for the third and fourth moments of Monte Carlo estimators.

Let $a_{1},\dots,a_{K}$ , $b_{1},\dots,b_{K}$ , and $c_{1},\dots,c_{K}$ be sets of random variables such that

The $a_{k}$ are independent and identically distributed (i.i.d.). Similarly, the $b_{k}$ are i.i.d. and the $c_{k}$ are i.i.d.

$a_{i}$ , $b_{j}$ , $c_{k}$ are mutually independent if $i\neq j\neq k\neq i$ , but potentially dependent otherwise

where $Z:=p_{\theta}(x)$ is the true marginal likelihood.

We start by considering the variance of the estimators. We will first exploit the fact that each $\hat{Z}_{m,K}$ is independent and identically distributed and then apply Taylor’s theoremThis approach follows similar lines to the derivation of nested Monte Carlo convergence bounds in (Rainforth, 2017; Rainforth et al., 2018; Fort et al., 2017) and the derivation of the mean squared error for self-normalized importance sampling, see e.g. (Hesterberg, 1988). to $\log\hat{Z}_{m,K}$ about $Z$ , using $R_{1}(\cdot)$ to indicate the remainder term, as follows.

where $o(\epsilon)$ indicates (asymptotically dominated) higher order terms originating from the $o(1)$ terms. If we now define

This is in the form required by Lemma 1 and satisfies the required assumptions and so we immediately have

by the second result in Lemma 1. If we further define

then we can also use the first result in Lemma 1 to give

Substituting these results back into (24) now gives

Considering now the expected gradient estimate and again using Taylor’s theorem, this time to a higher number of terms,

and thus by applying Lemma 1 with $a_{k}=b_{k}=c_{k}=w_{1,k}-Z$ we have

Substituting this back into (30) now yields

Finally, by combing (29) and (31), and noting that $\sqrt{\frac{A}{K}+\frac{B}{K^{2}}}=\frac{A}{\sqrt{K}}+\frac{B}{2AK^{3/2}}+O\left(\frac{1}{K^{(5/2)}}\right)$ , we have for $\theta$

For $\phi$ , then because $\nabla_{\phi}Z=0$ , we instead have

Appendix B Derivation of Optimal Parameters for Gaussian Experiment

To derive the optimal parameters for the Gaussian experiment we first note that

$Q_{\phi}(z_{1:K}|x^{(n)})$ is as per (2) and the form of the Kullback-Leibler (kl) is taken from (Le et al., 2018). Next, we note that $\phi$ only controls the mean of the proposal so, while it is not possible to drive the KL to zero, it will be minimized for any particular $\theta$ when the means of $q_{\phi}(z|x^{(n)})$ and $p_{\theta}(z|x^{(n)})$ are the same. Furthermore, the corresponding minimum possible value of the KL is independent of $\theta$ and so we can calculate the optimum pair $(\theta^{*},\phi^{*})$ by first optimizing for $\theta$ and then choosing the matching $\phi$ . The optimal $\theta$ maximizes $\log\prod_{n=1}^{N}p_{\theta}(x^{(n)})$ , giving $\theta^{*}:=\mu^{*}=\frac{1}{N}\sum_{n=1}^{N}x^{(n)}$ . As we straightforwardly have $p_{\theta}(z|x^{(n)})=\mathcal{N}(z;\left(x^{(n)}+\mu\right)/2,I/2)$ , the KL is then minimized when $A=I/2$ and $b=\mu/2$ , giving $\phi^{*}:=(A^{*},b^{*})$ , where $A^{*}=I/2$ and $b^{*}=\mu^{*}/2$ .

Appendix C Additional Empirical Analysis of SNR

To complete the picture for the effect of $M$ and $K$ on the distribution of the gradients, we generated histograms for the $K=1$ (i.e. vae) gradients as $M$ is varied. As shown in Figure 9(a), we see the expected effect from the law of large numbers that the variance of the estimates decreases with $M$ , but not the expected value.

C.2 Convergence of RMSE for Generative Network

C.3 Experimental Results for High Variance Regime

We now present empirical results for a case where our weights are higher variance. Instead of choosing a point close to the optimum by offsetting parameters with a standard deviation of $0.01$ , we instead offset using a standard deviation of $0.5$ . We further increased the proposal covariance to $I$ to make it more diffuse. This is now a scenario where the model is far from its optimum and the proposal is a very poor match for the model, giving very high variance weights.

We see that the behavior is the same for variation in $M$ , but somewhat distinct for variation in $K$ . In particular, the snr and dsnr only decrease slowly with $K$ for the inference network, while increasing $K$ no longer has much benefit for the snr of the inference network. It is clear that, for this setup, the problem is very far from the asymptotic regime in $K$ such that our theoretical results no longer directly apply. Nonetheless, the high-level effect observed is still that the snr of the inference network deteriorates, albeit slowly, as $K$ increases.

Appendix D Convergence of Deep Generative Model for Alternative Parameter Settings

Figure 15 shows the convergence of the introduced algorithms under different settings to those shown in Figure 5. Namely we consider $M=4,K=16$ for piwae and miwae and $\beta=0.05$ for ciwae. These settings all represent tighter bounds than those of the main paper. Similar behavior is seen in terms of the iwae-64 metric for all algorithms. piwae produced similar mean behavior for all metrics, though the variance was noticeably increased for $\log\hat{p}(x)$ . For ciwae and miwae, we see that the parameter settings represent an explicit trade-off between the generative network and the inference network: $\log\hat{p}(x)$ was noticeably increased for both, matching that of iwae, while $-\textsc{KL}(Q_{\phi}(z\lvert x)||P_{\theta}(z\lvert x))$ was reduced. Critically, we see here that, as observed for piwae in the main paper, miwae and ciwae are able to match the generative model performance of iwae whilst improving the KL metric, indicating that they have learned better inference networks.

Appendix E Convergence of Toy Gaussian Problem

We finish by assessing the effect of the outlined changes in the quality of the gradient estimates on the final optimization for our toy Gaussian problem. Figure 16 shows the convergence of running Adam (Kingma & Ba, 2014) to optimize $\mu$ , $A$ , and $b$ . This suggests that the effects observed predominantly transfer to the overall optimization problem. Interestingly, setting $K=1$ and $M=1000$ gave the best performance on learning not only the inference network parameters, but also the generative network parameters.

Acknowledgments

TR and YWT are supported in part by the European Research Council under the European Union’s Seventh Framework Programme (FP7/2007–2013) / ERC grant agreement no. 617071. TAL is supported by a Google studentship, project code DF6700. MI is supported by the UK EPSRC CDT in Autonomous Intelligent Machines and Systems. CJM is funded by a DeepMind Scholarship. FW is supported under DARPA PPAML through the U.S. AFRL under Cooperative Agreement FA8750-14-2-0006, Sub Award number 61160290-111668.