The Generalized Reparameterization Gradient
Francisco J. R. Ruiz, Michalis K. Titsias, David M. Blei
Introduction
Variational inference (vi) is a technique for approximating the posterior distribution in probabilistic models (Jordan et al., 1999; Wainwright and Jordan, 2008). Given a probabilistic model of observed variables and hidden variables , the goal of vi is to approximate the posterior , which is intractable to compute exactly for many models. The idea of vi is to posit a family of distributions over the latent variables with free variational parameters . vi then fits those parameters to find the member of the family that is closest in Kullback-Leibler (kl) divergence to the exact posterior, . This turns inference into optimization, and different ways of doing vi amount to different optimization algorithms for solving this problem.
For a certain class of probabilistic models, those where each conditional distribution is in an exponential family, we can easily use coordinate ascent optimization to minimize the kl divergence (Ghahramani and Beal, 2001). However, many important models do not fall into this class (e.g., probabilistic neural networks or Bayesian generalized linear models). This is the scenario that we focus on in this paper. Much recent research in vi has focused on these difficult settings, seeking effective optimization algorithms that can be used with any model. This has enabled the application of vi on nonconjugate probabilistic models (Carbonetto et al., 2009; Paisley et al., 2012; Ranganath et al., 2014; Titsias and Lázaro-Gredilla, 2014), deep neural networks (Neal, 1992; Hinton et al., 1995; Mnih and Gregor, 2014; Kingma and Welling, 2014), and probabilistic programming (Wingate and Weber, 2013; Kucukelbir et al., 2015; van de Meent et al., 2016).
One strategy for vi in nonconjugate models is to obtain Monte Carlo estimates of the gradient of the variational objective and to use stochastic optimization to fit the variational parameters. Within this strategy, there have been two main lines of research: black-box variational inference (bbvi) (Ranganath et al., 2014) and reparameterization gradients (Salimans and Knowles, 2013; Kingma and Welling, 2014). Each enjoys different advantages and limitations.
bbvi expresses the gradient of the variational objective as an expectation with respect to the variational distribution using the log-derivative trick, also called reinforce or score function method (Glynn, 1990; Williams, 1992). It then takes samples from the variational distribution to calculate noisy gradients. bbvi is generic—it can be used with any type of latent variables and any model. However, the gradient estimates typically suffer from high variance, which can lead to slow convergence. Ranganath et al. (2014) reduce the variance of these estimates using Rao-Blackwellization (Casella and Robert, 1996) and control variates (Ross, 2002; Paisley et al., 2012; Gu et al., 2016). Other researchers have proposed further reductions, e.g., through local expectations (Titsias and Lázaro-Gredilla, 2015) and importance sampling (Ruiz et al., 2016).
The second approach to Monte Carlo gradients of the variational objective is through reparameterization (Price, 1958; Bonnet, 1964; Salimans and Knowles, 2013; Kingma and Welling, 2014; Rezende et al., 2014). This approach reparameterizes the latent variable in terms of a set of auxiliary random variables whose distributions do not depend on the variational parameters (typically, a standard normal). This facilitates taking gradients of the variational objective because the gradient operator can be pushed inside the expectation, and because the resulting procedure only requires drawing samples from simple distributions, such as standard normals. We describe this in detail in Section 2.
Reparameterization gradients exhibit lower variance than bbvi gradients. They typically need only one Monte Carlo sample to estimate a noisy gradient, which leads to fast algorithms. Further, for some models, their variance can be bounded (Fan et al., 2015). However, reparameterization is not as generic as bbvi. It is typically used with Gaussian variational distributions and does not easily generalize to other common distributions, such as the gamma or beta, without using further approximations. (See Knowles (2015) for an alternative approach to deal with the gamma distribution.)
We develop the generalized reparameterization (g-rep) gradient, a new method to extend reparameterization to other variational distributions. The main idea is to define an invertible transformation of the latent variables such that the distribution of the transformed variables is only weakly governed by the variational parameters. (We make this precise in Section 3.) Our technique naturally combines both bbvi and reparameterization; it applies to a wide class of nonconjugate models; it maintains the black-box criteria of reusing variational families; and it avoids approximations. We empirically show in two probabilistic models—a nonconjugate factorization model and a deep exponential family (Ranganath et al., 2015)—that a single Monte Carlo sample is enough to build an effective low-variance estimate of the gradient. In terms of speed, g-rep outperforms bbvi. In terms of accuracy, it outperforms automatic differentiation variational inference (advi) (Kucukelbir et al., 2016), which considers Gaussian variational distributions on a transformed space.
Background
Consider a probabilistic model , where denotes the latent variables and the observations. We assume that the posterior distribution is analytically intractable and we wish to apply vi. We introduce a tractable distribution to approximate and minimize the kl divergence with respect to the variational parameters . This minimization is equivalently expressed as the maximization of the so-called evidence lower bound (elbo) (Jordan et al., 1999),
Score function method. A general way to obtain unbiased stochastic gradients is to use the score function method, also called log-derivative trick or reinforce (Williams, 1992; Glynn, 1990), which has been recently applied to vi (Paisley et al., 2012; Ranganath et al., 2014; Mnih and Gregor, 2014). It is based on writing the gradient of the elbo with respect to as
and then building Monte Carlo estimates by approximating the expectation with samples from . The resulting estimator suffers from high variance, making it necessary to apply variance reduction methods such as control variates (Ross, 2002) or Rao-Blackwellization (Casella and Robert, 1996). Such variance reduction techniques have been used in bbvi (Ranganath et al., 2014).
The assumption here is that the log-joint is differentiable with respect to the latent variables. The gradient depends on the model, but it can be computed using automatic differentiation tools (Baydin et al., 2015). Monte Carlo estimates of the reparameterization gradient typically present much lower variance than those based on Eq. 3. In practice, a single sample from is enough to estimate a low-variance gradient.In the literature, there is no formal proof that reparameterization has lower variance than the score function estimator, except for some simple models (Fan et al., 2015). Titsias and Lázaro-Gredilla (2014) provide some intuitions, and Rezende et al. (2014) show some benefits of reparameterization in the Gaussian case.
The reparameterization trick is thus a powerful technique to reduce the variance of the estimator, but it requires a transformation such that does not depend on the variational parameters . For instance, if the variational distribution is Gaussian with mean and covariance , a straightforward transformation consists of standardizing the random variable , i.e.,
This transformation ensures that the (Gaussian) distribution does not depend on or . For a general variational distribution , Kingma and Welling (2014) discuss three families of transformations: inverse cumulative density function (cdf), location-scale, and composition. However, these transformations may not apply in certain cases.The inverse cdf approach sets to the cdf. This leads to a uniform distribution over on the unit interval, but it is not practical because the inverse cdf, , does not have analytical solution in general. We develop an approach that does not require computation of cdfs or their derivatives. Notably, none of them apply to the gammaComposition is only available when it is possible to express the gamma as a sum of exponentials, i.e., its shape parameter is an integer, which is not generally the case in vi. and the beta distributions, although these distributions are often used in vi.
Next, we show how to relax the constraint that the transformed density must not depend on the variational parameters . We follow a standardization procedure similar to the Gaussian case in Eq. 5, but we allow the distribution of the standardized variable to depend (at least weakly) on .
The Generalized Reparameterization Gradient
We now generalize the reparameterization idea to distributions that, like the gamma or the beta, do not admit the standard reparameterization trick. We assume that we can efficiently sample from the variational distribution , and that is differentiable with respect to and . We introduce a random variable defined by an invertible transformation
where we can think of as a standardization procedure that attempts to make the distribution of weakly dependent on the variational parameters . “Weakly” means that at least its first moment does not depend on . For instance, if is defined to have zero mean, then its first moment has become independent of . However, we do not assume that the resulting distribution of is completely independent of the variational parameters , and therefore we write it as . We use the distribution in the derivation of g-rep, but we write the final gradient as an expectation with respect to the original variational distribution , from which we can sample.
More in detail, by the standard change-of-variable technique, the transformed density is
We now express the gradient as the sum of two terms, which we name and for reasons that we will explain below. We apply the log-derivative trick and the product rule for derivatives, yielding
We rewrite Eq. 9 as an expression that involves expectations with respect to the original variational distribution only. For that, we define the following two auxiliary functions that depend on the transformation :
After some algebra (see the Supplement for details), we obtain
Thus, we can finally write the full gradient of the elbo as
Interpretation of the generalized reparameterization gradient. The term is easily recognizable as the standard reparameterization gradient, and hence the label “rep.” Indeed, if the distribution does not depend on the variational parameters , then the term in Eq. 9 vanishes, making . Thus, we may interpret as a “correction” term that appears when the transformed density depends on the variational parameters.
Alternatively, we can interpret the g-rep gradient as a control variate of the score function gradient. For that, we rearrange Eqs. 9 and 11 to express the gradient as
where the second line is the control variate, which involves the reparameterization gradient.
Transformations. Eqs. 9 and 11 are valid for any transformation . However, we may expect some transformations to perform better than others, in terms of the variance of the resulting estimator. It seems sensible to search for transformations that make small, as the reparameterization gradient is known to present low variance in practice under standard smoothness conditions of the log-joint (Fan et al., 2015).Techniques such as Rao-Blackwellization could additionally be applied to reduce the variance of . We do not apply any such technique in this paper. Transformations that make small are such that becomes weakly dependent on the variational parameters . In the standard reparameterization of Gaussian random variables, the transformation takes the form in (5), and thus is a standardized version of . We mimic this standardization idea for other distributions as well. In particular, for exponential family distributions, we use transformations of the form (sufficient statistic expected sufficient statistic)(scale factor). We present several examples in the next section.
For concreteness, we show here some examples of the equations above for well-known probability distributions. In particular, we choose the gamma, log-normal, and beta distributions.
Gamma distribution. Let be a gamma distribution with shape and rate . We use a transformation based on standardization of the sufficient statistic , i.e.,
where denotes the digamma function, and is its -th derivative. This ensures that has zero mean and unit variance, and thus its two first moments do not depend on the variational parameters and . We now compute the auxiliary functions in Eq. 10 for the components of the gradient with respect to and , which take the form
The terms and are obtained after substituting these results in Eq. 11. We provide the final expressions in the Supplement. We remark here that the component of corresponding to the derivative with respect to the rate equals zero, i.e., , meaning that the distribution of does not depend on the parameter . Indeed, we can compute this distribution following Eq. 7 as
where we can verify that it does not depend on .
Log-normal distribution. For a log-normal distribution with location and scale , we can standardize the sufficient statistic as
This leads to a standard normal distribution on , which does not depend on the variational parameters, and thus . The auxiliary function , which is needed for , takes the form
Thus, the reparameterization gradient is given in this case by
This corresponds to advi (Kucukelbir et al., 2016) with a logarithmic transformation over a positive random variable, since the variational distribution over the transformed variable is Gaussian. For a general variational distribution, we recover advi if the transformation makes Gaussian.
Beta distribution. For a random variable , we could rewrite for and , and apply the gamma reparameterization for and . Instead, in the spirit of applying standardization directly over , we define a transformation to standardize the logit function, (sum of sufficient statistics of the beta),
This ensures that has zero mean. We can set the denominator to the standard deviation of . However, for larger-scaled models we found better performance with a denominator that makes for the currently drawn sample (see the Supplement for details), even though the variance of the transformed variable is not one in such case.Note that this introduces some bias since we are ignoring the dependence of on . The reason is that suffers from high variance in the same way as the score function estimator does.
2 Algorithm
We now present our full algorithm for g-rep. It requires the specification of the variational family and the transformation . Given these, the full procedure is summarized in Algorithm 1. We use the adaptive step-size sequence proposed by Kucukelbir et al. (2016), which combines rmsprop (Tieleman and Hinton, 2012) and Adagrad (Duchi et al., 2011). Let be the -th component of the gradient at the -th iteration, and the step-size for that component. We set
where we set , , , and we explore several values of . Thus, we update the variational parameters as , where ‘’ is the element-wise product.
3 Related work
A closely related vi method is advi, which also relies on reparameterization and has been incorporated into Stan (Kucukelbir et al., 2015, 2016). advi applies a transformation to the random variables such that their support is on the reals and then uses a Gaussian variational posterior on the transformed space. For instance, random variables that are constrained to be positive are first transformed through a logarithmic function and then a Gaussian variational approximating distribution is placed on the unconstrained space. Thus, advi struggles to approximate probability densities with singularities, which are useful in models where sparsity is appropriate. In contrast, the g-rep method allows to estimate the gradient for a wider class of variational distributions, including gamma and beta distributions, which are more appropriate to encode sparsity constraints.
Schulman et al. (2015) also write the gradient in the form given in Eq. 12 to automatically estimate the gradient through a backpropagation algorithm in the context of stochastic computation graphs. However, they do not provide additional insight into this equation, do not apply it to general vi, do not discuss transformations for any distributions, and do not report experiments. Thus, our paper complements Schulman et al. (2015) and provides an off-the-shelf tool for general vi.
Experiments
The second model is a beta-gamma mf model with weights and latent locations . We use this model to describe binary observations , which are modeled as
where and is the inverse logit function. We place a gamma prior with shape and rate over the weights , a uniform prior over the variables , and we use latent components.
Datasets. We apply the sparse gamma def on two different databases: (i) the Olivetti database at AT&T,http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html which consists of 400 (320 for training and 80 for test) images of human faces in a 8 bit scale (); and (ii) the collection of papers at the Neural Information Processing Systems (nips) 2011 conference, which consists of documents and a vocabulary of effective words in a bag-of-words format ( of words from all documents are set aside to form the test set).
We apply the beta-gamma mf on: (i) the binarized mnist data,http://yann.lecun.com/exdb/mnist which consists of images of hand-written digits (we use training and test images); and (ii) the Omniglot dataset (Lake et al., 2015), which consists of images of hand-written characters from different alphabets (we select 10 alphabets, with training images, test images, and characters).
Evaluation. We apply mean-field vi and we compare g-rep with bbvi (Ranganath et al., 2014) and advi (Kucukelbir et al., 2016). We do not apply bbvi on the Omniglot dataset due to its computational complexity. At each iteration, we evaluate the elbo using one sample from the variational distribution, except for advi, for which we use samples (for the Omniglot dataset, we only use one sample). We run each algorithm with a fixed computational budget of CPU time. After that time, we also evaluate the predictive log-likelihood on the test set, averaging over 100 posterior samples. For the nips data, we also compute the test perplexity (with one posterior sample) every 10 iterations, given by
Experimental setup. To estimate the gradient, we use Monte Carlo samples for bbvi, and only for advi and g-rep. For bbvi, we use Rao-Blackwellization and control variates (we use a separate set of samples to estimate the control variates). For bbvi and g-rep, we use beta and gamma variational distributions, whereas advi uses Gaussian distributions on the transformed space, which correspond to log-normal or logit-normal distributions on the original space. Thus, only g-rep and bbvi optimize the same variational family. We parameterize the gamma distribution in terms of its shape and mean, and the beta in terms of its shape parameters and . To avoid constrained optimization, we apply the transformation to the variational parameters that are constrained to be positive and take stochastic gradient steps with respect to . We use the analytic gradient of the entropy terms. We implement advi as described by Kucukelbir et al. (2016).
We use the step-size schedule in Eq. 13, and we explore the parameter . For each algorithm and each dataset, we report the results based on the value of for which the best elbo was achieved. We report the values of in Table 1 (left).
Results. We show in Figure 1 the evolution of the elbo as a function of the running time for three of the considered datasets. bbvi converges slower than the rest of the methods, since each iteration involves drawing multiple samples and evaluating the log-joint for each of them. advi and g-rep achieve similar bounds, except for the mnist dataset, for which g-rep provides a variational approximation that is closer to the posterior, since the elbo is higher. This is because a variational family with sparse gamma and beta distributions provides a better fit to the data than the variational family to which advi is limited (log-normal and logit-normal). advi seems to converge slower; however, we do not claim that advi converges slower than g-rep in general. Instead, the difference may be due to the different step-sizes schedules that we found to be optimal (see Table 1). We also report in Table 1 (right) the average time per iterationOn the full mnist with training images, g-rep (advi) took () seconds per iteration. for each method: bbvi is the slowest method, and advi is the fastest because it involves simulation of Gaussian random variables only.
However, g-rep provides higher likelihood values than advi. We show in Figure 2a the evolution of the perplexity (lower is better) for the nips dataset, and in Figure 2b the resulting test log-likelihood (larger is better) for the rest of the considered datasets. In Figure 2b, we report the mean and standard deviation over posterior samples. advi cannot fit the data as well as g-rep or bbvi because it is constrained to log-normal and logit-normal variational distributions. These cannot capture sparsity, which is an important feature for the considered models. We can also conclude this by a simple visual inspection of the fitted models. In the Supplement, we compare images sampled from the g-rep and the advi posteriors, where we can observe that the latter are more blurry or lack some details.
Conclusion
We have introduced the generalized reparameterization gradient (g-rep), a technique to extend the standard reparameterization gradient to a wider class of variational distributions. As the standard reparameterization method, our method is applicable to any probabilistic model that is differentiable with respect to the latent variables. We have demonstrated the generalized reparameterization gradient on two nonconjugate probabilistic models to fit a variational approximation involving gamma and beta distributions. We have also empirically shown that a single Monte Carlo sample is enough to obtain a noisy estimate of the gradient, therefore leading to a fast inference procedure.
This project has received funding from the EU H2020 programme (Marie Skłodowska-Curie grant agreement 706760), NFS IIS-1247664, ONR N00014-11-1-0651, DARPA FA8750-14-2-0009, DARPA N66001-15-C-4032, Adobe, the John Templeton Foundation, and the Sloan Foundation. The authors would also like to thank Kriste Krstovski, Alp Kuckukelbir, and Christian Naesseth for helpful comments and discussions.
References
Appendix A Derivation of the Generalized Reparameterization Gradient
Here we show the mathematical derivation of the generalized reparameterization gradient. Firstly, recall the definition of the functions
We start from the following expression of the gradient, also derived in the main text:
We can write the former term, , as
where we have first replaced the variational distribution on the transformed space with its form as a function of , i.e., . We have then applied the chain rule, and finally we have made a new change of variables back to the original space (thus multiplying by the inverse Jacobian).
For the latter, , we have that
The derivative can be obtained by the chain rule. If , then . We substitute this result in the above equation and revert the change of variables back to the original space (also multiplying by the inverse Jacobian), yielding
where we have used the definition of the functions and .
Appendix B Particularization for the Gamma Distribution
For the gamma distribution we choose the transformation
The derivatives of with respect to its arguments are given by
Therefore, the auxiliary functions and for the components of the gradient with respect to and can be written as
Thus, we finally obtain that the components of corresponding to the derivatives with respect to and are given by
while the components of can be similarly obtained by substituting the expressions above into Eq. 25. Remarkably, we obtain that
Appendix C Particularization for the Beta Distribution
For a random variable , we could rewrite for and , and apply the above method for the gamma-distributed variables and . Instead, in the spirit of applying standardization directly over , we define a transformation to standardize the logit function. This leads to
This transformation ensures that has mean zero. However, in this case we do not specify the form of , and we let it be a function of and . This allows us to choose in such a way that for the sampled value of , which we found to work well (even though this introduces some bias). For simplicity, we write .
The derivatives of with respect to its arguments are given by
Therefore, the auxiliary functions and for the components of the gradient with respect to and can be written as
Note that the term above can be computed from without knowledge of the value of as .
Thus, we finally obtain that the components of corresponding to the derivatives with respect to and are given by
where we are still free to choose and . We have found that the choice of these values such that works well in practice. Thus, we set the derivatives of such that the relationships
hold for the sampled value of . This involves solving a simple linear equation for and .
Appendix D Particularization for the Dirichlet Distribution
For a distribution, with , we can apply the standardization
where the mean is a -length vector and the covariance is a matrix,Instead, we could define a transformation that ignores the off-diagonal terms of the covariance matrix. This would lead to faster computations but higher variance of the resulting estimator.,We could also apply the full-covariance transformation for the beta distribution. which are respectively given by
Here, we have defined . The covariance matrix can be rewritten as a diagonal matrix plus a rank one update, which can be exploited for faster computations:
Note that, since is positive semidefinite, can be readily obtained after diagonalization. In other words, if we express , where is an orthonormal matrix and is a diagonal matrix, then .
Given the transformation above, we can write
The derivatives of with respect to its arguments are given by
Therefore, the auxiliary functions and can be written as
The intermediate derivatives that are necessary for the computation of the functions and are:
and is the solution to the Lyapunov equation
Putting all this together, we finally have the expressions for the generalized reparameterization gradient:
Appendix E Experimental Results
We now study the sensitivity of the generalized reparameterization gradient with respect to the number of samples of the Monte Carlo estimator. For that, we choose the Olivetti dataset, and we apply the generalized reparameterization approach using 2, 5, 10, and 20 Monte Carlo samples. At each iteration, we compute the elbo and the average sample variance of the gradient estimator. We report these results in Figure 3 for the first 200 iterations of the inference procedure. As expected, increasing the number of samples is beneficial because it reduces the resulting variance. The gap between the curves with and samples is negligible, specially after iterations. A larger number of samples seems to be particularly helpful in the very early iterations of inference.
E.2 Reconstructed images
Here, we show some reconstructed observations for the three datasets involving images, namely, the binarized mnist, the Olivetti dataset, and Omniglot. We plot the reconstructed images as follows: we first draw one sample from the variational posterior, and then we compute the mean of the observations for that particular sample of latent variables.
Figure 4 shows the results for the Olivetti dataset. The true observations are shown in the left panel, whereas the corresponding reconstructed images are shown in the center panel (for g-rep) and the right panel (for advi). We can observe that the images obtained from g-rep are more detailed (e.g., we can distinguish the glasses, mustache, or facial expressions) than the images obtained from advi. We argue that this effect is due to the variational family used by advi, which cannot capture well sparse posterior distributions, for which samples close to are common.
This behavior is similar in the case of the digits from mnist or the characters from Omniglot. We show these images in Figures 5 and 6, respectively. Once again, images sampled from the g-rep posterior are visually closer to the ground truth that images sampled from the advi posterior, which tend to be more blurry, or even unrecognizable in a few cases.