Implicit Reparameterization Gradients
Michael Figurnov, Shakir Mohamed, Andriy Mnih
Introduction
Pathwise gradient estimators are a core tool for stochastic estimation in machine learning and statistics [fu2006gradient, glasserman2013monte, titsias2014doubly, kingma2014auto, rezende2014stochastic]. In machine learning, we now commonly introduce these estimators using the “reparameterization trick”, in which we replace a probability distribution with an equivalent parameterization of it, using a deterministic and differentiable transformation of some fixed base distribution. This reparameterization is a powerful tool for learning because it makes backpropagation possible in computation graphs with certain types of continuous random variables, e.g. with Normal, Logistic, or Concrete distributions [jang2017categorical, maddison2017concrete]. Many of the recent advances in machine learning were made possible by this ability to backpropagate through stochastic nodes. They include variational autoenecoders (VAEs), automatic variational inference [kingma2014auto, rezende2014stochastic, kucukelbir2017automatic], Bayesian learning in neural networks [blundell2015weight, gal2016dropout], and principled regularization in deep networks [gal2016theoretically, molchanov2017variational].
The reparameterization trick is easily used with distributions that have location-scale parameterizations or tractable inverse cumulative distribution functions (CDFs), or are expressible as deterministic transformations of such distributions. These seemingly modest requirements are still fairly restrictive as they preclude a number of standard distributions, such as truncated, mixture, Gamma, Beta, Dirichlet, or von Mises, from being used with reparameterization gradients. This paper provides a general tool for reparameterization in these important cases.
The limited applicability of reparameterization has often been addressed by using a different class of gradient estimators, the score-function estimators [fu2006gradient, glynn1990likelihood, williams1992simple]. While being more general, they typically result in high-variance gradients which require problem-specific variance reduction techniques to be practical. Generalized reparameterizations involve combining the reparameterization and score-function estimators [ruiz2016generalized, naesseth2017reparameterization]. Another approach is to approximate the intractable derivative of the inverse CDF [knowles2015stochastic].
Following [graves2016stochastic], we use implicit differentiation to differentiate the CDF rather than its inverse. While the method of [graves2016stochastic] is only practical for distributions with analytically tractable CDFs and has been used solely with mixture distributions, we leverage automatic differentiation to handle distributions with numerically tractable CDFs, such as Gamma and von Mises. We review the standard reparameterization trick in Section 2 and then make the following contributions:
We develop implicit reparameterization gradients that provide unbiased estimators for continuous distributions with numerically tractable CDFs. This allows many other important distributions to be used as easily as the Normal distribution in stochastic computation graphs.
We show that the proposed gradients are both faster and more accurate than alternative approaches.
We demonstrate that our method can outperform existing stochastic variational methods at training the Latent Dirichlet Allocation topic model in a black-box fashion using amortized inference.
We use implicit reparameterization gradients to train VAEs with Gamma, Beta, and von Mises latent variables instead of the usual Normal variables, leading to latent spaces with interesting alternative topologies.
Background
For example, for a Gaussian distribution we can use . We can then express the objective as an expectation w.r.t. , transferring the dependence on into :
This allows us to compute the gradient of the expectation as the expectation of the gradients:
A standardization function satisfying the requirements exists for a wide range of continuous distributions, but it is not always practical to take advantage of this. For instance, the CDF of a univariate distribution provides such a function, mapping samples from it to samples from the uniform distribution over $$. However, inverting the CDF is often complicated and expensive, and computing its derivative is even harder.
2 Stochastic variational inference
Training models with modern stochastic variational inference [paisley2012variational, kingma2014auto] involves gradient-based optimization of the bound w.r.t. the model parameters and the variational posterior parameters . While the KL-divergence term and its gradients can often be computed analytically, the remaining term and its gradients are typically intractable and are approximated using samples from the variational posterior. The most general form of this approach involves score-function gradient estimators [paisley2012variational, ranganath2014black, mnih2014neural] that handle both discrete and continuous latent variables but have relatively high variance. The reparameterization trick usually provides a lower variance gradient estimator and is easier to use, but due to the limitations discussed above, is not applicable to many important continuous distributions.
Implicit reparameterization gradients
We propose an alternative way of computing the reparameterization gradient that avoids the inversion of the standardization function. We start from Eqn. (3) and perform a change of variable :
This expression for the gradient only requires differentiating the standardization function and not inverting it. Note that its value does not change under any invertible transformation of the standardization function, since the corresponding Jacobian cancels out with the inverse.
Example: univariate Normal distribution . We illustrate that explicit and implicit reparameterizations give identical results. A standardization function is given by . Explicit reparameterization inverts this function: . The implicit reparameterization, Eqn. (6), gives:
The expressions are equivalent, but the implicit version avoids inverting .
Universal standardization function. For univariate distributions, a standardization function is given by the CDF: . Assuming that the CDF is strictly monotonic and continuously differentiable w.r.t. and \bm{\phi},itsatisfiestherequirementsforastandardizationfunction.Pluggingthisfunctioninto~{}\eqref{eqn:z-grad},wehave\begin{equation}\nabla_{\bm{\phi}}z=-\frac{\nabla_{\bm{\phi}}F(z|\bm{\phi})}{q_{\bm{\phi}}(z)}.\end{equation}Therefore,computingtheimplicitgradientrequiresonlydifferentiatingtheCDF.Inthemultivariatecase,wecanperformthemultivariatedistributionaltransform~{}\cite[cite]{[\@@bibref{}{ruschendorf2013copulas}{}{}]}:\begin{equation}\mathcal{S}_{\bm{\phi}}(\bm{z})=(F(z_{1}|\bm{\phi}),F(z_{2}|z_{1},\bm{\phi}),\dots,F(z_{D}|z_{1},\dots,z_{D-1},\bm{\phi}))=\bm{\varepsilon},\end{equation}where.Eqn.~{}\eqref{eqn:z-grad}requirescomputingthegradientofthe(conditional)CDFsandsolvingalinearsystemwithmatrix.Ifthedistributionisfactorized,thematrixisdiagonalandthesystemcanbesolvedin.Otherwise,thematrixistriangularbecauseeachCDFdependsonlyontheprecedingelements,andthesystemissolvablein.\par\textbf{Algorithm.}WepresentthecomparisonbetweenthestandardexplicitandtheproposedimplicitreparameterizationinTable~{}\ref{fig:reparameterization-algos}.Samplesof$z