DiCE: The Infinitely Differentiable Monte-Carlo Estimator

Jakob Foerster, Gregory Farquhar, Maruan Al-Shedivat, Tim Rocktäschel, Eric P. Xing, Shimon Whiteson

Introduction

The score function trick is used to produce Monte Carlo estimates of gradients in settings with non-differentiable objectives, e.g., in meta-learning and reinforcement learning. Estimating the first order gradients is computationally and conceptually simple. While the gradient estimators can be directly defined, it is often more convenient to define an objective whose derivative is the gradient estimator. Then the automatic-differentiation (auto-diff) toolbox, as implemented in deep learning libraries, can easily compute the gradient estimates with respect to all upstream parameters.

This is the method used by the surrogate loss (SL) approach (Schulman et al., 2015), which provides a recipe for building a surrogate objective from a stochastic computation graph (SCG). When differentiated, the SL yields an estimator for the first order gradient of the original objective.

However, estimating higher order derivatives is more challenging. Such estimators are useful for a number of optimization techniques, accelerating convergence in supervised settings (Dennis & Moré, 1977) and reinforcement learning (Furmston et al., 2016). Furthermore, they are vital for gradient-based meta-learning (Finn et al., 2017; Al-Shedivat et al., 2017; Li et al., 2017), which differentiates an objective after some number of first order learning steps. Estimators of higher order derivatives have also proven useful in multi-agent learning (Foerster et al., 2018), when one agent differentiates through the learning process of another agent.

Unfortunately, the first order gradient estimators mentioned above are fundamentally ill suited to calculating higher order derivatives via auto-diff. Due to the dependency on the sampling distribution, estimators of higher order derivatives require repeated application of the score function trick. Simply differentiating the first order estimator again, as was for example done by Finn et al. (2017), leads to missing terms.

To obtain higher order score function estimators, there are currently two unsatisfactory options. The first is to analytically derive and implement the estimators. However, this is laborious, error prone, and does not comply with the auto-diff paradigm. The second is to repeatedly apply the SL approach to construct new objectives for each further derivative estimate. However, each of these new objectives involves increasingly complex graph manipulations, defeating the purpose of a differentiable surrogate loss.

Moreover, to match the first order gradient after a single differentiation, the SL treats part of the cost as a fixed sample, severing the dependency on the parameters. We show that this yields missing and incorrect terms in estimators of higher order derivatives. We believe that these difficulties have limited the usage and exploration of higher order methods in reinforcement learning tasks and other application areas that may be formulated as SCGs.

Therefore, we propose a novel technique, the Infinitely Differentiable Monte-Carlo Estimator (DiCE), to address all these shortcomings. DiCE constructs a single objective that evaluates to an estimate of the original objective, but can also be differentiated repeatedly to obtain correct estimators of derivatives of any order. Unlike the SL approach, DiCE relies on auto-diff as implemented for instance in TensorFlow (Abadi et al., 2016) or PyTorch (Paszke et al., 2017) to automatically perform the complex graph manipulations required for these estimators of higher order derivatives.

DiCE uses a novel operator, MagicBox (), that acts on the set of those stochastic nodes Wc\mathcal{W}_{c} that influence each of the original losses in an SCG. Upon differentiation, this operator generates the correct derivatives associated with the sampling distribution:

where \bot is an operator that sets the gradient of the operand to zero, so \nabla_{x}\text{\bot}(x)=0. In addition, we show how to use a baseline for variance reduction in our formulation.

We verify the correctness of DiCE both through a proof and through numerical evaluation of the DiCE gradient estimates. To demonstrate the utility of DiCE, we also propose a novel approach for learning with opponent learning awareness (Foerster et al., 2018). We also open-source our code in TensorFlow. We hope this powerful and convenient novel objective will unlock further exploration and adoption of higher order learning methods in meta-learning, reinforcement learning, and other applications of SCGs. Already, DiCE is used to implement repeatedly differentiable gradient estimators with pyro.infer.util.Dice and tensorflow_probability.python.monte_carlo.expectation.

Background

Let vwv\prec w denote that node vv influences node ww, i.e., there exists a path in the graph from vv to ww. If every node along the path is deterministic, vv influences ww deterministically which is denoted by vDwv\prec^{D}w. See Figure 1 (top) for a simple SCG with an input node θ\theta, a stochastic node xx and a cost function ff. Note that θ\theta influences ff deterministically (θDf\theta\prec^{D}f) as well as stochastically via xx (θf\theta\prec f).

2 Surrogate Losses

In order to estimate gradients of a sum of cost nodes, cCc\sum_{c\in\mathcal{C}}c, in an arbitrary SCG, Schulman et al. (2015) introduce the notion of a surrogate loss (SL):

Here \textscdepsw\textsc{deps}_{w} are the ‘dependencies’ of ww: the set of stochastic or input nodes that deterministically influence the node ww. Furthermore, Q^w\hat{Q}_{w} is the sum of sampled costs c^\hat{c} corresponding to the cost nodes influenced by ww.

The hat notation on Q^w\hat{Q}_{w} indicates that inside the SL, these costs are treated as fixed samples. This severs the functional dependency on θ\theta that was present in the original stochastic computation graph.

The SL produces a gradient estimator when differentiated once (Schulman et al., 2015, Corollary 1):

Note that the use of sampled costs Q^w\hat{Q}_{w} in the definition of the SL ensures that its first order gradients match the score function estimator, which does not contain a term of the form log(p)θQ\log(p)\nabla_{\theta}Q.

Although Schulman et al. (2015) focus on first order gradients, they argue that the SL gradient estimates themselves can be treated as costs in an SCG and that the SL approach can be applied repeatedly to construct higher order gradient estimators. However, the use of sampled costs in the SL leads to missing dependencies and wrong estimates when calculating such higher order gradients, as we discuss in Section 1.

Higher Order Derivatives

In this section, we illustrate how to estimate higher order derivatives via repeated application of the score function (SF) trick and show that repeated application of the surrogate loss (SL) approach in stochastic computation graphs (SCGs) fails to capture all of the relevant terms for higher order gradient estimates.

We begin by revisiting the derivation of the score function estimator for the gradient of the expectation L\mathcal{L} of f(x;θ)f(x;\theta) over xp(x;θ)x\sim p(x;\theta):

2 Higher Order Surrogate Losses

While Schulman et al. (2015) focus on first order gradients, they state that a recursive application of SL can generate higher order gradient estimators. However, as we demonstrate in this section, because the SL approach treats part of the objective as a sampled cost, the corresponding terms lose a functional dependency on the sampling distribution. This leads to missing terms in the estimators of higher order gradients.

Consider the following example, where a single parameter θ\theta defines a sampling distribution p(x;θ)p(x;\theta) and the objective is f(x,θ)f(x,\theta).

The corresponding SCG is depicted at the top of Figure 1. Comparing (3.1) and (3.2), note that the first term, f^(x)\hat{f}(x) has lost its functional dependency on θ\theta, as indicated by the hat notation and the lack of a θ\theta argument. While these terms evaluate to the same estimate of the first order gradient, the lack of the dependency yields a discrepancy between the exact derivation of the second order gradient and a second application of SL:

By contrast, the exact derivation of θ2L\nabla_{\theta}^{2}\mathcal{L} results in the following expression:

Since gSL(x;θ)g_{\text{SL}}(x;\theta) differs from g(x;θ)g(x;\theta) only in its dependencies on θ\theta, gSLg_{\text{SL}} and gg are identical when evaluated. However, due to the missing dependencies in gSLg_{\text{SL}}, the gradients w.r.t. θ\theta, which appear in the higher order gradient estimates in (3.3) and (3.4), differ:

We lose the term θf(x;θ)θlog(p(x;θ))\nabla_{\theta}f(x;\theta)\nabla_{\theta}\log(p(x;\theta)) in the second order SL gradient because θf^(x)=0\nabla_{\theta}\hat{f}(x)=0 (see left part of Figure 1). This issue occurs immediately in the second order gradients when ff depends directly on θ\theta. However, as g(x;θ)g(x;\theta) always depends on θ\theta, the SL approach always fails to produce correct third or higher order gradient estimates even if ff depends only indirectly on θ\theta.

3 Example

Evaluating the expectations for the SL gradient estimators analytically results in the following terms, with an incorrect second-order estimate:

If, for example, the Newton-Raphson method was used to optimise L\mathcal{L}, the solution could be found in a single iteration with the correct Hessian. In contrast, the wrong estimates from the SL approach would require damping to approach the optimum at all, and many more iterations would be needed.

The failure mode seen in this toy example appears whenever the objective includes a regularisation term that depends on θ\theta, and is also impacted by the stochastic samples. One example in a practical algorithm is soft QQ-learning for RL (Schulman et al., 2017), which regularises the policy by adding an entropy penalty to the rewards. This penalty encourages the agent to maintain an exploratory policy, reducing the probability of getting stuck in local optima. Clearly the penalty depends on the policy parameters θ\theta. However, the policy entropy also depends on the states visited, which in turn depend on the stochastically sampled actions. As a result, the entropy regularised RL objective in this algorithm has the exact property leading to the failure of the SL approach shown above. Unlike our toy analytic example, the consequent errors do not just appear as a rescaling of the proper higher order gradients, but depend in a complex way on the parameters θ\theta. Any second order methods with such a regularised objective therefore requires an alternate strategy for generating gradient estimators, even setting aside the awkwardness of repeatedly generating new surrogate objectives.

Correct Gradient Estimators with DiCE

In this section, we propose the Infinitely Differentiable Monte-Carlo Estimator (DiCE), a practical algorithm for programatically generating correct gradients of any order in arbitrary SCGs. The naive option is to recursively apply the update rules in (3.1) that map from f(x;θ)f(x;\theta) to the estimator of its derivative g(x;θ)g(x;\theta). However, this approach has two deficiencies. First, by defining gradients directly, it fails to provide an objective that can be used in standard deep learning libraries. Second, these naive gradient estimators violate the auto-diff paradigm for generating further estimators by repeated differentiation since in general θf(x;θ)g(x;θ)\nabla_{\theta}f(x;\theta)\neq g(x;\theta). Our approach addresses these issues, as well as fixing the missing terms from the SL approach.

where Wc={w  wS,wc,θw}\mathcal{W}_{c}=\{w\ |\ w\in\mathcal{S},w\prec c,\theta\prec w\}, i.e., the set of stochastic nodes that depend on θ\theta and influence the cost cc. For brevity, from here on we suppress the deps notation, assuming all probabilities and costs are conditioned on their relevant ancestors in the SCG.

Note that (4.1) is the generalisation of (3.1) to arbitrary SCGs. The proof is given by Schulman et al. (2015, Lines 1-10, Appendix A). Crucially, in Line 11 the authors then replace cc by c^\hat{c}, severing the dependencies required for correct higher order gradient estimators. As described in Section 2.2, this was done so that the SL approach reproduces the score function estimator after a single differentiation and can thus be used as an objective for backpropagation in a deep learning library.

To support correct higher order gradient estimators, we propose DiCE, which relies heavily on a novel operator, MagicBox (). MagicBox takes a set of stochastic nodes W\mathcal{W} as input and has the following two properties by design:

Below we prove that the DiCE objective indeed produces correct arbitrary order gradient estimators under differentiation.

For each cost node cCc\in\mathcal{C}, we define a sequence of nodes, cn,n{0,1,}c^{n},n\in\{0,1,\dots\} as follows:

The term inside the brackets in (4.4) is identical to cn+1c^{n+1}. Secondly, note that (4.6) shows that cn+1c^{n+1} depends only on cnc^{n} and Wcn\mathcal{W}_{c^{n}}. Therefore, the stochastic nodes which influence cn+1c^{n+1} are the same as those which influence cnc^{n}. So Wcn=Wcn+1\mathcal{W}_{c^{n}}=\mathcal{W}_{c^{n+1}}, and we arrive at (4.5).

Implementation of DiCE. DiCE is easy to implement in standard deep learning libraries A previous version of tf.contrib.bayesflow authored by Josh Dillon also used this implementation trick. :

where \bot is an operator that sets the gradient of the operand to zero, so \nabla_{x}\text{\bot}(x)=0.This operator exists in PyTorch as detach and in TensorFlow as stop_gradient.

Causality. The SL approach handles causality by summing over stochastic nodes, ww, and multiplying log(p(w))\nabla\log(p(w)) for each stochastic node with a sum of the downstream costs, Q^w\hat{Q}_{w}. In contrast, the DiCE objective sums over costs, cc, and multiplies each cost with a sum over the gradients of log-probabilities from upstream stochastic nodes, wWclog(p(w))\sum_{w\in\mathcal{W}_{c}}\nabla\log(p(w)).

In both cases, integrating causality into the gradient estimator leads to reduction of variance compared to the naive approach of multiplying the full sum over costs with the full sum over grad-log-probabilities.

However, the second formulation leads to greatly reduced conceptual complexity when calculating higher order terms, which we exploit in the definition of the DiCE objective. This is because each further gradient estimator maintains the same backward looking dependencies for each term in the original sum over costs, i.e., Wcn=Wcn+1\mathcal{W}_{c^{n}}=\mathcal{W}_{c^{n+1}}.

In contrast, the SL approach is centred around the stochastic nodes, which each become associated with a growing number of downstream costs after each differentiation. Consequently, we believe that our DiCE objective is more intuitive, as it is conceptually centred around the original objective and remains so under repeated differentiation.

Variance Reduction. We can include a baseline term in the definition of the DiCE objective:

The baseline bwb_{w} is a design choice and can be any function of nodes not influenced by ww. As long as this condition is met, the baseline does not change the expectation of the gradient estimates, but can considerably reduce the variance. A common choice is the average cost.

Hessian-Vector Product. The Hessian-vector, vHv^{\top}H, is useful for a number of algorithms, such as estimation of eigenvectors and eigenvalues of HH (Pearlmutter, 1994). Using DiCE, vHv^{\top}H can be implemented efficiently without having to compute the full Hessian. Assuming vv does not depend on θ\theta and using ⊤ to indicate the transpose:

Case Studies

While the main contribution of this paper is to provide a novel general approach for any order gradient estimation in SCGs, we also provide a proof-of-concept empirical evaluation for a set of case studies, carried out on the iterated prisoner’s dilemma (IPD). In IPD, two agents iteratively play matrix games with two possible actions: (C)ooperate and (D)efect. The possible outcomes of each game are DD, DC, CD, CC with the corresponding first agent payoffs, -2, 0, -3, -1, respectively. This setting is useful because (1) it has a nontrivial but analytically calculable value function, allowing for verification of gradient estimates, and (2) differentiating through the learning steps of other agents in multi-agent RL is a highly relevant application of higher order policy gradient estimators in RL (Foerster et al., 2018).

Empirical Verification. We first verify that DiCE recovers gradients and Hessians in stochastic computation graphs. To do so, we use DiCE to estimate gradients and Hessians of the expected return for fixed policies in IPD.

As shown in Figure 3, we find that indeed the DiCE estimator matches both the gradients (a) and the Hessians (b) for both agents accurately. Furthermore, Figure 4 shows how the estimate of the gradient improve as the value function becomes more accurate during training, in (a). Also shown is the quality of the gradient estimation as a function of sample size with and without a baseline, in (b). Both plots show that the baseline is a key component of DiCE for accurate estimation of gradients.

DiCE for multi-agent RL. In learning with opponent-learning awareness (LOLA), Foerster et al. (2018) show that agents that differentiate through the learning step of their opponent converge to Nash equilibria with higher social welfare in the IPD.

Since the standard policy gradient learning step for one agent has no dependency on the parameters of the other agent (which it treats as part of the environment), LOLA relies on a Taylor expansion of the expected return in combination with an analytical derivation of the second order gradients to be able to differentiate through the expected return after the opponent’s learning step.

Here, we take a more direct approach, made possible by DiCE. Let πθ1\pi_{\theta_{1}} be the policy of the LOLA agent and let πθ2\pi_{\theta_{2}} be the policy of its opponent and vice versa. Assuming that the opponent learns using policy gradients, LOLA-DiCE agents learn by directly optimising the following stochastic objective w.r.t. θ1\theta_{1}:

where α2\alpha_{2} is a scalar step size and Li=t=0Tγtrti\mathcal{L}^{i}=\sum_{t=0}^{T}\gamma^{t}r^{i}_{t} is the sum of discounted returns for agent ii.

To evaluate these terms directly, our variant of LOLA unrolls the learning process of the opponent, which is functionally similar to model-agnostic meta-learning (MAML, Finn et al., 2017). In the MAML formulation, the gradient update of the opponent, Δθ2\Delta\theta_{2}, corresponds to the inner loop (typically the training objective) and the gradient update of the agent itself to the outer loop (typically the test objective). Algorithm 1 describes the procedure we use to compute updates for the agent’s parameters.

Using the following DiCE-objective to estimate gradient steps for agent ii, we are able to preserve all dependencies:

where {aj{1,2}tt}\left\{a_{j\in\{1,2\}}^{t^{\prime}\leq t}\right\} is the set of all actions taken by both agents up to time tt. To save computation, we cache the Δθi\Delta\theta_{i} of the inner loop when unrolling the outer loop policies in order to avoid recalculating them at every time step.

Using DiCE, differentiating through Δθ2\Delta\theta_{2} produces the correct higher order gradients, which is critical for LOLA. By contrast, simply differentiating through the SL-based first order gradient estimator multiple times, as was done for MAML (Finn et al., 2017), results in omitted gradient terms and a biased gradient estimator, as pointed out by Al-Shedivat et al. (2017) and Stadie et al. (2018).

Figure 5 shows a comparison between the LOLA-DiCE agents and the original formulation of LOLA. In our experiments, we use a time horizon of 150 steps and a reduced batch size of 64; the lookahead gradient step, α\alpha, is set to 1 and the learning rate is 0.3. Importantly, given the approximation used, the LOLA method was restricted to a single step of opponent learning. In contrast, using DiCE we can unroll and differentiate through an arbitrary number of the opponent learning steps.

The original LOLA implemented via second order gradient corrections shows no stable learning, as it requires much larger batch sizes (\sim40004000). By contrast, LOLA-DiCE agents discover strategies of high social welfare, replicating the results of the original LOLA paper in a way that is both more direct, efficient and establishes a common formulation between MAML and LOLA.

Related Work

Gradient estimation is well studied, although many methods have been named and explored independently in different fields, and the primary focus has been on first order gradients. Fu (2006) provides an overview of methods from the point of view of simulation optimization.

The score function (SF) estimator, also referred to as the likelihood ratio estimator or REINFORCE, has received considerable attention in many fields. In reinforcement learning, policy gradient methods (Williams, 1992) have proven highly successful, especially when combined with variance reduction techniques (Weaver & Tao, 2001; Grondman et al., 2012). The SF estimator has also been used in the analysis of stochastic systems (Glynn, 1990), as well as for variational inference (Wingate & Weber, 2013; Ranganath et al., 2014). Kingma & Welling (2013) and Rezende et al. (2014) discuss Monte-Carlo gradient estimates in the case where the stochastic parts of a model can be reparameterised.

These approaches are formalised for arbitrary computation graphs by Schulman et al. (2015), but to our knowledge our paper is the first to present a practical and correct approach for generating higher order gradient estimators utilising auto-diff. To easily make use of these estimates for optimising neural network models, automatic differentiation for backpropagation has been widely used (Baydin et al., 2015).

One rapidly growing application area for such higher order gradient estimates is meta-learning for reinforcement learning. Finn et al. (2017) compute a loss after a number of policy gradient learning steps, differentiating through the learning step to find parameters that can be quickly fine-tuned for different tasks. Li et al. (2017) extend this work to also meta-learn the fine-tuning step direction and magnitude. Al-Shedivat et al. (2017) and Stadie et al. (2018) derive the proper higher order gradient estimators for their work by reapplying the score function trick. Foerster et al. (2018) use a multi-agent version of the same higher order gradient estimators in combination with a Taylor expansion of the expected return. None present a general strategy for constructing higher order gradient estimators for arbitrary stochastic computation graphs.

Conclusion

We presented DiCE, a general method for computing any order gradient estimators for stochastic computation graphs. DiCE resolves the deficiencies of current approaches for computing higher order gradient estimators: analytical calculation is error-prone and incompatible with auto-diff, while repeated application of the surrogate loss approach is cumbersome and, as we show, leads to incorrect estimators in many cases. We prove the correctness of DiCE estimators, introduce a simple practical implementation of DiCE for use in deep learning frameworks, and validate its correctness and utility in a multi-agent reinforcement learning problem. We believe DiCE will unlock further exploration and adoption of higher order learning methods in meta-learning, reinforcement learning, and other applications of stochastic computation graphs. As a next step we will extend and improve the variance reduction of DiCE in order to provide a simple end-to-end solution for higher order gradient estimation. In particular we hope to include solutions such as REBAR (Tucker et al., 2017) in the DiCE operator.

Acknowledgements

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement number 637713) and National Institute of Health (NIH R01GM114311). It was also supported by the Oxford Google DeepMind Graduate Scholarship and the UK EPSRC CDT in Autonomous Intelligent Machines and Systems. We would like to thank Misha Denil, Brendan Shillingford and Wendelin Boehmer for providing feedback on the manuscript.

References