Generative Adversarial Imitation Learning
Jonathan Ho, Stefano Ermon
Introduction
We are interested in a specific setting of imitation learning—the problem of learning to perform a task from expert demonstrations—in which the learner is given only samples of trajectories from the expert, is not allowed to query the expert for more data while training, and is not provided reinforcement signal of any kind. There are two main approaches suitable for this setting: behavioral cloning , which learns a policy as a supervised learning problem over state-action pairs from expert trajectories; and inverse reinforcement learning , which finds a cost function under which the expert is uniquely optimal.
Behavioral cloning, while appealingly simple, only tends to succeed with large amounts of data, due to compounding error caused by covariate shift . Inverse reinforcement learning (IRL), on the other hand, learns a cost function that prioritizes entire trajectories over others, so compounding error, a problem for methods that fit single-timestep decisions, is not an issue. Accordingly, IRL has succeeded in a wide range of problems, from predicting behaviors of taxi drivers to planning footsteps for quadruped robots .
Unfortunately, many IRL algorithms are extremely expensive to run, requiring reinforcement learning in an inner loop. Scaling IRL methods to large environments has thus been the focus of much recent work . Fundamentally, however, IRL learns a cost function, which explains expert behavior but does not directly tell the learner how to act. Given that learner’s true goal often is to take actions imitating the expert—indeed, many IRL algorithms are evaluated on the quality of the optimal actions of the costs they learn—why, then, must we learn a cost function, if doing so possibly incurs significant computational expense yet fails to directly yield actions?
We desire an algorithm that tells us explicitly how to act by directly learning a policy. To develop such an algorithm, we begin in Section 3, where we characterize the policy given by running reinforcement learning on a cost function learned by maximum causal entropy IRL . Our characterization introduces a framework for directly learning policies from data, bypassing any intermediate IRL step.
Then, we instantiate our framework in Sections 4 and 5 with a new model-free imitation learning algorithm. We show that our resulting algorithm is intimately connected to generative adversarial networks , a technique from the deep learning community that has led to recent successes in modeling distributions of natural images: our algorithm harnesses generative adversarial training to fit distributions of states and actions defining expert behavior. We test our algorithm in Section 6, where we find that it outperforms competing methods by a wide margin in training policies for complex, high-dimensional physics-based control tasks over various amounts of expert data.
Background
Suppose we are given an expert policy that we wish to rationalize with IRL. For the remainder of this paper, we will adopt maximum causal entropy IRL , which fits a cost function from a family of functions with the optimization problem
which maps a cost function to high-entropy policies that minimize the expected cumulative cost.
Characterizing the induced optimal policy
Now, let us define an IRL primitive procedure, which finds a cost function such that the expert performs better than all other policies, with the cost regularized by :
If , then is the occupancy measure for , and is the only policy whose occupancy measure is .
The proof of Proposition 3.2 is in Section A.1. The proof relies on the observation that the optimal cost function and policy form a saddle point of a certain function. IRL finds one coordinate of this saddle point, and running reinforcement learning on the output of IRL reveals the other coordinate.
Proposition 3.2 tells us that -regularized inverse reinforcement learning, implicitly, seeks a policy whose occupancy measure is close to the expert’s, as measured by the convex function . Enticingly, this suggests that various settings of lead to various imitation learning algorithms that directly solve the optimization problem given by Proposition 3.2. We explore such algorithms in Sections 4 and 5, where we show that certain settings of lead to both existing algorithms and a novel one.
The special case when is a constant function is particularly illuminating, so we state and show it directly using concepts from convex optimization.
In other words, if there were no cost regularization at all, then the recovered policy will exactly match the expert’s occupancy measure. To show this, we will need a lemma that lets us speak about causal entropies of occupancy measures:
Let . Then, is strictly concave, and for all and , we have and .
The proof of this lemma is in Section A.1. Proposition 3.1 and Lemma 3.1 together allow us to freely switch between policies and occupancy measures when considering functions involving causal entropy and expected costs, as in the following lemma:
Now, we are ready to give a direct proof of Corollary 3.2.1.
Define . Given that is a constant function, we have the following, due to Lemma 3.2:
This is the dual of the optimization problem
From this argument, we can deduce the following:
IRL is a dual of an occupancy measure matching problem, and the recovered cost function is the dual optimum. Classic IRL algorithms that solve reinforcement learning repeatedly in an inner loop, such as the algorithm of Ziebart et al. that runs a variant of value iteration in an inner loop, can be interpreted as a form of dual ascent, in which one repeatedly solves the primal problem (reinforcement learning) with fixed dual values (costs). Dual ascent is effective if solving the unconstrained primal is efficient, but in the case of IRL, it amounts to reinforcement learning!
The induced optimal policy is the primal optimum. The induced optimal policy is obtained by running RL after IRL, which is exactly the act of recovering the primal optimum from the dual optimum; that is, optimizing the Lagrangian with the dual variables fixed at the dual optimum values. Strong duality implies that this induced optimal policy is indeed the primal optimum, and therefore matches occupancy measures with the expert. IRL is traditionally defined as the act of finding a cost function such that the expert policy is uniquely optimal, but now, we can alternatively view IRL as a procedure that tries to induce a policy that matches the expert’s occupancy measure.
Practical occupancy measure matching
We saw in Corollary 3.2.1 that if is constant, the resulting primal problem 7 simply matches occupancy measures with expert at all states and actions. Such an algorithm, however, is not practically useful. In reality, the expert trajectory distribution will be provided only as a finite set of samples, so in large environments, most of the expert’s occupancy measure values will be exactly zero, and exact occupancy measure matching will force the learned policy to never visit these unseen state-action pairs simply due to lack of data. Furthermore, with large environments, we would like to use function approximation to learn a parameterized policy . The resulting optimization problem of finding the appropriate would have as many constraints as points in , leading to an intractably large problem and defeating the very purpose of function approximation.
Keeping in mind that we wish to eventually develop an imitation learning algorithm suitable for large environments, we would like to relax Eq. 7 into the following form, motivated by Proposition 3.2:
by modifying the IRL regularizer so that smoothly penalizes violations in difference between the occupancy measures.
Classic apprenticeship learning algorithms restrict to convex sets given by linear combinations of basis functions , which give rise a feature vector for each state-action pair. Abbeel and Ng and Syed et al. use, respectively,
Therefore, we see that entropy-regularized apprenticeship learning
is equivalent to performing RL following IRL with cost regularizer , which forces the implicit IRL procedure to recover a cost function lying in . Note that we can scale the policy’s entropy regularization strength in Eq. 11 by scaling by a constant as , recovering the original apprenticeship objective 9 by taking .
It is known that apprenticeship learning algorithms generally do not recover expert-like policies if is too restrictive [29, Section 1]—which is often the case for the linear subspaces used by feature expectation matching, MWAL, and LPAL, unless the basis functions are very carefully designed. Intuitively, unless the true expert cost function (assuming it exists) lies in , there is no guarantee that if performs better than on all of , then equals . With the aforementioned insight based on Proposition 3.2 that apprenticeship learning is equivalent to RL following IRL, we can understand exactly why apprenticeship learning may fail to imitate: it forces to be encoded as an element of . If does not include a cost function that explains expert behavior well, then attempting to recover a policy from such an encoding will not succeed.
While restrictive cost classes may not lead to exact imitation, apprenticeship learning with such can scale to large state and action spaces with policy function approximation. Ho et al. rely on the following policy gradient formula for the apprenticeship objective 9 for a parameterized policy :
Observing that Eq. 12 is the policy gradient for a reinforcement learning objective with cost , Ho et al. propose an algorithm that alternates between two steps:
Sample trajectories of the current policy by simulating in the environment, and fit a cost function , as defined in Eq. 12. For the cost classes and 10, this cost fitting amounts to evaluating simple analytical expressions .
Form a gradient estimate with Eq. 12 with and the sampled trajectories, and take a trust region policy optimization (TRPO) step to produce .
This algorithm relies crucially on the TRPO policy step, which is a natural gradient step constrained to ensure that does not stray too far , as measured by KL divergence between the two policies averaged over the states in the sampled trajectories. This carefully constructed step scheme ensures that divergence does not occur due to high noise in estimating the gradient 12. We refer the reader to Schulman et al. for more details on TRPO.
With the TRPO step scheme, Ho et al. were able train large neural network policies for apprenticeship learning with linear cost function classes 10 in environments with hundreds of observation dimensions. Their use of these linear cost function classes, however, limits their approach to settings in which expert behavior is well-described by such classes. We will draw upon their algorithm to develop an imitation learning method that both scales to large environments and imitates arbitrarily complex expert behavior. To do so, we first turn to proposing a new regularizer that wields more expressive power than the regularizers corresponding to and 10.
Generative adversarial imitation learning
As discussed in Section 4, the constant regularizer leads to an imitation learning algorithm that exactly matches occupancy measures, but is intractable in large environments. The indicator regularizers for the linear cost function classes 10, on the other hand, lead to algorithms incapable of exactly matching occupancy measures without careful tuning, but are tractable in large environments. We propose the following new cost regularizer that combines the best of both worlds, as we will show in the coming sections:
This regularizer places low penalty on cost functions that assign an amount of negative cost to expert state-action pairs; if , however, assigns large costs (close to zero, which is the upper bound for costs feasible for ) to the expert, then will heavily penalize . An interesting property of is that it is an average over expert data, and therefore can adjust to arbitrary expert datasets. The indicator regularizers , used by the linear apprenticeship learning algorithms described in Section 4, are always fixed, and cannot adapt to data as can. Perhaps the most important difference between and , however, is that forces costs to lie in a small subspace spanned by finitely many basis functions, whereas allows for any cost function, as long as it is negative everywhere.
Our choice of is motivated by the following fact, shown in the appendix (Corollary A.1.1):
where the maximum ranges over discriminative classifiers . Equation 14 is the optimal negative log loss of the binary classification problem of distinguishing between state-action pairs of and . It turns out that this optimal loss is (up to a constant shift) the Jensen-Shannon divergence , which is a squared metric between distributions . Treating the causal entropy as a policy regularizer, controlled by , we obtain a new imitation learning algorithm:
which finds a policy whose occupancy measure minimizes Jensen-Shannon divergence to the expert’s. Equation 15 minimizes a true metric between occupancy measures, so, unlike linear apprenticeship learning algorithms, it can imitate expert policies exactly.
Equation 15 draws a connection between imitation learning and generative adversarial networks , which train a generative model by having it confuse a discriminative classifier . The job of is to distinguish between the distribution of data generated by and the true data distribution. When cannot distinguish data generated by from the true data, then has successfully matched the true data. In our setting, the learner’s occupancy measure is analogous to the data distribution generated by , and the expert’s occupancy measure is analogous to the true data distribution.
Now, we present a practical algorithm, which we call generative adversarial imitation learning (Algorithm 1), for solving Eq. 15 for model-free imitation in large environments. Explicitly, we wish to find a saddle point of the expression
To do so, we first introduce function approximation for and : we will fit a parameterized policy , with weights , and a discriminator network , with weights . Then, we alternate between an Adam gradient step on to increase Eq. 16 with respect to , and a TRPO step on to decrease Eq. 16 with respect to . The TRPO step serves the same purpose as it does with the apprenticeship learning algorithm of Ho et al. : it prevents the policy from changing too much due to noise in the policy gradient. The discriminator network can be interpreted as a local cost function providing learning signal to the policy—specifically, taking a policy step that decreases expected cost with respect to the cost function will move toward expert-like regions of state-action space, as classified by the discriminator. (We derive an estimator for the causal entropy gradient in Section A.2.)
Experiments
We evaluated Algorithm 1 against baselines on 9 physics-based control tasks, ranging from low-dimensional control tasks from the classic RL literature—the cartpole , acrobot , and mountain car —to difficult high-dimensional tasks such as a 3D humanoid locomotion, solved only recently by model-free reinforcement learning . All environments, other than the classic control tasks, were simulated with MuJoCo . See Appendix B for a complete description of all the tasks.
Each task comes with a true cost function, defined in the OpenAI Gym . We first generated expert behavior for these tasks by running TRPO on these true cost functions to create expert policies. Then, to evaluate imitation performance with respect to sample complexity of expert data, we sampled datasets of varying trajectory counts from the expert policies. The trajectories constituting each dataset each consisted of about 50 state-action pairs. We tested Algorithm 1 against three baselines:
Behavioral cloning: a given dataset of state-action pairs is split into 70% training data and 30% validation data. The policy is trained with supervised learning, using Adam with minibatches of 128 examples, until validation error stops decreasing.
Feature expectation matching (FEM): the algorithm of Ho et al. using the cost function class 10 of Abbeel and Ng
Game-theoretic apprenticeship learning (GTAL): the algorithm of Ho et al. using the cost function class 10 of Syed and Schapire
We used all algorithms to train policies of the same neural network architecture for all tasks: two hidden layers of 100 units each, with nonlinearities in between. The discriminator networks for Algorithm 1 also used the same architecture. All networks were always initialized randomly at the start of each trial. For each task, we gave FEM, GTAL, and Algorithm 1 exactly the same amount of environment interaction for training.
Figure 1 depicts the results, and the tables in Appendix B provide exact performance numbers. We found that on the classic control tasks (cartpole, acrobot, and mountain car), behavioral cloning suffered in expert data efficiency compared to FEM and GTAL, which for the most part were able produce policies with near-expert performance with a wide range of dataset sizes. On these tasks, our generative adversarial algorithm always produced policies performing better than behavioral cloning, FEM, and GTAL. However, behavioral cloning performed excellently on the Reacher task, on which it was more sample efficient than our algorithm. We were able to slightly improve our algorithm’s performance on Reacher using causal entropy regularization—in the 4-trajectory setting, the improvement from to was statistically significant over training reruns, according to a one-sided Wilcoxon rank-sum test with . We used no causal entropy regularization for all other tasks.
On the other MuJoCo environments, we saw a large performance boost for our algorithm over the baselines. Our algorithm almost always achieved at least 70% of expert performance for all dataset sizes we tested, nearly always dominating all the baselines. FEM and GTAL performed poorly for Ant, producing policies consistently worse than a policy that chooses actions uniformly at random. Behavioral cloning was able to reach satisfactory performance with enough data on HalfCheetah, Hopper, Walker, and Ant; but was unable to achieve more than 60% for Humanoid, on which our algorithm achieved exact expert performance for all tested dataset sizes.
Discussion and outlook
As we demonstrated, our method is generally quite sample efficient in terms of expert data. However, it is not particularly sample efficient in terms of environment interaction during training. The number of such samples required to estimate the imitation objective gradient 18 was comparable to the number needed for TRPO to train the expert policies from reinforcement signals. We believe that we could significantly improve learning speed for our algorithm by initializing policy parameters with behavioral cloning, which requires no environment interaction at all.
Fundamentally, our method is model free, so it will generally need more environment interaction than model-based methods. Guided cost learning , for instance, builds upon guided policy search and inherits its sample efficiency, but also inherits its requirement that the model is well-approximated by iteratively fitted time-varying linear dynamics. Interestingly, both our Algorithm 1 and guided cost learning alternate between policy optimization steps and cost fitting (which we called discriminator fitting), even though the two algorithms are derived completely differently.
Our approach builds upon a vast line of work on IRL , and hence, just like IRL, our approach does not interact with the expert during training. Our method explores randomly to determine which actions bring a policy’s occupancy measure closer to the expert’s, whereas methods that do interact with the expert, like DAgger , can simply ask the expert for such actions. Ultimately, we believe that a method that combines well-chosen environment models with expert interaction will win in terms of sample complexity of both expert data and environment interaction.
We thank Jayesh K. Gupta and John Schulman for assistance and advice. This work was supported by the SAIL-Toyota Center for AI Research, and by a NSF Graduate Research Fellowship (grant no. DGE-114747).
References
Appendix A Proofs
First, we show strict concavity of . Let and be occupancy measures, and suppose . For all and , the log-sum inequality implies:
with equality if and only if . Summing both sides over all and shows that with equality if and only if . Applying Proposition 3.1 shows that equality in fact holds if and only if , so is strictly concave.
Now, we turn to verifying the last two statements, which also follow from Proposition 3.1 and the definition of occupancy measures. First,
This proof relies on properties of saddle points. For a reference, we refer the reader to Hiriart-Urruty and Lemaréchal [10, section VII.4].
The following relationships then hold, due to Proposition 3.1:
A.2 Proofs for Section 5
In Eq. 13 of Section 5, we described a cost regularizer , which leads to an imitation learning algorithm 15 that minimizes Jensen-Shannon divergence between occupancy measures. To justify our choice of , we show how to convert certain surrogate loss functions , for binary classification of state-action pairs drawn from the occupancy measures and , into cost function regularizers , for which is the minimum expected risk for :
Specifically, we will restrict ourselves to strictly decreasing convex loss functions. Nguyen et al. show a correspondence between minimum expected risks and -divergences, of which Jensen-Shannon divergence is a special case. Our following construction, therefore, can generate any imitation learning algorithm that minimizes an -divergence between occupancy measures, as long as that -divergence is induced by a strictly decreasing convex surrogate .
Then, is closed, proper, and convex, and .
Now, we verify the second claim. By Proposition 3.2, all we need to check is that :
where we made the change of variables , justified because is the range of . ∎
Having showed how to construct a cost function regularizer from , we obtain, as a corollary, a cost function regularizer for the logistic loss, whose optimal expected risk is, up to a constant, the Jensen-Shannon divergence.
Using the logistic loss , we see that Eq. 40 reduces to the claimed . Applying Proposition A.1, we get
We conclude with a policy gradient formula for causal entropy.
For an occupancy measure , define . Next,
The second term vanishes, because . We are left with
Appendix B Environments and detailed results
The environments we used for our experiments are from the OpenAI Gym . The names and version numbers of these environments are listed in Appendix B, which also lists dimension or cardinality of their observation and action spaces (numbers marked “continuous” indicate dimension for a continuous space, and numbers marked “discrete” indicate cardinality for a finite space).