Gaussian Processes for Data-Efficient Learning in Robotics and Control

Marc Peter Deisenroth, Dieter Fox, Carl Edward Rasmussen

Introduction

One of the main limitations of many current reinforcement learning (RL) algorithms is that learning is prohibitively slow, i.e., the required number of interactions with the environment is impractically high. For example, many RL approaches in problems with low-dimensional state spaces and fairly benign dynamics require thousands of trials to learn. This data inefficiency makes learning in real control/robotic systems impractical and prohibits RL approaches in more challenging scenarios.

Increasing the data efficiency in RL requires either task-specific prior knowledge or extraction of more information from available data. In this article, we assume that expert knowledge (e.g., in terms of expert demonstrations , realistic simulators, or explicit differential equations for the dynamics) is unavaiable. Instead, we carefully model the observed dynamics using a general flexible nonparametric approach.

Generally, model-based methods, i.e., methods which learn an explicit dynamics model of the environment, are more promising to efficiently extract valuable information from available data than model-free methods, such as Q-learning or TD-learning . The main reason why model-based methods are not widely used in RL is that they can suffer severely from model errors, i.e., they inherently assume that the learned model resembles the real environment sufficiently accurately . Model errors are especially an issue when only a few samples and no informative prior knowledge about the task are available. Fig. 1 illustrates how model errors can affect learning.

When learning models, considerable model uncertainty is present, especially early on in learning. Thus, we require probabilistic models to express this uncertainty. Moreover, model uncertainty needs to be incorporated into planning and policy evaluation. Based on these ideas, we propose pilco (Probabilistic Inference for Learning Control), a model-based policy search method . As a probabilistic model we use nonparametric Gaussian processes (GPs) . Pilco uses computationally efficient deterministic approximate inference for long-term predictions and policy evaluation. Policy improvement is based on analytic policy gradients. Due to probabilistic modeling and inference pilco achieves unprecedented learning efficiency in continuous state-action domains and, hence, is directly applicable to complex mechanical systems, such as robots.

In this article, we provide a detailed overview of the key ingredients of the pilco learning framework. In particular, we assess the quality of two different approximate inference methods in the context of policy search. Moreover, we give a concrete example of the importance of Bayesian modeling and inference for fast learning from scratch. We demonstrate that Pilco’s unprecedented learning speed makes it directly applicable to realistic control and robotic hardware platforms.

This article is organized as follows: After discussing related work in Sec. 2, we describe the key ideas of the pilco learning framework in Sec. 3, i.e., the dynamics model, policy evaluation, and gradient-based policy improvement. In Sec. 4, we detail two approaches for long-term predictions for policy evaluation. In Sec. 5, we describe how the policy is represented and practically implemented. A particular cost function and its natural exploration/exploitation trade-off are discussed in Sec. 6. Experimental results are provided in Sec. 7. In Sec. 8, we discuss key properties, limitations, and extensions of the pilco framework before concluding in Sec. 9.

Related Work

Controlling systems under parameter uncertainty has been investigated for decades in robust and adaptive control . Typically, a certainty equivalence principle is applied, which treats estimates of the model parameters as if they were the true values . Approaches to designing adaptive controllers that explicitly take uncertainty about the model parameters into account are stochastic adaptive control and dual control . Dual control aims to reduce parameter uncertainty by explicit probing, which is closely related to the exploration problem in RL. Robust, adaptive, and dual control are most often applied to linear systems ; nonlinear extensions exist in special cases .

The specification of parametric models for a particular control problem is often challenging and requires intricate knowledge about the system. Sometimes, a rough model estimate with uncertain parameters is sufficient to solve challenging control problems. For instance, in , this approach was applied together with locally optimal controllers and temporal bias terms for handling model errors. The key idea was to ground policy evaluations using real-life trials, but not the approximate model.

All above-mentioned approaches to finding controllers require more or less accurate parametric models. These models are problem specific and have to be manually specified, i.e., they are not suited for learning models for a broad range of tasks. Nonparametric regression methods, however, are promising to automatically extract the important features of the latent dynamics from data. In locally weighted Bayesian regression was used as a nonparametric method for learning these models. To deal with model uncertainty, in model parameters were sampled from the parameter posterior, which accounts for temporal correlation. In , model uncertainty was treated as noise. The approach to controller learning was based on stochastic dynamic programming in discretized spaces, where the model errors at each time step were assumed independent.

Pilco builds upon the idea of treating model uncertainty as noise . However, unlike , pilco is a policy search method and does not require state space discretization. Instead closed-form Bayesian averaging over infinitely many plausible dynamics models is possible by using nonparametric GPs.

Nonparametric GP dynamics models in RL were previously proposed in , where the GP training data were obtained from “motor babbling”. Unlike pilco, these approaches model global value functions to derive policies, requiring accurate value function models. To reduce the effect of model errors in the value functions, many data points are necessary as value functions are often discontinuous, rendering value-function based methods in high-dimensional state spaces often statistically and computationally impractical. Therefore, propose to learn GP value function models to address the issue of model errors in the value function. However, these methods can usually only be applied to low-dimensional RL problems. As a policy search method, pilco does not require an explicit global value function model but rather searches directly in policy space. However, unlike value-function based methods, pilco is currently limited to episodic set-ups.

Model-based Policy Search

In this article, we consider dynamical systems

with continuous-valued states $\boldsymbol{x}\in\mathds{R}^{D}$ and controls $\boldsymbol{u}\in\mathds{R}^{F}$ , i.i.d. Gaussian system noise $\boldsymbol{w}$ , and unknown transition dynamics $f$ . The policy search objective is to find a policy/controller $\pi:\boldsymbol{x}\mapsto\pi(\boldsymbol{x},\boldsymbol{\theta})=\boldsymbol{u}$ , which minimizes the expected long-term cost

of following $\pi$ for $T$ steps, where $c(\boldsymbol{x}_{t})$ is the cost of being in state $\boldsymbol{x}$ at time $t$ . We assume that $\pi$ is a function parametrized by $\boldsymbol{\theta}$ . In our experiments in Sec. 7, we use a) nonlinear parametrizations by means of RBF networks, where the parameters $\boldsymbol{\theta}$ are the weights and the features, or b) linear-affine parametrizations, where the parameters $\boldsymbol{\theta}$ are the weight matrix and a bias term.

To find a policy $\pi^{*}$ , which minimizes (2), pilco builds upon three components: 1) a probabilistic GP dynamics model (Sec. 3.1), 2) deterministic approximate inference for long-term predictions and policy evaluation (Sec. 3.2), 3) analytic computation of the policy gradients $\operatorname{d}\!J^{\pi}(\boldsymbol{\theta})/\operatorname{d}\!\boldsymbol{\theta}$ for policy improvement (Sec. 3.3). The GP model internally represents the dynamics in (1) and is subsequently employed for long-term predictions $p(\boldsymbol{x}_{1}|\pi),\dotsc,p(\boldsymbol{x}_{T}|\pi)$ , given a policy $\pi$ . These predictions are obtained through approximate inference and used to evaluate the expected long-term cost $J^{\pi}(\boldsymbol{\theta})$ in (2). The policy $\pi$ is improved based on gradient information $\operatorname{d}\!J^{\pi}(\boldsymbol{\theta})/\operatorname{d}\!\boldsymbol{\theta}$ . Alg. 1 summarizes the pilco learning framework.

Pilco’s probabilistic dynamics model is implemented as a GP, where we use tuples $(\boldsymbol{x}_{t},\boldsymbol{u}_{t})\in\mathds{R}^{D+F}$ as training inputs and differences $\boldsymbol{\Delta}_{t}=\boldsymbol{x}_{t+1}-\boldsymbol{x}_{t}\in\mathds{R}^{D}$ as training targets.Using differences as training targets encodes an implicit prior mean function $m(\boldsymbol{x})=\boldsymbol{x}$ . This means that when leaving the training data, the GP predictions do not fall back to 0 but they remain constant. A GP is completely specified by a mean function $m(\,\cdot\,)$ and a positive semidefinite covariance function/kernel $k(\,\cdot\,,\,\cdot\,)$ . In this paper, we consider a prior mean function $m\equiv 0$ and the covariance function

The posterior GP is a one-step prediction model, and the predicted successor state $\boldsymbol{x}_{t+1}$ is Gaussian distributed

where the mean and variance of the GP prediction are

For multivariate targets, we train conditionally independent GPs for each target dimension, i.e., the GPs are independent for given test inputs. For uncertain inputs, the target dimensions covary , see also Sec. 4.

2 Policy Evaluation

To evaluate and minimize $J^{\pi}$ in (2) pilco uses long-term predictions of the state evolution. In particular, we determine the marginal $t$ -step-ahead predictive distributions ${p}(\boldsymbol{x}_{1}|\pi),\dotsc,{p}(\boldsymbol{x}_{T}|\pi)$ from the initial state distribution ${p}(\boldsymbol{x}_{0})$ , $t=1,\dotsc,T$ . To obtain these long-term predictions, we cascade one-step predictions, see (4)–(5), which requires mapping uncertain test inputs through the GP dynamics model. In the following, we assume that these test inputs are Gaussian distributed. For notational convenience, we omit the explicit conditioning on the policy $\pi$ in the following and assume that episodes start from $\boldsymbol{x}_{0}\sim{p}(\boldsymbol{x}_{0})=\mathcal{N}\big{(}\boldsymbol{x}_{0}\,|\,\boldsymbol{\mu}_{0},\boldsymbol{\Sigma}_{0}\big{)}$ .

Assume the mean $\boldsymbol{\mu}_{\boldsymbol{\Delta}}$ and the covariance $\boldsymbol{\Sigma}_{\boldsymbol{\Delta}}$ of the predictive distribution ${p}(\boldsymbol{\Delta}_{t})$ are knownWe will detail their computations in Secs. 4.1–4.2.. Then, a Gaussian approximation to the desired predictive distribution ${p}(\boldsymbol{x}_{t+1})$ is given as $\mathcal{N}\big{(}\boldsymbol{x}_{t+1}\,|\,\boldsymbol{\mu}_{t+1},\boldsymbol{\Sigma}_{t+1}\big{)}$ with

Note that both $\boldsymbol{\mu}_{\boldsymbol{\Delta}}$ and $\boldsymbol{\Sigma}_{\boldsymbol{\Delta}}$ are functions of the mean $\boldsymbol{\mu}_{u}$ and the covariance $\boldsymbol{\Sigma}_{u}$ of the control signal.

To evaluate the expected long-term cost $J^{\pi}$ in (2), it remains to compute the expected values

$t=1,\dotsc,T$ , of the cost $c$ with respect to the predictive state distributions. We choose the cost $c$ such that the integral in (11) and, thus, $J^{\pi}$ in (2) can computed analytically. Examples of such cost functions include polynomials and mixtures of Gaussians.

3 Analytic Gradients for Policy Improvement

To find policy parameters $\boldsymbol{\theta}$ , which minimize $J^{\pi}(\boldsymbol{\theta})$ in (2), we use gradient information $\operatorname{d}\!J^{\pi}(\boldsymbol{\theta})/\operatorname{d}\!\boldsymbol{\theta}$ . We require that the expected cost in (11) is differentiable with respect to the moments of the state distribution. Moreover, we assume that the moments of the control distribution $\boldsymbol{\mu}_{u}$ and $\boldsymbol{\Sigma}_{u}$ can be computed analytically and are differentiable with respect to the policy parameters $\boldsymbol{\theta}$ .

In the following, we describe how to analytically compute these gradients for a gradient-based policy search. We obtain the gradient $\operatorname{d}\!J^{\pi}/\operatorname{d}\!\boldsymbol{\theta}$ by repeated application of the chain-rule: First, we move the gradient into the sum in (2), and with $\mathcal{E}_{t}\coloneqq\mathds{E}_{\boldsymbol{x}_{t}}[c(\boldsymbol{x}_{t})]$ we obtain

where we used the shorthand notation $\operatorname{d}\!\mathcal{E}_{t}/\operatorname{d}\!{p}(\boldsymbol{x}_{t})=\{\operatorname{d}\!\mathcal{E}_{t}/\operatorname{d}\!\boldsymbol{\mu}_{t},\operatorname{d}\!\mathcal{E}_{t}/\operatorname{d}\!\boldsymbol{\Sigma}_{t}\}$ for taking the derivative of $\mathcal{E}_{t}$ with respect to both the mean and covariance of ${p}(\boldsymbol{x}_{t})=\mathcal{N}\big{(}\boldsymbol{x}_{t}\,|\,\boldsymbol{\mu}_{t},\boldsymbol{\Sigma}_{t}\big{)}$ . Second, as we will show in Sec. 4, the predicted mean $\boldsymbol{\mu}_{t}$ and covariance $\boldsymbol{\Sigma}_{t}$ depend on the moments of ${p}(\boldsymbol{x}_{t-1})$ and the controller parameters $\boldsymbol{\theta}$ . By applying the chain-rule to (12), we obtain then

From here onward, we focus on $\operatorname{d}\!\boldsymbol{\mu}_{t}/\operatorname{d}\!\boldsymbol{\theta}$ , see (12), but computing $\operatorname{d}\!\boldsymbol{\Sigma}_{t}/\operatorname{d}\!\boldsymbol{\theta}$ in (12) is similar. For $\operatorname{d}\!\boldsymbol{\mu}_{t}/\operatorname{d}\!\boldsymbol{\theta}$ , we compute the derivative

Since $\operatorname{d}\!{p}(\boldsymbol{x}_{t-1})/\operatorname{d}\!\boldsymbol{\theta}$ in (13) is known from time step $t-1$ and $\partial\boldsymbol{\mu}_{t}/\partial{p}(\boldsymbol{x}_{t-1})$ is computed by applying the chain-rule to (17)–(20), we conclude with

The partial derivatives of $\boldsymbol{\mu}_{u}$ and $\boldsymbol{\Sigma}_{u}$ , i.e., the mean and covariance of ${p}(\boldsymbol{u}_{t})$ , used in (16) depend on the policy representation. The individual partial derivatives in (12)–(16) depend on the approximate inference method used for propagating state distributions through time. For example, with moment matching or linearization of the posterior GP (see Sec. 4 for details) the desired gradients can be computed analytically by repeated application of the chain-rule. The Appendix derives the gradients for the moment-matching approximation.

A gradient-based optimization method using estimates of the gradient of $J^{\pi}(\boldsymbol{\theta})$ such as finite differences or more efficient sampling-based methods (see for an overview) requires many function evaluations, which can be computationally expensive. However, since in our case policy evaluation can be performed analytically, we profit from analytic expressions for the gradients, which allows for standard gradient-based non-convex optimization methods, such as CG or BFGS, to determine optimized policy parameters $\boldsymbol{\theta}^{*}$ .

Long-Term Predictions

Following the law of iterated expectations, for target dimensions $a=1,\dotsc,D,$ we obtain the predictive mean

with $\boldsymbol{q}_{a}=[q_{a_{1}},\ldots,q_{a_{n}}]^{\top}$ . The entries of $\boldsymbol{q}_{a}\in\mathds{R}^{n}$ are computed using standard results from multiplying and integrating over Gaussians and are given by

Computing the predictive covariance matrix $\boldsymbol{\Sigma}_{\boldsymbol{\Delta}}\in\mathds{R}^{D\times D}$ requires us to distinguish between diagonal elements $\sigma_{aa}^{2}$ and off-diagonal elements $\sigma_{ab}^{2}$ , $a\neq b$ : Using the law of total (co-)variance, we obtain for target dimensions $a,b=1,\dotsc,D$

Using standard results from Gaussian multiplications and integration, we obtain the entries $Q_{ij}$ of $\boldsymbol{Q}\in\mathds{R}^{n\times n}$

with $\boldsymbol{\nu}_{i}$ defined in (20). Hence, the off-diagonal entries of $\boldsymbol{\Sigma}_{\boldsymbol{\Delta}}$ are fully determined by (17)–(20), (22), and (24)–(26).

From (21), we see that the diagonal entries contain the additional term

A visualization of the approximation of the predictive distribution by means of exact moment matching is given in Fig. 2.

2 Linearization of the Posterior GP Mean Function

$a=1,\dotsc,E$ , where $\boldsymbol{\beta}_{a}$ is given in (18).

Policy

In the following, we describe the desired properties of the policy within the pilco learning framework. First, to compute the long-term predictions ${p}(\boldsymbol{x}_{1}),\dotsc,{p}(\boldsymbol{x}_{T})$ for policy evaluation, the policy must allow us to compute a distribution over controls ${p}(\boldsymbol{u})={p}(\pi(\boldsymbol{x}))$ for a given (Gaussian) state distribution ${p}(\boldsymbol{x})$ . Second, in a realistic real-world application, the amplitudes of the control signals are bounded. Ideally, the learning system takes these constraints explicitly into account. In the following, we detail how pilco implements these desiderata.

During the long-term predictions, the states are given by a probability distribution ${p}(\boldsymbol{x}_{t})$ , $t=0,\dotsc,T$ . The probability distribution of the state $\boldsymbol{x}_{t}$ induces a predictive distribution ${p}(\boldsymbol{u}_{t})={p}(\pi(\boldsymbol{x}_{t}))$ over controls, even when the policy is deterministic. We approximate the distribution over controls using moment matching, which is in many interesting cases analytically tractable.

2 Constrained Control Signals

which is the third-order Fourier series expansion of a trapezoidal wave, normalized to the interval $ $. The squashing function in (36) is computationally convenient as we can analytically compute predictive moments for Gaussian distributed states. Subsequently, we multiply the squashed policy by$ \boldsymbol{u}_{\max}$ and obtain the final policy

an illustration of which is shown in Fig. 3.

To compute a distribution over constrained control signals, we execute the following steps:

3 Representations of the Preliminary Policy

The linear preliminary policy is given by

where $\boldsymbol{A}$ is a parameter matrix of weights and $\boldsymbol{b}$ is an offset vector. In each control dimension $d$ , the policy in (39) is a linear combination of the states (the weights are given by the $d$ th row in $\boldsymbol{A}$ ) plus an offset $b_{d}$ .

respectively. A drawback of the linear policy is that it is not flexible. However, a linear controller can often be used for stabilization around an equilibrium.

3.2 Nonlinear Policy: Deterministic Gaussian Process

where $\boldsymbol{x}_{*}$ is a test input, $\boldsymbol{\alpha}=(\boldsymbol{K}+0.01\boldsymbol{I})^{-1}\boldsymbol{t}$ , where $\boldsymbol{t}$ plays the role of a GP’s training targets. In (41), $\boldsymbol{M}=[\boldsymbol{m}_{1},\dotsc,\boldsymbol{m}_{N}]$ are the centers of the (axis-aligned) Gaussian basis functions

where for $i=1,\dotsc,N$ and all policy dimensions $a=1,\dotsc,F$

For $a,b=1,\dotsc,F$ , the entries of the predictive covariance matrix are computed according to

For $i,j=1,\dotsc,N$ , we compute the entries of $\boldsymbol{Q}$ as

Combining this result with (43) fully determines the predictive covariance matrix of the preliminary policy.

Unlike the predictive covariance of a probabilistic GP, see (21)–(22), the predictive covariance matrix of the deterministic GP does not comprise any model uncertainty in its diagonal entries.

4 Policy Parameters

The linear policy in (39) possesses $D+1$ parameters per control dimension: For control dimension $d$ there are $D$ weights in the $d$ th row of the matrix $\boldsymbol{A}$ . One additional parameter originates from the offset parameter $b_{d}$ .

4.2 Nonlinear Policy

The parameters of the deterministic GP in (41) are the locations $\boldsymbol{M}$ of the centers ( $DN$ parameters), the (shared) length-scales of the Gaussian basis functions ( $D$ length-scale parameters per target dimension), and the $N$ targets $\boldsymbol{t}$ per target dimension. In the case of multivariate controls, the basis function centers $\boldsymbol{M}$ are shared.

5 Computing the Successor State Distribution

Alg. 2 summarizes the computational steps required to compute the successor state distribution ${p}(\boldsymbol{x}_{t+1})$ from ${p}(\boldsymbol{x}_{t})$ .

Cost Function

In our learning set-up, we use a cost function that solely penalizes the Euclidean distance $d$ of the current state to the target state. Using only distance penalties is often sufficient to solve a task: Reaching a target $\boldsymbol{x}_{\text{target}}$ with high speed naturally leads to overshooting and, thus, to high long-term costs. In particular, we use the generalized binary saturating cost

which is locally quadratic but saturates at unity for large deviations $d$ from the desired target $\boldsymbol{x}_{\text{target}}$ . In (44), the geometric distance from the state $\boldsymbol{x}$ to the target state is denoted by $d$ , and the parameter $\sigma_{c}$ controls the width of the cost function.In the context of sensorimotor control, the saturating cost function in (44) resembles the cost function in human reasoning as experimentally validated by .

In classical control, typically a quadratic cost is assumed. However, a quadratic cost tends to focus attention on the worst deviation from the target state along a predicted trajectory. In the early stages of learning the predictive uncertainty is large and, therefore, the policy gradients, which are described in Sec. 3.3 become less useful. Therefore, we use the saturating cost in (44) as a default within the pilco learning framework.

The immediate cost in (44) is an unnormalized Gaussian with mean $\boldsymbol{x}_{\text{target}}$ and variance $\sigma_{c}^{2}$ , subtracted from unity. Therefore, the expected immediate cost can be computed analytically according to

where $\boldsymbol{T}^{-1}$ is the precision matrix of the unnormalized Gaussian in (45). If the state $\boldsymbol{x}$ has the same representation as the target vector, $\boldsymbol{T}^{-1}$ is a diagonal matrix with entries either unity or zero, scaled by $1/\sigma_{c}^{2}$ . Hence, for $\boldsymbol{x}\sim\mathcal{N}(\boldsymbol{\mu},\boldsymbol{\Sigma})$ we obtain the expected immediate cost

The partial derivatives $\tfrac{\partial}{\partial\boldsymbol{\mu}_{t}}\mathds{E}_{\boldsymbol{x}_{t}}[c(\boldsymbol{x}_{t})],\,\tfrac{\partial}{\partial\boldsymbol{\Sigma}_{t}}\mathds{E}_{\boldsymbol{x}_{t}}[c(\boldsymbol{x}_{t})]$ of the immediate cost with respect to the mean and the covariance of the state distribution ${p}(\boldsymbol{x}_{t})=\mathcal{N}(\boldsymbol{\mu}_{t},\boldsymbol{\Sigma}_{t})$ , which are required to compute the policy gradients analytically, are given by

The saturating cost function in (44) allows for a natural exploration when the policy aims to minimize the expected long-term cost in (2). This property is illustrated in Fig. 4 for a single time step where we assume a Gaussian state distribution ${p}(\boldsymbol{x}_{t})$ .

If the mean of ${p}(\boldsymbol{x}_{t})$ is far away from the target $\boldsymbol{x}_{\text{target}}$ , a wide state distribution is more likely to have substantial tails in some low-cost region than a more peaked distribution as shown in Fig. 4(a). In the early stages of learning, the predictive state uncertainty is largely due to propagating model uncertainties forward. If we predict a state distribution in a high-cost region, the saturating cost then leads to automatic exploration by favoring uncertain states, i.e., states in regions far from the target with a poor dynamics model. When visiting these regions during interaction with the physical system, subsequent model learning reduces the model uncertainty locally. In the subsequent policy evaluation, pilco will predict a tighter state distribution in the situations described in Fig. 4.

If the mean of the state distribution is close to the target as in Fig. 4(b), wide distributions are likely to have substantial tails in high-cost regions. By contrast, the mass of a peaked distribution is more concentrated in low-cost regions. In this case, the policy prefers peaked distributions close to the target, leading to exploitation.

To summarize, combining a probabilistic dynamics model, Bayesian inference, and a saturating cost leads to automatic exploration as long as the predictions are far from the target—even for a policy, which greedily minimizes the expected cost. Once close to the target, the policy does not substantially deviate from a confident trajectory that leads the system close to the target.Code is available at http://mloss.org/software/view/508/.

Experimental Results

In this section, we assess pilco’s key properties and show that pilco scales to high-dimensional control problems. Moreover, we demonstrate the hardware applicability of our learning framework on two real systems. In all cases, pilco followed the steps outlined in Alg. 1. To reduce the computational burden, we used the sparse GP method of after 300 collected data points.

In the following, we assess the quality of the approximate inference method used for long-term predictions in terms of computational demand and learning speed. Moreover, we shed some light on the quality of the Gaussian approximations of the predictive state distributions and the importance of Bayesian averaging. For these assessments, we applied pilco to two nonlinear control tasks, which are introduced in the following.

We considered two simulated tasks (double-pendulum swing-up, cart-pole swing-up) to evaluate important properties of the pilco policy search framework: learning speed, quality of approximate inference, importance of Bayesian averaging, and hardware applicability. In the following we briefly introduce the experimental set-ups.

The task is challenging since its solution requires the interplay of two correlated control signals. The challenge is to automatically learn this interplay from experience. To solve the double pendulum swing-up task, a nonlinear policy is required. Thus, we parametrized the preliminary policy as a deterministic GP, see (41), with 100 basis functions resulting in 812 policy parameters. We chose the saturating immediate cost in (44), where the Euclidean distance between the upright position and the tip of the outer link was penalized. We chose the cost width $\sigma_{c}=0.5$ , which means that the tip of the outer pendulum had to cross horizontal to achieve an immediate cost smaller than unity.

1.2 Approximate Inference Assessment

In the following, we evaluate the quality of the presented approximate inference methods for policy evaluation (moment matching as described in Sec. 4.1) and linearization of the posterior GP mean as described in Sec. 4.2) with respect to computational demand (Sec. 7.1.2) and learning speed (Sec. 7.1.2).

For a single time step, the computational complexity of moment matching is $\mathcal{O}(n^{2}E^{2}D)$ , where $n$ is the number of GP training points, $D$ is the input dimensionality, and $E$ the dimension of the prediction. The most expensive computations are the entries of $\boldsymbol{Q}\in\mathds{R}^{n\times n}$ , which are given in (26). Each entry $Q_{ij}$ requires evaluating a kernel, which is essentially a $D$ -dimensional scalar product. The values $\boldsymbol{z}_{ij}$ are cheap to compute and $\boldsymbol{R}$ needs to be computed only once. We end up with $\mathcal{O}(n^{2}E^{2}D)$ since $\boldsymbol{Q}$ needs to be computed for all entries of the $E\times E$ predictive covariance matrix.

For a single time step, the computational complexity of linearizing the posterior GP mean function is $\mathcal{O}(n^{2}DE)$ . The most expensive operation is the determination of $\boldsymbol{\Sigma}_{w}$ in (34), i.e., the model uncertainty at the mean of the input distribution, which scales in $\mathcal{O}(n^{2}D)$ . This computation is performed for all $E$ predictive dimensions, resulting in a computational complexity of $\mathcal{O}(n^{2}DE)$ .

Fig. 6 illustrates the empirical computational effort for both linearization of the posterior GP mean and exact moment matching. We randomly generated GP models in $D=1,2,3,4,5,6,7,8,9,10,15,20,50$ dimensions and GP training set sizes of $n=100,250,500,1000$ data points. We set the predictive dimension $E=D$ . The CPU time (single core) for computing a predictive state distribution and the required derivatives are shown as a function of the dimensionality of the state. Four graphs are shown for set-ups with 100, 250, 500, and 1000 GP training points, respectively. Fig. 6(a) shows the graphs for approximate inference based on linearization of the posterior GP mean, and Fig. 6(b) shows the corresponding graphs for exact moment matching on a logarithmic scale. Computations based on linearization were consistently faster by a factor of 5–10.

Fig. 7(b) relates pilco’s learning speed (blue bar) to other RL methods (black bars), which solved the cart-pole swing-up task from scratch, i.e., without human demonstrations or known dynamics models .

Dynamics models were only learned in , using RBF networks and multi-layered perceptrons, respectively. In all cases without state-space discretization, cost functions similar to ours (see (44)) were used. Fig. 7(b) stresses pilco’s data efficiency: Pilco outperforms any other currently existing RL algorithm by at least one order of magnitude.

Double-Pendulum Swing-Up with Two Actuators.

Summary. We have seen that both approximate inference methods have pros and cons: Moment matching requires more computational resources than linearization, but learns faster and more reliably. The reason why linearization did not reliably succeed in learning the tasks is that it gets relatively easily stuck in local minima, which is largely a result of underestimating predictive variances, an example of which is given in Fig. 2. Propagating too confident predictions over a longer horizon often worsens the problem. Hence, in the following, we focus solely on the moment matching approximation.

1.3 Quality of the Gaussian Approximation

Pilco strongly relies on the quality of approximate inference, which is used for long-term predictions and policy evaluation, see Sec. 4. We already saw differences between linearization and moment matching; however, both methods approximate predictive distributions by a Gaussian. Although we ultimately cannot answer whether this approximation is good under all circumstances, we will shed some light on this issue.

Fig. 9 shows a typical example of the angle of the inner pendulum of the double pendulum system where, in the early stages of learning, the Gaussian approximation to the multi-step ahead predictive distribution is not ideal. The trajectory distribution of a set of rollouts (red) is multimodal. Pilco deals with this inappropriate modeling by learning a controller that forces the actual trajectories into a unimodal distribution such that a Gaussian approximation is appropriate, Fig. 9(b).

We explain this behavior as follows: Assuming that pilco found different paths that lead to a target, a wide Gaussian distribution is required to capture the variability of the bimodal distribution. However, when computing the expected cost using a quadratic or saturating cost, for example, uncertainty in the predicted state leads to higher expected cost, assuming that the mean is close to the target. Therefore, pilco uses its ability to choose control policies to push the marginally multimodal trajectory distribution into a single mode—from the perspective of minimizing expected cost with limited expressive power, this approach is desirable. Effectively, learning good controllers and models goes hand in hand with good Gaussian approximations.

1.4 Importance of Bayesian Averaging

Model-based RL greatly profits from the flexibility of nonparametric models as motivated in Sec. 2. In the following, we have a closer look at whether Bayesian models are strictly necessary as well. In particular, we evaluated whether Bayesian averaging is necessary for successfully learning from scratch. To do so, we considered the cart-pole swing-up task with two different dynamics models: first, the standard nonparametric Bayesian GP model, second, a nonparametric deterministic GP model, i.e., a GP where we considered only the posterior mean, but discarded the posterior model uncertainty when doing long-term predictions. We already described a similar kind of function representation to learn a deterministic policy, see Sec. 5.3.2. The difference to the policy is that in this section the deterministic GP is still nonparametric (new basis functions are added if we get more data), whereas the number of basis functions in the policy is fixed. However, the deterministic GP is no longer probabilistic because of the loss of model uncertainty, which also results in a degenerate model. Note that we still propagate uncertainties resulting from the initial state distribution $p(\boldsymbol{x}_{0})$ forward.

Tab. I shows the average learning success of swinging the pendulum up and balancing it in the inverted position in the middle of the track. We used moment matching for approximate inference, see Sec. 4. Tab. I shows that learning is only successful when model uncertainties are taken into account during long-term planning and control learning, which strongly suggests Bayesian nonparametric models in model-based RL.

The reason why model uncertainties must be appropriately taken into account is the following: In the early stages of learning, the learned dynamics model is based on a relatively small data set. States close to the target are unlikely to be observed when applying random controls. Therefore, the model must extrapolate from the current set of observed states. This requires to predict function values in regions with large posterior model uncertainty. Depending on the choice of the deterministic function (we chose the MAP estimate), the predictions (point estimates) are very different. Iteratively predicting state distributions ends up in predicting trajectories, which are essentially arbitrary and not close to the target state either, resulting in vanishing policy gradients.

2 Scaling to Higher Dimensions: Unicycling

We applied pilco to learning to ride a 5-DoF unicycle with $\boldsymbol{x}\in\mathds{R}^{12}$ and $\boldsymbol{u}\in\mathds{R}^{2}$ in a realistic simulation of the one shown in Fig. 10(a).

Pilco differs from conventional controllers in that it learns a single controller for all control dimensions jointly. Thus, pilco takes the correlation of all control and state dimensions into account during planning and control. Learning separate controllers for each control variable is often unsuccessful .

3 Hardware Tasks

In the following, we present results from , where we successfully applied the pilco policy search framework to challenging control and robotics tasks, respectively. It is important to mention that no task-specific modifications were necessary, besides choosing a controller representation and defining an immediate cost function. In particular, we used the same standard GP priors for learning the forward dynamics models.

3.2 Controlling a Low-Cost Robotic Manipulator

We split the task of building a tower into learning individual controllers for each target block B2–B6 (bottom to top), see Fig. 12, starting from a configuration, in which the robot arm was upright. All independently trained controllers shared the same initial trial.

Fig. 13(b) gives some insights into the quality of the learned forward model after 10 controlled trials. It shows the marginal predictive distributions and the actual trajectories of the block in the gripper.

Discussion

We have shed some light on essential ingredients for successful and efficient policy learning: (1) a probabilistic forward model with a faithful representation of model uncertainty and (2) Bayesian inference. We focused on very basic representations: GPs for the probabilistic forward model and Gaussian distributions for the state and control distributions. More expressive representations and Bayesian inference methods are conceivable to account for multi-modality, for instance. However, even with our current set-up, pilco can already learn learn complex control and robotics tasks. In , our framework was used in an industrial application for throttle valve control in a combustion engine.

Pilco is a model-based policy search method, which uses the GP forward model to predict state sequences given the current policy. These predictions are based on deterministic approximate inference, e.g., moment matching. Unlike all model-free policy search methods, which are inherently based on sampling trajectories , pilco exploits the learned GP model to compute analytic gradients of an approximation to the expected long-term cost $J^{\pi}$ for policy search. Finite differences or more efficient sampling-based approximations of the gradients require many function evaluations, which limits the effective number of policy parameters . Instead, pilco computes the gradients analytically and, therefore, can learn thousands of policy parameters .

It is possible to exploit the learned GP model for sampling trajectories using the PEGASUS algorithm , for instance. Sampling with GPs can be straightforwardly parallelized, and was exploited in for learning meta controllers. However, even with high parallelization, policy search methods based on trajectory sampling do usually not rely on gradients and are practically limited by a relatively small number of a few tens of policy parameters they can manage .“Typically, PEGASUS policy search algorithms have been using […] maybe on the order of ten parameters or tens of parameters; so, 30, 40 parameters, but not thousands of parameters […]”, A. Ng .

In Sec. 6.1, we discussed pilco’s natural exploration property as a result of Bayesian averaging. It is, however, also possible to explicitly encourage additional exploration in a UCB (upper confidence bounds) sense : Instead of summing up expected immediate costs, see (2), we would add the sum of cost standard deviations, weighted by a factor $\kappa\in\mathds{R}$ . Then, $J^{\pi}(\boldsymbol{\theta})=\sum_{t}\big{(}\mathds{E}[c(\boldsymbol{x}_{t})]+\kappa\sigma[c(\boldsymbol{x}_{t})]\big{)}$ . This type of utility function is also often used in experimental design and Bayesian optimization to avoid getting stuck in local minima. Since pilco’s approximate state distributions ${p}(\boldsymbol{x}_{t})$ are Gaussian, the cost standard deviations $\sigma[c(\boldsymbol{x}_{t})]$ can often be computed analytically. For further details, we refer the reader to .

One of pilco’s key benefits is the reduction of model errors by explicitly incorporating model uncertainty into planning and control. Pilco, however, does not take temporal correlation into account. Instead, model uncertainty is treated as noise, which can result in an under-estimation of model uncertainty . On the other hand, the moment-matching approximation used for approximate inference is typically a conservative approximation.

In this article, we focused on learning controllers in MDPs with transition dynamics that suffer from system noise, see (1). The case of measurement noise is more challenging: Learning the GP models is a real challenge since we no longer have direct access to the state. However, approaches for training GPs with noise on both the training inputs and training targets yield initial promising results . For a more general POMDP set-up, Gaussian Process Dynamical Models (GPDMs) could be used for learning both a transition mapping and the observation mapping. However, GPDMs typically need a good initialization since the learning problem is very high dimensional.

In , the pilco framework was extended to allow for learning reference tracking controllers instead of solely controlling the system to a fixed target location. In , we used pilco for planning and control in constrained environments, i.e., environments with obstacles. This learning set-up is important for practical robot applications. By discouraging obstacle collisions in the cost function, pilco was able to find paths around obstacles without ever colliding with them, not even during training. Initially, when the model was uncertain, the policy was conservative to stay away from obstacles. The pilco framework has been applied in the context of model-based imitation learning to learn controllers that minimize the Kullback-Leibler divergence between a distribution of demonstrated trajectories and the predictive distribution of robot trajectories . Recently, pilco has also been extended to a multi-task set-up .

Conclusion

We have introduced pilco, a practical model-based policy search method using analytic gradients for policy learning. Pilco advances state-of-the-art RL methods for continuous state and control spaces in terms of learning speed by at least an order of magnitude. Key to pilco’s success is a principled way of reducing the effect of model errors in model learning, long-term planning, and policy learning. Pilco is one of the few RL methods that has been directly applied to robotics without human demonstrations or other kinds of informative initializations or prior knowledge.

The pilco learning framework has demonstrated that Bayesian inference and nonparametric models for learning controllers is not only possible but also practicable. Hence, nonparametric Bayesian models can play a fundamental role in classical control set-ups, while avoiding the typically excessive reliance on explicit models.

Acknowledgments

The research leading to these results has received funding from the EC’s Seventh Framework Programme (FP7/2007–2013) under grant agreement #270327, ONR MURI grant N00014-09-1-1052, and Intel Labs.

References

Appendix A Trigonometric Integration

This section gives exact integral equations for trigonometric functions, which are required to implement the discussed algorithms. The following expressions can be found in the book by , where $x\sim\mathcal{N}(x|\mu,\sigma^{2})$ is Gaussian distributed with mean $\mu$ and variance $\sigma^{2}$ .

Appendix B Gradients

In the beginning of this section, we will give a few derivative identities that will become handy. After that we will detail derivative computations in the context of the moment-matching approximation.

Let us start with a set of basic derivative identities that will prove useful in the following:

In in the last identity $\boldsymbol{B}(:,i)$ denotes the $i$ th column of $\boldsymbol{B}$ and $\boldsymbol{B}(i,:)$ is the $i$ th row of $\boldsymbol{B}$ .

B.2 Partial Derivatives of the Predictive Distribution with Respect to the Input Distribution

In the following, we compute the derivative of the predictive GP mean $\boldsymbol{\mu}_{\boldsymbol{\Delta}}\in\mathds{R}^{E}$ with respect to the mean and the covariance of the input distribution $\mathcal{N}\big{(}\boldsymbol{x}_{t-1}\,|\,\boldsymbol{\mu}_{t-1},\boldsymbol{\Sigma}_{t-1}\big{)}$ . The function value of the predictive mean is given as

Let us start with the derivative of the predictive mean with respect to the mean of the input distribution. From the function value in (51), we obtain the derivative

$\in\mathds{R}^{1\times(D+F)}$ for the $a$ th target dimension, where we used

For the derivative of the predictive mean with respect to the input covariance matrix $\boldsymbol{\Sigma}_{t-1}$ , we obtain

for $i=1,\dotsc,n$ . Here, we compute the two partial derivatives

where we used a tensor contraction in the last expression inside the bracket when multiplying the difference vectors onto the matrix derivative.

B.2.2 Derivatives of the Predictive Covariance with Respect to the Input Distribution

For target dimensions $a,b=1,\dotsc,E$ , the entries of the predictive covariance matrix $\boldsymbol{\Sigma}_{\boldsymbol{\Delta}}\in\mathds{R}^{E\times E}$ are given as

where $\delta_{ab}=1$ if $a=b$ and 0 otherwise.

The entries of $\boldsymbol{Q}\in\mathds{R}^{n\times n}$ are given by

For the derivative of the entries of the predictive covariance matrix with respect to the predictive mean, we obtain

where the derivative of $Q_{ij}$ with respect to the input mean is given as

The derivative of the entries of the predictive covariance matrix with respect to the covariance matrix of the input distribution is

where the partial derivative of $e_{2}$ with respect to the entries $\Sigma_{t-1}^{{(p,q)}}$ is given as

The missing partial derivative in (74) is given by

which concludes the computations for the partial derivative in (67).

B.2.3 Derivative of the Cross-Covariance with Respect to the Input Distribution

$\in\mathds{R}^{(D+F)\times(D+F)}$ for all target dimensions $a=1,\dotsc,E$ .