Learning to Drive from Simulation without Real World Labels

Alex Bewley, Jessica Rigley, Yuxuan Liu, Jeffrey Hawke, Richard Shen, Vinh-Dieu Lam, Alex Kendall

I Introduction

This paper demonstrates how to use machine translation techniques for unsupervised transfer of an end-to-end driving model from a simulated environment to a real vehicle. We trained a deep learning model to drive in a simulated environment (where complete knowledge of the environment is possible) and adapted it for the visual variation experienced in the real world (completely unsupervised and without real-world labels). This work goes beyond simple image-to-image translation by making the desired task of driving a differentiable component within a deep learning architecture. We show this shapes the learning process to ensure the driving policy is invariant to the visual variation across both simulated and real domains.

Learning to drive in simulation offers a number of benefits, including the ability to vary appearance, lighting and weather conditions along with more structural variations, such as curvature of the road and complexity of the surrounding obstructions. It is also possible to construct scenarios in simulation which are difficult (or dangerous) to replicate in the real world. Furthermore, simulation provides ground-truth representations like semantics and even privileged information such as the relative angle and position of the vehicle with respect to the road . However, applying this knowledge to real-world applications has been limited due to the reality gap, often expressed as the difference in appearance from what can be rendered in simulation to how the physical world is viewed from a camera.

In contrast to the many efforts to improve the photometric realism of rendered images , this work learns an invariant mapping between features observed in simulation and the real environment. This representation is jointly optimised for steering a vehicle along a road in addition to image translation. While enjoying the benefits of simulation, this method is also adept at inferring the common structure between the target environment and what has been seen in simulation. This is analogous to transferring skills learned though playing virtual racing games to a person’s first experience behind the wheel of a car .

In summary, the main contributions of this work are:

We present the first example of an end-to-end driving policy transferred from a simulation domain with control labels to an unlabelled real-world domain.

By exploiting the freedom to control the simulated environment we were able to learn manoeuvres in states beyond the common driving distribution in real-world imitation learning, removing the need for multiple camera data collection rigs or data augmentation.

We evaluated this method against a number of baselines using open-loop (offline) and closed-loop metrics, driving a real vehicle over 3km without intervention on a rural road, trained without any real-world labels.

Finally, we demonstrate this method successfully driving in closed-loop on public UK urban roads.

A supplementary video demonstrating the system behaviour is available here: https://wayve.ai/sim2real

II Related Work

There has been significant development in building simulators for the training and evaluation of robotic control policies. While recent simulated environments, such as , have benefited from continued progress in photo realistic gaming engines, the reality gap remains challenging. address this problem by exploiting a simulator’s ability to render significant amounts of random colours and textures with the assumption that visual variation observed in the real world is covered.

An active area of research is in image-to-image translation which aims to learn a mapping function between two image distributions . Isola et al. proposed a conditional adversarial network where the discriminator compares a pair of images corresponding to two modalities of the same scene. Zhu et al. relaxed the need for corresponding images by proposing a cyclic consistency constraint for back translation of images. Similarly, Liu et al. combine cross domain discriminators and a cycle consistent reconstruction loss to produce high quality translations. We extend this framework to accommodate learning control as an additional supervised decoder output.

A problem closer to this work is that of unsupervised domain adaptation combining supervision from the source domain with pairwise losses or adversarial techniques to align intermediate features for a supervised task . However, feature based alignment typically requires some additional pairwise constraints, or intermediate domains , to bridge significant shifts in domain appearance. Other related work align both feature and pixel level distributions for the perception only task of semantic segmentation . The very recent work of Wenzel et al. learns transfer control knowledge across simulated weather conditions via a modular pipeline with both image translation and feature-wise alignment.

While the large majority of work in autonomous driving focuses on rule-based and traditional robotic control approaches, end-to-end learning approaches have long held the promise of learning driving policies directly from data with minimal use of brittle assumptions and expensive hand tuning. Pomerleau first demonstrated the potential feasibility of this approach by successfully applying imitation learning (IL) to a simple real-world lane following task. Muller et al. further successfully learned to imitate functional driving policies in more unconstrained off-road environments with toy remote control cars.

More recently, Bojarski et al. demonstrated that simple behaviour cloning could be scaled to a larger and more complex set of lane following scenarios when using a multi-camera system and modern neural net architectures. Codevilla et al. took a similar multi-camera approach, and learned to navigate successfully in simulated urban driving scenarios by conditioning on higher level driving decisions. Building on this, Müller et al. transferred policies from simulation to reality by using a modular architecture predicting control decisions learned from semantic image segmentation.

Mehta et al. take a learning from demonstration approach, and augment the control prediction task to depend on intermediate predictions of both high-level visual affordances and low-level action primitives. Kendall et al. demonstrated that pure reinforcement learning approaches with a single camera system could be successfully applied for learning real-world driving policies, though single camera systems have typically not been implemented to date, attributed largely to the robustness gained from training data with multiple views and synthetic control labels. In this work, robustness is achieved through noisy experiences captured in simulation.

In this work, we use an intermediate representation where an implicit latent structure is formed, as opposed to more explicit representations, such as semantic segmentation . The work presented in this paper is closest to who recently proposed to exploit image translation for learning to steer while also sharing a similar philosophy in reducing the complexity of the simulated environment to minimise the sufficient statistics for the desired task. Our work differs in that we use a bi-directional image translation model where we are not dependent on having real-world driving commands. We also demonstrate the efficacy of a driving policy on a real system, showing closed-loop performance metrics rather than solely open-loop metrics against a validation set.

III Method

The problem of translating knowledge from expert demonstrations in simulation to improved real-world behaviour can be cast into the framework of unsupervised domain adaptation. Here, the training set consisted of image-label pairs from a simulated environment $(X_{sim},C_{sim})$ . An independent set of unlabelled images $(X_{real})$ from the real target environment was used to capture the distribution of real-world observations. Crucially, it is important to note there were no pairwise correspondences between images in the simulated training set and the unlabelled real-world image set.

Our framework consisted of various modules, which can broadly be broken down into an image-translator, a set of domain discriminator networks and the controller network for imitation learning. Fig. 2 provides a high-level overview of our model, showing the flow of information between the different modules.

The image-translation module followed the general framework proposed in . This consisted of two convolutional variational autoencoder-like networks where the latent embedding was swapped for translating between domains. More formally:

where $d\in[sim,real]$ represents the domain and $\epsilon\sim\mathit{N}(0,1)$ is noise added during training, but set to zero for inference. The process of translation consisted of computing $Z_{d}$ using one domain encoder and then predicting $\hat{X}$ with the other.

III-A2 Discriminators

For each translated image a two-scale discriminator was used to align the appearance distribution of the translator output with the images in the corresponding domain.

III-A3 Controllers

The control architecture is broadly based on the end-to-end control architecture in , where the image translator forms the primary convolutional encoder. The latent tensor $Z_{d}$ is passed through CoordConv layers to learn a spatially aware embedding suited for control from this latent translation representation. Finally, the spatially aware tensor is reduced with global average pooling and then followed by fully connected layers.

III-B Losses

Figure 3 gives an overview of the main losses.

For a given domain $d$ , $E_{d}$ and $G_{d}$ constitute a VAE. To improve image translation we used an L1 Loss $\mathcal{L}_{recon}$ between an image $X_{d}$ and the reconstructed image after passing it through the corresponding VAE, $X_{d}^{recon}~{}=~{}G_{d}(E_{d}(X_{d}))$ , as shown in Figure 3a.

III-B2 Cyclic Reconstruction Loss

Assuming a shared latent space implies the cycle-consistency constraint, which says that if an image is translated to the other domain and then translated back, the original image should be recovered . We applied a cyclic consistency loss $\mathcal{L}_{cyc}$ to the VAEs, given by an L1 loss between an image $X_{d}$ and the image after translating to the other domain, $d^{\prime}$ , and back, $X_{d}^{cyc}~{}=~{}G_{d}(E_{d^{\prime}}(G_{d^{\prime}}(E_{d}(X_{d}))))$ , see Figure 3b.

III-B3 Control Loss

To guide our model to learn features that are useful for driving, we also used a control loss $\mathcal{L}_{control}$ , which is an L1 loss between the controller’s predicted steering $\hat{c}=C(E_{d}(X_{d}))$ and the ground truth given by the autopilot, $c$ , shown in Figure 3c.

Control should be based on the semantic content of an image and independent of which domain the original image came from. We therefore introduced a cyclic control loss $\mathcal{L}_{cyc~{}control}$ , as shown in Figure 3d, an L1 loss between the predicted steering $\hat{c}$ and the steering predicted by the controller when the image was translated to the other domain and then encoded, $\hat{c}^{cyc}~{}=~{}C(E_{d^{\prime}}(G_{d^{\prime}}(E_{d}(X_{d}))))$ .

III-B4 Adversarial Loss

Both the image-translator and the discriminators were optimised with the Least-Squares Generative Adversarial Network (LSGAN) objective proposed by . The discriminator (3) and generator (4) adversarial losses ensured that translated images resembled those from the chosen domain.

III-B5 Perceptual Loss

To encourage consistent semantic content across the two domains, we employed the use of a pre-trained VGG model, which was applied to both the original and translated images. The perceptual loss $\mathcal{L}_{perceptual}$ was expressed as the difference between the features extracted from the last convolutional layer in the VGG16 model for a given input image and its translated counterpart. Extracted features were normalised via Instance Norm (IN) following the result from Huang et al. demonstrating that applying IN before computing the feature distance makes the perceptual loss more domain-invariant.

III-B6 Latent Reconstruction Loss

Ideally we wanted to encode the semantic content of the images within the latent space such that $Z$ is independent of the domain from which an image came. We therefore applied a latent reconstruction loss $\mathcal{L}_{Zrecon}$ , an L1 loss between the latent representation of an image $Z_{d}$ and the reconstruction of the latent representation after it was decoded to the other domain and then encoded once more, $Z_{d}^{recon}=E_{d^{\prime}}(G_{d^{\prime}}(Z_{d}))$ .

The total discriminator loss for a given domain was $\mathcal{L}_{LSGAN}(D_{d})$ (3). The VAEs and the controller were trained jointly using a weighted sum of losses with weights $\lambda_{i}$ , given by

The image translator was optimised using ADAM , with a learning rate of $0.0001$ and momentum terms of $0.5$ and $0.999$ . The controller was trained using stochastic gradient descent with a learning rate of $0.01$ .

IV Evaluation

To evaluate our approach in Section III, we considered data gathered from a simulated domain (with a procedurally generated environment) and a real domain (gathered from driving on real roads). We trained models using the methods in IV-D first in a rural setting, to compare our method to a number of baseline approaches. We then demonstrate our method scaling to the complexity of public urban roads in the United Kingdom.

Table II presents in-domain and cross domain open-loop (offline) metrics for the learned controller, and Table III presents closed loop imitation control results evaluated on a physical vehicle in the rural environment. Qualitative results of the image-to-image translation are shown in Figure 7 for both the rural and urban setting. The following subsections detail the performance metrics used and test scenarios considered, followed by an analysis of the obtained results.

For the purpose of this work we created a simulated environment where we could build up a significant training set of image- and control-label pairs on procedurally generated virtual roads. Figure 4 illustrates the rural simulated world, while Figure 6 shows the urban simulated world.

For each data collection run in the rural world, a single road was created using a piece-wise Bézier curve generated by sampling 1-dimensional Simplex noise . The curvature of the road could be coarsely controlled using the frequency parameter of the noise. The road surface was then assigned a random texture from a finite set of example road textures. Once the rural road was constructed, trees and foliage were placed using Poisson Disk Sampling according to a foliage density parameter. In simulation we were able to vary environmental factors, such as cloud cover, rainfall, surface water accumulation, and time of day. However, we fixed these as we wished to learn an image-to-action control policy in a very narrow source domain: we did not require high variance in the source distribution, rather to learn how the higher variance target distribution translates to the source.

For the urban world, we use a similar approach to procedurally generate a road network. We procedurally add buildings, trees and parked cars. Care was made to approximate the layout and topology of the real-world urban environment, but not the visual complexity or photo-realism. Instead, we rely on the image translation model to transfer the policy from the cartoon-like simulated world to the visually rich real-world.

IV-A2 Simulated Expert Agent

The labelled training set was generated from an expert autopilot agent. The expert driver has a simple proportional controller empirically tuned to track the lane based on ground-truth distance from the centre, maintaining a constant vehicle speed. The curve from Section IV-A1 used to generate the road was used to generate a set of lane paths which were offset from the central curve. One end of the road was chosen at random and the simulated vehicle was placed at the end and told to follow the corresponding lane path.

To perturb the vehicle state, we used additive Ornstein-Uhlenbeck (OU) process noise to the expert driver’s actions. This had the effect of generating a more robust training dataset by moving the vehicle throughout the drivable lane space, observing the expert driver’s response from each perturbed pose. The OU noise parameters $\theta$ and $\sigma$ were selected to maximise the magnitude of perturbation while still allowing the expert driver to largely stay within the lane. This simulation environment ran asynchronously, with image-label pairs captured at 10Hz.

Leveraging this expert agent to learn to drive with imitation learning is only possible in simulation. It requires privileged information, such as the distance to the centre of the lane, which is not available in the real world. Furthermore, we perturb the state with noise to generate richer training data; this would be dangerous in the real world as it would require swerving on the road.

IV-B Real-World Domain

We considered a single driving environment for initial real-world testing, consisting of a 250m private one-lane rural road, shown in Figure 4. For safety, the vehicle was only driven at $10~{}kmh^{-1}$ in the absence of any other vehicles wishing to use the road, both in data collection and under test.

IV-B2 Urban road

In addition, we consider an urban road environment in Cambridge, UK. We select minor public roads in dense suburban areas and opportunistically test when they are void of other traffic. Figure 6 illustrates typical scenes. We test across sun, overcast and raining weather conditions.

IV-C Data

Table I outlines the training, and test, lane following datasets used. For each domain, we collected approximately $60$ k frames of training data and $20$ k frames of test data.

The maximum steering range of each vehicle was parameterised to $\pm 1.0$ . Due to the nature of driving, the labelled training set was heavily dominated by data with near-zero steering angles (i.e., driving straight), as shown in Figure 5. If the controller is trained with a naive mean absolute error (MAE) loss, it could feasibly predict zero steering for all outputs and fail to learn the correct image-steering mapping.

To address this problem, we split the data into eight bins according to the steering angle, and uniformly sampled from each bin, upsampling the data such that every bin contained the same number of samples as the bin with the maximum number of data. The bins edges used in our experiments were $[-1,-0.075,-0.05,-0.025,0.0,0.025,0.05,0.075,1]$ . The steering limits differ between vehicles, hence we applied a linear calibration to the real vehicle steering output.

IV-D Transfer Learning Results

We compared our method to the following baselines:

Simple Transfer this takes a model pre-trained in simulation and directly applies it to the real-world data. Note: compared to the following this model only sees examples from simulation.

Real-to-Sim Translation uses the unsupervised image-to-image translation to convert a real-world image to the simulation domain for direct application of the controller pre-trained in simulation.

Sim-to-Real Translation uses the unsupervised image-to-image translation for training a controller on translated images (sim-to-real) with corresponding simulated actions. At test time no translation was performed and only the controller was used.

Latent Feature ADA applies Adversarial Domain Adaptation (ADA) to the feature space to align encoders from simulation and real data.

For evaluation, we also compared a Drive-Straight policy as a proxy to assess road curvature, and to quantitatively assess the efficacy of offline metrics.

IV-D2 Performance Metrics

Open-loop evaluation of control policies is an open question: and present a range of metrics, demonstrating weak correlation between offline metrics and online performance. We adopted two metrics to compare driving policies: mean average error (MAE), and Balanced MAE. These metrics for all approaches tested here are outlined in Table II. The MAE between the predicted and ground-truth steering is a useful loss function, but poor reflection of actual performance due to the data imbalance. We propose a Balanced-MAE metric: discretising the validation data as per the bins in Section IV-C1, computing the MAE of the data per bin, then taking an equally weighted average across all bins (similar in principle to mean AP in object detection or recognition tasks). Qualitatively, we found that this metric correlates with closed-loop driving performance in simulation.

For closed-loop testing on our rural driving route, we used a simple measure of distance travelled per safety driver intervention in metres travelled, averaged over 3km of driving. During test, we ran the models at close to the camera’s 30Hz frame rate. Table III outlines the performance of our system, as well as a number of baseline models. We found that the open-loop metrics in Table II hold only a weak correlation with real-world performance, demonstrated by the difference in closed- and open-loop performance between our model and the baseline approaches. Clearly, further work is needed to develop open-loop metrics which support real-world performance.

IV-D3 Loss Ablation

The various terms in Eqn. 5 are designed to work in concert to maintain a structured latent space to facilitate zero-shot transfer from simulated label to the real domain. The following ablation study investigates the influence of each term by setting the corresponding weighting $\lambda$ to zero, effectively removing it from the training procedure. Table IV shows that all terms play a significant role in reducing the error with the exception of the cyclic control factor. This indicated that the other cyclic losses are sufficient to maintain a shared latent structure. Not surprisingly, the L1 cyclic reconstruction and GAN loss from are most critical for control transfer. Finally, as removing the control term prevented the network from learning anything sensible, we instead evaluated the effect of the gradients of the controller on the image translator. Here we zeroed the gradients from the controller before flowing back into the translator, effectively turning the controller into a passive neural stethoscope, as described in . These gradients would normally help inject structure for the task into the translator in an auxiliary fashion leading to better real-world performance. This increase in error by not optimising the encoders for both tasks could also be a contributing factor to the poor performance of the translation baselines (real-to-sim and sim-to-real).

IV-D4 Urban Road Results

To further demonstrate the efficacy of our method, we performed closed-loop testing on public UK urban roads. Using our model, we were able to successfully demonstrate the vehicle lane-following on dense urban side streets in Cambridge, UK. Examples are shown in Figure 7b.

We observed one main conclusion: the model is typically able to drive if, and only if, the visual translation is successful. For example, when the topology of the road and all cars are translated correctly between domains, the car’s control policy gives the desired lane-following closed-loop behaviour. When a car is mis-translated into a footpath, the control policy requires intervention. Therefore, we conclude that this method may be able to scale to more complex domains, beyond simple lane following tasks, if the simulator is capable of simulating these scenarios and we can learn a successful domain translation model.

IV-D5 Visualisation

Understanding the performance of a control policy coupled with image domain transfer can be difficult. The bi-directional image translator provides the ability to inspect the model’s interpretation of the road surface as shown in Fig. 7. Here we can observe that the curvature and offset of the road is appropriately translated across domains facilitating a consistent transfer of the steering signal. Interestingly, in the third row of Figure 7a (where the vehicle is nearly driving off the virtual road, which is far from the distribution of collected real images) we notice the model will generate an image closer to the distribution of real-world observations. In such scenarios, the controller is trained with a high magnitude steering command to correct its course in simulation, which results in robust behaviour in the real world as it controlled the vehicle back into the distribution of real observations.

V Conclusions

Learning a driving policy from simulation has many advantages: training data is cheap, auxiliary ground-truth information can be provided with ease, and the vehicle can be put in situations that are difficult or dangerous to undertake in reality. Previously, with the substantial gap in complexity between the two domains, it was considered infeasible to transfer driving policies from simulation to the real world.

In this work, we present the first system that is capable of leveraging simulation to learn an end-to-end driving policy to directly transfer to real-world scenarios without any additional human demonstrations. This model jointly learns to translate images between the real-world operating domain and a procedurally generated simulation environment, while also learning to predict control decisions off of the latent space between the two domains given only the ground-truth labels from simulation. We empirically validated our proposed model in closed-loop against several baselines, successfully driving 3km between interventions on a real-world lane following task. We further evaluated the model using several standard open-loop metrics, observing that these metrics ultimately did not prove predictive of driving performance. Finally, we demonstrated this system driving in closed-loop on public urban roads in the United Kingdom.

This work provides evidence that end-to-end policy learning and simulation-to-reality transfer are highly promising directions for the development of autonomous driving systems. We note that standard open-loop metrics for this problem need to be improved, and leave this question to future work. Furthermore, the addition of orthogonal, but relevant, approaches on temporal motion consistency could further advance this field. We hope this work inspires further investigation of both learning driving policies directly from data, and exploiting simulation for removing the constraints of the real world.