Learning End-to-end Autonomous Driving using Guided Auxiliary Supervision

Ashish Mehta, Adithya Subramanian, Anbumani Subramanian

Introduction

A lot of recent work in autonomous driving has been focused on designing end-to-end learning networks for the task of driving using input images (Bojarski et al., 2016; Xu et al., 2016; Bojarski et al., 2017; Codevilla et al., 2017; Chowdhuri et al., 2017a). Most works model this as an end-to-end regression problem of arriving at the control values using the input pixel values directly. Clearly, humans utilize a much richer hierarchical task decomposition pipeline while carrying out visuomotor tasks like driving. Rather than arriving at the exact throttle, brake and steering percentages directly from high-dimensional pixel values, our brains first decompose the image into recognizable objects and their intuitive states in the scene (Felleman and Van, 1991; Kandel et al., 2000), then use visual abstractions along with past experience to come up with approximately optimal high level decisions or action primitives for the given state, and lastly use these action primitives to arrive at the exact motor commands to be executed (Bertenthal, 1996; Mataric, 2000; Flash and Hochner, 2005; Hart and Giszter, 2010). Such kind of hierarchical structure allows us to tackle hard problems like driving by decomposing them into multiple sub-tasks allowing for better generalization.

Studies in Neuroscience suggest that humans use a combintion of model-based and model-free techniques for sequential decision making (Gläscher et al., 2010; Evans, 2008; Fermin et al., 2010; Beierholm et al., 2011). Most of the current end-to-end driving networks only utilize a model-free architecture, without explicitly trying to infer the state or model of the environment which can be used for planning. A combination of model-based and model-free approaches could greatly improve decision-making in autonmous driving settings.

Another challenge with such end-to-end learning systems is their black-box nature (Marcus, 2018; Samek et al., 2017; Ribeiro et al., 2016). Deep Learning techniques, in general, are highly opaque function approximators having millions of self-learned parameters, each of which might not have any human-interpretable significance. This is a huge setback in life-critical decision-making tasks like autonomous driving, where it is essential to not only arrive at the correct decision but also know how the decision was made by the system.

In this work, we attempt to overcome these obstacles by proposing supervised auxiliary tasks that are learned by the network along with the main task of driving. We propose a set of ‘visual affordances’ and ‘action primitives’, that are annotated and used as auxiliary supervised tasks. The visual affordances are used to form abstract description of the visual scene in front of the vehicle and the action primitives provide an abstract description of the possible high-level actions required for driving.

The auxiliary tasks serve a three-fold purpose. Firstly, the auxiliary tasks allow us to infuse a rich prior in the form of human knowledge into the system that assists the final prediction task instead of expecting the network to learn all relevant knowledge from scratch. While trying to demonstrate the driving task, a human identifies the visual affordances and action primitives that are essential for driving, and provides the network with the auxiliary task of predicting these abstractions, thus assisting its decision-making with human knowledge. Secondly, the joint learning of the auxiliary task along with the main task of driving provides auxiliary guided supervision by forcing the network to predict intermediate representations like distance to vehicles and orientation with respect to the lane among others, that can be crucial to arrive at the final driving decision. This guided supervision allows the network to learn superior internal features thus allowing it to learn faster and generalize better. The auxiliary visual affordances can also be seen as an abstract description of the agent’s environment and the joint-learning technique as a method of efficiently combining model-based and model-free techniques. Lastly, predicting the visual affordances and action primitives allows for better transparency in the network’s learned internal representations, helping us better understand its decision-making process.

We demonstrate our hypothesis in the CARLA simulator (Dosovitskiy et al., 2017). All trials are started with random initialization of the player and non-player positions and goals and the non-player vehicles are induced with temporal noise to increase stochasticity of the multi-player dynamics. Such a stochastic setting makes it more difficult for the agent to overfit to a deterministic non-player policy and requires that the player have temporal information about the non-player agents to infer their present state without which it cannot derive it’s own policy. The demonstrated expert policy is also extremely opportunistic as the demonstrator tries to not only avoid all collisions but also reach the goal locations in the least amount of time, generously overtaking vehicles if required. These changes make the driving conditions more realistic and challenging for a machine learning algorithm.

Related Work

Learning from Demonstration or Imitation Learning (Schaal, 1997; Argall et al., 2009; Atkeson and Schaal, 1997) and Reinforcement Learning (Sutton and Barto, 1998; Kaelbling et al., 1996) have proven to be the key techniques for learning sequential decision-making tasks through demonstration and experience respectively. Reinforcement Learning requires a hand-crafted reward signal which the agent tries to maximize over time and in turn learns the desired policy through trial-and-error. The agent has to learn to assign credit to past actions faithfully in case of delayed rewards and also balance exploration and exploitation effectively to arrive at an ideal policy, both of which are extremely challenging (Sutton, 1984, 1992). In Learning from Demonstration, on the other hand, the agent learns by directly trying to mimic the decisions of an expert demonstrator. Though originally designed for low-dimensional state-space tasks, with the advent of superior supervised function approximators (Krizhevsky et al., 2012; He et al., 2015b; Szegedy et al., 2017), a lot of progress has been made in scaling these algorithms to very high dimensional state-space tasks enabling agents to learn policies directly from images (Tai and Liu, 2016; Mnih et al., 2015; Ratliff et al., 2009; Bojarski et al., 2016).

In this work, we use Learning from Demonstration framework to train an agent to learn the task of driving using high-dimensional observations in form of images. Learning from Demonstration along with deep function approximators have been used to tackle a lot of problems in robotics like indoor mobile robot navigation (Tai et al., 2016), quad-rotor control in forest trials (Giusti et al., 2016), robot-arm manipulation (Duan et al., 2017; Finn et al., 2017; Yu et al., 2018) among others. The closest to our work are the works of Bojarski et al. (2016) who show autonomous lane following using a single trained network, Codevilla et al. (2017) who demonstrate autonomous driving in CARLA using an additional conditional input from a high-level planner, Hou et al. (2017) who compare various contemporary networks for autonomous driving tasks and Chowdhuri et al. (2017b) who demonstrate multi-task and multi-modal behavior for autonomous driving.

Multi-task learning (MTL) research shows the joint training of auxiliary related side-tasks along with the main task enhances the training performance (Caruana, 1998; Zhang and Yeung, 2012). MTL in neural networks (Ruder, 2017) has been successfully demonstrated in many tasks previously including text-to-speech conversion (Seltzer and Droppo, 2013), natural language processing (Collobert and Weston, 2008), speech processing (Deng et al., 2013) and computer vision (Girshick, 2015; Zhang et al., 2016). In the field of sequential decision making, Lample and Chaplot (2016) demonstrate MTL for 3D game playing, Mirowski et al. (2016) and Jaderberg et al. (2016) demonstrate MTL in 3D maze navigation task whereas Chowdhuri et al. (2017b) utilize the MTL framework for autonomous driving. Instead of employing future control outputs as auxiliary tasks as shown by Chowdhuri et al. (2017b), in this work we employ action and visual abstractions to guide the driving behavior.

Supervised learning of visual affordances for autonomous driving was introduced by Chen et al. (2015), though they use the predicted affordances to plan using a set of fixed rules whereas our network uses visual affordances as auxiliary tasks for the main task of driving. Action primitives can be inferred as sub-policies for the desired task. Learning hierarchical policies via demonstration is an active area of research (Byrne and Russon, 1998; Demiris and Dearden, 2005; Le et al., 2018; Shiarlis et al., 2018) and research in developmental psychology has also found evidence of hierarchical task decomposition during imitation in young children (Whiten et al., 2006). Our work decomposes the main task of driving into sub-policies which are used as auxiliary supervision to derive the final control commands.

Multi-task Learning from Demonstration (MT-LfD) Framework

We first begin by detailing the framework used to train the agent through expert demonstrations for the task of driving autonomously along with auxiliary task guidance. Learning from Demonstration (LfD) involves training an agent to try and imitate an expert demonstrator. At each time step $i$ , the expert demonstrator is provided with an observation $o_{i}$ , and the demonstrator provides the ideal action $u_{i}$ for that particular observation. A dataset $D=\{o_{i},u_{i}\}_{i=1}^{N}$ comprising of multiple episodic sequential roll-outs of the demonstrations is curated.

LfD works on the assumption that if the demonstrator could deduce the ideal actions $u_{i}$ from the provided observations $o_{i}$ , and if the demonstrator uses a consistent policy to determine the ideal actions $u_{i}$ , there must exist a constant mapping function $F$ which maps the correlation between actions and observations $u_{i}=F(o_{i})\forall i\in[1,N]$ . In such a scenario, an agent parameterized by $\theta$ can be trained to obtain a policy $\pi(u_{i}/o_{i};\theta)$ which maps the observations to actions. If a sufficiently expressive function approximator is used to train the agent, the parameters $\theta$ can be tuned such that the learned policy $\pi$ is almost equivalent to the demonstrated mapping function $F$ . This can be done by tuning the parameters using the following update rule:

and since we are assuming that $u_{i}=F(o_{i})\forall i\in[1,N]$ the above equation is reducing the loss:

If the collected dataset $D$ is diverse enough to cover a large support of the distribution, the learned policy $\pi(u/o;\theta)$ could generalize to new unseen observations $o_{j}$ which are similar to the demonstrated observations, and predict a faithful control output $u_{j}$ . Even so, distribution-mismatch is a common problem in LfD, where the demonstrated distribution does not match the test-time distribution, leading to unexpected compounding errors. A lot of methods have been suggested to overcome the distribution-mismatch problem (Ross et al., 2010; Levine and Koltun, 2013; Bojarski et al., 2016). In this work, we employ the noise injection method suggested by Laskey et al. (2017), by inducing noise in the agent during demonstration (details of which are in Section 4.2) to force it to visit novel states and improve the demonstration distribution.

2 Auxiliary Task Supervision

Input observations $o_{i}$ include high-dimensional images and thus the underlying state $s_{i}$ of the system has to be inferred from these images for model-based decision making. Here we form an abstract description of the state $s_{i}$ using the visual affordances $v_{i}$ . These local visual statistics allow the agent to learn an abstract local model of it’s environment. The control commands $u_{i}$ predicted by the agent can also be decomposed into sub-policies. We decompose the task by enabling the agent to predict action primitives $a_{i}$ which are abstract descriptions of the policy. The visual affordances and action primitives are used as auxiliary supervised sub-tasks that are predicted by the multi-task network.

The dataset $D$ is augmented with the visual affordances $v_{i}$ and action primitives $a_{i}$ annotation and thus becomes $D=\{o_{i},u_{i},v_{i},a_{i}\}_{i=1}^{N}$ . Let $\pi(u_{i}/o_{i}^{\prime})$ , $\phi(v_{i}/o_{i})$ and $\psi(a_{i}/o_{i})$ , jointly parameterized by $\theta$ denote the learnable functions to predict control, visual affordances and action primitives respectively. Since the policy $\pi$ is now guided by the predicted visual affordances and action primitives, policy $\pi$ is conditioned on $\phi$ and $\psi$ along with observation $o_{i}$ as $\pi(u_{i}/o_{i},\phi(o_{i}),\psi(o_{i}))$ which we denote as $\pi(u_{i}/o_{i}^{\prime})$ . We can augment the update rule defined in Eq. 1 as follows:

where $(\theta)^{2}$ is the L2 regularization loss and $\alpha,\beta,\gamma$ are hyperparameter coefficients for the auxiliary losses. As the learning progresses and the network is able to learn to faithfully predict visual affordances and action abstractions, $\phi(o_{i})\approx v_{i}$ and $\psi(o_{i})\approx a_{i}$ . Thus eventually, the policy is conditioned on the visual affordances and action primitives as $\pi(u_{i}/o_{i},v_{i},a_{i})$ and is guided by them.

3 Network Architecture

In this sub-section we outline the network architecture we use to perform MT-LfD. In our framework, the observations $o_{i}$ consist of stacked input images, forward speed of the player agent and a planner input which provides the high-level directions at intersections. To encode temporal information which is vital for stochastic multi-agent urban scenarios that we experiment with, we provide a history sequence of five images stacked along with the the current image at each time-step. The control predictions $u_{i}$ include the brake, throttle and steering percentage. Details about the visual affordances and actions primitives and their ground truth annotation scheme is presented in Section 4.3.

Figure 1 shows a high-level representation of our proposed network. We use Resnet-50 (He et al., 2015a) to extract useful features from the stack of input images. Instead of adding or concatenating the speed and planner inputs we use a learnable soft-attention mechanism (Bahdanau et al., 2014) to attend to the extracted image features using the speed and planner inputs as represented by the input attention block in Figure 1. A learnable soft-attention mechanism allows for the low dimensional speed and planner inputs to have a large impact on the relatively higher-dimensional image feature vector. This attention mechanism is represented by a neural network layer and is jointly learned with the main network. The speed and planner inputs separately attend to the extracted feature vector and a union of the two attended features is taken to achieve a jointly attended vector. This attended observation feature is used to predict the visual affordances and action primitives. Lastly, in the Auxiliary Task Attention Block, a similar aforementioned double-attention mechanism is used to attend over the observation feature vector using the visual affordances and action primitives, which is then used to predict the final control predictions. The dashed-line vectors in Figure 1 represent the ground-truth positions from where the gradients are jointly back-propagated.

Experimental Setup

We use CARLA, an open-source 3D urban driving simulator, for our experiments. The use of CARLA allows for high-fidelity graphics and physics simulations of urban driving environment including diverse models of vehicles, pedestrians, houses, static obstacles, side-walks and intersections. For training, we collect data from a town with 2.9 km drivable urban roads. The map of the town is as shown in Figure 2.

To enable rich multi-agent behaviour, we initialize the map with 120 non-player vehicles and 140 non-player pedestrians during demonstration. The player as well as non-player agents are initialized at random start positions at the beginning of each episode to increase stochasticity of the system dynamics. The non-player pedestrians have a rich AI which enables diverse realistic behaviours. Each pedestrian is provided a goal location at random and assigned a random maximum walking speed. The pedestrians try to reach the random goal point, without colliding with other static or dynamic obstacles and use a weighted navigation mesh to decide when to walk on the footpath or when and where to cross the road at a stochastically sampled angle to the road. The non-player vehicles are initialized with random model and random colours at the beginning of each episode and have a rich intelligent behaviour as well. Each non-player vehicle drives within lane, stops at traffic lights, follows the speed limit of the particular road it is travelling on, randomly samples turns at intersections and actively avoids all other static and dynamic obstacles.

Even though the non-player vehicles have rich intelligent behaviour, their deterministic nature makes them highly unrealistic. A player agent could easily map such deterministic behaviour and overfit it’s own behaviour. In such cases a player agent could easily derive it’s own policy given a single current time-step image. To increase the stochasticity of the non-player vehicles, we induce temporal noise in the non-player vehicles. The temporal noise is enabled with some random probability and causes the non-payer vehicles to brake immediately and stop for a random duration sampled from a uniform distribution. This simple noise injection, prevents the player agent from overfitting to non-player vehicles policy and gives rise to realistic high-level urban traffic behaviours like erratic stopping, slow moving vehicular queues, busy multi-directional traffic at intersections, increased probability of overtaking maneuvers among others as can be seen in Figure 3.

2 Data Collection

In our framework, a human driver is an expert demonstrator and provides the ground truth driving commands using CARLA that the network is trained to imitate. A front facing RBG mono-camera of $320X180$ pixel resolution and $100^{\circ}$ FOV is mounted on a simulated vehicle for collecting visual information while driving. The expert demonstrator looks at a similar first-person view of a higher resolution ( $1280X720$ ) while driving the vehicle to collect data. The complications of record-mapping (Argall et al., 2009) are prevented by having the demonstrator use a similar visual view point for demonstration.

The demonstrator is equipped with the task of demonstrating the ideal brake, throttle and steering commands for each observation frame. The demonstrator does this by using a gaming steering wheel and throttle and brake pedals, which allow for easier demonstrations and enable analog inputs into the system. The setup we use for demonstration is shown in Figure 4 At the beginning of each episode, a collection of random goal points are selected and sequentially provided to an A* planner which uses the town map to plan the shortest route to the goal locations. The A* planner provides the demonstrator with one of the four high level commands (go left, go right, go straight, follow lane) which the demonstrator follows while providing demonstrations. The demonstrator avoids all static and dynamic obstacles and also tries to stay on lane as much as possible. The demonstrator also tries to reach the goal in the least amount of time, even overtaking long queues of vehicles if required, while trying to drive within the above mentioned constraints. This results in an extremely rich demonstration policy making it more realistic and difficult to imitate in novel scenarios.

To overcome the distribution mismatch problem, we induce noise in the player agent and record the demonstrations of the agent recovering from the noise as provided by the expert demonstrator. Unlike Codevilla et al. (2017) who induce a fixed duration temporal noise in the steering, we induce both positive and negative temporal noises in the steering, brake and throttle; the probability and duration of which are sampled from a uniform distribution.

We collect $150,000$ frames of images at about 6 fps resulting from approximately 7 hours of driving demonstration spread across 82 episodes. The player agent data collected include RGB images, high-level planner command and speed provided by CARLA which form our input observation $o_{i}$ , and steering, brake and throttle demonstrations provided by the expert demonstrator which forms the action $u_{i}$ . Other auxiliary player and non-player measurements are also collected which help us in annotating the ground-truth visual affordances $v_{i}$ and action primitives $a_{i}$ as discussed below.

3 Auxiliary Tasks and Data Annotation

In this subsection, we describe the visual affordances and action primitives used for auxiliary supervision. We identify 13 different visual affordances that are critical to describe the local state $s_{i}$ of the system and 8 different action primitives that decompose the policy effectively. A list of all the visual affordances and action primitives is provided in Table 1. A few visual affordance measurements like ’Percentage player in opposite lane’ and ’Percentage player on sidewalk’ are directly provided by CARLA, but most others have to be derived using a fixed set of rule-based mechanism. Multiple player and non-player measurements are collected which are not directly used for training the network, but are used to derive the ground truth annotations for the visual affordances. These measurements include player global position, player yaw, non-player agent positions, non-player agent yaw and non-player agent velocity. Action Primitives are derived using fixed rules to classify the demonstrated steering, brake and throttle control values.

Results

We train the proposed network in Section 3.3 using the expert demonstrated data for the control values and the automatic rule-based ground truth generated for the visual affordances and action primitives. Out of the 150000 images collected, a sequence of 10000 images was held out for validation and the rest was used for training. The control prediction and visual affordances prediction are regression problems and thus we use Mean Squared Error loss for $\pi(o_{i}^{\prime})$ and $\phi(o_{i})$ whereas the action primitives prediction is a mutually non-exclusive multi-class classification problem and thus we use a multi-class cross entropy loss for $\psi(o_{i})$ . Data augmentation including Gaussian blurring, Gaussian noise, pixel dropouts, gray-scaling, contrast change and intensity shits were used to help with better generalization. We use Adam optimizer with an initial learning rate of $3e^{-4}$ and use learning rate decay. We set $\alpha$ , $\beta$ and $\gamma$ to $0.1$ , $0.2$ and $1e^{-5}$ respectively. The network is trained for approximately 15 hours on a GPU. We call our proposed network as MT-LfD through the rest of this section.

Figure 5 shows the plots of our results. We compare the results of MT-LfD with 2 baselines to verify the performance of MT-LfD. The first baseline which we call Baseline with no attention is an ablated version of the MT-LfD network architecture in which we use simple concatenation instead of learnable soft-attention to combine the auxiliary information along with the visual information. The second baseline which we call Baseline with no auxiliary guidance has the same architecture as MT-LfD with $\alpha$ and $\beta$ set to zero. This essentially prevents the loss from the visual affordances and action primitives from back propagating through the network and thus removes the auxiliary guidance provided by their predictions. Figure 5(b) shows the action primitives prediction accuracy on the held-out validation set. Since Baseline with no auxiliary guidance does not have a loss for action primitives, it can be observed that it performs almost on par with random sampling whereas the others perform much better at the classification task.

We keep the rest of the hyperparameters same as MT-LfD for the baselines. Figure 5(a) shows the mean-squared error loss for the final control prediction task on the held-out validation set. MT-LfD has lower loss as compared to Baseline with no attention thus validating our use of the soft-attention mechanism. MT-LfD also has a lower loss and converges much faster than Baseline with no auxiliary guidance thus proving the necessity of the guided auxiliary supervision provided by the visual affordances and action primitives, validating our hypothesis.

Summary

We demonstrate Multi-task Learning from Demonstration for end-to-end learning of autonomous driving, jointly supervised and guided by visual affordances and action primitives. We present our network architecture which uses ResNet-50 and learnable soft-attention mechanisms to combine the auxiliary task predictions with the observation information for the final driving task prediction. We show that our proposed MT-LfD framework outperforms vanilla LfD and MT-LfD without attention in the main task of vehicle control prediction. We thus, validate our hypothesis that the joint learning of the auxiliary tasks and the employing their predictions to guide the final control prediction is able to enhance the speed and the performance of learning.

Nevertheless, our network does not outperform previous work in more deterministic setups with controlled number of agents and maximum speed. Driving in a realistic scenario with a highly stochastic multi-agent setup and realisitic driving demonstrations still remains a challenging open problem. It remains to ben seen how memory-augmented networks perform in such a scenario. A prospective future work could be to use Inverse-Reinforcement Learning to derive the intention of the demonstrator and use model based-statistics along with such inferred intention to derive an ideal driving policy. Another orthogonal prospective work could be to train a network to plan on the model-based statistics and use the planning network along with the model-free prediction network to derive the driving policies. More work also needs to be done in making driving simulators more realistic to human-driving scenarios for autonomous driving experiments.