TrafficPredict: Trajectory Prediction for Heterogeneous Traffic-Agents

Yuexin Ma, Xinge Zhu, Sibo Zhang, Ruigang Yang, Wenping Wang, Dinesh Manocha

Introduction

Autonomous driving is a significant and difficult task that has the potential to impact people’s day-to-day lives. The goal is to make a vehicle perceive the environment and safely and efficiently navigate any traffic situation without human intervention. Some of the challenges arise in dense urban environments, where the traffic consists of different kinds of traffic-agents, including cars, bicycles, buses, pedestrians, etc.. These traffic-agents have different shapes, dynamics, and varying behaviors and can be regarded as an instance of a heterogeneous multi-agent system. To guarantee the safety of autonomous driving, the system should be able to analyze the motion patterns of other traffic-agents and predict their future trajectories so that the autonomous vehicle can make appropriate navigation decisions.

Driving in an urban environment is much more challenging than driving on a highway. Urban traffic is riddled with more uncertainties, complex road conditions, and diverse traffic-agents, especially on some cross-roads. Different traffic-agents have different motion patterns. At the same time, traffic-agents’ behaviors are deeply affected by other traffic-agents. It is necessary to consider the interaction between the agent to improve the accuracy of trajectory prediction.

The problem of predicting trajectories for moving agents has been studied extensively. Some traditional algorithms are based on motion models like kinematic and dynamic models (?), Bayesian filters (?), Gaussian Processes (?), etc. These methods do not take into account interactions between the traffic-agents and the environment, making it difficult to analyze complicated scenarios or perform long-term predictions. With the success of LSTM networks in modeling non-linear temporal dependencies (?) in sequence learning and generation tasks, more and more works have been using these networks to predict trajectories of human crowds(?) and vehicles trajectories (?). The common limitation of these works is the focus on predicting one type of group (only pedestrians or cars, for example). These methods may not work in heterogeneous traffic, where different vehicles and pedestrians coexist and interact with each other(?).

Main Results: For the task of trajectory prediction in heterogeneous traffic, we propose a novel LSTM-based algorithm, TrafficPredict. Given a sequence of trajectory data, we construct a 4D Graph, where two dimensions are for instances and their interactions, one dimension is for time series, and one dimension is for high-level categorization. In this graph, all the valid instances and categories of traffic-agents are denoted as nodes, and all the relationships in spatial and temporal space is represented as edges. Sequential movement information and interaction information are stored and transferred by these nodes and edges. Our LSTM network architecture is constructed on the 4D Graph, which can be divided into two main layers: one is the instance layer and the other is the category layer. The former is designed to capture dynamic properties and and interactions between the traffic-agents at a micro level. The latter aims to conclude the behavior similarities of instances of the same category using a macroscopic view and guide the prediction for instances in turn. We also use a self attention mechanism in the category layer to capture the historical movement patterns and highlight the category difference. Our method is the first to integrate the trajectory prediction for different kinds of traffic-agents in one unified framework.

To better expedite research progress on prediction and navigation in challenging scenarios for autonomous driving, we provide a new trajectory dataset for complex urban traffic with heterogeneous traffic-agents during rush hours. Scenario and data sample of our dataset is shown in Fig. 1. In practice, TrafficPredict takes about a fraction of a second on a single CPU core and exhibits $20\%$ accuracy improvement over prior prediction schemes. The novel components of our work include:

Propose a new approach for trajectory prediction in heterogeneous traffic.

Collect a new trajectory dataset in urban traffic with much interaction between different categories of traffic-agents.

Our method has smaller prediction error compared with other state-of-art approaches.

The rest of the paper is organized as follows. We give a brief overview of related prior work in Section 2. In Section 3, we define the problem and give details of our prediction algorithm. We introduce our new traffic dataset and show the performance of our methods in Section 4.

Related Work

The problem of trajectory prediction or path prediction has been extensively studied. There are many classical approaches, including Bayesian networks (?), Monte Carlo Simulation (?), Hidden Markov Models (HMM) (?), Kalman Filters (?), linear and non-linear Gaussian Process regression models (?), etc. These methods focus on analyzing the inherent regularities of objects themselves based on their previous movements. They can be used in simple traffic scenarios in which there are few interactions among cars, but these methods may not work well when different kinds of vehicles and pedestrians appear at the same time.

Behavior modeling and interactions

There is considerable work on human behavior and interactions. The Social Force model (?) presents a pedestrian motion model with attractive and repulsive forces, which has been extended by (?). Some similar methods have also been proposed that use continuum dynamics (?), Gaussian processes (?), etc. Bera et al. (?; ?) combine an Ensemble Kalman Filter and human motion model to predict the trajectories for crowds. These methods are useful for analyzing motions of pedestrians in different scenarios, such as shopping malls, squares, and pedestrian streets. There are also some approaches to classify group emotions or identify driver behaviors (?). To extend these methods to general traffic scenarios, (?) predicts the trajectories of multiple traffic-agents by considering kinematic and dynamic constraints. However, this model assumes perfect sensing and shape and dynamics information for all of the traffic agents.

RNN networks for sequence prediction

In recent years, the concept of the deep neural network (DNN) has received a huge amount of attention due to its good performance in many areas (?). Recurrent neural network (RNN) is one of the DNN architectures and is widely used for sequence generation in many domains, including speech recognition (?), machine translation (?), and image captioning (?). Many methods based on long short-term Memory (LSTM), one variant of RNN, have been proposed for maneuver classification (?) and trajectory prediction (?). Some methods (?; ?; ?) produce the probabilistic information about the future locations of vehicles over an occupancy grid map or samples by making use of an encoder-decoder structure. However, these sampling-based methods suffer from inherent inaccuracies due to discretization limits. Another method (?) presents a model that outputs the multi-modal distribution and then generates trajectories. Nevertheless, most of these methods require clear road lanes and simple driving scenarios without other types of traffic-agents passing through. Based on images, (?) models the interactions between different traffic-agents by a LSTM-CNN hybrid network for trajectory prediction. Taking into account the human-human interactions, some approaches (?; ?; ?) use LSTM for predicting trajectories of pedestrians in a crowd and they show good performance on public crowd datasets. However, these methods are also limited in terms of trajectory prediction in complex traffic scenarios where the interactions are among not only pedestrians but also heterogeneous traffic-agents.

Traffic datasets

There are several datasets related to traffic scenes. Cityscapes (?) contains 2D semantic, instance-wise, dense pixel annotations for 30 classes. ApolloScape (?) is a large-scale comprehensive dataset of street views that contains higher scene complexities, 2D/3D annotations and pose information, lane markings and video frames. However these two dataset do not provide trajectories information. The Simulation (NGSIM) dataset (?) has trajectory data for cars, but the scene is limited to highways with similar simple road conditions. KITTI (?) is a dataset for different computer vision tasks such as stereo, optical ﬂow, 2D/3D object detection, and tracking. However, the total time of the dataset with tracklets is about 22 minutes. In addition, there are few intersection between vehicles, pedestrians and cyclists in KITTI, which makes it insufficient for exploring the motion patterns of traffic-agents in challenging traffic conditions. There are some pedestrian trajectory datasets like ETH (?), UCY (?), etc., but such datasets only focus on human crowds without any vehicles.

TrafficPredict

In this section, we present our novel algorithm to predict the trajectories of different traffic-agents.

We assume each scene is preprocessed to get the categories and spatial coordinates of traffic-agents. At any time $t$ , the feature of the $i$ th traffic-agent $A_{i}^{t}$ can be denoted as $f_{i}^{t}=(x_{i}^{t},y_{i}^{t},c_{i}^{t})$ , where the first two items are coordinates in the x-axis and y-axis respectively, and the last item is the category of the traffic-agent. In our dataset, we currently take into account three types of traffic-agents, $c_{i}\in\{1,2,3\}$ , where $1$ stands for pedestrians, $2$ represents bicycles and $3$ denotes cars. Our approach can be easily extended to take into account more agent types. Our task is to observe features of all the traffic-agents in the time interval $[1:T_{obs}]$ and then predict their discrete positions at $[T_{obs}+1:T_{pred}]$ .

D Graph Generation

In urban traffic scenarios where various traffic-agents are interacting with others, each instance has its own state in relation to the interaction with others at any time and they also have continuous information in time series. Considering traffic-agents as instance nodes and relationships as edges, we can construct a graph in the instance layer, shown in Fig.2 (b). The edge between two instance nodes in one frame is called spatial edge (?; ?), which can transfer the interaction information between two traffic-agents in spatial space. The edge between the same instance in adjacent frames is the temporal edge, which is able to pass the historic information frame by frame in temporal space. The feature of the spatial edge $(A_{i}^{t},A_{j}^{t})$ for $A_{i}^{t}$ can be computed as $f_{ij}^{t}=(x_{ij}^{t},y_{ij}^{t},c_{ij}^{t})$ , where $x_{ij}^{t}=x_{j}^{t}-x_{i}^{t}$ , $y_{ij}^{t}=y_{j}^{t}-y_{i}^{t}$ stands for the relative position from $A_{j}^{t}$ to $A_{i}^{t}$ , $c_{ij}^{t}$ is an unique encoding for $(A_{i}^{t},A_{j}^{t})$ . When traffic-agent $A_{j}$ considers the spatial edge, the spatial edge is represented as $(A_{j}^{t},A_{i}^{t})$ . The feature of the temporal edge $(A_{i}^{t},A_{i}^{t+1})$ is computed in the same way.

It is normally observed that the same kind of traffic-agents have similar behavior characteristics. For example, the pedestrians have not only similar velocities but also similar reactions to other nearby traffic-agents. These similarities will be directly reflected in their trajectories. We construct a super node $C_{u}^{t},u\in\{1,2,3\}$ for each kind of traffic-agent to learn the similarities of their trajectories and then utilize that super node to refine the prediction for instances. Fig.2 (c) shows the graph in the category layer. All instances of the same type are integrated into one group and each group has an edge oriented toward the corresponding super node. After summarizing the motion similarities, the super node passes the guidance through an oriented edge to the group of instances. There are also temporal edges between the same super node in sequential frames. This category layer is specially designed for heterogeneous traffic and can make full use of the data to extract valuable information to improve the prediction results. This layer is very flexible and can be easily degenerated to situations when several categories disappear in some frames.

Finally, we get the 4D Graph for a traffic sequence with two dimensions for traffic-agents and their interactions, one dimension for time series, and one dimension for high-level categories. By this 4D Graph, we construct an information network for the entire traffic. All the information can be delivered and utilized through the nodes and edges of the graph.

Model Architecture

Our TrafficPredict algorithm is based on the 4D Graph, which consists of two main layers: the instance layer and the category layer. Details are given below.

The instance layer aims to capture the movement pattern of instances in traffic. For each instance node $A_{i}$ , we have an LSTM, represented as $L_{i}$ . Because different kinds of traffic-agents have different dynamic properties and motion rules, only instances of the same type share the same parameters. There are three types of traffic-agents in our dataset: vehicles, bicycles, and pedestrians. Therefore, we have three different LSTMs for instance nodes. We also distribute LSTM $L_{ij}$ for each edge $(A_{i},A_{j})$ of the graph. All the spatial edges share the same parameters and all the temporal edges are classified into three types according to corresponding node type.

For edge LSTM $L_{ij}$ at any time $t$ , we embed the feature $f_{ij}^{t}$ into a fixed vector $e_{ij}^{t}$ , which is used as the input to LSTM:

where $\phi(\cdot)$ is an embedding function, $h_{ij}^{t}$ is the hidden state also the output of LSTM $L_{ij}$ , and $W_{spa}^{e}$ are the embedding weights, and $W_{spa}^{r}$ are LSTM cell weights, which contains the movement pattern of the instance itself. LSTMs for temporal edges $L_{ii}$ are defined in a similar way with parameters $W_{tem}^{e}$ and $W_{tem}^{r}$ .

Each instance node may connect with several other instance nodes via spatial edges. However, each of the other instances has different impacts on the node’s behavior. We use a soft attention mechanism (?) to distribute various weights for all the spatial edges of one instance node：

where $W_{i}$ and $W_{ij}$ are embedding weights, $Dot(\cdot)$ is the dot product, and $\frac{m}{\sqrt{d_{e}}}$ is a scaling factor (?). The final weights are ratios of $w(h_{ij}^{t})$ to the sum. The output vector $H_{i}^{t}$ is computed as a weighted sum of $h_{ij}^{t}$ . $H_{i}^{t}$ stands for the influence exhibited on an instance’s trajectory by surrounding traffic-agents and $h_{ii}^{t}$ denotes the information passing by temporal edges. We concatenate them and embed the result into a fixed vector $a_{i}^{t}$ . The node features $f_{i}^{t}$ and $a_{i}^{t}$ can finally concatenate with each other to feed the instance LSTM $L_{i}$ .

where $W_{ins}^{e}$ and $W_{ins}^{a}$ are the embedding weights, $W_{ins}^{r}$ is the LSTM cell weight for the instance node, $h1_{i}^{t}$ is the first hidden state of the instance LSTM. $h2_{i}^{t-1}$ is the final hidden state of the instance LSTM in the last frame, which will be described in next section.

Category Layer

Usually traffic-agents of the same category have similar dynamic properties, including the speed, acceleration, steering, etc., and similar reactions to other kinds of traffic-agents or the whole environment. If we can learn the movement patterns from the same category of instances, we can better predict trajectories for the entire instances. The category layer is based on the graph in Fig. 2(c). There are four important components: the super node for a specified category, the directed edge from a group of instances to the super node, the directed edge from the super node to instances, and the temporal edges for super nodes.

Taking one super node with three instances as the example, the architecture in the category layer is shown in Fig. 3. Assume there are $n$ instances belonging to the same category in the current frame. We have already gotten the hidden state $h1$ and the cell state $c$ from the instance LSTM, which are the input for the category layer. Because the cell state $c$ contains the historical trajectory information of the instance, self-attention mechanism (?) is used on $c$ by softmax operation to explore the pattern of the internal sequence. At time $t$ , the movement feature $d$ for the $m$ th instance in the category is captured as follows.

Then, we obtain the feature $F_{u}^{t}$ for the corresponding super node $C_{u}^{t}$ by computing the average of all the instances’ movement feature of the category.

$F_{u}^{t}$ captures valid trajectory information from instances and learn the internal movement law of the category. Equation (7)-(8) show the process of transferring information on the directed edge from a group of instances to the super node.

The feature $F_{uu}^{t}$ of the temporal edge of super node is computed by $F_{u}^{t}-F_{u}^{t-1}$ . Take $W_{st}^{e}$ as embedding weights and $W_{st}^{r}$ as the LSTM cell weights. The LSTM of the temporal edge between the same super node in adjacent frames can be computed as follows.

Next, we integrate the information from the group of instances and the temporal edge as the input to the super node. We embed the feature $F_{u}^{t}$ into fixed-length vectors and then concatenate with $h_{uu}^{t}$ together. The hidden state $h_{u}^{t}$ of super node can be gotten by follows.

Finally, we describe the process of transferring guidance on the directed edge from the super node to instances. For the $m$ th instance in the group, the hidden state of the super node is concatenated with the first hidden state $h1_{m}^{t}$ and then embedded into a vector with the same length of $h1_{m}^{t}$ . The second hidden state $h2_{m}^{t}$ is the final output of the instance node.

where $W_{s}^{r}$ is the embedding weights. By the network of the category layer, we use the similarity inside the same type of instances to refine the prediction of trajectories for instances.

Position estimation

We assume the position of the traffic-agent in next frame meets a bivariate Gaussian distribution as (?) with parameters including the mean $\mu_{i}^{t}=(\mu_{x},\mu_{y})_{i}^{t}$ , standard deviation $\sigma_{i}^{t}=(\sigma_{x},\sigma_{y})_{i}^{t}$ and correlation coefficient $\rho_{i}^{t}$ . The corresponding position can be represented as follows.

The second hidden state of traffic-agents at any time is used to to predict these parameters by linear projection.

The loss function is defined by the negative log-Likelihood $L_{i}$ .

We train the model by minimizing the loss for all the trajectories in the training dataset. We jointly back-propagated through instance nodes, super nodes and spatial and temporal edges to update all the parameters to minimize the loss at each time-step.

Experiments

We use Apollo acquisition car (?) to collect traffic data, including camera-based images and LiDAR-based point clouds, and generate trajectories by detection and tracking.

Our new dataset is a large-scale dataset for urban streets, which focuses on trajectories of heterogeneous traffic-agents for planning, prediction and simulation tasks. Our acquisition car runs in urban areas in rush hours in those scenarios shown in Fig. 4. The data is generated from a variety of sensors, including LiDAR (Velodyne HDL-64E S3), radar (Continental ARS408-21), camera, high definition maps and a localization system at 10HZ. We provide camera images and trajectory files in the dataset. The perception output information includes the timestamp, and the traffic-agent’s ID, category, position, velocity, heading angle, and bounding polygon. The dataset includes RGB videos with 100K $1920\times 1080$ images and around 1000km trajectories for all kinds of moving traffic agents. A comparison of NGSIM, KITTI (with tracklets), and our dataset is shown in Table. 1. Because NGSIM has a very large, top-down view, it has a large number of vehicles per frame. In this paper, each period of sequential sequences of the dataset was isometrically normalized for experiments. Our new dataset has been released over the WWW (?).

Evaluation Metrics and Baselines

We use the following metrics (?; ?) to measure the performance of algorithms used for predicting the trajectories of traffic-agents.

Average displacement error: The mean Euclidean distance over all the predicted positions and real positions during the prediction time.

Final displacement error: The mean Euclidean distance between the final predicted positions and the corresponding true locations.

We compare our approach with these methods below:

RNN ED (ED): An RNN encoder-decoder model, which is widely used in motion and trajectory prediction for vehicles.

Social LSTM (SL): An LSTM-based network with social pooling of hidden states (?). The model performs better than traditional methods, including the linear model, the Social force model, and Interacting Gaussian Processes.

Social Attention (SA): An attention-based S-RNN architecture (?), which learn the relative influence in the crowd and predict pedestrian trajectories.

TrafficPredict-NoCL (TP-NoCL): The proposed method without the category layer.

TrafficPredict-NoSA (TP-NoSA): The proposed method without the self-attention mechanism of the category layer.

Implementation Details

In our evaluation benchmarks, the dimension of hidden state of spatial and temporal edge cell is set to 128 and that of node cell is 64 (for both instance layer and category layer). We also apply the fixed input dimension of 64 and attention layer of 64. During training, Adam (?) optimization is applied with $\beta_{1}$ =0.9 and $\beta_{2}$ =0.999. Learning rate is scheduled as 0.001 and a staircase weight decay is applied. The model is trained on a single Tesla K40 GPU with a batch size of 8. For the training stability, we clip the gradients with the range -10 to 10. During the computation of predicted trajectories, we observe trajectories of 2 seconds and predict the future trajectories in next 3 seconds.

Analysis

The performance of all the prior methods and our algorithm on heterogeneous traffic datasets is shown in Table. 2. We compute the average displacement error and the final displacement error for all the instances and we also count the error for pedestrians, bicycles and vehicles, respectively. The social attention (SA) model considers the spatial relations of instances and has smaller error than RNN ED and Social LSTM. Our method without category layer (TP-NoCL) not only considers the interactions between instances but also distinguishes between instances by using different LSTMs. Its error is similar to SA. By adding the category layer without self attention, the prediction results of TP-NoSA are more accurate in terms of both metrics. The accuracy improvement becomes is more evident after we use the self-attention mechanism in the design of category layer. Our algorithm, TrafficPredict, performs better in terms of all the metrics with about 20% improvement of accuracy. It means the category layer has learned the inbuilt movement patterns for traffic-agents of the same type and provides good guidance for prediction. The combination of the instance layer and the category layer makes our algorithm more applicable in heterogeneous traffic conditions.

We illustrate some prediction results on corresponding 2D images in Fig. 5. The scenario in the image captured by the front-facing camera does not show the entire scenario. However, it is more intrinsic to project the trajectory results on the image. In most heterogeneous traffic scenarios, our algorithm computes a reasonably accurate predicted trajectory and is close to the ground truth. If we have prior trajectories over a longer duration, the prediction accuracy increases.

When traffic-agents are moving on straight lanes, it is easy to predict their trajectories because almost all the traffic-agents are moving in straight direction. It is more challenging to provide accurate prediction in cross roads, as the agents are turning. Fig. 5 shows 2D experimental results of two sequences in cross areas. There are some overlaps on trajectories. In these scenarios, there are many curves with high curvature because of left turn. Our algorithm can compute accurate predicted trajectories in these cases.

Conclusion

In this paper, we have presented a novel LSTM-based algorithm, TrafficPredict, for predicting trajectories for heterogeneous traffic-agents in urban environment. We use a instance layer to capture the trajectories and interactions for instances and use a category layer to summarize the similarities of movement pattern of instances belong to the same type and guide the prediction algorithm. All the information in spatial and temporal space can be leveraged and transferred in our designed 4D Graph. Our method outperforms previous state-of-the-art approaches in improving the accuracy of trajectory prediction on our new collected dataset for heterogeneous traffic. We have evaluated our algorithm in traffic datasets corresponding to urban dense scenarios and observe good accuracy. Our algorithm is realtime and makes no assumption about the traffic conditions or the number of agents.

Our approach has some limitations. Its accuracy varies based on traffic conditions and the duration of past trajectories. In the future, we will consider more constraints, like the lane direction, the traffic signals and traffic rules, etc. to further improve the accuracy of trajectory prediction. Furthermore, we would like to evaluate the performance in more dense scenarios.

Acknowledgement

Dinesh Manocha is supported in part by ARO Contract W911NF16-1-0085, and Intel. We appreciate all the people who offered help for collecting the dataset.