Decentralized Non-communicating Multiagent Collision Avoidance with Deep Reinforcement Learning

Yu Fan Chen, Miao Liu, Michael Everett, Jonathan P. How

I Introduction

Collision avoidance is central to many robotics applications, such as multiagent coordination , autonomous navigation through human crowds , pedestrian motion prediction , and computer crowd simulation . Yet, finding collision-free, time efficient paths around other agents remains challenging, because it may require predicting other agents’ motion and anticipating interaction patterns, through a process that needs to be computationally tractable for real-time implementation.

If there is a reliable communication network for agents to broadcast their intents (e.g. goals, planned paths), then collision avoidance can be enforced through a centralized planner. For instance, collision avoidance requirements can be formulated as separation constraints in an optimization framework for finding a set of jointly feasible and collision-free paths . However, centralized path planning methods can be computationally prohibitive for large teams . To attain better scalability, researchers have also proposed distributed algorithms based on message-passing schemes , which resolve local (e.g. pairwise) conflicts without needing to form a joint optimization problem between all members of the team.

This work focuses on scenarios where communication cannot be reliably established, which naturally arises when considering human-robot interactions and can also be caused by hardware constraints or failures. This limitation poses additional challenges for collision avoidance, because mobile agents would need to cooperate without necessarily having knowledge of the other agent’s intents. Existing work on non-communicating collision avoidance can be broadly classified into two categories, reaction-based and trajectory-based. Reaction-based methods specify one-step interaction rules for the current geometric configuration. For example, reciprocal velocity obstacle (RVO) is a reaction-based method that adjusts each agent’s velocity vector to ensure collision-free navigation. However, since reaction-based methods do not consider evolution of the neighboring agents’ future states, they are short-sighted in time and have been found to create oscillatory and unnatural behaviors in certain situations .

In contrast, trajectory-based methods explicitly account for evolution of the joint (agent and neighbors) future states by anticipating other agents’ motion. A subclass of non-cooperative approaches propagates the other agents’ dynamics forward in time, and then plans a collision-free path with respect to the other agents’ predicted paths. However, in crowded environments, the set of predicted paths often marks a large portion of the space untraversable/unsafe, which leads to the freezing robot problem . A key to resolving this issue is to account for interactions, such that each agent’s motion can affect one another. Thereby, a subclass of cooperative approaches has been proposed, which first infers the other agents’ intents (e.g. goals), then plans a set of jointly feasible paths for all neighboring agents in the environment. Cooperative trajectory-based methods often produce paths with better quality (e.g. shorter time for all agents to reach their goal) than that of reaction-based methods . However, planning paths for all other agents is computationally expensive, and such cooperative approach typically requires more information than is readily available (e.g. other agent’s intended goal). Moreover, due to model and measurement uncertainty, the other agents’ actual paths might not conform to the planned/predicted paths, particularly beyond a few seconds into the future. Thus, trajectory-based methods also need to be run at a high (sensor update) rate, which exacerbates the computational problem.

The major difficulty in multiagent collision avoidance is that anticipating evolution of joint states (paths) is desirable but computationally prohibitive. This work seeks to address this issue through reinforcement learning – to offload the expensive online computation to an offline training procedure. Specifically, this work develops a computationally efficient (i.e., real-time implementable) interaction rule by learning a value function that implicitly encodes cooperative behaviors.

The main contributions of this work are (i) a two-agent collision avoidance algorithm based on a novel application of deep reinforcement learning, (ii) a principled way for generalizing to more ( $n>2$ ) agents, (iii) an extended formulation to capture kinematic constraints, and (iv) simulation results that show significant improvement in solution quality compared with existing reaction-based methods.

II Problem Formulation

where Equation 2 is the collision avoidance constraint, Equation 3 is the goal constraint, Equation 4 is the agents’ kinematics, and the expectation in Equation 1 is with respect to the other agent’s policy and hidden states (intents). Note that static obstacles can be modeled as stationary agents, which will be discussed in more details in Section IV-D.

II-B Reinforcement Learning

Reinforcement learning (RL) is a class of machine learning methods for solving sequential decision making problems with unknown state-transition dynamics. Typically, a sequential decision making problem can be formulated as a Markov decision process (MDP), which is defined by a tuple $M=\langle S,A,P,R,\gamma\rangle$ , where $S$ is the state space, $A$ is the action space, $P$ is the state-transition model, $R$ is the reward function, and $\gamma$ is a discount factor. By detailing each of these elements and relating to Equation 1-Equation 4, the following provides a RL formulation of the two-agent collision avoidance problem.

Action space

The action space is the set of permissible velocity vectors. Here, it is assumed that an agent can travel in any direction at any time, that is $\mathbf{a}(\mathbf{s})=\mathbf{v}$ for $||\mathbf{v}||_{2}<v_{pref}$ . It is also straightforward to impose kinematic constraints, which will be explored in Section III-D.

Reward function

A reward function is specified to award the agent for reaching its goal Equation 3, and penalize the agent for getting too close or colliding with the other agent Equation 2,

State transition model

Value function

The objective is to find the optimal value function

where $\gamma\in[0,1)$ is a discount factor. Recall $v_{pref}$ is an agent’s preferred speed and is typically time invariant. It is used here as a normalization factor for numerical reasons, because otherwise the value function of a slow moving agent could be very small. The optimal policy can be retrieved from the value function, that is

This work chooses to optimize $V(\mathbf{s}^{jn})$ rather than the more common choice $Q(\mathbf{s}^{jn},\mathbf{a})$ , because unlike previous works that focus on discrete, finite action spaces , the action space here is continuous and the set of permissible velocity vectors depends on the agent’s state (preferred speed).

III Approach

The following presents an algorithm for solving the two-agent RL problem formulated in Section II-B, and then generalizes its solution (policy) to multiagent collision avoidance. While applications of RL are typically limited to discrete, low-dimensional domains, recent advances in Deep RL have demonstrated human-level performance in complex, high-dimensional spaces. Since the joined state vector $\mathbf{s}^{jn}$ is in a continuous 14 dimensional space, and because a large amount of training data can be easily generated in a simulator, this work employs a fully connected deep neural network with ReLU nonlinearities to parametrize the value function, as shown in Fig. 3(a). This value network is denoted by $V(\cdot;\mathbf{w})$ , where $\mathbf{w}$ is the set of weights in the neural network.

From a geometric perspective, there is some redundancy in the parameterization of the system’s joint state $\mathbf{s}^{jn}$ , because the optimal policy should be invariant to any coordinate transformation (rotation and translation). To remove this ambiguity, an agent-centric frame is defined, with the origin at the agent’s position, and the x-axis pointing toward the agent’s goal, that is,

III-B Generating Paths Using a Value Network

III-C Training a Value Network

The training procedure, outlined in Algorithm 2, consists of two major steps. First, the value network is initialized by supervised training on a set of trajectories generated by a baseline policy (line 3).

Specifically, each training trajectory is processed to generate a set of state-value pairs, $\{(\mathbf{s}^{jn},\;y)_{k}\}_{k=1}^{N}$ , where $y=\gamma^{{t_{g}}\cdot{v_{pref}}}$ and $t_{g}$ is the time to reach goal. The value network is trained by back-propagation to minimize a quadratic regression error, $\operatorname*{\arg\!\min}_{\mathbf{w}}\sum_{k=1}^{N}\left(y_{k}-V(\mathbf{s}^{jn}_{k};\mathbf{w})\right)^{2}$ . This work uses optimal reciprocal collision avoidance (ORCA) to generate a training set of 500 trajectories, which contains approximately 20,000 state-value pairs.

We make a few remarks about this initialization step. First, the training trajectories do not have to be optimal. For instance, two of the training trajectories generated by ORCA are shown in Fig. 4(a). The red agent was pushed away by the blue agent and followed a large arc before reaching its goal. Second, the initialization training step is not simply emulating the ORCA policy. Rather, it learns a time to goal estimate (value function), which can then be used to generate new trajectories following Algorithm 1. Evidently, this learned value function sometimes generates better (i.e. shorter time to goal) trajectories than ORCA, as shown in Fig. 4(b). Third, this learned value function is likely to be suboptimal. For instance, the minimum separation $d_{min}$ between the two agents should be around 0.2m by Equation 5, but $d_{min}$ is greater than 0.4m (too much slack) in Fig. 4(b).

The second training step refines the policy through reinforcement learning. Specifically, a few random test cases are generated in each episode (line 8), and two agents are simulated to navigate around each other using an $\epsilon$ -greedy policy, which selects a random action with probability $\epsilon$ and follows the value network greedily otherwise (line 9). The simulated trajectories are then processed to generate a set of state-value pairs (line 10). For convergence reasons, as explained in , rather than being used to update the value network immediately, the newly generated state-value pairs are assimilated (replacing older entries) into a large experience set $E$ (line 11). Then, a set of training points is randomly sampled from the experience set, which contains state-value pairs from many different simulated trajectories (line 12). The value network is finally updated by stochastic gradient descent (back-propagation) on the sampled subset.

To monitor convergence of the training process, the value network is tested regularly on a few pre-defined evaluation test cases (line 15), two of which are shown in Fig. 4. Note that the generated paths become tighter as a function of the number of training episodes (i.e. $d_{min}$ reduces from 0.4m to 0.2m). A plot of the test cases’ values $V(\mathbf{s}^{jn}_{0})$ shows that the value network has converged in approximately 800 episodes (Fig. 3(b)). It is important to point out that training/learning is performed on randomly generated test cases (line 8), but not on the evaluation test cases.

III-D Incorporating Kinematic Constraints

Kinematics constraints need to be considered for operating physical robots. Yet, in many existing works, such constraints can be difficult to encode and might lead to a substantial increase in computational complexity . In contrast, it is straightforward to incorporate kinematic constraints in the RL framework. We impose rotational constraints,

where Equation 9 limits the direction that an agent can travel, and Equation 10 specifies a maximum turning rate that corresponds to a minimum turning radius of 1.0m. Figure 5(a) illustrates the application of these rotational constraints to the same test case in Fig. 2(a). Here, the red agent chooses to slow down given the set of more constrained actions. Notice the agent is allowed to spin on the spot, which leads to an interesting optimal control problem when an agent’s current heading angle is not perfectly aligned with its goal. In this case, an agent can either travel at its full speed while turning toward its goal, or first spin on the spot before traveling in a straight line. Figure 5(b) shows that CADRL has learned a policy that balances between these two options to minimize the time to goal. With the thin lines showing its heading angle, the red agent chooses to initially turn on the spot, and then start moving before its heading angle is perfectly aligned with its goal.

III-E Multiagent Collision Avoidance

Note that the agent’s projected next state $\hat{\mathbf{s}}_{t+1}$ also depends on the selected action $\mathbf{a}_{t}$ . Although using a two-agent value network, CADRL can produce multiagent trajectories that exhibit complex interaction patterns. Figure 6(a) shows six agents each moving to the opposite side of a circle. The agents veer more than the two-agent case (bottom row of Fig. 4), which makes more room in the center to allow every agent to pass smoothly. Figure 6(b) shows three pairs of agents swapping position horizontally. The pair in the center slows down near the origin so the outer agents can pass first. Both cases demonstrate that CADRL can produce smooth, natural looking trajectories for multiagent systems, which will be explored in further detail in Section IV. However, we acknowledge that Equation 11 is only an approximation to a true multiagent RL value function – an $n$ -agent value network by simulating $n$ agents navigating around each other – which will be studied for future work.

IV Results

This work uses a neural network with three hidden layers of width $(150,\,100,\,100)$ , which is the size chosen to achieve real-time performance.We also experimented with other network structures. For example, a network with three hidden layers of width $(300,\,200,\,200)$ produced similar results (paths) but was twice as slow. In particular, on a computer with an i7-5820K CPU, a Python implementation of CADRL (Algorithm 1), on average, takes 5.7ms per iteration on two-agent collision avoidance problems. By inspection of Equation 11, computational time scales linearly in the number of neighboring agents for a decentralized implementation where each agent runs CADRL individually; and scales quadratically in a centralized implementation where one computer controls all agents. For decentralized control on ten agents, each iteration of CADRL is observed to take 62ms. Moreover, CADRL is parallelizable because it consists of a large number of independent queries of the value network Equation 11.

Furthermore, offline training (Algorithm 2) took less than three hours and is found to be quite robust. In particular, using mini-batches of size 500, the initialization step (line 3) took 9.6 minutes to complete 10,000 iterations of back-propagation. The RL step used an $\epsilon$ -greedy policy, where $\epsilon$ decays linearly from 0.5 to 0.1 in the first 400 training episodes, and remains 0.1 thereafter. The RL step took approximately 2.5 hours to complete 1,000 training episodes. The entire training process was repeated on three sets of training trajectories generated by ORCA, and the value network converged in all three trials. The paths generated by the value networks from the three trials were very similar, as indicated by a less than 5% difference in time to reach goal on all test cases in the evaluation set.

IV-B Performance Comparison on a Crossing Scenario

To evaluate the performance of CADRL over a variety of test cases, we compute the average extra time spent to reach goals, that is

where $t_{i,g}$ is the $i$ th agent’s time to reach its goal, and the second term is a lower bound of $t_{i,g}$ (to go straight toward goal at the preferred speed). This metric removes the effects due to variability in the number of agents and the nominal distance to goals.

A crossing scenario is shown in Fig. 7(a), where two identical agents with goals along collision courses are run into each other at different angles. Thus, cooperation is required for avoiding collision at the origin. Figure 7(a) shows that over a wide range of angles ( $\alpha\in$ deg), agents following CADRL reached their goals much faster than that of ORCA. Recall a minimum separation of 0.2m is specified for CADRL in the reward function Equation 5. For this reason, similar to the bottom row of Fig. 4, CADRL finds paths that are slightly slower than ORCA around $\alpha=0$ . It is also interesting to note that CADRL with rotational constraints (CADRL w/ cstr) performs slightly better than the unconstrained. This is because CADRL w/ cstr is more conservative (yielding) early on, which is coincidentally good for the crossing scenario. More specifically, if the other agent has stopped (reached goal) or turned before it reached the origin, unconstrained CADRL would have performed better. In short, as will be shown later in Fig. 9, unconstrained CADRL is better on average (randomized test cases), but can be slightly worse than CADRL w/ cstr on particular test cases.

IV-C Performance Comparison on Random Test Cases

In addition to showing that CADRL can handle some difficult test cases that fared poorly for ORCA (Figs. 4 and 7), a more thorough evaluation is performed on randomly generated test cases. In particular, within square shaped domains specified in Table I, agents are generated with randomly sampled speed, radius, initial positions and goal positions. This work chooses $v_{pref}\in[0.5,1.5]$ m/s, $r\in[0.3,0.5]$ m, which are parameters similar to that of typical pedestrian motion. Also, the agents’ goals are projected to the boundary of the room to avoid accidentally creating a trap formed by multiple stationary agents. A sample four-agent test case is illustrated in Fig. 8, where agents following CADRL were able to reach their goal much faster than that of ORCA. For each configuration in Table I, one hundred test cases are generated as described above. ORCA, CADRL, and CADRL w/cstr are employed to solve for these test cases. The average extra time to reach goal, $\bar{t}_{e}$ , is computed for each set of generated trajectories and plotted in Fig. 9. Key statistics are computed and listed in Table I, and it can be seen that CADRL performs similarly (slightly better) than ORCA on the easier test cases (median), and more than 26% better on the hard test cases ( $>$ 75 percentile). Also, performance difference is more clear on test cases with more agents, which could be a result of more frequent interactions.

IV-D Navigating around Non-cooperative Agents

V Conclusion

This work developed a decentralized multiagent collision avoidance algorithm based on a novel application of deep reinforcement learning. In particular, a pair of agents were simulated to navigate around each other to learn a value network that encodes the expected time to goal. The solution (value network) to the two-agent collision avoidance RL problem was generalized in a principled way to handle multiagent ( $n>2$ ) scenarios. The proposed method was shown to be real-time implementable for a decentralized ten-agent system. Simulation results show more than 26% improvement in paths quality when compared with ORCA.

Acknowledgment

This work is supported by Ford Motor Company.