Particle Filter Networks with Application to Visual Localization

Peter Karkus, David Hsu, Wee Sun Lee

Introduction

Particle filtering, also known as the sequential Monte-Carlo method, is a powerful approach to sequential state estimation . Particle filters are used extensively in robotics, computer vision, physics, econometrics, etc. , and are critical for robotic tasks such as localization , SLAM , and planning under partial observability . To apply particle filters in practice, a major challenge is to construct probabilistic system models or learn them from data . Consider, for example, robot localization with an onboard camera (Figure 1). The observation model is a probability distribution over all possible camera images, conditioned on a continuous robot state and an environment map. Learning such a model is challenging, because of the enormous observation space and the lack of sufficient labeled data. An emerging line of research circumvents the difficulty of traditional model learning: it embeds an algorithm into a deep neural network and then performs end-to-end learning to train a model optimized for the specific algorithm .

In this direction, we introduce the Particle Filter Network (PF-net), a recurrent neural network (RNN) with differentiable algorithm prior for sequential state estimation. A PF-net encodes learnable probabilistic state-transition and observation models together with the particle filter algorithm in a single neural network (Figure 1). It is fully differentiable and trained end-to-end from data. PF-net tackles the key challenges of learning complex probabilistic system models. Neural networks are capable of representing complex models over large spaces, e.g., observation models over images. Further, the network representation unites the model and the algorithm and thus allows training end-to-end. As a result, PF-net learns system models optimized for a specific algorithm, in this case, particle filtering, instead of learning generic system models. The models may learn only the features relevant for state estimation, thus reducing the complexity of learning.

[figure]style=plain,subcapbesideposition=top

We apply PF-net to robot visual localization, which is of great interest to mobile robotics. A robot navigates in a previously unseen environment and does not know its own precise location. It must localize in a visually rich 3-D world, given only a schematic 2-D floor map and observations from onboard sensors (Figure 1). While particle filtering is the standard approach for LIDAR , we consider visual sensors, e.g., cameras. Now the probabilistic observation model must match rich 3-D visual features from camera images to crude 2-D geometric features from the map. Further, the camera images may contain various objects not in the map, e.g., furniture. This task exhibits key difficulties of state estimation from ambiguous, partial observations. A standard model-based approach would construct an observation model as a probability distribution of images conditioned on the floor map and robot pose. This is difficult, because of the enormous observation space, i.e., the space of all possible images showing various floor layouts, furniture configurations, etc. In contrast, PF-net trains a model end-to-end and learns only features relevant to the localization task.

This paper makes two contributions. First, we encode a particle filter algorithm in a neural network to learn models for sequential state estimation end-to-end. Second, we apply PF-net to visual localization and present a network architecture for matching rich visual features of a 3-D world with a schematic 2-D floor map. Simulation experiments on the House3D data set show that the learned PF-net is effective for visual localization in new, unseen environments populated with furniture. Through end-to-end training, it also outperforms a conventional model-based method; it fuses information from multiple sensors, in particular, RGB and depth cameras; and it naturally integrates semantic information for localization, such as map labels for doors and room types.

Background

The idea of differentiable algorithm priors, i.e., embedding algorithms into a deep neural network, has been gaining attention recently. It has led to promising results for graph search , path integral optimal control , quadratic optimization , and decision-making in fully observable environments and partially observable environments .

The general idea, when applied to probabilistic state estimation, has led to, e.g., Kalman filter network and histogram filter network . However, Kalman filtering assumes that the underlying state distribution is or can be well approximated as a unimodal Gaussian. Histogram filtering assumes discrete state spaces and has difficulty in scaling up to high-dimensional state spaces because of the “curse of dimensionality”. To tackle arbitrary distributions and very large discrete or continuous state spaces, one possibility is particle filtering. Concurrent to our work, Jonschkowski et al. have been independently working on the idea of differentiable particle filtering . The work is closely related, and we want to highlight several important differences. First, we propose a differentiable approximation of resampling, a crucial step for many particle filter algorithms. Next, we apply PF-net to visual localization in new, unseen environments, after learning. While the concurrent work also deals with localization, it does so in a fixed environment. Finally, our observation model for visual localization matches rich 3-D visual feature with a schematic 2-D floor map, ignores objects not in the map, and fuses information from multiple sources. Neural networks have been used with particle filters in variational learning as well. Unlike the PF-net, such networks aim to parameterize a family of generative distributions over observations , thus making them unsuitable for large, complex observation spaces, such as the space of camera images and floor maps.

Particle filter methods, e.g., Monte-Carlo localization , are standard solutions to mobile robot localization. Many such methods assume a LIDAR sensor mounted on the robot and rely on handcrafted simple analytic observation models . While there have been attempts to incorporate monocular or depth cameras , constructing probabilistic observation models for them remains a challenge. PF-net learns effective system models through end-to-end training, without direct supervision on model components.

2 Particle filter algorithm

Particle filters periodically approximate the posterior distribution over states after an observation is received, i.e., they maintain a belief over states, $b(s)$ . The belief is approximated by a set of particles, i.e., weighted samples from the probability distribution,

where $\sum_{k}w_{k}=1$ , $K$ is the number of particles, $s_{k}$ is the particle state, $w_{k}$ is the particle weight, and $t$ denotes time. Importantly, the particle set can approximate arbitrary distributions, e.g., continuous, multimodal, non-Gaussian distributions. The state estimate can be computed by the weighted mean, $\textstyle{\overline{s}_{t}=\sum_{k}{w_{t}^{k}s_{t}^{k}}}$ . The particles are periodically updated in a Bayesian manner. First, the particle states are updated by sampling from a probabilistic transition model,

where the transition model, $T$ , defines the probability of a state, $s_{t}$ , given a previous state, $s_{t-1}^{k}$ , and the last action, $u_{t}$ . In the case of robot localization $u_{t}$ is the odometry input. Second, the particle weights are updated. The likelihood, $f_{t}^{k}$ , is computed for each particle,

where $\eta^{-1}=\sum_{j=1:K}{f_{t}^{j}w_{t-1}^{j}}$ is a normalization factor.

One common issue is particle degeneracy, i.e., when most particles have near-zero weight. The issue can be addressed by resampling particles. New particles are sampled from the current set with repetition, where a particle is chosen with a probability proportionate to its weight,

The weights are updated according to a uniform distribution,

The new particle set approximates the same distribution, but devotes its representation power to the important regions of the belief space. Note that the new set may contain repeated particles, but they diverge after stochastic transition updates.

Particle Filter Network

The Particle Filter Network (PF-net) encodes learnable transition and observation models, together with the particle filter algorithm, in a single neural network (Figure 1). PF-net is a RNN with differentiable algorithm prior, that is, structure specific to sequential state estimation. The differentiable algorithm prior in PF-net is particle filtering: the particle representation of beliefs, and Bayesian updates for transitions and observations. Compared to generic architectures, such as LSTM , these priors allow much more efficient learning.

The key idea underlying our approach is the unified representation of a learned model and an inference algorithm. The model is a neural network, i.e., a computation graph with trainable parameters. The inference algorithm is a differentiable program, i.e., a computation graph with differentiable operations. Both the model and the algorithm can be encoded in the same computation graph. What is the benefit? Unlike conventional model learning methods, PF-net can learn a model end-to-end, backpropagating gradients through the inference algorithm. The model is now optimized for a specific inference algorithm and a specific task. As a result, the model may not need to capture complex conditional probability distributions, instead, it may learn only the features relevant to the task.

Specifically, PF-net encodes the particle filtering steps, (1)-(6), in a computation graph (Figure 3). The transition and observation models, (2) and (3), are trainable neural networks with appropriate structure. Learned network weights are shared across particles. The rest of the computation graph is not learned, but rather, it implements the operations (1)-(6). Importantly, these operations must be differentiable to allow backpropagation. This is an issue for sampling from a learned distribution in (2), and resampling particles in (5)-(6).

The sampling operation (2) is not differentiable, but it can be easily expressed in a differentiable manner using the “reparameterization trick” . The trick is to take a noise vector as input, and express the desired distribution as a deterministic, differentiable function of this input. The function may have learnable parameters, e.g., the mean and variance of a Gaussian. Particle resampling poses a different issue: new particle weights are set to constant in (5), which produces zero gradients. We address the issue by introducing soft-resampling, a differentiable approximation based on importance sampling. Instead of sampling particles from the desired distribution $p(k)$ , we sample from $q(k)$ , a combination of $p(k)$ and a uniform distribution,

where $\alpha$ is a trade-off parameter. The new weights are computed by the importance sampling formula,

This operation has non-zero gradient when $\alpha\neq 1$ . Soft-resampling trades off the desired sampling distribution ( $\alpha=1$ ) with the uniform sampling distribution ( $\alpha=0$ ). It provides non-zero gradients by maintaining the dependency on previous particle weights. An alternative to soft-resampling is to simply carry over particles to the next step, without resampling them. We found this to be a good strategy when training in a low uncertainty setting, i.e., when most particles remain close to the underlying true states. Soft-resampling worked better under high uncertainty, where most particles would deviate far from the true states.

We have now introduced the PF-net architecture in a general setting. When applying PF-net to a particular task, we must choose the representation of states, and the network architecture for $T$ and $Z$ . Note that we may use different number of particles during training and during evaluation.

Visual localization

We apply PF-net to visual localization (Figure 1). A robot navigates in an indoor environment it has not seen before. The robot is uncertain of its location. It has an onboard camera, odometry, and it receives a schematic 2-D floor map. The task is to periodically estimate the location from the history of sensor observations. Formally, we seek to minimize the mean squared error,

where $\overline{x}_{t},\overline{y}_{t},\overline{\phi}_{t}$ and $x^{*}_{t},y^{*}_{t},\phi^{*}_{t}$ are the estimated and true robot poses for time $t$ , respectively; $\beta$ is a constant parameter.

Challenges are threefold. First, we must periodically update a posterior over states given ambiguous observations, where the posterior is a multimodal, non-Gaussian, continuous distribution. PF-net tackles the challenge by encoding suitable differentiable algorithm prior, i.e., particle filtering.

Simulation experiments

We implemented Our Tensorflow implementation is available at https://github.com/AdaCompNUS/pfnet. and evaluated PF-net in simulation for robot visual localization in indoor environments. We compared with several alternative methods to examine the benefits of differentiable algorithm priors and those of end-to-end training. We evaluated PF-net for various visual and depth sensor. Finally, we evaluated PF-net with increasing levels of uncertainty when the robot’s initial belief changes from a distribution concentrated around its true pose to that of one spread uniformly over the entire space. The results are summarized in Table 1.

Simulation. We conducted experiments in the House3D simulator , which builds on a large collection of human-designed, realistic residential buildings from the SUNCG data set . On average, the building size is $206\,\textrm{m}^{2}$ , and the room size is $37\,\textrm{m}^{2}$ . See Figure 5 for examples.

Tasks. We consider localization with various levels of uncertainty. For tracking, the initial belief is concentrated around the true state. For global localization, the belief is uniform over all rooms in a building. In between, for semi-global localization, the belief is uniform over one or more rooms.

Sensors. We considered a monocular RGB camera, a depth camera, an RGB-D camera, and a simulated 2-D LIDAR. Following earlier work , our simulated LIDAR simply transforms a depth images to a 2-D laser scan. The simulated LIDAR has a limited resolution of 54 beams and field of view. As a result, localization with the simulated LIDAR is harder compared with a typical real-world LIDAR. We also considered a simplified environment, LIDAR-W, for the LIDAR sensor by removing all furniture from the environment and leaving only the walls. This way, the corresponding floor map contains all geometric objects in the environment.

Training. The training data consists of 45,000 trajectories from $200$ buildings. Trajectories are generated at random: the robot moves forward ( $p=0.8$ ) or turns ( $p=0.2$ ). The distance and the turning angle are sampled uniformly from the ranges $[20\,\textrm{cm},80\,\textrm{cm}]$ and $[<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo separator="true">,</mo></mrow><annotation encoding="application/x-tex">,</annotation></semantics></math>,]$ , respectively. Each trajectory is 24 steps long, and each step is labeled with the true robot pose. The robot’s initial belief $b_{0}$ is a multivariate Gaussian distribution. The center of $b_{0}$ is perturbed from the true pose according to a Gaussian with zero mean and covariance matrix $\Sigma=\textrm{diag}(30\,\textrm{cm},30\,\textrm{cm},$ $)$ , and the the covariance of $b_{0}$ is the same $\Sigma$ . This setting corresponds to a tracking task. We trained PF-net and alternative networks to minimize the end-to-end loss (9). We trained by backpropagation through time, limited to 4 time steps. For training PF-net, we used $K=30$ particles. We did not resample particles during training, as it is not required for short trajectories and concentrated initial beliefs.

Alternative methods. We compared PF-net with alternative network architectures, histogram filter (HF) network and LSTM network . HF network represents the belief as a histogram over discretized states, in this case, a grid with $40\,\text{cm}\times 40\,\text{cm}$ cells and $16$ orientations. Finer discretization did not produce better results. The LSTM network relies on its hidden state vector to represent the belief. We used a network architecture based on local maps, similar to the PF-net observation model. Outputs are relative state estimates that are updated with the odometry. We also considered a conventional particle filtering (PF) method with a handcrafted analytic observation model. We used the beam model implementation from the AMCL package of ROS , a standard model for localization with LIDAR . The model parameters were tuned for our simulated LIDAR sensor. Finally, to calibrate the results, we also considered Odometry-NF, which updates the belief only with odometry, not with other sensor inputs.

Evaluation. We evaluated the methods on a fixed set of 820 trajectories in 47 previously unseen buildings for tracking, semi-global localization, and global localization tasks. We used the same setup and same model and algorithm parameters for all methods whenever possible. We trained the networks once for the tracking task and did not retrain for the other tasks. It is important to observe that for PF-net, the number of particles, $K$ , used in execution does not have to be the same as that for training. In particular, we used $K=300$ particles for tracking, $K=1,000$ for semi-global localization, and $K$ up to $3,000$ for global localization. We also activated resampling for semi-global and global localization. The same settings were applied to the PF method. For tracking, we report the average root mean squared error (RMSE), computed for the robot position (Table 1a). For semi-global and global localization, we report success rate on 100-step long trajectories (Table 1b–c). Localization is successful if the estimation error is below $1\,\text{m}$ for the last 25 steps of a trajectory. Finally, we evaluated PF-net on semi-global localization with semantic maps (Table 1d).

2 Main results

PF-net successfully reduces state uncertainty in the tracking task (Table 1a). Without additional training, PF-net can also localize successfully when the initial belief is uniform over a room (Table 1b), and even when uniform over the entire floor map (Table 1c). See Figure 6 for an example.

Differentiable algorithm priors are useful. PF-net consistently outperformed alternative end-to-end learning architectures, HF network and LSTM network. Why? PF-net encodes differentiable algorithm prior specific to sequential state estimation, i.e., the particle representation of beliefs and their Bayesian update. HF network encodes similar prior for updating beliefs, however, it is restricted to a discrete belief representation which does not scale well to large and continuous state spaces. The LSTM network is not restricted to a discrete state space, but it has no structure specific to probabilistic state estimation, and it must rely on the hidden state vector to encode the belief.

End-to-end learning leads to increased robustness. We compared learned PF-net to PF with a known LIDAR model (first and fourth row of Table 1a–b). PF-net and PF performed similarly when only walls were present in the environment (LIDAR-W column). PF-net performed significantly better when some objects in the environment were not in the map (LIDAR column). Why? The beam model has no principled way to distinguish relevant walls from irrelevant objects, because it decouples the LIDAR scan to individual beams. Through end-to-end training, PF-net may have learned relationships between beams to distinguish walls from objects. PF-net may have also learned to deal with map imperfections, e.g., missing walls, glass doors, and various map artifacts, which we observed occasionally in the House3D data set.

PF-net is effective with various sensors. The columns of Table 1a-b compare different sensors for localization. PF-net with RGB images is almost as effective as with depth images; and it performs better than with simulated LIDAR. This indicates that PF-net successfully learned to extract relevant geometry from RGB images, and it learned to ignore objects that are not in the map. When combining RGB and depth image inputs, RGB-D column, performance improves. This demonstrates that PF-net can learn simple sensor fusion end-to-end from data. Future work in this direction is promising.

3 Additional experiments

Global localization. We evaluated learned PF-net for localization with increasing difficulty (Table 1c). We chose initial beliefs uniform over one room, two rooms, and the entire building. We compared PF-net with different number of particles, up to $K=3000$ . Results show that PF-net can solve global localization with high initial uncertainty when provided with sufficiently many particles.

Semantic maps. Humans often use floor maps with semantic information: there are labels for the office, toilet, lift and staircase. Utilizing semantic maps for robot localization is not trivial . PF-net may learn to use semantic maps naturally, through end-to-end training. To demonstrate this, we trained PF-nets with simplified semantic maps with labels for doors and room categories. See Figure 5 for examples. We encoded the semantic labels in separate channels of the input map: one channel for doors, 8 channels for 8 distinct room categories. Results show that simple semantic maps can indeed improve localization performance (Table 1d).

Ablation study. In supplementary experiments we altered certain settings of PF-net during training, and evaluated the learned PF-nets for a fixed semi-global localization task. First, we added soft-resampling during training. When trained for the tracking task as before, success rates decreased for soft-resampling: 79% to 75%. However, when trained with increased initial uncertainty and noisy odometry, success rates increased for soft-resampling: 39% to 42%. As expected, resampling can be beneficial when most particles would be far from the true state; but it hurts when particles near the true state are eliminated, which often happens in early phases of learning. Future work may incorporate various strategies for resampling only when required . Indeed, when resampling only every second step, success rates increased: 42% to 54%.

Next, we varied the number of backpropagation steps for BPTT. Backpropagating through multiple steps improved performance: 73%, 79%, 79% success rates for 1, 2, and 4 steps, respectively. This indicates that loss from future steps can provide a useful learning signal for the present step.

Finally, we replaced our loss function (9), with the probabilistic loss function proposed in . The alternative loss function worked worse when training in the standard tracking setting, 74% versus 79% success rates. However, the alternative loss function worked better when training with increased uncertainty, 67% versus 39%. Our loss can be dominated by the distant particles, which may negatively affect learning in the latter case.

Conclusion & future work

We introduced the PF-net, a neural network architecture with differentiable algorithm prior for sequential state estimation. PF-net encodes learned probabilistic models, together with a particle filter algorithm, in a differentiable network representation. We applied PF-net to robot localization on a map. Through end-to-end training, PF-net successfully learned to localize in challenging, previously unseen environments populated with objects not shown in the map.

Future work may apply PF-net to real-world localization, a problem of great interest for mobile robot applications. One concern is online execution. With RGB input PF-net needs approx. $0.6\textrm{ms}$ per particle per step. Indoor localization with high uncertainty may require up to 1,000 – 10,000 particles . We can increase robustness, and use less particles, by incorporating standard techniques for particle filtering, e.g., injecting particles and adaptive resampling . We may also improve inference time, leveraging an abundance of work optimizing neural network models and hardware . Finally, learned PF-net models can be used for standard particle filtering, and thus visual sensors can be complementary to laser, potentially at a lower update frequency.

PF-net could also be applied to other domains, e.g., visual object tracking and SLAM. An exciting line of future work may extend PF-net to learn latent state representations for filtering, potentially in an unsupervised setting. Finally, the particle representation of beliefs can be important for encoding more sophisticated algorithms in neural networks, e.g., for planning under partial observability.

This research is supported in part by Singapore Ministry of Education grant MOE2016-T2-2-068. Peter Karkus is supported by the NUS Graduate School for Integrative Sciences and Engineering Scholarship.