BundleTrack: 6D Pose Tracking for Novel Objects without Instance or Category-Level 3D Models

Bowen Wen, Kostas Bekris

I INTRODUCTION

Robot manipulation often requires information about the pose of the manipulated object. In some cases, this can be achieved through forward kinematics (FK), assuming the object’s motion equivalent to the end-effector’s motion. Frequently, however, FK is insufficient to accurately estimate the object’s pose . This can be due to slippage during grasping or in-hand manipulation , or during handoffs or due to the compliance of a suction cup (Fig. 1). In these cases, dynamically estimating an object’s pose from visual data is desirable. Single-image 6D pose estimation methods have been studied extensively . Some of them are fast and can re-estimate poses from scratch for every new frame . Nevertheless, this is redundant, less efficient, leading to less coherent estimations over consecutive frames and negatively impacts planning and control. On the other hand, given an initial pose estimate, tracking 6D object poses over image sequences can improve estimation speed while providing coherent and accurate poses by leveraging temporal consistency .

Most existing 6D object pose estimation or tracking approaches assume access to an object instance’s 3D model . Having access to such instance 3D models complicates generalization to novel, unseen instances. To overcome this limitation, recent efforts have relaxed this assumption and require only category-level 3D models for 6D pose estimation or tracking . They often achieve this by training over a large number of CAD models from the same category. While promising results have been demonstrated for previously seen object categories, there are still limitations. These methods are constrained by the variety of categories in the training database. Popular 3D model databases, such as ShapeNet and ModelNet40 , contain 55 and 40 categories respectively. This is still far from sufficient to cover diverse object categories present in the real world. Furthermore, 3D model databases often require nontrivial manual effort and expert domain knowledge to build, involving steps such as scanning , mesh refinement or CAD design.

Another line of work from the SLAM literature has moved to address dynamic, object-aware challenges , where dynamic objects are being reconstructed on-the-fly while being tracked without the need for object 3D models beforehand. However, tracking-via-reconstruction tends to accumulate errors when fusing observations with erroneous pose estimates into the global model. These errors adversely impact model tracking in subsequent frames.

Motivated by the above limitations, this work aims for accurate, robust 6D pose tracking that is generalizable to novel objects without instance or category-level 3D models. It exploits recent advances in video segmentation as well as learning-based keypoint detection and matching for a coarse pose estimate, followed by a memory-augmented pose-graph optimization step to achieve spatiotemporal consistent pose output. Instead of aggregating into a global model, representative historical observations are maintained as keyframes in a memory pool, providing candidate nodes for future graphs so as to enable multi-pair data association together with the latest observation. An efficient implementation of this framework in CUDA allows to achieve competitive running times. Extensive experiments have been conducted on two large-scale public benchmarks, shown in Fig. 1. Both qualitative and quantitative results demonstrate a significant improvement over existing state-of-art approaches, including methods using instance or category-level 3D models or SLAM-like methods.

In summary, this work’s contributions are the following:

1) A novel integration of methods that result in a 6D pose tracking framework that generalizes to novel objects without access to instance or category-level 3D models.

2) A memory-augmented pose graph optimization for low-drift accurate 6D object pose tracking. In particular, augmenting the memory pool with historical observations enables multi-hop data association and ameliorate the dearth of correspondences between a pair of consecutive frames. Additionally, maintaining keyframes as raw nodes instead of aggregating into a global model significantly reduces tracking drift.

3) An efficient CUDA implementation, which allows to execute online the computationally-heavy multi-pair feature matching as well as pose-graph optimization for 6D object pose tracking (for the first time to the best of the authors’ knowledge).

These contributions result in a new state-of-art performance by boosting the previous best accuracy from 33.3% to 87.4% under the “5°5cm” metric in the NOCS Dataset , even when compared against approaches utilizing category-level 3D models for training. They also result in comparable performance on the YCBInEOAT dataset , even when compared against approaches utilizing instance-level 3D models .

II RELATED WORK

6D Object Pose Tracking - For setups where object CAD models are available, significant progress has been made in 6D pose tracking. This includes techniques based on hand-crafted probabilistic filtering , optimization , and machine learning . The requirements, however, of such instance-level 3D models, either for training offline or model-frame registration during tracking, complicate generalization to novel instances. More recently, a 6D pose tracking approach relaxed the assumption to category-level 3D models using 3D object CAD model databases for training . During testing, the target object category needs to be identified and the corresponding network for that category is utilized for tracking. Instead of being limited to the number of categories such database is able to include, this work employs deep features that in principle can be trained on arbitrary 2D images. It allows generalization to diverse novel objects, as shown in the accompanying experiments.

Dynamic Object-aware SLAM - In order to track dynamic objects’ pose and decouple them from static background, frame-model Iterative Closest Point (ICP) combined with color , probabilistic data association , or 3D level-set likelihood maximization has been applied. Object models are simultaneously reconstructed on-the-fly by aggregating the observed RGB-D data with the newly tracked pose. Nevertheless, frame-model tracking can be challenging for object reconstruction, since errors in pose estimation transfer to the reconstructed model and adversely affect the subsequent tracking . This work does not fuse observed frames but instead maintains them as nodes in a pose graph, allowing to correct previously erroneous estimates, and reduces drift in long-term tracking. The aforementioned SLAM-family approaches may also face challenges in robot manipulation setups that involve small, textureless, flat or shiny objects due to the dearth of sufficient correspondences between the pair of consecutive frames. To ameliorate this issue, BundleTrack searches correspondences among current and multiple historical frames, consisting of both feature and geometric terms, as the edges in the pose graph. Its effectiveness has been shown in extensive experiments including for such challenging manipulation scenarios.

3D Hand-held Object Scanning - Promising results have been demonstrated in scanning dynamic hand-held objects , where the object’s motion needs to be taken into account similar to the current setup. In particular, a framework for robot manipulation performs simultaneous object reconstruction and tracking, which leads to similar issues as the aforementioned dynamic SLAM methods. In addition, forward kinematics is required in its Kalman Filtering framework, preventing generalization in scenarios when objects are not held by the robotic manipulator. While estimating object poses is part of the scanning process, there are key differences from online 6D pose tracking. For the scanning application, external assistance including human interaction or deliberate motion is acceptable but it is not assumed in the current work. Furthermore, time consuming global-optimization steps are often adopted at the end of scanning to polish the models and their poses while intermediate erroneous pose estimations and associated frames can be discarded and not fused into the global model . In contrast, this work aims to provide fast and accurate pose tracking output online.

III PROBLEM FORMULATION

Assume a rigid body for which there is no its corresponding 3D model, nor its category-level 3D model database for training. The objective is to continuously track its 6D pose change relative to the start of tracking, i.e., the relative transformation $T_{0\rightarrow\tau}\in SE(3),\tau\in\{1,2,...,t\}$ in the camera’s frame $C$ . The input is the following:

$I_{\tau}$ : A sequence of RGB-D data $I_{\tau},\tau\in\{0,...,t\}$ .

$M_{0}$ : A binary mask on the first image $I_{0}$ , indicating the target object region to track in the image space.

$T_{0}^{C}$ (optional): The initial pose in the camera’s frame $C$ . Used if the objective is to recover the object’s absolute pose in $C$ , otherwise set to identity.

The initial mask $M_{0}$ can be obtained in multiple different ways to initialize tracking. For instance, via semantic segmentation or non-semantic methods, such as image segmentation, , point cloud segmentation/clustering , or plane fitting and removal , etc.

The object’s pose in the camera’s frame $C$ can be recovered at any timestamp by applying the relative transformation $T_{0\rightarrow\tau}$ in the camera’s frame $\mathbf{T}_{\tau}=T^{C}_{\tau}=T_{0}^{C}[(T_{0}^{C})^{-1}T_{0\rightarrow\tau}T_{0}^{C}]=T_{0\rightarrow\tau}T_{0}^{C}\in SE(3)$ . For simplicity, the rest of this document will refer to $\mathbf{T}_{\tau}$ as the output of the process but $T_{0\rightarrow\tau}$ is what is actually computed as tracking.

IV APPROACH

The first step is to segment the object’s image region from the background. Prior work used Mask-RCNN to compute the object mask in every frame of the video. It deals with each new frame independently, which is less efficient and results in temporal inconsistencies.

To avoid these limitations, this work adopts an off-the-shelf transductive-VOS network for video object segmentation, which is trained on the Davis 2017 and Youtube-VOS datasets. The network uses dense long-term similarity dependencies between current and past feature embeddings to propagate the previous object mask to the latest frame. The object mask needed by BundleTrack is simply binary, i.e., $M_{\tau}=\{0,1\}^{H\times W},\tau\in\{0,1,...,t\}$ and distinguishes the object region from the background. The only requirement is an initial mask $M_{0}$ of interest. Neither the transductive-VOS network nor the following steps of BundleTrack require $M_{0}$ to come from semantic/instance segmentation. Therefore, it can also be obtained in alternative ways depending on the application, e.g., low-level image segmentation , point cloud segmentation/clustering , or plane fitting and removal , etc.

While the current implementation uses transductive-VOS, the following techniques do not depend on this specific network. If the object mask can be computed via simpler means, such as computing a region of interest (ROI) from forward kinematics followed by point cloud filtering in robot manipulation scenarios , the segmentation module can be replaced.

IV-B Keypoint Detection, Matching and Local Registration

IV-C Keyframe Selection

where $R_{i}$ is the rotation matrix of the corresponding keyframe’s pose. The goal is to find the optimal binary vector $x\in R^{N}$ that indicates the selections. The weight of the edge between frame pair $(i,j)$ is the geodesic distance of their rotations. Mutual viewing overlap is maximized when the mutual rotation difference relative to the camera is minimized. Combinatorial optimization algorithms for solving this problem have a complexity of $O(\mathcal{N}^{\mathcal{K}}/log\mathcal{N})$ . In practice, an iterative greedy selection is followed by starting with the keyframe set $\{I_{0}\}$ until the number of selected keyframes reaches $\mathcal{K}$ . $I_{0}$ is chosen since the initial frame does not suffer from any tracking drift and serves as the reference frame. In each iteration, the keyframe with the smallest sum of geodesic distances against $I_{t}$ as well as all previously selected keyframes is added. This reduces complexity to $O(\mathcal{N}\mathcal{K}^{3}+\mathcal{N}\mathcal{K}^{2})$ , making the selection practical (under a millisecond) without degrading performance.

IV-D Online Pose Graph Optimization

In order to compute $\mathbf{E}_{f}$ , feature correspondences $C_{i,j}$ between each pair of nodes $(i,j)$ are determined. If $C_{i,j}$ has been built during a previous pose graph optimization, it is reused. Otherwise, the data association process of Sec. IV-B is performed to compute $C_{i,j}$ . These multi-pair feature correspondences are built in parallel on GPU. In Eq. (2) and (3), $p$ represents the unprojected 3D points in the camera’s frame, $\rho$ is the M-estimator, where Huber loss is used.

For $\mathbf{E}_{g}$ , dense pixel-wise correspondences are associated by point re-projection, while outliers are filtered based on the distance between the point pair and the angle formed by their normals; $\pi(\cdot)$ is the perspective projection operation; $\pi^{-1}_{D}(\cdot)$ denotes the unprojection mapping, which recovers a 3D point in the camera’s frame by looking up the depth value on the pixel location; $n_{i}(\cdot)$ returns the normal of the pixel on the frame $I_{i},i\in|V|$ .

In Eq. (1), $\lambda_{1}$ and $\lambda_{2}$ are the weights balancing $\mathbf{E}_{f}$ and $\mathbf{E}_{g}$ . To emphasize the lack of sensitivity to the choice of these values, $\lambda_{1}$ and $\lambda_{2}$ are set to $1$ in all experiments unless otherwise specified. Then, the goal is to find the optimal poses, such that:

where $\mathbf{J}$ is the Jacobian matrix with respect to $\xi$ , $\mathbf{W}$ is a diagonal weight matrix computed by the M-estimator $\rho$ and residual, which is updated in each iteration. To better take advantage of the sparsity of $\mathbf{J}$ and $\mathbf{W}$ , inside each Gauss-Newton step, an iterative PCG (Preconditioned Conjugate Gradient) solver is leveraged, where the diagonal matrix $\mathbf{J}^{T}\mathbf{W}\mathbf{J}$ is used as the preconditioner. Incremental pose updates are accumulated in the tangent space after each iteration $\xi\leftarrow\xi\boxplus\Delta\xi$ . The entire pose graph optimization is implemented in CUDA for parallel computation.

At the end of the optimization, the object pose corresponding to each graph node is obtained by $\mathbf{T}_{i}=exp(\xi_{i})\in SE(3),i\in|V|$ . The one corresponding to the current timestamp $t$ becomes the output tracked pose $\mathbf{T}_{t}$ , while poses corresponding to the historical keyframes are updated in the memory pool. The entire process is causal, i.e. past frames’ corrected poses cannot be updated in the output. However, their corrected pose estimates provide better initialization in following pose graph optimization steps to benefit the solution of new observations. This significantly reduces long-term drift compared against tracking-via-reconstruction , where any intermediate erroneous pose estimation introduces noise when fused into the global model and adversely affects the subsequent tracking.

IV-E Augmenting the Keyframe Memory Pool

The initial frame $I_{0}$ is always selected as it does not suffer from any tracking drift. For later frames, once the current object pose $\mathbf{T}_{t}$ is determined, its rotation geodesic distance against each existing keyframe in the pool is compared. If all pair-wise distances are larger than $\alpha$ ( $arccos(10\degree)$ in all experiments), $I_{t}$ is added into the keyframe memory pool. This encourages to add frames from novel views, such that multi-view diversity is enriched.

V EXPERIMENTS

This section evaluates the proposed approach and compares against state-of-the-art 6D pose tracking and estimation methods on two public benchmarks, the NOCS dataset and the YCBInEOAT dataset . Experiments are performed over diverse types of objects and various tracking scenarios (e.g., moving camera or moving objects). Both quantitative and qualitative results demonstrate that BundleTrack achieves comparable or even superior performance relative to alternatives, although it does not require instance or category-level 3D models. Concretely, no CAD models or training data from a 3D object database are used by BundleTrack. All experiments are conducted on a standard desktop with Intel Xeon(R) E5-1660 v3@3.00GHz processor and a single NVIDIA RTX 2080 Ti GPU.

NOCS dataset : Among existing datasets, this is the closest to the setup here, where instance 3D models are not provided during evaluation. The dataset contains 6 object categories: bottle, bowl, camera, can, laptop, and mug. The training set consists of: (1) 7 real videos containing 3 instances of each category in total, annotated with ground truth poses; and (2) 275K frames of synthetic data generated using 1085 instances from the above 6 categories using a 3D model database ShapeNetCore with random poses and object combinations in each scene. The testing set has 6 real videos containing 3 different unseen instances within each category, resulting in 18 different object instances and 3,200 frames in total.

YCBInEOAT dataset : This dataset helps verify the effectiveness of 6D pose tracking during robot manipulation. It was originally developed to evaluate approaches relying on CAD models. The available CAD models, however, are not used by BundleTrack. In contrast to the NOCS dataset where objects are statically placed on a tabletop and captured by a moving camera, YCBInEOAT contains 9 video sequences captured by a static RGB-D camera, while objects are dynamically manipulated. There are three types of manipulation: (1) single arm pick-and-place, (2) within-hand manipulation, and (3) pick to hand-off between arms to placement. These scenarios and the end-effectors used make directly computing poses from forward kinematics unreliable. The manipulation videos involve 5 YCB Objects : mustard bottle, tomato soup can, sugar box, bleach cleanser and cracker box.

V-B Results on the NOCS Dataset

Table I and Fig. 3 present the quantitative and qualitative results of state-of-art methods on the NOCS dataset respectively. The comparison points include learning-based methods relying on a category-level prior, such as NOCS , KeypointNet , and 6-PACK with or without temporal prediction . These methods are offline trained on both real and synthetic training sets, which are rendered with 3D object models extracted from the same categories of ShapeNetCore . In contrast, ICP , MaskFusion , TEASER++* and the proposed BundleTrack have no access to any training data based on 3D models.

The evaluation protocol is the same as in prior work . A perturbed ground-truth object pose is used for initialization. The perturbation adds a uniformly sampled random translation within a 4cm range to evaluate robustness against a noisy initial pose . No re-initialization is allowed during tracking. To evaluate robustness against missing frames, the same uniformly sampled 450 frames out of 3200 in the testing videos are dropped . Four metrics are adopted: 1) 5°5cm: percentage of estimates with orientation error < 5°and translation error < 5cm - the higher the better; 2) IoU25 (Intersection over Union): percentage of cases where the overlapping prediction and ground-truth 3D bounding box volume is larger than 25% of their union - the higher the better; 3) Rerr: mean orientation error in degrees - the lower the better; and 4) Terr: mean translation error in centimeters - the lower the better. For Rerr and Terr, estimates with IoU $\leq$ 25 are not counted when computing averageshttps://github.com/j96w/6-PACK/blob/master/benchmark.py .

The results of comparison points other than MaskFusion and TEASER++* come from the literature . The open-sourced codehttps://github.com/martinruenz/maskfusion of MaskFusion is used for evaluation, where the global SLAM module is disabled to avoid inferring object poses from the camera’s estimated ego-motion. The dynamic object tracking module is kept to solely evaluate object pose tracking effectiveness. Its original segmentation module Mask-RCNN is fine-tuned on the real training data provided in the NOCS dataset for better performance while the synthetic data rendered using category-level 3D models are not used, as this method is also agnostic to any 3D models . In addition to ICP reported in , another state-of-art 3D registration approach is included for comparison and denoted as TEASER++*, which is robust to outlier correspondences and agnostic to 3D models. It takes as input the segmented point cloud and feature correspondences that are computed using the same modules proposed in BundleTrack. For BundleTrack, an initial mask $M_{0}$ is required as input to the framework and is provided via the aforementioned Mask-RCNN. During execution, BundleTrack does not require external mask input nor any form of re-initialization. As exhibited in Table I, BundleTrack significantly outperforms the comparison points under all metrics and over all object categories, despite not accessing instance or category-level 3D models.

V-C Results on YCBInEOAT Dataset

Evaluation exclusively on static objects captured by a moving camera cannot completely reflect the properties of a 6D pose tracking method . For this reason, the YCBInEOAT dataset is chosen to evaluate tracking in scenarios where objects are moving in front of the camera. The same evaluation protocol is followed as in prior work . Results are computed from accuracy-threshold AUC (Area Under Curve) measured by $ADD=\frac{1}{m}\sum_{x\in M}||Rx+T-(\hat{R}x+\hat{T})||$ , which performs exact model matching, and $ADD$ - $S=\frac{1}{m}\sum_{x_{1}\in M}\min_{x_{2}\in M}||Rx_{1}+T-(\hat{R}x_{2}+\hat{T})||$ designed for evaluating symmetric objects. Similar to prior work , the ground-truth object’s pose in the camera’s frame is provided as initialization. No re-initialization is allowed during the tracking process.

Quantitative and qualitative results are shown in Table II and Fig. 4 respectively. Comparison points include state-of-art 6D pose tracking methods that use object CAD models, such as RGF , dbot PF and $se(3)$ -TrackNet . 6-PACK is a state-of-art 6D pose tracking approach relying on category-level 3D models. Its evaluation on objects “021_bleach_cleanser”, “006_mustard_bottle” and “005_tomato_soup_can” are performed by using the officially releasedhttps://github.com/j96w/6-PACK networks trained on “bottle” and “can” category respectively . For the rest of the objects “003_cracker_box” and “004_sugar_box”, no suitable corresponding category can be found in existing 3D model database and thus 6-PACK is not able to be retrained and evaluated on them. For 6-PACK, 3D bounding box of the object model, computed from forward kinematics, is provided in every frame to crop ROI from point cloud, since it is more reliable than its default module of extrapolating the 3D bounding box by estimated motion. For MaskFusion and BundleTrack, the initial object mask is obtained by table fitting and removal, followed by Euclidean Clustering implemented in PCL . The original MaskFusion’s segmentation module Mask-RCNN cannot be retrained on this benchmark due to the lack of training set. Therefore, during tracking, the target object mask is computed by segmenting out the region of robot arm and end-effector from forward kinematics. For instances of irregular shapes or colors (“021_bleach_cleanser”, “006_mustard_bottle”) within the “bottle” category that 6-PACK has been trained on, it struggles to get satisfactory result. Nevertheless, BundleTrack consistently demonstrates high quality tracking without any retraining or fine-tuning. This establishes generalizability of BundleTrack to novel object instances regardless of their out-of-distribution properties within the category. BundleTrack also achieves comparable or superior performance even when compared against methods relying on object instance CAD models .

V-D Analysis

Ablations Study: An ablation study investigates the effectiveness of the online global pose graph optimization and each energy term, presented in Fig. 5 (a).

Sensitivity to Initial Pose: As mentioned, random translation noise within 4cm range is added to the initial pose. This part further investigates robustness under different translation and rotation noise levels, shown in Fig. 5 (b).

Computation Time: The average running time of modules are given in Fig. 5 (c). The entire framework runs at 10Hz on average including video segmentation. The 6-PACK , TEASER++* and MaskFusion methods from related work run at 4Hz, 11Hz and 17Hz respectively on the same machine.

Tracking Drift Analysis: Fig. 5 (d) presents the rotation and translation error w.r.t. timestamps compared against representative related works . Results are averaged across all videos on the NOCS Dataset.

Generalization: The neural networks’ weights and hyper-parameters in BundleTrack are fixed without any retraining or fine-tuning across all evaluations (Sec. V-B, V-C). When applied to novel instances, the framework does not require access to instance or category-level 3D models for training or registration.

Failure Cases: While BundleTrack is able to robustly keep tracking in all experiments without lost or re-initialization, intermediate imprecise estimates are observed, such as the cases illustrated in Fig. 6.

VI CONCLUSION

This work presents BundleTrack, a general framework for tracking the 6D pose of novel objects without any assumptions on instance or category-level 3D models. Extensive experiments demonstrate that it is able to perform long-term accurate tracking under various challenging scenarios. It even achieves comparable performance to state-of-art methods that depend on the target object’s CAD model. Future research includes the exploration of combining BundleTrack with model-free grasping methods , to perform robust pick-and-place or in-hand dexterous manipulation for a wide variety of novel objects.