BundleSDF: Neural 6-DoF Tracking and 3D Reconstruction of Unknown Objects

Bowen Wen, Jonathan Tremblay, Valts Blukis, Stephen Tyree, Thomas Muller, Alex Evans, Dieter Fox, Jan Kautz, Stan Birchfield

cs.CV cs.AI cs.GR cs.RO

Introduction

Two fundamental (and closely related) problems in computer vision are 6-DoF (“degree of freedom”) pose tracking and 3D reconstruction of an unknown object from a monocular RGBD video. Solving these problems will unlock a wide range of applications in areas such as augmented reality , robotic manipulation , learning-from-demonstration , and sim-to-real transfer .

Prior efforts often consider these two problems separately. For example, neural scene representations have achieved great success in creating high quality 3D object models from real data . These approaches, however, assume known camera poses and/or ground-truth object masks. Furthermore, capturing a static object by a dynamically moving camera prevents full 3D reconstruction (e.g., the bottom of the object is never seen if resting on a table). On the other hand, instance-level 6-DoF object pose estimation and tracking methods often require a textured 3D model of the test object beforehand for pre-training and/or online template matching. While category-level methods enable generalization to new object instances within the same category , they struggle with out-of-distribution object instances and unseen object categories.

To overcome these limitations, in this paper we propose to solve these two problems jointly. Our method assumes that the object is rigid, and it requires a 2D object mask in the first frame of the video. Apart from these two requirements, the object can be moved freely throughout the video, even undergoing severe occlusion. Our approach is similar in spirit to prior work in object-level SLAM , but we relax many common assumptions, allowing us to handle occlusion, specularity, lack of visual texture and geometric cues, and abrupt object motion. Key to our method is an online pose graph optimization process, a concurrent Neural Object Field to reconstruct the 3D shape and appearance, and a memory pool to facilitate communication between the two processes. The robustness of our method is highlighted in Fig. 1.

Our contributions can be summarized as follows:

A novel method for causal 6-DoF pose tracking and 3D reconstruction of a novel unknown dynamic object. This method leverages a novel co-design of concurrent tracking and neural reconstruction processes that run online in near real-time while largely reducing tracking drift.

We introduce a hybrid SDF representation to deal with uncertain free space caused by the unique challenges in a dynamic object-centric setting, such as noisy segmentation and external occlusions from interaction.

Experiments on three public benchmarks demonstrate state-of-the-art performance against leading methods.

Related Work

6-DoF Object Pose Estimation and Tracking. 6-DoF object pose estimation infers the 3D translation and 3D rotation of a target object in the camera’s frame. State-of-the-art methods often require instance- or category-level object CAD models for offline training or online template matching , which prevents their application to novel unknown objects. Although several recent works relax the assumption and aim to quickly generalize to novel unseen objects, they still require pre-capturing posed reference views of the test object, which is not assumed in our setting. Aside from single-frame pose estimation, 6-DoF object pose tracking leverages temporal information to estimate per-frame object poses throughout the video. Similar to their single-frame counterparts, these methods make various levels of assumptions, such as training and testing on the same objects or pretraining on the same category of objects . BundleTrack shares the closest setting to ours, generalizing pose tracking instantly to novel unknown objects. Differently, however, our co-design of tracking and reconstruction with a novel neural representation not only results in more robust tracking as validated in experiments (Sec. 4), but also enables an additional shape output, which is not possible with .

Simultaneous Localization and Mapping. SLAM solves a similar problem to the one addressed in this work, but focuses on tracking the camera pose w.r.t. a large static environment . Dynamic-SLAM methods usually track dynamic objects by frame-model Iterative Closest Point (ICP) combined with color , probabilistic data association , or 3D level-set likelihood maximization . Models are simultaneously reconstructed on-the-fly by aggregating the observed RGBD data with the newly tracked pose. In contrast, our method leverages a novel Neural Object Field representation that allows for automatic on-the-fly fusion , while dynamically rectifying historically tracked poses to maintain multi-view consistency. We focus on the object-centric setting including dynamic scenarios, in which there is often a lack of texture or geometric cues, and severe occlusions are frequently introduced by the interaction agent—difficulties that rarely happen in traditional SLAM. Compared to static scenes studied in object-level SLAM , dynamic interaction also allows observing different faces of the object for more complete 3D reconstruction.

Object Reconstruction. Retrieving a 3D mesh from images has been extensively studied using learning based methods . With recent advances in neural scene representation, high quality 3D models can be reconstructed , though most of these methods assume known camera poses or ground-truth segmentation and often focus on static scenes with rich texture or geometric cues. In particular, presents a semi-automatic method with a similar goal but uses manual object pose annotations to retrieve a textured model of the object. In contrast, our method is fully automatic and operates over the video stream causally. Another line of research leverages human hand or body priors to resolve object scale ambiguity or refine object pose estimations via contact/collision constraints . In contrast, we do not assume specific knowledge of the interaction agent, which allows us to generalize to drastically different forms of interactions and scenarios, ranging from human hand, human body to robot arms, as shown in the experiments. This also eliminates another possible source of error from imperfect human hand/body pose estimation.

Approach

An overview of our method is depicted in Fig. 2. Given a monocular RGBD input video, along with a segmentation mask of the object of interest in the first frame only, our method tracks the 6-DoF pose of the object through subsequent frames and reconstructs a textured 3D model of the object. All processing is causal (no access to future frames) The object is assumed to be rigid, but no specific amount of texture is required—our method works well with untextured objects. In addition, no instance-level CAD model of the object, nor category-level prior (e.g., training on the same object category beforehand), is needed.

2 Memory Pool

To alleviate catastrophic forgetting, which can cause long-term tracking drift, it is important to retain information about past frames. A common approach exploited by prior work is to fuse each posed observation into an explicit global model . The fused global model is then used to compare against the subsequent new frames for their pose estimation (frame-to-model matching). However, such an approach is too brittle for the challenging scenarios considered in this work, for at least two reasons. First, any imperfections in the pose estimates will be accumulated when fusing into the global model, causing additional errors when estimating the pose of subsequent frames. Such errors frequently occur when there is insufficient texture or geometric cues on the object, or this information is not visible in the frame. Such errors accumulate over time and are irreversible. Second, in the case of long-term complete occlusion, large motion changes make registration between the global model and the reappearing frame observation difficult and suboptimal.

More specifically, $\xi_{t}$ is compared with the poses of all existing memory frames in the pool. Since in-plane object rotation does not provide additional information, this comparison takes into account rotational geodesic distance while ignoring rotation around the camera’s optical axis. Ignoring this difference allows the system to allocate memory frames more sparsely in the space while maintaining a similar amount of multi-view consistency information. This trick enables jointly optimizing a wider range of poses, compared to previous work (e.g., ), when selecting the same number of memory frames to participate in the online pose graph optimization.

3 Online Pose Graph Optimization

As described below (Sec. 3.4), the Neural Object Field is also used to assist in this optimization process. Every frame in the memory pool has associated with it a binary flag $b(\mathcal{F})$ indicating whether the pose of this particular frame has had the benefit of being updated by the Neural Object Field. When a frame is first added to the memory pool, $b(\mathcal{F})=$ False. This flag remains unchanged through subsequent online updates until the frame’s pose has been updated by the Neural Object Field, at which point it is forever set to True.

Concurrent with updating the pose of the new frame $\mathcal{F}_{t}$ , all the poses of the subset of frames selected for the online pose graph optimization are also updated to the memory pool, as long as their flag is set to False. Those frames whose flag is set to True continue to be updated by the more reliable Neural Object Field process, but they cease being modified by the online pose graph optimization.

Optimization. In the pose graph $\mathcal{G}=(\mathcal{V},\mathcal{E})$ , the nodes consist of $\mathcal{F}_{t}$ and the above selected subset of memory frames: $\mathcal{V}=\mathcal{F}_{t}\cup\mathcal{P}_{pg}$ , so $|\mathcal{V}|=K+1$ . The objective is to find the optimal poses that minimize the total loss of the pose graph:

where $\mathcal{L}_{f}$ and $\mathcal{L}_{p}$ are pairwise edge losses , and $\mathcal{L}_{s}$ is an additional unary loss. The scalar factors $w_{f},w_{p},w_{s}$ are all set to 1 empirically. The loss

measures the pixel-wise point-to-plane distance via re-projective association, where ${T_{ij}}\equiv\xi_{j}\xi_{i}^{-1}$ transforms from $\mathcal{F}^{(i)}$ to $\mathcal{F}^{(j)}$ , $\pi_{j}$ denotes the perspective projection mapping onto image $I_{j}$ associated with $\mathcal{F}^{(j)}$ , ${\pi^{-1}_{D_{j}}}$ represents the inverse projection mapping via looking-up the depth image $D_{j}$ at the pixel location, $n_{i}(p)$ denotes the normal via looking-up the normal map of $\mathcal{F}^{(i)}$ at pixel location $p\in I_{i}$ associated. Lastly, the unary loss

measures the point-wise distance to the neural implicit shape using the current frame, where $\Omega(\cdot)$ denotes the signed distance function from the Neural Object Field as will be discussed in Sec. 3.4. The Neural Object Field weights are frozen in this step. This unary loss is taken into account only after the initial training of the Neural Object Field has converged.

The poses are represented as inversions of camera poses w.r.t. the object, parametrized using Lie Algebra, fixing the coordinate frame of the initial frame as the anchor point. We solve the entire pose graph optimization via the Gauss-Newton algorithm with iterative re-weighting. The optimized pose corresponding to $\mathcal{F}_{t}$ becomes its updated pose $\xi_{t}$ . For the rest of the selected memory frames, their optimized poses in the memory pool are also updated to rectify possible errors computed earlier in the video, unless $b(\mathcal{F})=$ True, as mentioned earlier.

4 Neural Object Field

A key to our approach is learning an object-centric neural signed distance field that learns multi-view consistent 3D shape and appearance of the object while adjusting memory frames’ poses. It is learned per-video and does not require pre-training in order to generalize to novel unknown objects. This Neural Object Field trains in a separate thread parallel to the online pose tracking. At the start of each training period, the Neural Object Field consumes all the memory frames (along with their poses) from the pool and begins learning. When training converges, the optimized poses are updated to the memory pool to aid subsequent online pose graph optimization, which fetches these updated memory frame poses each time to alleviate tracking drift. The learned SDF is also updated to the subsequent online pose graph to compute the unary loss $\mathcal{L}_{s}$ described in Sec. 3.3. The Neural Object Field training process is then repeated by grabbing new memory frames from the pool.

Rendering. Given the object pose $\xi$ of a memory frame, an image is rendered by emitting rays through the pixels. 3D points are sampled at different locations along the ray:

The color $c$ of a ray $r$ is integrated by near-surface regions:

where $w(x_{i})$ is the bell-shaped probability density function that depends on the distance from the point to the implicit object surface, i.e., the signed distance $\Omega(x_{i})$ . $\alpha$ (set to a constant) adjusts the softness of the probability density distribution. The probability reaches a local maximum at the surface intersection. $z(r)$ is the depth value of the ray from the depth image. $\lambda$ is the truncation distance. In Eq. (6), we ignore the contribution from empty space that is more than $\lambda$ away from the surface to reduce over-fitting from the empty space in the neural field in order to improve pose updates. We then only integrate up to a $0.5\lambda$ penetrating distance to model self-occlusion . An alternative to directly using the depth reading $z(r)$ to guide the integration would be to infer the zero-crossing surface from $\Omega(x_{i})$ . However, we found this requires denser point sampling and slower training convergence compared to using the depth.

Efficient Hierarchical Ray Sampling. For efficient rendering, we construct an Octree representation before training by naively merging the point clouds of the posed memory frames. We then perform hierarchical sampling along the rays. Specifically, we first uniformly sample $N$ points bounded by the occupancy voxels (gray boxes in Fig. 3), terminating at $z(r)+0.5\lambda$ . A custom CUDA kernel was implemented to skip the sampling of intermediate unoccupied voxels. Additional samples are allocated around the surface for higher quality reconstruction: Instead of importance sampling based on the SDF predictions, which requires multiple forward passes through the network , we draw $N^{\prime}$ point samples from a normal distribution centered around the depth reading $\mathcal{N}(z(r),\lambda^{2})$ . This results in $N+N^{\prime}$ total samples, without querying the more expensive multi-resolution hash encoding or the networks.

Hybrid SDF Modeling. Due to the imperfect segmentation and external occlusions, we propose a hybrid signed distance model. Specifically, we divide the space into three regions to learn the SDF (see Fig. 3):

Uncertain free space: These points (yellow in the figure) correspond to the background in the segmentation mask or to pixels with missing depth values, for which the observation is unreliable. For instance, at ray $r_{1}$ ’s pixel location in the binary mask, the finger’s occlusion results in background prediction, even though it actually corresponds to the pitcher handle. Naively ignoring the background for emitting the ray would lose the contour information, causing bias. Therefore, instead of fully trusting or ignoring uncertain free space, we assign a small positive value $\epsilon$ to be potentially external to the object surface so that it can quickly adapt when a more reliable observation is available later:

Empty space: These points (red in the figure) are in front of the depth reading up to a truncation distance, making them almost certainly external to the object surface. We apply $L_{1}$ loss to the truncated signed distance to encourage sparsity:

Near-surface space: These points (blue in the figure) are near the surface, no more than $z(r)+0.5\lambda$ distance behind the depth reading to model self-occlusion. This space is critical for learning the sign flipping in SDF and the zero level set. We approximate the near-surface SDF by projective approximation for efficiency:

where $d_{x}=\left\|x-o(r)\right\|_{2}$ and $d_{D}=\left\|\pi^{-1}(z(r))\right\|_{2}$ are the distance from ray origin to the sample point and the observed depth point, respectively.

where $\mathcal{L}_{c}$ denotes the $L_{2}$ loss over the foreground color for appearance network supervision:

and $\mathcal{L}_{\textit{eik}}$ is the Eikonal regularization over the SDF in near-surface space:

Unlike which requires ground-truth mask as input, we do not perform mask supervision, since the predicted mask is often noisy from the network.

Experiments

To evaluate our method, we consider three real-world datasets with drastically different forms of interactions and dynamic scenarios. For results on wild application and static scenes, see project page.

HO3D : This dataset contains the RGBD video of a human hand interacting with YCB objects , captured by Intel RealSense camera at close range. Ground truth is automatically generated from multi-view registration. We adopt the most recent version HO-3D_v3 and test on the official evaluation set. This results in 4 different objects, 13 video sequences, and 20428 frames in total.

YCBInEOAT : This dataset contains the ego-centric RGBD videos of a dual-arm robot manipulating the YCB objects captured by Azure Kinect camera at mid range. There are three types of manipulation: (1) single arm pick-and-place, (2) within-hand manipulation, and (3) pick-and-place with handoff between arms. Although this dataset was originally developed to evaluate pose estimation approaches relying on CAD models, we do not provide any object prior knowledge to the evaluated methods. There are 5 different objects, 9 videos, and 7449 frames in total.

BEHAVE : This dataset contains the RGBD video of a human body interacting with the objects, captured at far range by a pre-calibrated multi-view system with Azure Kinect cameras. However, we constrain our evaluation to the single-view setting, where severe occlusions frequently occur. We evaluate on the official test split excluding the deformable objects. This results in 16 different objects, 70 videos/scenes, and 107982 frames in total.

2 Metrics

We separately evaluate pose estimation and shape reconstruction. For 6-DoF object pose, we compute the area under the curve (AUC) percentage of ADD and ADD-S metrics using ground-truth object geometry. For 3D shape reconstruction, we compute the chamfer distance between the final reconstructed mesh and ground-truth mesh in the canonical coordinate frame defined by the first image of each video. More details can be found in the appendix.

3 Baselines

We compare against DROID-SLAM (RGBD) , NICE-SLAM , KinectFusion , BundleTrack and SDF-2-SDF using their open-source implementations with the best tuned parameters. We additionally include the baseline results from their leaderboard. Note that methods such as focus on deformable objects and the root 6-DoF tracking and fusion are often based on , whereas we focus on rigid objects that are dynamically moving. We thus omit their comparisons. The inputs to each evaluated method are the RGBD video and the first frame’s mask indicating the object of interest. We augment the comparison methods with the same video segmentation masks used in our framework for fair comparison, to focus on 6-DoF object pose tracking and 3D reconstruction performance. In the case of tracking failure, no re-initialization is performed to test long-term tracking robustness.

DROID-SLAM , NICE-SLAM and KinectFusion were originally proposed for camera pose tracking and scene reconstruction. When given the segmented images, they run in an object-centric setting. Since DROID-SLAM and BundleTrack cannot reconstruct an object mesh, we augment these methods with TSDF Fusion for shape reconstruction evaluation. For NICE-SLAM and our method, we initialize the neural volume’s bound using only the first frame’s point cloud (to preserve causal processing, we cannot access future frames).

4 Comparison Results on HO3D

Quantitative results on HO3D are shown in Tab. 1 and Fig. 5. Our method outperforms the comparison methods by a large margin on both 6-DoF pose tracking and 3D reconstruction. For DROID-SLAM , NICE-SLAM and KinectFusion , when working in an object-centric setting, significantly less texture or geometric (purely planar or cylindrical object surfaces) cues can be leveraged for tracking, leading to poor performance. Fig. 5 presents the tracking error against time to study the long-term tracking drift. While BundleTrack achieves similarly low translation error as our approach, it struggles on the rotation estimation. In contrast, our method maintains a low tracking error throughout the video. We provide per-video quantitative results in the appendix.

Fig. 4 shows example qualitative results of the three most competitive methods. Despite multiple challenges such as severe hand occlusions, self-occlusions, little texture cues in intermediate observations and strong lighting reflections, our method keeps tracking accurately along the video and obtains dramatically higher quality 3D object reconstruction. Notably, our predicted pose is sometimes more accurate than ground-truth, which was annotated by multi-camera multi-view registration leveraging hand priors.

5 Comparison Results on YCBInEOAT

Quantitative results on YCBInEOAT are shown in Tab. 2. This dataset captures the interaction between the robot arms and the object from an ego-centric view, which leads to challenges due to the constrained camera view and severe occlusions by the robot arms. For completeness, in this table we also include additional baseline methods from .For fair comparison, we only include baselines from that—like our method—do not require instance- or category-level object knowledge. The results from these methods, indicated by asterisk (∗), are simply copied from . Note that, in the case of (non-asterisk) BundleTrack, we re-run the algorithm with the same segmentation masks as ours for fair comparison, and we augment with TSDF Fusion for reconstruction evaluation (same as Tab. 1). We omit the re-running for MaskFusion* and TEASER++* due to their relatively poorer performance.

Our approach sets a new benchmark record on ADD-S metric and chamfer distance in 3D reconstruction, while obtaining comparable performance with the previous state-art-art method on ADD metric. In particular, while BundleTrack achieves competitive object pose tracking, it does not obtain satisfactory 3D reconstruction results. This demonstrates the benefits of our co-design of tracking and reconstruction.

6 Comparison Results on BEHAVE

Quantitative results on BEHAVE are shown in Tab. 3. We refer to the supplemental material for more detailed results. In our setting of single-view and zero-shot transfer without leveraging human body priors, this dataset exhibits extreme challenges. For instance, (i) there are long-term complete occlusions when the human carries the object and faces away from the camera; (ii) severe motion blur and abrupt displacement frequently occur due to the human freely swinging the object; (iii) the objects are of diverse properties and vary greatly in size; (iv) the video is captured at a distance from the camera, making it difficult for depth sensing. Therefore, evaluation on this benchmark pushes the boundary to a more difficult setting. Despite these challenges, our method is still able to perform long-term robust tracking in most scenarios and performs significantly better than previous methods.

7 Ablation Study

We investigate the effectiveness of our design choices on HO3D dataset given its more accurate pose annotations. The results are shown in Tab. 4. Ours w/o memory achieves dramatically worse performance as there is no mechanism to alleviate tracking drift. For Ours-GPG, even with similar amount of computation, it struggles on objects or observations with little texture or geometric cues due to hand-crafted losses. Aside from object pose tracking, Ours w/o memory, Ours w/o NOF and Ours-GPG lack the module for 3D object reconstruction. Ours w/o hybrid SDF ignores the contour information and can be biased by false positive segmentation when rectifying the memory frames’ pose. These lead to less stable pose tracking and more noisy final 3D reconstruction. Ours w/o compact mem pool, when under the same computational budget, leads to insufficient pose coverage during pose graph optimization and Neural Object Field learning, as mentioned in Sec. 3.2.

Conclusion

We presented a novel method for 6-DoF object tracking and 3D reconstruction from a monocular RGBD video. Our method only requires segmentation of the object in the initial frame. Leveraging two parallel threads that perform online graph pose optimization and Neural Object Field representation respectively, our method is able to handle challenging scenarios, such as fast motion, partial and compete occlusion, lack of texture, and specular highlights. On several datasets we have demonstrated state-of-the-art results compared with existing methods. Future work will be aimed at leveraging shape priors to reconstruct unseen parts.

References

Appendix A Implementation Details

During coarse pose initialization, if there is no immediate previous frame to compare with (e.g., missing detection by the segmentation, or object reappearing after complete occlusion), the current frame will instead be compared with the memory frames. The memory frame which has more than 10 feature correspondences with the current frame is selected as the new reference frame for the coarse pose initialization. The following steps remain the same.

For online pose graph optimization, we constrain the maximum number of participating memory frames $K=10$ for efficiency. When computing $\mathcal{L}_{p}$ we reject corresponding points whose distance is larger than 1 cm, or their normal angle is larger than 20°. The Gauss-Newton optimization iterates for 7 steps.

For Neural Object Field learning, we normalize the object into the neural volume bound of $ $, where the scale is computed as 1.5 times of the initial frame’s point cloud dimension. The neural volume’s coordinate system is based on the first frame’s centered point cloud. The geometry network$ \Omega $consists of two-layer MLP with hidden dimension 64 and ReLU activation except for the last layer. The intermediate geometric feature$ f_{\Omega(\cdot)} $has dimension 16. The bias of the last layer is initialized to 0.1 for a small positive SDF prediction at the start of training. The appearance network$ \Phi $consists of three-layer MLP with hidden dimension 64 and ReLU activation except for the last layer, where we apply sigmoid activation to map the color prediction to$ $. For Octree ray-tracing, the finest voxel size is set to 2 cm. We simplify the multi-resolution hash encoder to 4 levels, with number of feature vectors from 16 to 128 for efficiency. Each level’s feature dimension is set to 2. The hash table size is set to$ 2^{22} $. In each iteration the ray batch size is 2048. For hierarchical point sampling,$ N $and$ N^{\prime} $are set to 128 and 64, respectively. The truncation distance$ \lambda $is set to 1 cm. For uncertain free space,$ \epsilon $is set to 0.001. In the training loss,$ w_{u}=100,w_{e}=1,w_{\textit{surf}}=1000,w_{c}=100,w_{\textit{eik}}=0.1$. We implement in PyTorch with Adam optimizer. The initial learning rate is 0.01 with linear decay rate 0.1. The Neural Object Field training runs in a separate thread concurrently and interchanges data with the memory pool periodically after each training convergence (300 steps), which leads to sufficient pose refinement. The first training period starts when there are 10 memory frames in the pool. Upon training convergence, it returns the data to the memory pool and grabs memory frames newly added to the pool during its last training period, to repeat the training process. The next training reuses the latest updated frames’ poses. But for the other trainable parameters, reusing their weights tend to get stuck in local minima if there is any sub-optimum in the previous training period, particularly due to noisy pose. Therefore, we re-initialize the network weights for the new training periods. This takes similar number of steps to refine the newly added memory frames’ poses, compared to reusing the previous network weights.

Appendix B Computation Time

All experiments were conducted on a standard desktop with Intel i9-10980XE CPU and a single NVIDIA RTX 3090 GPU. Our method consists of two threads running concurrently. The online tracking thread processes frames at around 10.2 Hz, where video segmentation takes 18 ms, coarse matching takes 24 ms, pose graph takes 56 ms on average. Concurrently, the neural object field thread runs in the background and takes 6.7 s averagely for each training round, at the end of which it exchanges data with the main thread. On the same hardware, competitive methods DROID-SLAM and BundleTrack run at 6.1 Hz and 11.2 Hz respectively.

Appendix C Metrics

For evaluation, we decouple the pose estimation and shape reconstruction, so that they can be treated separately. For 6-DoF object pose evaluation, we compute the area under the curve (AUC) percentage of ADD and ADD-S metric:

where $\mathcal{M}$ is the object model. Since the novel unknown object’s CAD model is inaccessible to the methods to define the coordinate system, we use the ground-truth pose in the first image to define the canonical coordinate frame of each video to evaluate the pose.

For 3D shape reconstruction evaluation, we report the results of chamfer distance between the final reconstructed mesh and the ground-truth mesh, using the following symmetric formulation:

In our method, the mesh can be extracted by applying Marching Cubes over the zero level set in the Neural Object Field. For all methods, we use the same resolution (5 mm) to sample points for evaluation. Since most videos do not cover the complete surrounding view of the object, we cull the ground-truth mesh faces that are never visible in the video by a rendering test, given by the ground-truth mesh and pose.

Appendix D Detailed Results

Recall curves for ADD-S and ADD for all three datasets are presented in Fig. 6 (HO3D), Fig. 7 (YCBInEOAT), and Fig. 8 (BEHAVE). Each plot shows the results for all videos of the respective dataset. As can be seen, the area-under-the-curve (AUC) for our method exceeds that of other methods for almost all datasets.

Per-video quantitative results for all three datasets are presented in Tab. 5 (HO3D), Tab. 6 (YCBInEOAT), and Tabs. 7-10 (BEHAVE). As can be seen, our method performs best on almost all videos of HO3D, more than half the videos of YCBInEOAT, and a large majority of videos of BEHAVE. Note that the last row of each table (“Mean”) is included in the main paper.

Qualitative results are demonstrated in Figs. 9 and 10 (HO3D), Fig. 11 (YCBInEOAT), and Figs. 12 and 13 (BEHAVE). We encourage the reader to watch the supplemental video.

Details Regarding the Single-View Setup of BEHAVE. As mentioned in the paper, the BEHAVE Dataset was captured by a pre-calibrated multi-camera system with four cameras. Since our method only requires a monocular input, for fair evaluation, we run all methods on a single monocular input. That is, for each scene, we input only one of the cameras’ captured video to the methods.

Although in theory we could run each method four times, once per video camera, this would be excessively time consuming for the little insight that it might bring. Moreover, since there are only four cameras placed at each corner around the scene, it is often the case that the object is severely occluded by the human in several cameras’ views (including at the beginning of the video). Using such cameras would not lead to meaningful results for tracking evaluation, due to the very limited object visibility at initialization.

Instead, we decided to automatically select one of the four cameras from each scene for evaluation. More specifically, we select the video with the least amount of occlusion in each scene over the entire sequence. To do so, we compute the average visibility ratio of the object in each camera’s video by comparing the ground-truth object mask against the rendered object mask using the ground-truth information. This is performed offline for all videos before evaluation. The selected single-view video is then used by all methods for evaluation, even though severe occlusions still occur frequently which exhibit challenges, as shown in Fig. 12, 13.

Appendix E Robustness Analysis

In the following we discuss our approach’s robustness under various challenges. We encourage the reader to watch our supplemental video for more complete appreciation of the system.

Dearth of Texture or Geometric Cues. In the case of dynamic object-centric setting, dearth of texture or geometric cues frequently occur given by the object itself. For instance, in Fig. 9, large areas on the blue pitcher lack texture, which challenge those methods heavily relying on optical flow (DROID-SLAM ), or keypoint matching (BundleTrack ), or photometric loss (NICE-SLAM ). Additionally, large areas of cylindrical surface also exhibit few geometric cues to leverage and can cause rotational ambiguity to those methods relying on point-to-surface matching (SDF2SDF , BundleTrack , KinectFusion ). In contrast, our method is robust to these challenges due to the synergy of pose graph optimization and Neural Object Field. More examples of such challenges can be found in Fig. 10, 12, 13.

Occlusions. In the dynamic object setting, occlusions include self-occlusions and external occlusions introduced by the interaction agent (e.g., human hand, human body, robotic arm). For instance, in Fig. 10, there are moments when the “meat can” only exhibits a single flat face (2nd column) after extreme rotations, causing severe self-occlusion. In other observations, external occlusion introduced by the human hand (4th column) also challenges the comparison methods. More examples of such challenges can be found in Fig. 9, 12, 13, 11. As can be observed, our method is robust to either case and keeps tracking accurately throughout the video thanks to the memory mechanism, whereas the comparison methods struggle.

Specularity. Due to the object’s surface smoothness, material and complex environmental lighting, specularity could happen, introducing challenges for those methods heavily replying on optical flow (DROID-SLAM ), keypoint matching (BundleTrack ) or photometric loss (NICE-SLAM ). As shown in Fig. 9, 10, 12, 11, despite the specularity on metalic or highly smooth surfaces, our method keeps tracking accurately throughout the video, whereas the comparison methods become brittle.

Abrupt Motion and Motion Blur. Fig. 14 illustrates an example of abrupt object motion due to the human freely swinging the box. Aside from challenges for 6-DoF pose tracking under large displacement, it causes motion blur in RGB, leading to additional challenge for keypoint matching and Neural Object Field learning. However, our method has shown robustness under these adverse conditions and even yields more accurate pose than ground-truth.

Noisy Segmentation. Figs. 15 and 16 demonstrate examples of noisy masks (purple) from the video segmentation network, including both false positive and false negative predictions. The false negative segmentation leads to ignorance of the texture-rich areas, intensifying the issue of dearth of texture. The false positive segmentation introduces deformable part from the interaction agent or undesired scene background, causing inconsistency in multi-view. However, our downstream modules are robust to the segmentation noise and maintain accurate tracking.

Noisy Depth. As shown in Fig. 17, in our setting, the noisy depth comes from two sources. First, the consumer-level RGBD camera has observable sensing noise. This is especially the case for BEHAVE and YCBInEOAT Dataset, where the images are captured at a distance from the camera, which challenges depth sensing. Second, due to the noisy segmentation, false positive predictions include undesired background areas in the depth point cloud. In Fig. 17 (left), when naively fusing the per-frame depth point cloud using ground-truth pose, the result is highly cluttered, which implies the noisy depth sensing and segmentation. However, despite such noise, our simultaneous pose tracking and reconstruction produce high quality mesh, as shown on the right.

Appendix F Limitation and Failure Modes

While our method is robust to a variety of challenging conditions, it fails when multiple types of challenges appear together. For instance, in Fig. 18, the occurrence of severe occlusion, segmentation error, dearth of texture and geometric cues together lead to tracking failure. When the object re-appears, the recovered pose is affected by symmetric geometry. Besides, our method requires depth modality which limits its application to certain types of objects where depth sensing fails, such as transparent objects. Finally, our method assumes the object to be rigid. In future work, generalizing to both rigid and non-rigid objects at the same time would be of interest.