RBP-Pose: Residual Bounding Box Projection for Category-Level Pose Estimation

Ruida Zhang, Yan Di, Zhiqiang Lou, Fabian Manhardt, Federico Tombari, Xiangyang Ji

Introduction

Category-level object pose estimation describes the task of estimating the full 9 degrees-of-freedom (DoF) object pose (consisting of the 3D rotation, 3D translation and 3D metric size) for objects from a given set of categories. The problem has gained wide interest in research due to its essential role in many applications, such as augmented reality , robotic manipulation and scene understanding . In comparison to conventional instance-level pose estimation , which assumes the availability of a 3D CAD model for each object of interest, the category-level task puts forward a higher requirement for adaptability to various shapes and textures within each category.

Noteworthy, category-level pose estimation has recently experienced a large leap forward in recent past, thanks to novel deep learning architecture that can directly operate on point clouds . Thereby, most of these works try to establish 3D-3D correspondences between the input point cloud and either a predefined normalized object space or a deformed shape prior to better address intra-class shape variability . Eventually, the 9DoF pose is commonly recovered using the Umeyama algorithm . Nonetheless, despite achieving great performance, these methods typically still suffer from two shortcomings. First, their shape prior integration only boosts pose estimation indirectly, which leads to insufficient pose-sensitive feature extraction and slow inference speed. Second, due to the relatively small amount of available real-world data , these works tend to overfit as they are directly trained on these limited datasets.

As for the lack of real labeled data, we further propose an online non-linear shape augmentation scheme for training to avoid overfitting and enhance the generalization ability of RBP-Pose. In FS-Net , the authors propose to stretch or compress the object bounding box to generate new instances. However, the proportion between different parts of the object basically remains unchanged, as shown in Fig. 4. Therefore, we propose a category-specific non-linear shape augmentation technique. In particular, we deform the object shape by adjusting its scale via a truncated parabolic function along the direction of a selected axis. To this end, we either choose the symmetry axis for symmetric objects or select the axis corresponding to the facing direction to avoid unrealistic distortions for non-symmetric objects. In this way we are able to increase the dataset size while preserving the representative shape characteristic of each category.

Interestingly, to tackle the former limitation, the authors of GPV-Pose have proposed to leverage Displacement Vectors from the observed points to the corresponding Projections on the Bounding box (DVPB), in an effort to explicitly encapsulate the spatial cues of the 3D object and, thus, improve direct pose regression. While this performs overall well, the representation still exhibits weaknesses. In particular, DVPB is not necessarily a small vector with zero-mean. In fact, the respective values can become very large (as for large objects like laptops), which can make it very difficult for standard networks to predict them accurately. Based on these grounds, in this paper we propose to overcome this limitation by means of integrating shape priors into DVPB. We essentially describe the displacement field from the shape-prior-indicated projections towards the real projections onto the object bounding box. We dub the residual vectors in this displacement field as SPRV for Shape Prior Guided Residual Vectors. SPRV is inherently zero-centered and relatively small, allowing robust estimation with a deep neural network. In practice, we adopt a fully convolutional decoder to directly regress SPRV and then establish geometry-aware consistency with the predicted pose to enhance feature extraction. We experimentally show that our novel geometry-guided Residual Bounding Box Projection network RBP-Pose provides state-of-the-art results and clearly outperforms the DVPB representation. Overall, our main contributions are summarized as follows,

We propose a Residual Bounding Box Projection network (RBP-Pose) that jointly predicts 9DoF pose and shape prior guided residual vectors. We demonstrate that these nearly zero-mean residual vectors can be effectively predicted from our network and well encapsulate the spatial cues of the pose whilst enabling geometry-guided consistency terms.

To enhance the robustness of our method, we additionally propose a non-linear shape augmentation scheme to improve shape diversity during training whilst effectively preserving the commonality of geometric characteristics within categories.

RBP-Pose runs at inference speed of 25Hz and achieves state-of-the-art performance on both synthetic and real-world datasets.

Related Works

Instance-level 6D Pose Estimation. Instance-level pose estimation tries to estimate the 6DoF object pose, composed of the 3D rotation and 3D translation, for a known set of objects with associated 3D CAD models. The majority of monocular methods falls into three groups. The first group of methods regresses the pose directly, whereas the second group instead establishes 2D-3D correspondences via keypoint detection or dense pixel-wise prediction of 3D coordinates . The pose can be then obtained by adopting the PnP algorithm. Noteworthy, a few methods adopt a neural network to learn the optimization step instead of relying on PnP. The last group of methods attempt to learn a pose-sensitive latent embedding for subsequent pose retrieval. As for RGB-D based methods, most works again regress the pose directly, while a few methods resort to latent embedding similar to . In spite of great advance in recent years, the practical use of instance-level methods is limited as they can typically only deal with a handful of objects and additionally require CAD models.

Category-level Pose Estimation. In the category-level setting, the goal is to predict the 9DoF pose for previously seen or unseen objects from a known set of categories . The setting is fairly more challenging due to the large intra-class variations of shape and texture within categories. To tackle this issue, Wang et al. derive the Normalized Object Coordinate Space (NOCS) as a unified representation. They map the observed point cloud into NOCS and then apply the Umeyama algorithm for pose recovery. CASS introduces a learned canonical shape space instead. FS-Net proposes a decoupled representation for rotation and directly regresses the pose. DualPoseNet adopts two networks for explicit and implicit pose prediction and enforces consistency between them for pose refinement. While 6-PACK tracks the object’s pose by means of semantic keypoints, CAPTRA instead combines coordinate prediction with direct regression. GPV-Pose harnesses geometric insights into bounding box projection to enhance the learning of category-level pose-sensitive features. To explicitly address intra-class shape variation, a certain line of works make use of shape priors . Thereby, SPD extracts the prior point cloud for each category as the mean of all shapes adopting a PointNet autoencoder. SPD further deforms the shape prior to fit the observed instance and assigns the observed point cloud to the reconstructed shape model. SGPA dynamically adapts the shape prior to the observed instance in accordance with its structural similarity. DO-Net also utilizes shape prior, yet, additionally harnesses the geometric object properties to enhance performance. ACR-Pose adopts a shape prior guided reconstruction network and a discriminator network to learn high-quality canonical representations. Noteworthy, as shape prior integration only improves pose estimation indirectly, all these methods commonly suffer from insufficient pose-sensitive feature extraction and slow inference speed.

Methodology

As illustrated in Fig. 2, RBP-Pose consists of 5 modules, responsible for i) input preprocessing, ii) feature extraction from the input and prior point cloud, iii) 9DoF pose regression, iv) adaptation of the shape prior given the extracted features, and, finally, v) Shape Prior Guided Residual Vectors (SPRV) prediction.

Preprocessing. Given an RGB-D image, we first leverage an off-the-shelf object detector (e.g. Mask-RCNN ) to segment objects of interest and then back-project their corresponding depth values to generate the associated object point clouds. We then uniformly sample $N=1024$ points from each detected object and feed it as the input ${P_{o}}$ to the following modules, as shown in Fig. 2 (a).

Feature Extraction. Since 3DGC is insensitive to shift and scale of the given point cloud, we adopt it as our feature extractor to respectively obtain pose-sensitive features $F_{obs}$ and $F_{prior}$ from $P_{o}$ and a pre-computed mean shape prior $P_{r}$ (with $M=1024$ points) as in SPD . We introduce a non-linear shape augmentation scheme to increase the diversity of shapes and promote robustness, which will be discussed in detail in Sec. 3.5. Finally, $F_{obs}$ is fed to the Pose Regression module for direct pose estimation and to the Shape Prior Adaptation module after concatenation with $F_{prior}$ .

Shape Prior Guided Residual Vectors (SPRV) Prediction. The main contribution of our work resides in the use of Shape Prior Guided Residual Vectors (SPRV) to integrate shape priors into the direct pose regression network, enhancing the performance whilst keeping a fast inference speed. In the following section we will now introduce this module in detail.

2 Residual Bounding Box Projection

Preliminaries. In GPV-Pose , the authors propose a novel confidence-aware point-wise voting method to recover the bounding box. For each observed object point $P$ , GPV-Pose thereby predicts its Displacement Vector towards its Projections on each of the 6 Bounding Box faces (DVPB), as shown in Fig. 3 (a). Exemplary, when considering the $x+$ plane, the DVPB of the observed point $P$ onto the $x+$ face of the bounding box is defined as,

where $\Braket{*,*}$ denotes the inner product and, as before, $r_{x}$ denotes the first column of the rotation matrix $R$ . Thus, each point $P$ provides 6 DVPBs with respect to all 6 bounding box faces $\mathcal{B}=\{x\pm,y\pm,z\pm\}$ . Notice that, since symmetries lead to ambiguity in bounding box faces around the corresponding symmetry axis, GPV-Pose only compute the DVPB on the ambiguity-free faces.

Although GPV-Pose reports great results when leveraging DVPB, it still suffers from two important shortcomings. First, DVPB is not necessarily a small vector with zero-mean. In fact, the respective values can become very large (as for large objects like laptops), which can make it very difficult for standard networks to predict them accurately. Second, DVPB is not capable of conducting automatic outlier filtering, hence, noisy point cloud observations may significantly deteriorate the predictions of DVPB. On that account, we propose to incorporate shape prior into DVPB in the form of Shape Prior Guided Residual Vectors (SPRV) to properly address the aforementioned shortcomings.

Shape Prior Guided Residual Vectors (SPRV). As illustrated in Fig. 2 (e), we predict the deformation field $D_{r}$ that deforms the shape prior $P_{r}$ to the shape of the observed instance with $M_{r}=P_{r}+D_{r}$ . Thereby, during experimentation we made two observations. First, as $P_{r}$ is outlier-free and $D_{r}$ is regularized to be small, similar to SPD , we can safely assume that $M_{r}$ contains no outliers, allowing us to accurately recover its bounding box in NOCS by selecting the outermost points along the x, y and z axis, respectively. Second, since $A_{r}$ is the row-normalized assignment matrix, we know that $M_{r}$ shares the same bounding box with $C_{o}=A_{r}M_{r}$ , which is assumed to be inherently outlier-free and accurate. Based on the above two observations, we can utilize $M_{r}$ and $C_{o}$ to provide initial hypotheses for DVPB with respect to each point in $P_{o}$ .

Specifically, for a point $P^{C}$ in $C_{o}$ , as it is in NOCS, its DVPB $D_{P^{C},x+}$ (Fig. 3 (b)) can be represented as,

where $s^{M}_{x}$ denotes the size of $M_{r}$ along the $x$ axis and $n_{x}=^{T}$ is the normalized bounding box face normal. We then transform $D_{P^{C},x+}$ from NOCS to the camera coordinate to obtain the initial DVPB hypotheses for the corresponding point $P$ in $P_{o}$ (Fig. 3 (c)) as,

where $L=\sqrt{s_{x}^{2}+s_{y}^{2}+s_{z}^{2}}$ is the diagonal length of the bounding box. Note that $L$ and $r_{x}$ are calculated from the category mean size $S_{M}$ and the rotation prediction of our Pose Regression Module respectively.

Given the ground truth DVPB $D^{gt}_{P,x+}$ and initial DVPB hypotheses $D_{P,x+}$ , the SPRV of $P$ to the $x+$ bounding box face (Fig. 3 (d)) is calculated as,

The calculation of SPRV with respect to the other bounding box faces in $\mathcal{B}$ follows the same principal. By this means, SPRV can be approximately modelled with zero-mean Laplacian distribution, which enables effective prediction with a simple network. In the SPRV Prediction module, we feed the estimated initial DVPB hypotheses together with the feature map $F_{obs}$ into a fully convolutional decoder to directly regress SPRV. As this boils down to a multi-task prediction problem, we employ the Laplacian aleatoric uncertainty loss from to weight the different contributions within SPRV according to

Thereby, $R^{gt}_{P_{j}}$ refers to the ground truth SPRV as calculated by the provided ground truth NOCS coordinates and respective pose annotations. Further, $\sigma_{s_{j}}$ , $\sigma^{\prime}_{s_{j}}$ denote the standard variation of Laplacian distribution that are utilized to model the uncertainties. Note that the first term $\mathcal{L}^{data}_{SPRV}$ is fully supervised using the respective ground truth, while $\mathcal{L}^{reg}_{SPRV}$ is a regularization term that enforces the SPRV network to predict small displacements. In addition, $\lambda_{0}$ is a weighting parameter to balance the two terms. Note that we do not apply Gaussian-distribution-based losses. We follow GPV-Pose to supervise other branches with $\mathcal{L}_{1}$ loss for stability. Thus we adopt Eq. 5 for convenient adjustment of the weight of each term.

3 SPRV for Pose Consistency

Since SPRV explicitly encapsulates pose-related cues, we utilize it to enforce geometric consistency between the SPRV prediction and the pose regression. To this end. we first employ the predicted pose to estimate DVPB $D^{Pose}$ according to Eq. 1. We then recover $D^{SPRV}$ via adding the predicted SPRV to the initial hypotheses. Finally, our consistency loss term is defined as follows,

where $|*|$ denotes the $\mathcal{L}_{1}$ distance.

4 Overall Loss Function

The overall loss function is defined as follows,

For $\mathcal{L}_{pose}$ , we utilize the loss terms from GPV-Pose to supervise $R,t,s$ with the ground truth. For $\mathcal{L}_{shape}$ , we adopt the loss terms from SPD to supervise the prediction of the deformation field $D_{r}$ and the assignment matrix $A_{r}$ . Further, $\mathcal{L}_{SPRV}$ and $\mathcal{L}_{con}$ are defined in Eq. 5 and Eq. 6. Finally, $\lambda_{1},\lambda_{2},\lambda_{3},\lambda_{4}$ denote the utilized weights to balance the individual loss contributions, and are chosen empirically.

5 Non-Linear Shape Augmentation

To tackle the intra-class shape variations and improve the robustness and generalizability of RBP-Pose, we propose a category-specific non-linear shape augmentation scheme (Fig. 4). In FS-Net , the authors augment the shape by stretching or compression of the object bounding box. Their augmentation is linear and unable to cover the large shape variations within a category, since the proportions between different parts of the object basically remain unchanged (Fig. 4 (d)). In contrast, we propose a novel non-linear shape augmentation method which is designed to generate diverse unseen instances, whilst preserving the representative shape features of each category (Fig. 4 (c)).

In particular, we propose two types of augmentation strategies for categories provided by the REAL275 dataset : axis-based non-linear scaling transformation ( $A1$ ) for camera, bottle, can, bowl, mug (Fig 4 (I, II)) and plane-based rotation transformation ( $A2$ ) for laptop (Fig 4 (III)).

As for $A1$ , we deform the object shape by adjusting its scale along the direction of a selected axis. For each point $P$ in the canonical object space, its deformation scale $\mathcal{S}_{A1}(P)$ is obtained by $\mathcal{S}_{A1}(P)=\xi(P_{*})$ , where $\xi(P_{*})$ is a random non-linear function and $P_{*}$ is the projection of $P$ on the selected axis. In this paper, we choose $\xi$ as the parabolic function, thus, we have

where $\gamma_{max},\gamma_{min}$ are uniformly sampled random variables that control the upper and lower bounds of $\mathcal{S}_{A1}(P)$ . Exemplary, when selecting $y$ as our augmentation axis, the respective transformation function is defined as,

where $\gamma$ is the random variable that controls the scaling transformation along $x$ and $z$ axis. In practice, we select the symmetry axis ( $y$ -axis) for bottle, can, bowl and mug as the transformation axis. Moreover, for camera, we select the axis that passes through the camera lens ( $x$ -axis), to keep its roundish shape after augmentation. The corresponding transformation function is then defined as in Eq. 8 and Eq. 9, yet, $\mathcal{S}_{A1}(P)$ is only applied to $P_{x}$ and $\gamma$ is applied to $P_{y}$ , $P_{z}$ .

As for $A2$ , since laptop is an articulated object consisting of two movable planes, we conduct shape augmentation by modifying the angle between the upper and lower plane (Figure 4 (III)). Thereby, we rotate the upper plane by a certain angle along the fixed axis, while the lower plane remains static. Please refer to the Supplementary Material for details of $A2$ transformation.

Experiments

Datasets. We employ the common REAL275 and CAMERA25 benchmark datasets for evaluation. Thereby, REAL275 is a real-world dataset consisting of 7 scenes with 4.3K images for training and 6 scenes with 2.75K images for testing. It covers 6 categories, including bottle, bowl, camera, can, laptop and mug. Each category contains 3 unique instances in both training and test set. On the other hand, CAMERA25 is a synthetic dataset generated by rendering virtual objects on real background. CAMERA25 contains 275k images for training and 25k for testing. Note that CAMERA25 shares the same six categories with REAL275.

Implementation Details. Following , we use Mask-RCNN to generate 2D segmentation masks for a fair comparison. As for our category-specific non-linear shape augmentation, we uniformly sample $\gamma_{max}\sim\mathcal{U}(1,1.3)$ , $\gamma_{min}\sim\mathcal{U}(0.7,1)$ and $\gamma\sim\mathcal{U}(0.8,1.2)$ . Besides our non-linear shape augmentation, we add random Gaussian noise to the input point cloud, and employ random rotational and translational perturbations as well as random scaling of the object. Unless specified, we set the employed balancing factors $\{\lambda_{1},\lambda_{2},\lambda_{3},\lambda_{4}\}$ to $\{8.0,10.0,3.0,1.0\}$ . Finally, the parameter $\lambda_{0}$ in Eq. 5 is set to $0.01$ . We train RBP-Pose in a two-stage manner to stabilize the training process. In the first stage, we only train the pose decoder and the shape decoder employing only $\mathcal{L}_{pose}$ and $\mathcal{L}_{shape}$ . In the second stage we train all the modules except the Preprocessing as explained in Eq. 7. This strategy ensures that our two assumptions in Sec. 3 are reasonable and enables smooth training. Notice that similar to other works , we train a single model for all categories. Unlike that train with both synthetic and real data for evaluation on REAL275, we only use the real data for training. We train RBP-Pose for 150 epochs in each stage and employ a batch size of 32. We further employ the Ranger optimizer with a base learning rate of 1e-4, annealed at $72\%$ of the training phase using a cosine schedule. Our experiments are conducted on a single NVIDIA-A100 GPU.

Evaluation metrics. Following the widely adopted evaluation scheme , we utilize the two standard metrics for quantitative evaluation of the performance. In particular, we report the mean precision of 3D IoU, which computes intersection over union for two bounding boxes under the predicted and the ground truth pose. Thereby, a prediction is considered correct if the IoU is larger than the employed threshold. On the other hand, to directly evaluate rotation and translation errors, we use the $5^{\circ}2cm$ , $5^{\circ}5cm$ , $10^{\circ}2cm$ and $10^{\circ}5cm$ metrics. A pose is hereby considered correct if the translational and rotational errors are less than the respective thresholds.

Performance on NOCS-REAL275. In Tab 1, we compare RBP-Pose with 9 state-of-the-art methods, among which 4 methods utilize shape priors. It can be easily observed that our method outperforms all other competitors by a large margin. Specifically, under $IoU_{75}$ , we achieve a mAP of 67.8%, which exceeds the second best method DO-Net by 4.1%. Regarding the rotation and translation accuracy, RBP-Pose outperforms SGPA by 2.3% in $5^{\circ}2cm$ , 8.5% in $5^{\circ}5cm$ , 1.8% in $10^{\circ}2cm$ and 8.5% in $10^{\circ}5cm$ . Moreover, when comparing with GPV-Pose , we can outperform them by 6.2% in $5^{\circ}2cm$ , 5.2% in $5^{\circ}5cm$ and 5.9% in $10^{\circ}5cm$ . Noteworthy, despite achieving significant accuracy improvements, RBP-Pose still obtains a real-time frame rate of 25Hz when using YOLOv3 and ATSA for object detection. Moreover, we present a detailed per-category comparison for 3D IoU, rotation and translation accuracy of RBP-Pose and SGPA in Fig. 6. It can be deduced that our method obtains superior results over SGPA in terms of mean precision for all metrics, especially in rotation. Moreover, our method is superior in dealing with complex categories with significant intra-class shape variations, e.g. camera (green line in Fig. 6).

Performance on NOCS-CAMERA25. The results for CAMERA25 are shown in Tab. 2. Our method outperforms all competitors for stricter metrics $IoU_{75}$ , $5^{\circ}2cm$ , $5^{\circ}5cm$ and $10^{\circ}5cm$ , and is on par with the best methods for $IoU_{50}$ and $10^{\circ}2cm$ . Specifically, our method exceeds the second best methods for $IoU_{75}$ , $5^{\circ}2cm$ , $5^{\circ}5cm$ and $10^{\circ}5cm$ by 0.7%, 1.4%, 0.5% and 0.5%, respectively.

2 Ablation Study

Effect of Shape Prior Guided Residual Vectors. In Tab. 3, we evaluate the performance of our method under different configurations. From E1 to E3, we compare three variants of RBP-Pose w.r.t the integration of DVPB: removing the DVPB related modules, predicting DVPB directly and predicting SPRV. By directly predicting DVPB like in GPV-Pose , the mAP improves by 0.8% under $5^{\circ}2cm$ and 2.1% in $5^{\circ}5cm$ , which indicates that DVPB explicitly encapsulates pose information, helping the network to extract pose-sensitive features. By utilizing shape priors to generate initial hypothesis of DVPB and additionally predicting SPRV, the performance improves 2.6% under $5^{\circ}2cm$ and 2.5% under $10^{\circ}2cm$ , while the mAP of $5^{\circ}5cm$ and $10^{\circ}5cm$ decreases. In general, by solely adopting the auxiliary task of predicting SPRV, the translation accuracy rises while the rotation accuracy falls. This, however, can be solved using our consistency loss between SPRV and pose. E7 adopts the consistency term in Eq. 6 based on E3, and boosts the performance by a large margin under all metrics. This shows that the consistency term is able to guide the network to align predictions from different decoders by jointly optimizing them. E4 enforces the consistency term on DVPB without residual reasoning. Performance deteriorates since initial DVPB hypotheses in are typically inaccurate. SPRV decoder refines the hypotheses by predicting residuals, and thus enhances overall performance.

Effect of non-linear shape augmentation. In Tab. 3 E5, we remove the non-linear shape augmentation and preserve all other components. Comparing E5 and E7, it can be deduced that the performance degrades dramatically without non-linear shape augmentation, where the mAP of $5^{\circ}2cm$ and $5^{\circ}5cm$ drops by 15.6% and 18.4%, respectively. The main reason is that we only train the network on real-world data containing only 3 objects for each category with 4k images, leading to severe overfitting. The non-linear data augmentation mitigates this problem and enhances the diversity of shapes in the training data.

Non-linear vs linear shape augmentation. We compare our non-linear shape augmentation with the linear bounding-box-based augmentation from FS-Net in Tab. 3 E6 and E7. Our non-linear shape augmentation boosts the mAP w.r.t. all metrics. Specifically, the accuracy improves by 1.6% for $IoU_{75}$ , 2.1% for $5^{\circ}2cm$ , 1.1% for $5^{\circ}5cm$ and 0.9% for $10^{\circ}2cm$ . The main reason is that our non-linear shape augmentation covers more kinds of shape variations than the linear counterpart, which improves the diversity of training data and mitigates the problem of overfitting.

3 Qualitative Results

We provide a qualitative comparison between RBP-Pose and SGPA in Fig. 5. Comparative advantage of our method over SGPA is significant, especially in the accuracy of the rotation estimation. Moreover, our method consistently outperforms SGPA when estimating the pose for the camera category, which supports our claim that we can better handle categories with large intra-class variations. We discuss Failure Cases and Limitations in the supplemental material.

Conclusion

In this paper, we propose RBP-Pose, a novel method that leverages Residual Bounding Box Projection for category-level object pose estimation. RBP-Pose jointly predicts 9DoF pose and shape prior guided residual vectors. We illustrate that these nearly zero-mean residual vectors encapsulate the spatial cues of the pose and enable geometry-guided consistency terms. We also propose a non-linear data augmentation scheme to improve shape diversity of the training data. Extensive experiments on the common public benchmark demonstrate the effectiveness of our design and the potential of our method for future real-time applications such as robotic manipulation and augmented reality.