Category-Level 6D Object Pose and Size Estimation using Self-Supervised Deep Prior Deformation Networks
Jiehong Lin, Zewei Wei, Changxing Ding, Kui Jia
Introduction
The task of category-level 6D object pose and size estimation, formally introduced in , is to estimate the rotations, translations, and sizes of unseen object instances of certain categories in cluttered RGB-D scenes. It plays a crucial role in many real-world applications, such as robotic grasping, augmented reality, and autonomous driving .
For this task, existing methods can be roughly categorized into two groups, i.e., those based on direct regression and those based on dense correspondence learning. Methods of the former group are conceptually simple, but struggle in learning pose-sensitive features such that direct predictions can be made in the full space; dense correspondence learning makes the task easier by first regressing point-wise coordinates in the canonical space to align with points of observations, and then obtaining object poses and sizes via solving of Umeyama algorithm . Recent works of the second group exploit strong categorical priors (e.g., mean shapes of object categories) for improving the qualities of canonical point sets, and constantly achieve impressive results; however, their surrogate objectives for the learning of canonical coordinates are one step away from the true ones for estimating object poses and sizes, making their learning suboptimal to the end task.
The considered learning task is further challenged by the lack of real-world RGB-D data with careful object pose and size annotations in 3D space. As such, synthetic data are usually simulated and rendered whose annotations can be freely obtained on the fly . However, the easy annotations in synthetic domains bring a downside effect of synthetic-to-real (Sim2Real) domain gap; learning with synthetic data with no consideration of Sim2Real domain adaptation would inevitably result in poor generalization in the real-world domain. This naturally falls in the realm of Sim2Real, unsupervised domain adaptation (UDA) .
In this work, we consider the task setting of Sim2Real UDA for category-level 6D object pose and size estimation. We propose a new method of self-supervised Deep Prior Deformation Network; Fig. 1 gives an illustration. Following dense correspondence learning, we first present a novel Deep Prior Deformation Network, shortened as DPDN, which implements a deep version of shape prior deformation in the feature space, and is thus able to establish deep correspondence for direct regression of poses and sizes with high precision. For a cluttered RGB-D scene, we employ a 2D instance segmentation network (e.g, Mask RCNN ) to segment the objects of interest out, and feed them into our proposed DPDN for pose and size estimation. As shown in Fig. 2, the architecture of DPDN consists of three main modules, including a Triplet Feature Extractor, a Deep Prior Deformer, and a Pose and Size Estimator. For an object observation, the Triplet Feature Extractor learns point-wise features from its image crop, point set, and categorical shape prior, respectively; then Deep Prior Deformer deforms the prior in feature space by learning a feature deformation field and a correspondence matrix, and thus builds deep correspondence from the observation to its canonical version; finally, Pose and Size Estimator is used to make reliable predictions directly from those built deep correspondence.
On top of DPDN, we formulate a self-supervised objective that combines an inter-consistency term with two intra-consistency ones for UDA. More specifically, as shown in Fig. 1, we apply two rigid transformations to an input point set of object observation, and feed them into our DPDN in parallel for making dual sets of predictions. Upon the above parallel learning, the inter-consistency term enforces cross consistency between dual predictions w.r.t. two transformations for improving the sensitivity of DPDN to pose changes, and within each learning, the individual intra-consistency term is employed to enforce self-adaptation between the correspondence and the predictions. We train DPDN on both training sets of the synthetic CAMERA25 and real-world REAL275 datasets ; our results outperform the existing methods on REAL275 test set under both unsupervised and supervised settings. We also conduct ablation studies that confirm the advantages of our designs. Our contributions can be summarized as follows:
We propose a Deep Prior Deformation Network, termed as DPDN, for the task of category-level 6D object pose and size estimation. DPDN deforms categorical shape priors to pair with object observations in the feature space, and is thus able to establish deep correspondence for direct regression of object poses and sizes.
Given that the considered task largely uses synthetic training data, we formulate a novel self-supervised objective upon DPDN to reduce the synthetic-to-real domain gap. The objective is built upon enforcing consistencies between parallel learning w.r.t. two rigid transformations, and has the effects of both improving the sensitivity of DPDN to pose changes, and making predictions more reliable.
We conduct thorough ablation studies to confirm the efficacy of our designs. Notably, our method outperforms existing ones on the benchmark dataset of real-world REAL275 under both the unsupervised and supervised settings.
Related Work
Methods of fully-supervised category-level 6D pose and size estimation could be roughly divided into two groups, i.e., those based on direct regression and those based on dense correspondence learning .
Direct estimates of object poses and sizes from object observations suffer from the difficulties in the learning of the full space, and thus make demands on extraction of pose-sensitive features. FS-Net builds an orientation-aware backbone with 3D graph convolutions to encode object shapes, and makes predictions with a decoupled rotation mechanism. DualPoseNet encodes pose-sensitive features from object observations based on rotation-equivariant spherical convolutions, while two parallel pose decoders with different working mechanisms are stacked to impose complementary supervision. A recent work of SS-ConvNet designs Sparse Steerable Convolutions (SS-Conv) to further explore SE(3)-equivariant feature learning, and presents a two-stage pose estimation pipeline upon SS-Convs for iterative pose refinement.
Another group of works first learn coordinates of object observations in the canonical space to establish dense correspondence, and then obtain object poses and sizes by solving Umeyama algorithm from the correspondence in 3D space. NOCS , the first work for our focused task, is realized in this way by directly regressing canonical coordinates from RGB images. SPD then makes the learning of canonical points easier by deforming categorical shape priors, rather than directly regressing from object observations. The follow-up works also confirm the advantages of shape priors, and make efforts on the prior deformation to further improve the qualities of canonical points, e.g., via recurrent reconstruction for iterative refinement , or structure-guided adaptation based on transformer .
0.2 Unsupervised Methods
Due to the time-consuming and labor-intensive annotating of real-world data in 3D space, UDA-COPE presents a new setting of unsupervised domain adaptation for the focused task, and adapts a teacher-student scheme with bidirectional point filtering to this setting, which, however, heavily relies on the qualities of pseudo labels. In this paper, we exploit inter-/intra-consistency in the self-supervised objective to explore the data characteristics of real-world data and fit the data for the reduction of domain gap.
Self-Supervised Deep Prior Deformation Network
with and . is a binary mask; if the observation of is fully annotated and otherwise. In Sec. 3.2, we will give a detailed illustration on the self-supervised objective , which learns inter-consistency and intra-consistency upon DPDN, while the illustration on the supervised objective is included in Sec. 3.3.
Finally, solving of Umeyama algorithm to align with gives out the target pose and size.
However, surrogate objectives for the learning of and are a step away from the true ones for estimates of pose and size; for example, small deviations of or may lead to large changes in the pose space. Thereby, we present a Deep Prior Deformation Network (DPDN), which implements a deep version of (2) as follows:
where , , and denote point-wise features of , , and , respectively, and is a feature deformation field w.r.t. . The deep version (3) deforms in the feature space, such that features of and are paired to establish deep correspondence, from which object pose and size could be predicted via a subsequent network. Direct regression from deep correspondence thus alleviates the difficulties encountered by that from object observation. We note that upon the correspondence and the predictions, a self-supervised signal of intra-consistency could also be built for unlabeled data (see Sec. 3.2).
As depicted in Fig. 2, the architecture of DPDN consists of three main modules, including Triplet Feature Extractor, Deep Prior Deformer, and Pose and Size Estimator. We will give detailed illustrations shortly.
1.2 Deep Prior Deformer
where denotes concatenation along feature dimension, denotes a trainable subnetwork of MLP, denotes an average-pooling operation over surface points, and denotes copies of the feature vector.
is used to deform the deep prior to match in the feature space. Thereby, according to (3), we have , with a global feature generated by averaging over points. Then could be learned from the fusion of , , and copies of , via another MLP as follows:
Compared to the common practice in to learn by fusing and with copies of , our deformed version of the deep prior via adding could effectively improve the quality of .
We also learn in (2) from as follows:
such that according to (2) and (3), we have and , respectively. Supervisions on and could guide the learning of and .
1.3 Pose and Size Estimator
Through the module of Deep Prior Deformer, we establish point-to-point correspondence for the observed with by learning in the canonical space with . As shown in Fig. 2, for estimating object pose and size, we firstly pair the correspondence via feature concatenation and apply an MLP to lift the features as follows:
We then inject global information into the point-wise correspondence in by concatenating its averaged feature , followed by an MLP to strengthen the correspondence; a pose-sensitive feature vector is learned from all the correspondence information via an average-pooling operation:
Finally, we apply three parallel MLPs to regress , , and , respectively:
where we choose a 6D representation of rotation as the regression target of the first MLP, for its continuous learning space in , and represents transformation from the 6D representation to the rotation matrix .
For the whole DPDN, we could summarize it as follows:
For an observed point set , if we transform it with (, , ) and (, , ), we can obtain and , respectively. When inputting them into DPDN in parallel, we have
with , , , and .
1) ;
2) .
Upon the above parallel learning of (11) and (12), we design a novel self-supervised objective for the unlabeled real-world data to reduce the Sim2Real domain gap. Specifically, it combines an inter-consistency term with two intra-consistency ones as follows:
where and are superparameters to balance the loss terms. enforces consistency across the parallel learning from with different transformations, making the learning aware of pose changes to improve the precision of predictions, while and enforce self-adaptation between correspondence and predictions within each learning, respectively, in order to realize more reliable predictions inferred from the correspondence.
We construct the inter-consistency loss based on the following two facts: 1) two solutions to the pose and size of from those of and are required to be consistent; 2) as representations of a same object in the canonical space, and should be invariant to any pose transformations, and thus keep consistent to each other, as well as and . Therefore, with two input transformations, the inter-consistency loss could be formulated as follows:
where and are balanced parameters, and
Chamfer distance is used to restrain the distance of two complete point sets and , while for the partial and , we use a more strict metric of L2 distance for point-to-point constraints, since their points should be ordered to correspond with those of .
2.2 Intra-Consistency Terms
For an observation , DPDN learns deep correspondence between and to predict their relative pose and size ; ideally, for , . Accordingly, the predictions and in (11) and (12) should be restrained to be consistent with and , respectively, which gives the formulations of two intra-consistency terms based on Smooth-L1 distance as follows:
with and .
Given a triplet of inputs along with the annotated ground truths , we generate dual input triplets by applying two rigid transformations to , as done in Sec. 3.2, and use the following supervised objective on top of the parallel learning of (11) and (12):
We note that this supervision also implies inter-consistency between the parallel learning defined in (14), making DPDN more sensitive to pose changes.
Experiments
We train DPDN on both training sets of synthetic CAMERA25 and real-world REAL275 datasets , and conduct evaluation on REAL275 test set. CAMERA25 is created by a context-aware mixed reality approach, which renders synthetic object CAD models of 6 categories to real-world backgrounds, yielding a total of RGB-D images, with ones of objects set aside for validation. REAL275 is a more challenging real-world dataset, which includes training images of scenes and testing ones of scenes. Both datasets share the same categories, yet impose large domain gap.
0.2 Implementation Details
To obtain instance masks for both training and test sets of REAL275, we train a MaskRCNN with a backbone of ResNet101 on CAMERA25; for settings of available training mask labels, we use the same segmentation results as to make fair comparisons. We employ the shape priors released by .
0.3 Evaluation Metrics
Following , we report mean Average Precision (mAP) of Intersection over Union (IoU) for object detection, and mAP of cm for 6D pose estimation. IoUx denotes precision of predictions with IoU over a threshold of , and cm denotes precision of those with rotation error less than and transformation error less than cm.
1 Comparisons with Existing Methods
We compare our method with the existing ones for category-level 6D object pose and size estimation under both unsupervised and supervised settings. Quantitative results are given in Table 1, where results under supervised setting significantly benefit from the annotations of real-world data, compared to those under unsupervised one; for example, on the metric of cm, SPD improves the results from to , while DualPoseNet improves from to . Therefore, the exploration of UDA for the target task in this paper is of great practical significance, due to the difficulties in precisely annotating real-world object instances in 3D space.
Firstly, a basic version of DPDN is trained on the synthetic data and transferred to real-world domain for evaluation; under this setting, our basic DPDN outperforms the existing methods on all the evaluation metrics, as shown in Table 1. To reduce the Sim2Real domain gap, we further include the unlabeled Real275 training set via our self-supervised DPDN for UDA; results in Table 1 verify the effectiveness of our self-supervised DPDN (dubbed Self-DPDN in the table), which significantly improves the precision of the basic version, e.g., a performance gain of on cm from to .
UDA-COPE is the first work to introduce the unsupervised setting, which trains deep model with a teacher-student scheme to yield pseudo labels for real-world data; in the process of training, pose annotations of real-world data are not employed, yet mask labels are used for learning instance segmentation. To fairly compare with UDA-COPE, we evaluate DPDN under the same setting; results in Table 1 also show the superiority of our self-supervised DPDN over UDA-COPE, especially for the metrics of high precisions, e.g., an improvement of on cm. The reason for the great improvement is that UDA-COPE heavily relies on the qualities of pseudo labels, while our self-supervised objective could guide the optimization moving for the direction meeting inter-/intra- consistency, to make the learning fit the characteristics of REAL275 and decrease the downside effect of the synthetic domain.
1.2 Supervised Setting
We also compare our DPDN with the existing methods, including those of direct regression , and those based on dense correspondence learning, under supervised setting. As shown in Table 1, DPDN outperforms the existing methods on all the evaluation metrics, e.g., reaching the precisions of on IoU75 and on cm. Compared with the representative SS-ConvNet , which directly regresses object poses and sizes from observations, our DPDN takes the advantages of categorical shape priors, and achieves more precise results by regressing from deep correspondence; compared with SGPA , the recent state-of-the-art method based on correspondence learning (c.f. Eq. (2)), our DPDN shares the same feature extractor, yet benefits from the direct objectives for pose and size estimation, rather than the surrogate ones, e.g., for regression of and in Eq. (2); DPDN thus achieves more reliable predictions.
2 Ablation Studies and Analyses
In this section, we conduct experiments to evaluate the efficacy of both the designs in our DPDN and the self-supervision upon DPDN.
We verify the efficacy of the designs in DPDN under supervised setting, with the results of different variants of our DPDN shown in Table. 2. Firstly, we confirm the effectiveness of our Deep Prior Deformer with categorical shape priors; by removing Deep Prior Deformer, the precision of DPDN with parallel learning on cm drops from to , indicating that learning from deep correspondence by deforming priors in feature space indeed makes the task easier than that directly from object observations.
Secondly, we show the advantages of using true objectives for direct estimates of object poses and sizes, over the surrogate ones for the learning of the canonical point set to pair with the observed . Specifically, we remove our Pose and Size Estimator, and make predictions by solving Umeyama algorithm to align and ; precisions shown in the table decline sharply on all the evaluation metrics, especially on IoUx. We found that the results on IoUx are also much lower than those methods based on dense correspondence learning , while results on are comparable; the reason is that we regress the absolute coordinates of , rather than the deviations in Eq. (2), which may introduce more outliers to affect the object size estimation.
Thirdly, we confirm the effectiveness of the parallel supervisions in (16), e.g., inputting and of the same instance with different poses. As shown in Table 2, results of DPDN with parallel learning are improved (with or without Deep Prior Deformer), since the inter-consistency between dual predictions is implied in the parallel supervisions, making the learning aware of pose changes.
2.2 Effects of the Self-Supervision upon DPDN
We have shown the superiority of our novel self-supervised DPDN under the unsupervised setting in Sec. 4.1; here we include the evaluation on the effectiveness of each consistency term in the self-supervised objective, which is confirmed by the results shown in Table 3. Taking results on cm as examples, DPDN with inter-consistency term improves the results of the baseline from to , and DPDN with the intra-consistency ones and improves to , while their combinations further refresh the results, revealing their strengths on reduction of domain gap. We also show the influence of data size of unlabeled real-world images on the precision of predictions in Fig. 3, where precisions improve along with the increasing ratios of training data.
3 Visualization
We visualize in Fig. 4 the qualitative results of our proposed DPDN under different settings on REAL275 test set . As shown in the figure, our self-supervised DPDN without annotations of real-world data, in general, achieves comparable results with the fully-supervised version, although there still exist some difficult examples, e.g., cameras in Fig. 4 (a) and (b), due to the inaccurate masks from MaskRCNN trained on synthetic data. Under the unsupervised setting, our self-supervised DPDN also outperforms the basic version trained with only CAMERA25, by including unlabeled real-world data with self-supervision; for example, more precise poses of laptops are obtained in Fig. 4 (c) and (d).
Acknowledgements This work is supported in part by Guangdong RD key project of China (No.: 2019B010155001), and the Program for Guangdong Introducing Innovative and Enterpreneurial Teams (No.: 2017ZT07X183).