Category-Level 6D Object Pose and Size Estimation using Self-Supervised Deep Prior Deformation Networks

Jiehong Lin, Zewei Wei, Changxing Ding, Kui Jia

Introduction

The task of category-level 6D object pose and size estimation, formally introduced in , is to estimate the rotations, translations, and sizes of unseen object instances of certain categories in cluttered RGB-D scenes. It plays a crucial role in many real-world applications, such as robotic grasping, augmented reality, and autonomous driving .

For this task, existing methods can be roughly categorized into two groups, i.e., those based on direct regression and those based on dense correspondence learning. Methods of the former group are conceptually simple, but struggle in learning pose-sensitive features such that direct predictions can be made in the full SE(3)SE(3) space; dense correspondence learning makes the task easier by first regressing point-wise coordinates in the canonical space to align with points of observations, and then obtaining object poses and sizes via solving of Umeyama algorithm . Recent works of the second group exploit strong categorical priors (e.g., mean shapes of object categories) for improving the qualities of canonical point sets, and constantly achieve impressive results; however, their surrogate objectives for the learning of canonical coordinates are one step away from the true ones for estimating object poses and sizes, making their learning suboptimal to the end task.

The considered learning task is further challenged by the lack of real-world RGB-D data with careful object pose and size annotations in 3D space. As such, synthetic data are usually simulated and rendered whose annotations can be freely obtained on the fly . However, the easy annotations in synthetic domains bring a downside effect of synthetic-to-real (Sim2Real) domain gap; learning with synthetic data with no consideration of Sim2Real domain adaptation would inevitably result in poor generalization in the real-world domain. This naturally falls in the realm of Sim2Real, unsupervised domain adaptation (UDA) .

In this work, we consider the task setting of Sim2Real UDA for category-level 6D object pose and size estimation. We propose a new method of self-supervised Deep Prior Deformation Network; Fig. 1 gives an illustration. Following dense correspondence learning, we first present a novel Deep Prior Deformation Network, shortened as DPDN, which implements a deep version of shape prior deformation in the feature space, and is thus able to establish deep correspondence for direct regression of poses and sizes with high precision. For a cluttered RGB-D scene, we employ a 2D instance segmentation network (e.g, Mask RCNN ) to segment the objects of interest out, and feed them into our proposed DPDN for pose and size estimation. As shown in Fig. 2, the architecture of DPDN consists of three main modules, including a Triplet Feature Extractor, a Deep Prior Deformer, and a Pose and Size Estimator. For an object observation, the Triplet Feature Extractor learns point-wise features from its image crop, point set, and categorical shape prior, respectively; then Deep Prior Deformer deforms the prior in feature space by learning a feature deformation field and a correspondence matrix, and thus builds deep correspondence from the observation to its canonical version; finally, Pose and Size Estimator is used to make reliable predictions directly from those built deep correspondence.

On top of DPDN, we formulate a self-supervised objective that combines an inter-consistency term with two intra-consistency ones for UDA. More specifically, as shown in Fig. 1, we apply two rigid transformations to an input point set of object observation, and feed them into our DPDN in parallel for making dual sets of predictions. Upon the above parallel learning, the inter-consistency term enforces cross consistency between dual predictions w.r.t. two transformations for improving the sensitivity of DPDN to pose changes, and within each learning, the individual intra-consistency term is employed to enforce self-adaptation between the correspondence and the predictions. We train DPDN on both training sets of the synthetic CAMERA25 and real-world REAL275 datasets ; our results outperform the existing methods on REAL275 test set under both unsupervised and supervised settings. We also conduct ablation studies that confirm the advantages of our designs. Our contributions can be summarized as follows:

We propose a Deep Prior Deformation Network, termed as DPDN, for the task of category-level 6D object pose and size estimation. DPDN deforms categorical shape priors to pair with object observations in the feature space, and is thus able to establish deep correspondence for direct regression of object poses and sizes.

Given that the considered task largely uses synthetic training data, we formulate a novel self-supervised objective upon DPDN to reduce the synthetic-to-real domain gap. The objective is built upon enforcing consistencies between parallel learning w.r.t. two rigid transformations, and has the effects of both improving the sensitivity of DPDN to pose changes, and making predictions more reliable.

We conduct thorough ablation studies to confirm the efficacy of our designs. Notably, our method outperforms existing ones on the benchmark dataset of real-world REAL275 under both the unsupervised and supervised settings.

Related Work

Methods of fully-supervised category-level 6D pose and size estimation could be roughly divided into two groups, i.e., those based on direct regression and those based on dense correspondence learning .

Direct estimates of object poses and sizes from object observations suffer from the difficulties in the learning of the full SE(3)SE(3) space, and thus make demands on extraction of pose-sensitive features. FS-Net builds an orientation-aware backbone with 3D graph convolutions to encode object shapes, and makes predictions with a decoupled rotation mechanism. DualPoseNet encodes pose-sensitive features from object observations based on rotation-equivariant spherical convolutions, while two parallel pose decoders with different working mechanisms are stacked to impose complementary supervision. A recent work of SS-ConvNet designs Sparse Steerable Convolutions (SS-Conv) to further explore SE(3)-equivariant feature learning, and presents a two-stage pose estimation pipeline upon SS-Convs for iterative pose refinement.

Another group of works first learn coordinates of object observations in the canonical space to establish dense correspondence, and then obtain object poses and sizes by solving Umeyama algorithm from the correspondence in 3D space. NOCS , the first work for our focused task, is realized in this way by directly regressing canonical coordinates from RGB images. SPD then makes the learning of canonical points easier by deforming categorical shape priors, rather than directly regressing from object observations. The follow-up works also confirm the advantages of shape priors, and make efforts on the prior deformation to further improve the qualities of canonical points, e.g., via recurrent reconstruction for iterative refinement , or structure-guided adaptation based on transformer .

0.2 Unsupervised Methods

Due to the time-consuming and labor-intensive annotating of real-world data in 3D space, UDA-COPE presents a new setting of unsupervised domain adaptation for the focused task, and adapts a teacher-student scheme with bidirectional point filtering to this setting, which, however, heavily relies on the qualities of pseudo labels. In this paper, we exploit inter-/intra-consistency in the self-supervised objective to explore the data characteristics of real-world data and fit the data for the reduction of domain gap.

Self-Supervised Deep Prior Deformation Network

with B1=i=1BαiB_{1}=\sum_{i=1}^{B}\alpha_{i} and B2=i=1B1αiB_{2}=\sum_{i=1}^{B}1-\alpha_{i}. {αi}i=1B\{\alpha_{i}\}_{i=1}^{B} is a binary mask; αi=1\alpha_{i}=1 if the observation of Vi\mathcal{V}_{i} is fully annotated and αi=0\alpha_{i}=0 otherwise. In Sec. 3.2, we will give a detailed illustration on the self-supervised objective Lselfsupervised\mathcal{L}_{self-supervised}, which learns inter-consistency and intra-consistency upon DPDN, while the illustration on the supervised objective Lsupervised\mathcal{L}_{supervised} is included in Sec. 3.3.

Finally, solving of Umeyama algorithm to align Qo\mathcal{Q}_{o} with Po\mathcal{P}_{o} gives out the target pose and size.

However, surrogate objectives for the learning of A\bm{A} and D\bm{D} are a step away from the true ones for estimates of pose and size; for example, small deviations of A\bm{A} or D\bm{D} may lead to large changes in the pose space. Thereby, we present a Deep Prior Deformation Network (DPDN), which implements a deep version of (2) as follows:

where FQc\mathcal{F}_{\mathcal{Q}_{c}}, FQv\mathcal{F}_{\mathcal{Q}_{v}}, and FQo\mathcal{F}_{\mathcal{Q}_{o}} denote point-wise features of Qc\mathcal{Q}_{c}, Qv\mathcal{Q}_{v}, and Qo\mathcal{Q}_{o}, respectively, and FD\mathcal{F}_{\bm{D}} is a feature deformation field w.r.t. FQc\mathcal{F}_{\mathcal{Q}_{c}}. The deep version (3) deforms Qc\mathcal{Q}_{c} in the feature space, such that features of Qo\mathcal{Q}_{o} and Po\mathcal{P}_{o} are paired to establish deep correspondence, from which object pose and size (R,t,s)(\bm{R},\bm{t},\bm{s}) could be predicted via a subsequent network. Direct regression from deep correspondence thus alleviates the difficulties encountered by that from object observation. We note that upon the correspondence and the predictions, a self-supervised signal of intra-consistency could also be built for unlabeled data (see Sec. 3.2).

As depicted in Fig. 2, the architecture of DPDN consists of three main modules, including Triplet Feature Extractor, Deep Prior Deformer, and Pose and Size Estimator. We will give detailed illustrations shortly.

1.2 Deep Prior Deformer

where [,][\cdot,\cdot] denotes concatenation along feature dimension, MLP()\texttt{MLP}(\cdot) denotes a trainable subnetwork of MLP, AvgPool()\texttt{AvgPool}(\cdot) denotes an average-pooling operation over surface points, and TileM()\texttt{Tile}^{M}(\cdot) denotes MM copies of the feature vector.

FD\mathcal{F}_{\bm{D}} is used to deform the deep prior FQc\mathcal{F}_{\mathcal{Q}_{c}} to match V\mathcal{V} in the feature space. Thereby, according to (3), we have FQv=FQc+FD\mathcal{F}_{\mathcal{Q}_{v}}=\mathcal{F}_{\mathcal{Q}_{c}}+\mathcal{F}_{\bm{D}}, with a global feature fQv\bm{f}_{\mathcal{Q}_{v}} generated by averaging FQv\mathcal{F}_{\mathcal{Q}_{v}} over MM points. Then A\bm{A} could be learned from the fusion of FIo\mathcal{F}_{\mathcal{I}_{o}}, FPo\mathcal{F}_{\mathcal{P}_{o}}, and NN copies of fQv\bm{f}_{\mathcal{Q}_{v}}, via another MLP as follows:

Compared to the common practice in to learn A\bm{A} by fusing FIo\mathcal{F}_{\mathcal{I}_{o}} and FPo\mathcal{F}_{\mathcal{P}_{o}} with NN copies of fQc=AvgPool(FQc)\bm{f}_{\mathcal{Q}_{c}}=\texttt{AvgPool}(\mathcal{F}_{\mathcal{Q}_{c}}), our deformed version of the deep prior FQc\bm{F}_{\mathcal{Q}_{c}} via adding FD\mathcal{F}_{\bm{D}} could effectively improve the quality of A\bm{A}.

We also learn Qv\mathcal{Q}_{v} in (2) from FQv\mathcal{F}_{\mathcal{Q}_{v}} as follows:

such that according to (2) and (3), we have Qo=A×Qv\mathcal{Q}_{o}=\bm{A}\times\mathcal{Q}_{v} and FQo=A×FQv\mathcal{F}_{\mathcal{Q}_{o}}=\bm{A}\times\mathcal{F}_{\mathcal{Q}_{v}}, respectively. Supervisions on Qv\mathcal{Q}_{v} and Qo\mathcal{Q}_{o} could guide the learning of FD\mathcal{F}_{\bm{D}} and A\bm{A}.

1.3 Pose and Size Estimator

Through the module of Deep Prior Deformer, we establish point-to-point correspondence for the observed Po\mathcal{P}_{o} with FPo\mathcal{F}_{\mathcal{P}_{o}} by learning Qo\mathcal{Q}_{o} in the canonical space with FQo\mathcal{F}_{\mathcal{Q}_{o}}. As shown in Fig. 2, for estimating object pose and size, we firstly pair the correspondence via feature concatenation and apply an MLP to lift the features as follows:

We then inject global information into the point-wise correspondence in Fcorr\mathcal{F}_{corr} by concatenating its averaged feature fcorr\bm{f}_{corr}, followed by an MLP to strengthen the correspondence; a pose-sensitive feature vector fpose\bm{f}_{pose} is learned from all the correspondence information via an average-pooling operation:

Finally, we apply three parallel MLPs to regress R\bm{R}, t\bm{t}, and s\bm{s}, respectively:

where we choose a 6D representation of rotation as the regression target of the first MLP, for its continuous learning space in SO(3)SO(3), and ρ()\rho(\cdot) represents transformation from the 6D representation to the 3×33\times 3 rotation matrix R\bm{R}.

For the whole DPDN, we could summarize it as follows:

For an observed point set Po={po(j)}j=1N\mathcal{P}_{o}=\{\bm{p}_{o}^{(j)}\}_{j=1}^{N}, if we transform it with (ΔR1\Delta\bm{R}_{1}, Δt1\Delta\bm{t}_{1}, Δs1\Delta s_{1}) and (ΔR2\Delta\bm{R}_{2}, Δt2\Delta\bm{t}_{2}, Δs2\Delta s_{2}), we can obtain Po,1={po,1(j)}j=1N={1Δs1ΔR1T(po(j)Δt1)}j=1N\mathcal{P}_{o,1}=\{\bm{p}_{o,1}^{(j)}\}_{j=1}^{N}=\{\frac{1}{\Delta s_{1}}\Delta\bm{R}_{1}^{T}(\bm{p}_{o}^{(j)}-\Delta\bm{t}_{1})\}_{j=1}^{N} and Po,2={po,2(j)}j=1N={1Δs2ΔR2T(po(j)Δt2)}j=1N\mathcal{P}_{o,2}=\{\bm{p}_{o,2}^{(j)}\}_{j=1}^{N}=\{\frac{1}{\Delta s_{2}}\Delta\bm{R}_{2}^{T}(\bm{p}_{o}^{(j)}-\Delta\bm{t}_{2})\}_{j=1}^{N}, respectively. When inputting them into DPDN in parallel, we have

with Qv,1={qv,1(j)}j=1M\mathcal{Q}_{v,1}=\{\bm{q}_{v,1}^{(j)}\}_{j=1}^{M}, Qo,1={qo,1(j)}j=1N\mathcal{Q}_{o,1}=\{\bm{q}_{o,1}^{(j)}\}_{j=1}^{N}, Qv,2={qv,2(j)}j=1M\mathcal{Q}_{v,2}=\{\bm{q}_{v,2}^{(j)}\}_{j=1}^{M}, and Qo,2={qo,2(j)}j=1N\mathcal{Q}_{o,2}=\{\bm{q}_{o,2}^{(j)}\}_{j=1}^{N}.

1) R1,t1,s1=ΔR1RPo,1,Δt1+Δs1ΔR1tPo,1,Δs1sPo,1\bm{R}_{1},\bm{t}_{1},\bm{s}_{1}=\Delta\bm{R}_{1}\bm{R}_{\mathcal{P}_{o,1}},\Delta\bm{t}_{1}+\Delta s_{1}\Delta\bm{R}_{1}\bm{t}_{\mathcal{P}_{o,1}},\Delta s_{1}\bm{s}_{\mathcal{P}_{o,1}};

2) R2,t2,s2=ΔR2RPo,2,Δt2+Δs2ΔR2tPo,2,Δs2sPo,2\bm{R}_{2},\bm{t}_{2},\bm{s}_{2}=\Delta\bm{R}_{2}\bm{R}_{\mathcal{P}_{o,2}},\Delta\bm{t}_{2}+\Delta s_{2}\Delta\bm{R}_{2}\bm{t}_{\mathcal{P}_{o,2}},\Delta s_{2}\bm{s}_{\mathcal{P}_{o,2}}.

Upon the above parallel learning of (11) and (12), we design a novel self-supervised objective for the unlabeled real-world data to reduce the Sim2Real domain gap. Specifically, it combines an inter-consistency term Linter\mathcal{L}_{inter} with two intra-consistency ones (Lintra,1,Lintra,2)(\mathcal{L}_{intra,1},\mathcal{L}_{intra,2}) as follows:

where λ1\lambda_{1} and λ2\lambda_{2} are superparameters to balance the loss terms. Linter\mathcal{L}_{inter} enforces consistency across the parallel learning from Po\mathcal{P}_{o} with different transformations, making the learning aware of pose changes to improve the precision of predictions, while Lintra,1\mathcal{L}_{intra,1} and Lintra,2\mathcal{L}_{intra,2} enforce self-adaptation between correspondence and predictions within each learning, respectively, in order to realize more reliable predictions inferred from the correspondence.

We construct the inter-consistency loss based on the following two facts: 1) two solutions to the pose and size of Po\mathcal{P}_{o} from those of Po,1\mathcal{P}_{o,1} and Po,2\mathcal{P}_{o,2} are required to be consistent; 2) as representations of a same object V\mathcal{V} in the canonical space, Qv,1\mathcal{Q}_{v,1} and Qv,2\mathcal{Q}_{v,2} should be invariant to any pose transformations, and thus keep consistent to each other, as well as Qo,1\mathcal{Q}_{o,1} and Qo,2\mathcal{Q}_{o,2}. Therefore, with two input transformations, the inter-consistency loss Linter\mathcal{L}_{inter} could be formulated as follows:

where λ1\lambda_{1} and λ2\lambda_{2} are balanced parameters, and

Chamfer distance Dcham\mathcal{D}_{cham} is used to restrain the distance of two complete point sets Qv,1\mathcal{Q}_{v,1} and Qv,2\mathcal{Q}_{v,2}, while for the partial Qo,1\mathcal{Q}_{o,1} and Qo,2\mathcal{Q}_{o,2}, we use a more strict metric of L2 distance DL2\mathcal{D}_{L2} for point-to-point constraints, since their points should be ordered to correspond with those of Po\mathcal{P}_{o}.

2.2 Intra-Consistency Terms

For an observation (Io,Po)(\mathcal{I}_{o},\mathcal{P}_{o}), DPDN learns deep correspondence between Po={po(j)}j=1N\mathcal{P}_{o}=\{\bm{p}_{o}^{(j)}\}_{j=1}^{N} and Qo={qo(j)}j=1N\mathcal{Q}_{o}=\{\bm{q}_{o}^{(j)}\}_{j=1}^{N} to predict their relative pose and size (R,t,s)(\bm{R},\bm{t},\bm{s}); ideally, for j=1,,N\forall j=1,\cdots,N, qo(j)=1s2RT(po(j)t)\bm{q}_{o}^{(j)}=\frac{1}{||\bm{s}||_{2}}\bm{R}^{T}(\bm{p}_{o}^{(j)}-\bm{t}). Accordingly, the predictions Qo,1\mathcal{Q}_{o,1} and Qo,2\mathcal{Q}_{o,2} in (11) and (12) should be restrained to be consistent with Qo,1={1s12R1T(po(j)t1)}j=1N\mathcal{Q}_{o,1}^{\prime}=\{\frac{1}{||\bm{s}_{1}||_{2}}\bm{R}_{1}^{T}(\bm{p}_{o}^{(j)}-\bm{t}_{1})\}_{j=1}^{N} and Qo,2={1s22R2T(po(j)t2)}j=1N\mathcal{Q}_{o,2}^{\prime}=\{\frac{1}{||\bm{s}_{2}||_{2}}\bm{R}_{2}^{T}(\bm{p}_{o}^{(j)}-\bm{t}_{2})\}_{j=1}^{N}, respectively, which gives the formulations of two intra-consistency terms based on Smooth-L1 distance as follows:

with Q1={(q1(j1),q1(j2),q1(j3))}j=1N\mathcal{Q}_{1}=\{(q_{1}^{(j1)},q_{1}^{(j2)},q_{1}^{(j3)})\}_{j=1}^{N} and Q2={(q2(j1),q2(j2),q2(j3))}j=1N\mathcal{Q}_{2}=\{(q_{2}^{(j1)},q_{2}^{(j2)},q_{2}^{(j3)})\}_{j=1}^{N}.

Given a triplet of inputs (Io,Po,Qc)(\mathcal{I}_{o},\mathcal{P}_{o},\mathcal{Q}_{c}) along with the annotated ground truths (R^,t^,s^,Qv^,Qo^)(\hat{\bm{R}},\hat{\bm{t}},\hat{\bm{s}},\hat{\mathcal{Q}_{v}},\hat{\mathcal{Q}_{o}}), we generate dual input triplets by applying two rigid transformations to Po\mathcal{P}_{o}, as done in Sec. 3.2, and use the following supervised objective on top of the parallel learning of (11) and (12):

We note that this supervision also implies inter-consistency between the parallel learning defined in (14), making DPDN more sensitive to pose changes.

Experiments

We train DPDN on both training sets of synthetic CAMERA25 and real-world REAL275 datasets , and conduct evaluation on REAL275 test set. CAMERA25 is created by a context-aware mixed reality approach, which renders 1,0851,085 synthetic object CAD models of 6 categories to real-world backgrounds, yielding a total of 300,000300,000 RGB-D images, with 25,00025,000 ones of 184184 objects set aside for validation. REAL275 is a more challenging real-world dataset, which includes 4,3004,300 training images of 77 scenes and 2,7542,754 testing ones of 66 scenes. Both datasets share the same categories, yet impose large domain gap.

0.2 Implementation Details

To obtain instance masks for both training and test sets of REAL275, we train a MaskRCNN with a backbone of ResNet101 on CAMERA25; for settings of available training mask labels, we use the same segmentation results as to make fair comparisons. We employ the shape priors released by .

0.3 Evaluation Metrics

Following , we report mean Average Precision (mAP) of Intersection over Union (IoU) for object detection, and mAP of n°mn\degree m cm for 6D pose estimation. IoUx denotes precision of predictions with IoU over a threshold of x%x\%, and n°mn\degree m cm denotes precision of those with rotation error less than n°n\degree and transformation error less than mm cm.

1 Comparisons with Existing Methods

We compare our method with the existing ones for category-level 6D object pose and size estimation under both unsupervised and supervised settings. Quantitative results are given in Table 1, where results under supervised setting significantly benefit from the annotations of real-world data, compared to those under unsupervised one; for example, on the metric of 5°25\degree 2 cm, SPD improves the results from 11.4%11.4\% to 19.3%19.3\%, while DualPoseNet improves from 15.9%15.9\% to 29.3%29.3\%. Therefore, the exploration of UDA for the target task in this paper is of great practical significance, due to the difficulties in precisely annotating real-world object instances in 3D space.

Firstly, a basic version of DPDN is trained on the synthetic data and transferred to real-world domain for evaluation; under this setting, our basic DPDN outperforms the existing methods on all the evaluation metrics, as shown in Table 1. To reduce the Sim2Real domain gap, we further include the unlabeled Real275 training set via our self-supervised DPDN for UDA; results in Table 1 verify the effectiveness of our self-supervised DPDN (dubbed Self-DPDN in the table), which significantly improves the precision of the basic version, e.g., a performance gain of 8.1%8.1\% on 5°25\degree 2 cm from 29.7%29.7\% to 37.8%37.8\%.

UDA-COPE is the first work to introduce the unsupervised setting, which trains deep model with a teacher-student scheme to yield pseudo labels for real-world data; in the process of training, pose annotations of real-world data are not employed, yet mask labels are used for learning instance segmentation. To fairly compare with UDA-COPE, we evaluate DPDN under the same setting; results in Table 1 also show the superiority of our self-supervised DPDN over UDA-COPE, especially for the metrics of high precisions, e.g., an improvement of 10.2%10.2\% on 5°55\degree 5 cm. The reason for the great improvement is that UDA-COPE heavily relies on the qualities of pseudo labels, while our self-supervised objective could guide the optimization moving for the direction meeting inter-/intra- consistency, to make the learning fit the characteristics of REAL275 and decrease the downside effect of the synthetic domain.

1.2 Supervised Setting

We also compare our DPDN with the existing methods, including those of direct regression , and those based on dense correspondence learning, under supervised setting. As shown in Table 1, DPDN outperforms the existing methods on all the evaluation metrics, e.g., reaching the precisions of 76.0%76.0\% on IoU75 and 78.4%78.4\% on 10°510\degree 5 cm. Compared with the representative SS-ConvNet , which directly regresses object poses and sizes from observations, our DPDN takes the advantages of categorical shape priors, and achieves more precise results by regressing from deep correspondence; compared with SGPA , the recent state-of-the-art method based on correspondence learning (c.f. Eq. (2)), our DPDN shares the same feature extractor, yet benefits from the direct objectives for pose and size estimation, rather than the surrogate ones, e.g., for regression of D\bm{D} and A\bm{A} in Eq. (2); DPDN thus achieves more reliable predictions.

2 Ablation Studies and Analyses

In this section, we conduct experiments to evaluate the efficacy of both the designs in our DPDN and the self-supervision upon DPDN.

We verify the efficacy of the designs in DPDN under supervised setting, with the results of different variants of our DPDN shown in Table. 2. Firstly, we confirm the effectiveness of our Deep Prior Deformer with categorical shape priors; by removing Deep Prior Deformer, the precision of DPDN with parallel learning on 5°25\degree 2 cm drops from 46.0%46.0\% to 35.2%35.2\%, indicating that learning from deep correspondence by deforming priors in feature space indeed makes the task easier than that directly from object observations.

Secondly, we show the advantages of using true objectives for direct estimates of object poses and sizes, over the surrogate ones for the learning of the canonical point set Qo\mathcal{Q}_{o} to pair with the observed Po\mathcal{P}_{o}. Specifically, we remove our Pose and Size Estimator, and make predictions by solving Umeyama algorithm to align Po\mathcal{P}_{o} and Qo\mathcal{Q}_{o}; precisions shown in the table decline sharply on all the evaluation metrics, especially on IoUx. We found that the results on IoUx are also much lower than those methods based on dense correspondence learning , while results on n°mn\degree m are comparable; the reason is that we regress the absolute coordinates of Qc\mathcal{Q}_{c}, rather than the deviations D\bm{D} in Eq. (2), which may introduce more outliers to affect the object size estimation.

Thirdly, we confirm the effectiveness of the parallel supervisions in (16), e.g., inputting Po,1\mathcal{P}_{o,1} and Po,2\mathcal{P}_{o,2} of the same instance with different poses. As shown in Table 2, results of DPDN with parallel learning are improved (with or without Deep Prior Deformer), since the inter-consistency between dual predictions is implied in the parallel supervisions, making the learning aware of pose changes.

2.2 Effects of the Self-Supervision upon DPDN

We have shown the superiority of our novel self-supervised DPDN under the unsupervised setting in Sec. 4.1; here we include the evaluation on the effectiveness of each consistency term in the self-supervised objective, which is confirmed by the results shown in Table 3. Taking results on 5°,25\degree,2 cm as examples, DPDN with inter-consistency term Linter\mathcal{L}_{inter} improves the results of the baseline from 29.7%29.7\% to 36.9%36.9\%, and DPDN with the intra-consistency ones Lintra,1\mathcal{L}_{intra,1} and Lintra,2\mathcal{L}_{intra,2} improves to 35.8%35.8\%, while their combinations further refresh the results, revealing their strengths on reduction of domain gap. We also show the influence of data size of unlabeled real-world images on the precision of predictions in Fig. 3, where precisions improve along with the increasing ratios of training data.

3 Visualization

We visualize in Fig. 4 the qualitative results of our proposed DPDN under different settings on REAL275 test set . As shown in the figure, our self-supervised DPDN without annotations of real-world data, in general, achieves comparable results with the fully-supervised version, although there still exist some difficult examples, e.g., cameras in Fig. 4 (a) and (b), due to the inaccurate masks from MaskRCNN trained on synthetic data. Under the unsupervised setting, our self-supervised DPDN also outperforms the basic version trained with only CAMERA25, by including unlabeled real-world data with self-supervision; for example, more precise poses of laptops are obtained in Fig. 4 (c) and (d).

Acknowledgements This work is supported in part by Guangdong R&\&D key project of China (No.: 2019B010155001), and the Program for Guangdong Introducing Innovative and Enterpreneurial Teams (No.: 2017ZT07X183).

References