FS-Net: Fast Shape-based Network for Category-Level 6D Object Pose Estimation with Decoupled Rotation Mechanism

Wei Chen, Xi Jia, Hyung Jin Chang, Jinming Duan, Linlin Shen, Ales Leonardis

Introduction

Estimating 6D object pose plays an essential role in many computer vision tasks such as augmented reality , virtual reality , and smart robotic arm . For instance-level 6D pose estimation, in which training set and test set contain the same objects, huge progress has been made in recent years .

However, category-level 6D pose estimation remains challenging as the object shape and color are various in the same category. Existing methods addressed this problem by mapping the different objects in the same category into a uniform model via RGB feature or RGB-D fusion feature. For example, Wang et al. trained a modified Mask R-CNN to predict the normalized object coordinate space (NOCS) map of different objects based on RGB feature, and then computed the pose with observed depth and NOCS map by Umeyama algorithm . Chen et al. proposed to learn a canonical shape space (CASS) to tackle intra-class shape variations with RGB-D fusion feature . Tian et al. trained a network to predict the NOCS map of different objects, with the uniform shape prior learned from a shape collection, and RGB-D fusion feature .

Although these methods achieved state-of-the-art performance, there are still two issues. Firstly, the benefits of using RGB feature or RGB-D fusion feature for category-level pose estimation are still questionable. In , Vlach et al. showed that people focus more on shape than color when categorizing objects, as different objects in the same category have very different colors but stable shapes (shown in Figure 3). Thereby the use of RGB feature for category-level pose estimation may lead to low performance due to huge color variation in the test scene. For this issue, to alleviate the color variation, we merely use the RGB feature for 2D detection, while using the shape feature learned with point cloud extracted from depth image for category-level pose estimation.

Secondly, learning a representative uniform shape requires a large amount of training data; therefore, the performance of these methods is not guaranteed with limited training examples. To overcome this issue, we propose a 3D graph convolution (3DGC) autoencoder to effectively learn the category-level pose feature via observed points reconstruction of different objects instead of uniform shape mapping. We further propose an online box-cage based 3D data augmentation mechanism to reduce the dependencies of labeled data.

In this paper, the newly proposed FS-Net consists of three parts: 2D detection, 3D segmentation & rotation estimation, and translation & size estimation. In 2D detection part, we use the YOLOv3 to detect the object bounding box for coarse object points obtainment . Then in the 3D segmentation & rotation estimation part, we design a 3DGC autoencoder to perform segmentation and observed points reconstruction jointly. The autoencoder encodes orientation information in the latent feature. Then we propose the decoupled rotation mechanism that uses two decoders to decode the category-level rotation information. For translation and size estimation, since they are all point coordinates related, we design a coordinate residual estimation network based on PointNet to estimate the translation residual and size residuals. To further increase the generalization ability of FS-Net, we use the proposed online 3D deformation for data augmentation. To summarize, the main contributions of this paper are as follows:

We propose a fast shape-based network to estimate category-level 6D object size and pose. Due to the efficient category-level pose feature extraction, the framework runs at 20 FPS on a GTX 1080 Ti GPU.

We propose a 3DGC autoencoder to reconstruct the observed points for latent orientation feature learning. Then we design a decoupled rotation mechanism to fully decode the orientation information. This decoupled mechanism allows us to naturally handle the circle symmetry object (in Section 3.3).

Based-on the shape similarity of intra-class objects, we propose a novel box-cage based 3D deformation mechanism to augment the training data. With this mechanism, the pose accuracy of FS-Net is improved by $7.7\%$ .

Related Works

In instance-level pose estimation, a known 3D object model is usually available for training and testing. Based on the 3D model, instance-level pose estimation can be roughly divided into three types: template matching based, correspondences-based, and voting-based methods. Template matching methods aligned the template to the observed image or depth map via hand-crafted or deep learning feature descriptors. As they need the 3D object model to generate the template pool, their applications in category-level 6D pose estimation are limited. Correspondences-based methods trained their model to establish 2D-3D correspondences or 3D-3D correspondences . Then they solved perspective-n-point and SVD problem with 2D-3D and 3D-3D correspondences , respectively. Some methods also used these correspondences to generate voting candidates, and then used RANSAC algorithm for selecting the best candidate. However, the generation of canonical 3D keypoints is based on the known 3D object model that is not available when predicting the category-level pose.

2 Category-Level Pose Estimation

Compared to instance-level, the major challenge of category-level pose estimation is the intra-class object variation, including shape and color variation. To handle the object variation problem, proposed to map the different objects in the same category to a NOCS map. Then they used semantic segmentation to access the observed points cloud with known camera parameters. The 6D pose and size are calculated by the Umeyama algorithm with the NOCS map and the observed points. Shape-Prior adopted similar method with , but both extra shape prior knowledge and dense-fusion feature , instead of RGB feature, are used. CASS estimated the 6D pose via the learning of a canonical shape space with dense-fusion feature . Since the RGB feature is sensitive to color variation, the performance of their methods in category-level pose estimation is limited. In contrast, our method is shape feature-based which is robust for this task.

3 3D Data Augmentation

In 3D object detection tasks , online data augmentation techniques such as translation, random flipping, shifting, scaling, and rotation are applied to original point clouds for training data augmentation. However, these operations cannot change the shape property of the object. Simply adopting these operations on point clouds is not able to handle the shape variation problem in the 3D task. To address this, proposed part-aware augmentation which operates on the semantic parts of the 3D object with five manipulations: dropout, swap, mix, sparing, and noise injection. However, how to decide the semantic parts are ambiguous. In contrast, we propose a box-cage based 3D data augmentation mechanism which can generate the various shape variants (shown in Figure 5) and avoid semantic parts decision procedure.

Proposed Method

In this section, we describe the detailed architecture of FS-Net shown in Figure 2. Firstly, we use the YOLOv3 to detect the object location with RGB input. Secondly, we use 3DGC autoencoder to perform 3D segmentation and observed points reconstruction, the latent feature can learn orientation information through the process. Then we propose a novel decoupled rotation mechanism for decoding orientation information. Thirdly, we use PointNet to estimate the translation and object size. Finally, to increase the generalization ability of FS-Net, we propose the box-cage based 3D deformation mechanism.

Following , we train a YOLOv3 to fast detect the object bounding box in RGB images, and output class (category) labels. Then we adopt the 3D sphere to locate the point cloud of the target object quickly. With these techniques, the 2D detection part provides a compact 3D learning space for the following tasks. Different from other category-level 6D object pose estimation methods that need semantic segmentation masks, we only need object bounding boxes. Since object detection is faster than semantic segmentation , the detection speed of our method is faster than previous methods.

2 Shape-Based Network

The output points of object detection contain both object and background points. To access the points that belong to the target object and calculate the rotation of the object, we need a network that performs two tasks: 3D segmentation and rotation estimation.

Although there are many network architectures that directly process point cloud , most of the architectures calculate on point coordinates, which means their networks are sensitive to point clouds shift and size variation . This decreases the pose estimation accuracy.

To tackle the point clouds shift, Frustum-PointNet and G2L-Net employed the estimated translation to align the segmented point clouds to local coordinate space. However, their methods cannot handle the intra-class size variation.

To solve the point clouds shift and size variation problem, in this paper, we propose a 3DGC autoencoder to extract the point cloud shape feature for segmentation and rotation estimation. 3DGC is designed for point cloud classification and object part segmentation; our work shows that 3DGC can also be used for category-level 6D pose estimation task.

3DGC kernel consists of $m$ unit vectors. The $m$ kernel vectors are applied to the $n$ vectors generated by the center point with its $n$ -nearest neighbors. Then, the convolution value is the sum of cosine similarity between kernel vectors and the $n$ -nearest vectors. In a 2D convolution network, the trained network learned a weighted kernel, which has a higher response with a matched RGB value, while the 3DGC network learned the orientations of the $m$ vectors in the kernel. The weighted 3DGC kernel has a higher response with a matched 3D pattern which is defined by the center point with its $n$ -nearest neighbors. For more details, please refer to .

2.2 Rotation-Aware Autoencoder

Based on the 3DGC, we design an autoencoder for the estimation of category-level object rotation. To extract the latent rotation feature, we train the autoencoder to reconstruct the observed points transformed from the observed depth map of the object. There are several advantages to this strategy: 1) the reconstruction of observed points is view-based and symmetry invariant , 2) the reconstruction of observed points is easier than that of a complete object model (shown in Table 2), and 3) more representative orientation feature can be learned (shown in Table 1).

In , the authors also reconstructed the input images to observed views. However, the input and output of their models are 2D images that are different to our 3D point cloud input and output. Furthermore, our network architecture is also different from theirs.

We utilize Chamfer Distance to train the autoencoder, the reconstruction loss function $\mathcal{L}_{rec}$ is defined as

where $M_{c}$ and $\hat{M}_{c}$ denote the ground truth point cloud and reconstructed point cloud, respectively. $x_{i}$ and $\hat{x}_{i}$ are the points in $M_{c}$ and $\hat{M}_{c}$ . With the help of 3D segmentation mask, we only use the features extracted from the observed object points for reconstruction.

After the network convergence, the encoder learned the rotation-aware latent feature. Since the 3DGC is scale and shift invariant, the observed points reconstruction enforces the autoencoder to learn the scale and shift invariant orientation feature under corresponding rotation. In the next subsection, we will describe how we decode rotation information from this latent feature.

3 Decoupled Rotation Estimation

Given the latent feature which contains rotation information, our task is to decode the category-level rotation feature. To achieve this, we utilize two decoders to extract the rotation information in a decoupled fashion. The two decoders decode the rotation information into two perpendicular vectors under corresponding rotation. These two vectors can represent rotation information completely (shown in Figure 4).

Since the two vectors are orthogonal, the decoded rotation information related to them is independent; we can use one of them to recover part rotation information of the object. For example, in Figure 8, we use the green vector axis to recover the pose. We can see that the green boxes and blue boxes are aligned well in the recovered axis.

Each decoder only needs to extract the orientation information along corresponding vector which is easier than the estimation of the complete rotation. The loss function is based on cosine similarity that defined as

where $\hat{\textbf{v}}_{1}$ and $\hat{\textbf{v}}_{2}$ are the predicted vectors. $\textbf{v}_{1}$ and $\textbf{v}_{2}$ are the ground truth, and $\lambda_{r}$ is the balance parameter.

The balance parameter $\lambda_{r}$ makes our network easy to handle circular symmetry object such as bottle, and for such circular symmetry object, the red vector is not necessary (shown in Figure 4). Without loss of generality, we assume that the green vector is along the symmetry axis; then, we set $\lambda_{r}$ as zero to handle the circular symmetry objects. For other types of symmetric objects, we can employ the rotation mapping function used in to map the relevant rotation matrices to a unique one.

Please note that our decoupled rotation is different to the rotation representation proposed in . They took the first two columns from a rotation matrix as the new representation, which has no geometric meaning. In contrast, our representation is defined based on the shape of the target object, and our representation can avoid the discontinuity issue mentioned in .

4 Residual Prediction Network

As both translation and object size are related to points coordinates, inspired by , we train a tiny PointNet that takes segmented point cloud as input. More concretely, the PointNet performs two related tasks: 1) estimating the residual between the translation ground truth and the mean value of the segmented point cloud; 2) estimating the residual between object size and the mean category size.

For size residual, we pre-calculate the mean size $[\overline{x},\overline{y},\overline{z}]^{T}$ of each category by

where $N$ is the amount of the object in that category. Then for object $o$ in that category the ground truth $[\delta_{x}^{o},\delta_{y}^{o},\delta_{z}^{o}]^{T}$ of the size residual estimation is calculated as

We use mean square error (MSE) loss to predict both the translation and size residual. The total loss function $\mathcal{L}_{res}$ is defined as:

where $\mathcal{L}_{tra}$ and $\mathcal{L}_{size}$ are sub-loss for translation residual and size residual, respectively.

5 3D Deformation Mechanism

One major problem in category-level 6D pose estimation is the intra-class shape variation. The existing methods employed two large synthetic datasets, i.e. CAMERA and 3D model dataset to learn this variation. However, this strategy not only needs extra hardware resources to store these big synthetic datasets but also increases the (pre-)training time.

To alleviate the shape variation issue, based on the fact that the shapes of most objects in the same category are similar (shown in Figure 3), we propose an online box-cage based 3D deformation mechanism for training data augmentation. We pre-define a box-cage for each rigid object (shown in Figure 5). Each point is assigned to its nearest surface of the cage; when we deform the surface, the corresponding points move as well.

Though box-cage can be designed more refined, in experiments, we find that with a simple box cage, i.e. 3D bounding box of the object, the generalization ability of the proposed method is considerably improved (Table 1). Different to , we do not need the extra training process to obtain the box-cage of the object, and we do not need target shape to learn the deformation operation either. Our mechanism is totally online, which saves training time and storage space.

To make the deformation operation easier, we first transfer the points to the canonical coordinate system and then perform 3D deformation. Finally we transform them to global scene:

Experiments

NOCS-REAL is the first real-world dataset for category-level 6D object pose estimation. The training set has 4300 real images of 7 scenes with 6 categories. For each category, there are 3 unique instances. In the testing set, there are 2750 real images spread in 6 scenes of the same 6 categories as the training set. In each test scene, there are about 5 objects which makes the dataset clutter and challenging.

LINEMOD is a widely used instance-level 6D object pose estimation dataset which consists of 13 different objects with significant shape variation.

We use the automatic point-wise labeling techniques proposed in to access the label of each point in both training sets.

2 Implementation Details

We use Pytorch to implement our pipeline. All experiments are deployed on a PC with i7-4930K 3.4GHz CPU and GTX 1080Ti GPU.

First, to locate the object in RGB images, we fine-tune the YOLOv3 pre-trained on COCO dataset with the training dataset. Then we jointly train the 3DGC autoencoder and residual estimation network. The total loss function is defined as

where $\lambda$ s are the balance parameters. We empirically set them as 0.001, 1, 0.001, and 1 to keep different loss values at the same magnitude. We use cross entropy for 3D segmentation loss function $\mathcal{L}_{seg}$ .

We adopt Adam to optimize the FS-Net. The initial learning rate is 0.001, and we halve it every 10 epochs. The maximum epoch is 50.

3 Evaluation Metrics

For category-level pose estimation, we adopt the same metrics used in :

$IoU_{X}$ is Intersection-over-Union (IoU) accuracy for 3D object detection under different overlap thresholds. The overlap ratio larger than the threshold $X$ is accepted.

$n^{\circ}$ $m$ cm represents pose estimation error of rotation and translation. The rotation error less than $n^{\circ}$ and the translation error less than $m$ cm is accepted.

For instance-level pose estimation, we compare the performance of FS-Net with other state-of-the-art instance-level methods using the ADD-(S) metric .

4 Ablation Studies

We use the G2L-Net as the baseline method which extracted the latent feature for rotation estimation via point-wise orientated vector regression, and the ground truth of rotation is the eight corners of 3D bounding box with corresponding rotation. The loss function for rotation estimation is the mean square error between predicted 3D coordinates and ground truth. Compared to baseline, our proposed work has three novelties: a) view-based 3DGC autoencoder for observed point cloud reconstruction; b) rotation decoupled mechanism; c) online 3D deformation mechanism.

In Table 1, we report the experimental results of three novelties on the NOCS-REAL dataset. Comparing Med3 and Med5, we find that reconstruction of the observed point cloud can learn better pose feature. The performance of Med2(Med1, G2L) and Med5(Med3, G2L+DR) shows that the proposed decoupled rotation mechanism can effectively extract the rotation information. The results of Med4 and Med5 demonstrate the effectiveness of the 3D deformation mechanism, which increases the pose accuracy by $7.7\%$ in terms of 10∘10 cm metric. We also compare the different reconstruction choices: the reconstruction of observed points and the complete object model with corresponding rotation. From the last row of Table 1, we can see that the observed points reconstruction can learn better rotation feature. Overall, Table 1 shows that the proposed novelties can improve the accuracy significantly.

5 Generalization Performance

NOCS-REAL dataset provides 4.3k real images that covers various poses of different objects in different categories for training. That means the category-level pose information is rich in the training set. Thanks to the effectively pose feature extraction, FS-Net achieves state-of-the-art performance even with part of the real-world training data. We randomly choose different percentages of the training set to train FS-Net and test it on the whole testing set. Figure 6 shows that: 1) FS-Net is robust to the size of the training dataset, and has good category-level feature extraction ability. Even with $20\%$ of the training dataset, the FS-Net can still achieve state-of-the-art performance; 2) the 3D deformation mechanism significantly improves the robustness and performance of FS-Net.

6 Evaluation of Reconstruction

Point cloud reconstruction has a close relationship with pose estimation performance. We compute the Chamfer Distance of the reconstructed point cloud with the ground truth point cloud and compared it with other reconstruction types used by other methods. From Table 2, we can see that the average reconstruction error of our method is 0.86, which is $72.9\%$ and $18.9\%$ lower than that of Shape-Prior and CASS , respectively. It shows that our method achieves better pose estimation results via a simpler reconstruction task, i.e. observed points reconstruction rather than complete object model reconstruction.

7 Comparison with State-of-the-Arts

We compare FS-Net with NOCS , CASS , Shape-Prior , and 6D-PACK on NOCS-REAL dataset in Table 4. We can see that our proposed method outperforms the other state-of-the-art methods on both accuracy and speed. Specifically, on 3D detection metric $IOU_{50}$ , our FS-Net outperforms the previous best method, NOCS, by $11.7\%$ and the running speed is 4 times faster. In terms of 6D pose metric 5∘5cm and 10∘10 cm, FS-Net outperforms the CASS by the margins of $4.7\%$ and $6.3\%$ , respectively. FS-Net even outperforms 6D-PACK under 3D detection metric $IOU_{50}$ , which is a 6D tracker and needs an initial 6D pose and object size to start. See Figure 7 for more quantitative details. The qualitative results are shown in Figure 8. Please note, we only use real-world data (NOCS-REAL) to train our pose estimation part. Other methods use both synthetic dataset (CAMERA) and real-world data for training. The number of training examples in CAMERA is 275K, which is more than 60 times that of NOCS-REAL (4.3K). It shows that FS-Net can efficiently extract the category-level pose feature with fewer data.

7.2 Instance-Level Pose Estimation

We compare the instance-level pose estimation results of FS-Net on the LINEMOD dataset with other state-of-the-arts instance-level methods. From Table 3, we can see that FS-Net achieves comparable results on both accuracy and speed. It shows that our method can effectively extract both category-level and instance-level pose features.

8 Running Time

Given a 640 $\times$ 480 RGB-D image, our method runs at 20 FPS with Intel i7-4930K CPU and 1080Ti GPU, which is 2 times faster than the previous fastest method 6D-PACK . Specifically, the 2D detection takes about 10ms to proceed. The pose and size estimation takes about 40ms.

Conclusion

In this paper, we propose a fast category-level pose estimation method that runs at 20 FPS which is fast enough for real-time applications. The proposed method first extracts the latent feature by the observed points reconstruction with a shape-based 3DGC autoencoder. Then the category-level orientation feature is decoded by the effective decoupled rotation mechanism. Finally, for translation and object size estimation, we use the residual network to estimate them based on residuals estimation. In addition, to increase the generalization ability of FS-Net and save the hardware source, we design an online 3D deformation mechanism for training set augmentation. Extensive experimental results demonstrate that FS-Net is less data-dependent, and can achieve state-of-the-art performance on category- and instance-level pose estimation in both accuracy and speed. Please note, our 3D deformation mechanism and decoupled rotation scheme are model-free, which can be applied to other pose estimation methods to boost the performance.

Although FS-Net achieves state-of-the-art performance, it relies on a robust 2D detector to detect the region of interest. In future work, we plan to adopt 3D object detection techniques to directly detect the objects from point clouds.

Appendix

This section provides more details about our FS-Net. Section 6.1 describes the details of the 3D deformation mechanism and deformed examples. Section 6.2 provides more quantitative results of the FS-Net on NOCS-REAL dataset and comparison with state-of-the-art method. Section 6.3 demonstrates that the proposed vectors-based rotation representation can be easily extended to handle other symmetric types.

As stated in Section 3.5 of the paper, the 3D deformation mechanism is box-cage based and the deformations are applied in a canonical space. In the canonical coordinate system, every box edge is parallel to an axis (shown in Figure 9). This property makes the 3D deformation calculation easier. For example, when we need to elongate/shrink the mug along $Y$ axis by $n$ times. We enlarge the distance between surface $S_{1,2,3,4}$ and surface $S_{5,6,7,8}$ by $n$ times. Since these two surfaces are parallel to the $XZ$ -plane, the $x$ and $z$ coordinates are unchanged. Then points coordinates are changed from $[\textbf{x},\textbf{y},\textbf{z}]$ to $[\textbf{x},n\textbf{y},\textbf{z}]$ . The calculations are similar when we need to elongate/shrink the mug along $X$ or $Z$ axis by $n$ times:

Further, if the object is the mug or bowl, we may need to change the top or bottom size to generate new shapes (shown in Figure 10). In this case, assuming we enlarge the bottom along $X$ axis by $n$ times, then from bottom to top, the coordinates are changed as:

where $l$ is the distance from a point to the top surface, i.e. $S_{1,2,3,4}$ in Figure 9. $L$ is the height of the object. Please note, all the edges are keep straight while deformation.

2 Experimental Results

We report the specific category pose estimation results under different metrics in Table 5. We also provide the rotation recovered by one/two vectors in Figure 11. We can see that the bounding boxes are well aligned in the recovered vector direction.

2.2 Comparison with State-of-The-Art

We compare FS-Net with the state-of-the-art method Shape-Prior , which utilized point cloud for category-level 6D object pose estimation. Shape-Prior estimated the object size and 6D pose from dense-fusion feature , while we estimate the pose from point cloud feature. Figure 12 shows that our FS-Net is robust to color and shape variation, and can handle some failure cases of Shape-Prior. For Shape-Prior, we use the predicted results provided on their website: https://github.com/mentian/object-deformnet.

3 Rotation Representation for Symmetry Object

The vector based rotation representation proposed in the paper can only handle the symmetry objects like bottle, however, in real-world the symmetric types are various (see Figure 13). In this section, we will show how to extend the vector based rotation representation for different symmetric types. Our strategy is inspired by the rotation mapping operation proposed in . In the following, we will show how to find the rotation group (termed proper symmetries in ) of a single rotation for common symmetric objects.

Our basic idea is list all the ambiguous rotations of a single rotation and choose the rotations that has the closest distance with the identity matrix:

where $\mathcal{D}(\cdot,\cdot)$ is the distance between two rotation matrix, $\mathcal{G}(R_{i})$ is a group of rotation that can provide the same visual appearance of a given object as rotation $R_{i}$ . Our goal is to find a rotation $R^{*}$ that can minimize the rotation distance.

For symmetric object like bottle, we can avoid the rotation ambiguity by only using the green vector to represent the rotation (see Figure 4), however, the case is non-trivial for other symmetric type. In the following, we describe how we find symmetry rotation group for different symmetric types

For this kind symmetric objects, in canonical space, when we rotate the object around one axis 180∘, we can get the same appearance (see Figure for illustration). Assume that axis is $Z$ axis, for arbitrary rotation $R$ , the appearance $\mathcal{A}$ :

where $R^{Z^{+}}_{180}$ means rotation the object around $Z$ 180∘ in clockwise, $\mathcal{O}$ denotes the object. That means we can find the rotation group of each rotation by right multiplication operation $R^{Z^{+}}_{180}$ . Then we use Equation 12 to find the representative rotation in the rotation group.

3.2 Symmetry with N𝑁N Axes

The idea can be easily extend to object with $N$ symmetries around a single axis $Z$ . For this kind of symmetric objects, when we rotate the object around axis $Z$ by $K\frac{360}{N}^{\circ}(K=1,2,\cdots,N)$ in canonical space, the appearance $\mathcal{A}$ of the object is unchanged:

Then, the symmetric rotation group $\mathcal{G}(R)$ of rotation $R$ is: $RR^{K\frac{360}{N}^{\circ}(K=0,1,2,\cdots,N)}$ . We find the representative rotation in $\mathcal{G}(R)$ with Equation 12.

3.3 General Case

Most symmetric types are included in the description of Section 6.3.1 and 6.3.2. For any other symmetric object, the key idea here is to find the rotation operation that can produce the same appearance of the object. Then use Equation 12 to find the representative rotation.

3.4 Decoupled Rotation Representation

Given the representative rotation $R^{*}$ of ambiguous rotation, we generate its corresponding vector-based representation $\mathcal{V}$ by:

where $\textbf{v}_{1}$ is the vector along with the axis $Z$ mentioned in Section 6.3.1 and 6.3.2, $\textbf{v}_{2}$ is the vectors orthogonal with $\textbf{v}_{1}$ .