IST-Net: Prior-free Category-level Pose Estimation with Implicit Space Transformation

Jianhui Liu, Yukang Chen, Xiaoqing Ye, Xiaojuan Qi

Introduction

Category-level pose estimation draws great attention and plays an important role in practical applications, including robotic manipulation , augmented reality , and scene understanding . Unlike instance-level pose estimation , which requires a 3D CAD model for each object instance, this task aims at exploiting category-specific information and thus can further generalize to unseen objects within given categories.

Recently, many methods have been proposed for category-level pose estimation, which can be categorized into two groups: prior-free methods and prior-based methods. Prior-free methods mainly focus on designing network structures to fit the training data better. These methods are relatively simple but struggle to generalize to novel objects and suffer from poor performance.

To address this issue, prior-based methods leverage category-specific 3D priors (templates) to guide pose estimation. They adopt a prior-driven deformation module to deform the prior for synthesizing the target object in world-space. And then, they formulate the pose estimation problem as a camera- and world-space correspondence learning problem which explicitly aligns the coordinates . Although considerable progress has been attained with prior-based methods , the requirements of collecting a large amount of ground-truth 3D models of target objects for obtaining the 3D prior and supervising training the prior deformation module hinders their practical applicability.

This motivates us to investigate the mechanism that makes prior-based methods effective. We experiment with the shape deformation module which is used to deform the given shape prior to the desired instance(Fig. 3) by replacing the shape priors with random noise and fixed shape prior from another category (see Fig. 2). We observe that the deformation module can adapt any inputs (noise or fixed prior) into a target world-space object (see Fig. 2 (b)). Besides, the model performance remains high regardless of the 3D priors (see Fig. 2 (a)). The above suggests: the shape prior itself is not necessary for the high performance of prior-based methods, but the deformation module that learns to synthesize world-space target objects and explicitly builds the correspondence between camera and world-space is the key as the performance degrades dramatically without prior deformation. This promotes us to investigate new ways to build camera-to-world correspondence without requiring 3D priors and models.

In this paper, we propose a simple yet effective prior-free model, named Implicit Space Transformation Network (IST-Net), which implicitly sets up feature correspondence between camera-space and world-space without requiring 3D priors or ground-truth 3D models of target objects. Specifically, given the camera space features, the network transforms them into world-space features which together with the camera space features are further used for estimating camera poses. For learning the transformation, we propose a world-space enhancer that distills standard world-space features to supervise the transformed features. Note that the standard world-space features are obtained by transforming the input target object into its world-space with the ground-truth pose and feeding them into a feature extractor. Besides, given camera-space inputs, the backbone network’s feature extraction capabilities are boosted by introducing an auxiliary pose estimation loss, namely the camera-space enhancer. Notably, both enhancers are training only, which brings considerable performance improvements without introducing computational overhead.

Our main contributions are summarized as follows:

We investigate prior-based methods and find that shape priors are not necessary for obtaining high performance while building the camera and world-space correspondence with prior deformation is a key factor.

We propose a simple yet effective Implicit Space Transformation Network (IST-Net) that implicitly builds the correspondence between camera- and world-space on feature-level without requiring 3D models or priors.

We introduce two different space enhancers to facilitate learning the transformation and enhance their representation capability for pose estimation.

We conduct a series of experiments on REAL275 and Wild6D datasets to demonstrate the effectiveness of the proposed method. Notably, IST-Net is currently the only prior-free method that achieves state-of-the-art performance on the REAL275 benchmark and attains notable gains over the prior-based method in terms of efficiency and accuracy (see Fig. 1).

Related Works

Prior-free methods focus on designing architectures for predicting the object pose in a concise manner. Sahin et al. propose an “Intrinsic Structure Adaptor” to adapt the distribution shifts arising from shape discrepancies. Wang et al. first introduce a new category-level benchmark by normalizing all object instances into a shared canonical representation named Normalized Object Coordinate Space-(NOCS) and try to recover the angle of view in NOCS for pose estimation. Chen et al. introduce a learned canonical shape space to handle intra-class variation. Chen et al. attempt to synthesize image matches upon neural rendering in order to verify the probability of each possible pose candidate for pose estimation. Wang et al. propose 6D-PACK which learns to compactly represent an object by a handful of 3D key points based on the motion information and compute the pose by tracking. In pursuit of more efficient and direct pose estimation, a few methods work on designing the network in an end-to-end manner. Chen et al. decouple the rotation into two mutually orthogonal vectors to fully decode the orientation information which allows the network to naturally handle the circle symmetry object. Di et al. embody the geometric insights with bounding box projection to enhance the learning of category-level pose-sensitive features. Lin et al. introduce DualPoseNet which is composed of two parallel pose decoders on top of a shared pose encoder. The two decoders work in an implicit and explicit manner with the restriction of the predicted pose consistency.

2 Prior-based Methods

Since the severe intra-class variation, the generalization of the prior-free models is greatly suppressed. To alleviate this issue, some literature turn to focus on prior-based methods. Tian et al. present a general solution. They set up shape priors for each category upon an autoencoder and then use these priors as the standard template to reconstruct the canonical model for each instance. Chen et al. use a variational autoencoder (VAE) for reconstructing standard object shape, followed by a fully sparse convolution network for pose regression. Wang et al. propose a cascaded relation network to capture the underlying relations of multi-source inputs. Kai et al. utilize a transformer network to model the global structure similarity between prior and target object, based on which the object semantic information is injected into the prior feature to dynamically adapt the category-level prior to each particular object. Fan et al. adopt a shape prior guided reconstruction network and a discriminator network to learn high-quality canonical representations. Zhang et al. use the shape priors as the indicator to predict pose and zero-mean residual vectors which encapsulate the spatial cues of the pose and enable geometry-guided consistency terms. Zhang et al. learn dense correspondences between input images and the canonical shape prior via surface embedding. Lin et al. establish deep correspondence in the feature space between shape prior and canonical model, which yields a surprising performance boost.

Analysis of Shape Priors

To overcome intra-class variation, the prior deformation, as a practical module, has been widely adopted by recent works . The vanilla version of prior deformation can be divided into two parts: 1) generating shape priors and 2) leveraging shape priors to develop prior deformation techniques.

For the former one, the common solution is to train an autoencoder with various object models sampled from ShapeNet , then acquire the category-level shape em- bedding by averaging the latent vectors output by the en- coder. These shape embeddings will be fed into the decoder to get the shape priors. It is worth mentioning that this process needs to rely on a large number of 3D models to obtain a general prior of a category.

For the latter one, we use Fig. 3 to illustrate the process. Given the shape prior, image patch, and observed points, the network first learns a deformation field that deforms the shape prior to the desired object instance, which is supervised by a ground-truth 3D model. Furthermore, the network outputs a matching matrix that indicates the point-to-point correspondences between the observed target points and the reconstructed models. These correspondences transform the models to the viewpoint in the world coordinate system. With the information from camera-space (depth images) and world-space (matched priors), pose parameters can be easily solved via Umeyama algorithm or pose regression by neural networks.

2 Is Shape Prior Necessary?

We conduct extensive experiments to verify whether the shape prior is necessary to address the intra-class variation problem. Specifically, we choose a competitive candidate from prior-based methods, DPDN . (For other methods please refer to the Appendix). We set up the following settings:

Case-2: All the categories share the same prior (can) in replace of the class-specific priors.

Case-3: Using random noise restricted to the unit cube instead of standard shape priors.

Case-4: Removing the prior deformation from original framework.

From Fig. 2, we can conclude that category-specific priors are not necessary as the model can learn to deform a shape to match the target from even random noise. The reason stems from the explicit supervision of ground-truth 3D models during training, which enables the model to learn to deform any given prior (e.g., random noise) to reconstruct a target object. Nevertheless, the prior deformation module contributes to the prior-based model as the performance drops dramatically without this module (see Fig. 2 (a) w/o deformation). The above suggests that the key to the success of prior-based methods is the deformation module that aligns objects in the camera and world-space and facilitates building correspondences, but not the prior itself.

Hence, we explore new ways to transform camera space inputs to world-space and implicitly builds the correspondence between them without relying on priors or 3D ground-truth models in training. We present a prior-free implicit feature transformation network (IST-Net) with details unfolded in Sec. 4.2. On one hand, our method gets rid of the dependence on a large number of 3D models required by prior-based methods. On the other hand, we make the model aware of the information in world-space with a simple design which further allows us to develop an efficient prior-free pose estimator without sacrificing model performance.

Method

2 Implicit Space Transformation

Since we have shown in Sec. 3.2, shape prior itself is not necessary, the important factor is how to transform camera-space inputs into world-space, align them and build their correspondences. To address these issues, we propose an implicit space transformation module, which transforms camera-space features to world-space in an implicit manner without resorting to ground-truth 3D models during training.

where $[\dots]$ refers to concatenation and $G$ denotes the global average pooling. $F_{L}$ and $F_{G}$ refers to the local and global feature.

where $L_{\text{SL1}}$ denotes the Smooth-L1 loss and $\Gamma$ indicates the 3D geometric transformation operation according to pose. This supervision will encourage the transformed features to be in the world-space.

3 Camera-space Enhancer

4 World-space Enhancer

5 Overall Loss Function

In summary, the overall loss function is as below:

where $\lambda_{f}$ and $\lambda_{r}$ are hyper-parameters which used to balance the individual loss contributions. $L_{\text{main}}$ , $L_{\text{aux1}}$ , and $L_{\text{aux2}}$ refer to the supervision for the outputs of main pose estimator and two feature enhancers which share the same loss format as:

Experiments

REAL275 & CAMERA25: Our method is trained on both the virtual dataset, CAMERA25, and the real dataset, REAL275 , and conducted an evaluation on REAL275 test split. CAMERA25 contains 300k synthetic RGB-G images, which are generated by rendering 1,085 synthetic objects with real-world backgrounds. REAL275 includes 8k RGB-D images, where 4300 images are split for training, 950 images for validation, and 2750 images for testing. In both datasets, there are 6 categories, including bottle, bowl, camera, can, laptop, and mug.

Wild6D: Wild6D contains 5,166 videos with 1722 object instances and 5 categories (bottle, bowl, camera, laptop, and mug). Among this data, 486 videos are split for model evaluation.

2 Implementation Details

Following previous work , we use the off-the-shelf MaskRCNN with a backbone of ResNet101 for generating high-quality instance masks. We adopt a PSP Network based on ResNet-18 for semantic feature extraction and PointNet++ for point-level feature extraction. The number of object point $N_{o}$ is set as 1024 and the size of the RGB image is resized to $192\times 192$ . We adopt several commonly used data augmentation, including random uniform noise, random rotational and translational perturbations, and bounding box-based adjustment, which is proposed by FS-Net . The hyper-parameters of the loss weights $\lambda_{f}$ and $\lambda_{r}$ are set to 10 and 1, respectively. All the experiments are conducted on 2 RTX3090Ti GPUs with a batch size of 24, and the ratio of real data to synthetic data is 1:3. For a fair comparison, the total training epoch is fixed to 30 epochs, and all the modules are trained in an end-to-end manner. During inference, only the feature extractor and implicit space transformation module are preserved.

3 Evaluation Metrics

We follow and utilize the widely adopted metrics for evaluation, including the mean precision of 3D intersection over union (IoU) to jointly evaluate rotation, translation, and size. Besides, the $5^{\circ}2cm$ , $5^{\circ}5cm$ , $10^{\circ}2cm$ , $10^{\circ}5cm$ and $10^{\circ}10cm$ are used to evaluate the rotation and translation error directly, specifically, only the prediction error under both thresholds can be considered correct.

4 Comparison with State-of-the-Arts

We present the results of IST-Net with state-of-the-art methods on REAL275 , as shown in Tab. 1. For comparison with prior-free methods, we surpass others with a large gap on all evaluation metrics, e.g., we reach 47.5 and 76.6 on 5 ${}^{\circ}2cm$ and $D_{75}$ , which outperform GPV-Pose by 19.4% and 41%. As for prior-based methods, compared with the current most powerful method DPDN , we still perform significant improvements in most of the metrics. e.g., 76.6 vs 76 on $D_{75}$ , 47.5 vs 46.0 on 5 ${}^{\circ}2cm$ , 53.4 vs 50.7 on 5 ${}^{\circ}5cm$ , 80.5 vs 78.4 on 10 ${}^{\circ}5cm$ and 82.6 vs 80.4. Notably, it is the first time for the prior-free methods to achieve comparable or even higher performance compared with prior-based methods on REAL275. In addition, we present a per-class comparison between IST-Net and DPDN, as shown in Fig. 7. Notably, our method performs better on the prediction of rotation, the error curve is steeper on the geometrically complex object, e.g., camera, which clearly proves the effectiveness of our proposed contributions. Apart from the performance, model efficiency is also worthy of attention, we list the inference time in the last column of Tab. 1. IST-Net reaches top inference speed which far exceeds other methods by more than 25% acceleration.

We conduct experiments on a larger dataset, Wild6D , to further verify the effectiveness of the proposed method. We directly test our model which are trained on REAL275 and CAMERA25 datasets without extra fine-tuning. The results are reported in Tab 2. It can be observed that IST-Net is much better than those designed for the REAL275 dataset and has an obvious improvement in various matrices. Compared with RePoNet and Self-Pose both of which are trained upon Wild6D, our method shows good generalization without fine-tuning on the target dataset. We achieve the highest performance of 93.4 and 79.6 on 3 $D_{25}$ and 3 $D_{50}$ . As for other metrics, our method can also reach a similar performance to Self-Pose. These analysis and results demonstrate the potential of our method.

5 Ablation Study

Effects of Proposed Modules. We ablate the combination of different modules of the proposed method, the results are shown in Tab 3. Firstly we present the effectiveness of the implicit space transformation module (IST). By adding this module, we can easily observe that the baseline is greatly lifted, suggesting that transforming camera-space features to world-space counterparts and building the correspondence between them in an implicit manner indeed can benefit the pose estimation. Besides this, after adding the camera-space enhancer (CE), the precision on 5 ${}^{\circ}5cm$ increases from 48.5 to 52.5, the reason is that with this auxiliary module, the feature extractors are enriched with more pose-sensitive information, which is beneficial to the quality of feature transformation and to improve the accuracy of final pose estimation. In addition, we show the advantage of world-space enhancer (WE), by combining it with IST. The results (E4) show that WE can further extend the performance, especially on 10 ${}^{\circ}2cm$ and 10 ${}^{\circ}5cm$ , which indicates that high-level supervision provides additional information different from low-level constraint. Finally, by combining all modules together, we reach relatively competitive performance.

Effects of Position Encoding Term. In this part, we verify the effect of the positional encoding (PE) term, the results are shown in Tab. 4. Obviously, without the PE module, the performance drops significantly on 5 ${}^{\circ}2cm$ and 5 ${}^{\circ}5cm$ . This in turn proves that the PE module makes up for the position lost by the feature extractor, which benefits pose regression.

Comparison with Explicit Space Transformation. To further verify the effectiveness of the proposed implicit space transformation, we set up an experiment with its explicit counterpart. For reaching a fair comparison, only WE and IST are included in the implicit candidate. From the results shown in Tab. 5, we can easily find the results of the two methods are very close. However, our method yields obvious superiority in speed (34Hz vs 22Hz) and parameter quantity (21M vs 24M) which attributes to the succinct feature space transformation instead of introducing repetitive modules for extracting features from coordinates. This further indicates the potential of the proposed modules.

Ablations on Predicted World Coordinate. Considering that we predict the coordinate of observed points in world-space for supervising implicit space transformation from a low-level perspective. Therefore, the quality of the generated coordinate can also reflect the effectiveness of the proposed method. To verify this, we use the predicted coordinate and observed points in camera space for solving the pose parameters by Umeyama algorithm . As shown in Tab. 6, our method achieves comparable results with SGPA, even with significant improvement on 3 $D_{75}$ (72.7 vs 61.9), indicating that the network can reconstruct the perspective in world-space without introducing shape prior.

Qualitative Analysis

In Fig 6, we visually compare our methods and DPDN on the REAL275 dataset. It clearly shows the superiority of our method. As highlighted in blue boxes, DPDN easily gets stuck in the object with complex structure, e.g., camera, which presents as apparent deviations of predicted boxes.This reflects the prior deformation has a poor capability for modeling challenging cases. By contrast, IST-Net demonstrates strong performance in predicting accurate rotation and translation estimations. The reason is that with the implicit transformation, the geometric structures are transmitted to the world-space together with the feature, which ensures the sensitivity of the network to complex structures.

Conclusion

In this paper, we analyze the overlooked issues in prior-based pose estimation methods and empirically find that shape prior does not contribute to performance boosts. The keypoint is actually the deformation process, which builds correspondence between camera and world coordinates by reconstructing the object shape in the world space. Inspired by this, we design an implicit space transformation network (IST-Net) to transform the camera-space features to world space in an implicit manner. It builds the space correspondence without requiring 3D priors or ground-truth 3D models of target objects. Besides, we design two independent feature enhancers to further enhance the features from both camera- and world-space, which enriches them with more pose-sensitive information and geometrical constraints. Extensive experiments on the challenging benchmark show the effectiveness of our method in both efficiency and accuracy. We hope our investigation can provide new insights for future research in the community.

References

Appendix

Appendix A More Implementation Details

We train our IST-Net from scratch in an end-to-end manner for 30 epochs with a batch size of 24. We further employ the Adam optimizer with a base learning rate of 0.01. We adopt the StepLR scheduler with step size 1 and gamma as 5. Our experiments are conducted on two RTX3090Ti GPUs.

A.2 Network Configurations

As mentioned in the main paper, we provide the detailed architecture of the pose estimators, as shown in Fig. 8. IST-Net contains three pose estimators in camera-space enhancer, world-space enhancer, and final pose regression which follow similar architectures. The pose estimators in world-space enhancer and final pose regression share the same architecture and adopt a standard design, namely standard pose estimator. While the pose estimator in camera-space enhancer adopts a lightweight design, namely lite pose estimator. Specifically, in Fig. 8, the lite pose estimator only takes camera space information as input, including semantic features $F_{P_{o}}$ , geometrical features $F_{I_{o}}$ and position encoding term which is generated by MLP upon $P_{o}$ . For the standard pose estimator, its inputs contain extra information from world-space, including world-space geometrical features $F_{\hat{Q}_{o}}$ and world-space position encoding term. Then the inputs are concatenated together and sent into an MLP to yield the fused features followed by a global average pooling layer. We further concatenate the global and local features and use a combination of MLP and a pooling layer to acquire the compressed features. Finally, three independent MLPs are used to predict $R$ , $t$ , and $s$ respectively.

Appendix B More Experimental Results

We further report the results of our method on the CAMERA25 dataset, as shown in Tab. 7. Our method is competitive with other methods, specifically, on metric 3 $D_{75}$ , IST-Net outperforms the previous state-of-the-art method by 2%. This indicates that our method has a strong ability to comprehensively estimate rotation, translation, and size.

Appendix C More Ablation Studies

We further ablate the effect of different choices of $\lambda_{f}$ on pose accuracy. We gradually enlarge the $\lambda_{f}$ from 1 to 100. The comparative results are shown in Tab. 8. When $\lambda_{f}$ is too small, the supervision is limited, and when it is too large, the supervision from ground truth will be weakened. Overall, when $\lambda_{f}$ is set as 10, we reach the best performance.

C.2 Ablate on Loss Type

In this experiment, we ablate the effect of different loss types of $L_{\text{feat}}$ . We present the results of MSE loss and L1 loss in Tab. 9. In contrast, MSE Loss has more advantages. It does not require two features to be completely similar but imposes strong constraints on places with large differences, which makes the imitation between features easier to learn.

C.3 Ablate on Shape Priors with Different Methods

In this part, we provide more experimental results to support the assumption “shape priors are not necessary” which is detailed in the main paper. We choose two competitive candidates from matching-based and regression-based methods, DPDN and SGPA , using prior deformation. We list the experimental results in Tab. 10. We can find that regardless of whether the approach is a matching-based or a direct regression-based method when we use category-independent prior and noise to replace the default shape prior, the final performance does not have a significant difference. This phenomenon further reflects that shape prior is redundant for the prior deformation process, supporting our major claims in the main paper.

Appendix D More Visualization

As shown in Fig. 9, we show more visualization of IST-Net on the REAL275 test split. As highlighted with the red box, ours can accurately predict the object pose, which visually demonstrates the superiority of our method.

Appendix E Limitation Analysis and Future Work

Our method yields strong performance in NOCS and Wild6D datasets, but it might be sufficient for in-the-wild open-world evaluation, because, existing datasets contain limited object categories and the object structure is relatively simple.

We will work on building a category-level dataset with deiverse object types and shapes to further push forward the area. We hope our current investigation can shed light on more new insights in pose estimation.