OnePose: One-Shot Object Pose Estimation without CAD Models

Jiaming Sun, Zihao Wang, Siyu Zhang, Xingyi He, Hongcheng Zhao, Guofeng Zhang, Xiaowei Zhou

Introduction

Object pose estimation plays an important role in augmented reality (AR). The ultimate goal of object pose estimation in AR is to use arbitrary objects as “virtual anchors” of AR effects, which demands the ability to estimate poses of surrounding objects in our daily life. Most established works in object pose estimation assume that the CAD model of the object is known a priori. Since high-quality CAD models of everyday objects are often inaccessible, the research on object pose estimation for AR scenarios necessitates new problem settings.

To not rely on instance-level CAD models, many recent methods have been working on category-level pose estimation . By training a network on different instances in the same category, the network can learn a category-level representation of object appearances and shapes and thus be able to generalize to new instances in the same category. However, such approaches require a large number of training samples in the same category, which can be hard to obtain and annotate. Furthermore, the generalization capabilities of category-level methods are not guaranteed when a new instance has a significantly different appearance or shape. More importantly, training and deploying a network for each category are unaffordable in many real world applications, e.g., mobile AR, when the number of object categories to be handled is huge.

To alleviate the demand for CAD models or category-specific training, we go back to an “old” problem setting for object pose estimation, but renovate the entire pipeline with a new learning-based approach. Similar to the task of visual localization, which estimates the unknown camera pose given an SfM map of a scene, object pose estimation has long been formulated in the localization-based setting . Different from instance- or category-level methods, this setting assumes that a video sequence of the object is given, and a sparse point cloud model can be reconstructed from the sequence. Estimating the object pose is then equivalent to localizing the camera pose with respect to the reconstructed point cloud model. At test time, 2D local features are extracted from the query image and matched with the points in the SfM model to obtain 2D-3D correspondences, from which the object pose can be solved by PnP. Instead of learning instance- or category-specific representations by neural networks, this traditional pipeline leverages an explicit 3D model of the object that can be built on-the-fly for a new instance, which brings better generalization capabilities to arbitrary objects while making the system more explainable.

In this paper, we refer to this problem setting as one-shot object pose estimation, where the objective is being able to estimate 6D pose of an object in arbitrary category, given only a few pose-annotated images of the object for training. While this problem is similar to visual localization, directly migrating existing visual localization methods does not solve this problem. The modern visual localization pipeline produces 2D-3D correspondences by first performing a 2D-2D matching between the query image and the retrieved database images. To ensure a high success rate of localization, matching to multiple image retrieval candidates is necessary, so that the 2D-2D matching can be expensive especially for learning-based matchers . As a result, the runtime of existing visual localization methods is often seconds and cannot satisfy the requirement to track moving objects in real-time.

For the reasons above, we propose to directly perform 2D-3D matching between the query image and the SfM point clouds. Our key idea is to use graph attention networks (GATs) to aggregate the 2D features that correspond to the same 3D SfM point (i.e., a feature track) to form a 3D feature. The aggregated 3D features are later matched with 2D features in the query images with self- and cross-attention layers. Together with the self- and cross-attention layers, the GATs can capture the globally-consented and context-dependent matching priors exhibited in ground-truth 2D-3D correspondences, making the matching more accurate and robust.

To evaluate the proposed method, we collected a large-scale dataset for the one-shot pose estimation setting, which contains 450 sequences of 150 objects. Compared with previous instance-level method PVNet and category-level method Objectron , OnePose achieves better precision without training for any object instances or categories in the validation set, while taking only 58 ms to process one frame on GPU. To the best of our knowledge, when combined with a feature-based pose tracker, OnePose is the first learning-based method that can stably detect and track poses of everyday household objects in real-time (refer to the project page).

Renovating the visual localization pipeline for object pose estimation that can handle novel objects without CAD models or additional network training.

A new architecture of graph attention networks for robust 2D-3D feature matching.

A large-scale object dataset for one-shot object pose estimation with pose annotations.

Related works

The state-of-the-art approaches for the object 6DoF pose estimation can be broadly categorized into regression and keypoint techniques. Given an image, the first type of methods directly regress pose parameters with features within each Region of Interest (RoI). In contrast, the latter type of methods first find correspondences between image pixels and 3D object coordinates either by regression or by voting , and then compute the pose with Perspective-n-Points (PnP). These methods require high ﬁdelity textured 3D models to generate auxiliary synthetic training data and for pose refinements to achieve high accuracy on trained instances.

Unlike the abovementioned methods that train a single network for each instance, NOCS proposes to establish correspondences between pixels on the image and Normalized Object Coordinates (NOCS) shared within each category. With this learned category-level shape prior, NOCS can eliminate the dependencies on CAD models during test time. Some later works follow the trend of leveraging category-level prior to further recover a more accurate shape of the object with NOCS representation. A limitation of this line of work is that the shape and the appearance of some instances could vary significantly even they belong to the same category, thus the generalization capabilities of trained networks over these instances are questionable. Moreover, accurate CAD models are still required for ground-truth NOCS map generation during training, and different networks need to be trained for different categories. Our proposed method does not require CAD models both for training and test time and is category-agnostic.

CAD-Model-Free Object Pose Estimation

Recently, a few attempts have been made to achieve CAD-model-free object 6D pose estimation both at the training and test time. Both Neural Object Fitting and LatentFusion tackled the problem via analysis-by-synthesis approaches where differentiablly synthesized images are compared with target images to generate gradients for the object pose optimization. Neural Object Fitting proposes to encode category-level appereance prior with a Variational Auto Encoder (VAE) trained with fully synthetic data, while LatentFusion builds a 3D latent space based object representation with posed RGB-D images for each unseen object. However, the efficiency and accuracy of such methods are highly limited by image synthesizing networks and are not suitable for AR applications. RLLG takes a different approach and learns correspondences from image pixels to object coordinates without CAD models. Although RLLG can achieve comparable precision to its counterparts , it works only on the instance level and requires highly accurate instance masks to segment foreground pixels. Most recently, Objectron proposes a data-driven approach that learns to regress pixel coordinates of projected box corners for each category with a tremendous amount of annotated training data. Such an approach is costly and only limited to a few categories as the learned model is category-specific. Moreover, it can only obtain up-to-scale poses without metric scales since it uses a single-view image as input. On the contrary, our method can leverage the visual-inertial odometry to recover metric scales during the mapping stage, thus being able to recover metric 6D poses at test time.

Feature-Matching-Based Pose Estimation

Visual localization pipelines based on feature-matching have long been studied. Traditionally, the localization problem is solved by finding 2D-3D correspondences between input RGB images and a 3D model from SfM with hand-crafted local features like SIFT , FAST and ORB . Recently, learning-based local feature detection, description and matching surpass these hand-crafted methods and have substituted the traditional counterparts in the localization pipeline. Notably, Hierarchical Localization (HLoc) provides a complete toolbox for running SfM with COLMAP and feature extraction and matching with SuperGlue . Our method is inspired by SuperGlue in terms of using self- and cross-attention layers for feature matching. However, SuperGlue only focuses on 2D-2D matching between images and does not consider the graph structure of the SfM map. Our method uses graph attention networks to process and aggregate 2D features that correspond to a 3D SfM point (i.e., a feature track), which preserves the graph structure of the SfM during 2D-3D matching.

Many traditional methods for object recognition and pose estimation also share the feature-based pipeline similar to visual localization. These methods first build object models by reconstructing sparse point clouds from matched keypoints across the views , and localize with the sparse point cloud model given a query image. Some approaches propose to build a point cloud model online with a framework similar to Simultaneous Localization and Mapping (SLAM). Notably, BundleTrack proposes an online pose tracking pipeline without instance- or category-level models, which resembles ours mostly. However, it uses 2D-2D feature matching instead of 2D-3D as in ours. To recover the 3D information, it also takes depth map as input which could limit its usage in AR.

Method

Structure from Motion

Pose Estimation through Visual Localization

In the localization phase, a sequence of query images $\{\mathbf{I}_{q}\}$ are captured in real-time. Localizing the camera poses of the query images $\{\xi_{q}^{-1}\}$ with respect to $\{\mathbf{P}_{j}\}$ produces the object poses $\{\xi_{q}\}$ defined in the camera coordinate.

To remedy this problem, we propose to directly perform 2D-3D matching between the query image and the SfM point clouds. Direct 2D-3D matching avoids the need of the image retrieval module, and thus can maintain localization accuracy while being fast. In the next section, we describe how to obtain the 2D-3D correspondences $\mathcal{M}_{3D}$ .

2 OnePose

Inspired by , we further use self- and cross-attention layers following the aggregation-attention layers to process and transform the aggregated 3D descriptors and query 2D descriptors. A set of aggregation-, self- and cross-attention layers forms an attention group, specifically:

The proposed architecture of graph attention networks (GATs) is composed of $N$ stacked attention groups. Intuitively, the aggregation-attention layers will adaptively attend to different $\mathbf{F}_{k}^{2D}$ in $\mathcal{G}_{j}$ according to its relevance with $\mathbf{F}_{q}^{2D}$ , thus preserving more descriminative information for 2D-3D matching. By interleaving the aggregation-attention layers with self- and cross-attention layers, $\{\mathbf{F}_{k}^{2D}\}$ , $\{\mathbf{F}_{j}^{3D}\}$ , $\{\mathbf{F}_{q}^{2D}\}$ can exchange information with each other, thus making the matching globally-consented and context-dependent.

Match Selection and Pose Calculation

We follow to use the dual-softmax operator to differentiablly extract match confidence scores $\mathcal{P}_{3D}$ . The score matrix $\mathbf{S}$ between the transformed features is first calculated by $\mathbf{S}\left(q,j\right)=\langle\mathbf{F^{\prime}}_{q}^{2D},\mathbf{F^{\prime}}_{j}^{3D}\rangle$ . Formally, the matching confidence $\mathbf{C}_{3D}$ is obtained by:

After selecting a confidence threshold $\theta$ , $\mathbf{C}_{3D}$ becomes a permutation matrix $\mathcal{M}_{3D}$ , which represents the 2D-3D match predictions. With $\mathcal{M}_{3D}$ , the object pose in the camera coordinate $\xi_{q}$ can be computed by the Perspective-n-Point (PnP) algorithm with RANSAC.

Supervision

The supervision signal $\mathcal{M}_{3D}^{gt}$ can be directly obtained from filtered 2D-3D correspondences in the SfM maps in the training set. The loss function $L$ is the focal loss over the confidence scores $\mathbf{C}_{3D}$ returned by the dual-softmax operator:

Online Feature-based Pose Tracking

The above-mentioned pose estimation module takes only sparse key-frame images as input. To obtain stable object poses for AR applications, we further equip OnePose with a feature-based pose tracking module, which processes every frame in the test sequence. Similar to a SLAM system, the pose tracking module reconstructs a 3D map online and maintains its own key-frame pool. At each time-step, tracking adopts a tightly-coupled approach and relies on both the prebuilt SfM map and the online-built 3D map to find 2D-3D correspondences and solve for 6D poses. Since the pose tracking module preserves 2D and 3D information of the test sequence in the online-built map, it can be more stable than the single-frame-based pose estimation module. The pose estimation module helps to recover and re-initialize the tracking module when it fails. We provide more details about the pose tracking module in the supplementary material.

Remarks on the One-Shot Setting

Other than not using CAD models or additional network training, the one-shot setting of OnePose has many advantages compared with existing instance- or category-level pose estimation methods. During the mapping phase, OnePose takes as input a simple video scan of an object and builds an instance-specific 3D representation of the object geometry. Similar to the role of CAD models in instance-level methods, the 3D geometry of the object is crucial for recovering object poses with metric scales. In the localization phase, learned local feature matching in OnePose can handle large changes in viewpoint, lighting and scale, making the system more stable and robust compared to category-level methods. The local-feature-based pipeline also allows the pose estimation module to be naturally coupled with a feature-based tracking module to realize efficient and stable pose tracking.

3 OnePose Dataset

Since there is no existing large-scale dataset that can fit the setting of one-shot pose estimation, we collected a dataset with multiple video scans of the same object put in different locations. The OnePose dataset contains over 450 video sequences of 150 objects. For each object, multiple video recordings, accompanied camera poses and 3D bounding box annotations are provided. These sequences are collected under different background environments, and each has an average recording-length of 30 seconds covering all views of the object. The dataset is randomly divided into training and validation sets. For each object in the validation set, we assign one mapping sequence for building the SfM map, and use a test sequence for the evaluation.

To reduce the manual labor of data annotation, we propose a semi-automatic approach to simultaneously collect and annotate the data in AR. To be specific, an adjustable 3D bounding box is rendered onto the image in AR, as shown in Fig. 4. The only manual work is to adjust the rotation and rough dimensions of the 3D bounding box. Visualizations of the data capture interface and the post-processing process are shown in Fig. 4.

The objective of the post-processing is to reduce the pose drift error of ARKit for each sequence and ensure consistent pose annotations across sequences. To achieve this, we first align sequences with the annotated bounding boxes and perform bundle adjustment (BA) with COLMAP . Feature matches used in the BA are extracted with SuperGlue. As the backgrounds are different between sequences, we extract matches only in the foreground (i.e., within the 2D object bounding boxes) between all matchable pairs of images. For more details about our data collection and processing pipeline, please refer to our supplementary material.

Experiments

In this section, we first introduce our selection of baseline methods and evaluation protocols, as well as evaluation metrics on our proposed OnePose dataset in Sec. 4.1, followed by implementation details of our method in Sec. 4.2. Experimental results and ablation studies are detailed in Sec. 4.3 and Sec. 4.4, respectively.

We compare our method with the following baseline methods in three categories: 1) Visual Localization methods are most relevant to the proposed method in terms of estimating the pose based on local feature matching. To be specific, we compare our method with HLoc using different keypoint descriptors (SIFT and SuperPoint ), as well as matchers (Nearest Neighbour, SuperGlue ). 2) Instance-level method PVNet . 3) Category-level method Objectron . To the best of our knowledge, Objectron is the only method for category-level object pose estimation with RGB image as input.

Evaluation Protocols

We apply per-frame pose estimation with the proposed method without the pose tracking module for a fair comparison in all the experiments. For our Visual Localization baselines and the proposed method, we use the same video scan to build the SfM map for the localization. Note that the original image retrieval module used for large scale scenes does not generalize well to objects, thus we equally sample a subset of five images with equal intervals from database images as retrieved images for feature matching. To train our instance-level baseline PVNet, we use 3D box corners instead of sampled semantic points from CAD models as keypoints to vote for, and further supply auxiliary mask supervision which is indispensable for training PVNet. Due to the data demanding nature of the category-level baseline Objectron , we directly use the models provided by the authors, which are trained on the original Objectron dataset.

Metrics

For evaluation metrics, we cannot directly adopt the commonly used ADD metric and 2D projection metric since CAD models are unavailable in our setting. Another commonly used metric for evaluating the quality of predicted object pose is the 5cm-5deg metric proposed in which deems a predicted pose as correct if the error is below 5cm and 5 $\degree$ . We further narrow down the criteria to 1cm-1deg and 3cm-3deg following a similar definition to set up more strict metrics for the pose estimation in augmented reality application. We divide the objects to three splits by their diameters with 40 cm and 25 cm as thresholds. When comparing with instance-level baseline and category-level baseline, we follow the metrics used in the original paper.

2 Implementation Details

During the mapping phase, to maintain a fast mapping speed, we reuse $\{\xi_{i}\}$ and use triangulation to reconstruct the point cloud, without further optimization on the camera poses by bundle adjustment. During the localization phase, we assume the 2D bounding box of the object is known, which can be easily obtained from an off-the-shelf 2D object detector (e.g. YOLOv5 ) in practice. To reduce possible mismatches in pose estimation, only the 3D points inside the annotated 3D bounding box are preserved during mapping, and only the 2D features inside the detected 2D bounding box are preserved during localization. For the network design, we use $N=4$ attention groups in GATs. Linear Attention is used in all the attention layers following . As the input of GATs, we randomly sample or pad a set of eight features from $\{\mathbf{F}_{i}^{2D}\}$ associated with each $\mathbf{F}_{i}^{3D}$ for all experiments in the paper. The $\{\mathbf{F}_{i}^{3D}\}$ are initialized by averaging all of the associated features $\{\mathbf{F}_{i}^{2D}\}$ .

3 Evaluation Results

We compare our approach with visual localization baselines with different feature extractors and matchers, and present the results in Tab. 1. HLoc (SPP + SPG) is the baseline with learning-based feature extractor (SuperPoint) and matcher (SuperGlue), which mostly resembles our method among all the three variants. Our method performs on-par or slightly better compared with HLoc (SPP + SPG), while HLoc (SPP + SPG) takes ten times the runtime of our method. We believe the improvement comes from the ability of our method to selectively aggregate context from multiple images benefited from our GATs design, instead of only focusing on the two images being matched.

Comparison with the Instance-level Baseline PVNet

The proposed method is compared with PVNet with 5cm-5deg on selected objects from our OnePose dataset and the results are as presented in Tab. 2. To obtain segmentation masks for training PVNet, we need to additionally apply dense 3D reconstruction and render the reconstructed meshes to obtain masks on the data sequences. This process is time-consuming and greatly limits our choices for objects because of the quality of 3D reconstruction. Our method achieves much higher precision than PVNet, which demonstrates the superiority of our method. PVNet relies on memorizing the mapping from image patches to object-specific keypoints. Without pre-training on large-scale synthetic images (rendered with CAD models) that densely cover all possible views, the performance of PVNet drops drastically. Conversely, our method is able to leverage the learned local features that are relatively viewpoint-invariant and thus generalize to unseen views while maintaining the precision.

Comparison with the Category-level Baseline Objectron

We compare our method with Objectron on all objects in the Shoe and Cup categories with the metrics used in the original paper and present the results in Tab. 3.

For mean pixel error of 2D projection, the results of Objectron on our dataset are far from the reported results for the two categories on Objectron dataset. This is because of the deviations in ground-truth annotations between the Objectron dataset and our dataset. For a fair comparison, we further apply scaling and center alignment operations to the predictions of Objectron to alleviate this gap and provided results respectively as Objectron (S) and Objectron (S+C) in Tab. 3. Although the performances of Objectron do get boosted and are comparable with the reported results in the original paper, our method surpasses it by a large margin. Our method outperforms Objectron evidently in the average precision of azimuth error and elevation error, especially for the objects of Cup category where the shape and appearance may vary significantly between instances. These experiments illustrate the limited generalization ability of category-level methods to new object instances.

Runtime Analysis

We report the runtimes of our visual localization baselines and our method in Tab. 1. The runtime consist of feature extraction for the query image with SuperPoint and the 2D-3D matching process without 2D detection and PnP. Our method runs ${\sim}10\times$ faster than HLoc (SPP + SPG). All the experiments are conducted on an NVIDIA TITAN RTX GPU.

4 Ablation Studies

In this section. we conduct several ablation experiments by substituting GATs with simpler counterparts of the feature aggregation and matching modules. All the results for our ablation studies are presented in Tab. 4.

We validate the effectiveness of the proposed aggregation-attention layers by substituting the corresponding aggregation layers in GATs by the averaging operation and report the result in Tab. 4 as (i). Notice the 2D-3D matching is still based on a GNN with self- and cross-attention layers, which is similar to SuperGlue. Without the aggregation-attention layers, the results dropped significantly for large and medium objects, which indicates the effectiveness of aggregation-attention layers.

The simple averaging operation cannot adaptively select relevant information from different viewpoints according to different query features.

Other variants with 2D-3D NN Matching

To provide more comparisons with traditional pipelines that estimate object pose with local features and 2D-3D matching, we also experimented with variants of our method based on different local features, feature aggregration methods and matchers for 2D-3D matching. The results are reported in Tab. 4 as (ii - v). (ii - v) are still unable to produce comparable results with our approach. Compared with (ii) and (iv) that use averaging for feature aggregation, our method consistently outperforms them by a significant margin. Similar to the analysis for (i), simply averaging the features from different viewpoints loses view-dependent information. For (iii) and (v), substituting averaging with K-Means clustering could provide richer 3D features but the results are still not comparable with ours.

Qualitative Comparisons

We provide some qualitative results to compare our method with baseline methods in Fig. 5. Please read the caption for details.

Conclusion

In this paper, we propose OnePose for one-shot object pose estimation. Unlike existing instance-level or category-level methods, OnePose does not rely on CAD models and can handle objects in arbitrary categories without instance- or category-specific network training. Compared with localization-based baseline methods, instance-level baseline method PVNet and category-level baseline method Objectron, OnePose achieves better pose estimation accuracy and faster inference speed. We also believe that our revisit to the localization-based setting (i.e., one-shot object pose estimation) is more practical for AR and valuable to the community.

The limitations of our method come with the nature of relying on local feature matching for pose estimation. Our method may fail when applied to textureless objects. Although being enhanced by attention mechanisms, our method still has difficulty to handle extreme change of scales between images in the video scan and the testing sequences.

Acknowledgements

The authors would like to acknowledge the support from the National Key Research and Development Program of China (No. 2020AAA0108901), NSFC (No. 62172364), and the ZJU-SenseTime Joint Lab of 3D Vision.