CosyPose: Consistent multi-view multi-object 6D pose estimation

Yann Labbé, Justin Carpentier, Mathieu Aubry, Josef Sivic

Introduction

The goal of this work is to estimate accurate 6D poses of multiple known objects in a 3D scene captured by multiple cameras with unknown positions, as illustrated in Fig. 1. This is a challenging problem because of the texture-less nature of many objects, the presence of multiple similar objects, the unknown number and type of objects in the scene, and the unknown positions of cameras. Solving this problem would have, however, important applications in robotics where the knowledge of accurate position and orientation of objects within the scene would allow the robot to plan, navigate and interact with the environment.

Object pose estimation is one of the oldest computer vision problems , yet it remains an active area of research . The best performing methods that operate on RGB (no depth) images are based on trainable convolutional neural networks and are able to deal with symmetric or textureless objects, which were challenging for earlier methods relying on local or global gradient-based image features. However, most of these works consider objects independently and estimate their poses using a single input (RGB) image. Yet, in practice, scenes are composed of many objects and multiple images of the scene are often available, e.g. obtained by a single moving camera, or in a multi-camera set-up. In this work, we address these limitations and develop an approach that combines information from multiple views and estimates jointly the pose of multiple objects to obtain a single consistent scene interpretation.

While the idea of jointly estimating poses of multiple objects from multiple views may seem simple, the following challenges need to be addressed. First, object pose hypotheses made in individual images cannot easily be expressed in a common reference frame when the relative transformations between the cameras are unknown. This is often the case in practical scenarios where camera calibration cannot easily be recovered using local feature registration because the scene lacks texture or the baselines are large. Second, the single-view 6D object pose hypotheses have gross errors in the form of false positive and missed detections. Third, the candidate 6D object poses estimated from input images are noisy as they suffer from depth ambiguities inherent to single view methods.

In this work, we describe an approach that addresses these challenges. We start from 6D object pose hypotheses that we estimate from each view using a new render-and-compare approach inspired by DeepIM . First, we match individual object pose hypotheses across different views and use the resulting object-level correspondences to recover the relative positions between the cameras. Second, gross errors in object detection are addressed using a robust object-level matching procedure based on RANSAC, optimizing the overall scene consistency. Third, noisy single-view object poses are significantly improved using a global refinement procedure based on object-level bundle adjustment. The outcome of our approach that optimizes multi-view COnSistencY, hence dubbed CosyPose, is a single consistent reconstruction of the input scene. Our single-view single-object pose estimation method obtains state-of-the-art results on the YCB-Video and T-LESS datasets, achieving a significant 34.2% absolute improvement over the state-of-the-art on T-LESS. Our multi-view framework clearly outperforms on YCB-Video while not requiring known camera poses and not being limited to a single object of each class per scene. On both datasets, we show that our multi-view solution significantly improves pose estimation and 6D detection accuracy over our single-view baseline.

Related work

Our work builds on results in single-view and multi-view object 6D pose estimation from RGB images and object-level SLAM.

The object pose estimation problem has been approached either by estimating the pose from 2D-3D correspondences using local invariant features , or directly by estimating the object pose using template-matching . However, local features do not work well for texture-less objects and global templates often fail to detect partially occluded objects. Both of these approaches (feature-based and template matching) have been revisited using deep neural networks. A convolutional neural network (CNN) can be used to detect object features in 2D or to directly find 2D-to-3D correspondences . Deep approaches have also been used to match implicit pose features, which can be learned without requiring ground truth pose annotations . The estimated 6D pose of the objects can be further refined using an iterative procedure that effectively moves the camera around the object so that the rendered image of the object best matches the input image. Such a refinement step provides important performance improvements and is becoming common practice as a final stage of the estimation process. Our single-view single-object pose estimation described in Section 3.2 builds on DeepIM . The performance of 6D pose estimation can be further improved using depth sensors , but in this work we focus on the most challenging scenario where only RGB images are available.

Multiple views of an object can be used to resolve depth ambiguities and gain robustness with respect to occlusions. Prior work using local invariant features includes and involves some form of feature matching to establish correspondences across views to aggregate information from multiple viewpoints. More recently, the multi-view single-object pose estimation problem has been revisited with a deep neural network that predicts an object pose candidate in each view and aggregates information from multiple views assuming known camera poses. In contrast, our work does not assume the camera poses to be known. We experimentally demonstrate that our approach outperforms despite requiring less information.

Other works consider all objects in a scene together in order to jointly estimate the state of the scene in the form of a compact representation of the object and camera poses in a common coordinate system. This problem is known as object-level SLAM where a depth-based object pose estimation method is used to recognize objects from a database in individual images and estimate their poses. The individual objects are tracked across frames using depth measurements, assuming the motion of the sensor is continuous. Consecutive depth measurements also enable to produce hypotheses for camera poses using ICP and the poses of objects and cameras are finally refined in a joint optimization procedure.Another approach uses local RGBD patches to generate object hypotheses and find the best view of a scene. All of these methods, however, strongly rely on depth sensors to estimate the 3D structure of the scene while our method only exploits RGB images. In addition, they assume temporal continuity between the views, which is also not required by our approach.

Other works have considered monocular RGB only object-level SLAM . Related is also where semantic 2D keypoint correspondences across multiple views and local features are used to jointly estimate the pose of a single human and the positions of the observing cameras. All of these works rely on local images features to estimate camera poses. In contrast, our work exploits 6D pose hypotheses generated by a neural network which allows to recover camera poses in situations where feature-based registration fails, as is the case for example for the complex texture-less images of the T-LESS dataset. In addition, do not consider full 6D pose of objects, and only consider scenes with a single instance of each object. In contrast, our method is able to handle scenes with multiple instances of the same object.

Multi-view multi-object 6D object pose estimation

In this section, we present our framework for multi-view multi-object pose estimation. We begin with an overview of the approach (Sec. 3.1 and Fig. 2), and then detail the three main steps of the approach in the remaining sections.

Our goal is to reconstruct a scene composed of multiple objects given a set of RGB images. We assume that we know the 3D models of objects of interest. However, there can be multiple objects of the same type in the scene and no information on the number or type of objects in the scene is available. Furthermore, objects may not be visible in some views, and the relative poses between the cameras are unknown. Our output is a scene model, which includes the number of objects of each type, their 6D poses and the relative poses of the cameras. Our approach is composed of three main stages, summarized in Fig. 2.

In the first stage, we build on the success of recent methods for single-view RGB object detection and 6D pose estimation. Given a set of objects with known 3D models and a single image of a scene, we output a set of candidate detections for each object and for each detection the 6D pose of the object with respect to the camera associated to the image. Note that some of these detections and poses are wrong, and some are missing. We thus consider the poses obtained in this stage as a set of initial object candidates, i.e. objects that may be seen in the given view together with an estimate of their pose with respect to this view. This object candidate generation process is described in Sec. 3.2.

In the second stage, called object candidate matching and described in detail in Sec. 3.3, we match objects visible in multiple views to obtain a single consistent scene. This is a difficult problem since object candidates from the first stage typically include many errors due to (i) heavily occluded objects that might be mis-identified or for which the pose estimate might be completely wrong; (ii) confusion between similar objects; and (iii) unusual poses that do not appear in the training set and are not detected correctly. To tackle these challenges, we take inspiration from robust patch matching strategies that have been used in the structure from motion (SfM) literature . In particular, we design a matching strategy similar in spirit to but where we match entire 3D objects across views to obtain a single consistent 3D scene, rather than matching local 2D patches on a single 3D object .

The final stage of our approach, described in Section 3.4, is a global scene refinement. We draw inspiration from bundle adjustment , but the optimization is performed at the level of objects: the 6D poses of all objects and cameras are refined to minimize a global reprojection error.

2 Stage 1: object candidate generation

We introduce a method for single-view 6D object pose estimation building on the idea of DeepIM with some simplifications and technical improvements. First, we use a more recent neural-network architecture based on EfficientNet-B3 and do not include auxiliary signals while training. Second, we exploit the rotation parametrization recently introduced in , which has been shown to lead to more stable CNN training than quaternions. Third, we disentangle depth and translation prediction in the loss following and handle symmetries explicitly as in instead of using the point-matching loss. Fourth, instead of fixing focal lengths to 1 during training as in , we use focal lengths of the camera equivalent to the cropped images. Fifth, in addition to the real training images supplied with both dataset, we also render a million images for each dataset using the provided CAD models for T-LESS and the reconstructed models for YCB-Video. The CNNs are first pretrained using synthetic data only, then fine-tuned on both real and synthetic images. Finally, we use data augmentation on the RGB images while training our models, which has been demonstrated to be crucial to obtain good performance on T-LESS . We also note that this approach can be used for coarse estimation simply by providing a canonical pose as the input pose estimate during both training and testing. We rendered objects at a distance of $1$ meter from the camera and used this approach to perform coarse estimate on T-LESS. Additional details are provided in the appendix.

Handling object symmetries is a major challenge for object pose estimation since the object pose can only be estimated up to a symmetry. This is in particular true for our object candidates pose estimates. We thus need to consider symmetries explicitly together with the pose estimates. Each 3D model $l$ is associated to a set of symmetries $S(l)$ . Following the framework introduced in , we define the set of symmetries $S(l)$ as the set of transformations $S$ that leave the appearance of object $l$ unchanged:

where $\mathcal{R}(l,X)$ is the rendered image of object $l$ captured in pose $X$ and $S$ is the rigid motion associated to the symmetry. Note that $S(l)$ is infinite for objects that have axes of symmetry (e.g. bowls).

$D_{l}(T_{1},T_{2})$ measures the average error between the points transformed with $T_{1}$ and $T_{2}$ for the symmetry $S$ that best aligns the (transformed) points. In practice, to compute this distance for objects with axes of symmetries, we discretize $S(l)$ using $64$ rotation angles around each symmetry axis, similar to .

3 Stage 2: object candidate matching

As illustrated in Fig. 2, given the object candidates for all views $\{O_{a,\alpha}\}$ , our matching module aims at (i) removing the object candidates that are not consistent across views and (ii) matching object candidates that correspond to the same physical object. We solve this problem in two steps detailed below: (A) selection of candidate pairs of objects in all pairs of views, and (B) scene-level matching.

We first focus on a single pair of views $(I_{a},I_{b})$ of the scene and find all pairs of object candidates $(O_{a,\alpha},O_{b,\beta})$ , one in each view, which correspond to the same physical object in these two views. To do so, we use a RANSAC procedure where we hypothesize a relative pose between the two cameras and count the number of inliers, i.e. the number of consistent pairs of object candidates in the two views. We then select the solution with the most inliers which gives associations between the object candidates in the two views. In the rest of the section, we describe in more detail how we sample relative camera poses and how we define inlier candidate pairs.

Sampling meaningful camera poses is one of the main challenges for our approach. Indeed, directly sampling at random the space of possible camera poses would be inefficient. Instead, as usual in RANSAC, we sample pairs of object candidates (associated to the same object label) in the two views, hypothesize that they correspond to the same physical object and use them to infer a relative camera pose hypothesis. However, since objects can have symmetries, a single pair of candidates is not enough to obtain a relative pose hypothesis without ambiguities and we thus sample two pairs of object candidates, which in most cases is sufficient to disambiguate symmetries.

In detail, we sample two tentative object candidate pairs with pair-wise consistent labels $(O_{a,\alpha},O_{b,\beta})$ and $(O_{a,\gamma},O_{b,\delta})$ and use them to build a relative camera pose hypothesis, $T_{C_{a}C_{b}}$ . We obtain the relative camera pose hypothesis by (i) assuming that $(O_{a,\alpha},O_{b,\beta})$ correspond to the same physical object and (ii) disambiguating symmetries by assuming that $(O_{a,\gamma},O_{b,\delta})$ also correspond to the same physical object, and thus selecting the symmetry that minimize their symmetric distance

where $l=l_{a,\alpha}=l_{b,\beta}$ is the object label associated to the first pair, and $S^{\star}$ is the object symmetry which best aligns the point clouds associated to the second pair of objects $(O_{a,\gamma}$ and $O_{b,\delta})$ . If the union of the two physical objects is symmetric, e.g. two spheres, the pose computed may be incorrect but it would not be verified by a third pair of objects, and the hypothesis would be discarded.

Let’s assume we are given a relative pose hypothesis between the cameras $T_{C_{a}C_{b}}$ . For each object candidate $O_{a,\alpha}$ in the first view, we find the object candidate in the second view $O_{b,\beta}$ with the same label $l=l_{a,\alpha}=l_{b,\beta}$ that minimizes the symmetric distance $D_{l}(T_{C_{a}O_{a,\alpha}},T_{C_{a}C_{b}}T_{C_{b}O_{b,\beta}})$ . In other words, $O_{b,\beta}$ is the object candidate in the second view closest to $O_{a,\alpha}$ under the hypothesized relative pose between the cameras. This pair $(O_{a,\alpha},O_{b,\beta})$ is considered an inlier if the associated symmetric distance is smaller than a given threshold $C$ . The total number of inliers is used to score the relative camera pose $T_{C_{a}C_{b}}$ . Note that we discard the hypothesis which have fewer than three inliers.

We use the result of the 2-view candidate pair selection applied to each image pair to define a graph between all candidate objects. Each vertex corresponds to an object candidate in one view and edges correspond to pairs selected from 2-view candidate pair selection, i.e. pairs that had sufficient inlier support. We first remove isolated vertices, which correspond to object candidates that have not been validated by other views. Then, we associate to each connected component in the graph a unique physical object, which corresponds to a set of initial object candidates originating from different views. We call these physical objects $P_{1},...P_{N}$ with $N$ the total number of physical objects, i.e. the number of connected components in the graph. We write $(a,\alpha)\in P_{n}$ to denote the fact that an object candidate $O_{a,\alpha}$ is in the connected component of object $P_{n}$ . Since all the objects in a connected component share the same object label (they could not have been connected otherwise), we can associate without ambiguity an object label $l_{n}$ to each physical object $P_{n}$ .

4 Stage 3: scene refinement

After the previous stage, the correspondences between object candidates in the individual images are known, and the non-coherent object candidates have been removed. The final stage aims at recovering a unique and consistent scene model by performing global joint refinement of objects and camera poses.

In detail, the goal of this stage is to estimate poses of physical objects $P_{n}$ , represented by transformations $T_{P_{1}},\ldots,T_{P_{N}}$ , and cameras $C_{v}$ , represented by transformations $T_{C_{1}},\ldots,T_{C_{V}}$ , in a common world coordinate frame. This is similar to the standard bundle adjustment problem where the goal is to recover the 3D points of a scene together with the camera poses. This is typically addressed by minimizing a reconstruction loss that measures the 2D discrepancies between the projection of the 3D points and their measurements in the cameras. In our case, instead of working at the level of points as done in the bundle adjustment setting, we introduce a reconstruction loss that operates at the level of objects.

More formally, for each object present in the scene, we introduce an object-candidate reprojection loss accounting for symmetries. We define the loss for a candidate object $O_{a,\alpha}$ associated to a physical object $P_{n}$ (i.e. $(a,\alpha)\in P_{n}$ ) and the estimated candidate object pose $T_{C_{a}O_{a,\alpha}}$ with respect to $C_{a}$ as:

Recovering the state of the unique scene which best explains the measurements consists in solving the following consensus optimization problem:

where the first sum is over all the physical objects $P_{n}$ and the second one over all object candidates $O_{a,\alpha}$ corresponding to the physical object $P_{n}$ . In other words, we wish to find global estimates of object poses $T_{P_{n}}$ and camera poses $T_{C_{a}}$ to match the (inlier) object candidate poses $T_{C_{a}O_{a,\alpha}}$ obtained in the individual views. The optimization problem is solved using the Levenberg-Marquart algorithm. We provide more details in the appendix.

Results

In this section, we experimentally evaluate our method on the YCB-Video and T-LESS datasets, which both provide multiple views and ground truth 6D object poses for cluttered scenes with multiple objects. In Sec. 4.1, we first validate and analyze our single-view single-object 6D pose estimator. We notably show that our single-view single-object 6D pose estimation method already improves state-of-the-art results on both datasets. In Sec. 4.2, we validate our multi-view multi-object framework by demonstrating consistent improvements over the single-view baseline.

Following , we evaluate on a subset of 2949 keyframes from videos of the 12 testing scenes. We use the standard ADD-S and ADD(-S) metrics and their area-under-the-curves (please see appendix for details on the metrics). We evaluate our refinement method using the same detections and coarse estimates as DeepIM , provided by PoseCNN . We ran two iterations of pose refinement network. Results are shown in Table 1a. Our method improves over the current-state-of-the-art DeepIM , by approximately 2 points on the AUC of ADD-S and ADD(-S) metrics.

2 Multi-view experiments

As shown above, our single-view method achieves state-of-the-art results on both datasets. We now evaluate the performance of our multi-view approach to estimate 6D poses in scenes with multiple objects and multiples views.

On both datasets, we use the same hyper-parameters. In stage 1, we only consider object detections with a score superior to 0.3 to limit the number of detections. In stage 2, we use a RANSAC 3D inlier threshold of $C=2\,$ cm. This low threshold ensures that no outliers are considered while associating object candidates. We use a maximum number of $2000$ RANSAC iterations for each pair of views, but this limit is only reached for the most complex scenes of the T-LESS dataset containing tens of detections. For instance, in the context of two views with six different 6D object candidates in each view, only 15 RANSAC iterations are enough to explore all relative camera pose hypotheses. For the scene refinement (stage 3), we use 100 iterations of Levenberg-Marquart (the optimization typically converges in less than 10 iterations).

In the single-view evaluation, the poses of the objects are expressed with respect to the camera frame. To fairly compare with the single-view baseline, we also evaluate the object poses in the camera frames, that we compute using the absolute object poses and camera placements estimated by our global scene refinement method. Standard metrics for 6D pose estimation strongly penalize methods with low detection recall. To avoid being penalized for removing objects that cannot be verified across several views, we thus add the initial object candidates to the set of predictions but with confidence scores strictly lower than the predictions from our full scene reconstruction.

The problem that we consider, recovering the 6D object poses of multiple known objects in a scene captured by several RGB images taken from unknown viewpoints has not, to the best of our knowledge, been addressed by prior work reporting results on the YCB-Video and T-LESS datasets. The closest work is , which considers multi-view scenarios on YCB-Video and uses ground truth camera poses to align the viewpoints. In , results are provided for prediction using 5 views. We use our approach with the same number of input images but without using ground truth calibration and report results in Table 2a. Our method significantly outperforms in both single-view and multi-view scenarios.

To demonstrate the benefits of global scene refinement (stage 3), we report in Table 3 the average ADD-S errors of the inlier candidates before and after solving the optimization problem of Eq.(6). We note a clear relative improvement, around 20% on both datasets..

A key feature of our method is that it does not require camera position to be known and instead robustly estimates it from the 6D object candidates. We investigated alternatives to our joint camera pose estimation. First, we used COLMAP, a popular feature-based SfM software, to recover camera poses. On randomly sampled groups of 5 views from the YCB-Video dataset COLMAP outputs camera poses in only $67\%$ of cases compared to $95\%$ for our method. On groups of 8 views from the more difficult T-LESS dataset, COLMAP outputs camera poses only in 4% of cases, compared to 74% for our method. Our method therefore demonstrates a significant interest compared to COLMAP that uses features to recover camera poses, especially for complex textureless scenes like in the T-LESS dataset. Second, instead of estimating camera poses using our approach, we investigated using ground truth camera poses available for the two datasets. We found that the improvements using ground truth camera poses over the camera poses recovered automatically by our method were only minor: within $1\%$ for T-LESS (4 views) and YCB-Video (5 views), and within $3\%$ for T-LESS (8 views). This demonstrates that our approach recovers accurate camera poses even for scenes containing only symmetric objects as in the T-LESS dataset.

We provide examples of recovered 6D object poses in Fig. 3 where we show both object candidates and the final estimated scenes. Please see the appendix for additional results, including detailed discussion of failure modes. Results on the YCB-Video are available on the project webpagehttps://www.di.ens.fr/willow/research/cosypose/.

For a common case with 4 views and 6 2D detections per view, our approach takes approximately 320 ms to predict the state of the scene. This timing includes: 190 ms for estimating the 6D poses of all candidates (stage 1, 1 iteration of the coarse and refinement networks), 40 ms for the object candidate association (stage 2) and 90 ms for the scene refinement (stage 3). Further speed-ups towards real-time performance could be achieved, for example, by exploiting temporal continuity in a video sequence.

Conclusion

We have developed an approach, dubbed CosyPose, for recovering the 6D pose of multiple known objects viewed by several non-calibrated cameras. Our main contribution is to combine learnable 6D pose estimation with robust multi-view matching and global refinement to reconstruct a single consistent scene. Our approach explicitly handles object symmetries, does not require depth measurements, is robust to missing and incorrect object hypothesis, and automatically recovers the camera poses and the number of objects in the scene. These results make a step towards the robustness and accuracy required for visually driven robotic manipulation in unconstrained scenarios with moving cameras, and open-up the possibility of including object pose estimation in an active visual perception loop.

Acknowledgments

This work was partially supported by the HPC resources from GENCI-IDRIS (Grant 011011181), the European Regional Development Fund under the project IMPACT (reg. no. CZ.02.1.01/0.0/0.0/15 003/0000468), Louis Vuitton ENS Chair on Artificial Intelligence, and the French government under management of Agence Nationale de la Recherche as part of the ”Investissements d’avenir” program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute).

References

Appendix

The appendix is organized as follows. In Sec. 0.A, we give more details of our single-view single-object 6D object pose estimator. In Sec. 0.B we illustrate the object candidate matching strategy on a simple 2D example. In Sec. 0.C, we give additional details about our parametrization and initialization of the object-level bundle adjustment problem, introduced in Sec. 3.4 of the main paper. Sec. 0.D presents the datasets used in the main paper and recalls the metrics that are used for each dataset. Finally, in Sec. 0.E we present additional qualitative results of our multi-view multi-object 6D pose estimation approach. We discuss in detail some examples to illustrate key benefits of our method as well as point out the main limitations. Examples randomly selected from the results on the T-LESS and YCB-Video datasets are available on the project webpagehttps://www.di.ens.fr/willow/research/cosypose/.

Appendix 0.A Our single-view single-object method

We now detail our single-view single-object pose estimation network introduced in Sec. 3.2 of the main paper. Our method builds on DeepIM but includes several extensions and improvements.

The network takes as input the concatenation of the synthetic and real cropped images. Both images are resized to the input resolution: $320\times 240$ . The backbone is EfficientNet-B3 followed by spatial average pooling. The prediction layer is a simple fully connected layer which outputs 9 values corresponding to one vector $[v_{x},v_{y},v_{z}]$ for the translation and two vectors $e_{1},e_{2}$ to predict the rotation component of $T_{CO}$ . A rotation matrix $R$ is recovered from $e_{1},e_{2}$ using by simply orthogonalizing the basis defined by the two vectors $e_{1},e_{2}$ . Please see “Rotation parametrization” for the equations to recover the rotation matrix $R$ from $e_{1},e_{2}$ . Compared to DeepIM , the main difference is that we use a more recent network architecture (DeepIM is based on FlowNet ) and we do not include auxiliary predictions of flow and mask. This makes the method simpler and easier to train. Our input resolution of $320\times 240$ is also smaller than $640\times 480$ used by DeepIM, reducing memory consumption and allowing to use larger batches while training.

A.0.2 Transformation parametrization.

Similar to DeepIM, we use the object-independent rotation and translation parametrization which consists in predicting a rotation of the camera around the object, a $xy$ translation $[v_{x},v_{y}]$ in image space (in pixels) for the center of the rendered object and a relative displacement $v_{z}$ along the depth axis of the camera. Given the input pose $T_{CO}^{k}$ and the outputs of the network ( $[v_{x},v_{y},v_{z}]$ and $R=f(e_{1},e_{2})$ ), the pose update is obtained from the following equations:

where $[x^{k},y^{k},z^{k}]$ is the 3D translation vector of $T_{CO}^{k}$ , $R^{k}$ the rotation matrix of $T_{CO}^{k}$ , $f_{x}^{C}$ and $f_{y}^{C}$ are the focal lengths that correspond to the (fictive) camera associated with the cropped input image $I^{C}$ . Finally, $[x^{k+1},y^{k+1},z^{k+1}]$ and $R^{k+1}$ are the parameters of the output pose estimate $T_{CO}^{k+1}$ . The differences with DeepIM are twofold. First, we use a linear parametrization of the relative depth (eq. (9)), instead of $z^{k+1}=z^{k}e^{-v_{z}}$ , which we found more stable to train. Second, we use the intrinsics $f_{x}^{C}$ , $f_{y}^{C}$ of the cropped camera associated with the input (cropped) image. DeepIM uses the intrinsics parameters of the non-cropped camera $f_{x}$ , $f_{y}$ and fix them to $1$ during training because the intrinsic parameters of the input camera are fixed on their datasets. We use the cropped focal lengths instead because (a) cropping and resizing the crop of the input image changes the apparent focal length and (b) the focal lengths of the input images are not unique on T-LESS. Using the cropped focal lengths forces the network to only predict $xy$ translations in pixels and the network can therefore become invariant to the intrinsic parameters of the input (cropped) camera.

A.0.3 Rotation parametrization.

Given two vectors $e_{1}$ and $e_{2}$ (6 values) predicted by the neural network, we recover a rotation parametrization $R$ by following :

where $\wedge$ is the cross product between two 3D vectors. This representation has been shown to be better than quaternions (used by DeepIM) to regress with a neural network .

A.0.4 Cropping strategy.

DeepIM uses (a) the input 2D detections and (b) the bounding box defined by $T_{CO}^{k}$ and the vertices of the object $l$ to define the size and location of the crop in the real input image during training. Indeed, the ground truth bounding box is known during training. At test time, only (b) is used by DeepIM because ground truth bounding boxes are not available. In our case, we only use (b) while training and testing. The intrinsic parameters of the cropped camera are also used to directly render the cropped synthetic image at a resolution of $320\times 240$ instead of rendering at a larger resolution followed by cropping.

A.0.5 Symmetric disentangled loss.

A standard loss for 6D pose estimation is ADD-S which allows to predict pose of symmetric objects. Our loss is inspired by ADD-S loss with two main differences. First, we enumerate all the possible symmetries to find the best matching between the vertices of the predicted model and the ground truth model instead of finding the nearest neighbors. This is similar in spirit to the approach of to handle object symmetries. Second, we disentangle depth $v_{z}$ and translation predictions $v_{x},v_{y}$ , following the recommendations from .

More formally, we define the update function $F$ which takes as input the initial estimate of the pose $T_{CO}^{k}$ , the outputs of the neural network $[v_{x},v_{y},v_{z}]$ and $R$ , and outputs the updated pose, i.e. the function such that

where the closed form of $F$ is expressed in equations (7)(8)(9)(10) of the appendix. We also write $[\hat{v}_{x},\hat{v}_{y},\hat{v}_{z}]$ and $\hat{R}$ the target predictions, i.e. the predictions such that $\hat{T}_{CO}=F(T_{CO}^{k},[\hat{v}_{x},\hat{v}_{y},\hat{v}_{z}],\hat{R})$ , where $\hat{T}_{CO}$ is the ground truth pose of the object. Our loss function is then:

where $D_{l}$ is the symmetric distance defined in the Sec. 3.2 of the main paper, with the $L_{2}$ norm replaced by the $L_{1}$ norm. The different terms of this loss separate the influence of: $xy$ translation (15), relative depth (16) and rotation (17). We refer to for additional explanations of the loss disentanglement.

A.0.6 Coarse estimation.

To perform coarse estimation on T-LESS, we use the same network architecture, parametrization and losses defined above. As input $T_{CO}^{0}$ we provide a canonical input pose that corresponds to the object being rendered at a distance of $1$ meter of the camera in the center of the input 2D bounding box. The coarse and refinement networks use the same architecture, but the weights are distinct. Each network is trained independently.

A.0.7 Training data.

Due to the complexity of annotating real data with 6D pose at large scale, most recent methods generate additionnal synthetic training data. In our experiments, we use the real training images provided by YCB-Video and the images of the real objects displayed individually on black backgrounds provided by T-LESS. In addition, we generate one million synthetic training images on each dataset using a simple procedure described next.

We randomly sample 3 to 9 objects from the set of 3D models considered, place them randomly in a 3D box of size 50 cm and sample randomly the orientation of each object. Half of the images are generated with objects flying in the air, the other half is generated by taking the images after running physics simulation for a few seconds, generating physically feasible object configurations. This is similar to the approach described in , though none of our rendered images are photorealistic. The camera is pointed at the center of the 3D box, its position is sampled uniformly above the box center at the same range of distance as the one of the real training data, and its roll angle is sampled between (-10, 10) degrees. On T-LESS, the distance to the object is fixed in the real training images and we use instead the range of distances of the testing set provided (which is explicitly allowed by the guidelines of the BOP challenge See https://bop.felk.cvut.cz/challenges/ Sec 2.2.). We do not use any information from the testing set beside this distance interval.

On the T-LESS dataset, we generate data using the CAD models only. We add random textures on the CAD models following work on domain randomization . We also paste images from the Pascal VOC dataset in the background with a probability 0.3, following . On both datasets, we add data augmentation to the input RGB images while training, following . Data augmentation includes gaussian blur, contrast, brightness, color and sharpness filters from the Pillow library .

Examples of training images are shown in Fig. 4. Finally, when training the refinement network, we use the same distribution as DeepIM for the input poses.

A.0.8 Training procedure.

All of the networks (refinement network on YCB-Video, coarse network on T-LESS, refinement network on T-LESS) are trained using the same procedure. We use the Adam optimizer with a learning rate of $3.10^{-4}$ and default momentum parameters. Networks are trained using Pytorch and synchronous distributed training on $32$ gpus, with $32$ images per GPU for a total batch size of $1024$ . The networks are randomly initialized and we use the following training procedure. First, the network is trained for $80$ k iterations on synthetic data only. Then, the network is trained for another $80$ k iterations on both real and synthetic training images. In this second phase, the real training images account for around $25\%$ of each batch. Following , we also use a warm-up phase where we progressively increase the learning rate from to $3.10^{-4}$ during the first $5$ k iterations.

A.0.9 Experimental findings.

On YCB-Video, we found that pre-training the model on synthetic data yields an improvement of approximately $2$ points on the AUC of ADD(-S) metric. Without this pre-training phase, our model performed comparably to the results reported by DeepIM. Note that this is hard to directly compare because the synthetic training images are different from the ones used by DeepIM.

On T-LESS, we found that the data augmentation is crucial as also pointed out by . Without data augmentation, the performance of the coarse and refinement networks is poor, with a $e_{vsd}<0.3$ score of around 37% compared to 64% when training with data augmentation.

Appendix 0.B Object candidate matching: additional illustration

In Fig. 5, we illustrate our method for “Sampling of relative camera poses sampling” described in Sec. 3.3 of the main paper with a simple 2D example.

Appendix 0.C Scene refinement

There are multiple ways to initialize the optimization problem defined in equation (6) of the main paper. We use the following procedure. We start by picking a random camera and setting it’s coordinate frame as the world coordinate frame. Then, we iterate over all cameras, trying to initialize each one. In order to initiliaze a camera $a$ , we randomly sample another camera $b$ which is already initialized (placed in the world coordinate frame) and use the relative pose between these two cameras $T_{C_{a}C_{b}}$ estimated while running RANSAC (relative camera pose sampling in Sec. 3.2) to place camera $a$ in the world coordinate frame. Once all the cameras have been initialized, we initalize objects by randomly picking an object $p$ an initializing it using a candidate associated with this physical object from a random view.

C.0.2 Rotation parametrization.

We use the same rotation parametrization as the one used for our single-view single-object network for which the equations are provided in Sec. 0.A of this appendix.

Appendix 0.D Datasets and metrics

In this section, we give details of the datasets used in our experiments.

The YCB-Video dataset is made of 92 scenes with around 1000 images per scene. The dataset is split into 80 scenes for training and 12 scenes for testing. It is mostly challenging due to the variations in lightning conditions, significant image noise and occlusions. The objects are picked from a subset of 21 objects from the YCB object set for which reconstructed 3D models are available. The models are presented in Fig. 7. These models are used to generate additional synthetic training images.

There is at most one object of each instance per scene and most of the objects are visually distinct with the exception of the large and extra-large clamps. When testing, we follow previous works and evaluate on a subset of 2949 keyframes. The variety of the viewpoints for each scene is limited as the camera is usually moved in front of the scene, but not completely around it.

D.1.2 T-LESS.

The T-LESS dataset is made of 20 scenes featuring multiple industry-relevant objects. There are 30 object instances, all of them are textureless and most of them are symmetric. The reconstructed 3D models of these objects are presented in Fig. 7. Many objects have similar visual appearance, making the class prediction task challenging for the object detector. The images in the dataset are taken all around the scene. Scene complexity varies from 3 objects of different types to up 18 objects with 7 belonging to the same type. In single-view experiments we consider all images of the testing scenes to provide meaningful comparison with . For multiview experiments we consider the subset of the BOP19 challenge . We use the CAD models for generating synthetic images and for evaluation.

D.2 Metrics

In this section, we give some details about the metrics reported in the main paper. We refer to for more information about these metrics.

The ADD (average distance) metric is introduced in and is typically used to measure the accuracy of pose estimation for non-symmetric objects. Given a label $l$ of an object and following the notation introduced in Sec. 3.2 of the main paper, this metric is computed as :

where $T$ is the predicted object pose, $\hat{T}$ is the ground truth pose, $X_{l}^{h}$ are the vertices of the 3D models and $H_{l}$ is the number of vertices of the model of the object $l$ .

For symmetric objects, the average distance is computed using the closest point distance and noted ADD-S:

The notation ADD(-S) corresponds to computing ADD for non symmetric objects and ADD-S for symmetric objects. It is also common to report the percentage of objects for which the pose is estimated within a given threshold such as 10% of it’s diameter. We use the notations ADD-S $<$ 0.1d and ADD(-S) $<$ 0.1d for this metric and report the mean computed over object types.

The authors of PoseCNN also proposed to report the area under the accurracy-threshold curve for a threshold (on ADD-S, or ADD(-S)) varying between 0 to 10cm. We note this metric as AUC of ADD(-S) or AUC of ADD-S and we use the implementation provided with the evaluation codehttps://github.com/yuxng/YCB_Video_toolbox of YCB-Video.

When evaluating on the T-LESS dataset, we also report the Visual Surface Discrepancy metric (vsd). This metric is invariant to object symmetries and takes into account the visibility of the object. As in , the pose is considered correct when the error is less than 0.3 with $\tau=20mm$ and $\delta=15mm$ . We note this metric $e_{\text{vsd}}<0.3$ and use the official implementation code of the BOP challenge https://github.com/thodan/bop_toolkit. There are multiple instances of objects in multiple scenes of the T-LESS dataset. When comparing with prior work on all images of the primesense camera, we only evaluate the prediction which has the highest detection score for each class, and only objects visible more than 10% are considered as ground truth targets. This corresponds to the SiSo task.

When evaluating our multi-view method, we follow the more recent 6D localization protocol of the ViVo BOP challenge which considers the top- $k$ predictions with highest score for each class in each image, where $k$ is the number of ground truth objects of the class in the scene. Note that the metrics of the BOP challenge do not penalize making many incorrect predictions for classes that are not in the scene, which happens in most methods and is problematic for practical application. We thus propose to analyze precision-recall tradeoff similar to the standard practice in object detection, using ADD-S $<$ 0.1d to count true positives.

When computing the mean of ADD-S errors in our scene refinement ablation, we only consider as true positives predictions the ones which have an ADD-S error lower than half of the diameter of the object, to ensure that the prediction is matched to the correct ground truth object. Without limiting the error to this threshold and using only class labels and scores, some predictions may be matched to ground truth objects which are at a very different location in the scene. This tends to increase the errors while not being representative only of the 6D pose accuracy of the predictions.

Appendix 0.E Additional multi-view multi-object results

Each scene reconstruction is presented with a dedicated figure and we provide close-ups on various parts of the visualization to illustrate the different aspects in detail. The explanation is provided in the caption of each figure.

In each figure presented below, four (on T-LESS) or five (on YCB-Video) RGB images were used to reconstruct each scene. In each figure, each row corresponds to results associated with one image and different columns present the results of different stages of our method. The last column shows the ground truth scene. The different columns are described next.

“Input image” is the (RGB) image used as input to the method.

“2D detections” shows the detections obtained by the object detector (RetinaNet on T-LESS, PoseCNN on YCB-Video), after removing detections that have scores below $0.3$ . The color of each 2D bounding box illustrates the object label predicted for this detection, each color is associated with a unique type of 3D object in the object database. Note that the colors for each type of 3D object are shared for all visualizations corresponding to one scene (one figure) but not shared across the figures because of the high number of objects in the database.

“Object candidates” illustrates the 6D object poses predicted for each 2D detection. The candidates considered as outliers (those who have not been matched with a candidate from another view and are discarded) are marked with red color and are transparent. The candidates considered inliers are shown in green. Inliers are used in the final scene reconstruction. Note that the red and green colors in this (3rd) column are only used to indicate inliers and outliers and there is no correspondence with red and green colors in the 4th column that denote the different object types.

“Scene reconstruction” illustrates the scene reconstructed by our method using all the views presented in the figure. Once the scene is reconstructed, we use the recovered 6D poses of physical objects and cameras to render the scene imaged from each of the predicted viewpoints. The renderings are overlaid over the input image.

“Ground truth” corresponds to the ground truth scene viewed from the ground truth viewpoints. These images are shown to enable visual comparison with the results of our method. The ground truth information (number of objects, types of objects, poses of cameras, poses of objects) is not used by our method.

In the following, we illustrate the main capabilities of our system.

E.1 Highlights of the capabilities of our system

Our method is able to recover the state of complex scenes that contain multiple objects, even if parts or the scene are partially or completely occluded in some of the views. The poses of cameras and objects can be correctly recovered even if all objects in the scene are symmetric. An example is presented in Fig. 8. Note how some objects are missing in each individual view but our method is able to recover correctly all objects.

E.1.2 Multiple object instances.

Our method is able to successfully identify the correct number of objects and their labels even if there are multiple objects of the same type in the image, objects are partially occluded in some views and multiple types of objects have very similar visual appearance. An example is presented in Fig. 9

E.1.3 Cluttered scenes with distractors.

Our method is also robust to distractor objects that are not in the database of objects. We present in Fig. 10 a complex example with many distractors where our method is able to successfully recover all objects in the scene, which are in the object database while filtering out the other ones. This is especially important for robotic applications in unstructured environments where the objects of interests are known and should not be confused with other background objects.

E.1.4 High accuracy.

One of the key components of our approach is scene refinement (section 3.4 in the main paper), which significantly improves the accuracy of pose predictions using information from multiple views. In Fig 11, we show an example of a reconstruction that highlights the accuracy that can be reached by our method using only 4 input images.

E.2 Detailed examples

We now explain in detail few simpler examples that demonstrate how our system works and how it achieves the kind of results presented in the previous section.

In some situations, objects are partially or completely occluded in some of the views. As a result, 2D detections for one physical object are missing in some views. If this physical object is visible in other views, our reconstruction method is able to estimate it’s pose with respect to the other objects. If all cameras can be positioned with respect to the rest of the scene using other non-occluded objects, our approach can also position the partially occluded object with respect to all cameras, even if there were initially no candidates corresponding to the object in these views. An example is shown in Fig. 12.

E.2.2 Robustness to incorrect detections.

In T-LESS, many objects have similar visual appearance. As a result, the 2D detector often makes mistakes, predicting incorrect labels for some of the detections in some views. Our method is able to handle multiple 2D detections that have different labels at the same location in the image. In this case, a pose hypothesis is generated for each of the label hypothesis. If the object candidate cannot be matched with another view - either because the incorrect label is predicted in only one view or because the poses are not consistent - our method is able to discard this object candidate. An example is shown in Fig. 13. Please see the discussion “Duplicate objects” and Fig. 14 for examples where an object is consistently mis-identified across multiple views.

E.2.3 Duplicate objects.

When multiple objects share the same visual appearance as it is the case in the T-LESS dataset, there are often multiple label hypotheses that are consistent across views for the same physical object. Because these objects look similar to each other and match the observed image, the pose estimation network (which tries to match a rendering with the observed image, regardless of the object type) predicts reasonable poses for each label that are consistent across different views. These candidates are matched across views and multiple objects with different labels are predicted in the final scene at the same spatial position. In our visualization, we remove these duplicate objects by using a simple 3D non-maximum suppression (NMS) strategy on the estimated physical objects of the final scene. If multiple objects are too close to each other in the 3D scene, we keep the object with the highest score – the sum of the 2D detection scores of all inlier object candidates that are associated with one physical 3D object. Duplicate objects and 3D non-maximum suppression are illustrated in Fig. 14, including one correct and one incorrect example. The column “Reconstruction” in all figures corresponds to the output of our method after the 3D NMS.

E.2.4 Robustness to distractors and false positives.

The complex scenes in the T-LESS dataset also have background distractor objects that are not in the object database. Some of these distractors look similar to objects in the database and can be incorrectly detected, sometimes in multiple images. In these cases, the pose estimator most often produces 6D pose estimates that are not consistent across views because the input real images are outside of the training distribution (they display objects that are not used to generate the training data). Because these estimates are not consistent across views, our method is able to filter them and mark them as outliers (red), thus gaining robustness with respect to these distractors. An example is shown in Fig. 15.

E.3 Limitations

We now describe the most challenging scenarios that our method is currently not able to recover from. For each of these, we briefly discuss possible improvements.

If two incorrect 6D object candidates are consistent across at least two views, an (incorrect) object will be present in the reconstructed scene. Such failure case typically happens when two viewpoints are similar to each other. An example is shown in Fig. 16. If two views are very similar, the incorrect candidates will be matched together. Note that this failure mode could be resolved by using a higher number of views, and by only considering physical objects that have a sufficiently high number of associated object candidates.

E.3.2 Limitation II: Objects missing in the final reconstruction.

Our current approach requires that a candidate in one view is matched with at least one candidate from another view. If a candidate detection and pose estimate is correct in one view but not in any other view, it will be missing from the final reconstruction. An example is presented in Fig. 17. Note that in this case, all camera poses are still estimated correctly. An interesting direction to overcome this problem would be to grow the number of object candidates in each view by reprojecting the detection from other views, as done in guided matching.

E.3.3 Limitation III: Incorrect estimates of camera pose.

To position the camera with respect to the scene, our method requires that there are at least three object candidate inliers in the view: two for positioning the camera with respect to the scene, and another one to validate the camera pose hypothesis. Sometimes, however, there is insufficient number of inliers. This typically happens if only two objects are visible, or if there is a small number of objects visible and some of the detections are incorrect. An example is shown in Fig. 18.