Learning Analysis-by-Synthesis for 6D Pose Estimation in RGB-D Images

Alexander Krull, Eric Brachmann, Frank Michel, Michael Ying Yang, Stefan Gumhold, Carsten Rother

Introduction

Tremendous effort has focused on the tasks of object instance detection and pose estimation in images and videos. In this paper we consider the pose estimation in a single RGB-D image, as shown in Fig. 1. Given the extra depth channel, it becomes feasible to extract the full 6D pose (3D rotation and 3D translation) of object instances present in the scene. Pose estimation has important applications in many areas, such as robotics , medical imaging , and augmented reality . Recently, Brachmann et al. achieved state-of-the-art results by adapting analysis-by-synthesis approach for pose estimation in RGB-D images. They use a random forest to obtain pixelwise dense predictions. Building upon the system of , we propose a novel method to learn to compare in the analysis-by-synthesis framework. We use a convolutional neural network (CNN) inside a probabilistic context to achieve this.

Analysis-by-synthesis has been a successful approach for many tasks in computer vision, such as object recognition , scene parsing , pose estimation and tracking . A forward synthesis model generates images from possible geometric interpretations of the world, and then selects the interpretation that best agrees with the measured visual evidence. In particular for pose estimation, the idea is to compare the observation with the output of a forward process, such as a rendered image of the object of interest in a particular pose. When attempting pose estimation in RGB-D images, comparing for analysis-by-synthesis is nontrivial due to occlusion or complicated sensor noise. There are for example areas with no depth measurements in Kinect or poor IR-reflectance.

We achieve considerable improvements over state-of-the-art methods of pose estimation in RGB-D images with heavy occlusion.

To the best of our knowledge, this work is the first to utilize a convolutional neural network (CNN) as a probabilistic model to learn to compare rendered and observed images.

We observe that the CNN does not specialize to the geometry or appearance of specific objects, and it can be used with objects of vastly different shapes and appearances, and in different backgrounds.

The paper is organized as follows. Section 2 provides an overview of related work. Our proposed approach is described in Sec. 3. In Sec. 4 we present evaluation of our method compared to the state-of-the-art on two datasets. We conclude the paper in Sec. 5.

Related Work

A large body of work in computer vision has focused on the problem of object detection and pose estimation, including instance and category recognition, rigid and articulated objects, and coarse (quantized) and accurate (6D) poses. Pose estimation has been an active topic, ranging from template-based approaches , sparse feature-based approaches , and dense approaches . In the brief review below, we focus on techniques that specifically address CNNs and analysis-by-synthesis.

CNNs. are driving advances in computer vision in recent years, such as image classification , detection , recognition , semantic segmentation , pose estimation . CNNs have shown remarkable performance in the large-scale visual recognition challenge (ILSVRC2012). The success of CNNs is attributed to their ability to learn rich feature representations as opposed to hand-designed features used in previous image classification methods. In , rich image and depth feature representations have been learned with CNNs to detect objects in RGB-D images. In , CNNs are used to generate an RGB image given the set of 3D chair models, the chair type, viewpoint and color. Very recent work from Gupta et al. uses object instance segmentation output from to infer 3D object pose in RGB-D images. Another CNN is used to predict the coarse pose of the object. This CNN is trained using pixel normals in images containing rendered synthetic objects. This coarse pose is used to align a small number of prototypical models to the data, and place the model that fits the best into the scene. Different from above approaches, we use a CNN as a probabilistic model to compare rendered and observed images. The output of our CNN is the energy value, while in the output of the CNN is the object pose. In , a similarity metric is learned. The learning process minimizes a discriminative loss function. A CNN with siamese architecture is used for mapping two face feature spaces. Similarly, in Wohlhart and Lepetit train a CNN to map image patches to a descriptor space, where pose estimation and object recognition is solved using the nearest neighbor method. Our framework is probabilistic. The posterior distribution of the pose is modelled as a Gibbs distribution with a CNN as energy function. Zbontar and LeCun train a CNN to predict how well two image patches match and use it to compute the stereo matching cost. The cost is minimized by cross-based cost aggregation and semi-global matching, followed by a left-right consistency check to eliminate errors in the occluded regions. While in the CNN is used for comparing two image patches, our CNN is used to to compare rendered and observed images.

Analysis-by-synthesis has been a successful approach for many tasks in computer vision, such as object recognition , scene parsing , viewpoint synthesis , material classification , and gaze estimation . All these approaches use a forward model to synthesize some form of image, which is compared to observations. Many works learn feature representation and compare in feature space. For instance, in the analysis-by-synthesis strategy has been used for recognizing and reconstructing 3D objects in images. The forward model synthesizes visual templates defined on invariant features. Gall et al. propose an analysis-by-synthesis framework for motion capture and tracking. It combines patch-based and region-based matching to track body parts. Patch-based matching extracts correspondences between two successive frames for prediction and between the current image and a synthesized image for avoiding drift. Recently, Brachmann et al. achieved state-of-the-art results by adapting classical analysis-by-synthesis approach for 6D pose estimation of specific objects from a single RGB-D image. They use a new representation in form of a joint dense 3D object coordinate and object class labeling. The major difference to our work, is that we learn to compare in the analysis-by-synthesis approach. For the problem of 6D pose estimation, due to occlusion or complicated sensor noise, it can be difficult to compare the observation with the output of a rendered image of the object of interest in a particular pose. In this paper, we propose an approach, which draws on recent successes of CNNs. Different from aforementioned approaches, we model the posterior density of a particular object pose with a CNN that compares an observed and rendered image. The network is trained with the maximum likelihood paradigm. One of the most closely related works is . They use a CNN as a part of probabilistic model. The CNN is fed in a sequential manner, first with the rendered image, then with the observed image. This produces two feature vectors, which are compared in the subsequence step, to give the probability of the observed image. In contrast to , we jointly input the rendered and observed images into a CNN to produce an energy value. The major difference is that our CNN is trained, while they take a pre-trained CNN as feature extractor.

We will now describe the system from in detail, because it is of particular relevance for our method. Brachmann et al. achieved state-of-the-art results by using a random forest to obtain pixelwise dense predictions, which facilitate pose estimation. Each tree in their forest is trained to jointly predict to which object a pixel belongs to and, where it is located on the surface of this object. A tree outputs a soft segmentation image for each object with values between 0 and 1, indicating whether a pixel belongs to the object or not. The predictions of different trees are then combined to a single object probability. Additionally each tree outputs 3D object coordinates for each object and each pixel. The term object coordinates refers to the coordinates in the local coordinate system of the object. When estimating the pose of a particular object, Brachmann et al. utilize the forest predictions in two ways:

Firstly, it is used to define an energy function, which is minimized to obtain the final pose. All aspects of the energy follow the analysis-by-synthesis principle. It is based on a pixelwise comparison between the predictions, the recorded depth values and rendered images of the object in the particular pose. In detail, three comparisons are done: (a) the rendered depth image of the object is compared to the recorded depth image; (b) the rendered image of object coordinates is compared to the predicted object coordinates; (c) the rendered segmentation mask of the object is compared to the predicted object class probability for the object. The pixelwise error inside the segmentation mask is aggregated and divided by the area of the mask. Robust error measures are used to deal with outliers.

Secondly, they use the forest predictions for an efficient optimization scheme to minimize the energy described above. It consists of two steps. The pixelwise object class probabilities are used inside the RANSAC pose estimation. In detail, sets of three pixels are sampled depending on the object class probability. For each set a pose hypothesis is calculated using the 3D-3D-correspondences between the camera coordinates, provided by the depth camera, and the object coordinates predicted by the forest. The best hypotheses, according to the energy function, are refined in a final step. Refinement is done by repeatedly determining inlier pixels in the rendered mask of the object, and again using the correspondences they provide to calculate a better pose. Finally. the pose with the lowest energy is taken as the final estimate.

In our work we build upon the framework of . As in we use the regression-classification random forest to obtain the predictions described above. We also use their optimization scheme, but replace the energy function with a novel one, based on a CNN, that is trained. The key difference is that while energy function in has only a few parameters which can be trained via discriminative cross-validation procedure, the CNN has around 600K which we train with a maximum likelihood procedure. We show that this richness of parameters makes remarkable difference, and practical challenges such as occlusion and noise are much better dealt with. This approach will be described in the next section.

Method

We will first give a description of the pose estimation task and introduce our terminology. Then we will describe our probabilistic model. The heart of this model is a CNN, which will be discussed subsequently. This is followed by a description of our maximum likelihood training procedure of the probabilistic model. Finally our inference procedure at test time is described. Fig. 2 gives an overview of our testing pipeline.

We will now formally define the task of 6D pose estimation. Our goal is to estimate the pose $H$ of a rigid objectIt should be noted that we assume the object to be present in the field of view, i. e. we do not perform object recognition. from a set of observations denoted by ${\bf x}$ , which will be discussed later. A pose describes the transformation from the local coordinate system of the object to the coordinate system of the camera. The local coordinate system has its origin in the center of the object. Each pose $H=(R,T)$ is a combination of two components. The rotational component $R$ is a $3\times 3$ matrix describing the rotation around the center of the object. The translational component $T$ is a 3D vector corresponding to the position of the object center in the camera coordinate system.

Let us now describe the observation ${\bf x}$ that is used to estimate the object pose. We use RGB-D images as input. However, since we use the same random forest predictions as in , the term observation or observed images will refer to two parts: (a) the forest predictions as described in , as well as (b) the recorded depth image. The reason for this simplified view is that the focus of our work lies on the modeling of the posterior density and aspects of the random forest prediction.

2 Probabilistic Model

We model the posterior distribution of the pose $H$ given the observations ${\bf x}$ as a Gibbs distribution

where $E\left(H,{\bf x};\boldsymbol{\theta}\right)$ is the so called energy function. The energy function is a mapping from a pose $H$ and the observed images ${\bf x}$ to a real number, parametrized by the vector $\boldsymbol{\theta}$ . Note that using a Gibbs distribution to model the posterior is a common practice for conditional random fields (CRFs) . However, the underlying energies are quite different. While in a CRF the energy function is a sum of potential functions, we implement it by using a CNN which directly outputs the energy value. The parameter vector $\boldsymbol{\theta}$ holds the weights of our CNN.

3 Convolutional Neural Network

In order to implement the mapping from a pose $H$ and the observed images $\bf x$ to an energy value we first render the object in pose $H$ to obtain rendered images ${\bf r}(H)$ . Our CNN then compares $\bf x$ with ${\bf r}(H)$ and outputs a value $f\big{(}{\bf x},{\bf r}(H);\boldsymbol{\theta}\big{)}$ . We define the energy function as

Our network is trained to assign a low energy values when there is a large agreement between observed images and renderings and a high energy value when there is little agreement. To perform the comparison we use a simple architecture, in which we feed all rendered and observed images as separate input channels into the CNN.

Note that we consider only a square window around the center of the object with pose $H$ . The width of the window is adjusted according to the size and distance of the object, as suggested by . For performance reasons windows which are bigger than 100x100 pixels are down sampled to this size. We use in total six input channels for our network. Note that Fig. 2 shows the images from which these six input channels are derived.

One observed depth channel and one rendered depth channel that contain values in millimeters. They are normalized by subtracting the $z$ component of the object position according to $H$ .

One rendered mask channel of the object. Pixel values are either $+1$ for all pixels belonging to the object or $-1$ otherwise.

One depth mask channel indicating whether a depth value was measured in the pixel. Again, pixel values are either $+1$ for all pixels where a depth was measured or $-1$ otherwise.

One probability channel holding the combined pixel wise object probabilities from all trees. The values are re-scaled to lie between $-1$ and $+1$ .

One object coordinate channel holding the pixel wise Euclidean distances between the rendered object coordinates and the predicted object coordinate from the tree giving the highest object probability for the respective pixel. We divide all values by the object diameter for normalization.

The $\tanh$ activation function is used after every convolution layer and after every fully connected layer. The first convolution layer $C_{1}$ consists 128 convolution kernels of size $3\times 3\times 6$ . The second convolution layer $C_{2}$ consists of 128 kernels of size $3\times 3\times 128$ , which is followed by a $2\times 2$ max-pooling layer with stride 2 in each direction. The third convolution layer $C_{3}$ is identical to $C_{2}$ . The fourth convolution layer consists of $256$ kernels of size $3\times 3\times 128$ . It is followed by a max-pooling operation over the remaining image size. The $256$ channels are further processed by two fully connected layers with $256$ neurons each and finally forwarded to a single output unit.

4 Maximum Likelihood Training

In training we want to find an optimal set of parameters $\boldsymbol{\theta}^{*}$ based on labeled training data $L=({\bf x}_{1},H_{1})\dots({\bf x}_{n},H_{n})$ , where ${\bf x}_{i}$ shall denote observations of the i-th training image and $H_{i}$ the corresponding ground truth pose. We apply the maximum likelihood paradigm and define

In order to solve this optimization task we use stochastic gradient descent , which requires calculating the partial derivatives of the log likelihood for each training sample

Sampling. It is possible to approximate the expected value in Eq. (4) by a set of pose samples

where $H_{1}\dots H_{N}$ are pose-samples drawn independently from the posterior $p(H|{\bf x};\boldsymbol{\theta})$ with the current parameters $\boldsymbol{\theta}$ . We use the Metropolis algorithm to generate these samples. It allows sampling from any distribution with a known density function that can be evaluated up to a constant factor. The algorithm generates a sequence of samples $H_{t}$ by repeating two steps:

Draw a new proposed sample $H^{\prime}$ according to a proposal distribution $Q(H^{\prime}|H_{t})$ .

Accept or reject the proposed sample according to an acceptance probability $A(H^{\prime}|H_{t})$ . If the proposed sample is accepted set $H_{t+1}=H^{\prime}$ . If it is rejected set $H_{t+1}=H_{t}$ .

The proposal distribution $Q(H^{\prime}|H_{t})$ has to be symmetric, i.e. $Q(H^{\prime}|H_{t})=Q(H_{t}|H^{\prime})$ . Our particular proposal distribution will be described in detail in the next section. The acceptance probability is in our case defined as

meaning that whenever the posterior density $p(H^{\prime}|{\bf x};\boldsymbol{\theta})$ at the proposed sample is greater than the posterior density $(H_{t}|{\bf x};\boldsymbol{\theta})$ at the current sample, the proposed sample will automatically be accepted. If this is not the case it will be accepted with the probability $p(H^{\prime}|{\bf x};\boldsymbol{\theta})/p(H_{t}|{\bf x};\boldsymbol{\theta})$ .

Proposal Distribution. A common choice for the proposal distribution is a normal distribution centered at the current sample. In our case this is not possible because the rotational component of the pose lives on the manifold $SO(3)$ , i.e. the group of rotations. We define $Q(H^{\prime}|H_{t})$ implicitly by describing a sampling procedure and ensuring that it is symmetric. The translational component $T^{\prime}$ of the proposed sample is directly drawn from a 3D isotropic normal distribution $\mathcal{N}(T_{t},{\Sigma}_{T})$ centered at the translational component $T_{t}$ of the current sample $H_{t}$ . The rotational component $R^{\prime}$ of the proposed sample $H^{\prime}$ is generated by applying a random rotation $\hat{R}$ to the rotational component $R_{t}$ of the current sample: $R^{\prime}=\hat{R}R_{t}$ .

We calculate $\hat{R}$ as the rotation matrix corresponding to an Euler vector A 3D vector represents a rotation. The direction of the vector describes the axis of the rotation and the length corresponds to the angle. ${\bf e}$ , which is drawn from a 3D zero centered isotropic normal distribution ${\bf e}\sim\mathcal{N}({\bf 0},{\Sigma}_{R})$ .

Initialization and Burn-in-phase. When the Metropolis algorithm is initialized in an area with low density it requires more iterations to provide a fair approximation of the expected value. To find a good initialization we run our inference procedure (described in the next section) using the current parameter set. We then perform the Metropolis algorithm for a total of 130 iterations, disregarding the samples from the first 30 iterations which are considered as burn-in-phase.

5 Inference Procedure

During test time we aim at finding the MAP estimate, i.e. the pose maximizing our posterior density as given in Eq. (1). Since the denominator in Eq. (1) is constant for any given observation ${\bf x}$ , finding the MAP estimate is equivalent to minimizing our energy function. To achieve this, we utilize the optimization scheme from , but replace their energy function with ours.

Experiments

In the following we compare our approach to the state-of-the-art method of Brachmann et al. in for two different datasets. We first describe some implementation details of the competitor and introduce the datasets. After that we describe details of our training procedure, and finally present quantitative and qualitative comparison. We will see that we achieve considerable improvements for both datasets. Additionally, we observe that our CNN generalizes from a single training object to a set of 11 test objects, with large variability in appearance and geometry.

Datasets. We use two datasets featuring heavy occlusion. The first dataset was created by Brachmann et al. . by annotating the ground truth poses for eight partially occluded objects in images taken from the dataset of Hinterstoisser et al. . We will refer to this dataset as the occlusion dataset from and . It includes a total of $8992$ test cases (images with different annotation), which are used for testing. We choose this dataset because it is more challenging than the original dataset from , on which already achieves an average of 98.3% correctly estimated poses.

The second dataset was introduced by Krull et al. in . It provides six annotated RGB-D sequences of three different objects and consists of a total of $3187$ images. We use three of the sequences for training and the other three (a total of $1715$ test images) for testing.

Evaluation Protocol. We use the evaluation procedure as described in . This means we calculate the percentage of correctly predicted poses for each sequence. As in we calculate the average distance between the 3D model vertices under the estimated pose and under the ground truth pose. A pose is considered correct, when the average distance is below 10% of the object diameter.

Competitors. We compare our method to the one presented in . For doing so we needed to re-implement this methodOur re-implementation is identical up to small details, which we discussed with the authors of .. We observed that our re-implementation gives on average slightly superior results. In the following, we mostly report two numbers, those of our re-implementation and those of the method of , reported in or . For completeness we additionally provide the numbers from LineMOD as reported in .

2 Training Procedure

Random Forests. We used different random forests for training and testing on both datasets. The forests were kindly provided to us by the authors of .

CNN. We trained three CNNs, each time using only a single object from the dataset provided by Krull et al. in . The sequences Toolbox_1, Cat_1, and Samurai_1 served as training sets - see Fig. 3. The first $100$ frames from Samurai_1 were removed in order to obtain a high percentage of frames with occlusion. Our validation set consists of $100$ randomly selected frames from the Cat_1 sequence, or the Samurai_1 sequence (in the case where Cat_1 was used as training set). The weights of the CNN were randomly initialized. Before training, the random weights of the last layer were multiplied by factor 1000, in order to cover a greater range of possible energy values. After every 5th iteration of stochastic gradient descent, we perform inference on the validation set and adjust the learning rate. The learning rate at step $t$ was proportional to $\gamma_{t}=\gamma_{0}/(1+\gamma_{0}\lambda t)$ , with $\gamma_{0}=10$ and $\lambda=0.5$ . After training we pick the set of weights which achieved the highest percentage of correctly estimates poses on the validation set. We use the criterion from to classify a pose as correct. One training cycle consisting of five steps of stochastic gradient descent and validation tookWe used an Intel(R) Core (TM) i7-3820 CPU at 3.60GHz with GeForce GTX 660 GPU. The Cat_1 sequence was used for training and 100 random frames from Samurai_1 for validation. 9min 46sec (2min 27sec + 7min 19sec). Further details on our training procedure can be found in the supplementary material.

3 Comparison

Occlusion Dataset from and . Quantitative results for this dataset are shown in Fig. 4, for all individual test and training objects. Considering the average over all objects we achieve an improvement of up to ${9.23\%}$ compared to our re-implementation of and 10.4% compared to the reported values in . Some qualitative results are illustrated in Fig. 7. In Fig. 5 we show another comparison of our method with respect to . It illustrates that we achieve the biggest gain for occlusion percentage between $50\%$ and $60\%$ .

Dataset of Krull et al. For this dataset we observe similar results as with the previous dataset. Since the other sequences were used in training and validation, we evaluated only with the Toolbox_2, Cat_2, and Samurai_2 sequences. When averaged over all objects we achieve an improvement of 10.97% compared to the results of . The quantitative results can be found in Fig. 6, and a few qualitative results are shown in Fig. 8.

Discussion of Failure Cases. The failure cases which are framed red in Fig. 7 have to be considered as failure of our learned energy function. However, the failure cases framed orange still exhibit a lower energy at the ground truth pose than at the estimate. This indicates a failure of the optimization scheme. It should be investigated in which case the correct pose can be found using an alternative optimization scheme. In the dataset introduced by Krull et al. our accuracy for the Tool Box sequences is below the one of our competitor (see Fig. 6). We attribute this to the fact that the Tool Box is the biggest object and most strongly affected by the down sampling schema described in Sec. 3.3.

Conclusion

We have presented a model for the posterior distribution in 6D pose estimation, which uses a CNN to map rendered and observed images to an energy value. We train the CNN based on the maximum likelihood paradigm. It has been demonstrated, that training on a single object is sufficient and the CNN is able to generalize to different objects and backgrounds. Our system has been evaluated on two datasets featuring heavy occlusion. By using our energy as objective function for pose estimation, we were able to achieve considerable improvements compared to the best previously published results.

Our approach is not restricted to the feature channels and even the application we demonstrated. The architecture of the CNN can in principle be applied to any kind of observed and rendered image. We think it would be worth investigating if the approach could be applied to other scenarios. An example could be pose estimation from pure RGB without recorded depth image and a forest to calculate features. Pose estimation for object classes could also benefit from our approach. Considering the recent success of CNNs in recognition it might be possible for a CNN to learn to compare observed images to renderings of an idealized model representing an object class instead of an instance. Our approach is not limited to comparing images of the same kind, as for example rendered and observed depth images. Instead, it could learn to asses the plausibility of the shading in an observed RGB by comparing it to a rendered depth image, which can be more easily produced than a realistic RGB rendering.

An interesting future line of research could be to train a CNN to predict pose updates from observed and rendered images. This could replace the refinement step and might improve the results.