Learning to Find Good Correspondences

Kwang Moo Yi, Eduard Trulls, Yuki Ono, Vincent Lepetit, Mathieu Salzmann, Pascal Fua

Introduction

Recovering the relative camera motion between two images is one of the most basic tasks in Computer Vision, and a key component of wide-baseline stereo and Structure from Motion (SfM) pipelines. However, it remains a difficult problem when dealing with wide baselines, repetitive structures, and illumination changes, as depicted by Fig. 1. Most algorithms rely on sparse keypoints to establish an initial set of correspondences across images, then try to find a subset of reliable matches—inliers—which conform to a given geometric model, and use them to recover the pose . They rely on combinations of well-established techniques, such as SIFT , RANSAC , and the 8-point algorithm , which have been in use for decades.

With the advent of deep learning, there has been a push towards reformulating local feature extraction using neural networks . However, while these algorithms outperform earlier ones on point-matching benchmarks, incorporating them into pose estimation pipelines may not necessarily translate into a performance increase, as indicated by two recent studies . This suggests that the limiting factor may not lie in establishing the correspondences as much as in choosing those that are best to recover the pose.

This problem has received comparatively little attention and most algorithms still rely on non-differentiable hand-crafted techniques to solve it. DSAC is the only recent attempt we know of to tackle sparse outlier rejection in a differentiable manner. However, this method is designed to mimic RANSAC rather than outperform it. Furthermore, it is specific to 3D to 2D correspondences, rather than stereo.

By contrast, we propose a novel approach to finding geometrically consistent correspondences with a deep network. Given feature points in both images, and potential correspondences between them, we train a deep network that simultaneously solves a classification problem—retain or reject each correspondence—and a regression problem—retrieving the camera motion—by exploiting epipolar constraints. To do so we introduce a differentiable way to compute the pose through a simple weighted reformulation of the 8-point algorithm, with the weights predicting the likelihood of a correspondence being correct. In practice, we assume the camera intrinsics to be known, which is often true because they are stored in the image meta-data, and express camera motion in terms of the essential matrix .

As shown in Fig. 2, our architecture relies on Multi Layer Perceptrons (MLPs) that are applied independently on each correspondence, rendering the network invariant to the order of the input. This is inspired by PointNet , which runs an MLP on each individual point from a 3D cloud and then feeds the results to an additional network that generates a global feature, which is then appended to each point-wise feature. By contrast, we simply normalize the distribution of the feature maps over all correspondences every time they go through a perceptron. As the correspondences are constrained by the camera motion, this procedure provides context. We call this novel, non-parametric operation Context Normalization, and found it simpler and more effective than the global context features of for our purposes.

Our method has the following advantages: (i) it can double the performance of the state of the art; (ii) being keypoint-based, it generalizes better than image-based dense methods to unseen scenes, which we demonstrate with a single model that outperforms current methods on drastically different indoors and outdoors datasets; (iii) it requires only weak supervision through essential matrices for training; (iv) it can work effectively with very little training data, e.g., we can still outperform the state of the art on challenging outdoor scenes with only 59 training images.

Related Work

The traditional approach for estimating the relative camera motion between two images is to use sparse keypoints, such as SIFT , to establish an initial set of correspondences, and reject outliers with RANSAC , using the 5-point algorithm or the 8-point algorithm to retrieve the essential matrix.

Many works have focused on improving the outlier rejection step of this pipeline, i.e., RANSAC. MLESAC shows improvements when solving image geometry problems. PROSAC speeds up the estimation process. USAC combines multiple advancements together into a unified framework. Least median of squares (LMEDS) is also commonly used to replace RANSAC. A comprehensive study on this topic can be found in . In practice, however, RANSAC still remains to be the de facto standard.

A major drawback of these approaches is that they rely on small subsets of the data to generate the hypotheses, e.g., the 5-point algorithm considers only five correspondences at a time. This is sub-optimal, as image pairs with large baselines and imaging changes will contain a large percentage of outliers, thus making most of the hypotheses useless.

Recent works try to overcome this limitation by simultaneously rejecting outliers and estimating global motion. assumes that global motion should be piece-wise smooth. GMS further exploits this assumption by dividing the image into multiple grids and forming initial matches between the grid cells. Although they show improvements over traditional matching strategies, the piece-wise smoothness assumption is often violated in practice.

Solving image correspondence with deep networks has received a lot of interest recently. In contrast with traditional methods, many of these new techniques are dense, using the raw image as input. They have shown promising results on constrained stereo problems, such as matching two frames from a video sequence, but the general problem remains far from solved. As evidenced by our experiments, this approach can be harmful on scenes with occlusions or large baselines, which are commonplace in photo-tourism applications.

Dense methods aside, DSAC recently introduced a differentiable outlier rejection scheme for monocular pose estimation. However, it relies on a strategy to evaluate hypotheses that is particular to monocular pose estimation and difficult to extend to stereo. Furthermore, this technique amounts to replacing RANSAC with a differentiable counterpart for end-to-end training, which does not significantly improve the performance beyond RANSAC, as we do.

In , a dual architecture with separate networks for extracting sparse keypoints and forming the correspondences, assuming a homography model, was proposed. Because of its requirement for ground-truth annotations of object corners, this work was only demonstrated on synthetic, non-realistic images. By contrast, our method only requires essential matrices for training and thus, as demonstrated by our experiments, is effective in real-world scenarios.

Method

Our goal is to establish valid and geometrically consistent correspondences across images and use them to retrieve the camera motion. We rely on local features, which can be unreliable and require outlier rejection. Traditional methods approach this problem by iteratively sampling small subsets of matches, which makes it difficult to exploit the global context. By contrast, we develop a deep network that can leverage all the available information in a single shot.

Specifically, we extract local features over two images, create an overcomplete set of putative correspondences, and feed them to the network depicted by Fig. 2. It assigns a weight to each correspondence, which encodes their likelihood of being an inlier. We further use it to set the influence of each correspondence in a weighted reformulation of the eight-point algorithm . In other words, our method performs joint inlier/outlier classification and regression to the essential matrix. The relative rotation and translation between cameras can then be recovered with a singular value decomposition of the resulting essential matrix .

In the remainder of this section, we formalize the problem and describe the architecture of Fig. 2. We then reformulate the eight-point algorithm for weighted correspondences, and discuss our learning strategy with a hybrid loss.

Formally, let us consider a pair of images $(\mathbf{I},\mathbf{I}^{\prime})$ . We extract local features $(\mathbf{k}_{i},\mathbf{f}_{i})_{i\in[1,N]}$ from image $\mathbf{I}$ , where $\mathbf{k}_{i}=(u_{i},v_{i})$ contains keypoint locations, and $\mathbf{f}_{i}$ is a vector encoding a descriptor for the image patch centered at $(u_{i},v_{i})$ . The keypoint orientation and scale are used to extract these descriptors and discarded afterwards. Similarly, we obtain $(\mathbf{k}^{\prime}_{j},\mathbf{f}^{\prime}_{j})_{j\in[1,N^{\prime}]}$ from $\mathbf{I^{\prime}}$ , keeping $N=N^{\prime}$ for simplicity. Our formulation is thus amenable to traditional features such as SIFT or newer ones such as LIFT .

We then generate a list of $N$ putative correspondences by matching each keypoint in $\mathbf{I}$ to the nearest one in $\mathbf{I^{\prime}}$ based on the descriptor distances. We denote this list as

More complex strategies could be used, but we found this one sufficient. As many traditional algorithms, we use the camera intrinsics to normalize the coordinates to $ $, which makes the optimization numerically better-behaved . We discard the descriptors, so that the list of$ N$ location quadruplets is the only input to our network.

As will be discussed in Section 3.3, it is straightforward to extend the eight-point algorithm to take as input a set of correspondences with accompanying weights

where $w_{i}\in$ is the score assigned to correspondence $\mathbf{q}_{i}$ , with $w_{i}=0$ standing for $\mathbf{q}_{i}$ being an outlier. Then, let $g$ be a function that takes as input $\mathbf{x}$ and $\mathbf{w}$ and returns the essential matrix $\mathbf{E}$ . Now, given a set of $P$ image pairs $(\mathbf{I}_{k},\mathbf{I}_{k}^{\prime})_{k\in[1,P]}$ and their corresponding essential matrices $\mathbf{E}_{k}$ , we extract the set of correspondences $\mathbf{x}_{k}$ of Eq. (1), and our problem becomes that of designing a deep network that encodes a mapping $f$ with parameters $\Phi$ , such that

2 Network Architecture

We now describe the network that implements the mapping of Eq. (3). Since the order of the correspondences is arbitrary, permuting $\mathbf{x}_{k}$ should result in an equivalent permutation of $\mathbf{w}_{k}=f_{\Phi}(\mathbf{x}_{k})$ . To achieve this, inspired by PointNet , a deep network designed to operate on unordered 3D point clouds, we exploit Multi-Layer Perceptrons (MLP). As MLPs operate on individual correspondences, unlike convolutional or dense networks, incorporating information from other perceptrons, i.e., the context, is indispensable for it to work properly. The distinguishing factor between PointNet and our method is how this is done.

In PointNet, the point-wise features are explicitly combined by a separate network that generates a global context feature, which is then concatenated to each point-wise output. By contrast, as shown in Fig. 2, we introduce a simple strategy to normalize the feature maps according to their distribution, after every perceptron. This lets us process each correspondence separately while framing it in the global context defined by the rest, which encodes camera motion and scene geometry. We call this non-parametric operation Context Normalization (CN).

Formally, let $\mathbf{o}_{i}^{l}\in\mathds{R}^{C^{l}}$ be the output of layer $l$ for correspondence $i$ , where $C^{l}$ is the number of neurons in $l$ . We take the normalized version of $\mathbf{o}_{i}^{l}$ to be

This operation is mechanically similar to other normalization techniques , but is applied to a different dimension and plays a different role. We normalize each perceptron’s output across correspondences, but separately for each image pair. This allows the distribution of the feature maps to encode scene geometry and camera motion, embedding contextual information into context-agnostic MLPs.

By contrast, the other normalization techniques primarily focus on convergence speed, with little impact on how the networks operate. Batch Normalization normalizes the input to each neuron over a mini-batch, so that it follows a consistent distribution while training. Layer Normalization transposes this operation to channels, thus being independent on the number of observations for each neuron. They do not, however, add contextual information to the input. Batch Normalization assumes that every sample follows the same global statistics, and reduces to subtracting and dividing by fixed scalar values at test time. Layer Normalization is applied independently for each neuron, i.e., it considers the features for one correspondence at a time.

More closely related is Instance Normalization , which applies the same operation we do to full images to normalize their contrast, for image stylization. However, it is also limited to enhancing convergence, as in their application context is already captured by spatial convolutions, which are not amenable to the sparse data in our problem.

Note however that our technique is compatible with Batch Normalization, and we do in fact use it to speed up training. As shown in Fig. 2, our network is a 12-layer ResNet , where each layer contains two sequential blocks consisting of: a Perceptron with 128 neurons sharing weights for each correspondence, a Context Normalization layer, a Batch Normalization layer, and a ReLU. After the last perceptron, we apply a ReLU followed by a $\tanh$ to force the output in the range $\left[0,1\right)$ . We use this truncated $\tanh$ instead of, e.g., a sigmoid, so that the network can easily output $w_{i}=0$ to completely remove an outlier.

3 Computing the Essential Matrix

We now define the function $g$ of Eq. (3) that estimates the essential matrix from a weighted set of correspondences. If we are to integrate it into a Deep Learning framework, it must be differentiable with respect to the weights. We first outline the 8-point algorithm that tackles the unweighted scenario, and then extend it to the weighted case.

In practice, the 8-point algorithm can be severely affected by outliers and is used in conjunction with an outlier rejection scheme. In our case, given the weights $\mathbf{w}$ produced by the network, we can simply minimize $\left\|\mathbf{X}^{\top}\operatorname{diag}\left(\mathbf{w}\right)\mathbf{X}\operatorname{Vec}\left(\mathbf{E}\right)\right\|$ instead of $\left\|\mathbf{X}^{\top}\mathbf{X}\operatorname{Vec}\left(\mathbf{E}\right)\right\|$ , where $\operatorname{diag}$ is the diagonalization operator. This amounts to giving weight $w_{i}$ to the $i$ -th row in $\mathbf{X}$ , representing the contribution of the $i$ -th correspondence.

Since $\mathbf{X}^{\top}\operatorname{diag}\left(\mathbf{w}\right)\mathbf{X}$ is symmetric, the function $g$ that computes $\operatorname{Vec}\left(\mathbf{E}\right)$ , and therefore $\mathbf{E}$ , from $\mathbf{X}$ is a self-adjoint eigendecomposition, which has differentiable closed-form solution with respect to $\mathbf{w}$ . Note that $g(\mathbf{x},\mathbf{w})$ only depends on the inliers because any correspondence with zero weight has strictly no influence on the result. To compute the final essential matrix $\hat{\mathbf{E}}$ , we follow the same procedure as in the unweighted case, which is also differentiable.

4 Learning with Classification and Regression

We train the network $f_{\Phi}$ with a hybrid loss function

where $\Phi$ are the network parameters and $\mathbf{x}_{k}$ is the set of putative correspondences for image pair $k$ . $\mathcal{L}_{x}$ is a classification loss aggregated over each individual correspondence, and $\mathcal{L}_{e}$ is a regression loss over the essential matrix, obtained from the weighted 8-point algorithm of Section 3.3 with the weights produced by the network. $\alpha$ and $\beta$ are the hyper-parameters weighing the two loss terms.

Given a set of $N$ putative correspondences $\mathbf{x}_{k}$ and their respective labels $\mathbf{y}_{k}=\left[y_{k}^{1},...,y_{k}^{N}\right]$ , where $y_{k}^{i}\in\{0,1\}$ and $y_{k}^{i}=1$ denotes that the $i$ th correspondence is an inlier, we define

where $o_{k}^{i}$ is the linear output of the last layer for the $i$ -th correspondence in training pair $k$ , $S$ is the logistic function used in conjunction with the binary cross entropy $H$ , and $\gamma_{k}^{i}$ is the per-label weight to balance positive and negative examples.

To avoid exhaustive annotation, we generate the labels $\mathbf{y}$ by exploiting the epipolar constraint. Given a point in one image, if the corresponding point in the other image does not lie on the epipolar line, the correspondence is spurious. We can quantify this with the epipolar distance

where $\mathbf{p}=[u,v,1]^{T}$ and $\mathbf{p}^{\prime}=[u^{\prime},v^{\prime},1]^{T}$ indicate a putative correspondence between images $\mathbf{I}$ and $\mathbf{I}^{\prime}$ in homogenous coordinates, and $\mathbf{v}_{[j]}$ denotes the $j$ th element of vector $\mathbf{v}$ . In practice, we use the symmetric epipolar distance $d(\mathbf{p},\mathbf{E}\mathbf{p}^{\prime})+d(\mathbf{p}^{\prime},\mathbf{E}^{T}\mathbf{p})$ , with a threshold of $10^{-2}$ in normalized coordinates to determine valid correspondences.

Note that this is a weak supervisory signal because outliers can still have a small epipolar distance if they lie close to the epipolar line. However, a fully-supervised labeling would require annotating correspondences for every point in both images, which is prohibitively expensive, or depth data, which is not available in most scenarios. Our experiments show that, in practice, weak supervision is sufficient.

The weak supervision applied by the classification loss proves robust in our experience, but it may still let outliers in. We propose to improve it by using the weighted set of correspondences to extract the essential matrix and penalize deviations from the ground-truth. We therefore write

where $\mathbf{E}_{k}^{*}$ is the ground-truth essential matrix, and $g$ is the function defined in Section 3.3 that predicts an essential matrix from a set of weighted correspondences. We have to compute either the difference or the sum because the sign of $g(\mathbf{x},\mathbf{w})$ can be flipped with respect to that of $\mathbf{E}^{*}$ . Note that here we use the non-rank-constrained estimate of the essential matrix $\mathbf{E}$ , instead of $\hat{\mathbf{E}}$ , as defined in Section 3.3, because minimizing $\mathcal{L}_{e}$ already aims to get the correct rank.

We use Adam to minimize the loss $\mathcal{L}(\Phi)$ , with a learning rate of $10^{-4}$ and mini-batches of 32 image pairs. The classification loss can train accurate models by itself, and we observed that using the regression loss early on can actually harm performance. We find empirically that setting $\alpha=1$ and enabling the regression loss after $20$ k batches with $\beta=0.1$ works well. This allows us to increase relative performance by 5 to 20%.

5 RANSAC at Test Time

The loss function must be differentiable with respect to the network weights. However, this requirement disappears at test time. We can thus apply RANSAC on the correspondences labeled as inliers by our network to weed out any remaining outliers. We will show that this performs much better stand-alone RANSAC, which confirms our intuition that sampling is a sub-optimal way to approach “needle-in-the-haystack” scenarios with a large ratio of outliers.

Experimental Results

We first present the datasets and evaluation protocols, and then justify our implementation choices. Finally, we benchmark our approach against state-of-the-art techniques.

Sparse feature matching is particularly effective in outdoor scenes, and we will see that this is where our approach shines. By contrast, keypoint-based methods are less suitable for indoor scenes. Nevertheless, recent techniques have tackled this problem, and we show that our approach still outperforms the state of the art in this scenario.

To evaluate our method in outdoor settings, we aim to leverage many images featuring similar scenes seen from different viewpoints. As such, photo-tourism datasets are natural candidates. We rely on Yahoo’s YFCC100M dataset , a collection of 100 million publicly accessible Flickr images with accompanying metadata, which were later curated into 72 image collections suitable for Structure from Motion (SfM). We pick five sequences, depicted in Fig. 3. We use VisualSFM to recover the camera poses and generate the ground-truth.

We also consider the ‘Fountain’ and ‘HerzJesu’ sequences from . They are very small and we only use them for testing, to show that our networks trained on photo-tourism datasets can successfully generalize.

We use the SUN3D dataset , which comprises a series of indoor videos captured with a Kinect, with 3D reconstructions. They feature office-like scenes with few distinctive features, many repetitive elements, and substantial self-occlusions, which makes them extremely challenging for sparse keypoint methods. We select 9 sequences for training and testing, and use 15 sequences previously chosen by for testing only, to provide a fair comparison. We subsample the videos by a factor of 10.

For every sequence we train on, we randomly split the images into disjoint subsets for training (60%), validation (20%) and test (20%). To select valid image pairs on the ‘Outdoors’ subset, we sample two images randomly and check if they have a minimum number of 3D points in common from the SfM reconstruction, indicating that the problem is solvable. For the ‘Indoors’ set, we use the depth maps to determine visibility. We use $1$ k image pairs for validation and testing, and as many as possible for training.

2 Evaluation Protocols

They include well-established algorithms, RANSAC , MLESAC , and LMEDS , as well as the very recent GMS . For GMS, we incorporate an additional RANSAC step as described for our method in Section 3.5, which we empirically found mandatory to obtain good performance.

Note that GMS operates with a large ( $10$ k) pool of ORB features , and thus behaves similarly to dense methods. For all others, ours included, we evaluate both SIFT and LIFT features. For SIFT we use the OpenCV library, and for LIFT the publicly available models, which were trained on photo-tourism data different from ours.

We consider G3DR and DeMoN . For G3DR, we implement their architecture and train it using only the pose component of their loss function, as they argue that pose estimation is more accurate without the classification loss . For DeMoN, we use the publicly available models, which were trained on SUN3D sequences and on SfM reconstructions of outdoors sets.

Given two images, it is possible to estimate rotation exactly—in theory—and translation only up to a scale factor . We thus use the angular difference between the estimated and ground-truth vectors, i.e., the closest arc distance, in degrees, for both, as our error metric. We do so as follows. First, we generate a curve by classifying each pose as accurate or not, i.e. we compute the precision, given a threshold (0 to 180o), and build a normalized cumulative curve as in . Second, we compute the area under this curve (AUC) up to a maximum threshold of 5, 10 or 20o, because after a point it does not matter how inaccurate pose estimates are. As the curve measures precision, its AUC is equivalent to mean average precision (mAP). We apply the same threshold over rotation and translation, for simplicity.

3 Ablation Study

At the heart of our approach are two key ideas: to label the correspondences as inliers or outliers while simultaneously using them to estimate the essential matrix, and to leverage a PointNet-like architecture, replacing its global context feature with our Context Normalization, introduced in Section 3.2. We examine the impact of these two choices.

To show the benefits of our hybrid approach, we compare four different settings of our algorithm:

Ours: Our complete formulation with $\alpha$ and $\beta$ in Eq. (6) set to $1$ and initially and then to $1$ and $0.1$ after $20$ k batches. In other words, we first seek to assign reasonable weights to the correspondences before also trying to make the essential matrix accurate.

Essential: We disable the classification loss by setting $\alpha=0$ and $\beta=1$ in Eq. (6). This amounts to direct regression from correspondences to the essential matrix by assigning weights to the correspondences and performing least-squares fitting.

Classification: We disable the essential loss, setting $\alpha=1$ and $\beta=0$ in Eq. (6). The network then tries to classify correspondences into inliers and outliers.

Direct: We regress the essential matrix directly by average-pooling the output of our last ResNet block and adding a fully-connected layer that outputs nine values, which we take to be the output of $g$ in Eq. (9). In other words, we directly predict the coefficients of the essential matrix without resorting to a weighted version of the 8-point algorithm, as in Essential.

We ran these four variants on the ‘Sacre Coeur’ collection from YFCC100M, using SIFT keypoints, and report the results in Fig. 4. Essential and Direct, which do direct regression without classification, perform worse than Classification, as long as the number of keypoints is sufficient. Ours outperforms all the others by combining both classification and regression, by a margin of 12-24%. Note that the difference is larger for smaller error thresholds, suggesting that both Essential and Direct are learning the general trend of the dataset without providing truly accurate poses.

3.2 Context Normalization vs Context Feature

For comparison purposes, we reformulate our approach using the PointNet architecture . The specificity of PointNet is that it extracts a global context feature from the input features, which is then concatenated to each individual feature and re-injected into the network. By contrast, we simply embed it into each point-wise feature through Context Normalization. For a fair comparison, we use an architecture of similar complexity to ours, with 12 MLPs each for the extraction of point features, global features, and the final output. We replace max pooling by average pooling to extract the global feature, which gave better results. We refer to this architecture as PointNet, and train it using the hybrid loss of Eq. (6), as we do for Ours. For completeness, we also try our approach without Context Normalization.

We report results on the ‘Sacre Coeur’ sequence in Fig. 5. In all three cases, we present results without and with the final RANSAC stage of Section 3.5. As expected, our approach without Context Normalization performs poorly, whereas, with it, it does much better than PointNet.

4 Post-processing with RANSAC

Note that traditional keypoint-based methods do not work at all without RANSAC. By contrast, our network outperforms RANSAC in a single forward pass, using only the 8-point algorithm. However, at test time we can drop the differentiability constraints, as explained in Section 3.5, and apply RANSAC on our surviving inliers to further boost performance. We do this for the remainder of this paper.

Moreover, our method is much faster than stand-alone RANSAC because it can throw away most bad matches in a single step. Given $2$ k matches, a forward pass through our network takes 13 ms on GPU (or 25 ms on CPU) and returns on average ~300 inliers—RANSAC then removes a further ~100 matches in 9 ms. By contrast, a RANSAC loop with the full $2$ k matches needs 373 ms and returns ~300 inliers. Our method, including RANSAC, is thus not only more accurate but also 17x faster (please refer to the supplementary material for comprehensive numbers).

5 Comparison to the Baselines

We now turn to comparing our method against the baselines discussed above. Given the results presented in Section 4.3, we use Ours with context normalization turned on, and apply RANSAC post-processing to all of the sparse methods. We average the results over multiple sets—please refer to the supplementary material for additional results.

We first consider networks trained and tested on images from the same scene, and then on different scenes. Note that we split the images into disjoint sets for training an testing, as explained in Section 4.1, so there is no overlap between both sets in either case. For keypoint-based methods other than GMS, we consider both SIFT and LIFT.

Consider the 5 collections of outdoor images from YFCC100M and the 9 indoor sequences from SUN3D, described in Section 4.1 and depicted by Fig. 3. We report our comparative results in Figs. 6 and 8. The training and test sets are always disjoint, but drawn from the same collection.

For the five ‘Outdoors’ collections, whose images are feature-rich, we achieve our best results using LIFT features trained on photo-tourism datasets. Ours then delivers an mAP that is more than twice that of previous state-of-the-art methods. However, even when using the more popular SIFT, it still outperforms the other methods. Furthermore, the gap grows wider for small error thresholds, indicating that our approach performs better the more strict we are.

By contrast, the two ‘Indoors’ collections feature poorly textured images with repetitive patterns, static environments, and consistent scales, which makes them ill-suited for keypoint-based methods and more amenable to dense methods. Nevertheless, our method significantly outperforms all compared methods, including dense ones. On these images, Ours performs better with SIFT than LIFT, which is consistent with the fact that LIFT was trained on photo-tourism images. Note that for the numbers reported in Fig. 6 for DeMoN, we used their pre-trained models as it is not possible to re-train it due to the lack of depth information for the ‘Outdoors’ data. Note also that their training sets include not one but many sequences from SUN3D .

5.2 Generalization to Unknown Scenes

Here, we evaluate the generalization capability of our method by training and testing on different scenes. Fig. 7 reports results for the model trained with the combination of the ‘Saint Peter’s’ sequence from ‘Outdoors’ and the ‘Brown 1’ sequence from ‘Indoors’. For ‘Outdoors’ we report the average result for all sets excluding ‘Saint Peter’s’. For ‘Indoors’, we report the average result for the 15 test-only sequences selected by , for fair comparison.

We outperform all baselines on ‘Outdoors’ by a significant margin. Comparing Figs. 6 and 7 shows that our models generalize very well to unknown scenes. Note that the jump in the performance of several baselines between Figs. 6 and 7 is solely due to the addition of two easy sequences, ‘Fountain’ and ‘Herzjesu’, for testing in Fig. 7.

On ‘Indoors’ we outperform the state of the art by a small margin and lose some generalization power, probably limited by the capabilities of SIFT and LIFT for indoor scenes. Nevertheless, we outperform the state of the art on both subsets with a single model trained on only 2000 outdoors and 500 indoors images and tested on completely different scenes. Further results are given in the supplementary material, where we show that we can still outperform the state of the art with much smaller training sets, e.g., 59 images.

Conclusion

We have proposed a single-shot technique to recover the relative motion between two images with a deep network. In contrast with current trends, our method is sparse, indicating that keypoint-based robust estimation can still be relevant in the age of dense, black-box formulations. Our approach outperforms the state of the art by a significant margin with few images and limited supervision.

Our solution requires known intrinsics. In the future, we plan to investigate using the fundamental matrix instead of the essential. While the formalism would remain largely unchanged, we expect numerical stability problems in the regression component of the hybrid loss, which may require additional normalization layers or regularization terms.

References

Appendix A Supplementary Appendix for “Learning to Find Good Correspondences”

Here we detail the sequences used for training and testing in Table 1. The image numbers reported for SUN3D are after subsampling the video sequences by a factor of 10. For SUN3D we choose 9 sequences for training, and use the 15 sequences previously used by , collectively marked with $\ddagger$ , only for testing. We generate disjoint training, validation and test subsets by splitting the images in each set in a 60-20-20 ratio, as explained in Section 4.1. For $\ddagger$ we take up to 500 images with a 0-0-100 ratio, as we do not train any model on them.

The last column assigns a label (a-u) to each set for convenience, which is used in Section A.3. Our best model is trained concatenating the datasets marked with $\Diamond$ and $\triangle$ .

A.2 Training with limited data

Due to space constraints, the paper only reports results with our best model, which is the concatenation of ‘St. Peter’s’ ( $\Diamond$ ) and ‘Brown 1’ ( $\triangle$ ). Here we replicate the experiments of Section 4.5.1 and Section 4.5.2, i.e., we train a model and evaluate it first on the same sequence and then on every other ‘Outdoors’ sequence, respectively, but now using only our smallest training sequence. The dataset in question is ‘Reichstag’ (d) from the ‘Outdoors’ subset, which contains only 59 images for training, 8 for validation and 8 for testing. Note that after accounting for visibility constraints, this still lets us extract over 1500 image pairs for training and about 35 for each validation and testing.

Fig. 9 shows results training and testing on (different subsets of) the same sequence, and Fig. 10 shows how the model generalizes over every ‘Outdoors’ sequence other than itself, i.e., a-c, e, f, and $\Diamond$ . We follow the same protocols as in Section 4.5.1 and Section 4.5.2. The best results are obtained with LIFT features, which is consistent with our previous observations. Our method outperforms all the baselines, with LIFT plus RANSAC and GMS plus RANSAC being the closest competitors. More importantly, when generalizing to other scenes with so little training data (Fig. 10) we still outperform GMS by 65%-200% relative at different error thresholds, and RANSAC by about 50% relative.

A.3 Per-sequence results

We could not show per-sequence results in the paper due to space constraints. Fig. 11 provides separate results for every testing sequence for our approach and for every baseline at multiple error thresholds. Again, we use our models trained on a single sequence from each data type, marked respectively with $\Diamond$ and $\triangle$ (we do the same for G3DR ). The sequences used for testing include ‘Outdoors’ datasets a-f in Table 1, which are averaged in the column marked $*$ , and ‘Indoors’ datasets g-u, which are averaged in $\ddagger$ . We provide numbers on top of the bars for $*$ and $\ddagger$ . Our approach outperforms every baseline, with LIFT performing better than SIFT on the ‘Outdoors’ subset and the opposite for the ‘Indoors’ subset.

Note that DeMoN only achieves good performance for e and f in the ‘Outdoors’ sequences, and completely fails for YFCC100M sequences. G3DR shows even worse performance, hinting that sparse methods are preferable when it comes to photo-tourism datasets.

A.4 Ransac post-processing

As outlined in 4.4, RANSAC for post-processing allows us to greatly improve both the performance and the speed over RANSAC. In Table 2 we provide results for the generalization experiments of Section 4.5.2, for stand-alone RANSAC, and our method using either the 8-point algorithm or RANSAC for post-processing, which were not originally included in the paper due to spatial constraints. Note that this boost is only possible at test time, due to the differentiability requirement for training.

A.5 Differentiating through eigendecomposition

To differentiate through the eigendecomposition, we rely on the TensorFlow implementation. Here, we provide a short definition for completeness. For more details, we refer the interested readers to . In , it is shown that for a matrix $\mathbf{X}$ , which can be decomposed into $\mathbf{X}=\mathbf{U}\boldsymbol{\Sigma}\mathbf{U}^{\top}$ , where $\mathbf{U}$ is the matrix of eigenvectors and $\boldsymbol{\Sigma}$ is a diagonal matrix with eigenvalues, the derivative w.r.t. the eigenvectors are

where $\mathbf{M}_{sym}=\frac{1}{2}(\mathbf{M}^{\top}+\mathbf{M})$ , and

and $\sigma_{i}$ is the $i$ -th eigenvalue.