LoFTR: Detector-Free Local Feature Matching with Transformers

Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, Xiaowei Zhou

Introduction

Local feature matching between images is the cornerstone of many 3D computer vision tasks, including structure from motion (SfM), simultaneous localization and mapping (SLAM), visual localization, etc. Given two images to be matched, most existing matching methods consist of three separate phases: feature detection, feature description, and feature matching. In the detection phase, salient points like corners are first detected as interest points from each image. Local descriptors are then extracted around neighborhood regions of these interest points. The feature detection and description phases produce two sets of interest points with descriptors, the point-to-point correspondences of which are later found by nearest neighbor search or more sophisticated matching algorithms.

The use of a feature detector reduces the search space of matching, and the resulted sparse correspondences are sufficient for most tasks, e.g., camera pose estimation. However, a feature detector may fail to extract enough interest points that are repeatable between images due to various factors such as poor texture, repetitive patterns, viewpoint change, illumination variation, and motion blur. This issue is especially prominent in indoor environments, where low-texture regions or repetitive patterns sometimes occupy most areas in the field of view. Fig. 1 shows an example. Without repeatable interest points, it is impossible to find correct correspondences even with perfect descriptors.

Several recent works have attempted to remedy this problem by establishing pixel-wise dense matches. Matches with high confidence scores can be selected from the dense matches, and thus feature detection is avoided. However, the dense features extracted by convolutional neural networks (CNNs) in these works have limited receptive field which may not distinguish indistinctive regions. Instead, humans find correspondences in these indistinctive regions not only based on the local neighborhood, but with a larger global context. For example, low-texture regions in Fig. 1 can be distinguished according to their relative positions to the edges. This observation tells us that a large receptive field in the feature extraction network is crucial.

Motivated by the above observations, we propose Local Feature TRansformer (LoFTR), a novel detector-free approach to local feature matching. Inspired by seminal work SuperGlue , we use Transformer with self and cross attention layers to process (transform) the dense local features extracted from the convolutional backbone. Dense matches are first extracted between the two sets of transformed features at a low feature resolution ( $\nicefrac{{1}}{{8}}$ of the image dimension). Matches with high confidence are selected from these dense matches and later refined to a sub-pixel level with a correlation-based approach. The global receptive field and positional encoding of Transformer enable the transformed feature representations to be context- and position-dependent. By interleaving the self and cross attention layers multiple times, LoFTR learns the densely-arranged globally-consented matching priors exhibited in the ground-truth matches. A linear transformer is also adopted to reduce the computational complexity to a manageable level.

We evaluate the proposed method on several image matching and camera pose estimation tasks with indoor and outdoor datasets. The experiments show that LoFTR outperforms detector-based and detector-free feature matching baselines by a large margin. LoFTR also achieves state-of-the-art performance and ranks first among the published methods on two public benchmarks of visual localization. Compared to detector-based baseline methods, LoFTR can produce high-quality matches even in indistinctive regions with low-textures, motion blur, or repetitive patterns.

Related Work

Detector-based Local Feature Matching. Detector-based methods have been the dominant approach for local feature matching. Before the age of deep learning, many renowned works in the traditional hand-crafted local features have achieved good performances. SIFT and ORB are arguably the most successful hand-crafted local features and are widely adopted in many 3D computer vision tasks. The performance on large viewpoint and illumination changes of local features can be significantly improved with learning-based methods. Notably, LIFT and MagicPoint are among the first successful learning-based local features. They adopt the detector-based design in hand-crafted methods and achieve good performance. SuperPoint builds upon MagicPoint and proposes a self-supervised training method through homographic adaptation. Many learning-based local features along this line also adopt the detector-based design.

The above-mentioned local features use the nearest neighbor search to find matches between the extracted interest points. Recently, SuperGlue proposes a learning-based approach for local feature matching. SuperGlue accepts two sets of interest points with their descriptors as input and learns their matches with a graph neural network (GNN), which is a general form of Transformers . Since the priors in feature matching can be learned with a data-driven approach, SuperGlue achieves impressive performance and sets the new state of the art in local feature matching. However, being a detector-dependent method, it has the fundamental drawback of being unable to detect repeatable interest points in indistinctive regions. The attention range in SuperGlue is also limited to the detected interest points only. Our work is inspired by SuperGlue in terms of using self and cross attention in GNN for message passing between two sets of descriptors, but we propose a detector-free design to avoid the drawbacks of feature detectors. We also use an efficient variant of the attention layers in Transformer to reduce the computation costs.

Detector-free Local Feature Matching. Detector-free methods remove the feature detector phase and directly produce dense descriptors or dense feature matches. The idea of dense features matching dates back to SIFT Flow . are the first learning-based approaches to learn pixel-wise feature descriptors with the contrastive loss. Similar to the detector-based methods, the nearest neighbor search is usually used as a post-processing step to match the dense descriptors. NCNet proposed a different approach by directly learning the dense correspondences in an end-to-end manner. It constructs 4D cost volumes to enumerate all the possible matches between the images and uses 4D convolutions to regularize the cost volume and enforce neighborhood consensus among all the matches. Sparse NCNet improves upon NCNet and makes it more efficient with sparse convolutions. Concurrently with our work, DRC-Net follows this line of work and proposes a coarse-to-fine approach to produce dense matches with higher accuracy. Although all the possible matches are considered in the 4D cost volume, the receptive field of 4D convolution is still limited to each matches’ neighborhood area. Apart from neighborhood consensus, our work focuses on achieving global consensus between matches with the help of the global receptive field in Transformers, which is not exploited in NCNet and its follow-up works. proposes a dense matching pipeline for SfM with endoscopy videos. The recent line of research that focuses on bridging the task of local feature matching and optical flow estimation, is also related to our work.

Transformers in Vision Related Tasks. Transformer has become the de facto standard for sequence modeling in natural language processing (NLP) due to their simplicity and computation efficiency. Recently, Transformers are also getting more attention in computer vision tasks, such as image classification , object detection and semantic segmentation . Concurrently with our work, proposes to use Transformer for disparity estimation. The computation cost of the vanilla Transformer grows quadratically as the length of input sequences due to the multiplication between query and key vectors. Many efficient variants are proposed recently in the context of processing long language sequences. Since no assumption of the input data is made in these works, they are also well suited for processing images.

Methods

Given the image pair $I^{A}$ and $I^{B}$ , the existing local feature matching methods use a feature detector to extract interest points. We propose to tackle the repeatability issue of feature detectors with a detector-free design. An overview of the proposed method LoFTR is presented in Fig. 2.

Convolutional Neural Networks (CNNs) possess the inductive bias of translation equivariance and locality, which are well suited for local feature extraction. The downsampling introduced by the CNN also reduces the input length of the LoFTR module, which is crucial to ensure a manageable computation cost.

2 Local Feature Transformer (LoFTR) Module

Preliminaries: Transformer . We first briefly introduce the Transformer here as background. A Transformer encoder is composed of sequentially connected encoder layers. Fig. 3(a) shows the architecture of an encoder layer.

The key element in the encoder layer is the attention layer. The input vectors for an attention layer are conventionally named query, key, and value. Analogous to information retrieval, the query vector $Q$ retrieves information from the value vector $V$ , according to the attention weight computed from the dot product of $Q$ and the key vector $K$ corresponding to each value $V$ . The computation graph of the attention layer is presented in Fig. 3(b). Formally, the attention layer is denoted as:

Intuitively, the attention operation selects the relevant information by measuring the similarity between the query element and each key element. The output vector is the sum of the value vectors weighted by the similarity scores. As a result, the relevant information is extracted from the value vector if the similarity is high. This process is also called “message passing” in Graph Neural Network.

Linear Transformer. Denoting the length of $Q$ and $K$ as $N$ and their feature dimension as $D$ , the dot product between $Q$ and $K$ in the Transformer introduces computation cost that grows quadratically ( $O(N^{2})$ ) with the length of the input sequence. Directly applying the vanilla version of Transformer in the context of local feature matching is impractical even when the input length is reduced by the local feature CNN. To remedy this problem, we propose to use an efficient variant of the vanilla attention layer in Transformer. Linear Transformer proposes to reduce the computation complexity of Transformer to $O(N)$ by substituting the exponential kernel used in the original attention layer with an alternative kernel function $\operatorname{sim}(Q,K)=\phi(Q)\cdot\phi(K)^{T},\text{where }\phi(\cdot)=\operatorname{elu}(\cdot)+1$ . This operation is illustrated by the computation graph in Fig. 3(c). Utilizing the associativity property of matrix products, the multiplication between $\phi(K)^{T}$ and $V$ can be carried out first. Since $D\ll N$ , the computation cost is reduced to $O(N)$ .

3 Establishing Coarse-level Matches

Match Selection. Based on the confidence matrix $\mathcal{P}_{c}$ , we select matches with confidence higher than a threshold of $\theta_{c}$ , and further enforce the mutual nearest neighbor (MNN) criteria, which filters possible outlier coarse matches. We denote the coarse-level match predictions as:

4 Coarse-to-Fine Module

5 Supervision

The final loss consists of the losses for the coarse-level and the fine-level: $\mathcal{L}=\mathcal{L}_{c}+\mathcal{L}_{f}$ .

Coarse-level Supervision. The loss function for the coarse-level is the negative log-likelihood loss over the confidence matrix $\mathcal{P}_{c}$ returned by either the optimal transport layer or the dual-softmax operator. We follow SuperGlue to use camera poses and depth maps to compute the ground-truth labels for the confidence matrix during training. We define the ground-truth coarse matches $\mathcal{M}_{c}^{gt}$ as the mutual nearest neighbors of the two sets of $\nicefrac{{1}}{{8}}$ -resolution grids. The distance between two grids is measured by the re-projection distance of their central locations. More details are provided in the supplementary. With the optimal transport layer, we use the same loss formulation as in . When using dual-softmax for matching, we minimize the negative log-likelihood loss over the grids in $\mathcal{M}_{c}^{gt}$ :

in which $\hat{j}^{\prime}_{gt}$ is calculated by warping each $\hat{i}$ from $\hat{F}^{A}_{tr}(\hat{i})$ to $\hat{F}^{B}_{tr}(\hat{j})$ with the ground-truth camera pose and depth. We ignore ( $\hat{i}$ , $\hat{j}^{\prime}$ ) if the warped location of $\hat{i}$ falls out of the local window of $\hat{F}^{B}_{tr}(\hat{j})$ when calculating $\mathcal{L}_{f}$ . The gradient is not backpropagated through $\sigma^{2}(\hat{i})$ during training.

6 Implementation Details

Experiments

In the first experiment, we evaluate LoFTR on the widely adopted HPatches dataset for homography estimation. HPatches contains 52 sequences under significant illumination changes and 56 sequences that exhibit large variation in viewpoints.

Evaluation protocol. In every test sequence, one reference image is paired with the rest five images. All images are resized with shorter dimensions equal to 480. For each image pair, we extract a set of matches with LoFTR trained on MegaDepth . We use OpenCV to compute the homography estimation with RANSAC as the robust estimator. To make a fair comparison to methods that produce different numbers of matches, we compute the corner error between the images warped with the estimated $\hat{\mathcal{H}}$ and the ground-truth $\mathcal{H}$ as a correctness identifier as in . Following , we report the area under the cumulative curve (AUC) of the corner error up to threshold values of 3, 5, and 10 pixels, respectively. We report the results of LoFTR with a maximum of 1K output matches.

Baseline methods. We compare LoFTR with three categories of methods: 1) detector-based local features including R2D2 , D2Net , and DISK , 2) a detector-based local feature matcher, i.e., SuperGlue on top of SuperPoint features, and 3) detector-free matchers including Sparse-NCNet and DRC-Net . For local features, we extract a maximum of 2K features with which we extract mutual nearest neighbors as the final matches. For methods directly outputting matches, we restrict a maximum of 1K matches, same as LoFTR. We use the default hyperparameters in the original implementations for all the baselines.

Results. Tab. 1 shows that LoFTR notably outperforms other baselines under all error thresholds by a significant margin. Specifically, the performance gap between LoFTR and other methods increases with a stricter correctness threshold. We attribute the top performance to the larger number of match candidates provided by the detector-free design and the global receptive field brought by the Transformer. Moreover, the coarse-to-fine module also contributes to the estimation accuracy by refining matches to a sub-pixel level.

2 Relative Pose Estimation

Datasets. We use ScanNet and MegaDepth to demonstrate the effectiveness of LoFTR for pose estimation in indoor and outdoor scenes, respectively.

ScanNet contains 1613 monocular sequences with ground truth poses and depth maps. Following the procedure from SuperGlue , we sample 230M image pairs for training, with overlap scores between 0.4 and 0.8. We evaluate our method on the 1500 testing pairs from . All images and depth maps are resized to $640\times 480$ . This dataset contains image pairs with wide baselines and extensive texture-less regions.

MegaDepth consists of 1M internet images of 196 different outdoor scenes. The authors also provide sparse reconstruction from COLMAP and depth maps computed from multi-view stereo. We follow DISK to only use the scenes of “Sacre Coeur” and “St. Peter’s Square” for validation, from which we sample 1500 pairs for a fair comparison. Images are resized such that their longer dimensions are equal to 840 for training and 1200 for validation. The key challenge on MegaDepth is matching under extreme viewpoint changes and repetitive patterns.

Evaluation protocol. Following , we report the AUC of the pose error at thresholds ( $5^{\circ},10^{\circ},20^{\circ}$ ), where the pose error is defined as the maximum of angular error in rotation and translation. To recover the camera pose, we solve the essential matrix from predicted matches with RANSAC. We don’t compare the matching precisions between LoFTR and other detector-based methods due to the lack of a well-defined metric (e.g., matching score or recall ) for detector-free image matching methods. We consider DRC-Net as the state-of-the-art method in detector-free approaches .

Results of indoor pose estimation. LoFTR achieves the best performance in pose accuracy compared to all competitors (see Tab. 2 and Fig. 5). Pairing LoFTR with optimal transport or dual-softmax as the differentiable matching layer achieves comparable performance. Since the released model of DRC-Net ${\dagger}$ is trained on MegaDepth, we provide the results of LoFTR ${\dagger}$ trained on MegaDepth for a fair comparison. LoFTR ${\dagger}$ also outperforms DRC-Net ${\dagger}$ by a large margin in this evaluation (see Fig. 5), which demonstrates the generalizability of our model across datasets.

Results of Outdoor Pose Estimation. As shown in Tab. 3, LoFTR outperforms the detector-free method DRC-Net by 61% at AUC@10°, demonstrating the effectiveness of the Transformer. For SuperGlue, we use the setup from the open-sourced localization toolbox HLoc . LoFTR outperforms SuperGlue by a large margin (13% at AUC@10°), which demonstrates the effectiveness of the detector-free design. Different from indoor scenes, LoFTR-DS performs better than LoFTR-OT on MegaDepth. More qualitative results can be found in Fig. 5.

3 Visual Localization

Visual Localization. Besides achieving competitive performance for relative pose estimation, LoFTR can also benefit visual localization, which is the task to estimate the 6-DoF poses of given images with respect to the corresponding 3D scene model. We evaluate LoFTR on the Long-Term Visual Localization Benchmark (referred to as VisLoc benchmark in the following). It focuses on benchmarking visual localization methods under varying conditions, e.g., day-night changes, scene geometry changes, and indoor scenes with plenty of texture-less areas. Thus, the visual localization task relies on highly robust image matching methods.

Evaluation. We evaluate LoFTR on two tracks of VisLoc that consist of several challenges. First, the “visual localization for handheld devices” track requires a full localization pipeline. It benchmarks on two datasets, the Aachen-Day-Night dataset concerning outdoor scenes and the InLoc dataset concerning indoor scenes. We use open-sourced localization pipeline HLoc with the matches extracted by LoFTR. Second, the “local features for long-term localization” track provides a fixed localization pipeline to evaluate the local feature extractors themselves and optionally the matchers. This track uses the Aachen v1.1 dataset . We provide the implementation details of testing LoFTR on VisLoc in the supplementary material.

Results. We provide evaluation results of LoFTR in Tab. 4 and Tab. 5. We have evaluated LoFTR pairing with either the optimal transport layer or the dual-softmax operator and report the one with better results. LoFTR-DS outperforms all baselines in the local feature challenge track, showing its robustness under day-night changes. Then, for the visual localization for handheld devices track, LoFTR-OT outperforms all published methods on the challenging InLoc dataset, which contains extensive appearance changes, more texture-less areas, symmetric and repetitive elements. We attribute the prominence to the use of the Transformer and the optimal transport layer, taking advantage of global information and jointly bringing global consensus into the final matches. The detector-free design also plays a critical role, preventing the repeatability problem of detector-based methods in low-texture regions. LoFTR-OT performs on par with the state-of-the-art method SuperPoint + SuperGlue on night queries of the Aachen v1.1 dataset and slightly worse on the day queries.

4 Understanding LoFTR

Ablation Study. To fully understand the different modules in LoFTR, we evaluate five different variants with results shown in Tab. 6: 1) Replacing the LoFTR module by convolution with a comparable number of parameters results in a significant drop in AUC as expected. 2) Using a smaller version of LoFTR with $\nicefrac{{1}}{{16}}$ and $\nicefrac{{1}}{{4}}$ resolution feature maps at the coarse and fine level, respectively, results in a running time of 104 ms and a degraded pose estimation accuracy. 3) Using DETR-style Transformer architecture which has positional encoding at each layer, leads to a noticeably declined result. 4) Increasing the model capacity by doubling the number of LoFTR layers to $N_{c}=8\text{ and }N_{f}=2$ barely changes the results. We conduct these experiments using the same training and evaluation protocol as indoor pose estimation on ScanNet with an optimal transport layer for matching.

Visualizing Attention. We visualize the attention weights in Fig. 6.

Conclusion

This paper presents a novel detector-free matching approach, named LoFTR, that can establish accurate semi-dense matches with Transformers in a coarse-to-fine manner. The proposed LoFTR module uses the self and cross attention layers in Transformers to transform the local features to be context- and position-dependent, which is crucial for LoFTR to obtain high-quality matches on indistinctive regions with low-texture or repetitive patterns. Our experiments show that LoFTR achieves state-of-the-art performances on relative pose estimation and visual localization on multiple datasets. We believe that LoFTR provides a new direction for detector-free methods in local image feature matching and can be extended to more challenging scenarios, e.g., matching images with severe seasonal changes.

Acknowledgement. The authors would like to acknowledge the support from the National Key Research and Development Program of China (No. 2020AAA0108901), NSFC (No. 61806176), and ZJU-SenseTime Joint Lab of 3D Vision.