COTR: Correspondence Transformer for Matching Across Images

Wei Jiang, Eduard Trulls, Jan Hosang, Andrea Tagliasacchi, Kwang Moo Yi

Introduction

Finding correspondences across pairs of images is a fundamental task in computer vision, with applications ranging from camera calibration to optical flow , Structure from Motion (SfM) , visual localization , point tracking , and human pose estimation . Traditionally, two fundamental research directions exist for this problem. One is to extract sets of sparse keypoints from both images and match them in order to minimize an alignment metric . The other is to interpret correspondence as a dense process, where every pixel in the first image maps to a pixel in the second image .

The divide between sparse and dense emerged naturally from the applications they were devised for. Sparse methods have largely been used to recover a single global camera motion, such as in wide-baseline stereo, using geometrical constraints. They rely on local features and further prune the putative correspondences formed with them in a separate stage with sampling-based robust matchers , or their learned counterparts . Dense methods, by contrast, usually model small temporal changes, such as optical flow in video sequences, and rely on local smoothness . Exploiting context in this manner allows them to find correspondences at arbitrary locations, including seemingly texture-less areas.

In this work, we present a solution that bridges this divide, a novel network architecture that can express both forms of prior knowledge – global and local – and learn them implicitly from data. To achieve this, we leverage the inductive bias that densely connected networks possess in representing smooth functions and use a transformer to automatically control the nature of priors and learn how to utilize them through its attention mechanism. For example, ground-truth optical flow typically does not change smoothly across object boundaries, and simple (attention-agnostic) densely connected networks would have challenges in modelling such a discontinuous correspondence map, whereas a transformer would not. Moreover, transformers allow encoding the relationship between different locations of the input data, making them a natural fit for correspondence problems.

Our work is the first to apply transformers to obtain accurate correspondences. Our main technical contributions are:

we propose a functional correspondence architecture that combines the strengths of dense and sparse methods;

we show how to apply our method recursively at multiple scales during inference in order to compute highly-accurate correspondences;

we demonstrate that COTR achieves state-of-the-art performance in both dense and sparse correspondence problems on multiple datasets and tasks, without retraining;

we substantiate our design choices and show that the transformer is key to our approach by replacing it with a simpler model, based on a Multi-Layer Perceptron (MLP).

Related works

We review the literature on both sparse and dense matching, as well as works that utilize transformers for vision.

Sparse methods generally consist of three stages: keypoint detection, feature description, and feature matching. Seminal detectors include DoG and FAST . Popular patch descriptors range from hand-crafted to learned ones. Learned feature extractors became popular with the introduction of LIFT , with many follow-ups . Local features are designed with sparsity in mind, but have also been applied densely in some cases . Learned local features are trained with intermediate metrics, such as descriptor distance or number of matches.

Feature matching is treated as a separate stage, where descriptors are matched, followed by heuristics such as the ratio test, and robust matchers, which are key to deal with high outlier ratios. The latter are the focus of much research, whether hand-crafted, following RANSAC , consensus- or motion-based heuristics , or learned . The current state of the art builds on attentional graph neural networks . Note that while some of these theoretically allow feature extraction and matching to be trained end to end, this avenue remains largely unexplored. We show that our method, which does not divide the pipeline into multiple stages and is learned end-to-end, can outperform these sparse methods.

Dense methods

Dense methods aim to solve optical flow. This typically implies small displacements, such as the motion between consecutive video frames. The classical Lucas-Kanade method solves for correspondences over local neighbourhoods, while Horn-Schunck imposes global smoothness. More modern algorithms still rely on these principles, with different algorithmic choices , or focus on larger displacements . Estimating dense correspondences under large baselines and drastic appearance changes was not explored until methods such as DeMoN and SfMLearner appeared, which recovered both depth and camera motion – however, their performance fell somewhat short of sparse methods . Neighbourhood Consensus Networks explored 4D correlations – while powerful, this limits the image size they can tackle. More recently, DGC-Net applied CNNs in a coarse-to-fine approach, trained on synthetic transformations, GLU-Net combined global and local correlation layers in a feature pyramid, and GOCor improved the feature correlation layers to disambiguate repeated patterns. We show that we outperform DGC-Net, GLU-Net and GOCor over multiple datasets, while retaining our ability to query individual points.

Attention mechanisms

The attention mechanism enables a neural network to focus on part of the input. Hard attention was pioneered by Spatial Transformers , which introduced a powerful differentiable sampler, and was later improved in . Soft attention was pioneered by transformers , which has since become the de-facto standard in natural language processing – its application to vision tasks is still in its early stages. Recently, DETR used Transformers for object detection, whereas ViT applied them to image recognition. Our method is the first application of transformers to image correspondence problems. A concurrent relevant work for feature-less image matching was proposed shortly after our work became public .

Functional methods using deep learning

While the idea existed already, e.g. to generate images , using neural networks in functional form has recently gained much traction. DeepSDF uses deep networks as a function that returns the signed distance field value of a query point. These ideas were recently extended by to establish correspondences between incomplete shapes. While not directly related to image correspondence, this research has shown that functional methods can achieve state-of-the-art performance.

Method

We first formalize our problem (Section 3.1), then detail our architecture (Section 3.2), its recursive use at inference time (Section 3.3), and our implementation (Section 3.4).

Let $\boldsymbol{x}\in^{2}$ be the normalized coordinates of the query point in image $\boldsymbol{I}$ , for which we wish to find the corresponding point, $\boldsymbol{x}^{\prime}{\in}\>^{2}$ , in image $\boldsymbol{I}^{\prime}$ . We frame the problem of learning to find correspondences as that of finding the best set of parameters $\boldsymbol{\Phi}$ for a parametric function $\mathcal{F}_{\boldsymbol{\Phi}}\left(\boldsymbol{x}|\boldsymbol{I},\boldsymbol{I}^{\prime}\right)$ minimizing

where $\mathcal{D}$ is the training dataset of ground correspondences, $\mathcal{L}_{\text{corr}}$ measures the correspondence estimation errors, and $\mathcal{L}_{\text{cycle}}$ enforces correspondences to be cycle-consistent.

2 Network architecture

We implement $\mathcal{F}_{\boldsymbol{\Phi}}$ with a transformer. Our architecture, inspired by , is illustrated in Figure 2. We first crop and resize the input into a $256\times 256$ image, and convert it into a downsampled feature map size $16\times 16\times 256$ with a shared CNN backbone, $\mathcal{E}$ . We then concatenate the representations for two corresponding images side by side, forming a feature map size $16\times 32\times 256$ , to which we add positional encoding $\mathcal{P}$ (with $N{=}256$ channels) of the coordinate function $\boldsymbol{\Omega}$ (i.e. $\text{MeshGrid}(0{:}1,0{:}2)$ of size $16{\times}32{\times}2$ ) to produce a context feature map $\mathbf{c}$ (of size $16\times 32\times 256$ ):

where $\left[\cdot\right]$ denotes concatenation along the spatial dimension – a subtly important detail novel to our architecture that we discuss in greater depth later on. We then feed the context feature map $\mathbf{c}$ to a transformer encoder $\mathcal{T}_{\mathcal{E}}$ , and interpret its results with a transformer decoder $\mathcal{T}_{\mathcal{D}}$ , along with the query point $\boldsymbol{x}$ , encoded by $\mathcal{P}$ – the positional encoder used to generate $\boldsymbol{\Omega}$ . We finally process the output of the transformer decoder with a fully connected layer $\mathcal{D}$ to obtain our estimate for the corresponding point, $\boldsymbol{x}^{\prime}$ .

For architectural details of each component please refer to supplementary material.

Concatenation of the feature maps along the spatial dimension is critical, as it allows the transformer encoder $\mathcal{T}_{\mathcal{E}}$ to relate between locations within the image (self-attention), and across images (cross-attention). Note that, to allow the encoder to distinguish between pixels in the two images, we employ a single positional encoding for the entire concatenated feature map; see Fig. 2. We concatenate along the spatial dimension rather than the channel dimension, as the latter would create artificial relationships between features coming from the same pixel locations in each image. Concatenation allows the features in each map to be treated in a way that is similar to words in a sentence . The encoder then associates and relates them to discover which ones to attend to given their context – which is arguably a more natural way to find correspondences.

Linear positional encoding

We found it critical to use a linear increase in frequency for the positional encoding, as opposed to the commonly used log-linear strategy , which made our optimization unstable; see supplementary material. Hence, for a given location $\boldsymbol{x}=[x,y]$ we write

where $N{}={}256$ is the number of channels of the feature map. Note that $p_{k}$ generates four values, so that the output of the encoder $\mathcal{P}$ is size $N$ .

Querying multiple points

We have introduced our framework as a function operating on a single query point, $\boldsymbol{x}$ . However, as shown in Fig. 2, extending it to multiple query points is straightforward. We can simply input multiple queries at once, which the transformer decoder $\mathcal{T}_{\mathcal{D}}$ and the decoder $\mathcal{D}$ will translate into multiple coordinates. Importantly, while doing so, we disallow self attention among the query points in order to ensure that they are solved independently.

3 Inference

We next discuss how to apply our functional approach at inference time in order to obtain accurate correspondences.

Applying the powerful transformer attention mechanism to vision problems comes at a cost – it requires heavily downsampled feature maps, which in our case naturally translates to poorly localized correspondences; see Section 4.6. We address this by exploiting the functional nature of our approach, applying out network $\mathcal{F}_{\Phi}$ recursively. As shown in Fig. 3, we iteratively zoom into a previously estimated correspondence, on both images, in order to obtain a refined estimate. There is a trade-off between compute and the number of zoom-in steps. We ablated this carefully on the validation data and settled on a zoom-in factor of two at each step, with four zoom-in steps. It is worth noting that multiscale refinement is common in many computer vision algorithms , but thanks to our functional correspondence model, realizing such a multiscale inference process is not only possible, but also straightforward to implement.

Compensating for scale differences

While matching images recursively, one must account for a potential mismatch in scale between images. We achieve this by making the scale of the patch to crop proportional to the commonly visible regions in each image, which we compute on the first step, using the whole images. To extract this region, we compute the cycle consistency error at the coarsest level, for every pixel, and threshold it at $\tau_{\text{visible}}{=}5$ pixels on the $256\times 256$ image; see Fig. 4. In subsequent stages – the zoom-ins – we simply adjust the crop sizes over $\boldsymbol{I}$ and $\boldsymbol{I}^{\prime}$ so that their relationship is proportional to the sum of valid pixels (the unmasked pixels in Fig. 4).

Dealing with images of arbitrary size

Our network expects images of fixed $256\times 256$ shape. To process images of arbitrary size, in the initial step we simply resize (i.e. stretch) them to $256\times 256$ , and estimate the initial correspondences. In subsequent zoom-ins, we crop square patches from the original image around the estimated points, of a size commensurate with the current zoom level, and resize them to $256\times 256$ . While this may seem a limitation on images with non-standard aspect ratios, our approach performs well on KITTI, which are extremely wide (3.3:1). Moreover, we present a strategy to tile detections in Section 4.4.

Discarding erroneous correspondences

What should we do when we query a point is occluded or outside the viewport in the other image? Similarly to our strategy to compensate for scale, we resolve this problem by simply rejecting correspondences that induce a cycle consistency error (3) greater than $\tau_{\text{cycle}}{=}5$ pixels. Another heuristic we apply is to terminate correspondences that do not converge while zooming in. We compute the standard deviation of the zoom-in estimates, and reject correspondences that oscillate by more than $\tau_{\text{std}}{=}0.02$ of the long-edge of the image.

Interpolating for dense correspondence

While we could query every single point in order to obtain dense estimates, it is also possible to densify matches by computing sparse matches first, and then interpolating using barycentric weights on a Delaunay triangulation of the queries. This interpolation can be done efficiently using a GPU rasterizer.

4 Implementation details

We train our method on the MegaDepth dataset , which provides both images and corresponding dense depth maps, generated by SfM . These images come from photo-tourism and show large variations in appearance and viewpoint, which is required to learn invariant models. The accuracy of the depth maps is sufficient to learn accurate local features, as demonstrated by . To find co-visible pairs of images we can train with, we first filter out those with no common 3D points in the SfM model. We then compute the common area between the remaining pairs of images, by projecting pixels from one image to the other. Finally, we compute the intersection over union of the projected pixels, which accounts for different image sizes. We keep, for each image, the 20 image pairs with the largest overlap. This simple procedure results in a good combination of images with a mixture of high/low overlap. We use 115 scenes for training and 1 scene for validation.

Implementation

We implement our method in PyTorch . For the backbone $\mathcal{E}$ we use a ResNet50 , initialized with weights pre-trained on ImageNet . We use the feature map after its fourth downsampling step (after the third residual block), which is of size $16\times 16\times 1024$ , which we convert into $16\times 16\times 256$ with $1\times 1$ convolutions. For the transformer, we use 6 layers for both encoder and decoder. Each encoder layer contains a self-attention layer with 8 heads, and each decoder layer contains an encoder-decoder attention layer with 8 heads, but with no self-attention layers, in order to prevent query points from communicating between each other. Finally, for the network that converts the Transformer output into coordinates, $\mathcal{D}$ , we use a 3-layer MLP, with 256 units each, followed by ReLU activations.

On-the-fly training data generation

We select training pairs randomly, pick a random query point in the first image, and find its corresponding point on the second image using the ground truth depth maps. We then select a random zoom level among one of ten levels, uniformly spaced, in log scale, between 1 $\times$ and 10 $\times$ . We then crop a square patch at the desired zoom level, centered at the query point, from the first image, and a square patch that contains the corresponding point in the second image. Given this pair of crops, we sample 100 random valid correspondences across the two crops – if we cannot gather at least 100 valid points, we discard the pair and move to the next.

Staged training

Our model is trained in three stages. First, we freeze the pre-trained backbone $\mathcal{E}$ , and train the rest of the network, for 300k iterations, with the ADAM optimizer , a learning rate of $10^{-4}$ , and a batch size of 24. We then unfreeze the backbone and fine-tune everything end-to-end with a learning rate of $10^{-5}$ and a batch size of 16, to accommodate the increased memory requirements, for 2M iterations, at which point the validation loss plateaus. Note that in the first two stages we use the whole images, resized to $256\times 256$ , as input, which allows us to load the entire dataset into memory. In the third stage we introduce zoom-ins, generated as explained above, and train everything end-to-end for a further 300k iterations.

Results

We evaluate our method with four different datasets, each aimed for a different type of correspondence task. We do not perform any kind of re-training or fine-tuning. They are:

HPatches : A dataset with planar surfaces viewed under different angles/illumination settings, and ground-truth homographies. We use this dataset to compare against dense methods that operate on the entire image.

KITTI : A dataset for autonomous driving, where the ground-truth 3D information is collected via LIDAR. With this dataset we compare against dense methods on complex scenes with camera and multi-object motion.

ETH3D : A dataset containing indoor and outdoor scenes captured using a hand-held camera, registered with SfM. As it contains video sequences, we use it to evaluate how methods perform as the baseline widens by increasing the interval between samples, following .

Image Matching Challenge (IMC2020) : A dataset and challenge containing wide-baseline stereo pairs from photo-tourism images, similar to those we use for training (on MegaDepth). It takes matches as input and measures the quality the poses estimated using said matches. We evaluate our method on the test set and compare against the state of the art in sparse methods.

We follow the evaluation protocol of , which computes the Average End Point Error (AEPE) for all valid pixels, and the Percentage of Correct Keypoints (PCK) at a given reprojection error threshold – we use 1, 3, and 5 pixels. Image pairs are generated taking the first (out of six) images for each scene as reference, which is matched against the other five. We provide two results for our method: ‘COTR’, which uses 1,000 random query points for each image pair, and ‘COTR + Interp.’, which interpolates correspondences for the remaining pixels using the strategy presented in Section 3.3. We report our results in Table 1.

Our method provides the best results, with and without interpolation, with the exception of PCK-1px, where it remains close to the best baseline. We note that the results for this threshold should be taken with a grain of salt, as several scenes do not satisfy the planar assumption for all pixels. To provide some evidence for this, we reproduce the results for GLU-Net using the code provided by the authors to measure PCK at 3 pixels, which was not computed in the paper. While GLU-Net+GOCor slightly edges out GLU-Net, code was not available at the time of submission. COTR outperforms it by a significant margin.

2 KITTI

To evaluate our method in an environment more complex than simple planar scenes, we use the KITTI dataset . Following , we use the training split for this evaluation, as ground-truth for the test split remains private – all methods, including ours, were trained on a separate dataset. We report results both in terms of AEPE, and ‘Fl.’ – the percentage of optical flow outliers. As KITTI images are large, we randomly sample 40,000 points per image pair, from the regions covered by valid ground truth.

We report the results on both KITTI-2012 and KITTI-2015 in Table 2. Our method outperforms all the baselines by a large margin. Note that the interpolated version also performs similarly to the state of the art, slightly better in terms of flow accuracy, and slightly worse in terms of AEPE, compared to RAFT . It is important to understand here that, while COTR provides a drastic improvement over compared methods, we are evaluating only on points where COTR returns confident results, which is about 81.8% of the queried locations – among the 18.2% of rejected queries, 67.8% fall out of the borders of the other image, which indicates that our filtering is reasonable. This shows that COTR provides highly accurate results in the points we query and retrieve estimates for, and is currently limited by the interpolation strategy. This suggests that improved interpolation strategies based on CNNs, such as those used in , would be a promising direction for future research.

In Fig. 5 we further highlight cases where our method shows clear advantages over the competitors – we see that the objects in motion, i.e., cars, result in high errors with GLU-Net, which is biased towards a single, global motion. Our method, on the other hand, successfully recovers the flow fields for these cases as well, with minor errors at the boundaries, due to interpolation. These examples clearly demonstrate the role that attention plays when estimating correspondences on scenes with moving objects.

Finally, we stress that while our method is trained on MegaDepth, an urban dataset exhibiting only global, rigid motion, for which ground truth is only available on stationary objects (mostly building facades), our method proves capable of recovering the motion of objects moving in different directions; see Fig. 5, bottom. In other words, it learns to find precise, local correspondences within images, rather than global motion.

3 ETH3D

We also report results on the ETH3D dataset, following . This task is closer to the ‘sparse’ scenario, as performance is only evaluated on pixels corresponding to SfM locations with valid ground truth, which are far fewer than for HPatches or KITTI. We summarize the results in terms of AEPE in Table 3, sampling pairs of images with an increasing number of frames between them (the sampling “rate”), which correlates with baseline and, thus, difficulty. Our method produces the most accurate correspondences for every setting, tied with LiteFlowNet at a 3-frame difference, and drastically outperforms every method as the baseline increasesWe could not report exact numbers for GLU-Net+GOCor as they were not reported, and their implementation was not yet publicly available at the time of submission, but our method should comfortably outperform it in every setting; see , Fig 4.; see qualitative results in Fig. 6.

4 Image Matching Challenge

Accurate, 6-DOF pose estimation in unconstrained urban scenarios remains too challenging a problem for dense methods. We evaluate our method on a popular challenge for pose estimation with local features, which measures performance in terms of the quality of the estimated poses, in terms of mean average accuracy (mAA) at a 5∘ and 10∘ error threshold; see for details.

We focus on the stereo task.The challenge features two tracks: stereo, and multi-view (SfM). Our approach works on arbitrary locations and has no notion of ‘keypoints’ (we use random points). For this reason, we do not consider the multiview task, as SfM requires “stable” points to generate 3D landmarks. We plan to re-train the model and explore its use on keypoint locations in the future. As this dataset contains images with unconstrained aspect ratios, instead of stretching the image before the first zoom level, we simply resize the short-edge to $256$ and tile our coarse, image-level estimates – e.g. an image with 2:1 aspect ratio would invoke two tiling instances. If this process generates overlapping tiles (e.g. with a 4:3 aspect ratio), we choose the estimate that gives best cycle consistency among them. We pair our method with DEGENSAC to retrieve the final pose, as recommended by and done by most participants.

We summarize the results in Table 4. We consider the top performers in the 2020 challenge (a total of 228 entries can be found in the leaderboards [link]). As the challenge places a limit on the number of keypoints, instead of matches, we consider both categories (up to 2k and up to 8k keypoints per image), for fairness – note that our method has no notion of keypoints, instead, we query at random locations.While we limit the number of matches for each image pair, because we use random points for each pair, the number of points we use per image may grow very large. Hence, our method does not fit into the ‘traditional’ image matching pipeline, requiring additional considerations to use this benchmark; we thank the organizers for accommodating our request.

With 2k matches and excluding the methods that feature semantic masking – a heuristic employed in the challenge by some participants to filter out keypoints on transient structures such as the sky or pedestrians – COTR ranks second overall. These results showcase the robustness and generality of our method, considering that it was not trained specifically to solve wide-baseline stereo problems. In contrast, the other top entries are engineered towards this specific application. We also provide results lowering the cap on the number of matches (see $N$ in Table 4), showing that our method outperforms vanilla SuperGlue (the winner of the 2k-keypoint category) with as few as 512 input matches, and DISK (the runner-up) with as few as 256 input matches. Qualitative examples on IMC are illustrated in Fig. 7.

5 Object-centric scenes

While our evaluation focuses on outdoor scenes, our models can be applied to very different images, such as those picturing objects. We show one such example in Fig. 8, where COTR successfully estimates dense correspondences for two of objects moving in different directions – despite the fact that this data looks nothing alike the images it was trained with. This shows the generality of our approach.

6 Ablation studies

We validate the effectiveness of filtering out bad correspondences (Section 3.3) on the ETH3D dataset, where it improves AEPE by roughly 5% relative. More importantly, it effectively removes correspondences with a potentially high error. This allows the dense interpolation step to produce better results. We find that on average 1.2% of the correspondences are filtered out on this dataset – below 1% up to ‘rate=9’, gradually increasing until 3.65% at ‘rate=15’.

On the role of the transformer

Transformers are powerful attention mechanisms, but also costly. It is fair to wonder whether a simpler approach would suffice. We explore the use of MLPs in place of transformers, forming a pipeline similar to , and train such a variant – see supplementary material for details. In Fig. 9, we see that the MLP yields globally-smooth estimates, as expected, which fail to model the discontinuities that occur due to 3D geometry. On the other hand, COTR with the transformer successfully aligns source and target even when such discontinuities exist.

Zooming

To evaluate how our zooming strategy affects the localization accuracy of the correspondences, we measure the errors in the estimation at each zoom level, in pixels. We use the HPatches dataset, with more granularity than we use for inference, and display the histogram of pixels errors at each zoom level in Fig. 10. As we zoom-in, the distribution shifts to the left and gets squeezed, yielding more accurate estimates. While zooming in more is nearly always beneficial, we found empirically that four zoom-ins with a factor of two at each zoom provides a good balance between compute and accuracy.

Conclusions and future work

We introduced a functional network for image correspondence that is capable to address both sparse and dense matching problems. Through a novel architecture and recursive inference scheme, it achieves performance on par or above the state of the art on HPatches, KITTI, ETH3D, and one scene from IMC2020. As future work, in addition to the improvements we have suggested throughout the paper, we intend to explore the application of COTR to semantic and multi-modal matching, and incorporate refinement techniques to further improve the quality of its dense estimates.

Acknowledgements

This work was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant, Google’s Visual Positioning System, Compute Canada, and Advanced Research Computing at the University of British Columbia.

References

Appendix A Compute

The functional (and recursive) nature of our approach, coupled with the use of a transformer, means that our method has significant compute requirements. Our currently non-optimized prototype implementation queries one point at a time, and achieves 35 correspondences per second on a NVIDIA RTX 3090 GPU. This limitation could be addressed by careful engineering in terms of tiling and batching. Our preliminary experiments show no significant drop in performance when we query different points inside a given crop – we could thus potentially process any queries at the coarsest level in a single operation, and drastically reduce the number of operations in the zoom-ins (depending on how many queries overlap in a given crop). We expect this will speed up inference drastically. In addition to batching the queries at inference time, we plan to explore its use on non-random points (such as keypoints) and advanced interpolation techniques.

Appendix B Log-linear vs Linear

Here, we empirically demonstrate that linear positional encoding is important. We train two COTR models with different positional encoding strategies; see Section 3.2. One model uses log-linear increase in the frequency of the sine/cosine function, and the other uses linear increase instead. Fig. A shows that COTR successfully converges using the linear increase strategy. However, as shown in Fig. B, COTR fails to converge with the commonly used log-linear strategy . We suspect that this is because the task of finding correspondences does not involve very high frequency components, but further investigation is necessary and is left as future work.

Appendix C Architectural details for COTR

We use the lower layers of ResNet50 as our CNN backbone. We extract the feature map with 1024 channels after layer3, i.e., after the fourth downsampling step. We then project the feature maps with 1024 channels with $1\times 1$ convolution to 256 channels to reduce the amount of computation that happens within the transformers.

Transformers

We use 6 layers in both the transformer encoder and the decoder. Each encoder layer contains an 8-head self-attention module, and each decoder layer contains an 8-head encoder-decoder attention module. Note that we disallow the self-attention in the decoder, in order to maintain the independence between queries – queries should not affect each other.

MLP

Once the transformer decoder process the results, we obtain a 256 dimensional vector that represents where the correspondence should be. We use a 3-layer MLP to regress the corresponding point coordinates from the 256-dimensional latent vector. Each layer contains 256 neurons, followed by ReLU activations.

Appendix D Architectural details for the MLP variant

We use the same backbone in COTR. The difference here is that, once the feature map with 256 channels is obtained, we apply max pooling to extract the global latent vector for the image, as suggested in . We also tried a variant where we do not apply global pooling and use a fully-connected layer to bring it down to a manageable size of 1024 neurons but it quickly provided degenerate results, where all correspondence estimates were at the centre.

MLP

With the latent vectors from each image, we use a 3 layer MLP to regress the correspondence coordinates. Specifically, the input to the coordinate regressor is a 768-dimensional vector, which is the concatenation of two global latent vectors for the input images and the positional encoded query point. Similarly to the MLP used in COTR, each linear layer contains 256 neurons, and followed by ReLU activations.

Appendix E Comparing with RAFT [65]

RAFT performs better in KITTI-type of scenarios, not necessarily so for other cases. To show this, we provide results for RAFT on all other datasets in Table A. On KITTI, sparse COTR still performs best, and with the interpolation strategy it is roughly on par with RAFT . On other datasets, COTR outperforms RAFT by a large marginNote that RAFT requires two input images of the same size. We resize them to 1024 $\times$ 1024 for HPatches and the Image Matching Challenge..