Neighborhood Attention Transformer

Ali Hassani, Steven Walton, Jiachen Li, Shen Li, Humphrey Shi

Introduction

Convolutional neural networks (CNNs) have been the de facto standard architecture for computer vision models across different applications for years. AlexNet showed their usefulness on ImageNet , and many others followed suit with architectures such as VGG , ResNet , and EfficientNet . Transformers on the other hand, were originally proposed as attention-based models for natural language processing (NLP), trying to exploit the sequential structure of language. They were the basis upon which BERT and GPT were built, and they continue to be the state of the art architecture in NLP.

In late 2020, Vision Transformer (ViT) was proposed as an image classifier using only a Transformer Encoder operating on an embedded space of image patches, mostly for large-scale training. A number of other methods followed, attempting to increase data efficiency , eventually making such Transformer-like models the state of the art in ImageNet-1K classification (without pre-training on large-scale datasets such as JFT-300M).

These high-performing Transformer-like methods are all based on Self Attention (SA), the basic building block in the original Transformer . SA has a linear complexity with respect to the embedding dimension (excluding linear projections), but a quadratic complexity with respect to the number of tokens. In the scope of vision, the number of tokens is typically in linear correlation with image resolution. As a result, higher image resolution results in a quadratic increase in complexity and memory usage in models strictly using SA, such as ViT. The quadratic complexity has prevented such models from being easily applicable to downstream vision tasks, such as object detection and segmentation, in which image resolutions are usually much larger than classification. Another problem is that convolutions benefit from inductive biases such as locality, and the 2-dimensional spatial structure, while dot-product self attention is a global 1-dimensional operation by definition. This means that some of those inductive biases have to be learned with either large sums of data or advanced training techniques and augmentations .

Local attention modules were therefore proposed to alleviate these issues. Stand-Alone Self-Attention (SASA) was one of the earliest applications of local window-based attention to vision, where each pixel attends to a window around it. Its explicit sliding window pattern is identical to that of same convolutions, with zero paddings around and a simple 2-dimensional raster scan, therefore maintaining translational equivariance. SASA was aimed at replacing convolutions in a ResNet, and was shown to have a noticeable improvement over baselines. However, the authors noted SASA was limited in terms of speed due to the lack of an efficient implementation similar to that of convolutions. Swin on the other hand was one of the first hierarchical vision transformers based on local self attention. Its design and the shifted-window self attention allowed it to be easily applicable to downstream tasks, as they made it computationally feasible, while also boosting performance through the additional biases injected. Swin’s localized attention, however, first applies self attention to non-overlapping windows and then shifts the windows, the motivation of which was sliding window methods such as SASA suffering throughput bottlenecks. HaloNet used a haloing mechanism that localizes self attention for blocks of pixels at a time, as opposed to pixel-wise. One of their key motivations for this was also noted to be the lack of an efficient sliding window attention.

In this work, we revisit explicit sliding window attention mechanisms, and propose Neighborhood Attention (NA). NA localizes SA to each pixel’s nearest neighbors, which is not necessarily a fixed window around the pixel. This change in definition allows all pixels to maintain an identical attention span, which would otherwise be reduced for corner pixels in zero-padded alternatives (SASA). NA also approaches SA as its neighborhood size grows, and is equivalent to SA at maximum neighborhood. Additionally, NA has the added advantage of maintaining translational equivariance , unlike blocked and window self attention. We develop $\mathcal{N}ATTEN$ , a Python package with efficient C++ and CUDA kernels that allow NA to run even faster than Swin’s WSA in practice, while using less memory. We build Neighborhood Attention Transformer (NAT), which achieves competitive results across vision tasks.

To summarize, our main contributions are:

Proposing Neighborhood Attention (NA): A simple and flexible explicit sliding window attention mechanism that localizes each pixel’s attention span to its nearest neighborhood, approaches self attention as its span grows, and maintains translational equivariance. We compare NA in terms of complexity and memory usage to self attention, window self attention, and convolutions.

Developing efficient C++ and CUDA kernels for NA, including the tiled NA algorithm, which allow NA to run up to 40% faster than Swin’s WSA while using up to 25% less memory. We release them under a new Python package for explicit sliding window attention mechanisms, $\mathbfcal{N}\mathbf{ATTEN}$ , to provide easy-to-use modules with autograd support that can be plugged into any existing PyTorch pipeline.

Introducing Neighborhood Attention Transformer (NAT), a new efficient, accurate, and scalable hierarchical transformer based on NA. We demonstrate its effectiveness on both classification and downstream tasks. For instance, NAT-Tiny reaches 83.2% top-1 accuracy on ImageNet with only 4.3 GFLOPs and 28M parameters, and 51.4% box mAP on MS-COCO and 48.4% mIoU on ADE20K, significantly outperforming both Swin Transformer and ConvNeXt .

Related Works

In this section, we briefly review the original Self Attention (SA) , some of the notable vision transformers and Transformer-like architectures , some of the notable local attention-based vision transformers , and a recent CNN which provides an up-to-date baseline for attention-based models.

Scaled dot-product attention was defined by Vaswani et al. as an operation on a query and a set of key-value pairs. The dot product of query and key is computed and scaled. Softmax is applied to the output in order to normalize attention weights, and is then applied to the values. It can be expressed as follows:

2 Vision Transformer

Dosovitskiy et al. proposed a Transformer-based image classifier that merely consists of a Transformer encoder and an image tokenizer, named Vision Transformer (ViT). Previous works, such as DETR , explored CNN-Transformer hybrids for object detection. ViT on the other hand proposed a model that would only rely on a single non-overlapping convolutional layer (patching and embedding). ViT was pre-trained primarily on the private JFT-300M dataset, and was shown to outperform state-of-the-art CNNs on many benchmarks. However, it was also added that when ViT is pre-trained on medium-scale datasets, such as ImageNet-1K and ImageNet-21K, it no longer achieves competitive results. This was attributed to the lack of inductive biases that are inherent to CNNs, which the authors argued is trumped by large-scale training. While this effectively proved ViT inferior in medium-scale training, it provided empirical evidence that Transformer-based models outperform CNNs in larger scales. ViT paved the way for many more vision transformers, and attention-based models in general, that followed and transferred it to medium-scale learning , and even small-scale learning on much smaller datasets . Touvron et al. extended the study of Vision Transformers by exploring data efficiency. Their Data-efficient image Transformer (DeiT) model pushed ViT ahead with minimal architectural changes, and through the use of advanced augmentations and training techniques. Their efforts highlighted the true potential of a Transformer-based image classifier in medium-sized data regimes, and inspired many more to adopt their training techniques .

3 Local Attention

Stand Alone Self Attention (SASA) , is one of the earliest sliding window self attention patterns, aimed to replace convolutions in existing CNNs. It operates similarly to a convolution with zero padding, and extracts key-value pairs by striding the feature map. The authors reported a noticeable accuracy improvement, but observed that the implementation suffered high latency despite the lower theoretical cost. This attention pattern was also adopted in language processing in Longformer (sliding window attention), and later adopted in Vision Longformer (ViL) . While Longformer and ViL’s implementations were different from SASA, they were still not able to scale to larger windows and models as a result of both computational overhead. Additionally, the reduced receptive field in corner cases caused by padding was not addressed. Window and Shifted Window (Swin) Attention were introduced by Liu et al. as non-sliding window-based self attention mechanisms that partition feature maps and apply self attention to each partition separately. This operation has a similar theoretical complexity to SASA, but it can be easily parallelized through batched matrix multiplication. The shifted variant follows the regular, and as the name suggests shifts the partitioning to allow out-of-window interactions, which are necessary for receptive field growth. Their proposed model, Swin Transformer, is one of the earliest hierarchical vision transformers. It produces pyramid-like feature maps, reducing spatial dimensionality while increasing depth. This structure has been commonly used in CNNs over the years, and is why Swin can be easily integrated with other networks for application to downstream tasks, such as detection and segmentation. Swin outperformed DeiT, which uses a convolutional teacher, at ImageNet-1K classification. Moreover, Swin Transformer became the state-of-the-art method in object detection on MS-COCO and in semantic segmentation on ADE20K. Vaswani et al. proposed HaloNet, which aimed to avoid SASA’s speed issue by replacing it with a new blocked attention pattern. They noted that while this change relaxes translational equivariance, it can provide a reasonable trade-off with speed and memory. HaloNet’s attention mechanism consists of 3 stages: blocking, haloing, and attention. Input feature maps are blocked into non-overlapping subsets, which will serve as queries. Followed by that, “haloed” neighboring blocks are extracted, which will serve as keys and values. Attention is then applied to the extracted queries and key-value pairs. HaloNet was shown to be effective at both reducing cost (compared to SASA) and improving performance, especially when used in conjunction with convolutional layers in the network. Many works followed Swin in adopting WSA, such as RegionViT , in which a regional token is inserted into every local self attention layer for the purpose of introducing global context. This work and HaloNet highlight that the research community has lost interest in sliding window attention patterns in part because they are thought to be inefficient. We aim to change that by introducing $\mathcal{N}ATTEN$ .

4 New Convolutional Baselines

Liu et al. proposed a new CNN architecture influenced by models such as Swin, dubbed ConvNeXt. These models are not attention-based, and manage to outperform Swin across different vision tasks. This work has since served as a new CNN baseline for fair comparison of convolutional models and attention-based models.

We propose Neighborhood Attention, which by design localizes the receptive field to a window around each query, and therefore would not require additional techniques such as the cyclic shift used by Swin. Additionally, Neighborhood Attention maintains translational equivariance, which is traded off for efficiency in methods such as HaloNet and Swin. We demonstrate that Neighborhood Attention can run even faster than methods such as Swin, while using less memory, with our $\mathcal{N}ATTEN$ python package. We introduce a hierarchical transformer-like model with this attention mechanism, dubbed Neighborhood Attention Transformer, and demonstrate its performance compared to Swin on image classification, object detection, and semantic segmentation.

Method

In this section, we introduce Neighborhood Attention, a localization of self attention (see Eq. 1) considering the structure of visual data. This not only reduces computational cost compared to self attention, but also introduces local inductive biases, similar to that of convolutions. We show that NA is better alternative to the previously proposed SASA in terms of restricting self attention, while being equivalent in theoretical cost. We then introduce our Python package, $\mathcal{N}ATTEN$ , which provides efficient implementations of NA for both CPU and GPU acceleration. We discuss the novelties in the extension and how it manages to exceed the speed of Swin’s WSA and SWSA, while using less memory. We finally introduce our model, Neighborhood Attention Transformer (NAT), which uses this new mechanism instead of self attention. In addition, NAT utilizes a multi-level hierarchical design, similar to Swin , meaning that feature maps are downsampled between levels, as opposed to all at once. Unlike Swin, NAT uses overlapping convolutions to downsample feature maps, as opposed to non-overlapping (patched) ones, which have been shown to improve model performance by introducing useful inductive biases .

where $\rho_{j}{(i)}$ denotes $i$ ’s $j$ -th nearest neighbor. We then define neighboring values, $\mathbf{V}_{i}^{k}$ , as a matrix whose rows are the $i$ -th input’s $k$ nearest neighboring value projections:

Neighborhood Attention for the $i$ -th token with neighborhood size $k$ is then defined as:

where $\sqrt{d}$ is the scaling parameter. This operation is repeated for every pixel in the feature map. Illustrations of this operation are presented in Figs. 2 and VIII.

From this definition, it is easy to see that as $k$ grows, $\mathbf{A}_{i}^{k}$ approaches self attention weights, and $\mathbf{V}_{i}^{k}$ approaches $V_{i}$ itself, therefore Neighborhood Attention approaches self attention. This is the key difference between NA and SASA , where each pixel attends to a window around it with padding around the input to handle edge cases. It is thanks to this difference that NA approaches self attention as window size grows, which does not hold true in SASA, due to the zero padding around the input.

2 Tiled NA and 𝒩𝒜𝒯𝒯ℰ𝒩𝒩𝒜𝒯𝒯ℰ𝒩\mathbfcal{N}\mathbf{ATTEN}

Restricting self attention in a pixel-wise manner has not been well-explored in the past, primarily because it was considered a costly operation that would require lower-level reimplementation. That is because self attention itself is broken down to matrix multiplication, an operation that is easily parallelizable on accelerators, and has a myriad of efficient algorithms defined for different use cases in computational software (to name a few: LAPACK, cuBLAS, CUTLASS). Additionally, most deep learning platforms, such as PyTorch, are written on top of such software, and additional packages (such as cuDNN). This is very helpful to researchers, as it allows them to use abstractions of operations such as matrix multiplications or convolutions, while the backend decides which algorithm to run based on their hardware, software, and use case. They also typically handle automatic gradient computation, which makes designing and training deep neural networks very straightforward.

Because of the pixel-wise structure of NA (and other pixel-wise attention mechanisms, such as SASA ), and also the novelty of the definition of neighborhoods in NA, the only way to implement NA with these platforms is to stack a number of highly inefficient operations to extract the neighborhoods, store them as an intermediary tensor, and then compute attention. This results in a significantly slow operation, with an exponentially growing memory usage. To tackle these challenges, we developed a set of efficient CPU and CUDA kernels and packaged them as a Python package, Neighborhood Attention Extension ( $\mathbfcal{N}\mathbf{ATTEN}$ ). $\mathcal{N}ATTEN$ includes half precision support, support for both 1D and 2D data, and autograd-compatible integration with PyTorch. This means that users can simply import NA as a PyTorch module and integrate it into existing pipelines. We also add that SASA can also be easily implemented with this package with no change in the underlying kernels (simply by padding inputs with zeros), as it is a special case of NA. The reverse does not hold true. It also includes our tiled NA algorithm, which computes neighborhood attention weights by loading non-overlapping query tiles into shared memory to minimize global memory reads. Compared to the naive implementation, tiled NA can decrease latency up to an order of magnitude (see Appendix A for technical details), and it allows NA-based models to run up to 40% faster than similar Swin counterparts (see Fig. 4.) $\mathcal{N}ATTEN$ is open-sourced at: https://github.com/SHI-Labs/NATTEN.

3 Neighborhood Attention Transformer

NAT embeds inputs using 2 consecutive $3\times 3$ convolutions with $2\times 2$ strides, resulting in a spatial size $1/4$ th the size of the input. This is similar to using a patch and embedding layer with $4\times 4$ patches, but it utilizes overlapping convolutions instead of non-overlapping ones to introduce useful inductive biases . On the other hand, using overlapping convolutions would increase cost, and two convolutions incurs more parameters. However, we handle that by re-configuring the model, which results in a better trade-off.

NAT consists of 4 levels, each followed by a downsampler (except the last). Downsamplers cut spatial size in half, while doubling the number of channels. We use $3\times 3$ convolutions with $2\times 2$ strides, instead of $2\times 2$ non-overlapping convolutions that Swin uses (patch merge). Since the tokenizer downsamples by a factor of $4$ , our model produces feature maps of sizes $\frac{h}{4}\times\frac{w}{4}$ , $\frac{h}{8}\times\frac{w}{8}$ , $\frac{h}{16}\times\frac{w}{16}$ , and $\frac{h}{32}\times\frac{w}{32}$ . This change is motivated by previous successful CNN structures, and followed by other hierarchical attention-based methods, such as PVT , ViL , and Swin Transformer . Additionally, we use LayerScale for stability in larger variants. An illustration of the overall network architecture is presented in Fig. 5. We present a summary of different NAT variants in Tab. 1.

4 Complexity Analysis

We present a complexity and memory usage analysis in this subsection, which compares SA, WSA, NA, and convolutions in Tab. 2. For simplicity, we exclude attention heads. Given input feature maps of shape $h\times w\times d$ , where $d$ is the number of channels, and $h$ and $w$ are feature map height and width respectively, the $QKV$ linear projections are $3hwd^{2}$ FLOPs, which is the same for all three attention patterns. SA has a quadratic complexity, as both computing attention weights and output are $h^{2}w^{2}d$ FLOPs, and attention weights are of shape $hw\times hw$ . Swin’s WSA divides the queries, keys, and values into $\frac{h}{k}\times\frac{w}{k}$ windows of shape $k\times k$ , then applies self attention with each window, which is $hwdk^{2}$ FLOPs. WSA’s memory consumption, given that its attention weights are of shape $\frac{h}{k}\times\frac{w}{k}\times k^{2}\times k^{2}$ , is therefore $hwdk^{2}$ . In NA, $\mathbf{A}_{i}^{k}$ is of size $h\times w\times k^{2}$ , and the cost to compute it is $hwdk^{2}$ . $\mathbf{V}_{i}^{k}$ is of shape $h\times w\times k^{2}\times d$ , and therefore the cost of applying attention weights to it would be $hwdk^{2}$ . As for convolutions, computational cost is $hwd^{2}k^{2}$ , and memory usage would be only $d^{2}k^{2}$ . The summary in Tab. 2 clarifies that Swin’s WSA and NA have identical computational cost and memory usage in theory.

Experiments

We demonstrate NAT’s applicability and effectiveness by conducting experiments across different vision tasks, such as image classification, object detection, and semantic segmentation. We also present ablations on different attention patterns, as well as our NAT design. Additional experiments, including saliency analysis can be found in Appendix B.

We trained our variants on ImageNet-1K in order to compare to other transformer-based and convolutional image classifiers. This dataset continues to be one of the few benchmarks for medium-scale image classification, containing roughly 1.28M training, 50K validation, and 100K test images, categorized into 1000 classes. We train NAT with the commonly used timm (Apache License v2), and use the conventional augmentations (CutMix , Mixup , RandAugment , and Random Erasing ) and training techniques used in methods we compare to . We follow Swin’s training configuration (learning rate, iteration-wise cosine schedule, and other hyperparameters). Following convention, we train for 300 epochs, 20 of which warm up the learning rate, while the rest decay according to the scheduler, and do 10 additional cooldown epochs . Results are presented in Tab. 3, and visualized in Fig. 3. We observe that NAT-Mini outperforms Swin-Tiny by a margin of 0.5%, with fewer parameters, higher throughput and lower memory usage. As for the other three variants, we observe they consistently outperform both Swin and ConvNeXt counterparts with similar number of parameters and FLOPs. While our Small variant is slightly slower than its Swin counterpart due to the difference in architecture, our Base variant catches up to being faster than Swin-Base.

2 Object Detection and Instance Segmentation

We trained Mask and Cascade Mask R-CNN on MS-COCO , with NAT backbones, which were pre-trained on ImageNet. We followed Swin ’s training settings, using mmdetection (Apache License v2), and trained with the same accelerated $3\times$ LR schedule. Results are presented in Tab. 4. NAT-Mini outperforms Swin-Tiny with Mask R-CNN, while falling slightly short to it with Cascade Mask R-CNN, all while having significantly fewer FLOPs. NAT-Tiny outperforms both its Swin and ConvNeXt counterparts, with both Mask and Cascade Mask R-CNN, while having a slightly lower throughput compared to its Swin counterpart. NAT-Small reaches a competitive performance compared to its Swin counterpart, while being faster. NAT-Base can even outperform its Swin counterpart, while also enjoying a higher throughput.

3 Semantic Segmentation

To demonstrate NAT’s performance on semantic segmentation, we trained UPerNet with NAT backbones on ADE20K . We followed Swin’s configuration for training ADE20K, and used mmsegmentation (Apache License v2). Additionally, and following standard practice, input images are randomly resized and cropped at $512\times 512$ when training. Results are presented in Tab. 5. It is noticeable that NAT-Mini outperforms Swin-Tiny, and also comes very close to ConvNeXt-Tiny. NAT-Tiny outperforms ConvNeXt-Tiny significantly, while also slightly more efficient. NAT-Small outperforms Swin-Small on single-scale performance, while matching the multi-scale performance. NAT-Base similarly performs on-par with Swin-Base, while falling slightly short of ConvNeXt-Base. Note that both NAT-Small and NAT-Base bear fewer FLOPs with them compared to their Swin and ConvNeXt counterparts, while their performance is within the same region. It is also noteworthy that Swin especially suffers from more FLOPs even beyond the original difference due to the fact that the image resolution input in this task specifically ( $512\times 512$ ) will not result in feature maps that are divisible by $7\times 7$ , Swin’s window size, which forces the model to pad input feature maps with zeros to resolve that issue, prior to every attention operation. NAT on the other hand supports feature maps of arbitrary size.

4 Ablation Study

We compare Swin’s attention pattern (WSA+SASA) to sliding window patterns, namely SASA (implemented with our $\mathcal{N}ATTEN$ package and therefore enjoys approximately the same throughput and identical memory usage as NA), and our NA. We simply replace the attention blocks in Swin-Tiny, and run the model on ImageNet-1K classification, MSCOCO object detection and instance segmentation, and ADE20K segmentation. Results are presented in Tab. 6.

Separately, we investigate the effects of our NAT design (convolutional downsampling and deeper-thinner architecture), by performing an ablation with Swin-Tiny as baseline. We slowly transform the model into NAT-Tiny, and present the results in Tab. 7. We start by replacing the patched embedding and patched merge with the overlapping convolution design used in NAT. This results in almost 0.5% improvement in accuracy. After taking the second step to reduce the model size and compute, by making it deeper and thinner, we notice the model sees approximately an improvement in accuracy of 0.9% over the first step. We then try swapping the WSA and SWSA attention patterns in Swin with SASA , and see a slight drop in accuracy. However, swapping WSA and SWSA with our NA shows a further 0.5% improvement in accuracy.

We also present a kernel size experiment in Tab. 8, in which we try kernel sizes ranging from 3× 3 to 9 × 9, in an effort to analyze its affect on our model’s performance.

Conclusion

In this paper, we present Neighborhood Attention (NA), the first efficient and scalable sliding window attention mechanism for vision. NA is a pixel-wise operation which localizes self attention for each pixel to its nearest neighborhood, and therefore enjoys linear complexity. It also introduces local inductive biases and maintains translational equivariance, unlike blocked (HaloNet) and window self attention (Swin). Different from SASA, NA approaches self attention as its window size grows, and does not limit attention span at corner cases. We challenge the common notion that explicit sliding window attention patterns are not efficient or parallelizable by developing $\mathcal{N}ATTEN$ . Through using $\mathcal{N}ATTEN$ , NA-based models can run even faster than existing alternatives, despite the latter running primarily on highly optimized deep learning libraries built on top of lower-level computational packages. We also propose Neighborhood Attention Transformer (NAT) and show the power of such models: NAT outperforms both Swin Transformer and ConvNeXt in image classification, and outperforms or competes with both in downstream vision tasks. We open-source our entire project to encourage more research in this direction.

This work was in part supported by the Intelligence Advanced Research Projects Activity (IARPA) under Contract No. 2022-21102100004. We thank Picsart AI Research (PAIR) and Meta for their generous support that made this work possible.

References

Appendix

We present a more detailed illustration of neighborhoods in Fig. VIII. Note that the repeated windows at corner pixels is especially important to NA approaching SA. We also present details on our $\mathcal{N}ATTEN$ package in Appendix A, and additional experiments in Appendix B. Additionally, we discuss translational equivariance in self attention mechanisms in Appendix C.

Appendix A 𝒩𝒜𝒯𝒯ℰ𝒩𝒩𝒜𝒯𝒯ℰ𝒩\mathbfcal{N}\mathbf{ATTEN}

In this section, we outline the necessity for an extension such as $\mathcal{N}ATTEN$ for research in the direction of dynamic sliding window attention patterns, and describe how it aims to resolve such problems.

While many operations in deep neural networks can be broken down to matrix multiplication, certain point-wise operations, such as convolutions, require customized implementations for more optimal parallelization. As a result, convolutions, recurrent modules, and other similar operations are natively supported in most low-level computational packages, which are called by deep learning frameworks such as PyTorch. In other words, given any input set and an operation, deep learning frameworks select the most efficient implementation available for that particular case, considering the hardware and software running said operation.

This makes research significantly easier, while being inevitably constrained to operations that are well-implemented. To allow further flexibility, some deep learning frameworks also allow for extensions to be built on top of them when necessary. Extensions can therefore enjoy customized CPU and GPU implementations. Notable examples of such extensions are Deformable Convolutions and Deformable Attention , which have been implemented as CUDA extensions to PyTorch.

Sliding window attention mechanisms are no different, in that they require manual implementation to maximize parallelization and bandwidth. Without those implementations, the only alternative is a Python implementation, which typically does not scale. For instance, implementing Neighborhood Attention with PyTorch alone would include extracting sliding windows, repeating and re-arranging them to produce neighborhoods, and then performing two batched matrix multiplications. This would mean two separate C++/CUDA calls to generate significantly large intermediary tensors, which result in an exponential memory usage and latency increases. With the most optimized plain PyTorch implementation, NA would run at 13% the speed of Swin Transformer on a 56 × 56 feature map (first level of an ImageNet model), while using approximately 9 times as much memory. With a naive CUDA implementation, the same NA module runs at 102% the speed of Swin, while using approximately 20% less memory. With our Tiled NA algorithm, that same module runs at 132% the speed of Swin, with no change in memory usage. You can refer to Figs. II and I for more benchmarks comparing different NA implementations to Swin Transformer in terms of relative speed and memory usage.

This is why we developed $\mathcal{N}ATTEN$ , which currently serves as an extension to PyTorch, and provides torch modules NeighborhoodAttention1D and NeighborhoodAttention2D. This allows any PyTorch user to integrate NA into their models, for both tokens and pixels.

Each module consists of linear projections for queries, keys, and values, and a final linear projection, which is standard in most dot-product self attention modules. $\mathcal{N}ATTEN$ provides a single autograd function for each of Eq. 2 and Eq. 4. Once tensors q, k, and v are generated, attention weights are computed (see Eq. 2) by passing q, k, and positional biases b to the C function QK+RPB, which picks the appropriate kernel to call (CPU or CUDA; naive or special; half or full precision). Softmax and dropout are then applied to the output attention weights, a, with native torch implementations. NA’s final is computed by passing a and v to the C function AV.

A.2 Naive CUDA Kernels

Originally, we developed 7 naive CUDA kernels: 1 for QK+RPB, 1 for AV, and 5 to compute gradients for each of q, k, b, a, and v. Naive kernels simply divide computation across available threadblocks, and do not utilize shared memory or warp optimization. Despite their simplicity, they were able to benchmark between 80% up to 130% the speed of WSA+SWSA layers (both with kernel size 7 × 7). However, naive kernels are not optimal; they read directly from the global memory on the GPU, which bottlenecks throughput.

A.3 Half precision

Supporting mixed precision training is not too complicated. PyTorch’s ATen dispatchers compile all kernels for both double and single precision by default, since tensor data type is usually templated. By choosing a different dispatcher, kernels can be easily compiled for half tensors. However, simply support half precision rarely results in any significant bandwidth improvement without integrating CUDA’s vectorized half2 data type and operators. As a result, we separately define our half precision kernels to utilize vectorize load, multiply-add, and stores. This yields a more significant improvement in mixed-precision training speed.

A.4 Tiled Neighborhood Attention

CUDA allows easy allocation and utilization of shared memory between threadblocks. This, however, typically requires a change in the algorithm. Therefore, we implemented a tiled version of our attention weight kernel, and its backward kernel, which divides inputs into non-overlapping tiles, assigns each thread within the threadblock to read a specific number of adjacent cells from global memory, sync, and then compute outputs based on values in the shared memory. We present an illustration of that in Fig. IV. Using shared memory also presents new challenges, including, but not limited to: 1. Tile size bounds depending on kernel size, dimension, and shared memory limit on the GPU. 2. Bank conflicts between warps during computation. 3. Different number of reads from each input depending on tile size.

For instance, Fig. IV illustrates NA at kernel size 7 ×7, with tile size 3 × 3, which requires a key tile of size 9 × 9. The 3 × 3 tile size was chosen based on a number of factors, including the size of shared memory (48 KB), total number of threads per threadblock (1024 since compute capability 2.0), and other problem-specific factors such as embedding dimension. Key tile size is always equal to $t_{q}+k-1$ , where $t_{q}$ is the query tile size, and $k$ is kernel size, which is $3+7-1=9$ here.

Through a detailed internal analysis, we implemented and optimized Tiled NA for kernel sizes 3, 5, 7, 9, 11, and 13. Although not all bank conflicts were avoided in all use cases, they were minimized through profiling with NVIDIA NsightTM. Even though this implementation has resulted in a considerable bandwidth increase in NA training and inference, $\mathcal{N}ATTEN$ is still fairly at an early stage. We hope to improve existing kernels and add more optimal ones for different use cases, and add support for the new Hopper architecture with CUDA 12.

A.5 CPU Kernels

We extend $\mathcal{N}ATTEN$ to support CPU operations as well, both for training and inference. CPU functions for Neighborhood Attention are simple C++ implementations, with AVX vectorization support in newer PyTorch versions. As a result, they can easily utilize multi-threaded computation, which usually results in a relatively good latency compared to similar sized models on consumer CPUs. In total, there are 7 CPU kernels in the current version (similar to the naive implementations, 1 for each operation and 1 for each gradient.) We foresee further optimizations and additional CPU kernels in the near future.

A.6 Future efforts

We hope to continue supporting $\mathcal{N}ATTEN$ and help the community enjoy sliding window attention modules. Our hope is to eventually implement Neighborhood Attention with implicit GEMM (generalized matrix-matrix product), which will allow $\mathcal{N}ATTEN$ to be built on top of open-source packages (i.e. CUTLASS) and utilize the power of hardware accelerators to a greater extent.

Appendix B Additional experiments

We present an ablation on relative positional biases and pixel shifts (WSA only) in Tab. I.

B.2 Saliency analysis

In an effort to further illustrate the differences between attention mechanisms and models, we present salient maps from ViT-Base, Swin-Base, and NAT-Base. We selected a few images from the ImageNet validation set, sent them through the three models, and created the salient maps based on the outputs, which are presented in Fig. VII. All images are correctly predicted (Bald Eagle, Acoustic Guitar, Hummingbird, Steam Locomotive) except ViT’s Acoustic Guitar which predicts Stage. From these salient maps we can see that all models have relatively good interpretability, though they focus on slightly different areas. NAT appears to be slightly better at edge detection, which we believe is due to the localized attention mechanism, that we have presented in this work, as well as the convolutional downsamplers.

Appendix C Notes on translational equivariance

In this section, we discuss the translational equivariance property in attention-based models, which is often referenced as a useful property in convolutional models . To do that, we begin with defining equivariance and translations, and then move on to studying the existence translational equivariance in different modules.

In the context of computer vision, translation typically refers to a shift (and sometimes rotation) in pixels.

Equivariance.

A function $f$ is equivariant to a function $\mathcal{T}$ if $\mathcal{T}(f(x))=f(\mathcal{T}(x))$ .

Translational Equivariance.

An operation $f$ is equivariant to translations.

Linear projections.

A single linear layer, which can also be formulated as a 1×1 convolution, is by definition equivariant to any change in the order of pixels. Therefore, they are also translationally equivariant.

Convolutions.

Thanks to their dynamic sliding window structure, and their static kernel weights, convolutions are translationally equivariant , since every output pixel is the product of its corresponding input pixel centred in a window and multiplied by the static kernel weight.

Self Attention.

SA (Eq. 1) is translationally equivariant , because: 1. the linear projections maintain that property, and 2. self attention weights are also equivariant to any change in order.

SASA.

SASA extracts key-value pairs for every query according to the same raster-scan pattern convolutions follow, which suggests it maintains translational equivariance. However, convolutions apply static kernel weights, which allows them to maintain this property. On the other hand, even though SASA applies dynamic weights, those weights are still a function of the pixels within the window. Therefore, SASA also maintains translational equivariance. Note that SASA does not enjoy the same position-agnostic property in self attention.

HaloNet.

The blocked self attention pattern described in HaloNet is described to “relax” translational equivariance. It is simply due to the fact that pixels within the same region share their neighborhood, therefore their sliding window property is relaxed and with it translational equivariance.

WSA and SWSA.

The basic property present in both WSA and SWSA is the partitioning, which exists in only one of two forms (regular and shifted) and therefore not dynamically sliding like SASA or convolutions. This simply breaks translational equivariance, as translations move the dividing lines. To give an example, an object within the feature map could fit within a single WSA partition, but the translation could shift the object just enough so that it falls into two different partitions. To illustrate this, we provide visualizations of activations from a single Swin block (WSA + SWSA) in Fig. VI, where we compare translations applied to input and output. We replace all linear projections with the identity function (as those are already known to be equivariant) and remove positional biases for simplicity in visualization.

NA.

We note that our NA preserves translational equivariance for the most part, similar to SASA. However, NA relaxes translational equivaraince in corner cases in favor of maintaining attention span. We present translations applied to dummy inputs and their NA outputs in Fig. VI, similar to those of Swin. However, we also note that NA relaxes the translational equivariance in corner cases, particularly because of its definition of neighborhood which results in sliding windows being repeated at edge pixels. A visualization of this can be seen visualized with a larger kernel size (quarter of the image) compared to Swin and SASA in Fig. V.

The difference in how corner cases are handled is an important difference which should exist between sliding window attention mechanisms and convolutions. Repeating sliding windows at corner cases (which NA achieves with the neighborhood definition) is useful in the scope of attention, because the repeated windows are still subsets of the original self attention weights, which are being restricted. This does not hold true in convolutions, where repeated sliding windows produces repeated output pixels, because of the static kernel. On the other hand, zero padding in attention (no repetition at corner cases; like SASA) is less powerful because it limits attention span farther at corner cases. It also does not approach self attention as its window size grows, while NA does.