Dilated Neighborhood Attention Transformer

Ali Hassani, Humphrey Shi

Introduction

Transformers have made a significant contribution to AI research, starting with natural language understanding before being applied to other modalities such as speech and vision , thanks to their universal architecture built upon self attention. This success inspired efforts into attention-based models in vision, from backbone networks , to more specific applications including image generation and density modeling , object detection , image segmentation , and more.

Vision Transformer (ViT) was one of the first major demonstrations of transformers as direct alternatives to Convolutional Neural Networks (CNNs) , the de facto standard in vision. ViT treats an image as a sequence of patches and uses a plain transformer encoder to encode and classify images. It demonstrated competitive performance to CNNs on large scale image classification, and resulted in a surge in vision research focused on transformer-based architectures as competitors to CNNs .

Vision transformers and CNNs are different not only in terms of architecture and building blocks, but also in how they treat data. CNNs typically downsample inputs gradually as they pass through the model and construct hierarchical feature maps. This hierarchical design is crucial for vision, as objects vary in scale, and high-resolution feature maps are important to dense tasks, such as segmentation. On the other hand, transformers are known for their fixed dimensionality throughout the model, and as a result, plain ViTs downsample inputs aggressively from the very beginning to alleviate the quadratic cost of self attention, which in turn hinders the application of plain ViTs as backbones to dense vision tasks.

While research in applying plain ViTs to dense vision tasks continues , research into hierarchical vision transformers quickly became dominant and continues to grow . A key advantage of these hierarchical transformer models is their ease of integration with existing hierarchical vision frameworks. Inspired by existing CNNs, hierarchical vision transformers are comprised of multiple (typically 4) levels of transformer encoders, with downsampling modules in between, and a less aggressive initial downsampling (i.e. $1/4$ instead of $1/16$ ). Earlier layers in hierarchical transformers, if using unrestricted self attention, would bear the same quadratically growing complexity and memory usage with respect to input resolution, making them intractable for higher resolution images. Therefore, hierarchical transformers typically employ certain local attention mechanisms.

Swin Transformer , one of the earliest hierarchical vision transformers, utilizes a Window Self Attention (WSA) module, followed by a pixel-shifted Window Self Attention (SWSA), both of which localize self attention to non-overlapping sub-windows. This reduces the cost of self attention, making its time and space complexity linear with respect to resolution. SWSA is identical to WSA, but with a shift in feature map pixels preceding it, and followed by a reverse shift. This is essential to its performance, as it allows out-of-window interactions, and therefore the expansion of its receptive field. One of the major advantages of Swin is efficiency, as pixel shifts and window partitioning are relatively cheap and easily parallelizable operations. Additionally, it involves little to no changes to the self attention module, making implementation easier. Swin became the state of the art across multiple vision tasks, and followed by Swin-V2 to accommodate large scale pre-training.

Neighborhood Attention Transformer (NAT) was introduced later, with a simple sliding-window based attention, Neighborhood Attention (NA). Unlike Stand Alone Self Attention (SASA) , which applies attention in the style of convolutions, NA localizes self attention to the nearest neighbors around each token, which allows it by definition to approach self attention and enjoy a fixed attention span. Such pixel-wise attention operations were assumed to be inefficient and challenging to parallelize , until the release of Neighborhood Attention Extension . With this extension, NA can run even faster than Swin’s SWSA in practice. NAT was able to significantly outperform Swin on image classification, and achieved competitive performance on downstream tasks, while also scaling up to be even faster than Swin despite the slightly different architecture.

Despite the efforts into hierarchical vision transformers with local attention, some of self attention’s most important properties, including global receptive field, and the ability to model long-range inter-dependencies, are weakened as a result of this localization.

This leads to a simple question: How does one maintain the tractability that local attention provides in hierarchical vision transformers, while avoiding its shortcomings? In other words, the optimal scenario is maintaining the linear complexity, while preserving the global receptive field and the ability to model long-range inter-dependencies of self attention. In this paper, we aim to answer this question and improve hierarchical transformers by extending a simple local attention mechanism, Neighborhood Attention, to Dilated Neighborhood Attention (DiNA): a flexible and powerful sparse global attention. Dilating neighborhoods in NA into larger sparse regions has multiple advantages: 1. it captures more global context, 2. allows the receptive field to grow exponentially, as opposed to linearly , and 3. comes at no additional computational cost. To demonstrate the effectiveness of DiNA, we propose Dilated Neighborhood Attention Transformer (DiNAT), which not only improves the existing NAT model in terms of downstream performance, it manages to outperform strong modern CNN baselines, such as ConvNeXt , in downstream tasks with a noticeable margin.

Our main contributions can be summarized as follows:

Introducing DiNA, a simple and powerful sparse global attention pattern, which allows receptive field to grow exponentially and captures longer-range context without any additional computational burden. DiNA does so while maintaining the symmetry in neighborhoods introduced in NA. It can also adapt to larger resolutions without expanding to larger window sizes.

Analyzing theoretical receptive field sizes in models based on convolutions, localized attention, and a DiNA-based model.

Introducing DiNAT, a new hierarchical vision transformer made of both dilated and non-dilated variants of NA. DiNAT utilizes a gradual dilation change through the model, which extends receptive fields more optimally and helps fine-to-coarse feature learning.

Conducting extensive experiments on image classification, object detection, and segmentation with DiNAT, and finding that it exhibits a noticeable improvement in downstream tasks over both attention-based and convolutional baselines. Additionally, we investigate isotropic and hybrid attention variants, scaling experiments with ImageNet-22K pre-training, and the effects of different dilation values. We also achieve state of the art image segmentation performance with advanced segmentation frameworks.

Extending $\mathcal{N}ATTEN$ , NA’s CUDA extension for PyTorch, by adding dilation support, and bfloat16 utilization, allowing the research in this direction to be extended to other tasks and applications.

While the initial experiments with DiNAT already exhibit significant improvements in downstream vision tasks, neither its performance nor applications stop here. NA’s local attention and DiNA’s sparse global attention complement each other: they can preserve locality, model longer-range inter-dependencies, expand the receptive field exponentially, and maintain a linear complexity. Their restriction of self attention can potentially improve convergence by avoiding self attention’s possible redundant interactions, such as those with repetitive, background, or distracting tokens . Combinations of local attention and sparse global attention can potentially empower various vision tasks and beyond. To support research in this direction, we open source our entire project, including our modified $\mathcal{N}ATTEN$ , which can reduce runtime by orders of magnitude compared to naive implementations .

Related Work

We briefly review dot product self attention (DPSA), the Transformer , and Vision Transformer . We then move on to localized self attention modules such as SASA , SWSA in Swin Transformer , and NA in Neighborhood Attention Transformer , and discuss their limitations, which are our motivation behind this work. Finally, we discuss previous uses of sparse attention mechanisms in language processing and vision .

Vaswani et al. define dot product attention as an operation between a query, and a set of key-value pairs. The dot product of the query and keys is scaled and sent through a softmax activation to produce attention weights. Said attention weights are then applied to the values:

2 Local Attention

SASA is one of the earliest local attention mechanisms that was specifically designed to be used in vision models, years before ViT . It sets the key-value pair to sliding windows over the feature map, therefore localizing attention for each query (pixel) to a window centered around it. Such an operation could easily replace convolutions in existing CNNs, such as ResNets, and theoretically even reduce computational complexity. Despite the promise it showed, the authors found that the resulting model runs slow, due to the inefficient implementation of this module. Works succeeding it therefore switched to alternative methods that could run more efficiently, such as blocked self attention in HaloNet , and Window Self Attention in Swin .

Shifted Window Self Attention (SWSA).

Liu et al. proposed Window Self Attention (WSA) and its shifted variant SWSA, and used them in their hierarchical model for vision, Swin Transformer. They pointed out the inefficiency of sliding-window methods such as SASA as one of their motivations behind developing Window Self Attention. The shifted variant (SWSA), as the name suggests, shifts pixels before the attention operation, and reverses the shift afterwards, to create a different window partitioning compared to the previous layer, which allows for out-of-window interactions that are crucial to a growing receptive field (see Fig. 3). Swin initially became the state of the art in object detection and semantic segmentation. It also inspired other works that extended it to different tasks beyond the ones explored in the paper, such as generation , restoration , masked image modeling , video action recognition , and more. Additionally, the followup model, Swin-V2 , became the new state of the art with their largest model. It is noteworthy that Swin-V2 utilizes much larger window sizes to achieve such performance, which in turn increase time complexity and memory usage.

Neighborhood Attention (NA).

NA was proposed as a simple sliding-window attention, which localizes self attention for each pixel to its nearest neighbors. NA shares the same time and space complexity and number of parameters to those of Swin’s WSA and SWSA, but instead operates in overlapping sliding windows, and therefore preserves translation equivariance. While NA’s sliding-window pattern is similar to SASA, its formulation of nearest neighbors makes it a direct restriction of self attention, and therefore NA, unlike SASA, approaches SA as its window size grows. A major challenge of sliding-window attention was the lack of efficient implementations, as no existing deep learning or CUDA libraries support such operations directly. Therefore, NA was introduced along with $\mathcal{N}ATTEN$ , an extension with efficient CPU and GPU kernels that allow NA to outperform modules such as WSA/SWSA in terms of both speed and memory usage. The model Neighborhood Attention Transformer (NAT) is similar in its hierarchical design to Swin Transformer. The key differences, other than the attention modules, is that NAT utilizes overlapping convolutions in downsampling layers, as opposed to the patched ones used in Swin. As a result, to keep variants similar to Swin variants in terms of number of parameters and FLOPs, the models were made slightly deeper, with smaller inverted bottlenecks. NAT achieves superior results in image classification compared to Swin, and performs competitively on downstream tasks.

While local attention based models are able to perform well across different vision tasks due to its preservation of locality and efficiency, they fall short of capturing global context like self attention, which is also crucial to vision. Additionally, localized attention mechanisms utilize a smaller and slowly growing receptive field, similar to that of convolutions, compared to the full-sized receptive field in self attention. Besides self attention, several works also explored global receptive fields in vision, including but not limited to Non-local Neural Networks . However, operations with unrestricted global receptive field usually suffer from high computational complexities compared to restricted ones, which can be local, or sparse.

3 Sparse Attention

Child et al. proposed Sparse Transformers, which in addition to scaling to much deeper variants, utilized a sparse-kernel attention mechanism. Through this, the model was able to train much more efficiently on longer sequences of data. There have been other works in sparse attention, such as Longformer , Routing Transformers , and CCNet , all of which share a common feature: reducing the cost of self attention in cases where longer sequences of tokens are inevitable, but a global context is still necessary. Longformer specifically investigates using a combination of 1-D sliding window attention with and without dilation, along with global attention for specific tokens. This results in a model that is able to process long documents while maintaining the global context. CCNet uses axial attention to improve semantic segmentation heads by introducing global context without the quadratic cost of unrestricted self attention. More recently, MaxViT explored a hybrid model, which uses a combination of MBConv, Window Attention , and sparse grid attention, obtaining high ImageNet accuracy. However, the resulting model yields higher complexity and lower throughput compared to Swin .

Even though such non-local and sparse restrictions of self attention have shown to be promising, they are not well-studied in the scope of hierarchical vision transformers. To expand the local receptive fields, and re-introduce global context into hierarchical vision transformers, we introduce Dilated Neighborhood Attention (DiNA), an extension of NA that spans neighborhoods over longer ranges by increasing the step size, while maintaining the overall attention span. DiNA can serve as a sparse and global operation and works most effectively when used in conjunction with NA as a local-only operation. We present an illustration of receptive fields in Fig. 4, where we compare fully connected layers to convolutions and dilated convolutions, and similarly self attention, to NA and DiNA. We provide empirical evidence for this claim with our hierarchical vision transformer, Dilated Neighborhood Attention Transformer (DiNAT).

Method

In this section, we define DiNA as an extension to NA, analyze its effect on the receptive field, and move on to our model, DiNAT. We also provide brief details on implementation, and integration with the existing $\mathcal{N}ATTEN$ package.

where $\rho_{j}{(i)}$ denotes $i$ ’s $j$ -th nearest neighbor. We similarly define neighboring values, $\mathbf{V}_{i}^{k}$ , as a matrix whose rows are the $i$ -th token’s $k$ nearest neighboring value projections:

where $V$ is a linear projection of $X$ . Neighborhood Attention output for the $i$ -th token with neighborhood size $k$ is then defined as:

where $\sqrt{d}$ is the scaling parameter, and $d$ is the embedding dimension. To extend this definition to DiNA, given a dilation value $\delta$ , we simply define $\rho_{j}^{\delta}{(i)}$ as token $i$ ’s $j$ -th nearest neighbor that satisfies: $j\bmod\delta=i\bmod\delta$ . We can then define $\delta$ -dilated neighborhood attention weights for the $i$ -th token with neighborhood size $k$ , $\mathbf{A}_{i}^{(k,\delta)}$ , as follows:

We similarly define $\delta$ -dilated neighboring values for the $i$ -th token with neighborhood size $k$ , $\mathbf{V}_{i}^{(k,\delta)}$ :

DiNA output for the $i$ -th token neighborhood size $k$ is then defined as:

2 Choice of Dilation

DiNA introduces a key new architectural hyperparameter: per layer dilation values. We define the upper bound for dilation value to be $\lfloor\frac{n}{k}\rfloor$ , where $n$ is the number of tokens, and $k$ is kernel/neighborhood size. This is simply to ensure exactly $k$ dilated neighbors exist for each token. The lower bound is always 1, which would be equivalent to vanilla NA. Therefore, dilation value in each layer of the model will be an input-dependent hyperparameter, which can take any integer $\delta\in[1,\lfloor\frac{n}{k}\rfloor]$ . Because dilation values are changeable, they provide a flexible receptive field (discussed in Sec. 3.3). It is not feasible to try out all possible combinations, therefore we explored a limited number of choices, which are discussed in Sec. 4.4.

3 Receptive Fields

We analyze DiNA’s receptive field, as it is important to understanding the power of DiNA, especially in comparison to other models. We present a comparison of receptive field sizes in different attention patterns in Tab. 1, along with FLOPs and memory usage. We also include depth-wise separable convolution (DWSConv), the key component in ConvNeXt , for completeness.

It is worth noting that while Swin enjoys a slightly larger receptive field compared to NAT and ConvNeXt thanks to its special shifted window design, it breaks an important property: symmetry. Since Swin’s feature maps are partitioned into non-overlapping windows, pixels within the same window only attend to each other, regardless of their position (whether at center or corner), leading to some pixels seeing asymmetric context around them.

4 DiNAT

For a fair evaluation of DiNA’s performance, we design DiNAT to be identical to the original NAT model in terms of architecture and configuration. It uses two 3×3 convolutional layers with 2×2 strides initially, resulting in feature maps that are a quarter of the input resolution. It also uses a single 3×3 convolution with 2×2 strides to downsample between levels, which cut spatial resolution in half and double channels. Details are presented in Tab. 2. The key difference in DiNAT is that every other layer uses DiNA instead of NA. Dilation values for DiNA layers are set based on the task and input resolution. For ImageNet-1k at 2242 resolution, we set dilation values to 8, 4, 2, and 1 in levels one through four respectively. In downstream tasks, because of their larger resolution, we increase dilation values to beyond that. All dilation values and other relevant architecture details are presented in Tab. II.

5 Implementation

We implemented DiNA on top of the existing Neighborhood Attention Extension ( $\mathcal{N}ATTEN$ ), allowing ease of use and identical memory usage to NA. The latest public version of the extension includes a more efficient “tiled” implementation of Neighborhood Attention, which is what allows it to compete with methods such as Swin in terms of speed. By adding a dilation element to all the existing CUDA kernels, and re-implementing the “tiled” kernel to support dilated memory format, we managed to implement DiNA without affecting the speed of the existing NA kernels. However, it should be noted that DiNA’s throughput will depend on dilation value, and is expected to be slightly slower than NA in practice. This is simply due to the break in memory access pattern, which would affect throughput overall (see Fig. 1(a)). We also note that these implementations are still fairly naive and don’t fully utilize newer architecture standards in CUDA, such as Tensor Cores, and are therefore only working as a proof of concept. Despite this limitation, models using NA and DiNA can achieve competitive throughput levels compared to other methods that mostly utilize convolutions, linear projections, and self attention, all of which run through NVIDIA libraries that fully utilize the aforementioned standards. More information on implementation is provided in Appendix A.

Experiments

We conducted extensive experiments to study the effects of our proposed DiNAT model over existing baselines. Similar to existing methods, we pre-train models on image classification (ImageNet-1K and ImageNet-22K ), and then transfer the learned weights to downstream vision tasks. We compare DiNAT to the original NAT model , Swin , and ConvNeXt . We also pair our model with Mask2Former and perform instance, semantic, and panoptic segmentation experiments.

We used the community standard for ImageNet training in PyTorch, timm (Apache License v2), which now serves as the community standard for ImageNet training in PyTorch , to train our model on ImageNet-1k . We use the same training configurations, regularization techniques, and augmentations (CutMix , Mixup , RandAugment , and Random Erasing ) and training techniques used in NAT and Swin . Models trained on ImageNet-1K directly are trained for 300 epochs with a batch size of 1024, and use an iteration-wise cosine learning rate schedule and a 20 epoch warmup, with a base learning rate of 1e-3, and weight decay rate of 0.05, cooled down for an additional 10 epochs. Larger variants are pre-trained on ImageNet-22K for 90 epochs with a batch size of 4096, but use a linear learning rate schedule and a 5 epoch warmup, with a base learning rate of 1e-3, and weight decay rate of 0.01, again following Swin . We fine-tune models pre-trained on ImageNet-22K to ImageNet-1K for 30 epochs, with a batch size of 512, and a linear learning rate schedule with no warmup, and a base learning rate of 5e-5, and weight decay rate of 1e-4. Final ImageNet-1K validation set accuracy levels, along with number of learnable parameters, FLOPs, throughput, and memory usage are provided in Tabs. 3 and 4. The reason for providing both FLOPs and throughput is to point out the necessity in distinguishing theoretical computational requirements, versus efficiency in practice with each method’s available implementation. This is especially important in this case because NA and DiNA are based on from scratch implementations of the algorithms ( $\mathcal{N}ATTEN$ ), and are not as well-optimized as ConvNeXt or Swin, which mostly run on native NVIDIA libraries designed for optimal throughput.

DiNAT doesn’t show improvement over NAT in smaller variants. Improvement over NAT-Mini is less than 0.1%, and we found that while the Tiny variant converges faster than NAT-Tiny at first, it converges to a lower accuracy of 82.7%. We noticed that despite this, DiNAT consistently outperforms NAT across all four variants on downstream tasks. DiNAT shows a slight improvement of at least 0.1% over NAT on Small and Base variants.

ImageNet-22K.

We pre-trained our Large variant on ImageNet-22K, and fine-tuned it to ImageNet-1K at both 2242 and 3842 resolutions. We found that our large variant can successfully outperform Swin-Large and match ConvNeXt-Large’s accuracy at 2242 resolution. At 3842, our large variant exceeds its Swin counterpart’s reported accuracy without increasing its kernel size from 72 to 122. Upon increasing the large variant’s kernel size to 112 and interpolating positional biases (similar to Swin), we see that our large variant matches ConvNeXt-Large’s accuracy as well. We note that NA/DiNA are in theory limited to odd-sized kernels, which is the reason behind picking 112 instead of 122.

Isotropic variants.

To further compare NA/DiNA to plain self attention, we also explore isotropic variants of NAT and DiNAT, similar to isotropic ConvNeXt variants. These models simply follow ViT in design: a single Transformer encoder operating on feature maps with a fixed spatial size (142), preceded by a single patch-and-embedding layer; they are not hierarchical transformers. To maintain fairness in comparison to self attention, we trained ViT models with relative positional biases (ViT+) to ensure the models are only different in attention patterns. Note that ViT variants with relative positional biases have previously been explored in timm , but we run our own to ensure similar training settings. We present a comparison of these models and their performance on ImageNet-1k in Tab. 5.

We find that isotropic variants of both NAT and DiNAT exhibit only minor throughput improvements over ViT+, which can again be attributed to the lack of fully optimized implementations. Note that these variants reduce FLOPs to almost the same number as isotropic ConvNeXt variants. They also reduce memory usage compared to ViT+noticeably. As for performance, we observe that isotropic NAT variants result in a drop in performance compared to ViT+, which is to be expected since NAT has half the attention span as ViT+. However, we find that isotropic DiNAT variants significantly improve upon NAT’s isotropic variants, without increasing kernel size. This further supports our claim that a combination of NA and DiNA is more effective at producing an alternative to self attention than simply using NA throughout the model.

To further study the effects of different attention mechanisms, and investigate whether or not a model fully based on self-attention always yields the best result, we experiment with hybrid isotropic models utilizing both NA/DiNA layers as well as self attention. We present those results in Tab. 6. We found that a small-scale (22M parameter) model with only half the layers performing self attention and the other half neighborhood attention can reach a similar accuracy as a similar model with all 12 layers utilizing self attention. We also found that changing the order of different attention layers can result in an approximately 0.2% change in accuracy.

2 Object Detection and Instance Segmentation

To explore DiNAT’s effectiveness in object detection and instance segmentation, we used its pre-trained weights as backbones for Mask R-CNN and Cascade Mask R-CNN , and trained those models on MS-COCO . We followed NAT and Swin ’s training settings in mmdetection (Apache License v2), and trained with the same accelerated $3\times$ LR schedule. The results are presented in Tab. 7. We observe that DiNAT consistently shows noticeable improvement over NAT, with little-to-no drop in throughput. There are even instances where DiNAT even surpasses NAT’s throughput, but within the margin of error. Additionally, we observe that this improvement over NAT pushes DiNAT ahead of ConvNeXt . At scale, we see DiNAT continues to outperform both Swin and ConvNeXt with ImageNet-22K pre-training.

3 Semantic Segmentation

We also trained UPerNet with our DiNAT as the backbone on ADE20K , with ImageNet-pre-trained backbones. We followed NAT’s mmsegmentation (Apache License v2) configurations, itself following Swin’s configuration for training ADE20K. The results are presented in Tab. 8. We find that DiNAT exhibits a noticeable improvement over the original NAT model. DiNAT also maintains its place ahead of both models at scale with ImageNet-22K pre-training.

4 Ablation study

In this section, we aim to study DiNAT in more depth by analyzing the effects of: dilation values, NA-DiNA order, kernel sizes, and test-time changes in dilation.

In Tab. 9, we present models with different dilation values, and their effect on classification, detection, instance segmentation and semantic segmentation performance levels. Note that the increased dilation (16, 8, 4, 2) is applicable to downstream tasks only, because in theory input feature maps should be larger than or equal to the product of kernel size and dilation. As a result, “8, 4, 2, 1” is the maximum applicable dilation to ImageNet at 224 × 224 resolution. Depending on image resolution, even higher dilation values are possible. We explored a “dynamic” dilation value, where DiNA layers apply the maximum possible dilation, which is the floor of resolution divided by kernel size (“Maximum” in Tab. 9). We finally choose settle on “gradual” dilation (see illustration in Fig. 4), in which we gradually increase dilation to the maximum level defined. For instance, if maximum dilation for a specific level to 8, its layers will have dilation values 1, 2, 1, 4, 1, 6, 1, 8 (refer to Appendix B for details).

NA-DiNA vs. DiNA-NA.

We also experimented with models with DiNA layers before NA layers, as opposed to our final NA before DiNA choice. While the local-global order (NA-DiNA) was our initial choice, we’ve also found it to be the more effective choice. We also tried a model with only DiNA modules, and found that it performs significantly worse than other combinations. This highlights the importance of having a combination of both local and sparse global attention patterns in the model. The results are summarized in Tab. 10.

Kernel size.

We study the effect of kernel size on model performance in Tab. 11. We observed that a DiNAT-Tiny sees a significant decay in performance with a smaller kernel size across all three tasks. However, we find increasing kernel size beyond the default 7×7 does not result in a significant increase in return.

Test-time dilation changes.

We present an analysis of sensitivity to dilation values, in which we attempt different dilation values on already trained models, and evaluate their performance. This can be particularly important to cases with varying resolutions, i.e. multi-scale testing. For DiNAT to be at its best, dilation level needs to be a near-maximum number to expand attention to a longer range. The results are presented in Tab. 12.

5 Image segmentation with Mask2Former

To analyze DiNAT’s segmentation performance further, we conducted experiments with Mask2Former . Mask2Former is an attention-based segmentation architecture, which can be trained on instance segmentation, semantic segmentation, and panoptic segmentation. It set a new state-of-the-art score for panoptic and instance segmentation on MS-COCO, as well as semantic segmentation on ADE20K. Mask2Former additionally used Swin-Large as the backbone, making it the perfect candidate for this experiment. We trained Mask2Former on MS-COCO , ADE20K , and Cityscapes , on all three segmentation objectives (instance, semantic, and panoptic), by simply replacing the Swin-Large backbone in a fork of their original repository. Following their reported environment, we used PyTorch 1.9 with Detectron2 . We present instance segmentation results in Tab. 13, semantic segmentation results in Tab. 14, and panoptic segmentation results in Tab. 15. We note that DiNAT-L is using an 112 kernel size, instead of Swin-L’s 122, since even-sized windows break the symmetry in NA and are therefore not defined.

DiNAT-L outperforms Swin-L on all three tasks and datasets. It also sets new state of the art records for image segmentation without using extra data. According to PapersWithCode leaderboards, DiNAT-L with Mask2Former is the SOTA panoptic segmentation on ADE20K and MS-COCO, and instance segmentation on ADE20K and Cityscapes. It also ties with the current SOTA on ADE20K, and ranks second on Cityscapes semantic segmentation (previous SOTA on both is SeMask ).

Conclusion

Local attention modules are effective at reducing complexity, and are crucial when working with a hierarchical model that gradually downsamples inputs. Nevertheless, they cannot capture longer range inter-dependencies as well as global self attention, unless their receptive field size is increased, which defeats their initial purpose of efficiency and tractability. In this paper, we propose DiNA, a natural extension to NA that expands its local attention to sparse global attention at no additional cost. We build DiNAT with combinations of NA and DiNA, and show that it can improve performance significantly, especially in downstream tasks, without introducing any additional computational burden. Paired with new segmentation frameworks, our model achieves state-of-the-art image semantic, instance, and panoptic segmentation performance While our experiments give insight into the power behind such flexible attention modules, neither their performance nor efficiency stop here. We believe that combinations of NA and DiNA will be able to empower various models in vision and beyond, wherever locality and global context matter. We open source our entire project, including our extension to $\mathcal{N}ATTEN$ , and will continue to support it as a toolkit for the community to allow easy experimentation with sparse sliding-window attention.

We thank Picsart AI Research (PAIR), Meta/Facebook AI, and Intelligence Advanced Research Projects Activity (IARPA) for their generous support that made this work possible.

References

Appendix

Appendix A Implementation notes

As discussed in Sec. 3.5, we extend the existing $\mathcal{N}ATTEN$ package to support dilated neighborhoods. $\mathcal{N}ATTEN$ has a two-stage attention computation, similar to many other implementations: QK, and AV. The former computes the dot product of queries and keys, and produces attention weights, and the latter applies attention weights to the values. Scaling, softmax, and dropout are not included, as to prevent re-implementation. One of the advantages of this two-stage structure over manual implementations is that, like implementations of convolutions, sliding windows are taken directly from the source tensor, and not cached into an intermediary tensor, thus using significantly less memory. We refer readers to $\mathcal{N}ATTEN$ documentation, and NAT for further details.

Adding dilation to $\mathcal{N}ATTEN$ ’s naive kernels is mostly simple: instead of incrementing neighbors across each axis by $1$ , we simply instruct the kernels to increment by a variable $d$ . NA however has a special way to handle edge/corner pixels, which requires additional changes to support dilation. The greater challenge in adding dilation to $\mathcal{N}ATTEN$ was adding it to the “tiled” kernels that utilize shared memory. Tiled NA kernels are a more recent addition to $\mathcal{N}ATTEN$ , and boost NA’s throughput significantly. Tiled implementations of matrix multiplication and convolutions are essential in parallelizing these operations efficiently, while minimizing DRAM accesses. As the name suggests, tiled implementations divide the operation into tiles and cache tiles o inputs from the global memory into the shared memory within each threadblock. Accessing values from shared memory is typically much faster compared to directly accessing global memory, but also comes with challenges such as bank conflicts. Tiled implementations also operate with the assumption that access patterns are not broken. Introducing dilation values would break those access patterns and require a re-implementation that ensures dilated neighbors are cached instead of local neighbors. We present a layer-wise relative speed and memory usage comparison between NAT and DiNAT with respect to Swin in Fig. I.

Scaling and brain float support.

In order to train our larger models and avoid overflowing activation values in later layers of the model, we’ve had to switch from automatic mixed-precision training with the default half precision data type, float16, which has 5 exponent bits and 10 mantissa bits, to bfloat16, which has the advantage of having 8 exponent bits while having only 7 mantissa bits. Utilizing bfloat16 has often been recommended for cases which lead to large activations, which includes ours as we scale our model. However, switching to bfloat16 required a re-implementation of $\mathcal{N}ATTEN$ ’s half precision kernels to support and utilize bfloat16 correctly.

Appendix B Training settings

We provide additional details on training DiNAT in Tab. I. We also provide details on DiNATs, which utilizes non-overlapping patch embedding and downsampling, similar to Swin and ConvNeXt . DiNATs serves as an alternative DiNA-based model, which has an architecture identical to Swin. DiNATs can also serve as an ablation model, since it is identical to Swin in architecture, with WSA replaced with NA, and SWSA replaced with DiNA.

One of the most important architecture-related hyperparameters in DiNA-based models is dilation values. Both DiNAT and DiNATs use a combination of NA and DiNA layers. We typically set dilation values in DiNA layers to be the maximum possible value with respect to input resolutions, if known. For example, ImageNet classification at 224×224 is downsampled to a quarter of the original size initially, therefore Level 1 layers take feature maps of resolution 56×56 as input. With a kernel size of 7×7, the maximum possible dilation value is $\lfloor 56/7\rfloor=8$ . Level 2 will take feature maps of resolution 28×28 as input, leading to a maximum possible dilation value of $4$ . Because of this, we change dilation values depending on the task and resolution. We present the final dilation values we used in classification, detection, and segmentation in Tab. II. Note that we only change dilation values for DiNA layers, since we found that fine-tuning NA layers to DiNA layers may result in a slight decrease in initial performance (see Sec. 4.4, Tab. 12).

Appendix C Experiments with alternative architecture

We conducted all primary experiments with both our main model, DiNAT, as well as DiNATs. We found that DiNATs could serve as alternatives in certain cases, as they still provide noticeable improvements over Swin in terms of speed, accuracy, and memory usage. Classification results are provided in Tab. III, object detection and instance segmentation results are provided in Tab. IV, and semantic segmentation results are provided in Tab. VI.

In Sec. 4.4 we experimented with architecture-related hyperparameters that are introduced by DiNA: dilation values, and the ordering of NA and DiNA layers. We also complete those dilation experiments by adding DiNATs and Swin, and presente the results in Tabs. V and VII.