EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers

Junting Pan, Adrian Bulat, Fuwen Tan, Xiatian Zhu, Lukasz Dudziak, Hongsheng Li, Georgios Tzimiropoulos, Brais Martinez

Introduction

Vision transformers (ViTs) have rapidly superseded convolutional neural networks (CNNs) on a variety of visual recognition tasks , particularly when the priors and successful designs of previous CNNs are reintroduced for leveraging the induction bias of visual data such as local grid structures . Due to the quadratic complexity of ViTs and the high-dimension property of visual data, it is indispensable that the computational cost needs to be taken into account in design . Three representative designs to make computationally viable ViTs are (1) the use of a hierarchical architecture with the spatial resolution (i.e., the token sequence length) progressively down-sampled across the stages , (2) the locally-grouped self-attention mechanisms for controlling the length of input token sequences and parameter sharing , (3) the pooling attention schemes to subsample the key and value by a factor . The general trend has been on designing more complicated and stronger ViTs to challenge the dominance of top-performing CNNs in computer vision by achieving ever higher accuracies . These advances however are still insufficient to satisfy the design requirements and constraints for mobile and edge platforms (e.g., smart phones, robotics, self-driving cars, AR/VR devices), where the vision tasks need to carry out in a timely manner under certain computational budgets. Prior efficient CNNs (e.g., MobileNets , ShuffleNets , EfficientNets , and etc.) remain the state-of-the-art network architectures for such platforms in the tradeoff between running latency and recognition accuracy (Fig. 1).

In this work, we focus on the development of largely under-studied efficient ViTs with the aim to surpass the CNN counterparts on mobile devices. We consider a collection of very practical design requirements for running a ViT model on a target real-world platform as follows: (1) Inference efficiency needs to be high (e.g., low latency and energy consumption) so that the running cost becomes generically affordable and more on-device applications can be supported. This is a direct metric that we really care about in practice. In contrast, the often-used efficiency metric, FLOPs (i.e., the number of multiply-adds), cannot directly translate into the latency and energy consumption on a specific device, with several conditional factors including memory access cost, degree of parallelism, and the platform’s characteristics . This is, not all operations of a model can be carried out at the same speed and energy cost on a device. Hence, FLOPs is merely an approximate and indirect metric of efficiency. (2) Model size (i.e., parameter number) is affordable for modern average devices. Given the availability of ever cheaper and larger storage spaces, this constraint has been relaxed significantly. For example, an average smart phone often comes with 32GB or more storage. As a consequence, using it as a threshold metric is no longer valid in most cases. (3) Implementational friendliness is also critical in real-world applications. For a wider range of deployment, it is necessary that a model can be implemented efficiently using the standard computing operations supported and optimized in the generic deep learning frameworks (e.g., ONNX, TensorRT, and TorchScript), without costly per-framework specialization. Otherwise, the on-device speed of a model might be unsatisfactory even with low FLOPs. For instance, the cyclic shift and its reverse operations introduced in Swin Transformers are rarely supported by the mainstream frameworks, i.e., deployment unfriendly. In the literature, very recent MobileViTs are the only series of ViTs designed for mobile devices. In architecture design, they are a straightforward combination of MobileNetv2 and ViTs . As a very initial attempt in this direction, MobileViTs still lag behind CNN counterparts. Further, its evaluation protocol takes the model size (i.e. the parameter number) as the competitor selection criteria (i.e., comparing the accuracy of models only with similar parameter numbers), which however is no longer a hard constraint with modern hardware as discussed above and is hence out of date.

We present a family of light-weight attention based vision models, dubbed as EdgeViTs, for the first time, enabling ViTs to compete with the best light-weight CNNs (e.g., MobileNetv2 and EfficientNets ) in terms of accuracy-efficiency tradeoff on mobile devices. This sets a milestone in the landscape of light-weight ViTs vs. CNNs in the low resource regime. Our EdgeViTs are based on a novel factorization of the standard self-attention for more cost-effective information exchange within every individual layer. This is made possible by introducing a highly light-weight and easy-to-implement local-global-local (LGL) information exchange bottleneck characterized with three operations: (i) Local information aggregation from neighbor tokens (each corresponding to a specific patch) using efficient depth-wise convolutions; (ii) Forming a sparse set of evenly distributed delegate tokens for long-range information exchange by self-attention; (iii) Diffusing updated information from delegate tokens to the non-delegate tokens in local neighborhoods via transposed convolutions. As we show in experiments, this design presents a favorable hybrid of self-attention, convolutions, and transposed convolutions, achieving the best accuracy-efficiency tradeoff. It is efficient in that the self-attention is applied to a sparse set of delegate tokens. To support a variety of computational budgets, with our primitive module we establish a family of EdgeViT variants with three computational complexities: small (S), extra-small (XS), extra-extra-small (XXS).

We make the following contributions: (1) We investigate the design of light-weight ViTs from the practical on-device deployment and execution perspective. (2) For best scalability and deployment, we present a novel family of efficient ViTs, termed as EdgeViTs, designed based on an optimal decomposition of self-attention using standard primitive operations. (3) Regarding on-device performance, towards relevance for real-world deployment, we directly consider latency and energy consumption of different models rather than relying on high-level proxies like number of FLOPs or parameters. Our results experimentally verify efficiency of our models in a practical setting and refute some of the claims made in the existing literature. More specifically, extensive experiments on three visual tasks show that our EdgeViTs can match or surpass state-of-the-art light-weight CNNs, whilst consistently outperform the recent MobileViTs in accuracy-efficiency tradeoff, including largely ignored on-device energy evaluation. Importantly, EdgeViTs are consistently Pareto-optimal in terms of both latency and energy efficiency, achieving strict dominance over other ViTs in almost all cases and competing with the most efficient CNNs. On ImageNet classification our EdgeViT-XXS outperforms MobileNetv2 by 2.2% subject to the similar energy-aware efficiency.

Related Work

Since the advent of modern CNN architectures , there has been a steady stream of works focusing on efficient architecture design for on-device deployment. The first widely adopted families bring depthwise separable convolutions in a ResNet-like structure, e.g., MobileNets , ShuffleNets . These works define a space of well-performing efficient architectures, resulting in widespread usage. Successive works further exploit this design space by automating the architectural design choices . As a parallel line of research, net pruning creates efficient architectures by removing spurious parts of a larger network with close-to-zero weights , or via first training a super-network that is further slimmed to meet a pre-specified computational budget . Dynamic computing has also been explored, consisting of the mechanisms that condition the network parameters on the input data . Finally, using low bit-width is a very critical technique that can offer different tradeoffs between the accuracy and efficiency .

ViTs quickly popularize transformer-based architectures for computer vision. A series of works followed instantly, offering large improvements to the original ViTs in terms of data efficiency and architecture design . Among these works, one of the main modifications is to introduce hierarchical designs in multiple stages from convolutional architectures . Several works also focus on improving the positional encoding by using a relative positional embedding , making it learnable , or even replacing it by a attention bias element . All these approaches mostly aim to improve the model performance.

Recently, more efforts have been made towards finding efficient alternatives to the multi-head self-attention (MHSA) module, which is typically the computational bottleneck in the ViT architectures. A particularly effective solution is to reduce the internal spatial dimensions within the MHSA. The MHSA involves projecting the input tensor into key, query and value tensors. Several recent works, e.g. , find that the key and value tensors could be downsampled with a limited loss in accuracy, leading to a better efficiency-accuracy tradeoff. Our work extends this idea by also downsampling the query tensors, which further improves the efficiency, as shown in Fig. 2.

There are also alternative approaches reducing the number of tokens dynamically . That is, in the forward pass, tokens deemed to not contain the important information for the target task are pruned or pooled together, reducing the overall complexity thereafter. Finally, encouraged by their potential complementarity, many works have attempted to combine convolutional designs with self-attentions. This ranges from using convolutions at the stem , integrating convolutional operations into the MHSA block , or incorporating the MHSA block into ResNet-like architectures . It is interesting to note that even the original ViTs explored similar tradeoffs. .

Whilst the efficiency issue has been taken into account in designing the ViT variants discussed above, they are still not dedicated and satisfactory architectures for on-device applications. There is only one exception, MobileViTs , which are introduced very recently. However, compared to the current best light-weight CNNs such as MobileNets and EfficicentNets , these ViTs are still clearly inferior in terms of the on-device accuracy-efficiency tradeoff. In this work, we present the first family of efficient ViTs that can deliver comparable or even superior tradeoffs in comparison to the best CNNs and ViTs. We also extensively carry out the critical yet largely lacking on-device evaluations with energy consumption analysis.

EdgeViTs

For designing light-weight ViTs suitable for mobile/edge devices, we adopt a hierarchical pyramid network structure (Fig. 2(a)) used in recent ViT variants . A pyramid transformer model typically reduces the spatial resolution but expands the channel dimension across different stages. Each stage consists of multiple transformer-based blocks processing tensors of the same shape, mimicking the ResNet-like networks. The transformer-based blocks heavily rely on the self-attention operations at a quadratic complexity w.r.t the spatial resolution of the visual features. By progressively aggregating the spatial tokens, pyramid vision transformers are potentially more efficient than isotropic models . In this work, we dive deeper into the transformer-based block and introduce a cost-effective bottleneck, Local-Global-Local (LGL) (Fig. 2(b)). LGL further reduces the overhead of self-attention with a sparse attention module (Fig. 2(c)), achieving better accuracy-latency balancing.

2 Local-Global-Local bottleneck

Self-attention has been shown to be very effective for learning the global context or long-range spatial dependency of an image, which is critical for visual recognition. On the other hand, as images have high spatial redundancy (e.g., nearby patches are semantically similar) , applying attention to all the tokens, even in a down-sampled feature map, is inefficient. There is hence an opportunity to reduce the scope of tokens whilst still preserving the underlying information flows that model the global and local contexts. In contrast to previous transformer blocks that perform self-attention at each spatial location, our LGL bottleneck only computes self-attention for a subset of the tokens but enables full spatial interactions, as in the standard multi-head self-attention (MHSA) .

To achieve this, we decompose the self-attention into consecutive modules that process the spatial tokens within different ranges (Fig. 2(b)). We introduce three efficient operations: i) Local aggregation that integrates signals only from locally proximate tokens; ii) Global sparse attention that model long-range relations among a set of delegate tokens where each of them is treated as a representative for a local window; iii) Local propagation that diffuses the global contextual information learned by the delegates to the non-delegate tokens with the same window. Combining these, our LGL bottleneck enables information exchanges between any pair of tokens in the same feature map at a low-compute cost. Each of these components is described in detail below:

Local aggregation: for each token, we leverage depth-wise and point-wise convolutions to aggregate information in local windows with a size of $k\times k$ (Fig. 3(a)).

Global sparse attention: we sample a sparse set of delegate tokens distributed evenly across the space, one token for each $r\times r$ window. Here, $r$ denotes the sub-sample rate. We then apply self-attention on these selected tokens only (Fig. 3(b)). This is distinct from all the existing ViTs where all the spatial tokens are involved as queries in the self-attention computation.

Local propagation: We propagate the global contextual information encoded in the delegate tokens to their neighbor tokens by transposed convolutions (Fig. 3(c)).

Formally, our LGL bottleneck can be formulated as:

Here $X_{in}\in\mathcal{R}^{H\times W\times C}$ indicates the input tensors. Norm is the layer normalization operation. LocalAgg represents the local aggregation operator, FFN is a two-layer perceptron, similar to the position-wise feed-forward network introduced in . GlobalSparseAttn is the global sparse self-attention. LocalProp is the local propagation operator. For simplicity, positional encoding is omitted. Note that, all these operators can be implemented by commonly used and highly optimized operations in the standard deep learning platforms. Hence, our LGL bottleneck is implementation friendly.

Our LGL bottleneck shares a similar goal with the recent PVTs and Twins-SVTs models that attempt to reduce the self-attention overhead. However, they differ in the core design. PVTs perform self-attention where the number of keys and values are reduced by strided-convolutions, whilst the number of queries remains the same. In other words, PVTs still perform self-attention at each grid location. In this work, we question the necessity of positional-wise self-attention and explore to what extent the information exchange enabled by our LGL bottleneck could approximate the standard MHSA (see Section 4 for more details). Twins-SVTs combine local-window self-attention with global pooled attention from PvTs . This is different from our hybrid design using both self-attention and convolution operations distributed in a series of local-global-local operations. As demonstrated in the experiments (Table 2 and 3), our design achieve a better tradeoff between the model performance and the computation overhead (e.g. latency, energy consumption, etc).

3 Architectures

We build a family of EdgeViTs with the proposed LGL bottleneck at different computational complexities (i.e. 0.5G, 1G, and 2G). The configurations are summarized in Table 1. Following the hierarchical ViTs , EdgeViTs consist of four stages with the spatial resolution (i.e., the token sequence length) gradually reduced throughout, and their self-attention module replaced with our LGL bottleneck. For the stage-wise down-sampling, we use a conv-layer with a kernel size of $2\times 2$ and stride 2, except for the first stage where we down-sample the input feature by $\times 4$ , and use a $4\times 4$ kernel and a stride of 4. We adopt the conditional positional encoding that has been shown to be superior to the absolute positional encoding. This can be implemented using 2D depth-wise convolutions with a residual connection. In our model, we use $3\times 3$ depth-wise convolutions with zero paddings. It is placed before the local aggregation and global sparse self-attention. The FFN consists of two linear layers with GeLU non-linearity placed in-between. Our local aggregation operator is implemented as a stack of pointwise and depthwise convolutions. The global sparse attention is composed of a spatial uniform sampler with sample rates of $(4,2,2,1)$ for the four stages, and a standard MHSA. The local propagation is implemented with a depthwise separable transposed convolution with the kernel size and stride equal to the sample rate used in the global sparse attention. The exact architecture for the LGL bottleneck is described in the supplementary material.

Experiments

We benchmark EdgeViTs on visual recognition tasks. We pre-train EdgeViTs on the Imagenet1K recognition task , comparing the performances and computation overheads against alternative approaches. We also evaluate the generalization capacity of EdgeViTs on downstream dense prediction tasks: object detection and instance segmentation on the COCO benchmark , and semantic segmentation on the ADE20K Scene Parsing benchmark . For on-device execution, we report exucution time (latency) and energy consumption of all relevant models on ImageNet. We do not report on-device measurements on downstream tasks as they reuse ImageNet models.

ImageNet-1K provides 1.28 million training images and 50,000 validation images from 1000 categories. We follow the training recipe introduced in DeiT . We optimize the models using AdamW with a batch size of 1024, weight decay of $5\times 10^{-2}$ , and momentum of 0.9. The models are trained from scratch for 300 epochs with a linear warm-up during the first 5 epochs. Our base learning rate is set as $1\times 10^{-3}$ , and decay after the warm-up using a cosine schedule . We apply the same data augmentations as in which include random cropping, random horizontal flipping, mixup, random erasing and label-smoothing. During training, the images are randomly cropped to $224\times 224$ . During testing, we use a single center crop of $224\times 224$ . We report the top-1 accuracy on the validation set.

For latency measurements, we use a Samsung Galaxy S21 mobile phone equipped with a Snapdragon 888 chipset. All relevant models are benchmarked by running a forward pass 50 times using TorchScript lite interpreter via the Android benchmarking app provided by PyTorch . We use CPU implementation, full precision and batch 1 to execute all operations. This choice comes from the fact that this is the only combination that was able to robustly execute all of the models from our paper. In general, more efficient implementations exist, such as those utilizing specialized hardware like Neural Processing Units (NPUs). However, these put more restrictions on what can be executed and many models failed to run in our experiments when trying to use different hardware targets.

For energy measurements, we use a Monsoon High Voltage Power Monitor connected to a Snapdragon 888 Hardware Development Kit (HDK8350) to obtain accurate power readings over the course of running a forward pass of each test model 50 times. The same TorchScript runtime is used as in latency measurements. From the power signal reported by the monitor, we derive the average per-inference power and energy consumption by first subtracting background power consumption (i.e., power readings when not running any model) and then identifying 50 continuous regions of significantly higher power draw. Each region like that is considered a single inference and we calculate its total energy as the integral over the individual power samples. Analogously we also calculate average power consumption by averaging over the same set of samples. After energy and power are calculated for each inference, the final statistics of a model are obtained by again averaging over the 50 identified runs. Our methodology follows what can be found in the literature .

We compare EdgeViTs to a variety of baseline models, including the classic efficient CNNs, e.g. MobileNetV2 , MobileNetV3 , EfficientNet , and the state-of-the-art ViTs, e.g. MobileViT , PVT-v2 , DeiT , LeViT . As the original LeViT was optimized in a large-scale setting (i.e. 1000 epochs) with knowledge distillation, we perform a comparison by re-training LeViT under the same setting (300 epochs) as EdgeViT without knowledge distillation. We denote the retrained LeViTs as LeViT-384 $\dagger$ . We select the baselines with a complexity of less than 2 GFLOPS as i) in real-world applications, the computational cost remains the top concern; ii) whilst FLOPs is an indirect metric for the latency, it is the most used cost metric in prior works. This selection criterion is different from that instead uses the model size (i.e. the parameter number) which however has become a less restricted facet in mobile devices.

From Table 2, we can learn: i) EdgeViTs significantly outperform other light-weight ViTs at a similar level of GFLOPs complexity. Compared to the PVT-v2 family , our EdgeViT-XXS/EdgeViT-S achieve $3.9\%$ / $2.3\%$ improvements over PVT-v2-B0/PVT-v2-B1. Compared to MobileViTs,EdgeViTs achieve $5.4\%$ , $2.8\%$ and $2.7\%$ gains in the three complexitiy settings. ii) ViTs vs. CNNs: Our EdgeViTs lift the performance of efficient ViTs to approach the level of well-established efficient CNNs. For example, the EdgeViT-XXS performs superior to MobileNet-v2 and MobileNet-v3-0.75 at a similar level of model size, but requires more GFLOPs. However, we observe that the efficient CNNs still surpass efficient ViTs in the accuracy-FLOPs tradeoff by a small margin.

On the other hand, as discuss early, numbers of FLOPs or parameters are merely indicative but do not fully reflect the on-device efficiency . We further consider on-device latency and energy consumption directly. Other than the representative ViTs and CNNs, we also compare two recent ViT variants with the number of channels and layers re-scaled to fit the complexity need. As presented in Table 2, EdgeViTs demonstrate strong performance with latencies comparable to MobileNets: EdgeViT-XXS achieves a gain of $2.4\%$ over MobileNet-V2 while running slightly faster. EdgeViT-XXS also surpasses MobileNet-V3 by $1.1\%$ but at the cost of being 9.8ms slower. EdgeViT-XS performs on par with the auto-searched EfficientNet-B0 model. We believe our models could also benefit from the automatic architecture search techniques as use in MobileNet-V3 and EfficientNets. Our models yield clear advantages over alternative ViT models. Compared to MobileViTs in the three GFLOPs settings, EdgeViTs excel by 5.4%, 2.8%, and 2.7% while being $\times 2,\times 2.7,\times 2.6$ faster.

Energy results are presented in Table 3. In addition to the raw energy and power numbers, for comparison simplicity, we define an energy-aware efficiency metric as the average gain in top-1 accuracy (in percentages) from each consumed 1mJ of energy. We observe that less accurate models tends to be more efficient. This is not a surprise in that improvements in accuracy scale sublinearly with model complexity. However, this also means that comparing efficiency of models with very different top-1 scores might be severely biased by the sole difficulty of achieving certain accuracy levels, which is independent from a model. Therefore, we limit our comparison to identifying pareto-optimal models, those upon which no other models can improve in either accuracy or energy efficiency without degrading other metrics. We can see that our EdgeViT family is able to dominate almost all other ViTs, with the only exception being LeViT-384 $\dagger$ whose accuracy and efficiency fall between our EdgeViT-S and EdgeViT-XS. When compared to CNNs, our EdgeViTs compete with MobileNet-v3 and EfficientNet-B0 that are more efficient but also less accurate. MobileNet-v2 achieves decent results but is dominated by its newer version, MobileNet-v3. PVT-v2-B0, although high on the efficiency side, is rather inaccurate and hence is favored by highly efficient CNNs. Visibly at the end of the spectrum are the latest MobileViT models which turn out to be neither efficient nor accurate, when compared to the rest. Unlike them, our EdgeViT models, although not as efficient as best CNNs in the absolute sense, exhibit favourable trade-off between efficiency and accuracy by being rather highly accurate while not sacrificing efficiency.

We conduct detailed ablations to validate our design choices in the LGL bottleneck. We use EdgeViT-XXS as the base model and re-scale the alternative designs to $\sim$ 0.5GFLOPs for fair comparison.

Local aggregation. We compare our local aggregation (LA) operation to the Locally-grouped Self-Attention (LSA) used in . It is shown in Table 4a that applying LA consistently improve the performance. Our convolutional LA module performs better than the self-attention based operator (LSA). This validates our choice of using depth-wise convolutions in LA for local context learning.

Global sparse attention. We explore three options for delegate token sampling: max, avg, and center. All choices perform similarly in terms of accuracy, with our default design center being slightly faster.

Local propagation. We investigate two alternatives to the local propagation operator: i) w/o LP: We simply remove the local propagation. Note that EdgeViTsw/o LP has similar complexity to standard EdgeViTs. ii) Bilinear: we use the bilinear interpolation, instead of the transposed convolution, to up-sample the delegate tokens. Table 4c shows that adding LP improves the top-1 accuracy by $\mathtt{0.5\%}$ , with only $\mathtt{0.4ms}$ overhead.

2 Dense Prediction

Following , we also evaluate the proposed EdgeViTs on COCO Objection Detection/Instance Segmentation and ADE20K Scene Parsing . Here, we use the EdgeViTs as the feature extractor for the main model and initialize it with the ImageNet1K-pretrained weights obtained in our previous experiments.

We demonstrate the performance of our model in main-stream object detection and instance segmentation frameworks: RetinaNet for object detection, Mask R-CNN with the FPN for instance segmentation. Following the training protocol in , we resize the training images to have a shorter side of 800 pixels while keeping the longer side to be smaller than 1333 pixels. During testing, the images are re-scaled to have a shorter size of 800 pixels. The models are finetuned with $1\times$ schedule (i.e. 12 epochs) by AdamW using an initial learning rate of $1\times 10^{-4}$ and a batch size of 16. We train the models on the COCO 2017 training set and report the mAP@100 score on the COCO 2017 validation set.

In Table 5, our EdgeViTs perform consistently better than other visual backbones on RetinaNet and Mask R-CNN. Our smallest variant EdgeViT-XXS, when used on RetinaNet , achieves $1.5$ higher AP than PVTv2-B0. When used on Mask R-CNN , EdgeViT-XXS also surpasses PVTv2-B0 by $1.7$ on the bounding box detection task (APb), and by $0.7$ on the mask segmentation task (APm). For EdgeViT-S, we observe even larger gains when comparing to PVTv2-B1: $+2.2$ on RetinaNet , $+3.0$ APb and $+1.2$ APm on Mask R-CNN.

2.2 ADE20K Scene Parsing.

We incorporate the pretrained EdgeViT in the Semantic FPN segmentation model . As in , we create $512\times 512$ random crops of the images during training and resize the images to have a shorter side of 512 pixels during inference. The models are finetuned by AdamW using an initial learning rate of $1\times 10^{-4}$ and a batch size of 16. We train the models for 80K iterations on the ADE20K training set, and report the mean Intersection over Union (mIoU) score on the validation set.

In Table 6, we compare EdgeViTs to both CNN (ResNet-18 ) and ViT backbones (PVTs) for FPN based Semantic Segmentation . EdgeViTs achieves better performance than all counterparts at similar compute costs. Particularly, EdgeViT-XXS outperforms PVTv2-B0 by 2.5 $\%$ in mIoU, EdgeViT-S surpasses PVTv2-B1 by a margin of 3.4 $\%$ .

Conclusion

In this work, we investigate the design of efficient ViTs from the on-device deployment perspective. By introducing a novel decomposition of self-attention, we present a family of EdgeViTs that, for the first time, achieve comparable or even superior accuracy-efficiency tradeoff on generic visual recognition tasks, in comparison to a variety of state-of-the-art efficient CNNs and ViTs. We conduct extensive on-device experiments using practically critical and previously underestimated metrics (e.g., energy-aware efficiency) and reveal new insights and observations in the comparison of light-weight CNN and ViT models.

Acknowledgements. We thank Victor Escorcia, Yassine Ouali and Javier Fernandez for helpful discussions.

References

Appendix 0.A Appendix

We calculate the computational cost of spatial context modeling involved in our proposed LGL bottleneck. We omit point-wise operations for simplicity as the key difference is on the spatial modeling part. Let us assume an input $X\in\mathcal{R}^{h\times w\times c}$ where $h$ , $w$ , $c$ denotes the height, the width, and the channel dimension, respectively. The cost of the local aggregation is $\mathcal{O}(k^{2}hwc)$ , where $k^{2}$ is the local group size. By selecting one delegate out of $r^{2}$ tokens with $r$ the sub-sampling rate, the complexity of our Sparse Global Self-Attention is then $\mathcal{O}(\frac{h^{2}w^{2}}{r^{4}}c)$ . Finally, the local propagation step takes a cost of $\mathcal{O}(r^{2}hwc)$ . Putting all these together we have a total cost of LGL is $\mathcal{O}(k^{2}hwc+\frac{h^{2}w^{2}}{r^{4}}c+r^{2}hwc)$ . When comparing with the cost of a standard multi-head self-attention $\mathcal{O}(h^{2}w^{2}c)$ , we can see that our LGL significantly reduces the computation overhead when $k\ll h,w$ ; and $r>1$ . In our experiments, for simplicity we set $k=3$ , and $r$ to (4,2,2,1) for the four stages.

A.2 Implementation Details

All variants of EdgeViTs can be built upon these components according to the schematic overview (Fig. 2a of the main paper), and the model configuration parameters (Table 1a. of the main paper).For more details with the ablation studies in the main paper, we have replaced or removed one of these blocks with the details given below.

(1) In Table 4a, for the case of w/o LA, we aim to test the importance of separate local and global context modeling. Thus we remove both LocalAgg and GlobalSparseAttn, and instead use the Spatial-Reduced Self-Attentionhttps://github.com/whai362/PVT/blob/v2/classification/pvt_v2.py#L54-L126 introduced in PVT , resulting in a single Self-Attention Block for both local and global context modeling. For the case of LA(LSA) we simply replace LocalAgg with Local-grouped Self-Attentionhttps://github.com/Meituan-AutoML/Twins/blob/main/gvt.py#L32-L71 introduced in .

(2) In Table 4b, we replace the default sampler (Center) with Avg and Max functions which can be implemented with AvgPool2d() and MaxPool2d() in Pytorch , respectively. Note, for both cases the kernel size is set to sample_rate.

(3) In Table 4c, in the case of w/o LP, we replace our GlobalSparseAttn with Spatial-Reduce Self-Attention from PVT , but different from w/o LA, we keep the LocalAgg. For the case of LP(Bilinear), the LocalProp is instantiated as a bilinear interpolation function (Upsample(mode=‘bilinear’) in Pytorch).

Note, the number of layers for each of these variants is down-scaled to have 0.5GFLOPs for fair comparison.

A.3 Accuracy-Speed Pareto-Optimal Models

In order to facilitate the Accuracy vs. Speed interpretation. We identify pareto optimal models when comparing trade-off between accuracy and latency . In our context, the accuracy-latency pareto-optimal models are defined as those upon which no other models can improve in either accuracy or latency without degrading other metrics. As shown in Fig. 1, our EdgeViTs are well comparable with best efficient CNNs , whilst significantly dominating over all prior ViT counterparts. Specifically, EdgeViTs are all pareto-optimal in both trade-offs.

A.4 Efficiency in detection/segmentation

This work proposes a genetic transformer-based network and demonstrate its efficacy when used as the backbone in detection/segmentation. We provide a evaluation by measuring only the inference time, energy and efficiency of the backbones for detection/segmentation. As shown in Tab. 7 and 8, EdgeViTs demonstrate higher efficiency compared to the baselines.