HR-NAS: Searching Efficient High-Resolution Neural Architectures with Lightweight Transformers

Mingyu Ding, Xiaochen Lian, Linjie Yang, Peng Wang, Xiaojie Jin, Zhiwu Lu, Ping Luo

Introduction

Neural architecture search (NAS) has achieved remarkable success in automatically designing efficient models for image classification . NAS has also been applied to improve the efficiency of models for dense prediction tasks such as semantic segmentation and pose estimation . However, existing NAS methods for dense prediction either directly extend the search space designed for image classification , only search for a feature aggregation head , organizing network cells in a chain-like single-branch manner . This lack of consideration to the specificity of dense prediction hinders the performance advancement of NAS methods compared to the best hand-crafted models .

In principle, dense prediction tasks require the integrity of the global context and the high-resolution (HR) representation; the former is critical to clarify ambiguous local features at each pixel, and the latter is useful for the accurate prediction of fine details , such as semantic boundaries and keypoint locations. However, these two aspects, especially the HR representations, have not got enough attention in existing NAS algorithms for classification. The straightforward strategy to implement the principle is manually combining multi-scale features at the end of the network , while recent approaches show the performance can be enhanced by putting multi-scale feature processing within the network backbone. Another observation from recent research is that multi-scale convolutional representations can not guarantee a global outlook of the image since dense prediction tasks often come with high input resolution but a network often only covers a fixed receptive field. Therefore, global attention strategies such as SENet and non-local network have been proposed to enrich image convolutional features. Most recently, inspired by its success in natural language processing, Transformer architectures , which contain global attention with spatial encoding, have also shown superior results when combined with convolutional neural network for image classification and object detection .

Motivated by the above observations, in this work, we propose a NAS algorithm, which incorporates these strategies, \iein-network multi-scale features and transformers, and enables their adaptive changing with respect to task objectives and resource constraints. In practice, it is non-trivial to put them together. Firstly, Transformer has a high computational cost that is quadratic w.r.t\onedotimage pixels and hence unfriendly to the NAS search space of efficient architectures. We solve this through a dynamic down projection strategy, yielding a lightweight and plug-and-play transformer architecture that can be combined with other convolutional neural architectures. In addition, searching a fused space of multi-scale convolution and transformers needs proper feature normalization, selection of fusion strategies and balancing. We did extensive studies to calibrate various model choices that generalize to multiple tasks.

In summary, HR-NAS works as follows. We first setup a super network, where each layer contains a multi-branch parallel module followed by a fusion module. The parallel module contains searching blocks with multiple resolutions, and the fusion module contains searching blocks of feature fusion determining how feature from different resolutions fuses. Then, based on the computational budget and the task objective, a fine-grained progressive shrinking search strategy is introduced to prune redundant channels in convolutions and queries in transformers, resulting in an efficient model that provides the best trade-off between performance and computational costs. With extensive experiments, HR-NAS achieves state-of-the-art on multiple dense prediction tasks and competitive results on image classification under highly efficient settings with a single search. Fig. 1 shows a comprehensive comparison of our proposed approach with previous NAS approaches as well as manually designed networks on four different tasks.

Our main contributions are three-fold. (1) We introduce a novel lightweight and plug-and-play transformer, which is highly efficient and can be easily combined with convolutional networks for computer vision tasks. (2) We propose a well-designed multi-resolution search space containing both convolutions and transformers to model in-network multi-scale information and global contexts for dense prediction tasks. To our best knowledge, we are the first to integrate transformers in a resource-constrained NAS search space for computer vision. (3) A resource-aware search strategy allows us to customize efficient architectures for different tasks. Extensive experiments show models produced by our NAS algorithm achieve state-of-the-art on three dense prediction tasks and four widely used benchmarks with lower computational costs.

Related Work

Transformers. Transformer , a model architecture relying on a self-attention mechanism to learn dependencies between input and target, is used primarily in natural language processing. Generative Pre-trained Transformer (GPT) uses language modeling as a pre-training task . BERT improves Transformer with a masked language model and a learned positional embedding to replace the sinusoidal positional encoding .

Since Transformer is suitable for capturing global information and pairwise interactions, some attempts have been made to adapt it to computer vision. Non-local networks proposed a self-attention architecture to capture long-range interactions which can be viewed as a simplified version of Transformer. DETR formulates object detection as a set prediction problem, which is naturally modeled as a sequence prediction task by the Transformer. Visual Transformers represent images as a set of visual tokens and apply a Transformer-based structure to detect relationships between visual semantic concepts for semantic segmentation. iGPT uses a standard Transformer to unsupervisedly learn generative relationships of image pixels. However, since its computational complexity grows quadratically with the number of pixels, such applications of Transformers in computer vision are computationally expensive. Some approaches leverage network compression techniques, such as dynamic routing and knowledge distillation, to improve the efficiency of Transformers in NLP. However, efficient Transformers are seldom explored in computer vision. In light of this, we formulate Transformer into an efficient and plug-and-play module that is seamlessly integrated into a well-designed NAS search space.

Neural Architecture Search for Efficient models. Early approaches utilize reinforcement learning and evolution algorithms to find efficient and powerful network structures. However, these methods are usually computationally expensive. To improve the efficiency of the search process, differentiable search methods such as Darts and ProxylessNAS formulate the search space as a super-graph where the probability to adopt an operator is represented by a continuous importance weight, allowing an efficient search of the architecture using gradient descent. Other approaches utilize a random sampling approach when training the super-net and search for the best model candidate after the network converges. Inspired by the manually designed structures, use a search space based on MobileNetV2 to search for efficient structures. Mixed convolution is also adopted in NAS search spaces due to its multi-scale feature modeling capability. Recently, model scaling techniques are used to expand the search space from operators to other hyper-parameters such as input resolutions, channel numbers, and layer numbers . In order to search for efficient models, the existing methods usually borrow efficient operators from manually designed networks, such as depthwise convolution and Inverted Residual Block . To construct the search space with more powerful operators, we design a new efficient Transformer structure that can be directly inserted into existing NAS search spaces.

Neural Architecture Search for Dense Prediction. The current NAS algorithms either reuse search spaces for image classification or only search for a feature aggregation head for dense prediction tasks. A single branch super-net structure is usually utilized for dense prediction tasks such as semantic segmentation , object detection , and human pose estimation . Structures of feature aggregation head are also discovered using NAS algorithms for semantic segmentation . Recent explorations aim to find an optimal network layout in a hierarchical multi-scale search space. However, their search spaces use fixed width of layers which result in computationally heavy models. In contrast, we propose a multi-branch search space where each branch specializes for a typical feature resolution. The same search space can be directly used for various dense prediction tasks that have different preferences on the granularity of features, in which the computation budget is allocated for different resolutions through an end-to-end optimization.

Methodology

Our method aims to search for network structures within a multi-branch search space containing both Convolutions and Transformers with a resource-aware search strategy. In this section, we first introduce our lightweight Transformers. We then detail our multi-branch search space and how to integrate our Transformers into it. Finally, we describe the resource-aware fine-grained search strategy.

The standard Transformer cannot be directly applied to high-resolution images and mobile scenarios, as its computational cost grows quadratically to the number of pixels. Our lightweight Transformer shown in Fig. 2, which consists of a projector, an encoder, and a decoder, is proposed to solve this issue (see Fig. 2).

The 2D positional map $P$ is very efficient as it contains only 2 channels (\ie, $d_{p}=2$ ). Later in the experiments, we show that this simple encoding outperforms the sinusoidal positional encoding and the learned embedding .

Time Complexity. The time complexities of our Multi-Head Self-Attention and our FFN are $O(4nds^{2}+2n^{2}d)$ and $O(8ns^{4})$ , respectively, where $s^{2}$ , $d$ and $n$ are in the projected low-dimensional space. Since $s^{2}$ is a projected small spatial size, the overall time complexity (FLOPs) $O_{\mathcal{T}}(n)$ of our Transformer is approximately linear with $n^{2}d$ . In the following part, we will further introduce a fine-grained search strategy to reduce the number of tokens $n$ in order to make the lightweight Transformer more efficient.

In summary, the main difference between our lightweight Transformer and the standard Transformer lies in: (1) A projection function $\mathcal{P}(\cdot)$ is used to learn self-attention in a low-dimensional space. (2) A simpler yet effective 2D positional map $P$ is used for positional encoding. (3) The first Multi-Head Attention and the spatial encoding in the standard Transformer decoder are removed.

2 Multi-branch Search Space

Inspired by HRNet , we design a multi-branch search space for dense predictions that contains both multi-scale features and global contexts while maintaining high-resolution representations throughout the network.

Overview. The network consists of two modules: the parallel module and the fusion module. Both of the two modules are constructed with our searching blocks. As shown in Fig. 3 (a), after two convolutions which decrease the feature resolution to $1/4$ of the input image size, we start with this high-resolution branch and gradually add high-to-low resolution branches through fusion modules, and connect the multi-resolution branches in parallel through parallel modules. Finally, multi-branch features are resized and concatenated together, and connected to the final classification/regression layer without any additional heads.

The parallel module obtains larger receptive fields and multi-scale features by stacking searching blocks in each branch. It has $m\in$ branches containing $nc_{1},\ldots,nc_{m}$ convolutions with $nw_{1},\ldots,nw_{m}$ channels in each branch. A fusion module is used after a parallel module to exchange information across multiple branches. An extra lower-resolution branch is also generated from the previously lowest-resolution branch until it reaches $1/32$ downsampling ratios. A fusion module takes $m_{in}$ branches from the previous parallel module as input and outputs $m_{out}$ branches. For each output branch, all its neighboring input branches are fused by using the searching block to unify their feature map sizes. For example, a $1/8$ output branch integrates information of $1/4,1/8,\mbox{and}~{}1/16$ input branches. In our fusion module, the high-to-low resolution feature transformation is realized by the reduction searching block, while the low-to-high resolution feature transformation is implemented with the normal searching block and upsampling.

Searching block. As shown in Fig. 3 (c), our searching block contains two paths: one path is a MixConv , the other path is a lightweight Transformer which aims to provide more global contexts. The number of convolutional channels and the number of tokens in the Transformer are searchable parameters.

Formally, let $X$ be the input of $c$ feature channels (the spatial dimension is omitted for simplicity). In the MixConv path, the first layer is a point-wise convolution $\mathcal{C}_{0}$ which expands $X$ to a $3r\times c$ dimension (\ie, the expansion ratio is $3r$ ); the output is split into three parts with an equal number of channels (\ie, each with $r\times c$ channels), which are then fed into three depth-wise convolutions $\mathcal{C}_{1},\mathcal{C}_{2},\mathcal{C}_{3}$ with kernel sizes of $3\times 3,5\times 5$ , and $7\times 7$ , respectively. The outputs of these three layers are concatenated, followed by another point-wise convolution $\mathcal{C}_{4}$ that produces the feature map with the desired number of channels $c^{\prime}$ . In the Transformer path, a lightweight Transformer $\mathcal{T}$ with $n$ tokens is applied to the input feature $X$ to obtain the global self-attention. The outputs of two branches are added as the final output of the searching block. Intuitively, the Transformer path can be regarded as a residual path for enhancing the global context within the searching block. The information flow in a searching block can be written as:

where $\mathcal{C}_{0}(X)_{i}$ represents the $i$ -th part of the output of $\mathcal{C}_{0}(X)$ , as shown in Fig. 3 (c). Note that when the strides of the convolutions $\mathcal{C}_{1},\mathcal{C}_{2},\mathcal{C}_{3}$ are equal to $2$ , as in the reduction searching block, the inverse projection $\widehat{\mathcal{P}}(\cdot)$ in Transformer resizes its input into half size of the original spatial dimension in order to match the output shape of $\mathcal{C}_{4}$ .

3 Resource-aware Fine-grained Search

Our supernet is a multi-branch network where each branch is a chain of searching blocks operating at different resolutions; each searching block combines a MixConv and a Transformer. Unlike previous searching methods that are designed for specific tasks, we aim to customize the network for various tasks. Specifically, we propose a resource-aware channel/query-wise fine-grained search strategy to explore the optimal feature combination for different tasks.

We adopt a progressive shrinking NAS paradigm which generates lightweight models by discarding some of the convolutional channels and Transformer queries during training. As described in , as channels in depth-wise convolutions are independent in our searching block, any channels from these convolutions can be easily removed without affecting the other searching blocks; we only need to remove the corresponding weights from the convolutions. Similarly, if a token of the Transformer is discarded, we just remove the corresponding weights from the $1\times 1$ convolutions of the projections $\mathcal{P}(\cdot)$ and $\widehat{\mathcal{P}}(\cdot)$ , and the corresponding embedding from queries $S$ .

In the rest of this paper, we call a channel of the depth-wise convolutions or a token in Transformers a search unit. A searching block with $c$ input channels, the expansion ratio of $3r$ , and $n$ tokens has $3rc+n$ search units in total.

Following Darts , we introduce an importance factor $\alpha>0$ that can be learned jointly with the network weights for each search unit of the searching block. We then progressively discard those with low importance while maintaining overall performance. Inspired by works on channel pruning , we add a resource-aware L1 penalty on $\alpha$ , which effectively pushes importance factors of high computational costs to zero. Specifically, the L1 penalty of a search unit is weighted by the amount of the reduction in computational cost $\Delta>0$ (\ieFLOPs in this case):

where $O_{\mathcal{T}}$ is the FLOPs of the Transformer defined in Sec. 3.1, $i$ is the index of the search unit, $n^{\prime}$ is the number of remaining tokens. Note that $\Delta$ ’s for search units of convolutions are fixed, while in the Transformer, $\Delta$ ’s is a function of the number of remaining tokens. It is worth mentioning that, although FLOPs is not always a good measure of latency, we use it anyway as it is the most widely and easily used metric. The Eq. 7 can be easily adapted to use other metrics, \eg, latency and energy cost.

With the added resource-aware penalty term, the overall training loss is:

where $L_{\text{task}}$ denotes the standard classification/regression loss, and $\lambda$ denotes the coefficient of the L1 penalty term.

During training, after every few epochs, we progressively remove the search units whose importance factors are below a predefined threshold $\epsilon$ and re-calibrate the running statistics of Batch Normalization (BN) layers. Note that if all tokens of a Transformer are removed, the Transformer will degenerate into a residual path, as shown in Fig. 2.

When the search ends, the remaining structure not only represents the best accuracy-efficiency trade-offs, but also has the optimal low-level/high-level and local/global feature combination for a specific task. In addition, since the network training and architecture search are conducted in a unified end-to-end manner, the resulting network can be used directly without fine-tuning.

Experiments

To validate the generalizability of our method, we select five benchmark datasets on four representative tasks for performance evaluation: image classification on ImageNet , human pose estimation on COCO keypoint , semantic segmentation on Cityscapes and ADE20K , and 3D object detection on KITTI . These benchmarks are carefully selected as they require different receptive fields, global/local contexts, and 2D/3D perceptions. In this work, the same supernet is used for all five benchmarks; It begins with two $3\times 3$ convolutions with stride 2, which is followed by five parallel modules (respectively with 1, 2, 3, 4, 4 branches); a fusion module is inserted between every two adjacent parallel modules, to obtain multi-scale features. For Transformers, we set $s=8$ , $d=s^{2}=64$ , and $h=1$ . In some evaluation experiments without search, we fix $d=8$ . The expansion ratio $r$ of the searching block is set to be 4. For the MixConv, we use the scales from the batch normalization layers after the depth-wise convolutions as the importance factors; for the Transformer, we use the scales from the batch normalization layer in the projector $\mathcal{P}$ as the importance factors. On each benchmark, we obtain HR-NAS-A and HR-NAS-B using different $\lambda$ values. Search units with $\alpha<0.001$ are deemed unimportant and removed every five epochs. Unless specified, all experiments in this paper use standard training protocols, \eg, we don’t apply techniques like AutoAug , Mixup , and Cutout . All our models are trained from scratch without pretraining on the ImageNet dataset, and are evaluated with single-scale input and without multi-crop. Details of the datasets and the training settings for each task can be found in Supplemental Materials.

2 Comparative Results

We conduct experiments against the state-of-the-art methods on five benchmarks: image classification on ImageNet (Tab. 1), semantic segmentation on Cityscapes (Tab. 2), semantic segmentation on ADE20K (Tab. 3), human pose estimation on COCO keypoint (Tab. 4), and 3d object detection on KITTI (Tab. 5). From these tables we can see that: (1) Our method achieves state-of-the-art performance on all three dense prediction tasks and competitive results on the classification task. Compared with other tasks, classification usually benefits less from multi-scale and global contexts because it aggregates position-invariant features through global pooling. (2) Many existing methods, such as utilize additional modules or pretraining on the ImageNet dataset to get better performance for a specific task. In contrast, our method is able to show superior results across multiple challenging datasets without any bells and whistles. (3) We evaluate the mean and standard deviation of 5 runs on Cityscapes with Random Search as a baseline. It shows that our method yields stable results with a standard deviation of about only 0.3. (4) For NAS methods toward high segmentation accuracy rather than accuracy-efficiency trade-offs, we reduce their network width to 1/2 (and depth to 1/2 for ), thus obtain the tiny variants. Our method outperforms the second-best competitor by a large margin on Cityscapes (74.18 vs. 76.01), ADE20K (33.41 vs. 34.92), and COCO keypoint (74.9 vs. 75.5) using a much lighter model, showing its superiority and accuracy-efficiency balance ability on dense prediction tasks.

3 Ablation Study

Search Space. In this part, we study the design components of our search space. In Tab. 6 we show how the introduction of different components affects the performance and FLOPs, using the Cityscapes segmentation benchmark as an example. The baseline search space, “Single-branch” in Tab. 6, is a single-branch network with only $3\times 3$ convolutions, where the up-sampling operations are applied at the end for dense prediction tasks. Adding the multi-branch architecture increases the mIoU from $66.23\%$ to $68.65\%$ with fewer parameters and FLOPs, showing the effectiveness of our multi-branch design. The MixConv with a mix of $3\times 3,5\times 5,7\times 7$ convolutions in the searching block further improves the mIoU by $3.34\%$ . Finally, the lightweight Transformer provides another gain of $2.56\%$ ( $71.99\%$ v.s. $74.55\%$ ) with only extra $70$ M FLOPs.

Lightweight Transformer. In Tab. 7, we study the choice of positional embeddings. It can be seen that using the proposed 2D positional map in the encoder of the Transformer achieves better performance than using the sinusoidal position encoding and the learned position embedding . This may be because our lightweight Transformer has fewer queries and smaller token dimensions than the other two, and therefore it is unnecessary to use high dimension representation for position information. We also evaluate the alternative which uses the 2D positional map at both the encoder and the decoder; the performance is slightly worse than the encoder-only option.

The proposed Transformer can be used as a plug-and-play component. To show this, we add our Transformer to the Inverted Residual Blocks of two efficient models ShuffleNetV2 and MobileNetV2 , and evaluate their performance on both ImageNet classification and Cityscapes segmentation tasks. PSP module is added as segmentation head to all models. As shown in Tab. 8, our Transformer improves the two baseline models on both classification and segmentation tasks.

4 Visualization of Searched Networks

We visualized the four smaller models we found on each of the four benchmarks (\ieHR-NAS-A) in Fig. 4. We can observe that our method can find different architectures for different tasks, showing that it can automatically adapt to various tasks: (1) In the image classification task and the 3D detection task, at the high-resolution branches (\iefirst and second branches), the models we found remove most of the search units; some searching blocks are even completely removed, as indicated by circles with complete gray in Fig. 4). The reason is that in these two tasks, global semantic information is more important than local information. (2) The model for the segmentation task still retains computation from the first two branches, as it is important to keep high resolution imagery for semantic segmentation tasks. (3) The human pose estimation model mainly utilizes the second and the third branches, which means it may rely more on middle-resolution semantics instead of high-resolution semantics. (4) Transformers are more used in the segmentation and the human keypoint estimation tasks, indicating these dense prediction tasks benefit more from global contexts.

Conclusion

In this paper, we introduce a lightweight and plug-and-play Transformer that can be easily combined with convolutional networks to enrich global contexts for dense image prediction tasks. We then effectively encode both the proposed Transformers and convolutions into a well-designed high-resolution search space to model both global and multiscale contextual information. A channel/query-level fine-grained progressive shrinking strategy is applied to the search space for searching and customizing efficient models for various tasks. Our searched models achieve state-of-the-art trade-offs between performance and FLOPs for three dense prediction tasks and an image classification task, given only small computational budgets.

Acknowledgements Ping Luo was supported by the General Research Fund of HK No.27208720. Zhiwu Lu was supported by National Natural Science Foundation of China (61976220 and 61832017), and Beijing Outstanding Young Scientist Program (BJJWZYJH012019100020098).

Appendix A Datasets and Settings

In this section, we provide details of the datasets and settings used. We use the same super network for training and evaluation in each task.

In practice, different hyperparameters are often tuned with a validation set for different tasks according to different datasets and losses. For example, HRNet is trained using two different settings for segmentation and keypoint estimation tasks. In this work, we follow the common training settings of each task, \ie, the setting in HRNet for segmentation and keypoint estimation, AtomNAS for classification, and PointPillar for 3D detection.

As for the choice of $\lambda$ for each task, we first empirically tuned it so that HR-NAS-A’s FLOPs is comparable to the least FLOPs among the baseline models, then we relaxed the restriction so that HR-NAS-B reaches SOTA yet still costs less FLOPs than the best baseline models. Currently, the searched model size cannot be controlled precisely by $\lambda$ . We will strengthen it by incorporating other techniques as our future work. See below for details.

ImageNet for Image Classification. The ILSVRC 2012 classification dataset consists of 1,000 classes, with a number of 1.2 million training images and 50,000 validation images. Follow the common practice in , we adopt a RMSProp optimizer with momentum 0.9 and weight decay 1e-5; exponential moving average (EMA) with decay 0.9999; and exponential learning rate decay. The input size is $224\times 224$ . The initial learning rate is set to 0.064 with batch size 1024 on 16 Tesla V100 GPUs for 350 epochs, and decays by 0.97 every 2.4 epochs. By setting the coefficient of the L1 penalty term $\lambda$ to 1.8e-4 and 1.2e-4, we obtain our HR-NAS-A and HR-NAS-B. Unless specified, we adopt the ReLU activation and the basic data augmentation scheme, i.e., random resizing and cropping, and random horizontal flipping, and use single-crop for evaluation. For experiments of HR-NAS ${\dagger}{\ddagger}$ , we also adopt the SE module , Swish activation , and RandAugment for better performance. We report the top-1 Accuracy as the evaluation metric.

Cityscapes for Semantic Segmentation. The Cityscapes dataset contains high-quality pixel-level annotations of 5000 images with size 1024x2048 (2975, 500, and 1525 for the training, validation, and test sets respectively) and about 20000 coarsely annotated training images. Following works , 19 semantic labels are used for evaluation without considering the void label. In this work, the input size is set to $512\times 1024$ . We use an AdamW optimizer with momentum 0.9 and weight decay 1e-5; exponential moving average (EMA) with decay 0.9999. The initial learning rate is set to 0.04 with batch size 32 on 8 Tesla V100 GPUs for 430 epochs. The learning rate and momentum follow the onecycle scheduler with a minimum learning rate of 0.0016. By setting the coefficient of the L1 penalty term $\lambda$ to 1.6e-4 and 6.0e-5, we obtain our HR-NAS-A and HR-NAS-B. We use a basic data augmentation, \ie, random resizing and cropping, random horizontal flipping, and photometric distortion for training and single-crop testing with a test size of $1024\times 2048$ . We report the mean Intersection over Union (mIoU), mean (macro-averaged) Accuracy (mAcc), and overall (micro-averaged) Accuracy (aAcc) as the evaluation metrics.

ADE20K for Semantic Segmentation. The ADE20K dataset contains 150 classes and diverse scenes with 1,038 image-level labels. The dataset is divided into 20K/2K/3K images for training, validation, and testing respectively. In this work, the input size and testing size is set to $512\times 512$ and $512\times 2048$ , respectively. The model is trained with a batch size of 64 on 8 Tesla V100 GPUs for 200 epochs. We use the same optimizer, learning rate scheduler, data augmentation, and penalty weight $\lambda$ as in the Cityscapes dataset. We report the mean Intersection over Union (mIoU) as the evaluation metric.

COCO Keypoint for Human Pose Estimation. The COCO dataset contains over $200,000$ images and $250,000$ person instances labeled with $17$ keypoints. We train our model on the COCO train2017 set, including $57K$ images and $150K$ person instances. We evaluate our approach on the val2017, containing $5000$ images. In this work, we train the model using input sizes of $256\times 192$ and $384\times 288$ with batch size 384 and 192 on 8 Tesla V100 GPUs for 210 epochs, respectively. Following HRNet , the initial learning rate is set to 1e-3 with a multistep scheduler (decayed by a factor of 0.1 in 170 and 200 epochs). We use an Adam optimizer with momentum 0.9 and weight decay 1e-8; exponential moving average (EMA) with decay 0.9999. By setting the coefficient of the L1 penalty term $\lambda$ to 1e-6 and 1e-8, we obtain our HR-NAS-A and HR-NAS-B. We use random scaling and rotation as only data augmentation for training and single-crop testing. We report average precision (AP), recall scores (AR), $\text{AP}^{M}$ for medium objects, and $\text{AP}^{L}$ for large objects as evaluation metrics.

KITTI for 3D Object Detection. The KITTI 3D object detection dataset is widely used for monocular and LiDAR-based 3D detection. It consists of 7,481 training images and 7,518 test images as well as the corresponding point clouds and the calibration parameters, comprising a total of 80,256 2D-3D labeled objects with three object classes: Car, Pedestrian, and Cyclist. Each 3D ground truth box is assigned to one out of three difficulty classes (easy, moderate, hard) according to the occlusion and truncation levels of objects. In this work, we follow the train-val split , which contains 3,712 training and 3,769 validation images. The overall framework is based on Pointpillars . The input point points are projected into bird’s-eye view (BEV) feature maps by a voxel feature encoder (VFE). The projected BEV feature maps ( $496\times 432$ ) are then used as input of our 2D network for 3D/BEV detection. Following , we set, pillar resolution: 0.16m, max number of pillars: 12000, and max number of points per pillar: 100. We use the onecycle scheduler with an initial learning rate of 2e-3, a minimum learning rate of 2e-4, and batch size 16 on 8 Tesla V100 GPUs for 80 epochs. We use an AdamW optimizer with momentum 0.9 and weight decay 1e-2. We apply the same data augmentation, \ie, random mirroring and flipping, global rotation and scaling, and global translation for 3D point clouds as in Pointpillar . At inference time, we apply axis-aligned nonmaximum suppression (NMS) with an overlap threshold of 0.5 IoU. We report standard average precision (AP) as the evaluation metric.

Appendix B Network Architecture

As shown in Fig. 5, we visualize our entire super network used in all experiments. It begins with two $3\times 3$ convolutions with stride 2 and number of channels 24, which are followed by five parallel modules (respectively with 1, 2, 3, 4, 4 branches); a fusion module is inserted between every two adjacent parallel modules, to obtain multi-scale features. The numbers of channels for the four branches in parallel modules are 18, 36, 72, 144, respectively.

Appendix C Ablative Results for Transformer

In this section, we conduct two ablative experiments to study the impact of the projection size $s$ , the encoder-decoder structure, and the attention mechanism on the performance of our lightweight Transformer. For both experiments, we take the searched network on Multi-branch + MixConv space (without Transformer) in Tab.6 of the main paper as a strong baseline.

Projection Sizes. We evaluate our Transformers with different projected spatial sizes $s$ . From Tab. 9 we can see that when $s$ goes from 0 to 8, the mIoU keeps increasing at the expense of small extra cost (\ie, FLOPs). Further increasing $s$ brings no gain in performance but drastically increasing FLOPs. We therefore choose $s=8$ throughout the experiments.

Attention Structures and Mechanisms. We also conduct ablative experiments to validate the effectiveness of our Transformer. We discuss (1) encoder-decoder structures and (2) two kinds of attention mechanisms by transposing the feature, \ie, ‘channel’ – use each channel of the flattened feature map as a token, ‘spatial’ – use each spatial position as a token. As shown in Tab. 10, our Transformer obtains the best performance when both encoder and decoder are used on channel-wise tokens. Our Transformer also significantly outperforms its counterparts such as SE and Non-local on dense prediction tasks. Since the channel-wise lightweight transformer shows better performance, we set it as the default in this work.

Appendix D Visualization of Visual Recognition Results

We visualize the results of HR-NAS-A on segmentation, human pose estimation, and 3D detection (Fig. 6, 7, 8).