Latency-aware Unified Dynamic Networks for Efficient Image Recognition

Yizeng Han, Zeyu Liu, Zhihang Yuan, Yifan Pu, Chaofei Wang, Shiji Song, Gao Huang

Introduction

Deep neural networks have demonstrated exceptional capabilities in various domains such as computer vision , natural language processing , and multi-modal understanding/generation . Despite their stellar performance, the intensive computational requirements of these deep networks often limit their deployment on resource-constrained platforms, like mobile phones and IoT devices, highlighting the need for more efficient deep learning models.

Unlike traditional static networks which process all inputs uniformly, dynamic models adaptively allocate computation in a data-dependent fashion. This adaptivity involves bypassing certain network layers or convolution channels conditionally, and executing spatially adaptive inference that concentrates computational effort on the most informative regions of an image . As the field evolves and various dynamic models show promise, it begs the question: How can we design a dynamic network for practical use?

Addressing this question is challenging due to difficulties in fairly comparing different dynamic-computation paradigms. These challenges fall into three categories: 1) The lack of a unified framework to encompass different paradigms, as research in this area is often fragmented; 2) The focus on algorithm design, which often results in the mismatch between practical efficiency and their theoretical computational potential, due to the significant impact of scheduling strategiesScheduling strategies are essential for practical efficiency because they optimize the use of GPU threads and memory with CUDA codes. and hardware properties on real-world latency; 3) The laborious task of evaluating a dynamic model’s latency on different hardware platforms, as common libraries (e.g. cuDNN) are not built to accelerate many dynamic operators.

In response, we introduce a Latency-Aware Unified Dynamic Network (LAUDNet), a framework that unifies three representative dynamic-inference paradigms. Specifically, we examine the algorithmic design of layer skipping, channel skipping, and spatially dynamic convolution, integrating them through a ”mask-and-compute” scheme (Figure 1 (a)).

Next, we delve into the challenges of translating theoretical efficiency into tangible speedup, especially on multi-core processors such as GPUs. Traditional literature commonly adopts hardware-agnostic FLOPs (floating-point operations) as a crude efficiency measure, failing to provide latency-aware guidance for algorithm design. In dynamic networks, adaptive computation coupled with sub-optimal scheduling strategies intensifies the gap between FLOPs and latency. Moreover, most existing methods execute adaptive inference at the finest granularity. For instance, in spatial-wise dynamic inference, the decision to compute each feature pixel is made independently . This fine-grained flexibility results in non-contiguous memory access , necessitating specialized scheduling strategies (Figure 1 (b)).

Given that dynamic operators exhibit unique memory access patterns and scheduling strategies, libraries designed for static models, like cuDNN, fail to optimize dynamic models effectively. The lack of library support implies that each dynamic operator requires individualized scheduling optimization, code refinement, compilation, and deployment, making network latency evaluation across hardware platforms labor-intensive. To address this, we propose a novel latency prediction model that efficiently estimates network latency by taking into account algorithm design, scheduling strategies, and hardware properties. Compared to hardware-agnostic FLOPs, our predicted latency offers a more realistic representation of dynamic model efficiency.

Guided by the latency prediction model, we tackle the aforementioned challenges within our latency-aware unified dynamic network (LAUDNet) framework. For a given hardware device, we use the predicted latency as the guiding metric for algorithm design and scheduling optimization, as opposed to the conventionally used FLOPs (Figure 1 (c)). In this context, we propose coarse-grained dynamic networks where ”whether-to-compute” decisions are made at the patch/group level rather than individual pixels/channels. Though less flexible than pixel/channel-level adaptability in prior works , this approach encourages contiguous memory access, enhancing real-world speedup on hardware. Our improved scheduling strategies further permit batching inference. We investigate dynamic inference paradigms, focusing on the accuracy-latency trade-off. Notably, previous research has established a correlation between latency and FLOPs on CPUs , hence in this paper, we primarily target the GPU platform, a more challenging but less explored environment.

The LAUDNet is designed as a general framework in two ways: 1) Multiple adaptive inference paradigms can be easily implemented in various vision backbones, like ResNets , RegNets and vision Transformers ; and 2) The latency predictor functions as an off-the-shelf tool that can be readily applied to diverse computing platforms, such as server-end GPUs (Tesla V100, RTX3090), desktop-level GPU (RTX3060) and edge devices (Jetson TX2, Nvidia Nano).

We evaluate LAUDNet’s performance across multiple backbones for image classification, object detection, and instance segmentation. Our results show that LAUDNet significantly improves the efficiency of deep CNNs, both in theory and practice. For instance, the inference latency of ResNet-101 on ImageNet is reduced by >>50% on different types of GPUs (e.g., V100, RTX3090 and TX2), without compromising accuracy. Moreover, our method outperforms various lightweight networks in low-FLOPs scenarios.

Although parts of this work were initially published in a conference version , this paper significantly expands our previous efforts in several key areas:

A unified dynamic-inference framework is proposed. While the preliminary paper predominantly focused on spatially adaptive computation, this paper delves deeper into two additional and important dynamic paradigms, specifically, dynamic layer skipping and channel skipping (Figure 1 and Sec.3.1). Furthermore, we integrate these paradigms into a unified framework, and provide more thorough study on architecture design and complexity analysis (Sec.3.2).

The latency predictor has been enhanced to support an expanded set of dynamic operators, including layer skipping and channel skipping (Sec. 3.3). Moreover, we adopt Nvidia Cutlass to optimize the scheduling strategies. Hardware evaluations demonstrate that our latency predictor can accurately predict the latency on real hardware (Figure 5).

The LAUDNet framework has been extended to accommodate Transformer architectures, as detailed in Sec. 3.2. This extension notably enhances latency optimization through the implementation of dynamic token skipping (spatially adaptive computation), head (channel) skipping, and layer skipping. Such advancements significantly broaden the applicability of LAUDNet. The empirical evaluation, illustrated in Figure 10 (c) and discussed in Sec. 4.3.2, yields valuable insights into the design of efficient Transformers, underpinning the framework’s versatility and efficacy.

For the first time, we incorporate batching inference for our dynamic operators (Sec. 3.4). This innovation leads to more consistent prediction outcomes and an enhanced speedup ratio on GPU platforms (Figure 8, 12).

We undertake an exhaustive analysis of various dynamic granularities (Figure 9) and paradigms (Figure 10,11,13, Tab. II,III), spanning different vision tasks and platforms, with added evaluations on contemporary GPUs like RTX3060 and RTX3090. We are confident that our results will offer valuable insights to both researchers and practitioners.

Related works

Efficient deep learning has garnered substantial interest. Traditional solutions involve lightweight model design , network pruning , weight quantization , and knowledge distillation . However, these static methods have sub-optimal inference strategy, leading to intrinsic redundancy since they process all inputs with equal computation.

Dynamic networks propose an appealing alternative to static models by enabling input-conditional dynamic inference. This adaptive approach has yielded superior results across various domains. In visual recognition, prevalent dynamic paradigms include early exiting , layer skipping , channel skipping , and spatial-wise dynamic computation . This paper primarily targets the latter three paradigms, as they can be readily applied to arbitrary visual backbones, thereby offering a generality advantage. Layer skipping and channel skipping explore structural redundancy within deep networks by selectively activating computation units, such as layers or convolution channels when processing different inputs. Spatial-wise dynamic models alleviate spatial redundancy in image features and selectively assign computation to the regions most pertinent to the task at hand.

Despite their effectiveness, previous studies often fail to recognize the shared underlying formulation across different dynamic paradigms. In contrast, we introduce a unified framework that encompasses all three paradigms, facilitating a thorough exploration of dynamic networks. Additionally, existing methods primarily concentrate on algorithm design, which often results in a significant disparity between theoretical and practical efficiency. In our latency-aware co-design framework, we bridge this gap by utilizing latency directly from our latency predictor to guide both algorithm design and scheduling optimization. This approach results in improved latency performance across diverse platforms.

Hardware-aware network design. Researchers have acknowledged the necessity to bridge the gap between theoretical and practical efficiency of deep models by considering actual latency during network design. Two primary approaches have emerged: the first entails conducting speed tests on hardware and deriving guidelines to facilitate hand-designing lightweight models , and the second involves performing speed tests for various types of static operators and modeling the latency predictor as a small trainable model . Neural architecture search (NAS) techniques are then used to search for hardware-friendly models.

Our work distinguishes itself from these approaches in two significant ways: 1) while existing works predominantly focus on constructing static models that inherently exhibit computational redundancy by treating all inputs uniformly, our goal is to design latency-aware dynamic models that adjust their computation based on inputs; 2) conducting speed tests for dynamic operators across various hardware devices can be laborious and impractical. To circumvent this, we propose a latency prediction model that efficiently estimates the inference latency of dynamic operators on any given computing platform. This model accounts for algorithm design, scheduling strategies, and hardware properties simultaneously, providing valuable insights without the need for extensive speed testing.

Method

This section begins by providing an introduction to the foundational concepts underlying three dynamic inference paradigms (Sec. 3.1). We then present the architecture design of our LAUDNet framework, which unifies these paradigms under a cohesive mask-and-compute formulation (Sec. 3.2). Next, we explain the latency prediction model (Sec. 3.3), which guides the determination of granularity settings and scheduling optimization (Sec. 3.4). Finally, we describe the training strategies for our LAUDNet (Sec. 3.5).

During inference, the current scheduling strategy for spatial-wise dynamic convolutions generally involve three steps (Figure 1 (b)): 1) gathering, which re-organizes the selected pixels (if the convolution kernel size is greater than 1×11\times 1, the neighbors are also required) along the batch dimension; 2) computation, which performs convolution on the gathered input; and 3) scattering, which fills the computed pixels on their corresponding locations of the output feature. Compared to performing convolutions on the entire feature map, this scheduling strategy reduces computation at the cost of overhead from mask generation and non-contiguous memory access. As a result, the overall latency could even be increased, particularly when the granularity of dynamic convolution is at the pixel level (Figure 6).

2 LAUDNet architecture

Computational complexity. We first point out that the masker FLOPs are negligible compared to the backbone convolutions. Therefore, we mainly analyse the complexity of standard convolution blocks here.

Generalization in Transformer architectures. It is essential to highlight that the implementation of the three dynamic paradigms—namely spatial-wise adaptive computation, dynamic channel selection, and layer skipping—is inherently more straightforward in vision Transformers compared to CNNs. These paradigms are not only more amenable to hardware considerations, requiring minimal scheduling optimization, but also benefit from the inherent structure of vision Transformers. For instance, spatial-wise dynamic computation can be efficiently executed through token indexing and selection, thanks to the image tokenization process in vision Transformers, thereby avoiding the complex pixel gathering required in convolution layers (Figure 2 (b)). From an algorithmic design perspective, the recent AdaViT framework introduces a method for adaptively skipping tokens, heads/channels in multi-head attention, and layers, thus enabling dynamic computation across spatial, width, and depth dimensions simultaneously. However, despite theoretical comparisons presented in , the practical efficacy of these paradigms on hardware remains uncertain. This paper leverages the architectural design principles of AdaViT to circumvent the need for foundational redesigns and utilizes our proposed latency predictor to conduct a thorough examination of the practical performance of these dynamic paradigms in vision Transformers.

3 Latency predictor

Hardware modeling. We model a device with multiple processing engines (PEs) for parallel computation (Figure 4). The memory system has three levels : 1) off-chip memory, 2) on-chip global memory, and 3) memory in PE. In practice, the latency mainly comes from two processes: data movement and parallel computation:

Latency prediction. Given hardware properties and model parameters, adopting a proper scheduling strategy is key to maximizing resource utilization through increased parallelism and reduced memory access. We use Nvidia Cutlass to search for the optimal scheduling (tiling and in-PE parallelism configurations) of dynamic operations. The data movement latency can then be easily obtained from data shapes and target device bandwidth. Furthermore, the computation latency is derived from hardware properties. Please refer to Appendix A for more details.

Empirical validation. We evaluate the performance of our latency predictor with a ResNet-101 block on an RTX3090 GPU, varying the activation rate rr. The blue curves represent the predictions, and the scattered dots are obtained via searching for a proper scheduling strategy (implemented with custom CUDA code) using Nvidia Cutlass . All the three dynamic paradigms are tested. Figure 5 compares predictions to real GPU testing latency, showing accurate estimates across a wide range of activation rates.

4 Scheduling optimization

We use general optimization methods like fusing activation functions and batch normalization (BN) layers into convolution layers. We also optimize our dynamic convolution blocks as follows.

Operator fusion for spatial maskers. As mentioned in Sec. 3.2, spatial maskers have negligible computation but take the full feature map as input, making them memory-bounded (latency is dominated by memory access). Since the masker shares its input with the first 1×11\times 1 conv (Masker-Conv1×\times1 in Figure 2 (b)), fusing them avoids repeated input reads. However, this makes the convolution spatially static, potentially increasing computation. For simplicity, we adopt such operator fusion in all tested models. In practice, we find that operator fusion improves efficiency in most scenarios.

Fusing gather and dynamic convolution. Traditional approaches first gather the input pixels of the first dynamic convolution in a block. The gather operation is also a memory-bounded operation. Furthermore, when the kernel size exceeds 1×\times1, input patches overlap, leading to repeated loads/stores. We fuse gathering into dynamic convolution to reduce the memory access (Gather-Conv3x3 in Figure 2 (b)).

Note that for dynamic channel skipping (Figure 2 (c)), gathering is conducted on convolution kernels rather than features. The weight gather operations is also fused with convolution by our scheduling optimization.

Fusing scatter and add. Conventional methods scatter the final convolution outputs before the element-wise addition. We fuse these two operators (Scatter-Add in Figure 2 (b)) to reduce memory access costs. The ablation study in Sec. 4.2 validates the effectiveness of the proposed fusing methods.

Batching inference is enabled by recording patch, location, and sample correspondences during gathering and scattering (Figure 2 (b, c)). Inference with a larger batch size facilitates parallel computation, making latency more dependent on computation versus kernel launching or memory access. See Appendix C.1 for empirical analysis.

5 Training

We further propose to leverage the static counterparts of our dynamic networks as “teachers” to guide the optimization procedure. Let y\mathbf{y} and y\mathbf{y}^{\prime} denote the output logits of a dynamic “student” model and its static “teacher”, respectively. Our final loss can be written as

Experiments

Image classification experiments are conducted on the ImageNet dataset. We implement our LAUDNet on five representative architectures extending up to a broad spectrum of computational costs: four CNNs (ResNet-50, ResNet-101 , RegNetY-400M, RegNetY-800M ) and a vision Transformer, T2T-ViT . Different training settings are used for CNNs and Transformers. For CNNs, As per the established methodology in , we initialize the backbone parameter from a torchvision pre-trained checkpoint (https://pytorch.org/vision/stable/models.html), and finetune the whole network for 100 epochs employing the loss function in Eq. (3). We fix α ⁣= ⁣10,β ⁣= ⁣0.5\alpha\!=\!10,\beta\!=\!0.5 and T ⁣= ⁣4.0T\!=\!4.0 for all dynamic models. Note that we adopt the pretrain-fintune paradigm mainly to reduce the training cost, as Gumbel Softmax usually requires longer training for convergence. For our study on T2T-ViT, we use the same setup as described in AdaViT and evaluate the efficiency of its various dynamic inference methods through our latency predictor.

Latency prediction. We evaluate our LAUDNet on various types of hardware platforms, including two server GPUs (Tesla V100 and RTX3090), a desktop GPU (RTX3060) and two edge devices (e.g., Jetson TX2 and Nvidia Nano). The major properties considered by our latency prediction model include the number of processing engines (#PE), the floating-point computation in a processing engine (#FP32), the frequency and the bandwidth. It can be observed from Tab. IV that server GPUs generally have a larger #PE than IoT devices. If not stated otherwise, the batch size is set as 128 for V100, RTX3090 and RTX3060 GPUs. On edge devices TX2 and Nano, tesing batch size is fixed as 1.

2 Latency prediction results

Despite the implementation of our optimized scheduling strategies, pixel-level dynamic convolution (SS=1) does not consistently enhance practical efficiency. This approach to fine-grained adaptive inference has been adopted in previous works . Our findings help elucidate why these studies only managed to achieve realistic speedup on less potent CPUs or specialized devices ;

An excessively large SS (indicating less flexible adaptive inference) provides negligible improvement on both devices. In particular, increasing SS from 7 to 14 in the second stage of LAUD-RegNetY-800MF on TX2 detrimentally impacts efficiency. This is hypothesized to be due to the oversized patch size causing additional memory access costs on this device, which has fewer processing engines (PEs);

Layer skipping (marked by \star) consistently outperforms spatial-wise dynamic computation (marked by \bullet). We will analyze their performance across various vision tasks in Sec. 4.3 and Sec. 4.5.

Ablation study of batch size. To establish a suitable testing batch size, we graph the relationship between latency per image and batch size for LAUD-ResNet-50 in Figure 8. Two server-end GPUs (V100 and RTX3090) are tested. The results highlight that latency diminishes with an increase in batch size, eventually reaching a stable plateau when the batch size exceeds 128 on both platforms. This is comprehensible since a larger batch size favors enhanced computation parallelism, resulting in latency becoming more dependent on theoretical computation. The results on the desktop-level GPU, RTX3060 (Figure 12 in Appendix C.1), show a similar phenomenon. Based on these observations, we report the latency on server-end and desktop-level GPUs with a batch size of 128 henceforth.

3 ImageNet classification

Elevating the channel granularity GG from 1 to 2 does yield sort of speedup for ResNet-50, but renders comparable performance in the case of RegNetY-800M. We hypothesize that a larger GG is only beneficial for models with more extensive channel numbers, which also aligns with observations from Figure 7.

3.2 Comparison of dynamic paradigms

Having decided on the optimal granularities, we submit different dynamic paradigms to a more detailed comparison. Additionally, our LAUDNet is compared to various competitive baselines. The findings are illustrated in Figure 10.

Standard baseline comparison: ResNets. The compared baselines include various types of dynamic inference approaches: 1) layer skipping (SkipNet and Conv-AIG ); 2) channel skipping (BAS ); and 3) pixel-level spatial-wise dynamic network (DynConv ). For our LAUDNet, we select the best granularity settings for spatial-wise and channel-wise dynamic inference. Layer skipping implemented in our framework is also included. We set training targets (cf. Sec. 3.5) t ⁣ ⁣{0,4,,0.8}t\!\in\!\{0,4,\cdots,0.8\} for our dynamic models to evaluate their performance across different sparsity regimes. We apply scheduling optimization (Sec. 3.4) uniformly across all models for a fair comparison.

The results are exhibited in Figure 10 (a). On the left we plot the relationship between accuracy and FLOPs. It becomes obvious that our LAUD-ResNets, with various granularity settings, considerably outperform competing dynamic networks. Moreover, on ResNet-101, the three paradigms seem fairly comparable, whereas, on ResNet-50, layer skipping falls behind, especially when the training target is small. This is understandable because layer skipping might be overly aggressive for more shallow models.

Interestingly, the scenario alters as we explore real latency (middle on TX2 and right on V100). On the less potent TX2, latency generally exhibits a stronger correlation with theoretical FLOPs, given that it is computation-bounded (that means, the latency is primarily focused around computation) on such IoT devices. However, different dynamic paradigms yield varying acceleration impacts on server-end GPU, V100, as latency could be impacted by the memory access cost. For instance, layer skipping takes precedence over the other two paradigms on the deeper ResNet-101. With the target activation rate t ⁣= ⁣0.4t\!=\!0.4, our LAUDl-ResNet-101 reduces the inference latency of its static counterpart by \sim53%. On the shallower ResNet-50, channel skipping keeps pace with layer skipping on some low-FLOPs models. Although our proposed course-grained spatially adaptive inference trails behind the other two schemes, it significantly outclasses the previous work using pixel-level dynamic computation . The additional results in Appendix C.2 also demonstrate the preferable efficiency of layer skipping on RTX3060 and RTX3090. Channel skipping outperforms the other two paradigms only on the edge device, Nvidia Nano.

Lightweight baseline comparison: RegNets. We further evaluate our LAUDNet in lightweight CNN architectures, i.e. RegNets-Y . Two different sized models are tested: RegNetY-400MF and RegNetY-800MF. Compared baselines include other types of efficient models, e.g., MobileNets-v2 , ShuffletNets-v2 and CondenseNets .

The results are presented in Figure 10 (b). We observe that while channel skipping surpasses the other two paradigms substantially in the accuracy-FLOPs trade-off, it is less efficient than layer skipping on most models except RegNet-Y-800M. Remarkably, layer skipping emerges as the most dominant paradigm. We theorize that this is due to the model width (number of channels) of RegNet-Y being limited, and the inference latency still being bounded by memory access. Moreover, layer skipping enables skipping the memory-bounded SE operation . The results on desktop-level and server-end GPUs (Appendix C.2) further showcase the superiority of layer skipping.

Experiments on vision Transformers. Building on the foundation laid out in Sec. 3.2, our LAUDNet seamlessly integrates with vision Transformers using the AdaViT framework. Despite the absence of direct comparisons among the three dynamic paradigms in existing studies, with employing all three simultaneously, it leaves open the question of which paradigm offers the best balance between accuracy and efficiency. We address this by showcasing the accuracy-latency trade-off curves for LAUD-T2T-ViT across various platforms—TX2, RTX3060, and V100 (the performance on RTX3090 is similar to that on V100)—in Figure 10 (c). The findings highlight several key insights:

Layer skipping and head (channel) skipping are more advantageous for maintaining high accuracy at high activation rates, though both experience a significant accuracy decline at reduced activation rates.

When evaluating the balance between practical latency and accuracy, layer skipping consistently outperforms head (channel) skipping on all platforms.

Despite its lower theoretical upper-bound of accuracy, spatial-wise adaptive computation (token skipping) might excel over the other paradigms at lower activation rates, attributing its practical latency benefits to the straightforward implementation of indexing and selection operations on GPUs, without necessitating specialized operators as in CNNs.

A synergistic application of all three paradigms further enhances the accuracy-efficiency trade-off, showing the complementary strengths of each approach.

4 Visualization and interpretability

We present visualization results of LAUDNet to delve into its interpretability from the perspectives of networks’ structural redundancy and images’ spatial redundancy.

Channel skipping results in activation rates that are more centered around 0.5 throughout the network.

5 Dense prediction tasks

Our LAUDNet is further put to test on downstream tasks, i.e. COCO object detection (as seen in Table II) and instance segmentation (presented in Table III). For object detection, the mean average precision (mAP) stands as the barometer for network efficacy. For instance segmentation, the APmask dives deeper to gauge the nuance of dense prediction. The average backbone FLOPs, and the average backbone latency on the validation set are used to measure the network efficiency. Due to LAUDNet’s versatile nature, we can seamlessly replace the backbones in various detection and segmentation frameworks with our pre-trained models on ImageNet, then fine-tune them on the COCO dataset under the standard protocol for 12 epochs—except for models based on Mask2Former , which are trained for 50 epochs in line with the baseline configurations (detailed settings are elaborated in Appendix B.3). In the domain of object detection, our experimentation covers three frameworks: the two-stage Faster R-CNN with Feature Pyramid Network , the one-stage RetinaNet , and a DETR -based model, namely Dense Distinct Query (DDQ)-DETR . We compare our results against a range of recent advancements, such as Deformable DETR , DINO-DETR , Rank-DETR , and Stable-DINO . For instance segmentation, we utilize the well-established Mask R-CNN and the query-based Mask2Former . The results are presented in Tab. II (for object detection) and Tab. III (for instance segmentation), unequivocally demonstrating that LAUDNet consistently boosts both mAP and efficiency across classic and state-of-the-art (SOTA) frameworks. Notably, while channel and layer skipping generally surpass spatial-wise dynamic computation in efficiency, the ideal dynamic paradigm may vary depending on the specific detection framework, backbone architecture, and hardware platforms.

Conclusion

In this paper, we propose to build latency-aware unified dynamic networks (LAUDNet) under the guidance of a latency prediction model. By collectively considering the algorithm, scheduling strategy, and hardware properties, we can accurately estimate the practical latency of different dynamic operators on any computing platforms. Based on an empirical analysis of the correlation between latency and the granularity of spatial-wise and channel-wise adaptive inference, the algorithm and scheduling strategies are optimized to attain realistic speedup on a range of multi-core processors, such as Tesla V100 and Jetson TX2. Our experiments on image classification, object detection, and instance segmentation tasks affirm that the proposed method markedly boosts the practical efficiency of deep CNNs and surpasses numerous competing approaches. We believe our research brings useful insights into the design of dynamic networks. Future works include explorations on more types of model architectures (e.g. Transformers, large language models) and tasks (e.g. low-level vision tasks and vision-language tasks).

Acknowledgments

This work is supported in part by the National Key R&D Program of China under Grant 2021ZD0140407, the National Natural Science Foundation of China under Grants 42327901 and 62276150, and Guoqiang Institute of Tsinghua University. We also appreciate the generous donation of computing resources by High-Flyer AI.

References

Appendix A Latency prediction model.

As the dynamic operators in our method have not been supported by current deep learning libraries, we propose a latency prediction model to efficiently estimate the real latency of these operators on hardware device. The inputs of the latency prediction model include: 1) the structural configuration and dynamic paradigm of a convolution block, 2) its activation rate rr which decides the computation amount, 3) the spatial (channel) granularity SS (GG), and 4) the hardware properties mentioned in Table IV. The latency of a dynamic block is predicted as follows.

Operation-to-hardware mapping. Next, we map the operations to hardware. As illustrated in Figure 4, we model a hardware device as multiple processing engines (PEs). We assign the computation of each element in the output feature map to a PE. Specifically, we consecutively split the output feature map into multiple tiles. The shape of each tile is TP×TC×TS1×TS2T_{P}\times T_{C}\times T_{S1}\times T_{S2}. These split tiles are assigned to multiple PEs. The computation of the elements in each tile is executed in a PE. We can configure different shapes of tiles. In order to determine the optimal shape of the tile, we make a search space of different tile shapes. The tile shape has 4 dimensions. The candidates of each dimension are power-of-2 and do not exceed the corresponding dimension of the feature map.

Latency estimation. Then, we evaluate the latency of each tile shape in the search space and select the optimal tile shape with the lowest latency. The latency includes the data movement latency and the computation latency:

The latency of data movement is affected by the granularity SS or GG: when the granularity is small, the same input data has a higher probability of being sent to multiple PEs to compute different output patches, which significantly increases the number of on-chip memory movement. And due to the small amount of data transmitted each time and the data is randomly distributed, the efficiency of data movement will be low. This accounts for our experiment results in the paper that a larger SS will effectively improve the practical efficiency.

To summarize, our latency prediction model can predict the real latency of dynamic operators by considering both the data movement cost and the computation cost. Guided by the latency prediction model, we propose our LAUDNet with coarse-grained spatially adaptive inference (S ⁣> ⁣1S\!>\!1 and G ⁣> ⁣1G\!>\!1). It is validated in our paper that LAUDNet achieve better efficiency than previous approaches (S ⁣= ⁣1S\!=\!1), as it effectively reduces the data movement latency, which is rarely considered by other researchers.

Appendix B Detailed experimental settings

In this section, we present the detailed experiment settings which are not provided in the main paper due to the page limit.

Hardware properties considered by our latency prediction model include the number of processing engines (#PE), the floating-point computation in a processing engine (#FP32), the frequency and the bandwidth. We test four types of hardware devices, and their properties are listed in Table IV.

It could be found that the server-end GPUs V100 and RTX3090 are more powerful hardware devices, especially with the largest number of processing engines (#PE). Therefore, spatially adaptive inference and dynamic channel skipping could easily fall into a memory-bounded operation on these GPUs. Our experiment results in Figure 6 and Figure 9 in the paper can reflect this phenomenon: the more flexibility the computation is, the harder to improve the practical efficiency.

1) Fusing the masker and the first convolution. We mentioned in Sec. 3.4 of the paper that the masker operation is fused with the first 1×\times1 convolution in a block to reduce the cost on memory access. This is feasible because the two operators share the same input feature, and their convolution kernel sizes are both 1×\times1.

Afterwards, we fuse the masker with the first convolution by performing once convolution whose output channel number is C+1C+1, where CC is the original output width of the first convolution. The output of this step is split into a feature map (for further computation) and a mask (for obtaining the index for gathering). Such operator fusion avoids the repeated reading the input feature, and helps reduce the inference latency (Tab. I).

2) Fusing the gather operation and the dynamic convolution. To facilitate the scheduling on hardware devices with multiple PEs, the masker generates the indices of activated patches instead of sparse mask at inference time. In this way, it is easy to evenly distribute the computation of output patches to different PEs, thus avoiding unbalanced computation of PEs. Each element in the indices represents the index of an activated patch. PE fetches the input data from the corresponding positions on the feature map according to the index. The output patches could be densely stored in memory. Such operator fusion benefits the contiguous memory access and parallel computation on multiple PEs.

3) Fusing the scatter operation and the add operation. Similar to the previous operation, each PE fetches a tile of data from the residual feature map according to the index, adds them with the corresponding feature map from previous dynamic convolution, and then stores the results to the corresponding position on the residual feature map according to the index. This optimization can significantly reduce the costs on memory access.

Speed test. We test the latency on real hardware devices to evaluate the accuracy of our latency prediction model. On GPUs, we use Nvidia Cutlass (https://github.com/NVIDIA/cutlass) and CUDA (version 11.6) for code generation and compilation respectively. The results in Figure 5 of the paper validate that the predictions obtained from our latency predictor effectively align with the real-test values.

B.2 ImageNet classification

We use pre-trained CNN models in the official torchvision website to initialize our backbone parameters, and finetune the overall models for 100 epochs. The initial learning rate is set as 0.01×\timesbatch size/128, and decays with a cosine shape. The training batch size is determined on the model size and the GPU memory. For example, we train our LAUD-ResNet-101 on 8 RTX 3090 GPUs with the batch size of 512, and the batch size for LAUD-ResNet-50 is doubled. We use the same weight decay and the standard data augmentation as in the RegNet paper . For our own hyper-parameter τ\tau in Eq. (1) of the paper, this Gumbel temperature τ\tau exponentially decreases from 5 to 0.1 in the training procedure. For the training hyper-parameter in Eq. (2), we simply fix α=10,β=0.5\alpha=10,\beta=0.5 and T=4.0T=4.0 for all dynamic models. We conduct a very simple grid search with a RegNet for β{0.3,0.5}\beta\in\{0.3,0.5\} and T{1.0,4.0}T\in\{1.0,4.0\} to determine their values.

B.3 COCO object detection & instance segmentation

We use the standard setting suggested in , except that we decrease the learning rate for our pre-trained backbone network. We simply set a learning rate multiplier 0.5 for Faster R-CNN , 0.2 for RetinaNet and 0.5 for Mask R-CNN . For DDQ-DETR and Mask2Former, we follow the standard setting and set the learning rate multiplier to 0.1. As for the additional loss items, the hyper-parameters are kept the same as training our classification models, except that the temperature is fixed as 0.1 in the 12 training epochs. The input images are resized to a short side of 800 with a long side not exceeding 1333.

Appendix C More experimental results

In this section, we report more experimental results which are not presented in the main paper.

Design of channel masker. We mentioned in Sec. 3.2 that our channel maskers are designed as a 2-layer MLP with reduced hidden units. This design is determined under the guidance of our latency predictor. Specifically, we compare the latency of two choices with a a LAUDc-ResNet-101 on TX2: 1 linear layer and our 2-layer MLP. The latency numbers of the channel maskers in 4 stages and their ratios to those of ResNet blocks are reported in Table V. The results reveal that in late stages where channel numbers are large, our 2-layer MLP with reduced hidden units significantly reduces the latency.

Batch size. The relationship between inference latency and batch size of LAUD-ResNet-50 on the desktop-level GPU, RTX3060, is presented in Figure 12. The phenomenon is similar to the serven-end GPUs we present in the main paper (Figure 8).

C.2 ImageNet classification

Results on more hardware devices. In Figure 10 of the paper, we report the ImageNet classification results of LAUD-ResNet on V100 and TX2, and those of LAUD-RegNet-Y on TX2 and Nano. Here we present the results on other hardware platforms. From the results in Figure 13, we can find that the optimal dynamic-inference paradigms can depend on the backbone and hardware devices. For example, channel skip demonstrate its advantages on Nano for LAUD-ResNet-50 and ResNet-101, while layer skipping significantly outperform the other two schemes on the more powerful devices, RTX3060 and RTX3090.

C.3 Visualization results