Computation-efficient Deep Learning for Computer Vision: A Survey
Yulin Wang, Yizeng Han, Chaofei Wang, Shiji Song, Qi Tian, Gao Huang
Introduction
Over the past decade, the field of computer vision has experienced significant advancements in deep learning. Innovations in model architectures and learning algorithms have allowed deep networks to approach or even exceed human-level performance on benchmark competition datasets for a wide range of visual tasks, such as image recognition , object detection , image segmentation , video understanding , and 3D perception . This considerable progress has stimulated interest in deploying deep models in practical applications, including self-driving cars, mobile devices, robotics, unmanned aerial vehicles, and internet of things devices .
However, the demands of real-world applications are distinct from those of competitions. Models achieving state-of-the-art accuracy in competitions often exhibit computational intensity and resource requirements during inference. In contrast, computation is typically equivalent to practical latency, power consumption, and carbon emissions. Low-latency or real-time inference is crucial for ensuring security and enhancing user experience . Deep learning systems must prioritize low power consumption to improve battery life or reduce energy costs . Minimizing carbon emissions is also essential for environmental considerations . Motivated by these practical challenges, a substantial portion of recent literature focuses on achieving a balance between effectiveness and computational efficiency. Ideally, deep learning models should yield accurate predictions while minimizing the computational cost during inference. This topic has given rise to numerous intriguing research questions and garnered significant attention from both academic and industrial sectors.
In light of these developments, this survey presents a comprehensive and systematic review of the exploration towards computationally efficient deep learning. Our aim is to provide an overview of this rapidly evolving field, summarize recent advances, and identify important challenges and potential directions for future research. Specifically, we will discuss existing works from the perspective of the following five directions:
1) Efficient Backbone Models. Designing light-weighted backbone networks that effectively extract discriminative deep representations from images, videos or 3D scenes with minimal computation by optimizing both efficient network micro-architectures (e.g., operators, modules, and layers) and improving the system-level organization of micro-architectures . Recent advances in neural architecture search (NAS) have further enabled the automatic design of backbones.
2) Dynamic Deep Networks. Developing dynamic networks is an important emerging research direction for improving computational efficiency. These networks break the limits of static computational graphs and propose adapting their structures or parameters to the input during inference . For example, the model can selectively activate certain model components (e.g., layers , channels , and sub-networks ) based on each test input or allocate less computation to less informative spatial/temporal regions of each input.
3) Task-specialized Efficient Models. Numerous works focus on building task-specific heads on top of the features from light-weighted static/dynamic backbones to efficiently accomplish specific computer vision tasks. Examples include fast one-stage models for real-time object detection , the efficient multi-branch architecture for semantic segmentation , and end-to-end instance segmentation frameworks .
4) Model Compression Techniques. Orthogonal to network architecture design, many algorithms have been proposed to compress relatively large models with minimal accuracy loss. This can be achieved by pruning less important network components , quantizing parameters , or distilling knowledge from large models to smaller models of interest .
5) Efficient Deployment on Hardware. To achieve high practical efficiency, it is necessary to consider hardware requirements when developing deep learning applications. Reducing latency on specific hardware devices is usually treated as an objective in network design or algorithm-hardware co-design . Additionally, several acceleration tools have been developed for efficient deployment of deep learning models .
While some relevant surveys exist , our survey is more up-to-date and comprehensive in several crucial aspects: 1) we systematically review model design techniques for images, videos, and 3D vision; 2) we summarize the recent works on designing dynamic deep neural networks for efficient inference; and 3) we thoroughly discuss the specialized models for accomplishing the most common and challenging computer vision tasks, e.g., object detection and image segmentation.
The rest of this survey is organized as follows (see Figure 1 for the overview). In Sec. 2 and 3, we introduce the design of efficient static and dynamic backbone networks, respectively. In Sec. 4, the methodology for designing task-specialized efficient models is reviewed. The techniques for compressing deep learning models are investigated in Sec. 5. Efficient hardware deployment approaches are summarized in Sec. 6. Lastly, we discuss existing challenges and future directions in Sec. 7.
Architecture Design of Backbone Networks
Typically, deep learning models for computation vision tasks incorporate two components, i.e., 1) a backbone network that extracts deep representations from the raw inputs (e.g., images, video frames, and point clouds); and 2) a task-specific head that is designed specialized for the task of interest. The deep features obtained from backbone networks are fed into the head to accomplish the corresponding task. The outputs of backbones (i.e., the inputs of the head) are usually assumed to have similar formats, while the outputs of the head are tailored for the tasks of interest.
In this section, we focus on how to design a computational-efficient general-purpose backbone network. Our discussions will start from processing the most fundamental data form, 2D images, where a light-weighted network may be obtained by either manual design (Sec. 2.1) or automatic searching approaches (Sec. 2.2). Then we will discuss the backbones for processing videos (Sec. 2.3) and understanding 3D scenes (Sec. 2.4).
A considerable number of efficient backbone networks are designed manually based on theoretical derivations, empirical observations, or heuristics. Existing works can be categorized into two levels according the granularity of modifying the network: micro-architecture (Sec. 2.1.1) and macro-architecture (Sec. 2.1.2)
The micro-architecture refers to the individual layers, modules, and neural operators of backbones. These basic components are the foundation for constructing deep networks. Many works seek to attain higher computational efficiency by improving them. Notably, these works usually serve as off-the-shelf plug-in components that can be employed together with other techniques.
However, such dense layers tend to computationally intensive. To address this issue, researchers have proposed to replace the dense connection with particularly designed topologies , which dramatically reduces the computational complexity, yet yields a competitive or stronger representation ability. Among existing works, one of the most popular designs is the split-transform-merge strategy, as shown in the following (as a fundamental component, a residual connection is added here):
3) Feature Reusing. Conventionally, the successive linear connection is the dominant topology for network design. The inputs are fed into a layer and transformed to obtained the inputs of the next layer. Any feature will be utilized for only a single time. Although being straightforward, this design is usually sub-optimal from the lens of computational efficiency. An important idea for lighted-weighted models is to reuse the have-been-used features.
a) Inter-layer feature reusing. A basic idea is to reuse the features from previous layers. The skip-layer residual connection adds the inputs of each layer to the outputs, contributing the effective training of very deep and computationally more efficient networks. A more general formulation is established by dense connection , where all the previous features are fed into a next layer. CondenseNets extend this architecture by automatically learning the inter-layer connection topology. In contrast, other works like ShuffleNetV2 and G-GhostNet focus on manually designing inter-layer interaction mechanisms.
b) Intra-layer feature reusing. The idea of feature reusing can also be leveraged within each network layers. For example, GhostNets demonstrate that there exist considerable redundancy in the outputs of each layer. They first obtain a small set of intrinsic output features, which are not only used as the inputs of the next layer, but also reused to generating other output features using cheap operations like linear transformations.
4) Feature Down-sampling. Extracting deep representations from image-based data typically yields feature maps, which inherently have spatial sizes (i.e., height and weight). This property can be leveraged to reduce the computational cost of models e.g., introducing properly configured feature down-sampling modules.
a) Processing feature maps efficiently. The cost of processing feature maps grows quadratically with respect to their height/weight. OctConv finds that processing all the features with the same resolution is not an optimal design. They propose to process a group of features at a down-sampled scale to capture only the low-frequency information, while the remaining features are designed to recognize high-frequency patterns, and the two groups exchange information after each layer. Consequently, the overall computational cost is reduced. This idea is also effective in ViTs . Similarly, HRNets and HRFormer maintain multi-resolution features at each layer, aiming to efficiently extract multi-scale discriminative representations for various computer vision tasks in the meantime.
b) Facilitating efficient self-attention. Particularly, feature down-sampling can be embedded into self-attention operations in ViTs to improve its efficiency. For example, PVTs and ShuntedViT propose to compute attention maps efficiently with down-sampled feature maps. Twins perform self-attention on low-resolution features to aggregate global information efficiently.
5) Efficient Self-attention. ViTs have achieved remarkable success in the fields of computer vision. Their self-attention mechanisms enable adaptively aggregating information across the entire image, yielding excellent scalability with the growing dataset scale or model size. However, vanilla self-attention suffers from high computational cost. A considerable number of recent visual backbones focus on developing more efficient self-attention modules without sacrificing their performance.
a) Locality-inspired Self-attention. In this direction, an important idea is drawn from the success of ConvNets: exploiting the locality of images, i.e., encouraging the models to aggregate more information from adjacent spatial regions. Swin Transformers achieve this by performing self-attention only within a square windows. Some other works extending this idea by designing different shapes of attention windows or introducing soft local constraints to attention maps . An important challenge faced by these works is how to model the interaction of different windows effectively. Possible solutions to address this issue include changing window positions , shuffling the channels , designing specialized window shapes , or further introducing window-level global self-attention modules .
b) SoftMax-free Self-attention. To reduce the inherent high computation complexity of self-attention, another line of research proposes to replace the SoftMax function in self-attention with separate kernel functions, yielding linear attention . As representative examples, Performer approximates SoftMax with orthogonal random features, while Nyströmformer and SOFT attain this goal through matrix decomposition. Castling-ViT measures the spectral similarity between tokens with linear angular kernels. EfficientViT further leverages depth-wise convolution to improve the local feature extraction ability of linear attention. FLatten Transformer proposes a focused linear attention module to achieve high expressiveness. .
1.2 Macro-architecture
The macro-architecture refers to the system-level methodology of organizing micro-architectures (e.g., operators, modules and layers) and constructing the whole deep networks. Existing literature has revealed that, even with the same efficient micro-architectures, the approaches and configurations for combining them will significantly affect the computational efficiency of the resulting models. In the following, we will discuss the works and design principles relevant to this topic.
1) Marrying Convolution and Attention Modules. Convolution and self-attention are both important modules with their own strengths. A considerable amount of literature has been published to study how to combine them for a higher overall computational efficiency. At the per-layer level, convolution can be leveraged to generate the inputs of self-attention, e.g., queries/keys/values or position embeddings . In addition, some works simultaneously utilize self-attention and a convolutional layer, and fuse their outputs , which facilitates the learning of local features. Another promising idea is to integrate convolution into the feed-forward network after the self-attention module .
At the network level, many existing works focus on the placing order of self-attention and depth-wise convolution blocks. In particular, leveraging convolution at earlier layers is proven beneficial , which enables the efficient extraction of local representations. Besides, convolutional blocks are usually adopted as light-weighted down-sample layers . Another line of works parallelizes both a self-attention path and a convolution path in a single model , where the two paths typically interact in a layer-wise fashion.
2) Depth-width Relationship. In the context of ConvNets and hierarchical ViTs, the backbone models consist of multiple stages with progressively reduced feature resolution. The layers within each stage usually have the same width, while later stages are wider. The stage-wise width growing rule is an important configuration, where it is popular to adopt an exponential growth with base two . In contrast, RegNets further propose a more detailed principle: widths and depths of good networks can be explained by a quantized linear function.
3) Model Scaling. On top of designing a single efficient model, it is also important to obtain a family of models that can adapt to varying computational budgets. An important principle for addressing this issue is compound scaling , which indicates that simultaneously increasing the depth, width and input resolution of a given base model will yield a family of efficient network architectures. Dollár et al. further study how to design a proper model scaling rule in terms of the actual runtime. In addition, TinyNets extend this idea to the shrinking of the model size.
2 Automatic Architecture Design
Compared to manually designing backbones, another appealing idea is to find proper network architectures automatically, which is usually referred to as neural architecture search (NAS). In recent years, a number of existing works have investigated this idea through the lens of computational efficiency. In the following, we will discuss the basic computation-aware formulation of NAS (Sec. 2.2.1) and how the practical speed is considered in NAS (Sec. 2.2.2).
Typically, NAS consists of two components: a searching space that contain a number of candidate architectures, and an algorithm to search for an optimal architecture. The computational cost for inferring the model is usually treated as a constraint, which is either inherently controlled by the searching space or strictly restricted by a pre-defined rule. The optimization objective is to maximize the validation accuracy.
1) Early Works. Early NAS methods propose to formulate a discrete searching space . The network is viewed as a graph with a number of nodes connected by edges, where each edge corresponds to an operation and one needs to find the optimal operation for each edge. Such a problem can be solved with discrete optimization algorithms. For example, by viewing the validation performance as the rewards, one can leveraged off-the-shelf reinforcement learning methods . Moreover, evolutionary algorithms also achieve favorable performance for discrete NAS .
2) Efficient Searching Algorithms. The aforementioned NAS methods are able to find computationally more efficient network architectures than human design. However, their searching cost is a notable limitation, since their search procedure usually incorporates training many candidate networks from scratch to convergence to evaluate their validation accuracy. Motivated by this issue, a large number of works focus on developing low cost NAS algorithms. A basic idea in this direction is to reuse the previous candidates, e.g., adding/deleting layers and paths on top of currently found architectures or adopting existing architectures as network components .
Driven by these preliminary explorations, ENAS and DARTS propose a parameter-sharing paradigm. They propose to construct a large computational graph that contains all possible connections and operations, such that each subgraph within it corresponds to a network architecture. The large graph is named as a super-net, while all possible candidate networks share the same super-net parameters. Hence, one can train the super-net, and directly sample architectures from it without retraining any specific candidate network. The network selection process is usually formulated to be differentiable and accomplished efficiently via gradient-based optimization methods . Besides, some recent works focus on improving this procedure by introducing progressive searching mechanisms , introducing hyper-networks or training more proper super-nets for NAS .
2.2 Latency-aware Neural Architecture Search
From the lens of practical efficiency, an important challenge faced by NAS is the inference speed on real hardware (e.g., GPUs or CPUs). Since NAS usually leads to irregular network architectures, the obtained model with low theoretical computational cost may not be efficient in practice. To address this issue, recent NAS methods explicitly incorporate real latency into the optimization objective to achieve a good trade-off between real speed and accuracy . As representative examples, MobileNetV3 leverages hardware-aware NAS to obtain the basic architecture, and modifies it manually. Once-for-all proposes to train a shared general super-nets, and perform NAS on top of it conditioned on the specific hardwares, yielding a state-of-the-art efficiency.
3 Efficient Backbones for Video Understanding
In this subsection, we will focus on the efficient backbones for processing videos. Notably, videos consist of a series of frames, each of which is an image. In general, the aforementioned techniques for processing images are typically compatible with videos. Hence, here we mainly review the efficient modeling of the temporal relationships of video frames, including ConvNet-based (Sec. 2.3.1) and Transformer-based (Sec. 2.3.2) approaches.
The most straightforward approach to modeling temporal relationships may be introducing 3D convolutional layers , such that one can directly perform convolution in the space formed by frame height, width, and video duration. However, 3D convolution is computationally expensive, and many efficient backbones have been proposed to alleviate this problem.
1) Marrying 2D and 3D Convolution. A basic idea is to avoid designing a pure 3D ConvNets, i.e., most of the feature extraction process may be accomplished by the efficient 2D convolution, while 3D convolution is only introduced at several particular positions. From the lens of macro-architecture, this goal can be attained by sequentially mixing 2D and 3D blocks, either first using 3D and later 2D or first 2D and later 3D . At the micro-architecture level, the group-wise or depth-width 3D convolution can be integrated in to the transform module of 2D split-transform-merge architecture (Eq. (LABEL:eq:split)) .
2) (2+1)D Networks. Another elegant idea is to decompose 3D convolution into two components: a 2D convolution that extract representation from video frames, and a temporal operation that only focuses on learning the temporal relationships. The former can directly adopt 2D neural operators, while the latter can be implemented using 1D temporal convolution , adaptive 1D convolution , and MLPs .
3) 2D Networks. In addition to the aforementioned approaches, the models with only 2D convolution may also be able to model temporal relationships. This is typically achieved by designing zero-parameter operations. For example, subtracting the features of adjacent frames to extract the motion information . The temporal-shift-based models propose to shift part of the channels of 2D features along the temporal dimension, performing information exchange among neighboring frames efficiently.
4) Long/Short-term Separable Networks. Another important idea is modeling long/short-term temporal dynamics with separate network architectures. An representative work in this direction is SlowFast , which incorporate a lower temporal resolution slow pathway and a higher temporal resolution fast pathway. Many recent works further extend this idea.
3.2 Transformer-based Video Backbones
Driven by the success of ViTs , a considerable number of recent works focus on facilitating efficient video understanding with self-attention-based models. In general, most of these works extend the aforementioned design ideas (including both image-based and video-based backbones) in the context of ViTs, e.g., performing spatial-temporal local self-attention , combining self-attention and convolution , and performing 1D temporal attention in (2+1)D designs .
4 Efficient Backbones for 3D Vision
The perception and understanding of 3D scenes is not only a key ability of human intelligence, but also an important task for computer vision which are ubiquitous in real-world applications. In this subsection, we will review the backbones designed for processing 3D information efficiently. In general, the works in this direction can be categorized by the forms of model inputs, i.e., 3D point clouds (Sec. 2.4.1), 3D voxels (Sec. 2.4.2) and multi-view images (Sec. 2.4.3).
A fundamental type of 3D geometric data structure is the cloud of 3D points, where each point is represented by its three coordinates. PointNet is the pioneering work that leveraging deep learning to process 3D point clouds. It adopts point-wise feature extraction with shared MLPs to maintain the permutation invariance. PointNet++ improves PointNet by facilitating capturing local geometric structures. On top of them, a number of works focus on how to aggregating local information effectively without increasing computational cost significantly. Representative approaches include introducing graph neural networks , projecting 3D points to regular grids to perform convolution , aggregating the features of adjacent points using the weights determined by the local geometric structure , and self-attention . In particular, recent works have revealed that point-based models can achieve state-of-the-art computational efficiency with proper training and model scaling techniques .
4.2 Voxel-based Models
The 3D point clouds can be further transformed to voxels, which are regular and can be directly processed with 3D convolution . Typically, the 3D space is divided into cubic voxel grids, while the features of the points in each grid will be averaged. The side length of the grid is named as the voxel resolution. An important technique for processing voxels efficiently is sparse convolution , i.e., only performing convolution on the voxels with 3D points in them. Many works design backbone networks with this mechanism conditioned on the vision task of interest for an optimal efficiency-accuracy trade-off. In addition, the point-based and voxel-based models can be combined to reduce the memory and computational cost . Some recent works have explored the automatic backbone design using NAS
4.3 Multi-view-based Models
Multi-view projective analysis is another effective solution for understanding 3D shapes, where the 3D objects are projected into 2D images from varying visual angles and processed by 2D backbone networks . This idea can be implemented for recognition , retrieval and pose estimation . An important challenge for these methods is how to fuse the multi-view features. Existing works have proposed to leverage LSTM or graph convolutional network .
Dynamic Backbone Networks
Although the advanced architectures introduced in Sec. 2 have achieved significant progress in improving the inference efficiency of deep models, they generally have an intrinsic limitation: the computational graphs are kept the same during inference when processing different inputs with varying complexity. Such a static inference paradigm inevitably brings redundant computation on some “easy” samples. To address this issue, dynamic neural networks have attracted great research interest in recent years due to their favorable efficiency, representation power, and adaptiveness .
Researchers have proposed various types of dynamic networks which can adapt their architectures/parameters to different inputs. Based on the granularity of adaptive inference, we categorize related works into sample-wise (Sec. 3.1), spatial-wise (Sec. 3.2), and temporal-wise (Sec. 3.3) dynamic networks. Compared to the previous work which contains both vision and language models, we mainly focus on the computational efficient models for vision tasks in this survey. Moreover, more up-to-date works are included.
The most common adaptive inference paradigm is processing each input sample (e.g. an image) dynamically. There are mainly two lines of work in this direction: one aims at reducing the computation with decent network performance via dynamic architectures, and the other adjusts network parameters to boost the representation power with minor computational overhead. In this survey, we focus on the former line which typically reduces redundant computation for improving efficiency. Popular approaches include three types: 1) dynamic depth, 2) dynamic width, and 3) dynamic routing in a super network (SuperNet).
The inference procedure of a traditional (static) network can be written as
where is decided based on itself.
There are mainly two common implementations to realize dynamic depth. The first is early exiting, which means that the network predictions for some “easy” samples can be output at an intermediate layer without activating the deeper layers (Figure 2). Researchers have found that multiple classifiers in a deep model may interfere with each other and degrade the performance by forcing early layers to capture semantic-level features . To address this issue, multi-scale feature representation is adopted to quickly produce coarse-scale features with rich semantic information. Instead of constructing intermediate exits in convolutional networks, the recent Dynamic Vision Transformer (DVT) realizes early exiting in cascaded vision Transformers which process images with different token numbers. Dynamic Perceiver proposes to integrate intermediate features and perform early-exit by introducing an addition attention-based path. Apart from architectural design, researchers have also proposed specialized techniques for training early-exiting models.
1.2 Dynamic Width
Instead of skipping an entire layer, a less aggressive approach is adjusting the network width to different inputs. In this direction, the most popular implementation is dynamically skipping the channels in convolutional blocks via a gating module (Figure 4). Specifically, a gating module is first executed before conducting a convolution operation. The output of this gating module is a -dimensional binary vector that decides whether to compute each channel, where is the output channel number. This implementation is similar to that in the aforementioned layer-skipping scheme. The most prominent difference is that the output of the gating module in layer skipping is a scalar, and the gating module in channel-skipping is required to output a vector controlling the computation of different channels. Apart from convolution layers, the same idea can also be applied in vision Transformers to dynamically skip channels in multi-layer perceptron (MLP) blocks .
1.3 Dynamic Routing in SuperNets
Extensive works have proposed different forms of SuperNets, such as tree structures , dynamic mixture-of-experts , and more general architectures .
2 Spatial-wise Dynamic Networks
It has been found that different spatial locations in an image contribute unequally to the performance of vision tasks . However, most existing deep models process different spatial locations with the same computation, leading to redundant computation on less important regions. To this end, spatial-wise dynamic networks are proposed to exploit the spatial redundancy in image data to achieve an improved efficiency. Based on the granularity of adaptive inference, we categorize relative works into pixel level, region level, and resolution level.
A typical approach to spatial-wise adaptive inference is dynamically deciding whether to compute each pixel in a convolution block based on a binary mask . This form is similar to that in layer skipping and channel skipping (Sec. 3.1), except that the gating module is required to output a spatial mask. Each element of this spatial mask determines the computation of a feature pixel. In this way, the mask generators learn to locate the most discriminative regions in image features, and redundant computation on less informative pixels can be skipped.
The limitation of such pixel-level dynamic computation is that the acceleration is currently not supported by most deep learning libraries. The memory access cost can be heavier than static convolutions, and the computation parallelism is reduced due to sparse convolution. As a result, although the computation can be significantly reduced, the practical efficiency of these methods usually lags behind their theoretical efficiency. To this end, researchers have also proposed “coarse-grained” spatial-wise dynamic networks , which means that an element of a spatial mask can decide a patch rather than a pixel. In this way, more contiguous memory access is realized for realistic speedup. Moreover, the scheduling strategies are also proven to have a considerable effect on the inference latency . It is also promising to co-design algorithm, scheduling, and hardware devices to better harvest the theoretical efficiency of spatial-wise dynamic networks.
Apart from skipping the computation of certain pixels, another line of work breaks the static reception field of traditional convolution and proposes deformable convolution . Specifically, a lightweight module is used to learn the offsets for each feature pixel, and the convolution neighbors are sampled from arbitrary locations based on the predicted offsets. This idea has also been implemented in vision Transformers to enhance the performance of the local attention mechanism .
2.2 Region-level Dynamic Networks
Instead of flexibly deciding which feature pixels to compute, another line of work aims at locating important regions (patches) in input images and cropping these patches for recognition tasks. For example, image recognition can be formulated as a sequential decision problem, in which an RNN is adopted to make predictions based on the cropped image patches . A multi-scale CNN with multiple sub-networks could also be used to perform the classification task based on cropped salient image patches . A lightweight module is placed between every two sub-networks to decide the coordinate and size of the salient patch.
Along this direction, the recent glance-and-focus network (GFNet) proposes a general framework for region-level dynamic inference which is compatible with various visual backbones. It first “glances” a low-resolution input image, and then repeatedly “focus” on salient regions using reinforcement learning (RL) . Moreover, early exiting (Sec. 3.1) is allowed, which means that the step number of “focus” can be dynamically adjusted for different input images.
2.3 Resolution-level Dynamic Networks
Most existing vision models process different images with the same resolution. However, the input complexity could vary, and not all images require a high-resolution representation. Ideally, low-resolution representations should be sufficient for those “easy” samples with large objects and canonical features. The early work proposes to adaptively zoom input images in the face detection task. The recent resolution adaptive network (RANet) builds a multi-scale architecture, in which inputs are first processed with a low resolution and a small sub-network. Large sub-networks and high-resolution representations are conditionally activated based on early predictions. Instead of using a specialized structure, dynamic resolution network rescales each image with the resolution predicted by a small model and feeds the rescaled image to common CNNs.
Note that different spatial locations are still processed equally in the aforementioned methods. We categorize the relative works in this section since they mainly utilize the spatial redundancy of image inputs for efficient inference.
3 Temporal-wise Dynamic Networks
As video data can be viewed as a sequence of image data, adaptive computation could also be performed along the temporal dimension due to the considerable redundancy in video recognition tasks. Representative works can generally be divided into two lines: one processes video with recurrent models and dynamically save computation at certain time steps; the other aims at sampling key frames/clips and allocating the computation to these sampled frames.
Different video frames are unequally informative. To this end, extensive studies propose to dynamically activate computation when updating the hidden state in recurrent models. For example, LiteEval establishes two different sized LSTM . In each time step, a gating module is used to decide which LSTM should be executed for processing the current frame. AdaFuse dynamically skips the computation of some convolution channels, and these channels are filled with the hidden state from the previous step. Moreover, the numerical precision and image resolution of different frames can also be dynamically decided.
The aforementioned works generally require a ConvNet for encoding each input frame before updating the hidden state. A more flexible solution is allowing the network to learn “where to see”. In other words, networks can directly jump to an arbitrary temporal location in the video or perform early exiting instead of “watch” the entire video frame by frame.
3.2 Dynamic Key Frame Sampling
An alternative to skipping computation in recurrent networks is sampling key frames and then feeding the sampled frames rather than the whole video to a standard model. Reinforcement learning is a popular technique for training frame samplers .
A recent trend is simultaneously achieving dynamic inference from multiple perspectives. For example, AdaFocus and its variants makes use of both spatial and temporal redundancy in video data. Dynamic architecture with 3D convolution is also an interesting topic.
Efficient Models for Downstream Computer Vision Tasks
In this section, we assume that a light-weighted backbone network has already been obtained, and discuss how to design task-specific heads or algorithms on top of them. The general aim is to facilitate accomplishing real-world computer vision tasks efficiently or even in real time. To this end, we will focus on three representative tasks, namely object detection (Sec. 4.1), semantic segmentation (Sec. 4.2), and instance segmentation (Sec. 4.3), all of which have a strong need for accurate and real-time applications. Note that most of other more complex computer vision tasks (e.g. visual object tracking) are mainly based on the three tasks we consider.
Object detection aims to answer two fundamental questions in computer vision: what visual objects are contained in the images, and where are them ? The classification and localization results obtained by object detection usually serve as the basis of other vision tasks, e.g., instance segmentation, image captioning, and object tracking. The algorithms for object detection can be roughly categorized into two-stage (Sec. 4.1.1) and one-stage (Sec. 4.1.2). In the following, we will discuss them respectively from the lens of computational efficiency.
Object detection with deep learning starts from the two stage paradigm. The pioneer work, RCNN , proposes to first crop a set of object proposals from the images, and classify them with deep networks. On top of it, SPPNet avoids repeatedly inferring the backbones by adaptively pooling the features of the regions of interest. Fast RCNN simultaneously train a detector and a bounding box regressor in the same network, leading to more than 200 times of speedup than RCNN. Faster R-CNN and its improvements introduce a region proposal network that cheaply generates object proposals from the features, yielding the first nearly real-time deep learning detector. The feature pyramid networks further propose to leverage the feature maps at varying scales to detect the object with different sizes respectively, which improves the detection accuracy significantly without sacrificing the efficiency .
1.2 One-stage Detectors
The major motivation behind the two-stage detects is the “coarse-to-fine” refining, i.e., first obtaining the coarse proposals, and then refining the localization and discrimination results on top of these proposals, such that an excellent detection performance can be achieved. Despite the aforementioned techniques proposed to improve the efficiency of this procedure, the speed and the complexity of two-stage detectors are usually not applicable to real-time applications. In contrast, the one-stage detectors directly output the detection results in a single step, yielding much faster inference speed with a decent accuracy.
1) Bounding-box-based Methods. The first deep-learning-based one-stage detector is YOLO . YOLO divides the image into grid regions and simultaneously predicts the bounding boxes and the classification results conditioned on each region. The subsequent works of YOLO focus on further improving the localization performance or classification accuracy without affecting the practical speed. The latest version, YOLOv7 , achieves a state-of-the-art effectiveness-efficiency trade-off.
In addition to YOLO, SSD improves the accuracy of one-stage detectors by detecting the objects at different scales on different layers of the network. RetinaNet proposes a focal loss to encourage the model to focus more on the difficult, misclassified examples, which boosting the accuracy of one-stage detectors effectively.
2) Point-based Methods. The aforementioned detection methods mostly learn to produce the ground-truth bounding boxes on top of pre-defined anchor boxes. Despite the effectiveness, this paradigm suffers from a lot of design hyper-parameters and an imbalance between positive/negative boxes during training. To address this issue, CornerNet proposes to directly predict the top-left corner and bottom-right corner of candidate boxes. Many subsequent works extend this point-based setting. For example, FCOS predicts the distances from each location in feature maps to the four sides of the bounding box. ExtremeNet learns to detect the extreme points the center of bounding boxes. CenterNet further considers each object to be a single center point and regresses all the attributes (2D/3D size, orientation, depth, locations, etc.) based on this point.
3) Transformer-based Methods. In recent years, N. Carion et al. propose an end-to-end Transformer-based detection network, DETR . DETR views detection as a set prediction problem, where the results are obtained based on several object queries. Deformable DETR addresses the long convergence issue of DETR by introducing a deformable mechanism to self-attention.
2 Semantic Segmentation
The aim of semantic segmentation is to predict the semantic label of each pixels , e.g., if a pixel belongs to a car, a bike, etc. Here we summarize existing efficient semantic segmentation methodologies based on their paradigms i.e., encoder-decoder (Sec. 4.2.1), multi-branch (Sec. 4.2.2) and others (Sec. 4.2.3).
A popular approach is to first extract the low-resolution discriminative representations with a multi-stage backbone network, up-sample the deep features to the input resolution with a decoder, and then produce the pixel-wise predictions. This procedure is named as “encoder-decoder” . To improve the efficiency of this paradigm, many works propose to design light-weighted decoders. Representative methods include introducing split-transform-merge architectures (Eq. (LABEL:eq:split)), developing efficient approximations of the computationally intensive dilated convolution , and introducing dense connections . In addition, it is efficient to simultaneously feed the low-level and high-level features into the decoder, i.e., comprehensively leveraging both of them improves the accuracy without introducing notable computational overhead .
2.2 Multi-branch Models
Another popular efficient paradigm is designing multi-branch architectures. Typically, the model consist of two types of paths: 1) context paths with low-resolution feature maps and large receptive fields, aiming to extract discriminative information; and 2) spatial paths that preserve the low-level spatial information. These paths are fused in a parallel or cascade fashion, yielding high-resolution but semantically rich deep representations for segmentation.
2.3 Others
In recent years, some new ideas have been proposed to facilitate efficient semantic segmentation. For example, processing deep features with self-attention layers , designing segmentation models with NAS , adjusting the architecture of the decoder conditioned on the inputs . More recently, a considerable number of papers seek to design efficient semantic segmentation models on top of ViTs . These works mainly focus on achieving a state-of-the-art performance with as less computational cost as possible.
3 Instance Segmentation
Instance segmentation can be seen as a combination of object detection and semantic segmentation, where the model needs to detect the instances of objects, demarcate their boundaries and recognize their categories . Existing works in this direction can be categorized into two-stage (Sec. 4.3.1) and End-to-end (Sec. 4.3.2).
From the lens of efficiency, a notable milestone of deep-learning-based instance segmentation is the proposing of Mask R-CNN . Mask R-CNN is developed by introducing mask segmentation branches on the basis of Faster R-CNN . It enjoys high computational efficiency by directly obtaining the regions of interest from the feature maps. In contrast, MaskLab improved Faster R-CNN by adding the semantic segmentation and direction prediction paths. To improve the accuracy of Mask R-CNN, MS R-CNN predicts the quality of the predicted instance masks and prioritizes more accurate mask predictions during validation. PANet introduces a path augmentation mechanism to facilitate the bottom-up information interaction of feature maps. HTC proposes a hybrid task cascade framework to learn more discriminative features progressively while integrating complementary features in the meantime.
3.2 End-to-end Approaches
Another liner of works focus on realizing efficient end-to-end instance segmentation. SOLO achieves this by introducing the “instance categories”, which assigns categories to each pixel within an instance according to the instance’s location and size, thus converting instance segmentation into a pure dense classification problem. YOLACT and BlendMask propose to first generate a set of prototype masks, and then combines them with per-instance mask coefficients or attention scores. Inspired by SSD and RetinaNet , TensorMask build an efficient sliding-window-based instance segmentation framework.
Model Compression Techniques
Deep networks necessitate substantial resources, including energy, processing capacity, and storage. These resource requirements diminish the suitability of deep networks for resource-constrained devices . Furthermore, the extensive resource requirements of deep networks become a bottleneck for real-time inference and executing deep networks on browser-based applications. To address these drawbacks of deep networks, various model compression techniques have been proposed in existing literature. Several comprehensive reviews on model compression techniques exist . These reviews categorize model compression techniques, discuss challenges, provide overviews, solutions, and future directions of model compression techniques. We adopt their classification structure but place a greater emphasis on vision-related works. Specifically, we categorize existing research into network pruning , network quantization , low-rank decomposition , knowledge distillation , and other techniques . For readers interested in a particular category, we recommend consulting these more targeted reviews .
Network pruning is one of the most prevalent techniques for reducing the size of a deep learning model by eliminating inadequate components, such as channels, filters, neurons, or layers, resulting in a light-weighted model. Network pruning techniques can be categorized into four types: channel pruning, filter pruning, connection pruning, and layer pruning. These techniques help decrease the storage and computation requirements of deep networks. A typical pruning algorithm consists of two stages: evaluating and pruning unimportant parameters, followed by fine-tuning the pruned model to restore accuracy. The steps and categories are illustrated in Figure 5.
In deep networks, the inputs provided to each layer are channeled. Channel pruning involves removing unimportant channels to reduce computation and storage requirements. Various channel pruning schemes have been proposed . Convolutional operations in ConvNets incorporate a large number of filters to enhance performance. Increases in filter quantities result in a significant growth in the number of floating-point operations. Filter pruning eliminates unimportant filters, thus reducing computation . The number of input and output connections to a layer in deep networks determines the number of parameters. These parameters can be used to estimate the storage and computation requirements of deep networks. Connection pruning is a direct approach to reduce parameters by removing unimportant connections . Layer pruning involves selecting and deleting certain unimportant layers from the network, leading to ultra-high compression of the deep network. This is particularly useful for deploying deep networks on resource-constrained computing devices, where ultra-high compression is necessary. Some layer pruning approaches have been proposed to substantially reduce both storage and computation requirements . However, layer pruning may result in a higher accuracy compromise due to the structural deterioration of deep networks.
2 Network Quantization
Network quantization aims to compress the original network by reducing the storage requirements of weights. It can be categorized into linear quantization and nonlinear quantization. Linear quantization focuses on minimizing the number of bits needed to represent each weight, while nonlinear quantization involves dividing weights into several groups, with each group sharing a single weight.
Utilizing 32-bit floating-point numbers to represent weights consumes a substantial amount of resources. Consequently, linear quantization employs low-bit number representation to approximate each weight. Suyog et al. contend that the weights of deep networks can be represented by 16-bit fixed-point numbers without significantly reducing classification accuracy . Some studies further compress ConvNets to 8-bit . In the extreme case of a 1-bit representation for each weight, binary weight neural networks emerge. The primary concept is to directly learn binary weights or activation during model training. Several works directly train ConvNets with binary weights, including BinaryConnect , BinaryNet, and XNOR .
2.2 Nonlinear Quantization
Nonlinear quantization entails dividing weights into several groups, with each group sharing a single weight. Gong et al. initially employ the k-means algorithm to cluster weight parameters and replace the parameter values with the clustering center values, substantially reducing the network’s storage space . Wu et al. further quantize convolution filters, fully connected layers, and other parameters . Chen et al. randomly assign weights to hash buckets, with each hash bucket sharing a single weight . Han et al. combine network pruning, parameter quantization, and Huffman coding to achieve significant reductions in storage and memory .
3 Knowledge Distillation
Knowledge distillation (KD) is a widely adopted technique for transferring “dark knowledge” from a high-capacity model (teacher) to a more compact model (student) in order to achieve various types of efficiency. The two primary aspects of KD are knowledge representation and distillation schemes. In this section, we concentrate on existing research in these two technical areas and further summarize the theoretical exploration and application progress of KD in computer vision, as illustrated in Figure 6.
Drawing on , we examine different forms of knowledge in the following categories: response-based knowledge, feature-based knowledge, and relation-based knowledge. Response-based knowledge typically refers to the neural response of the teacher model’s final output layer, with the main idea being to directly emulate the teacher model’s final prediction. The most prevalent response-based knowledge for image classification is soft targets . In object detection tasks, the response may include logits along with the bounding box offset . For semantic landmark localization tasks, such as human pose estimation, the teacher model’s response may consist of a heatmap for each landmark .
Feature-based knowledge pertains to the feature representation derived from intermediate layers. Fitnets are the first to introduce intermediate representations, which subsequently inspire the development of various methods . Relation-based knowledge further investigates the relationships between different feature layers or data samples . For instance, Yim et al. propose calculating the relations between pairs of feature maps using the Gram matrix, while Liu et al. suggest transferring the instance relationship graph, which defines instance features and relationships as vertices and edges, respectively.
3.2 Distillation Schemes
The learning schemes of knowledge distillation can be classified into three main categories based on the synchronization of the teacher model’s update with the student model: offline distillation, online distillation, and self-distillation.
In offline distillation, the teacher model is usually assumed to be pre-trained. The primary focus of offline methods is to enhance various aspects of knowledge transfer, including knowledge representation and the design of loss functions. Vanilla knowledge distillation serves as a classic example of offline distillation methods. Most prior knowledge distillation methods operate in an offline manner.
In cases where a high-capacity, high-performance teacher model is unavailable, online distillation provides an alternative. In this approach, both the teacher model and the student model are updated simultaneously, allowing for an end-to-end trainable knowledge distillation framework. Deep mutual learning introduced a method for training multiple neural networks collaboratively, where any given network can serve as the student model while the others act as teachers. Numerous online knowledge distillation methods have been proposed , with multi-branch architecture and ensemble techniques being widely adopted.
Self-distillation refers to a learning process in which the student model acquires knowledge independently, without the presence of teacher models, whether pre-trained or virtual. Several studies have explored this idea in various contexts. For instance, Zhang et al. propose a method for distilling knowledge from deeper layers to shallower ones for image classification tasks. Similarly, Hou et al. employ attention maps from deeper layers as distillation targets for lower layers in object detection tasks. In contrast, Yang et al. introduce snapshot distillation, where checkpoints from earlier epochs are considered as teachers to distill knowledge for the models in later epochs. Additionally, Wang et al. suggest constraining the outputs of the backbone network using target class activation maps.
3.3 Theory and Applications
A wide range of knowledge distillation methods has been extensively employed in vision applications. Initially, most knowledge distillation methods were developed for image classification and later extended to other vision tasks, including face recognition , action recognition , object detection , semantic segmentation , depth estimation , image retrieval , video captioning , and video classification , among others.
Despite the significant practical success, relatively few works have focused on the theoretical or empirical understanding of knowledge distillation . Hinton et al. suggest that the success of KD could be attributed to learning similarities between categories. Yuan et al. posited that dark knowledge not only encompasses category similarities but also imposes regularization on student training. They indicate that KD is a learned label smoothing regularization (LSR). Tang et al. propose approach where, in addition to regularization and class relationships, another type of knowledge, instance-specific knowledge, is also used by the teacher to rescale the student model’s per-instance gradients. Chen et al. quantify the extraction of visual concepts from the intermediate layers of a deep learning model to explain knowledge distillation. Wang et al. connect KD with the information bottleneck and empirically validate that preserving more mutual information between feature representation and input is more important than improving the teacher model’s accuracy. Overall, theoretical research remains limited compared to the diverse and numerous applications.
4 Low-rank Factorization
Convolution kernels can be viewed as 3D tensors. Ideas based on tensor decomposition are derived from the intuition that there is structural sparsity in the 3D tensor. In the case of fully connected layers, they can be viewed as 2D matrices (or 3D tensors), and low-rankness can also be helpful. The key idea of low-rank factorization is to find an approximate low-rank tensor that is close to the real tensor and easy to decompose. Low-rank factorization is beneficial for both tensors and matrices.
There are several typical low-rank methods for compressing 3D convolutional layers. Lebedev et al. propose Canonical Polyadic (CP) decomposition for kernel tensors. They use nonlinear least squares to compute the CP decomposition for a better low-rank approximation. Since low-rank tensor decomposition is a non-convex problem and generally difficult to compute, Jaderberg et al. use iterative schemes to obtain an approximate local solution . Then, Tai et al. find that the particular form of low-rank decomposition in has an exact closed-form solution, which is the global optimum, and present a method for training low-rank constrained ConvNets from scratch .
Many classical works have exploited low-rankness in fully connected layers. Denil et al. reduce the number of dynamic parameters in deep models using the low-rank method . Zhang et al. introduce a Tucker decomposition model to compress weight tensors in fully connected layers . Lu et al. adopt truncated singular value decomposition to decompose the fully connected layer for designing compact multi-task deep learning architectures . Sainath et al. explore a low-rank matrix factorization of the final weight layer in deep networks for acoustic modeling .
5 Hybrid Techniques
Apart from the four categories of mainstream techniques mentioned above, there are other techniques for network compression. Some studies have attempted to integrate orthogonal techniques to achieve more significant performance . Some works have designed compact networks or efficient convolutions , which have been discussed in Sec. 2.
Efficient Deployment on Hardware
The aforementioned works mostly design network architectures based on their theoretical computation (e.g. floating operations, FLOPs). However, there is often a gap between theoretical computation and practical latency on hardware devices . Realistic efficiency can be influenced by other factors such as hardware properties and scheduling strategies. Along this direction, we review relative works from the following perspectives: 1) hardware-aware neural architecture search (Sec. 6.1); 2) acceleration software libraries and hardware design (Sec. 6.2); and 3) algorithm-software codesign techniques.
As the practical latency of models can be influenced by many factors other than theoretical computation, the commonly used FLOPs is an inaccurate proxy for network efficiency. Ideally, one should develop efficient models based on specific hardware properties. However, hand-designing networks for different hardware devices can be laborious. Therefore, automatically searching for efficient architectures is emerging as a promising direction. Compared to the traditional NAS methods , this line of works can generate appropriate models which satisfy different hardware constraints and gain realistic efficiency in practice. For example, ProxylessNAS establishes a latency prediction function based on realistic tests on targeted hardware, and the predicted latency is then directly used as a regularization item in the NAS objective. A similar idea is also implemented by MnasNet to search for efficient models on mobile devices. The following works FBNet , FBNet-v2 and OFA have improved NAS techniques.
Apart from the traditional static models, the hardware-aware design paradigm has also been applied to develop spatial-wise dynamic networks (Sec. 3.2) .
Note that we mainly give a brief introduction of basic ideas in this work due to the page limit. For more detailed techniques we refer the readers to the survey which specifically focuses on this topic.
2 Acceleration Tools
In addition to architectural design, the efficient deployment of algorithms on hardwares also requires acceleration software libraries or specific hardware accelerators.
1) Software Libraries. Extensive efforts have been made to accelerate model inference on different hardware platforms. For example, NVIDIA TensorRT is widely used to deploy models for optimized inference on GPUs. NNPACK (https://github.com/Maratyszcza/NNPACK.), CoreML and TinyEngine are representative tools on multi-core CPUs, Apple silicons, and microcontrollers (MCUs), respectively. Cross-platform tools such as Tencent TNN (https://github.com/Tencent/TNN). and Apache TVM have also emerged as popular development tools.
2) Hardware Accelerators. Apart from adapting neural architectures to given hardware devices, another line of works studies accelerators from the hardware perspective to enable fast inference of deep models. For example, DianNao focuses on memory behavior and proposes an accelerator that simultaneously improves the inference speed and energy consumption of deep models. An FPGA-based accelerator is proposed quantitatively analyze the throughput of CNNs with the help of the classical roofline model . In addition to the regular deep networks, researchers have also proposed accelerators to improve the inference efficiency of spatially sparse convolution .
3 Algorithm-Hardware Co-design
The aforementioned methods typically improve the inference efficiency from the perspective of either algorithm or hardware. Ideally, one should expect algorithms and hardware can “cooperate” with each other to further push forward the Pareto frontier between accuracy and efficiency trade-off. Along this direction, extensive efforts have been made based on the highly flexible and versatile Field Programmable Gate Arrays (FPGA) platform, and NAS techniques (Sec. 6.1) are widely used to search for hardware-friendly network structures . The recent MCUNet series has enabled both inference and training on MCUs based on algorithm-hardware co-design with the help of their proposed tiny-Engine tool (Sec. 6.2).
The co-designing method has also been applied to the field of dynamic neural networks, especially for efficient spatially adaptive convolution and attention operations.
Challenges and Future Directions
Despite the significant advances in the field of computationally efficient deep learning in recent years, numerous open challenges warrant further research. In this section, we summarize these challenges and discuss potential future directions.
The efficient extraction of discriminative representations from raw inputs has been established as a critical cornerstone for practical deep learning applications, as demonstrated in the existing literature. Light-weighted backbone networks are commonly employed to achieve this goal. As a result, a significant challenge lies in the design of efficient, general-purpose backbones. Potential avenues of investigation in this area encompass enhancing current convolution and self-attention operators via manual design , employing automated architecture search methodologies , and amalgamating these approaches to create comprehensive efficient modules . Specifically, the exploration of innovative information aggregation methods beyond convolution and self-attention appears promising, for instance, clustering algorithms , LSTM , and graph convolution . Moreover, an emerging area of interest involves enabling backbone networks to accommodate multi-modal inputs (e.g., text, images, and videos) and execute multiple visual tasks (e.g., retrieval, classification, and visual question answering) . Consequently, the development of mobile-level multi-modal and multi-task visual foundation models could present an intriguing direction for future research.
2 Developing Task-specialized Models
In addition to the architectural advancements in backbone models, tailoring deep learning methodologies to specific computer vision tasks of interest has been demonstrated as crucial. Two research challenges of particular significance in this domain can be identified. Firstly, the exploitation of representations extracted by backbones to efficiently obtain task-specific features is essential, for example, multi-scale features for object detection and multi-path fused features for semantic segmentation. A potential solution to this challenge could involve designing specialized, efficient decoders (e.g., utilizing NAS ). Secondly, it is important to streamline the multi-stage design of visual tasks (e.g., two-stage object detection and instance segmentation algorithms) to achieve end-to-end paradigms with minimal performance compromises. Additionally, the removal of time-consuming components, such as non-maximum suppression (NMS) , is crucial. A promising area for future research may involve the development of an efficient, unified, and end-to-end learnable interface for a majority of prevalent computer vision tasks .
3 Deep Networks for Edge Computing
In practical applications, extant research predominantly focuses on conventional hardware, such as GPUs and CPUs. However, within the realm of edge computing, there is an increasing demand for the deployment of deep learning models on Internet of Things (IoT) devices and microcontrollers. These diminutive devices are characterized by their minimal size, low power consumption, affordability, and ubiquity . The development of deep learning algorithms specifically adapted for such devices represents an exigent research direction. MCUNets have provided an initial exploration by optimizing the design, inference, and training of ConvNets for these devices. Another prospective concept involves the creation of spiking neural networks , which, when co-designed with hardware, can yield energy-efficient solutions.
4 Leveraging Large-scale Training Data
Contemporary large visual backbone models have exhibited remarkable scalability in response to the increasing volumes of training data , that is, the model’s performance consistently enhances as more training data becomes accessible. However, it is generally arduous for computationally efficient models with a reduced number of parameters to capitalize on this high-data regime to the same extent as their larger counterparts. For example, the improvements attained by pre-training light-weighted models on expansive ImageNet-22K/JFT datasets are typically inferior to those observed in larger models . This challenge is similarly experienced by self-supervised learning algorithms, where the methods effective for larger models frequently produce limited gains for smaller models . As a result, a propitious avenue of research involves the exploration of effective scalable supervised and unsupervised learning algorithms for light-weighted models, allowing them to reap the benefits of an unlimited amount of data without incurring the expense of acquiring annotations. Some recent works on novel training algorithms have started to preliminarily explore this direction .
5 Practical Efficiency
While numerous extant studies have attained low theoretical computational costs, they may be hindered by the restricted practical efficiency. For example, certain irregular network architectures discovered through NAS may display considerable latency on GPUs/CPUs, and the models employing group or depth-wise convolution may exhibit reduced gains in actual speedup relative to their theoretical computational efficiency. To tackle this challenge, researchers might consider integrating the speed on practical hardwares as an objective in architecture design or utilizing efficient implementation software . From a hardware design standpoint, one potential direction involves the creation of model-specialized hardware platforms .
6 Model Compression Approaches
Network compression algorithms, encompassing network pruning, quantization, and knowledge distillation, have exhibited a robust capacity to diminish the inference costs associated with deep networks. However, several avenues of investigation remain unexplored. For instance, while the overarching concept of model compression is not confined to a particular vision task, a majority of algorithms predominantly concentrate on image classification, rendering their extension to other tasks non-trivial. A significant research direction entails the development of general-purpose, task-agnostic model compression techniques. Furthermore, strategies such as network pruning may yield irregular architectural topologies, potentially impairing the practical efficiency of deep learning models. Consequently, the examination of practically efficient compression methodologies constitutes a propitious area for future research.
Acknowledgments
This work is supported in part by the National Key R&D Program of China (2021ZD0140407), the National Natural Science Foundation of China (62022048, 62276150), Guoqiang Institute of Tsinghua University and Beijing Academy of Artificial Intelligence.