Joint-DetNAS: Upgrade Your Detector with NAS, Pruning and Dynamic Distillation

Lewei Yao, Renjie Pi, Hang Xu, Wei Zhang, Zhenguo Li, Tong Zhang

Introduction

Finding the optimal tradeoff between model performance and complexity has always been a core problem for the community. The mainstream approaches aiming at addressing this issue are: Neural Architecture Search (NAS) is proposed to automatically search for promising model architectures; pruning removes redundant parameters from a model while maintaining its performance; and Knowledge Distillation (KD) aims to transfer the learnt knowledge from a cumbersome teacher model to a more compact student model. These methods share the same ultimate goal: boosting the model’s performance while making it more compact. However, jointly optimizing them is a challenging task, especially for detection, which is much more complex than classification. In this paper, we propose Joint-DetNAS, a unified framework for detection which jointly optimizes NAS, pruning and KD.

The aforementioned methods each have some limitations, as illustrated in Figure 1. NAS and pruning only focus on one aspect while neglecting the other: The current de facto paradigm of NAS considers the architecture to be the sole factor that impacts the model’s performance, while pruning only takes parameters into account and is structure-agnostic. A recent work has observed an interesting phenomenon: the pruned model’s final performance highly depends on its retraining initialization. This observation indicates that the architecture and parameters are closely coupled with each other, both of them play important roles in the model’s final performance, which motivates us to optimize them jointly.

On the other hand, the architecture of student-teacher pair is arbitrary and fixed during training in conventional KD. However, recent works have pointed out the existence of structural knowledge in KD, which implies that the teacher’s architecture has to match with the student to facilitate knowledge transfer. Therefore, we are inspired to incorporate dynamic KD into our framework, where the teacher is dynamically sampled to find the optimal matching for the student.

We propose Joint-DetNAS, a unified framework consisting of two integrated processes: student morphism and dynamic distillation. Student morphism aims to optimize the student’s architecture while remove the redundant parameters. To this end, an action space along with a weight inheritance training strategy are carefully designed, which eliminates the prerequisite of backbone’s ImageNet pre-training and allows the student to flexibly adjust its architecture while fully utilize the predecessor’s weights. Dynamic distillation targets at finding the optimal matching teacher and transferring its knowledge to the student. To facilitate teacher search without repeated training, an elastic teacher pool is built to provide sufficient powerful detectors, which trains a super-network only once and obtains all the sub-networks with competitive performances. During the search, we adopt a neat hill climbing strategy to evolve the student-teacher pair. Thanks to weight inheritance and the elastic teacher pool, each student-teacher pair can be evaluated at the cost of fewer epochs and the final obtained student detector requires no additional training

Our framework enables further exploration on the relationship between the architectures of student-teacher pair. We observe two interesting phenomena: (1) a more powerful detector does not necessarily make a better teacher; (2) the capacities of the student and teacher are highly correlated. These facts indicate the existence of structural knowledge and architecture matching in KD for detection.

We conduct extensive experiments to verify the effectiveness of each component (i.e., KD, pruning and the proposed elastic teacher pool) on detection task. Our Joint-DetNAS presents clear performance enhancement over 1) the input FPN baseline, 2) pipelining NAS->pruning->KD. Given a classic R101-FPN as the base detector, our framework is able to boost its AP from 41.4 to 43.9 on MS COCO and reduce its latency by 47%, which is on par with the SOTA EfficientDet while requiring less search cost.

Our contributions are as follows: 1) We investigate KD and pruning for detection and carefully analyze their effectiveness. 2) We propose an elastic teacher pool containing sufficient powerful detectors which can be directly sampled without training. 3) We develop a unified framework which jointly optimizes NAS, pruning and dynamic KD. 4) Extensive experiments are conducted to investigate the matching pattern between the student-teacher pair and verify the performance of our proposed framework.

Related Work

Object Detection. State-of-the-art detection networks can be classified as one-stage, two-stage and anchor-free detectors. One-stage detectors such as directly makes prediction on the feature maps. Two-stage detectors such as uses a region proposal network (RPN) to identify the foreground boxes and passes the corresponding features to an RCNN head for final prediction. Recently, works such as propose to eliminate anchor priors and makes prediction directly.

Neural Architecture Search. NAS aims at finding an efficient network architecture for a task automatically. There are numerous works proposing different NAS methods for classification tasks and detection tasks . One recent paper proposed to combine NAS with knowledge distillation by searching for the best student model given a fixed teacher model, which also proves the existence of structural knowledge in KD.

Knowledge Distillation. KD was first introduced in and its effectiveness for classification task has been validated by extensive works . However, few works have proposed KD methods for object detection , which introduce only limited performance gain.

Pruning. Pruning methods have been well studied for classification tasks . which focus on reducing the model complexity without much performance degradation. However, few works have verified its effectiveness on detection tasks.

Proposed Method

As illustrated in Fig. 2, our Joint-DetNAS framework comprises two core processes: student morphism and dynamic distillation:

Student morphism aims to optimize the student’s architecture while reduce the redundant parameters. However, integrating the two objectives is non-trivial: pruning requires pre-trained weights, which is incompatible with current NAS paradigm, since it is practically infeasible to obtain pre-trained weights that satisfy pruning requirements for all sampled architectures. To address this issue, we propose a carefully designed action space and a weight inheritance strategy, which enable the student to flexibly adjust its architecture while fully utilize the predecessor’s weights.

Dynamic distillation targets at finding the optimal matching teacher to adapt to the student’s structural changes, which calls for a way of obtaining sufficient powerful teachers with low cost. The mainstream NAS approach using a proxy task (e.g., training with fewer epochs) to train the teacher does not guarantee the quality of teacher’s supervision. On the other hand, training every teacher detector from scratch is too costly. Therefore, inspired by the recent work , we propose to construct an elastic teacher pool (ETP) containing sub-networks with high performances, which can be directly sampled as teachers to supervise the student. Empowered by the proposed ETP, teachers can be dynamically optimized according to the current status of the students with high efficiency.

A neat hill climbing algorithm is adopted to integrate the two processes, which enables adjusting the student’s architecture and finding the matching teacher simultaneously. Due to the use of weight inheritance strategy and ETP, the search cost of our framework is significantly reduced.

Our goal is to adjust an input detector’s backbone and enable better adaptation to the given task. This is accomplished by continuously applying beneficial actions to the backbone while fully utilize the predecessor’s parameters.

Weight Inheritance. The trained weights of the predecessor are inherited to (1) provide the initial pre-trained weights for pruning, and (2) eliminate the expensive ImageNet pre-training prerequisites for faster evaluation. Specifically, we define the inheritance process as a function $f_{evolve}$ : $f_{evolve}(S_{old}^{\text{$ \theta $}},a)\longrightarrow$ $S_{new}^{\theta^{{}^{\prime}}}$ , which accepts a detector $S_{old}^{\text{$ \theta $}}$ and an action $a$ as its inputs and outputs a new detector $S_{new}^{\theta^{{}^{\prime}}}$ with adjusted architecture and inherited parameters. The aforementioned action space is highly compatible with Weight Inheritance and the detail of $f_{evolve}$ for each action is elaborated in the appendix.

Search with Dynamic Resolution. The resolution of input images play an important role in the performance and inference speed of detectors. Instead of directly incorporating input resolutions into the search process, which expands the search space considerably, we propose to train the student by dynamically sampling a resolution in each training iteration. Thus, multiple resolutions can be evaluated after training, which boosts the search efficiency.

We are inspired by the recent work , in which a progressive shrinking strategy is proposed to train a super-network only once and obtain all the subnets with competitive performances. This approach fits our requirement for building a pool of teachers containing sufficient powerful detectors. However, the complexity of detection task initiates new challenges for this already complicated pipeline.

Subnet space. For a backbone with multiple stages, each subnet is determined by sampling the width and depth in a given range at each stage, while the combination of all the subnets form the subnet space. Other than the backbone, the FPN (neck) also plays an important role in detection, which fuses the features maps of different scales to obtain richer spatial information. Thus, we incorporate its widths variations in our implementation. To facilitate the search, we design our subnet space to cover architectures ranging from that of ResNet18 (1.0x width) to ResNet101 (1.5x width), which contains roughly 765000 networks (including different image resolutions) with competitive performances. More implementation details of the subnet space can be found in Section 4 and the Appendix.

Training with Integrated Progressive Shrinking (IPS). The training is divided into several phases: In the first phase, only the largest super-net is trained; In the following phases, subnets with shrunk depths and widths are gradually added into the subnet space, while the super-net acts as the teacher to distill all subnets using our KD method proposed in 3.2. In contrast to the progressive shrinking (PS) strategy proposed in , where the shrinkage of width and depth are performed sequentially, we propose an integrated progressive shrinking strategy (IPS) to jointly optimize smaller depths and widths, thus significantly reduces the training cost. More details can be found in the Appendix.

Different from mainstream NAS methods, our framework aims to upgrade a base detector $S_{base}$ rather than exploring the whole search space. Comparing with sample-based search algorithms (e.g., RL , BO , etc. ), the Hill Climbing (HL) approach efficiently evolves the student-teacher pairs and is highly compatible with weight inheritance strategy.

During the search, we optimize the student and the teacher alternatively. Specifically, the algorithm starts with an initial student-teacher pair. During each iteration, either the student is updated by applying an action (as described in Section 3.1.2) or the teacher is mutated by modifying the depth or width in each backbone stage. Benefiting from the weight inheritance strategy, each student-teacher pair can be evaluated with only a few epochs of training. We use the following scoring metric to evaluate a student-teacher pair:

where $S$ is the student detector; $C$ is the complexity metric, which we adopt $FLOPS$ since we do not target any particular device; $R$ is the resolution of input image; $C_{base}$ and $R_{base}$ are the base complexity and base resolution; $\alpha$ is a coefficient that balances the performance and complexity trade-off; $\beta$ balances the complexity introduced by the architecture and the input resolution.

The search procedure is illustrated in Algorithm 1. Our framework can be parallelized on multiple machines to boost the search efficiency.

2 Knowledge Distillation for Detection

Detection KD requires delicate design to distill spatial and localization information. Our detection KD method includes two components: a) Feature-level distillation maximizes the agreement between teacher and student’s backbone features in interested areas; b) Prediction-level distillation uses predictions outputs from teacher’s heads as soft labels to train the students.

Feature maps encode important semantic information. However, imitating the whole feature maps is hindered by severe imbalance between the foreground instances and background regions. To this end, we only distill the features of object proposals, the objective can be formulated as:

where $F_{S}$ and $F_{T}$ are features after ROI align; $f_{adap}(\cdot)$ is an adaptation function mapping $F_{S}$ and $F_{T}$ to the same dimension; $N_{p}$ is the number of mask’s positive points; $L$ is number of FPN layers; $W,H,C$ are feature dimensions.

The prediction level KD loss can be expressed in terms of classification and regression KD loss: $L_{pred}=L_{cls}+L_{loc}$ .

Uncertainty from Classification. Similar to classification, the student is optimized by soft cross entropy loss using teacher’s logits as targets, which can be written as: $L_{cls}=-\frac{1}{N}\sum_{i}^{N}\mathbf{P}_{t}^{i}\log\mathbf{P}_{s}^{i}$ , where $N$ is the number of training data; $\mathbf{P}_{t}$ and $\mathbf{P}_{s}$ are predicted score vectors of the teacher and the student, respectively.

3 Model Pruning

Pruning is incorporated into the framework as a part of student morphism to reduce student detector’s complexity. We utilize both layer-wise and channel-wise pruning, which reduce the student’s depth and width respectively. Layer-wise Pruning removes backbone’s entire layer with the least L1-Norm. Channel-wise Pruning reduces the width of detector’s backbone, for which we apply network slimming approach . The method determines the channel importance according to the magnitude of BN’s weights. Then the channels with least importance are removed. To encourage channel sparsity, we add a regularization loss to BN’s weight parameters $\gamma$ : $L_{BN}=\sum_{\gamma\in\Gamma}\mid\gamma\mid$ . A small pruning percentage is set during each student morphism to progressively shrink the student without causing much performance deterioration.

The total loss for training student detectors can be represented as: $L=L_{det}+L_{feat}+L_{pred}+\lambda L_{BN}$ , where $L_{det}$ denotes the normal detection training loss; $\lambda$ is the coefficient of the regularization loss for pruning, which is set to 0.00001. $L$ is enforced on the student throughout the search process.

Experiments

Datasets and evaluation metrics. We use MS COCO to conduct experiments. The mAP for IoU thresholds from 0.5 to 0.95 is used as the performance metric.

Implementation details. We use ResNet-based detectors to construct our elastic teacher pool, the subnet space for the backbone contains depth ranging from to for four stages, and the width for each backbone stage and the neck can be sampled from [ $W$ , $1.25\times W$ , $1.5W$ ], where $W$ is the width of standard ResNet. During search, each student-teacher pair is trained with 3 epochs for fast evaluation. For each teacher subnet sampled from the teacher pool, we reset its BN statistics by forwarding a batch of images, which is essential for performance recovery. More details are in the Appendix.

Each component plays an important role in the overall Joint-DetNAS framework. Thus, it is essential to decouple them from the framework and separately analyze their effectiveness in detail.

Quality of Elastic Teacher Pool. Our framework requires the teacher detectors sampled from the ETP to have competitive performances. To demonstrate the quality of our ETP, we compare its sampled subnets with their equivalent classic FPN detectors trained under standard 2x schedule and multi-scale training (for easier notation, we denote this as 2x+ms in later sections) strategy in Table 1. The former consistently outperforms the latter.

Pruning. We conduct experiments to prune the backbone of R50-FPN and R101-FPN detectors given different channel pruning percentages in Table 2. The detectors are pre-trained for 12 epochs before pruning and fine-tuned for extra 3 epochs afterwards. The detector’s parameter can be effectively reduced without much performance degradation. e.g., For both detectors, the performance after pruning 30% channels is still comparable to the original.

Distillation. Our detection KD framework is simple yet effective. We compare our detection KD method with baselines and previous KD works in Table 3. The R18-FPN and R50-FPN detectors are adopted as the students, with R50-FPN and R101-FPN as the teachers, respectively. To demonstrate effectiveness of our KD method, stronger baselines (2x+ms) are used. The results show that our method outperform the others by a large margin.

We aim to verify the superiority of dynamic KD: whether dynamic teacher is better than a fixed powerful teacher for transferring knowledge. Specifically, we fix the ResNet18-FPN detector as the student and follow the 3-epoch iterative training schedule, then conduct two experiments (1) dynamic KD (DKD): the teacher is dynamically sampled in every iteration and (2) Conventional KD (CKD): the largest super-net in ETP is used as the teacher. The results in Figure 4 shows that DKD can boost the student faster and help it reach a higher final performance. This also implies the underlying structural knowledge in KD, for which we provide further analysis in later Section 4.3.1.

2 Main Results

Joint-DetNAS can upgrade detectors with various backbone designs. We conduct experiments on FPN detectors with R18, R50, R101 and X101 as backbones to verify the effectiveness of our framework. We use $1333\times 800$ resolution with 2x+ms training for baseline and compare with our result using searched resolution. As shown in Table 4, our method consistently boosts the detectors’ performances while substantially reduces their complexities. Notably, for R101-FPN, the upgraded detector achieves $+2.5$ gain in $AP$ and $47\%$ reduction in latency.

Intuitively, NAS, pruning and KD is can be combined by pipelining: first search a detector with NAS, then prune it and train it with KD. We compare our joint optimization approach with pipelining methods: (1) Start with regular R101-FPN detector or a NAS-searched detector with lower complexity; (2) pre-train them with the pruning regularization loss; (3) prune the detector to comparable complexity with the result of Joint-DetNAS (R101-based); (4) train the pruned detector with the proposed KD under standard training strategy (2x+ms) and the same resolution ( $1080\times 720$ ). In Table 4, we compare the result of NAS-prune-KD and R101-prune-KD and find that the performance gain brought by NAS diminishes after pruning and KD are applied, indicating that the naive pipelining strategy leads to suboptimal. In contrast, our joint optimization methods outperforms both pipelining methods by a large margin.

We compare our method with the SOTA manually designed detectors (e.g., FCOS, RepPoints and CB-Net, etc.) and NAS-based (e.g., NAS-FPN, SP-NAS, etc.) approaches. The results of the COCO’s test-dev split are reported in Table 6. Our Joint-DetNAS outperforms SOTA manually designed detectors in terms of both FPS and AP, e.g. our searched detector based on R101 reaches 23.3 FPS and 43.9 AP, outperforming RepPoint-R101’s 13.7 FPS and 41.0 AP by a large margin. Furthermore, our method (R101-based) surpasses most mainstream detection NAS methods (e.g., SM-NAS and NAS-FPN ) and reaches comparable performance with the SOTA EfficientDet (D2) , while requiring much less search cost and no extra post-search training epochs.

Search efficiency is a key issue in NAS. We compare Joint-DetNAS with other SOTA detection NAS methods (e.g., ) in Table 7. Our framework finds better performance-complexity tradeoff for the detector with less search cost,

3 Looking into the Search Results: More analysis

As observed in earlier Section 4.1.2, larger detectors may not be better teachers, which naturally prompts us to further explore the matching pattern of promising teachers for different students. To this end, we apply dynamic KD to search optimal matching teachers for students with various complexities (i.e., FPN with R18, R50, R101 as the backbones). As shown in Figure 5, starting with the same teacher, each student can converge to different teachers. The results present a clear pattern: smaller students tend to match teachers with lower capacities, and vice versa. This phenomenon implies the underlying interdependence of complexity between the student-teacher pairs, which can provide useful insights for designing detection KD system.

We study how the student evolves along the search process by analyzing the actions improving the score function $H$ taken throughout the generations (generation increases when student’s performances is boosted) for our R50- and R101-based search. Figure 6 shows the shift of focus in balancing the performance-complexity tradeoff. We can see that channel pruning contributes the most score increment. In early phases, channel pruning occurs more often to adjust the network as a whole; while in later phases, Add-layer, Prune-layer and Rearrange follow to adjust the computation allocation at each stage in a fine-grained manner.

In Figure 7, we show the backbone’s computation allocation of our R50- and R101-based detectors before and after the search. The computation at stage 3 is reduced most dramatically, followed by stage 2 and stage 1. This implies the redundancy distribution in manually designed ResNet models, which provides the community with some prior knowledge for detector’s backbone design.

Conclusion

This paper present a new way of jointly optimizing NAS, pruning and KD to boost the performance and reduce the complexity of object detectors. Extensive experiments are conducted to show the superior performance of our proposed Joint-DetNAS framework. We believe our method has the potential to be extended to tasks other than object detection.

Supplementary Materials

We adopt the ResNet-based detectors as the search space due to its popularity in the detection community. Specifically, the backbone architecture is divided into four stages, where the feature resolution halves and the number of output channels doubles at the beginning of each stage. Basic block is used for R18-based students, while Bottleneck Block is used for other students and the teacher pool. In the following sections, “layer” and “block” are used interchangeably.

The student’s action spaces contains four actions: (1) Channel Pruning, (2) Layer Pruning, (3) Add-Layer and (4) Rearrange.

We specify the definition of $f_{evolve}$ for each action.

Pruning The parameters are first ranked globally by an importance measure, then the least important ones are removed while the rest are inherited. For Channel Pruning, the importance measure is the magnitude of each BN’s channel weights. For Layer Pruning, the importance measure is the parameter’s L1 norm.

Add-Layer aims to introduce extra capacity into the detector while maintain the performance of the predecessor. This is realized by initializing the block as an identity mapping. Specifically, for each block in ResNet whose output can represented as $H(x)=F(x)+x$ , we make $F(x)$ equal to 0 by applying Dirac initialization to the CONV layers and zero-initializing the last BN layer. The new layer is appended to the end of the selected stage.

Rearrange, a stage is firstly selected, then the layer at the beginning or the end of the stage is moved to its neighboring stage by modifying its stride, the parameters can then be directly inherited.

Subnet Space. In our implementation of the ETP, the super-network is set to have the same depth and 1.5x width as ResNet101. Specifically, the depths and the width coefficients are and [1.5, 1.5, 1.5, 1.5] at each stage, respectively. During our integrated progressive shrinking training, the subnet space is gradually expanded to include smaller subnets. At the final phase, the smallest subnet in the space has depths and width coefficients [1.0, 1.0, 1.0, 1.0] at each stage, all the subnets in between can be sampled and trained. The width coefficients can be 1.0, 1.25 or 1.5.

Dynamic Resolution. We use $512\times 512$ , $800\times 600$ , $1080\times 720$ and $1333\times 800$ as the predefined resolutions, from which one is randomly sampled during each training iteration.

Phases of integrated progressive shrinking. (1) Training the super-network: the super-network is firstly trained with dynamic resolution, which is later used as the teacher detector to distill other subnets. (2) First shrinking phase: the depths and widths of the subnet space are expanded to and [1.25-1.5,1.25-1.5,1.25-1.5,1.25-1.5] , respectively. (3) Second shrinking phase: the depths and widths of the subnet space are expanded to and [1.0-1.5,1.0-1.5,1.0-1.5,1.0-1.5] , respectively. During (2) and (3), one subnet is randomly sampled from the subnet space and trained in each training iteration. Dynamic resolution is adopted throughout the training process.

Training details. The teacher pool is trained from scratch on 32 GPUs with batch size $2\times 32$ (2 for each GPU). Synchronized BN is adopted to normalize input distribution across multiple nodes, which addresses the issue cause by small batch size. Step learning rate schedule is used throughout training. The initial learning rate and training epochs for the 3 phases are described in Table 8.

The student’s architecture is fixed during the first 5 search iterations to make the search more stable. At the beginning of each search iteration, one student-teacher pair is sampled from the topk list according to the score ranking. The size of topk list is set to 5. In $f_{score}$ , $\beta$ is set to 0.8 for all base detector; $\alpha$ is set to 0.1 for X101 to encourage higher performance, while it is set to 0.4 for other base detectors. During fast evaluation phase, $\left\{S_{new}^{\theta^{{}^{\prime}}},T_{new}\right\}$ is trained for 3 epochs under cosine learning rate schedule, where the initial learning rate is set to 0.01; the batch size is 4; synchronized BN is adopted.

A.2 Knowledge Distillation

Adaptation function. The adaptation function $f_{adap}(\cdot)$ is implemented as a 3x3 Conv layer to match the feature dimensions of the student-teacher pair. The output dimension is set to 256 and the stride is set to 1.

Proposal matching. The student and the teacher have different proposals, leading to unmatched outputs which cannot be directly distilled. We solve this by sharing student’s proposals with the teacher.

A.3 Pruning

The existence of skip connections constrain the blocks in the same stage to have identical output dimensions. Thus, the channels can not be arbitrarily pruned. To address this issue, the BN’s weights in projection mapping (the skip connection of the stage’s first block) are used to prune the output channel of all blocks in the stage. The other channels inside the block are determined by the weights of the two BN modules at the middle.

To encourage channel sparsity, we enforce a regularization term on the weights of BN. We set the loss weight $\lambda$ to be $1\times 10^{-5}$ in our implementation.

Appendix B Encoding of the Searched Architecture

The student’s backbone architecture is encoded as the output channels of each convolutional layer in each block at every stage. Blocks and stages are separated by “-” and “], [”, respectively. We list out the encodings of students obtained with different base detectors and the corresponding input resolutions

R18. Student: [(64, 64)], [(128, 128)-(128, 128)], [(256, 256)-(256, 256)], [(512, 512)-(512, 512); Input size: $1080\times 720$ .

R50. Student: (58, 59, 205)-(60, 64, 205)-(63, 62, 205)], [(127, 128, 314)-(109, 122, 314)-(127, 123, 314)-(125, 124, 314)], [(256, 255, 591)-(243, 245, 591)-(237, 247, 591)-(243, 246, 591)-(252, 244, 591)-(252, 254, 591)], [(509, 507, 1856)-(509, 506, 1856)-(508, 507, 1856)]; Input size: $1080\times 720$ .

R101. Student: [(49, 62, 202)-(35, 33, 202)-(56, 62, 202)], [(123, 128, 300)-(57, 90, 300)-(117, 113, 300)-(124, 117, 300)], [(255, 254, 321)-(65, 127, 321)-(32, 47, 321)-(32, 63, 321)-(120, 161, 321)-(132, 181, 321)-(162, 232, 321)-(175, 241, 321)-(143, 237, 321)-(199, 246, 321)-(210, 238, 321)-(201, 225, 321)-(210, 215, 321)-(211, 222, 321)-(201, 208, 321)-(198, 206, 321)-(220, 213, 321)-(226, 221, 321)-(234, 221, 321)-(237, 222, 321)], [(249, 229, 321)-(245, 231, 321)-(511, 478, 2031)-(507, 503, 2031)-(491, 477, 2031)]; Input size: $1080\times 720$ .

X101. Student: [(128, 128, 256)-(112, 112, 256)-(124, 124, 256)], [(256, 256, 512)-(256, 256, 512)-(256, 256, 512)-(256, 256, 512)], [(512, 512, 1024)-(448, 448, 1024)-(480, 480, 1024)-(496, 496, 1024)-(512, 512, 1024)-(464, 464, 1024)-(416, 416, 1024)-(416, 416, 1024)-(416, 416, 1024)-(416, 416, 1024)-(432, 432, 1024)-(496, 496, 1024)-(400, 400, 1024)-(400, 400, 1024)-(464, 464, 1024)-(464, 464, 1024)-(432, 432, 1024)-(352, 352, 1024)-(400, 400, 1024)-(384, 384, 1024)-(272, 272, 1024)-(384, 384, 1024)-(384, 384, 1024)], [(384, 384, 1024)-(352, 352, 1024)-(1024, 1024, 2048)-(864, 864, 2048)-(384, 384, 2048)]; Input size: $1333\times 800$ .

Appendix C Illustration of the search process

In Figure 8, we show the $H$ score (defined in Section 3.1.4 of the paper) of sampled student detectors throughout generation. The results verify that Joint-DetNAS can consistently optimize the performance-complexity tradeoff for various base detectors. In addition, weight inheritance strategy enables the student’s score to be consistently improved throughout the search. We excluded $512\times 512$ input resolution from the plot since it presents a clear performance gap with other resolutions.

In Figure 9, we show the Pareto optimal of various base detectors. R101 almost dominates both R18 and R50, which indicates that given the same score function, starting with a larger base detector is often the better choice, because base detector with higher capacity can be adjusted more flexibly, thus derive a better performance-complexity tradeoff.

Appendix D Post-search Fine-tuning Further Improves Performance

Although the obtained student detector can achieve competitive performance without additional training, we want to show that applying post-search fine-tuning to the student-teacher pair is able to further improve the student’s performance. The results are demonstrated in Table 9.

Appendix E Iterative Training Does Not Hurt Performance

In the framework, the student detector is trained iteratively in each search iteration during fast evaluation. Each iterative training process lasts for three epochs with cosine learning rate schedule. We comparing it with fully training in this experiment. Specifically, we fix the student detector and use the super-net in the ETP as teacher. Then we plot the change of AP with the training time for iterative training. Iterative training follows the same setting as mentioned in A.1.4. Fully training adopts 2x schedule and cosine learning rate decay, the initial learning rate is 0.02. The result in Figure 10 shows that: (1) the convergence speeds are comparable, and (2) the final performance of iterative training is on par with fully training.

Appendix F Search with ETP

In fact, ETP can already serve as a search space, from which detectors can be directly sampled. We compare the search result of ETP with other NAS methods and our Joint-DetNAS in Table 10. The comparison shows that, both ETP search and Joint-DetNAS outperform previous works: ETP search is more efficient, while Joint-DetNAS achieves higher performance. Furthermore, the Joint-DetNAS framework is applicable for different student architecture families without retraining the teacher pool, thus is more flexible and economical.

Appendix G Ablation Study of Distillation for Object Detection

Comparison of different ways to distill feature level information. Most previous detection KD methods aim to better distill teacher’s feature level information. We compare the mask based methods with the adopted proposal feature distillation in Table 11 and found that the latter results in the most performance gain, while being the simplest to implement.

Analysis of each component in our KD framework. The ablation study of each component is shown in Table 12. Our experiments demonstrate that both feature level and prediction level distillation bring considerable improvement. We can also see that our proposed class-aware localization loss brings noticeable improvement relative to the original approach which directly distill the localization outputs. The student is R18-FPN and trained under 1x schedule, while the teacher is trained under 2x+ms schedule.

Appendix H Ablation Study of Pruning for Object Detection

We analyze the effect of the regularization term as well as the pattern of all BNs’ weights in the detector’s backbone in Figure 11. As shown in the graph, more BN’s channel weights are close to 0 after the regularization is enforced. In addition, BN’s weights in the third projection mapping are smaller, thus causing the third stage to be pruned the most. This also indicates that the third stage contains the most redundancy.