On the Efficacy of Knowledge Distillation

Jang Hyun Cho, Bharath Hariharan

Introduction

The past few years have seen dramatic improvements in visual recognition systems, but these improvements have been driven by deeper and larger convolutional network architectures. The large computational complexity of these architectures has limited their use in many downstream applications. As such, there has been a lot of recent research on achieving the same or similar accuracy with smaller models. Some of this work has involved building more efficient neural network families , pruning weights from larger neural networks , quantizing existing networks to use fewer bits for weights and activations and distilling knowledge from larger networks into smaller ones .

The last of these, knowledge distillation, is a general-purpose technique that at first glance is widely applicable and complements all other ways of compressing neural networks . The key idea is to use soft probabilities (or ‘logits’) of a larger “teacher network” to supervise a smaller “student” network, in addition to the available class labels. These soft probabilities reveal more information than the class labels alone, and can purportedly help the student network learn better.

The appeal of this approach is in its apparent generality: any student can learn from any teacher. But does knowledge distillation fulfill this promise of generality? Unfortunately, in spite of the recent interest in variants of knowledge distillation , an empirical answer to this question is missing. Prior experiments have typically looked at a small number of carefully chosen architectures, with the implicit assumption that conclusions will generalize across student or teacher architectures. However, there are a few isolated reports of failed experiments with knowledge distillation that suggest that this might not be true. For example, Zagoruyko and Komodakis observe that they are “unable to achieve positive results with knowledge distillation on ImageNet” . What characterizes this, and other experiments where knowledge distillation does not seem to improve performance? Are there student-teacher combinations that perform better? And finally, is there something we can do to improve performance for other combinations?

In this paper, we seek to answer these questions. We find that in general, the teacher accuracy is a poor predictor of the student’s performance. Larger teachers, though they are more accurate by themselves, do not necessarily make for better teachers. We explore the reasons for this and demonstrate that as the teacher grows in capacity and accuracy, the student often finds it difficult to emulate the teacher (resulting in a high KL divergence from the teacher logits even during training). We show that this issue cannot be mitigated by solutions suggested in prior work, such as using a sequence of knowledge-distillation steps to increase the student accuracy. Finally, we find an effective solution to the problem: regularizing the teacher by stopping the training of the teacher early, and stopping knowledge distillation close to convergence to allow the student to fit the training loss better. Our solution is simple to implement and effective across the board at improving the efficacy of knowledge distillation.

The rest of the paper is organized as follows. After describing related work (Sec. 2), we first provide some background on knowledge distillation and attention transfer (Sec. 3). We then describe our experimental setting(Sec. 4). In Sec. 5 we discuss our findings and the empirical evidence for each.

Related Work

The notion of training smaller, cheaper models (“students”) to mimic larger ones (“teachers”) is an old one, first described in a seminal paper on model compression by Buciluǎ et al . This technique can be applied to deep neural networks almost out of the box . In this paper, we use the knowledge distillation framework described by Hinton et al. . A brief description of knowledge distillation is provided in Section 3. The original paper on knowledge distillation experimented with the idea on a few small datasets, but a thorough empirical evaluation of knowledge distillation is missing.

Meanwhile, the focus of past work has been either on improving the quality of knowledge distillation or finding new applications for the idea. On the former direction, prior work has explored adding additional losses on intermediate feature maps of the student to bring them closer to those of the teacher . Zhang et al. train a pair of models, distilling knowledge bidirectionally at every epoch . Tarvainen et al. find that averaging consecutive student models over training steps tend to produce better performing students . Yang et al. modify the loss function of teacher network to be more “tolerant” (that is, by adding more terms to make the model intentionally maintain high energy, benefiting from teacher’s misclassified logits) .

A particular approach to improving knowledge distillation is to perform knowledge distillation repetitively (we call it sequential knowledge distillation ). A particular way of using sequential knowledge distillation is as an alternative to ensembling to increase model accuracy . For example, Furlanello et al. suggest training an ensemble of networks using a sequence of knowledge-distillation steps where a network uses its own previous version as a teacher. Interestingly, our results suggest that this approach underperforms an ensemble trained from scratch, and furthermore, such sequential knowledge distillation reduces the ability of the network to act as a teacher. More generally, we find that these methods are highly dependent on the student capacity. In fact we find them ineffective in many cases particularly when student capacity is limited or the dataset is complex.

In terms of applications of knowledge distillation, prior work has found knowledge distillation to be useful for sequence modeling , semi-supervised learning , domain adaptation , multi-modal learning and so on. This wide applicability of the idea of knowledge distillation makes an exhaustive evaluation of knowledge distillation ideas even more important.

Background: Knowledge distillation

The key idea behind knowledge distillation is that the soft probabilities output by a trained “teacher” network contains a lot more information about a data point than just the class label. For example, if multiple classes are assigned high probabilities for an image, then that might mean that the image must lie close to a decision boundary between those classes. Forcing a student to mimic these probabilities should thus cause the student network to imbibe some of this knowledge that the teacher has discovered above and beyond the information in the training labels alone.

Concretely, given any input image xx the teacher network produces a vector of scores st(x)=[s1t(x),s2t(x),,sKt(x)]\mathbf{s}^{t}(x)=[s^{t}_{1}(x),s^{t}_{2}(x),\ldots,s^{t}_{K}(x)] that are converted into probabilities: pkt(x)=eskt(x)jesjt(x)p^{t}_{k}(x)=\frac{e^{s^{t}_{k}(x)}}{\sum_{j}e^{s^{t}_{j}(x)}}. Trained neural networks produce peaky probability distributions, which may be less informative. Hinton et al therefore propose to “soften” these probabilities using temperature scaling :

α\alpha and τ\tau are hyperparameters; popular choices are τ{3,4,5}\tau\in\{3,4,5\} and α=0.9\alpha=0.9 .

Methods

We perform experiments on both CIFAR10 and ImageNet. In each case we keep the student the same and use multiple teachers of varying capacity to perform knowledge distillation.

For experiments on CIFAR10, we run each model for 200 epochs using SGD with momentum 0.90.9 and set the initial learning rate γ=0.1\gamma=0.1, dropping 0.20.2 every 60 epochs. Standard data augmentation was applied to the dataset. For the hyperparameters regarding knowledge distillation, we stayed consistent with the popular choice (, ): Temperature τ=4\tau=4, α=0.9\alpha=0.9, and β=1000\beta=1000 for attention transfer. The same experiment was repeated 5 times and median, mean, and standard deviation are reported. We consider three different network architectures: ResNet , WideResNet , and DenseNet .

ImageNet

For ImageNet experiments we followed Zagoruyko et al. closely since it was the first successful work of knowledge distillation on ImageNet, to the best of our knowledge. We used SGD with nesterov momentum 0.90.9, initial learning rate γ=0.1\gamma=0.1, weight decay 1×1041\times 10^{-4} , and dropped learning rate by 0.10.1 every 30 epochs. As with CIFAR10, we set temperature τ=4\tau=4, α=0.9\alpha=0.9, and β=1000\beta=1000 for attention transfer. For ImageNet experiments, we consider ResNet .

Results

The idea behind knowledge distillation is that soft probabilities from a trained teacher reflect more about the data than the true label alone. One might expect that as the teacher becomes more accurate, these soft probabilities will capture the underlying class distribution better and thus serve as better supervision to the student. Thus, intuitively, we might expect that bigger, more accurate models might form better teachers.

We first evaluate if this is true on CIFAR10 dataset. In Figure 2, The red and blue lines shows the accuracy for different student networks trained from different teachers; the left plot varies the “depth” of the teacher while the right plot varies the “width”. From these experiments, we find that the hypothesis that bigger, more accurate models make better teachers is incorrect: although the teacher accuracy continues to rise as the teacher becomes larger (see supplementary for teacher accuracies), the student accuracy rises and then begins to fall. One might wonder if this is an artifact of the CIFAR dataset. We repeated the experiment on ImageNet, with ResNet18 as the student and ResNet18, ResNet34, ResNet50, and ResNet152 as teachers. The results are shown in Table 1. As can be seen, as the teacher becomes larger and more accurate, the student becomes less accurate.

What might be the reason for this decrease? One possibility is that as the teacher becomes both more confident and more accurate, the output probabilities start resembling more and more a one-hot encoding of the true label, and thus the information available to the student decreases. However, softening the probabilities with high temperature did not change this result (detailed later in Figure 6, Table 10), invalidating this hypothesis. Below, we propose an alternative hypothesis.

2 Analyzing student and teacher capacity

There might be two reasons why a larger, more accurate teacher doesn’t lead to better student accuracy:

The student is able to mimic the teacher, but this does not improve accuracy. This would suggest a mismatch between the KD loss and the accuracy metric we care about.

The student is unable to mimic the teacher, suggesting a mismatch between student and teacher capacities.

We evaluated these hypotheses on CIFAR10 and ImageNet. In Table 2, we show the KD error for CIFAR: the fraction of examples for which the student and teacher predictions differ. Odd rows in Table 3 show the KD loss on ImageNet for a ResNet 18 student trained with different teachers. (We show KD error instead of KD loss on CIFAR because of scale issues caused by peaky output distributions).

In both cases, the KD Error/Loss is much higher for the largest teacher, which in turn leads to the least accurate student. This suggests that the student is unable to mimic large teachers and points to the second hypothesis, namely, the issue is one of mismatched capacity. We therefore posit that on both ImageNet and CIFAR, due to much lower capacity, the student is unable to find a solution in its space that corresponds well to the largest teacher.

3 Distillation adversely affects training

Note that knowledge distillation performs particularly poorly on ImageNet, where all teachers lead to lower student accuracy than a student trained from scratch (Table 1). While the previous section suggests that the student may not have enough capacity to match a very large teacher, it is still a mystery why no teacher improves accuracy on ImageNet. Despite multiple recent papers in knowledge distillation, experiments on ImageNet are rarely reported. The few that do report find that standard setting of knowledge distillation fails on ImageNet or perform an experiment with a small portion of ImageNet . But the reason for this has not been explored.

We dug deeper into the result. Figure 3 shows a comparison of validation accuracy plots between ResNet18 trained from scratch and using knowledge distillation with ResNet34. We find that while the KD loss improves validation accuracy initially, it begins to hurt accuracy towards the end of training.

We hypothesized that because ImageNet is a more challenging problem, the low-capacity student may be in the underfitting regime. The student may not have enough capacity to minimize both the training loss and the knowledge distillation loss, and might end up minimizing one loss (KD loss) at the expense of the other (cross entropy loss), especially towards the end of training.

This hypothesis suggests that we might want to stop the knowledge distillation early in the training process, and do gradient descent only on the cross-entropy loss for the rest of the training. We call this process “Early-stopped” knowledge distillation (“ESKD”) as opposed to standard knowledge distillation (“Full KD”).

Table 3 shows how this version compares to standard knowledge distillation, and also shows the loss values at the end of training. We find that the early-stopped version is better for all three teachers. We also find that, consistent with our hypothesis, the early-stopped version achieves a lower training cross-entropy loss and a higher KD loss than the baseline version suggesting that the latter models are indeed trading off one loss against the other. Note also that this simple trick of stopping knowledge distillation early now gives us the promised benefit of knowledge distillation: all the early-stopped students in Table 3 perform better than a model of similar architecture trained from scratch (30.24% accuracy).

However, early stopping does not change our original observation: that larger, more accurate teachers don’t result in more accurate students. Even with early sstopping, we find that the KD loss on the test set increases with increasing teacher size, suggesting that the student is still struggling to mimic the teacher, and that it is indeed an issue of student capacity.

4 The efficacy of repeated knowledge distillation

If the difference between teacher and student capacities is very large, one possibility is to first distill from the large teacher to an intermediate teacher and then distill to the student, so that each knowledge distillation step has a better match between student and teacher capacity. This notion of sequential knowledge distillation has been proposed in the literature in other contexts. Recently Furlanello et al attempted to train a sequence of models, with the ii-th model in the sequence being trained with knowledge distillation with the i1i-1-th model as the teacher. They find that such sequential knowledge distillation may improve the performance compared to a model trained from scratch, and ensembling the sequence produces a better model.

We first test this claim on CIFAR with multiple networks and with both knowledge distillation and attention transfer (Table 4). We find that there are several caveats to Furlanello et al.’s result. First, for some models (ResNet 8 and ResNet 14), the last student in the sequence actually underperforms a student model trained from scratch. This suggests that the network architecture heavily determines the success of sequential knowledge distillation. Second, we find that although an ensemble of the student models from the entire sequence outperforms a single model trained from scratch, it does not outperform an ensemble of an equal number of models trained from scratch. This might be because the students obtained through a sequence of knowledge distillation steps may be correlated with each other and therefore may not produce a strong ensemble.

If sequential knowledge distillation does indeed improve the accuracy of a model, a natural question to ask is if the resulting model forms a better teacher. To evaluate this, we conducted the following experiment. We chose WRN16-1 as the student model and WRN16-3 as the teacher (note that this is the optimal teacher for this student as suggested by Figure 2). We then trained the teacher using a sequence of 5 iterations of knowledge distillation. We compared the efficacy of this model as a teacher compared to a teacher trained from scratch. As shown in Table 5, a teacher trained with a sequence of knowledge distillation iterations, though more accurate, is not in fact a better teacher.

As discussed above, we might be interested in a variant of this idea where we first attempt to distill from a “large” model to a “medium” model, and then from the “medium” model to a “small” model. If this worked, it might help us avoid the issue of differing student and teacher capacities. We compare this step-wise knowledge distillation to directly distilling from the large model to the small model, or from the medium model to the small model. This might be a way to get around the effect we observed in Figure 2, where the larger models were not necessarily better teachers. We performed this experiment using WRN16-1 as the small model, WRN16-3 (the optimal teacher for WRN16-1) as the medium model and WRN16-8 as the large model. We find that such a step-wise distillation does not work: it performs almost exactly the same as directly using the large model for distillation with the small model (Table 6). Sequential distillation cannot help make large models better teachers.

We repeat some of these experiments on ImageNet and show the results in Table 7. We use early-stopping when performing knowledge distillation based on results from the previous section. Sequential knowledge distillation is in fact ineffective on ImageNet, too. The best result corresponds to a single knowledge distillation from the “small” model to another “small” model, where “small” is ResNet18, “Med.” is ResNet50, and “Large” is ResNet152. All these results suggest that despite the initial promise of sequential distillation, it is not a panacea and it especially does not help us use a large teacher to train a small student of significantly different capacity.

5 Early-stopped teachers make better teachers

In the previous section we have shown that sequential knowledge distillation is ineffective. This might be because it doesn’t address the core problem: the solution the large teacher has found is simply not in the solution space of the small student. The only solution is to find a teacher whose discovered solution is in fact reachable by the student.

We may perform grid-search to find the optimal teacher network architecture, but that is too expensive. Instead, we propose to regularize the teacher when training it. In particular, we propose to stop the training of the large teacher early. There is some evidence that a large network trained with only a few epochs behaves as a small network, while still encompassing a greater search space than small network . This method is extremely simple and cheap, since only a third to fourth of the total number of epochs are needed.

We evaluate the effectiveness of this idea in both CIFAR10 and ImageNet. Figure 4 plots the error rates vs. total epochs on CIFAR10, where the x-axis represents the total number of epochs each teacher is trained. The same hyperparameters as other CIFAR10 experiments are used for the training teacher, except the total number of epochs and the learning rate schedule. For training the teacher network, the learning rate is dropped by 0.20.2 every n53\lfloor\frac{n-5}{3}\rfloor where nn is the total number of epochs. We chose n{35,50,65,80}n\in\{35,50,65,80\}. Notice that for both student models (WRN16-1 and WRN28-1), all early-stopped teachers produce better students than the optimal fully-trained teacher (WRN16-3 and WRN28-3).

Given these promising results, we next turn our attention to ImageNet. We choose n{35,50}n\in\{35,50\} and learning rate drop schedule of (15,25,30)(15,25,30) for 35 and (20,35,45)(20,35,45) for 50. Other hyperparameters and settings are the same with those of the previous ImageNet experiments. Table 8 shows results on ImageNet, where we also compare our results with prior results using knowledge distillation or its variants. Simply early-stopping the knowledge distillation with the largest, fully-trained teacher outperforms most prior work (29.45%)(\approx 29.45\%). Our best teachers are the early-stopped ResNet34 and fully-trained ResNet18, (29.01%\approx 29.01\%) which has 1.23\approx 1.23 point performance gain over the model trained from scratch and 0.2%\approx 0.2\% from the best known result for this architecture from .

Table 8 also shows variants using attention transfer , an improvement over knowledge distillation. Early stopping of the teacher and of the student are both very compatible with attention transfer, leading to improvements of 1.6 points over the baseline and 0.7 points over the best numbers obtained with attention transfer .

6 Other factors impacting knowledge distillation

In experiments above, we drew both the student and the teacher from the same model family. We now experiment with teachers and students drawn from other, posssibly different, model families. Figure 5 shows various combinations of DenseNets and Wide ResNets as students and teachers. Our conclusions, both the inefficacy of knowledge distillation from large teachers and the benefits from early stopping, are appaarent in these results.

Impact of α𝛼\alpha and τ𝜏\tau

Till now we have fixed the tradeoff between KD and cross entropy, α=0.9\alpha=0.9 and the temperature τ=4\tau=4. Although the standard choice of the temperature is τ{3,4,5}\tau\in\{3,4,5\}, one might wonder if our conclusions about early stopping are sensitive to these choices. As shown in Figure 6 we find that the early-stopped teacher consistently outperforms the fully-trained teacher across a range of these hyperparameter values on CIFAR10. We further investigate the high temperature case on ImageNet dataset (Table 10); we use τ=20\tau=20. High temperature can theoretically mitigate the peakiness of the teacher logits and may result better performance. We find that high temperature does increase the overall performance for early-stopped knowledge distillation (“ESKD”) but had no visible difference for full knowledge distillation (“Full KD”). The early stopped teacher still performed the best.

Generalizability for transfer learning

Although we have seen variations in accuracies on ImageNet, a big aspect of convolutional networks is how well they transfer to other tasks. In the table 9 we examine whether the distilled network can be fine-tuned for classification on Places365 for a variety of students from the previous experiments. The results of transfer learning are consistent with the CIFAR and ImageNet experiments (full KD vs. early-stopped KD, small vs. large teachers, and regular vs. early-stopped teachers), proving that our findings also apply to transfer.

Conclusion

In this paper, we have presented an exhaustive study of the factors influencing knowledge distillation. Our key finding is that knowledge distillation is not a panacea and cannot succeed when student capacity is too low to successfully mimic the teacher. We have presented an approach to mitigate this issue by stopping teacher training early, to recover a solution more amenable for the student. Finally we have shown the benefits of this approach on CIFAR10 and ImageNet and also on transfer learning on Places365. We believe that further research into the nuances of distillation are necessary before it can succeed as a general and practical approach.

References

More Results on CIFAR10

Here we report more results and details of experiments in our work. Consistent with the main paper, “WRN” and “DN” stand for WideResNet and DenseNet, respectively. Table 17 and 18 show the efficacy of early-stopped teachers for student network WideResNet16-1 and WideResNet28-1 trained from teachers with varying width factor. As stated in the main paper, the number of total epochs N{35,50,65,80,200}N\in\{35,50,65,80,200\} and learning rate decay step size k{10,15,20,25,60}k\in\{10,15,20,25,60\} were considered in this experiment. Table 19 shows that our conclusions are consistent with different knowledge distillation method such as attention transfer (“AT+KD”). Table 11, 12, 13, and 16 show different experiment settings (different student-teacher pairs, learning method, etc.)

Details on ImageNet Experiments

Here we report more details of ImageNet experiments. Figure 7 are comparisons of different student accuracy plots, showing the harming effect of distillation. Table 14 shows the fully-trained and early-stopped models used as a teacher for ImageNet experiments.