Deeply-Supervised Nets

Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, Zhuowen Tu

Introduction

Much attention has been given to a resurgence of neural networks, deep learning (DL) in particular, which can be of unsupervised , supervised , or a hybrid form . Significant performance gain has been observed, especially in the presence of large amount of training data, when deep learning techniques are used for image classification and speech recognition . On the one hand, hierarchical and recursive networks have demonstrated great promise in automatically learning thousands or even millions of features for pattern recognition; on the other hand concerns about deep learning have been raised and many fundamental questions remain open.

Some potential problems with the current DL frameworks include: reduced transparency and discriminativeness of the features learned at hidden layers ; training difficulty due to exploding and vanishing gradients ; lack of a thorough mathematical understanding about the algorithmic behavior, despite of some attempts made on the theoretical side ; dependence on the availability of large amount of training data ; complexity of manual tuning during training . Nevertheless, DL is capable of automatically learning and fusing rich hierarchical features in an integrated framework. Recent activities in open-sourcing and experience sharing have also greatly helped the adopting and advancing of DL in the machine learning community and beyond. Several techniques, such as dropout , dropconnect , pre-training , and data augmentation , have been proposed to enhance the performance of DL from various angles, in addition to a variety of engineering tricks used to fine-tune feature scale, step size, and convergence rate. Features learned automatically by the CNN algorithm are intuitive . Some portion of features, especially for those in the early layers, also demonstrate certain degree of opacity . This finding is also consistent with an observation that different initializations of the feature learning at the early layers make negligible difference to the final classification . In addition, the presence of vanishing gradients also makes the DL training slow and ineffective . In this paper, we address the feature learning problem in DL by presenting a new algorithm, deeply-supervised nets (DSN), which enforces direct and early supervision for both the hidden layers and the output layer. We introduce companion objective to the individual hidden layers, which is used as an additional constraint (or a new regularization) to the learning process. Our new formulation significantly enhances the performance of existing supervised DL methods. We also make an attempt to provide justification for our formulation using stochastic gradient techniques. We show an improvement of the convergence rate of the proposed method over standard ones, assuming local strong convexity of the optimization function (a very loose assumption but pointing to a promising direction).

Several existing approaches are particularly worth mentioning and comparing with. In , layer-wise supervised pre-training is performed. Our proposed method does not perform pre-training and it emphasizes the importance of minimizing the output classification error while reducing the prediction error of each individual layer. This is important as the backpropagation is performed altogether in an integrated framework. In , label information is used for unsupervised learning. Semi-supervised learning is carried in deep learning . In , an SVM classifier is used for the output layer, instead of the standard softmax function in the CNN . Our framework (DSN), with the choice of using SVM, softmax or other classifiers, emphasizes the direct supervision of each intermediate layer. In the experiments, we show consistent improvement of DSN-SVM and DSN-Softmax over CNN-SVM and CNN-Softmax respectively. We observe all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN. It is also worth mentioning that our formulation is inclusive to various techniques proposed recently such as averaging , dropconnect , and Maxout . We expect to see more classification error reduction with careful engineering for DSN.

Deeply-Supervised Nets

In this section, we give the main formulation of the proposed deeply-supervised nets (DSN). We focus on building our infrastructure around supervised CNN style frameworks by introducing classifier, e.g. SVM model , to each layer. An early attempt to combine SVM with DL was made in , which however has a different motivation with ours and only studies the output layer with some preliminary experimental results.

We are motivated by the following simple observation: in general, a discriminative classifier trained on highly discriminative features will display better performance than a discriminative classifier trained on less discriminative features. If the features in question are the hidden layer feature maps of a deep network, this observation means that the performance of a discriminative classifier trained using these hidden layer feature maps can serve as a proxy for the quality/discriminativeness of those hidden layer feature maps, and further to the quality of the upper layer feature maps. By making appropriate use of this feature quality feedback at each hidden layer of the network, we are able to directly influence the hidden layer weight/filter update process to favor highly discriminative feature maps. This is a source of supervision that acts deep within the network at each layer; when our proxy for feature quality is good, we expect to much more rapidly approach the region of good features than would be the case if we had to rely on the gradual backpropagation from the output layer alone. We also expect to alleviate the common problem of having gradients that “explode” or “vanish”. One concern with a direct pursuit of feature discriminativeness at all hidden layers is that this might interfere with the overall network performance, since it is ultimately the feature maps at the output layer which are used for the final classification; our experimental results indicate that this is not the case.

Our basic network architecture will be similar to the standard one used in the CNN framework. Our additional deep feedback is brought in by associating a companion local output with each hidden layer. We may think of this companion local output as analogous to the final output that a truncated network would have produced. Backpropagation of error now proceeds as usual, with the crucial difference that we now backpropagate not only from the final layer but also simultaneously from our local companion output. The empirical result suggests the following main properties of the companion objective: (1) it acts as a kind of feature regularization (although an unusual one), which leads to significant reduction to the testing error but not necessarily to the train error; (2) it results in faster convergence, especially in presence of small training data (see Figure (2) for an illustration on a running example).

2 Formulation

We focus on the supervised learning case and let S={(Xi,yi),i=1..N}S=\{({\bf X}_{i},y_{i}),i=1..N\} be our set of input training data where sample XiRn{\bf X}_{i}\in R^{n} denotes the raw input data and yi{1,..,K}y_{i}\in\{1,..,K\} is the corresponding groundtruth label for sample XiX_{i}. We drop ii for notational simplicity, since each sample is considered independently. The goal of deep nets, specifically convolutional neural networks (CNN) , is to learn layers of filters and weights for the minimization of classification error at the output layer. Here, we absorb the bias term into the weight parameters and do not differentiate weights from filters and denote a recursive function for each layer m=1..Mm=1..M as:

MM denotes the total number of layers; W(m),m=1..M{\bf\mathsf{W}}^{(m)},m=1..M are the filters/weights to be learned; Z(m1){\bf Z}^{(m-1)} is the feature map produced at layer m1m-1; Q(m){\bf Q}^{(m)} refers to the convolved/filtered responses on the previous feature map; f()f() is a pooling function on Q{\bf Q}; Combining all layers of weights gives

Now we introduce a set of classifiers, e.g. SVM (other classifiers like Softmax can be applied and we will show results using both SVM and Softmax in the experiments), one for each hidden layer,

in addition to the W{\bf\mathsf{W}} in the standard CNN framework. We denote the w(out){\bf w}^{(out)} as the SVM weights for the output layer. Thus, we build our overall combined objective function as:

To summarize, we describe this optimization problem as follows: we want to learn filters/weights W{\bf\mathsf{W}} for the entire network such that an SVM classifier w(out){\bf w}^{(out)} trained on the output feature maps (that depend on those filters/features) will display good performance. We seek this output performance while also requiring some “satisfactory” level of performance on the part of the hidden layer classifiers. We are saying: restrict attention to the parts of feature space that, when considered at the internal layers, lead to highly discriminative hidden layer feature maps (as measured via our proxy of hidden-layer classifier performance). The main difference between eqn. (3) and previous attempts in layer-wise supervised training is that we perform the optimization altogether with a robust measure (or regularization) of the hidden layer. For example, greedy layer-wise pretraining was performed as either initialization or fine-tuning which results in some overfitting . The state-of-the-art benchmark results demonstrate the particular advantage of our formulation. As shown in Figure 2(c), indeed both CNN and DSN reach training error near zero but DSN demonstrates a clear advantage of having a better generalization capability.

To train the DSN model using SGD, the gradients of the objective function w.r.t the parameters in the model are:

The gradient w.r.t W{\bf\mathsf{W}} just follows the conventional CNN based model plus the gradient that directly comes from the hidden layer supervision.

Next, we provide more discussions to and try to understand intuitively about our formulation, eqn. (3). For ease of reference, we write this objective function as

3 Stochastic Gradient Descent View

We focus on the convergence advantage of DSN, instead of the regularization to the generalization aspect. In addition to the present problem in CNN where learned features are not always intuitive and discriminative , the difficulty of training deep neural networks has been discussed . As we can observe from eqn. (1) and (2), the change of the bottom layer weights get propagated through layers of functions, leading to exploding or vanishing gradients . Various techniques and parameter tuning tricks have been proposed to better train deep neural networks, such as pre-training and dropout . Here we provide a somewhat loose analysis to our proposed formulation, in a hope to understand its advantage in effectiveness.

The objective function in deep neural networks is highly non-convex. Here we make the following assumptions/observations: (1) the objective/energy function of DL observes a large “flat” area around the “optimal” solution where any result has a similar performance; locally we still assume a convex (or even λ\lambda-strongly convex) function whose optimization is often performed with stochastic gradient descent algorithm .

The definition of λ\lambda-strongly convex is standard: A function F(W)F({\bf\mathsf{W}}) is λ\lambda-strongly convex if ,W,WW\forall,{\bf\mathsf{W}},{\bf\mathsf{W}}^{\prime}\in\mathcal{W} and any subgradient g{\bf g} at W{\bf\mathsf{W}},

We name Sγ(F)={W:F(W)γ}\mathcal{S}_{\gamma}(F)=\{{\bf\mathsf{W}}:F({\bf\mathsf{W}})\leq\gamma\} as the γ\gamma-feasible set for a function F(W)P(W)+Q(W)F({\bf\mathsf{W}})\equiv{\mathcal{P}}({\bf\mathsf{W}})+{\mathcal{Q}}({\bf\mathsf{W}}).

First we show that a feasible solution for Q(W){\mathcal{Q}}({\bf\mathsf{W}}) leads to a feasible one to P(W){\mathcal{P}}({\bf\mathsf{W}}). That is:

Lemma 1 shows that a good solution for Q(W){\mathcal{Q}}({\bf\mathsf{W}}) is also a good one for P(W){\mathcal{P}}({\bf\mathsf{W}}), but it may not be the case the other way around. That is: a W{\bf\mathsf{W}} that makes P(W){\mathcal{P}}({\bf\mathsf{W}}) small may not necessarily produce discriminative features for the hidden layers to have a small Q(W){\mathcal{Q}}({\bf\mathsf{W}}). However, Q(W){\mathcal{Q}}({\bf\mathsf{W}}) can be viewed as a regularization term. Since P(W){\mathcal{P}}({\bf\mathsf{W}}) observes a very flat area near even zero on the training data and it is ultimately the test error that we really care about, we thus only focus on the W{\bf\mathsf{W}}, W{\bf\mathsf{W}}^{\star}, which makes both Q(W){\mathcal{Q}}({\bf\mathsf{W}}) and P(W){\mathcal{P}}({\bf\mathsf{W}}) small. Therefore, it is not unreasonable to assume that F(W)P(W)+Q(W)F({\bf\mathsf{W}})\equiv{\mathcal{P}}({\bf\mathsf{W}})+{\mathcal{Q}}({\bf\mathsf{W}}) and P(W){\mathcal{P}}({\bf\mathsf{W}}) share the same optimal W{\bf\mathsf{W}}^{\star}.

Let P(W)){\mathcal{P}}({\bf\mathsf{W}})) and P(W)){\mathcal{P}}({\bf\mathsf{W}})) be strongly convex around W{\bf\mathsf{W}}^{\star}, WW2D\lVert{\bf\mathsf{W}}^{\prime}-{\bf\mathsf{W}}^{\star}\rVert^{2}\leq D and WW2D\lVert{\bf\mathsf{W}}-{\bf\mathsf{W}}^{\star}\rVert^{2}\leq D, with P(W)P(W)+<gp,WW>+λ12WW2{\mathcal{P}}({\bf\mathsf{W}}^{\prime})\geq{\mathcal{P}}({\bf\mathsf{W}})+<{\bf gp},{\bf\mathsf{W}}^{\prime}-{\bf\mathsf{W}}>+\frac{\lambda_{1}}{2}\lVert{\bf\mathsf{W}}^{\prime}-{\bf\mathsf{W}}\rVert^{2} and Q(W)Q(W)+<gq,WW>+λ12WW2{\mathcal{Q}}({\bf\mathsf{W}}^{\prime})\geq{\mathcal{Q}}({\bf\mathsf{W}})+<{\bf gq},{\bf\mathsf{W}}^{\prime}-{\bf\mathsf{W}}>+\frac{\lambda_{1}}{2}\lVert{\bf\mathsf{W}}^{\prime}-{\bf\mathsf{W}}\rVert^{2}, where gp{\bf gp} and gq{\bf gq} are the subgradients for P{\mathcal{P}} and Q{\mathcal{Q}} at W{\bf\mathsf{W}} respectively. It can be directly seen that F(W)F({\bf\mathsf{W}}) is also strongly convex and for subgradient gf{\bf gf} of F(W)F({\bf\mathsf{W}}) at W{\bf\mathsf{W}}, gf=gp+gq{\bf gf}={\bf gp}+{\bf gq}.

Since F(W)=P(W)+Q(W)F({\bf\mathsf{W}})={\mathcal{P}}({\bf\mathsf{W}})+{\mathcal{Q}}({\bf\mathsf{W}}), it can be directly seen that

Based on lemma 1 in , this upper bound directly holds. \square

Following the assumptions in lemma 2, but now we assume ηt=1/t\eta_{t}=1/t since λ1\lambda_{1} and λ2\lambda_{2} are not always readily available, then started from W1W2D\lVert{\bf\mathsf{W}}_{1}-{\bf\mathsf{W}}^{\star}\rVert^{2}\leq D the convergence rate is bounded by

Let λ=λ1+λ2\lambda=\lambda_{1}+\lambda_{2}, we have

With 2λ/t2\lambda/t being small, we have 12λ/te2λ/t.1-2\lambda/t\approx e^{-2\lambda/t}.

Lemma 1 shows the compatibility of the companion objective of Q{\mathcal{Q}} w.r.t the output objective P{\mathcal{P}}. The first equation can be directly derived from lemma 2 and the second equation can be seen from lemma 3. In general λ2λ1\lambda_{2}\gg\lambda_{1} which leads to a great improvement in convergence speed and the constraints in each hidden layer also helps to learning filters which are directly discriminative. \square

Experiments

We evaluate the proposed DSN method on four standard benchmark datasets: MNIST, CIFAR-10, CIFAR-100 and SVHN. We follow a common training protocol used by Krizhevsky et al. in all experiments. We use SGD solver with mini-batch size of 128128 at a fixed constant momentum value of 0.90.9. Initial value for learning rate and weight decay factor is determined based on the validation set. For a fair comparison and clear illustration of the effectiveness of DSN, we match the complexity of our model with that in network architectures used in and to have a comparable number of parameters. We also incorporate two dropout layers with dropout rate at 0.50.5. Companion objective at the convolutional layers is imposed to backpropagate the classification error guidance to the underlying convolutional layers. Learning rates are annealed during training by a factor of 2020 according to an epoch schedule determined on the validation set. The proposed DSN framework is not difficult to train and there are no particular engineering tricks adopted. Our system is built on top of widely used Caffe infrastructure . For the network architecture setup, we adopted the mlpconv layer and global averaged pooling scheme introduced in . DSN can be equipped with different types of loss functions, such as Softmax and SVM. We show performance boost of DSN-SVM and DSN-Softmax over CNN-SVM and CNN-Softmax respectively (see Figure (2.a)). The performance gain is more evident in presence of small training data (see Figure (2.b)); this might partially ease the burden of requiring large training data for DL. Overall, we observe state-of-the-art classification error in all four datasets (without data augmentation), 0.39%0.39\% for MINIST, 9.78%9.78\% for CIFAR-10, 34.57%34.57\% for CIFAR-100, and 1.92%1.92\% for SVHN (8.22%8.22\% for CIFAR-10 with data augmentation). All results are achieved without using averaging , which is not exclusive to our method. Figure (3) gives an illustration of some learned features.

We first validate the effectiveness of the proposed DSN on the MNIST handwritten digits classification task , a widely and extensively adopted benchmark in machine learning. MNIST dataset consists of images of 10 different classes (0 to 9) of size 28×2828\times 28 with 60,000 training samples and 10,000 test samples. Figure 2(a) and (b) show results from four methods, namely: (1) conventional CNN with softmax loss (CNN-Softmax), (2) the proposed DSN with softmax loss (DSN-Softmax), (3) CNN with max-margin objective (CNN-SVM) , and (4) the proposed DSN with max-margin objective (DSN-SVM). DSN-Softmax and DSN-SVM outperform both their competing CNN algorithms (DSN-SVM shows classification error of 0.39%0.39\% under a single model without data whitening and augmentation). Figure 2(b) shows classification error of the competing methods when trained w.r.t. varying sizes of training samples (26%26\% gain of DSN-SVM over CNN-Softmax at 500500 samples. Figure 2(c) shows a comparison of generalization error between CNN and DSN.

2 CIFAR-10 and CIFAR-100

CIFAR-10 dataset consists of 32×3232\times 32 color images. A total number of 60,000 images are split into 50,000 training and 10,000 testing images. The dataset is preprocessed by global contrast normalization. To compare our results with the previous state-of-the-art, in this case, we also augmented the dataset by zero padding 4 pixels on each side, then do corner cropping and random flipping on the fly during training. No model averaging is done at the test phase and we only crop the center of a test sample. Table (2) shows our result. Our DSN model achieved an error rates of 9.78%9.78\% without data augmentation and 8.22%8.22\% with data agumentation (the best known result to our knowledge).

DSN also provides added robustness to hyperparameter choice, in that the early layers are guided with direct classification loss, leading to a faster convergence rate and relieved burden on heavy hyperparameter tuning. We also compared the gradients in DSN and those in CNN, observing 4.554.55 times greater gradient variance of DSN over CNN in the first convolutional layer. This is consistent with an observation in , and the assumptions and motivations we make in this work. To see what the features have been learned in DSN vs. CNN, we select one example image from each of the ten categories of CIFAR-10 dataset, run one forward pass, and show the feature maps learned from the first (bottom) convolutional layer in Figure (3). Only the top 30% activations are shown in each of the feature maps. Feature maps learned by DSN show to be more intuitive than those by CNN.

CIFAR-100 dataset is similar to CIFAR-10 dataset, except that it has 100 classes. The number of images for each class is then 500500 instead of 5,0005,000 as in CIFAR-10, which makes the classification task more challenging. We use the same network settings as in CIFAR-10. Table (2) shows previous best results and 34.57%34.57\% is reported by DSN. The performance boost consistently shown on both CIFAR-10 and CIFAR-100 again demonstrates the advantage of the DSN method.

3 Street View House Numbers

Street View House Numbers (SVHN) dataset consists of 73,25773,257 digits for training, 26,03226,032 digits for testing, and 53,113153,1131 extra training samples on 32×3232\times 32 color images. We followed the previous works for data preparation, namely: we select 400 samples per class from the training set and 200 samples per class from the extra set. The remaining 598,388 images are used for training. We followed to preprocess the dataset by Local Contrast Normalization (LCN). We do not do data augmentation in training and use only a single model in testing. Table 3 shows recent comparable results. Note that Dropconnect uses data augmentation and multiple model voting.

Conclusions

In this paper, we have presented a new formulation, deeply-supervised nets (DSN), attempting to make a more transparent learning process for deep learning. Evident performance enhancement over existing approaches has been obtained. A stochastic gradient view also sheds light to the understanding of our formulation.

Acknowledgments

This work is supported by NSF award IIS-1216528 (IIS-1360566) and NSF award IIS-0844566 (IIS-1360568). We thank Min Lin, Naiyan Wang, Baoyuan Wang, Jingdong Wang, Liwei Wang, and David Wipf for help discussions. We are greatful for the generous donation of the GPUs by NVIDIA.

References