Empirical Evaluation of Rectified Activations in Convolutional Network

Bing Xu, Naiyan Wang, Tianqi Chen, Mu Li

Introduction

Convolutional neural network (CNN) has made great success in various computer vision tasks, such as image classification (Krizhevsky et al., 2012; Szegedy et al., 2014), object detection(Girshick et al., 2014) and tracking(Wang et al., 2015). Despite its depth, one of the key characteristics of modern deep learning system is to use non-saturated activation function (e.g. ReLU) to replace its saturated counterpart (e.g. sigmoid, tanh). The advantage of using non-saturated activation function lies in two aspects: The first is to solve the so called “exploding/vanishing gradient”. The second is to accelerate the convergence speed.

In all of these non-saturated activation functions, the most notable one is rectified linear unit (ReLU) (Nair & Hinton, 2010; Sun et al., 2014). Briefly speaking, it is a piecewise linear function which prunes the negative part to zero, and retains the positive part. It has a desirable property that the activations are sparse after passing ReLU. It is commonly believed that the superior performance of ReLU comes from the sparsity (Glorot et al., 2011; Sun et al., 2014). In this paper, we want to ask two questions: First, is sparsity the most important factor for a good performance? Second, can we design better non-saturated activation functions that could beat ReLU?

We consider a broader class of activation functions, namely the rectified unit family. In particular, we are interested in the leaky ReLU and its variants. In contrast to ReLU, in which the negative part is totally dropped, leaky ReLU assigns a noon-zero slope to it. The first variant is called parametric rectified linear unit (PReLU) (He et al., 2015). In PReLU, the slopes of negative part are learned form data rather than predefined. The authors claimed that PReLU is the key factor of surpassing human-level performance on ImageNet classification (Russakovsky et al., 2015) task. The second variant is called randomized rectified linear unit (RReLU). In RReLU, the slopes of negative parts are randomized in a given range in the training, and then fixed in the testing. In a recent Kaggle National Data Science Bowl (NDSB) competitionKaggle National Data Science Bowl Competition: https://www.kaggle.com/c/datasciencebowl, it is reported that RReLU could reduce overfitting due to its randomized nature.

In this paper, we empirically evaluate these four kinds of activation functions. Based on our experiment, we conclude on small dataset, Leaky ReLU and its variants are consistently better than ReLU in convolutional neural networks. RReLU is favorable due to its randomness in training which reduces the risk of overfitting. While in case of large dataset, more investigation should be done in future.

Rectified Units

In this section, we introduce the four kinds of rectified units: rectified linear (ReLU), leaky rectified linear (Leaky ReLU), parametric rectified linear (PReLU) and randomized rectified linear (RReLU). We illustrate them in Fig.1 for comparisons. In the sequel, we use $x_{ji}$ to denote the input of $i$ th channel in $j$ th example , and $y_{ji}$ to denote the corresponding output after passing the activation function. In the following subsections, we introduce each rectified unit formally.

Rectified Linear is first used in Restricted Boltzmann Machines(Nair & Hinton, 2010). Formally, rectified linear activation is defined as:

2 Leaky Rectified Linear Unit

Leaky Rectified Linear activation is first introduced in acoustic model(Maas et al., 2013). Mathematically, we have

where $a_{i}$ is a fixed parameter in range $(1,+\infty)$ . In original paper, the authors suggest to set $a_{i}$ to a large number like 100. In additional to this setting, we also experiment smaller $a_{i}=5.5$ in our paper.

3 Parametric Rectified Linear Unit

Parametric rectified linear is proposed by (He et al., 2015). The authors reported its performance is much better than ReLU in large scale image classification task. It is the same as leaky ReLU (Eqn.2) with the exception that $a_{i}$ is learned in the training via back propagation.

4 Randomized Leaky Rectified Linear Unit

Randomized Leaky Rectified Linear is the randomized version of leaky ReLU. It is first proposed and used in Kaggle NDSB Competition. The highlight of RReLU is that in training process, $a_{ji}$ is a random number sampled from a uniform distribution $U(l,u)$ . Formally, we have:

In the test phase, we take average of all the $a_{ji}$ in training as in the method of dropout (Srivastava et al., 2014) , and thus set $a_{ji}$ to $\frac{l+u}{2}$ to get a deterministic result. Suggested by the NDSB competition winner, $a_{ji}$ is sampled from $U(3,8)$ . We use the same configuration in this paper.

Experiment Settings

We evaluate classification performance on same convolutional network structure with different activation functions. Due to the large parameter searching space, we use two state-of-art convolutional network structure and same hyper parameters for different activation setting. All models are trained by using CXXNETCXXNET: https://github.com/dmlc/cxxnet.

The CIFAR-10 and CIFAR-100 dataset (Krizhevsky & Hinton, 2009) are tiny nature image dataset. CIFAR-10 datasets contains 10 different classes images and CIFAR-100 datasets contains 100 different classes. Each image is an RGB image in size 32x32. There are 50,000 training images and 10,000 test images. We use raw images directly without any pre-processing and augmentation. The result is from on single view test without any ensemble.

The network structure is shown in Table 1. It is taken from Network in Network(NIN)(Lin et al., 2013).

In CIFAR-100 experiment, we also tested RReLU on Batch Norm Inception Network (Ioffe & Szegedy, 2015). We use a subset of Inception Network which is started from inception-3a module. This network achieved 75.68% test accuracy without any ensemble or multiple view test CIFAR-100 Reproduce code: https://github.com/dmlc/mxnet/blob/master/example/notebooks/cifar-100.ipynb.

2 National Data Science Bowl Competition

The task for National Data Science Bowl competition is to classify plankton animals from image with award of $170k. There are 30,336 labeled gray scale images in 121 classes and there are 130,400 test data. Since the test set is private, we divide training set into two parts: 25,000 images for training and 5,336 images for validation. The competition uses multi-class log-loss to evaluate classification performance.

We refer the network and augmentation setting from team AuroraXieWinning Doc of AuroraXie: https://github.com/auroraxie/Kaggle-NDSB, one of competition winners. The network structure is shown in Table 5. We only use single view test in our experiment, which is different to original multi-view, multi-scale test.

Result and Discussion

Table 3 and 4 show the results of CIFAR-10/CIFAR-100 dataset, respectively. Table 5 shows the NDSB result. We use ReLU network as baseline, and compare the convergence curve with other three activations pairwisely in Fig. 2, 3 and 4, respectively. All these three leaky ReLU variants are better than baseline on test set. We have the following observations based on our experiment:

Not surprisingly, we find the performance of normal leaky ReLU ( $a=100$ ) is similar to that of ReLU, but very leaky ReLU with larger $a=5.5$ is much better.

On training set, the error of PReLU is always the lowest, and the error of Leaky ReLU and RReLU are higher than ReLU. It indicates that PReLU may suffer from severe overfitting issue in small scale dataset.

The superiority of RReLU is more significant than that on CIFAR-10/CIFAR-100. We conjecture that it is because the in the NDSB dataset, the training set is smaller than that of CIFAR-10/CIFAR-100, but the network we use is even bigger. This validates the effectiveness of RReLU when combating with overfitting.

For RReLU, we still need to investigate how the randomness influences the network training and testing process.

Conclusion

In this paper, we analyzed four rectified activation functions using various network architectures on three datasets. Our findings strongly suggest that the most popular activation function ReLU is not the end of story: Three types of (modified) leaky ReLU all consistently outperform the original ReLU. However, the reasons of their superior performances still lack rigorous justification from theoretic aspect. Also, how the activations perform on large scale data is still need to be investigated. This is an open question worth pursuing in the future.

Acknowledgement

We would like to thank Jason Rolfe from D-Wave system for helpful discussion on test network for randomized leaky ReLU.