Understanding and Improving Convolutional Neural Networks via Concatenated Rectified Linear Units

Wenling Shang, Kihyuk Sohn, Diogo Almeida, Honglak Lee

Introduction

In recent years, convolutional neural networks (CNNs) have achieved great success in many problems of machine learning and computer vision (Krizhevsky et al., 2012; Simonyan & Zisserman, 2014; Szegedy et al., 2015; Girshick et al., 2014). In addition, a wide range of techniques have been developed to enhance the performance or ease the training of CNNs (Lin et al., 2013; Zeiler & Fergus, 2013; Maas et al., 2013; Ioffe & Szegedy, 2015). Despite the great empirical success, fundamental understanding of CNNs is still lagging behind. Towards addressing this issue, this paper aims to provide insight on the intrinsic property of convolutional neural networks.

To better comprehend the internal operations of CNNs, we investigate the well-known AlexNet (Krizhevsky et al., 2012) and thereafter discover that the network learns highly negatively-correlated pairs of filters for the first few convolution layers. Following our preliminary findings, we hypothesize that the lower convolution layers of AlexNet learn redundant filters to extract both positive and negative phase information of an input signal (Section 2.1). Based on the premise of our conjecture, we propose a novel, simple yet effective activation scheme called Concatenated Rectified Linear Unit (CReLU{\rm CReLU}). The proposed activation scheme preserves both positive and negative phase information while enforcing non-saturated non-linearity. The unique nature of CReLU{\rm CReLU} allows a mathematical characterization of convolution layers in terms of reconstruction property, which is an important indicator of how expressive and generalizable the corresponding CNN features are (Section 2.2).

In experiments, we evaluate the CNN models with CReLU{\rm CReLU} and make a comparison to models with ReLU{\rm ReLU} and Absolute Value Rectification Units (AVR{\rm AVR}) (Jarrett et al., 2009) on benchmark object recognition datasets, such as CIFAR-10/100 and ImageNet (Section 3). We demonstrate that simply replacing ReLU{\rm ReLU} with CReLU{\rm CReLU} for the lower convolution layers of an existing state-of-the-art CNN architecture yields a substantial improvement in classification performance. In addition, CReLU{\rm CReLU} allows to attain notable parameter reduction without sacrificing classification performance when applied appropriately.

We analyze our experimental results from several viewpoints, such as regularization (Section 4.1) and invariant representation learning (Section 4.2). Retrospectively, we provide empirical evaluations on the reconstruction property of CReLU{\rm CReLU} models; we also confirm that by integrating CReLU{\rm CReLU}, the original “pair-grouping” phenomenon vanishes as expected (Section 4.3). Overall, our results suggest that by better understanding the nature of CNNs, we are able to realize their higher potential with a simple modification of the architecture.

CRelu and Reconstruction Property

In our initial exploration of classic CNNs trained on natural images such as AlexNet (Krizhevsky et al., 2012), we noted a curious property of the first convolution layer filters: these filters tend to form “pairs”. More precisely, assuming unit length vector for each filter ϕi\phi_{i}, we define a pairing filter of ϕi\phi_{i} in the following way: ϕˉi=arg ⁣minϕjϕi,ϕj.\bar{\phi}_{i}=\arg\!\min_{\phi_{j}}\langle\phi_{i},\phi_{j}\rangle. We also define their cosine similarity μiϕ=ϕi,ϕˉi\mu^{\phi}_{i}=\langle\phi_{i},\bar{\phi}_{i}\rangle.

In Figure 1, we show each normalized filter of the first convolution layer from AlexNet with its pairing filter. Interestingly, they appear surprisingly opposite to each other, i.e., for each filter, there does exist another filter that is almost on the opposite phase. Indeed, AlexNet employs the popular non-saturated activation function, Rectified Linear Unit (ReLU{\rm ReLU}) (Nair & Hinton, 2010), which zeros out negative values and produces sparse activation. As a consequence, if both the positive phase and negative phase along a specific direction participate in representing the input space, the network then needs to learn two linearly dependent filters of both phases.

To systematically study the pairing phenomenon in higher layers, we graph the histograms of μˉiw\bar{\mu}^{w}_{i}’s for conv1-conv5 filters from AlexNet in Figure 2. For comparison, we generate random Gaussian filters rir_{i}’s of unit normWe sample each entry from standard normal distribution independently and normalize the vector to have unit l2l^{2} norm. and plot the histograms of μˉir\bar{\mu}^{r}_{i}’s together. For conv1 layer, we observe that the distribution of μˉiw\bar{\mu}^{w}_{i} is negatively centered; by contrast, the mean of μˉir\bar{\mu}^{r}_{i} is only slightly negative with a small standard deviation. Then the center of μˉiw\bar{\mu}^{w}_{i} shifts towards zero gradually when going deeper into the network. This implies that convolution filters of the lower layers tend to be paired up with one or a few others that represent their opposite phase, while the phenomenon gradually lessens as they go deeper.

Following these observations, we hypothesize that despite ReLU{\rm ReLU} erasing negative linear responses, the first few convolution layers of a deep CNN manage to capture both negative and positive phase information through learning pairs or groups of negatively correlated filters. This conjecture implies that there exists a redundancy among the filters from the lower convolution layers.

In fact, for a very special class of deep architecture, the invariant scattering convolutional network (Bruna & Mallat, 2013), it is well-known that its set of convolution filters, which are wavelets, is overcomplete in order to be able to fully recover the original input signals. On the one hand, similar to ReLU{\rm ReLU}, each individual activation within the scattering network preserves partial information of the input. On the other hand, different from ReLU{\rm ReLU} but more similar to AVR{\rm AVR}, scattering network activation preserves the energy information, i.e., keeping the modulus of the responses but erasing the phase information; ReLU{\rm ReLU} from a generic CNN, as a matter of fact, retains the phase information but eliminates the modulus information when the phase of a response is negative. In addition, while the wavelets for scattering networks are manually engineered, convolution filters from CNNs must be learned, which makes the rigorous theoretical analysis challenging.

Now suppose we can leverage the pairing prior and design a method to explicitly allow both positive and negative activation, then we will be able to alleviate the redundancy among convolution filters caused by ReLU{\rm ReLU} non-linearity and make more efficient use of the trainable parameters. To this end, we propose a novel activation scheme, Concatenated Rectified Linear Units, or CReLU{\rm CReLU}. It simply makes an identical copy of the linear responses after convolution, negate them, concatenate both parts of activation, and then apply ReLU{\rm ReLU} altogether. More precisely, we denote ReLU{\rm ReLU} as []+max(,0)[\cdot]_{+}\triangleq\max(\cdot,0), and define CReLU{\rm CReLU} as follows:

The rationale of our activation scheme is to allow a filter to be activated in both positive and negative direction while maintaining the same degree of non-saturated non-linearity.

A resembling method, namely soft-thresholding (Coates & Ng, 2011), has been applied as a separate feature encoding step to unsupervised dictionary learning to generate more separable features for linear SVM classifier. Concurrent to our work, other research groups also conducted related studies independently and presented as MaxMin scheme (Blot et al., 2016), ON/OFF ReLU (Kim et al., 2015), or as Antirectifier (Chollet, 2016).The antirectifier (https://github.com/fchollet/keras/blob/master/examples/antirectifier.py) has slightly different formulation to ours as it involves a few preprocessing steps such as mean subtraction and normalization before concatenated rectification. In comparison, we provide comprehensive experiments on large-scale datasets using deeper network architectures as well as qualitative analysis.

An alternative way to allow negative activation is to employ the broader class of non-saturated activation functions including Leaky ReLU{\rm ReLU} and its variants (Maas et al., 2013; Xu et al., 2015). Leaky ReLU{\rm ReLU} assigns a small slope to the negative part instead of completely dropping it. These activation functions share similar motivation with CReLU{\rm CReLU} in the sense that they both tackle the two potential problems caused by the hard zero thresholding: (1) the weights of a filter will not be adjusted if it is never activated, and (2) truncating all negative information can potentially hamper the learning. However, CReLU{\rm CReLU} is based on an activation scheme rather than a function, which fundamentally differentiates itself from Leaky ReLU{\rm ReLU} or other variants. In our version, we apply ReLU{\rm ReLU} after separating the negative and positive part to compose CReLU{\rm CReLU}, but it is not the only feasible non-linearity. For example, CReLU{\rm CReLU} can be combined with other activation functions, such as Leaky ReLU{\rm ReLU}, to add more diversity to the architecture.

Another natural analogy to draw is between CReLU{\rm CReLU} and AVR{\rm AVR}, where the latter one only preserves the modulus information but discard the phase information, similar to the scattering network. AVR{\rm AVR} has not been widely used recently for the CNN models due to its suboptimal empirical performance. We confirm this common belief in the matter of large-scale image recognition task (Section 3) and conclude that modulus information alone does not suffice to produce state-of-the-art deep CNN features.

2 Reconstruction Property

A notable property of CReLU{\rm CReLU} is its information preservation nature: CReLU{\rm CReLU} conserves both negative and positive linear responses after convolution. A direct consequence of information preserving is the reconstruction power of the convolution layers equipped with CReLU{\rm CReLU}.

Reconstruction property of a CNN implies that its features are representative of the input data. This aspect of CNNs has gained interest recently: Mahendran & Vedaldi (2015) invert CNN features back to the input under simple natural image priors; Zhao et al. (2015) stack autoencoders with reconstruction objective to build better classifiers. Bruna et al. (2013) theoretically investigate general conditions under which the max-pooling layer followed by ReLU{\rm ReLU} is injective and measure stability of the inverting process by computing the Lipschitz lower bound. However, their bounds are non-trivial only when the number of filters significantly outnumbers the input dimension, which is not realistic.

In our case, it becomes more straightforward to analyze the reconstruction property since CReLU{\rm CReLU} preserves all the information after convolution. The rest of this section mathematically characterizes the reconstruction property of a single convolution layer followed by CReLU{\rm CReLU} with or without max-pooling layer.

We first analyze the reconstruction property of convolution followed by CReLU{\rm CReLU} without max-pooling. This case is directly pertinent as deep networks replacing max-pooling with stride has become more prominent in recent studies (Springenberg et al., 2014). The following proposition states that the part of an input signal spanned by the shifts of the filters is well preserved.

See Section A.1 in the supplementary materials for proof.

Next, we add max-pooling into the picture. To reach a non-trivial bound, we need additional constraints on the input space. Due to space limit, we carefully explain the constraints and the theoretical consequence in Section A.2 of the supplementary materials. We will revisit this subject after the experiment section (Section 4.3).

Benchmark Results

We evaluate the effectiveness of the CReLU{\rm CReLU} activation scheme on three benchmark datasets: CIFAR-10, CIFAR-100 (Krizhevsky, 2009) and ImageNet (Deng et al., 2009). To directly assess the impact of CReLU{\rm CReLU}, we employ existing CNN architectures with ReLU{\rm ReLU} that have already shown a good recognition baseline and demonstrate improved performance on top by replacing ReLU{\rm ReLU} into CReLU{\rm CReLU}. Note that the models with CReLU{\rm CReLU} activation don’t need significant hyperparameter tuning from the baseline ReLU{\rm ReLU} model, and in most of our experiments, we only tune dropout rate while other hyperparameters (e.g., learning rate, mini-batch size) remain the same. We also replace ReLU{\rm ReLU} with AVR{\rm AVR} for comparison with CReLU{\rm CReLU}. The details of network architecture are in Section F of the supplementary materials.

The CIFAR-10 and 100 datasets (Krizhevsky, 2009) each consist of 50,00050,000 training and 10,00010,000 testing examples of 32×3232\times 32 images evenly drawn from 10 and 100 classes, respectively. We subtract the mean and divide by the standard deviation for preprocessing and use random horizontal flip for data augmentation.

We use the ConvPool-CNN-C model (Springenberg et al., 2014) as our baseline model, which is composed of convolution and pooling followed by ReLU{\rm ReLU} without fully-connected layers. This baseline model serves our purpose well since it has clearly outlined network architecture only with convolution, pooling, and ReLU{\rm ReLU}. It has also shown competitive recognition performance using a fairly small number of model parameters.

First, we integrate CReLU{\rm CReLU} into the baseline model by simply replacing ReLU{\rm ReLU} while keeping the number of convolution filters the same. This doubles the number of output channels at each convolution layer and the total number of model parameters is doubled. To see whether the performance gain comes from the increased model capacity, we conduct additional experiments with the baseline model while doubling the number of filters and the CReLU{\rm CReLU} model while halving the number of filters. We also evaluate the performance of the AVR{\rm AVR} model while keeping the number of convolution filters the same as the baseline model.

Since the datasets don’t provide pre-defined validation set, we conduct two different cross-validation schemes:

“Single”: we hold out a subset of training set for initial training and retrain the network from scratch using the whole training set until we reach at the same loss on a hold out set (Goodfellow et al., 2013). For this case, we also report the corresponding train error rates.

1010-folds: we divide training set into 1010 folds and do validation on each of 1010 folds while training the networks on the rest of 99 folds. The mean error rate of single network (“Average”) and the error rate with model averaging of 1010 networks (“Vote”) are reported.

The recognition results are summarized in Table 1. On CIFAR-10, we observe significant improvement with the CReLU{\rm CReLU} activation over ReLU{\rm ReLU}. Especially, CReLU{\rm CReLU} models consistently improve over ReLU{\rm ReLU} models with the same number of neurons (or activations) while reducing the number of model parameters by half (e.g., CReLU{\rm CReLU} + half model and the baseline model have the same number of neurons while the number of model parameters are 0.70.7M and 1.41.4M, respectively). On CIFAR-100, the models with larger capacity generally improve the performance for both activation schemes. Nevertheless, we still find a clear benefit of using CReLU{\rm CReLU} activation that shows significant performance gain when it is compared to the model with the same number of neurons, i.e., half the number of model parameters. One possible explanation for the benefit of using CReLU{\rm CReLU} is its regularization effect, as can be confirmed in Table 1 that the CReLU{\rm CReLU} models showed significantly lower gap between train and test set error rates than those of the baseline ReLU{\rm ReLU} models.

To our slight surprise, AVR{\rm AVR} outperforms the baseline ReLU{\rm ReLU} model on CIFAR-100 with respect to all evaluation metrics and on CIFAR-10 with respect to single-model evaluation. It also reaches promising single-model recognition accuracy compared to CReLU{\rm CReLU} on CIFAR-10; however, when averaging or voting across 10-folds validation models, AVR{\rm AVR} becomes clearly inferior to CReLU{\rm CReLU}.

We conduct experiments with very deep CNN that has a similar network architecture to the VGG network (Simonyan & Zisserman, 2014). Specifically, we follow the model architecture and training procedure in Zagoruyko (2015). Besides the convolution and pooling layers, this network contains batch normalization (Ioffe & Szegedy, 2015) and fully connected layers. Due to the sophistication of the network composition which may introduce complicated interaction with CReLU{\rm CReLU}, we only integrate CReLU{\rm CReLU} into the first few layers. Similarly, we subtract the mean and divide by the standard deviation for preprocessing and use horizontal flip and random shifts for data augmentation.

In this experimentWe attempted to replace ReLU{\rm ReLU} with AVR{\rm AVR} on various layers but we observed significant performance drop with AVR{\rm AVR} non-linearity when used for deeper networks., we gradually replace ReLU{\rm ReLU} after the first, third, and the fifth convolution layersIntegrating CReLU{\rm CReLU} into the second or fourth layer before max-pooling layers did not improve the performance. with CReLU{\rm CReLU} while halving the number of filters, resulting in a reduced number of model parameters. We report the test set error rates using the same cross-validation schemes as in the previous experiments. As shown in Table 2, there is substantial performance gain in both datasets by replacing ReLU{\rm ReLU} with CReLU{\rm CReLU}. Overall, the proposed CReLU{\rm CReLU} activation improves the performance of the state-of-the-art VGG network significantly, achieving highly competitive error rates to other state-of-the-art methods, as summarized in Table 3.

2 ImageNet

To assess the impact of CReLU{\rm CReLU} on large scale dataset, we perform experiments on ImageNet dataset (Deng et al., 2009)We used a version of ImageNet dataset for ILSVRC 2012., which contains about 1.31.3M images for training and 50,00050,000 for validation from 1,0001,000 object categories. For preprocessing, we subtract the mean and divide by the standard deviation for each input channel, and follow the data augmentation as described in (Krizhevsky et al., 2012).

We take the All-CNN-B model (Springenberg et al., 2014) as our baseline model. The network architecture of All-CNN-B is similar to that of AlexNet (Krizhevsky et al., 2012), where the max-pooling layer is replaced by convolution with the same kernel size and stride, the fully connected layer is replaced by 1×11\times 1 convolution layers followed by average pooling, and the local response normalization layers are discarded. In sum, the layers other than convolution layers are replaced or discarded and finally the network consists of convolution layers only. We choose this model since it reduces the potential complication introduced by CReLU{\rm CReLU} interacting with other types of layers, such as batch normalization or fully connected layers.

We gradually integrate more convolution layers with CReLU{\rm CReLU} (e.g., conv1–4, conv1–7, conv1–9), while keeping the same number of filters. These models contain more parameters than the baseline model. We also evaluate two models where one replaces all ReLU{\rm ReLU} layers into CReLU{\rm CReLU} and the other conv1,conv4 and conv7 only, where both models reduce the number of convolution layers before CReLU{\rm CReLU} by half. Hence, these models contain fewer parameters than the baseline model. For comparison, AVR{\rm AVR} models are also constructed by gradually replacing ReLU{\rm ReLU} in the same manner as the CReLU{\rm CReLU} experiments (conv1–4, conv1–7, conv1–9). The network architectures and the training details are in Section F and Section E of the supplementary materials.

The results are provided in Table 4. We report the top-1 and top-5 error rates with center crop only and by averaging scores over 1010 patches from the center crop and four corners and with horizontal flip (Krizhevsky et al., 2012). Interestingly, integrating CReLU{\rm CReLU} to conv1-4 achieves the best results, whereas going deeper with higher model capacity does not further benefit the classification performance. In fact, this parallels with our initial observation on AlexNet (Figure 2 in Section 2.1)—there exists less “pairing” in the deeper convolution layers and thus there is not much gain by decomposing the phase in the deeper layers. AVR{\rm AVR} networks exhibit the same trend but do not noticeably improve upon the baseline performance, which implies that AVR{\rm AVR} is not the most suitable candidate for large-scale deep representation learning. Another interesting observation, which we will discuss further in Section 4.2, is that the model integrating CReLU{\rm CReLU} into conv1, conv4 and conv7 layers also achieve highly competitive recognition results with even fewer parameters than the baseline model. In sum, we believe that such a significant improvement over the baseline model by simply modifying the activation scheme is a pleasantly surprising result.We note that Springenberg et al. (2014) reported slightly better result (41.2%41.2\% top-1 error rate with center crop only) than our replication result, but still the improvement is significant.

We also compare our best models with AlexNet and other variants in Table 5. Even though reducing the number of parameters is not our primary goal, it is worth noting that our model with only 4.64.6M parameters (CReLU{\rm CReLU} + all) outperforms FastFood-32-AD (FriedNet) (Yang et al., 2015) and Pruned AlexNet (PrunedNet) (Han et al., 2015), whose designs directly aim at parameter reduction. Therefore, besides the performance boost, another significance of CReLU{\rm CReLU} activation scheme is in designing more parameter-efficient deep neural networks.

Discussion

In this section, we discuss qualitative properties of CReLU{\rm CReLU} activation scheme in several viewpoints, such as regularization of the network and learning invariant representation.

In general, a model with more trainable parameters is more prone to overfitting. However, somewhat counter-intuitively, for the all-conv CIFAR experiments, the models with CReLU{\rm CReLU} display much less overfitting issue compared to the baseline models with ReLU{\rm ReLU}, even though it has twice as many parameters (Table 1). We contemplate that keeping both positive and negative phase information makes the training more challenging, and such effect has been leveraged to better regularize deep networks, especially when working on small datasets.

Besides the empirical evidence, we can also describe the regularization effect by deriving a Rademacher complexity bound for the CReLU{\rm CReLU} layer followed by linear transformation as follows:

The proof is in Section B of the supplementary materials. Theorem 4.1 says that the complexity bound of CReLU{\rm CReLU} + linear transformation is the same as that of ReLU{\rm ReLU} + linear transformation, which is proved by Wan et al. (2013). In other words, although the number of model parameters are doubled by CReLU{\rm CReLU}, the model complexity does not necessarily increase.

2 Towards Learning Invariant Features

We measure the invariance scores using the evaluation metrics from (Goodfellow et al., 2009) and draw another comparison between the CReLU{\rm CReLU} models and the ReLU{\rm ReLU} models. For a fair evaluation, we compare all 7 conv layers from all-conv ReLU{\rm ReLU} model with those from all-conv CReLU{\rm CReLU} model trained on CIFAR-10/100. In the case of ImageNet experiments, we choose the model where CReLU{\rm CReLU} replaces ReLU{\rm ReLU} for the first 7 conv layers and compare the invariance scores with the first 7 conv layers from the baseline ReLU{\rm ReLU} model. Section D in the supplementary materials details how the invariance scores are measured.

Figure 4 plots the invariance scores for networks trained on CIFAR-10, CIFAR-100, and ImageNet respectively. The invariance scores of CReLU{\rm CReLU} models are consistently higher than those of ReLU{\rm ReLU} models. For CIFAR-10 and CIFAR-100, there is a big increase between conv2 and conv3 then again between conv4 and conv6, which are due to max-pooling layer extracting shift invariance features. We also observe that although as a general trend, the invariance scores increase while going deeper into the networks–consistent with the observations from (Goodfellow et al., 2009), the progression is not monotonic. This interesting observation suggests the potentially diverse functionality of different layers in the CNN, which would be worthwhile for future investigation.

In particular, the scores of ImageNet ReLU{\rm ReLU} model attain local maximum at conv1, conv4 and conv7 layers. It inspires us to design the architecture where CReLU{\rm CReLU} are placed after conv1, 4, and 7 layers to encourage invariance representations while halving the number of filters to limit model capacity. Interestingly, this architecture achieves the best top1 and top5 recognition results when averaging scores from 1010 patches.

3 Revisiting the Reconstruction Property

In Section 2.1, we observe that lower layer convolution filters from ReLU{\rm ReLU} models form negatively-correlated pairs. Does the pairing phenomenon still exist for CReLU{\rm CReLU} models? We take our best CReLU{\rm CReLU} model trained on ImageNet (where the first 4 conv layers are integrated with CReLU{\rm CReLU}) and repeat the histogram experiments to generate Figure 3. In clear contrast to Figure 2, the distributions of μˉiw\bar{\mu}^{w}_{i} from CReLU{\rm CReLU} model well align with the distributions of μˉir\bar{\mu}^{r}_{i} from random Gaussian filters. In other words, each lower layer convolution filter now uniquely spans its own direction without a negatively correlated pairing filter, while CReLU{\rm CReLU} implicitly plays the role of “pair-grouping”.

The empirical gap between CReLU{\rm CReLU} and AVR{\rm AVR} justifies that both modulus and phase information are essential in learning deep CNN features. In addition, to ensure that the outgoing weights for the positive and negative phase are not merely negations of each other, we measure their correlations for the conv1-7 CReLU{\rm CReLU} model trained on ImageNet. Table 6 compares the averaged correlation between the (normalized) positive-negative-pair (pair) outgoing weights and the (normalized) unmatched-pair (non-pair) outgoing weights. The pair correlations are marginally higher than the non-pair ones but both are on average far below 11 for all layers. This suggests that, in contrast to AVR{\rm AVR}, the CReLU{\rm CReLU} network does not simply focus on the modulus information but imposes different manipulation over the opposite phases.

In Section 2.2, we mathematically characterize the reconstruction property of convolution layers with CReLU{\rm CReLU}. Proposition 2.1 claims that the part of an input spanned by the shifts of the filters can be fully recovered. ImageNet contains a large number of training images from a wide variety of categories; the convolution filters learned from ImageNet are thus expected to be diverse enough to describe the domain of natural images. Hence, to qualitatively verify the result from Proposition 2.1, we can directly invert features from our best CReLU{\rm CReLU} model trained on ImageNet via the simple reconstruction algorithm described in the proof of Proposition 2.1 (Algorithm 1 in the supplementary materials). Figure 5 shows an image from the validation set along with its reconstructions using conv1-conv4 features (see Section G in the supplementary materials for more reconstruction examples). Unlike other reconstruction methods (Dosovitskiy & Brox, 2015; Mahendran & Vedaldi, 2015), our algorithm does not involve any additional learning. Nevertheless, it still produces reasonable reconstructions, which supports our theoretical claim in Proposition 2.1.

For the convolution layers involving max-pooling operation, it is less straightforward to perform direct reconstruction. Yet we evaluate the conv+CReLU{\rm CReLU}+max-pooling reconstruction power via measuring properties of the convolution filters and the details are elaborated in Section C of the supplementary materials.

We are grateful to Erik Brinkman, Harry Altman and Mark Rudelson for their helpful comments and support. We acknowledge Yuting Zhang and Anna Gilbert for discussions during the preliminary stage of this work. H. Lee was supported in part by ONR N00014-13-1-0762 and NSF CAREER IIS-1453651. We thank Technicolor Research for providing resources and NVIDIA for the donation of GPUs.

References

Appendix

A.2 Max-Pooling Case

To reach a non-trivial bound when max-pooling is present, we put a constraint on the input space V\mathcal{V}: xV\forall x\in\mathcal{V}, there exists {cij}i=1,,Kj=1,,n\{c_{i}^{j}\}_{i=1,\cdots,K}^{j=1,\cdots,n} such that

In other words, we assume that an input xx is a linear combination of the shifted convolution filters {wij}i=1,,Kj=1,,n\{w_{i}^{j}\}_{i=1,\cdots,K}^{j=1,\cdots,n} such that over a single max-pooling region, only one of the shifts participates: j=1n1{cij>0}1\sum_{j=1}^{n}\mathbf{1}\{c_{i}^{j}>0\}\leq 1: a slight translation of an object or viewpoint change does not alter the nature of a natural image, which is how max-pooling generates shift invariant features by taking away some fine-scaled locality information.

Next, we denote the matrix consisting of the shifts whose corresponding cijc_{i}^{j}’s are non-zero by WxW_{x} , and the vector consisting of the non-zero cijc_{i}^{j}’s by cx\mathbf{c}_{x}, i.e. Wxcx=xW_{x}\mathbf{c}_{x}=x. Also, we denote the matrix consisting of the shifts whose activation is positive and selected after max-pooling operation by W^x+\widehat{W}^{+}_{x}, negative by W^x\widehat{W}^{-}_{x}. Let W^x[W^x+,W^x]\widehat{W}_{x}\triangleq\left[\widehat{W}^{+}_{x},\widehat{W}^{-}_{x}\right]. Finally, we give notation, W~x\widetilde{W}_{x}, to the matrix consisting of a subset of W^x\widehat{W}_{x}, such that the iith column comes from W^x+\widehat{W}^{+}_{x} if cij0c_{i}^{j}\geq 0 or from W^x\widehat{W}^{-}_{x} if otherwise.

Before proceeding to the main theorem and its proof, we would like to introduce more tools from Frame Theory.

A frame is a set of elements of a vector space VV, {ϕk}k=1,,K\{\phi_{k}\}_{k=1,\cdots,K}, which satisfies the frame condition: there exist two real numbers C1C_{1} and C2C_{2}, the frame bounds, such that 0<C1C2<0<C_{1}\leq C_{2}<\infty, and vV\forall v\in V

The Frame Operator is always invertible. (Christensen, 2003)

The optimal lower frame bound C1C_{1} is the smallest eigenvalue of S\mathcal{S}; the optimal upper frame bound C2C_{2} is the largest eigenvalue of S\mathcal{S}. (Christensen, 2003)

Now we are ready to present the theorem that characterizes the reconstruction property of the conv+CReLU{\rm CReLU}+max-pooling operation.

where λmin\lambda_{\min} and λ~max\widetilde{\lambda}_{\max} are square of the minimum and maximum singular values of WxW_{x} and W~x\widetilde{W}_{x} respectively.

We refer to the term xx2x2\frac{\|x-x^{\prime}\|_{2}}{\|x\|_{2}} as the reconstruction ratio in later discussions.

Appendix B Proof of Model Complexity Bound

(Rademacher Complexity) For a sample S={x1,,xL}S=\{x_{1},\cdots,x_{L}\} generated by a distribution DD on set XX and a real-valued function class F\mathcal{F} in domain XX, the empirical Rademacher complexity of F\mathcal{F} is the random variable:

where σi\sigma_{i}’s are independent uniform {±1}\{\pm 1\}-valued (Rademacher) random variables. The Rademacher complexity of F\mathcal{F} is RL(F)=ES[R^L(F)]R_{L}(\mathcal{F})=\mathbf{E}_{S}\left[\hat{R}_{L}(\mathcal{F})\right]

By Lemma B.1, Proposition B.2, and the fact that ReLU{\rm ReLU} is 11-Lipschitz, we know that R^L(ReLUG)=R^L(G)\hat{R}_{L}({\rm ReLU}\circ\mathcal{G})=\hat{R}_{L}(\mathcal{G}) and that R^L(HReLUG)dinBR^L(F)\hat{R}_{L}(\mathcal{H}\circ{\rm ReLU}\circ\mathcal{G})\leq\sqrt{d_{\textrm{in}}}B\hat{R}_{L}(\mathcal{F}).

Recall from Definition 2.1, ρc\rho_{c} is the CReLU{\rm CReLU} formulation.

From (S1) to (S2), use the definition of linear transformation and inner product. From (S2) to (S3), use Cauchy-Schwarz inequality and the assumption that W2B\|W\|_{2}\leq B. From (S3) to (S4), use the definition of CReLU{\rm CReLU} and l2l^{2} norm. From (S4) to (S5), use the definition of l2l^{2} norm and sup\sup operator. From (S5) to (S6), use the definition of R^L\hat{R}_{L}

We see that CReLU{\rm CReLU} followed by linear transformation reaches the same Rademacher complexity bound as ReLU{\rm ReLU} followed by linear transformation with the same input dimension.

Appendix C Reconstruction Ratio

Recall that Theorem A.5 characterizes the reconstruction property when max-pooling is added after CReLU{\rm CReLU}. As an example, we study the all-conv CReLU{\rm CReLU} (half) models used for CIFAR-10/100 experiments. In this model, conv2 and conv5 layers are followed by max-pooling. CIFAR images are much less diverse than those from ImageNet. Instead of directly inverting features all the way back to the original images, we empirically calculate the reconstruction ratio, xx2/x2\|x-x^{\prime}\|_{2}/\|x\|_{2}. We sample testing examples, extract pooled features after conv2(conv5) layer and reconstruct features from the previous layer via Algorithm 2. To compare, we perform the same procedures on random convolution filtersEach entry is sampled from standard normal distribution.. Essentially, convolution imposes structured zeros to the random W~x\widetilde{W}_{x}; there has not been published results on random subspace projection with such structured zeros. In a simplified setting without structured zeros, i.e. no convolution, it is straightforward to show that the expected reconstruction ratio is DKD\sqrt{\frac{D-K}{D}} (Theorem C.1), where, in our case, D=48(96)×5×5D=48(96)\times 5\times 5 and K=48(96)K=48(96) for conv2(conv5) layer. Table S1 compares between the empirical mean of reconstruction ratios using learned filters and random filters: random filters only recover 1%1\% of the original input, whereas the learned filters span more of the input domain.

Without loss of generality, let x2=1\|x\|_{2}=1. Projecting a fixed xx onto a random subspace of dimension DsD_{s} is equivalent of projecting a random unit-norm vector z=(z1,z2,,zD)Tz=(z_{1},z_{2},\cdots,z_{D})^{T} onto a fixed subspace of dimension DsD_{s} thanks to the rotational invariance of inner product. Without loss of generality, assume the fixed subspace here is spanned by the first DsD_{s} standard basis covering the first D2D_{2} coordinates of zz. Then the resulting projection is zs=(z1,z2,,zDs,0,,0)z_{s}=(z_{1},z_{2},\cdots,z_{D_{s}},0,\cdots,0).

Because each entry of zz, ziz_{i}, is identically distributed, we have

Appendix D Invariance Score

We use consistent terminology employed by Goodfellow et al. (2009) to illustrate the calculation of the invariance scores.

For CIFAR-10/100, we utilize all 50k testing images to calculate the invariance scores; for ImageNet, we take the center crop from 5k randomly sampled validation images

For each individual filter, we calculate its own firing threshold, such that it is fired one percent of the time, i.e. the global firing rate is 0.010.01. For ReLU{\rm ReLU} models, we zero out all the negative negative responses when calculating the threshold; for CReLU{\rm CReLU} models, we take the absolute value.

To build the set of semantically similar stimuli for each testing image xx, we apply horizontal flip, 15 degree rotation and translation. For CIFAR-10/100, translation is composed of horizontal/vertical shifts by 3 pixels; for ImageNet, translation is composed of cropping from the 4 corners.

Because our setup is convolutional, we consider a filter to be fired only if both the transformed stimulus and the original testing example fire the same convolution filter at the same spatial location.

At the end, for each convolution layer, we average the invariance scores of all the filters at this layer to form the final score.

Appendix E Implementation Details on ImageNet Models

The networks from Table S13 and S13, where the number of convolution filters after CReLU{\rm CReLU} are reduced by half, are optimized using Adam with an initial learning rate 0.00020.0002 and mini-batch size of 6464 examples for 100100 epochs.

Appendix F Details of Network Architecture

Appendix G Image Reconstruction

In this section, we provide more image reconstruction examples.