i-RevNet: Deep Invertible Networks

Jörn-Henrik Jacobsen, Arnold Smeulders, Edouard Oyallon

Introduction

A CNN may be very effective in classifying images of all sorts (He et al., 2016; Krizhevsky et al., 2012), but the cascade of linear and nonlinear operators reveals little about the contribution of the internal representation to the classification. The learning process is characterized by a steady reduction of large amounts of uninformative variability in the images while simultaneously revealing the essence of the visual class. It is widely believed that this process is based on progressively discarding uninformative variability about the input with respect to the problem at hand (Dosovitskiy & Brox, 2016; Mahendran & Vedaldi, 2016; Shwartz-Ziv & Tishby, 2017; Achille & Soatto, 2017). However, the extent to which information is discarded is lost somewhere in the intermediate non-linear processing steps. In this paper, we aim to provide insight into the variability reduction process by proposing an invertible convolutional network, that does not discard any information about the input.

The difficulty to recover images from their hidden representations is found in many commonly used network architectures (Dosovitskiy & Brox, 2016; Mahendran & Vedaldi, 2016). This poses the question if a substantial loss of information is necessary for successful classification. We show information does not have to be discarded. By using homeomorphic layers, the invariance can be built only at the very last layer via a projection.

In Shwartz-Ziv & Tishby (2017), minimal sufficient statistics are proposed as a candidate to explain the reduction of variability. Tishby & Zaslavsky (2015) introduces the information bottleneck principle which states that an optimal representation must reduce the mutual information between an input and its representation to reduce as much uninformative variability as possible. At the same time, the network should maximize the mutual information between the desired output and its representation to effectively preserve each class from collapsing onto other classes. The effect of the information bottleneck was demonstrated on small datasets in Shwartz-Ziv & Tishby (2017); Achille & Soatto (2017).

Several works (Oyallon, 2017; Zeiler & Fergus, 2014) observed a phenomenon of progressive separation and contraction in non-invertible networks on limited datasets. Those progressive improvements can be interpreted as the creation of progressively stronger invariants for classification. Ideally, the contraction should not be too brutal to avoid removing important information from the intermediate signal. This shows that a good trade-off between discriminability and invariance has to be progressively built. In this paper, we extend some findings of Zeiler & Fergus (2014); Oyallon (2017) to ImageNet (Russakovsky et al., 2015) and, most importantly, show that a loss of information is not necessary for observing a progressive contraction.

The duality between invariance and separation of the classes is discussed in Mallat (2016). Here, intra-class variabilities are modeled as Lie groups that are processed by performing a parallel transport along those symmetries. Filters are adapted through learning to the specific bias of the dataset and avoid to contract along discriminative directions. However, using groups beyond the Euclidean case for image classification is hard. Mainly because groups associated with abstract variabilities are difficult to estimate due to their high-dimensional nature, as well as the appropriate degree of invariance required. An illustration of this framework on the Euclidean group is given by the scattering transform (Mallat, 2012), which builds invariance to small translations while being recoverable to a certain extent. In this work, we introduce a network that cannot discard any information except at the final classification stage, while we demonstrate numerically progressive contraction and separation of the signal classes.

We introduce the ii-RevNet, an invertible deep network.Code is available at: https://github.com/jhjacobsen/pytorch-i-revnet ii-RevNets retain all information about the input signal in any of their intermediate representations up until the last layer. Our architecture builds upon the recently introduced RevNet (Gomez et al., 2017), where we replace the non-invertible components of the original RevNets by invertible ones. ii-RevNets achieve the same performance on Imagenet compared to similar non-invertible RevNet and ResNet architectures (Gomez et al., 2017; He et al., 2016). To shed light on the mechanism underlying the generalization-ability of the learned representation, we show that ii-RevNets progressively separate and contract signals with depth. Our results are evidence for an effective reduction of variability through a contraction with a recoverable input obtained from a series of one-to-one mappings.

Related Work

Several recent works show that significant information about the input images is lost with depth in successful Imagenet classification CNNs (Dosovitskiy & Brox, 2016; Mahendran & Vedaldi, 2016). To understand the loss of information, the references propose to invert the representations by means of learned or hand-engineered priors. The approximate inversions indicate increased geometric and photometric invariance with depth. Multiple other works report progressive properties of deep networks that may be linked to discarded information in the representations as well, such as linearization (Radford et al., 2015), linear separability (Zeiler & Fergus, 2014), contraction (Oyallon, 2017) and low-dimensional embeddings (Aubry & Russell, 2015). However, it is not clear from above observations if the loss of information is a necessity for the observed progressive phenomena. In this work, we show that progressive separation and contraction can be obtained while at the same time allowing an exact reconstruction of the signal.

Multiple frameworks have been introduced that permit to learn invertible representations under certain conditions. Parseval networks (Cisse et al., 2017) have been introduced to increase the robustness of learned representations with respect to adversarial attacks. In this framework, the spectrum of convolutional operators is constrained to norm 1 during learning. The linear operator is thus injective.

As a consequence, the input of Parseval networks can be recovered if but only if the built-in non-linearities are invertible as well, which is typically not the case. Bruna et al. (2013) derive conditions under which pooling representations are, but our method directly overcomes this issue. The Scattering transform (Mallat, 2012) is an example of predefined deep representation, approximately invariant to translations, that can be reconstructed when the degree of invariance specified is small. Yet, it requires a gradient descent optimization and no guarantee of convergences are known. In summary, the references make clear that invertibility requires special care in designing the architecture or special care in designing the optimization procedure. In this paper, we introduce a network, that overcomes these issues and has an exact inverse by construction.

Our main inspiration for this work is the recent reversible residual network (RevNet), introduced in Gomez et al. (2017). RevNets are in turn closely related to NICE and Real-NVP architectures (Dinh et al., 2016; 2014), which make use of constrained Jacobian determinants for generative modeling. All these architectures are similar to the lifting scheme (Sweldens, 1998) and Feistel cipher diagrams (Menezes et al., 1996), as we will show. RevNets illustrate how to build invertible ResNet-type blocks that avoid storing intermediate activations necessary for the backward pass. However, RevNets still employ multiple non-invertible operators like max-pooling and downsampling operators as part of the network. As such, RevNets are not invertible by construction. In this paper, we show how to build an invertible type of RevNet architecture that performs competitively with RevNets on Imagenet, which we call ii-RevNet for invertible RevNet.

The i𝑖i-RevNet

This section introduces the general framework of the ii-RevNet architecture and explains how to explicitly build an inverse or a left-inverse to an ii-RevNet. Its practical implementation is discussed, and we demonstrate competitive numerical results.

In this way, we avoid the non-invertible modules of a RevNet (e.g. max-pooling or strides) which are necessary to train them in a reasonable time and are designed to build invariance w.r.t. translation variability. Our method shows we can replace them by linear and invertible modules Sj\mathcal{S}_{j}, that can reduce the spatial resolution (we refer to it as a spatial down-sampling for the sake of simplicity) while maintaining the layer’s size by increasing the number of channels.

We keep the computational cost manageable by tightly coupling downsampling and increase in width of the network. Reducing the spatial resolution can be undesirable, so Sj\mathcal{S}_{j} can potentially be the identity. We refer to such networks as ii-RevNets. This leads to the following equations:

Our downsampling layer can be written for uu the spatial variable and λ\lambda the channel index:

2 Architecture, training and performances

In this subsection, we describe two models that we trained: an injective ii-RevNet (a) and a bijective ii-RevNet (b), with fewer parameters. The hyper-parameters were selected to be either close to the ResNet and RevNet baselines in terms of the number of layers (a) or parameters (b) while keeping performance competitive. For the same reasons as in Gomez et al. (2017), our scheme also allows avoiding storing any intermediate activations at training time, making memory consumption for very deep ii-RevNets not an issue in practice. We compare our implementation with a RevNet with 56 layers corresponding to 28MM parameters, as provided in the open source release of Gomez et al. (2017), and with a standard ResNet of 50 layers, with 26MM parameters (He et al., 2016).

Each block Fj\mathcal{F}_{j} is a bottleneck block, which consists of a succession of 3 convolutional operators, each preceded by Batchnormalization (Ioffe & Szegedy, 2015) and ReLU non-linearity. The second layer has four times fewer channels than the other two, while their corresponding kernel sizes are respectively 1×1,3×3,1×11\times 1,3\times 3,1\times 1.

The final representation is spatially averaged and projected onto the 1000 classes after a ReLU non-linearity. We now discuss how we progressively decrease the spatial resolution, while increasing the number of channels per layer by use of the operators Sj\mathcal{S}_{j}.

We first describe the model (a), that consists of 56 layers which have been optimized to match the performances of a RevNet or a ResNet with approximatively the same number of layers. In particular, we explain how we progressively decrease the spatial resolution, while increasing the number of channels per block by use of the operators Sj\mathcal{S}_{j}.

We report the training loss (i.e. Cross entropy) curves in Figure 3 of our ii-RevNet (b) and the ResNet baseline, displayed is a moving average over 100 iterations. Observe that the decrease of both training-losses are very similar which indicates that the constraint of invertibility does not interfere negatively with the learning process. However, we observed one third longer wall-clock times for ii-RevNets compared to plain RevNets because the channel size becomes larger. The Table 1 reports the performances of our ii-RevNets, with comparable RevNet and ResNet. First, we compare the ii-RevNet (a) with the RevNet and ResNet. Indeed, those CNNs have the same number of layers, and the ii-RevNet (a) increases the channel width of the initial layer as done in Gomez et al. (2017). The drawback of this technique is that the kernel sizes will be larger for all subsequent layers.

The ii-RevNet (a) has about 6 times more parameters than a RevNet and a ResNet but leads to a similar accuracy on the validation set of ImageNet. On the contrary, the ii-RevNet (b) is designed to have roughly the same number of parameters as the RevNet and ResNet, while being bijective. Its accuracy decreases by 1.5% absolute percent on ImageNet compared to the RevNet baseline, which is not surprising because the number of channels was not drastically increased in the earlier layers as done in the baselines (Gomez et al., 2017; Krizhevsky et al., 2012; He et al., 2016); we did not explore wide ranges of hyper-parameters, thus the gap between (a) and (b) can likely be reduced with additional engineering.

Analysis of the inverse

We now analyze the representation Φ\Phi built by our bijective neural network ii-RevNet (b) and its inverse Φ1\Phi^{-1}, as trained on ILSVRC-2012. We first explain why obtaining Φ1\Phi^{-1} is challenging, even locally. We then discuss the reconstruction, while displaying in the image space linear interpolations between representations.

In the previous section, we have described the ii-RevNet architecture, that permits defining a deep network with an explicit inverse. We explain now why this is normally difficult, by studying its local inversion. We study the local stability of a network Φ\Phi and its inverse Φ1\Phi^{-1} w.r.t. to its input, which means that we will quantify locally the variations of the network and its inverse w.r.t. to small variations of an input. As Φ\Phi is differentiable (and its inverse as well), an equivalent way to perform this study is to analyze the singular values of the differential Φ\partial\Phi at some point, as for (a,b)(a,b) close the following holds:

Ideally, a well-conditioned operator has all its singular values constant equal to 1, for instance as achieved by the isometric operators of Cisse et al. (2017).

In our numerical application to an image xx, Φx\partial\Phi_{x} corresponds to a very large matrix (square of the number of coefficients of the image at least) whose computations are expensive. Figure 4 corresponds to the singular values of the differential (i.e. the square roots of the eigen values of ΦΦ\partial\Phi^{*}\partial\Phi), in decreasing order, for a given natural image from ImageNet. The example we plot is typical of the behavior of Φ\partial\Phi. Observe there is a fast decay: numerically, the first 10310^{3} and 10410^{4} singular values are responsible respectively for 80%80\% and 97%97\% of the cumulated energy (i.e. sum of squared singular values). This indicates Φ\Phi linearizes the space locally in a considerably smaller space in comparison to the original input dimension. However, the dimensionality is still quite large (i.e. >10>10) and thus we can not infer that Φ\Phi lays locally in a low-dimensional manifold. It also proves that inversing Φ\Phi is difficult and is an ill-conditioned problem. Thus obtaining implicitly this inverse would be a challenging task that we avoided, thanks to the formal reconstruction algorithm provided by Subsection 3.1.

2 Linear interpolation and reconstruction

Visualizing or understanding the important directions in the representation of inner layers of a CNN, and in particular, the final layer is complex because typically the cascade is either not invertible or unstable. One approach to reconstruct from an output layer consists in finding the input image that matches the activation through via gradient descent. However, this technique leads only to a partial or informal reconstruction (Mahendran & Vedaldi, 2015).

Another method consists in embedding the representation in a lower dimensional space and comparing the common attributes of nearest neighbors (Szegedy et al., 2013). It is also possible to train a CNN to reconstruct the representation (Dosovitskiy & Brox, 2016). Yet these methods require a priori knowledge in order to find the appropriate embeddings or training sets. We now discuss the improvements achieved by the ii-RevNet.

Our main claim is that while the local inversion is ill-conditioned, the inverse Φ1\Phi^{-1} computations do not involve significant round-off errors. The forward pass of the network does not seem to suffer from significant instabilities, thus it seems coherent to assume that this will hold for Φ1\Phi^{-1} as well. For example, adding constraints beyond vanishing moments in the case of a Lifting scheme is difficult (Sweldens, 1998; Mallat, 1999), and this is a weakness of this method. We validate our claim by computing the empirical relative error on several subsets X\mathcal{X} of data:

We evaluate this measure on a subset X1\mathcal{X}_{1} of X1=104|\mathcal{X}_{1}|=10^{4} independent uniform noises and on the validation set X2\mathcal{X}_{2} of ImageNet. We report ϵ(X1)=5×106\epsilon(\mathcal{X}_{1})=5\times 10^{-6} and ϵ(X2)=3×106\epsilon(\mathcal{X}_{2})=3\times 10^{-6} respectively, which are close to the machine error and indicates that the inversion does not suffer from significant round-off errors.

Given a pair of images {x0,x1}\{x^{0},x^{1}\}, we propose to study linear interpolations between the pair of representations {Φx0,Φx1}\{\Phi x^{0},\Phi x^{1}\}, in the feature domain. Those interpolations correspond to existing images as Φ1\Phi^{-1} is an exact inverse. We reconstruct a convex path between two input points; it means that if:

then: xt=Φ1ϕtx^{t}=\Phi^{-1}\phi^{t} is a signal that corresponds to an image.

We discretized $intointo\{t_{1},...,t_{k}\},adaptthestepsizemanuallyandreconstructthesequenceof, adapt the step size manually and reconstruct the sequence of\{x^{t_{1}},...,x^{t_{k}}\}$. Results are displayed in the Figure 5. We selected images from the basel face dataset (Paysan et al., 2009), describable texture dataset (Cimpoi et al., 2014) and imagenet.

We now interpret the results. First, observe that a linear interpolation in the feature space is not a linear interpolation in the image space and that intermediary images are noisy, even for small deformations, yet they mostly remain recognizable. However, some geometric transformations such as a 3D-rotation seem to have been linearized, as suggested in Aubry & Russell (2015). In the next section, we thus investigate how the linear separation progresses with depth.

A contraction

In this section, we study again the bijective ii-RevNet. We first show that a localized or linear classifier progressively improves with depth. Then, we describe the linear subspace spanned by Φ\Phi, namely the feature space, showing that the classification can be performed on a much smaller subspace, which can be built via a PCA.

We show that both a ResNet and an ii-RevNet build a progressively more linearly separable and contracted representation as measured in Oyallon (2017). Observe this property holds for the ii-RevNet despite the fact that it can not discard any information.

We observe that both classifiers progressively improve similarly with depth for each model, the linear SVM performing slightly better than the nearest neighbor classifier because it is the more robust and discriminative classifier of the two. In the case of the ii-RevNet, the classification performed by the CNN leads to 77%77\%, and the linear SVM performs slightly better because we did not fine-tune the model to 100 classes. Observe that there is a more intense jump of performance on the 3 last layers, which seems to indicate that the former layers have prepared the representation to be more contracted and linearly separated for the final layers.

The results suggest a low-dimensional embedding of the data, but this is difficult to validate as estimating local dimensionality in high dimensions is an open problem. However, in the next section, we try to compute the dimension of the discriminative part of the representation built by an ii-RevNet.

2 Dimensionality analysis of the feature space

In this section, we investigate if we can refine the dimensionality of informative variabilities in the final layer of an ii-RevNet. Indeed, the cascade of convolutional operators has been trained on the training set to separate the 1000 different classes while being a homeomorphism on its feature space. Thus, the dimensionality of the feature space is potentially large.

As shown in the previous subsection, the final layer is progressively prepared to be projected on the final probes corresponding to the classes. This indicates that the non-informative variabilities for classification can be removed via a linear projection on the final layer Φ\Phi, which lie in a space of dimension 1000, at most. However, this projection has been built via supervision, which can still retain directions that have been contracted and thus will not be selected by an algorithm such as PCA. We show in fact a PCA retains the necessary information for classification in a small subspace.

To do so, we build the linear projectors πd\pi_{d} on the subspace of the dd first principal components, and we propose to measure the classification power of the projected representation with a supervised classifier, e.g. nearest neighbor or a linear SVM, on the previous 100 class task. Again, the feature representation {Φxn}nN\{\Phi x^{n}\}_{n\leq N} are spatially averaged to remove the translation variability, and standardized on the training set. We apply both classifiers, and we report the classification accuracy of {πdΦxn}nN\{\pi_{d}\Phi x^{n}\}_{n\leq N} w.r.t. to dd on the Figure 7. A linear projection removes some information that can not be recovered by a linear classifier, therefore we observe that the classification accuracy only decreases significantly for d200d\leq 200. This shows that the signal indeed lies in a subspace much lower dimensional than the original feature dimensions that can be extracted simply with a PCA that only considers directions of largest variances, illustrating a successful contraction of the representation.

Conclusion

Invertible representations and their relationship to loss of information are on the agenda of deep learning for some time. Understanding how transformations in feature space are related to the corresponding input is an important step towards interpretable deep networks, invertible deep networks may play an important role in such analysis since, for example, one could potentially back-track a property from the feature space to the input space. To the best of our knowledge, this work provides the first empirical evidence that learning invertible representations that do not discard any information about their input on large-scale supervised problems is possible.

To achieve this we introduce the ii-RevNet class of CNN which is fully invertible and permits to exactly recover the input from its last convolutional layer. ii-RevNets achieve the same classification accuracy in the classification of complex datasets as illustrated on ILSVRC-2012, when compared to the RevNet (Gomez et al., 2017) and ResNet (He et al., 2016) architectures with a similar number of layers. Furthermore, the inverse network is obtained for free when training an ii-RevNet, requiring only minimal adaption to recover inputs from the hidden representations.

The absence of loss of information is surprising, given the wide believe, that discarding information is essential for learning representations that generalize well to unseen data. We show that this is not the case and propose to explain the generalization property with empirical evidence of progressive separation and contraction with depth, on ImageNet.

Acknowledgements

Jörn-Henrik Jacobsen was partially funded by the STW perspective program ImaGene. Edouard Oyallon was partially funded by the ERC grant InvariantClass 320959, via a grant for PhD Students of the Conseil régional d’Ile-de-France (RDM-IdF), and a postdoctoral grant from the from DPEI of Inria (AAR 2017POD057) for the collaboration with CWI. We thank Berkay Kicanaoglu for the Basel Face data and Mathieu Andreux, Eugene Belilovsky, Amal Rannen and Kyriacos Shiarlies for feedback on drafts of the paper.

References