Invertible Residual Networks

Jens Behrmann, Will Grathwohl, Ricky T. Q. Chen, David Duvenaud, Jörn-Henrik Jacobsen

Introduction

One of the main appeals of neural network-based models is that a single model architecture can often be used to solve a variety of related tasks. However, many recent advances are based on special-purpose solutions tailored to particular domains. State-of-the-art architectures in unsupervised learning, for instance, are becoming increasingly domain-specific (Van Den Oord et al., 2016b; Kingma & Dhariwal, 2018; Parmar et al., 2018; Karras et al., 2018; Van Den Oord et al., 2016a). On the other hand, one of the most successful feed-forward architectures for discriminative learning are deep residual networks (He et al., 2016; Zagoruyko & Komodakis, 2016), which differ considerably from their generative counterparts. This divide makes it complicated to choose or design a suitable architecture for a given task. It also makes it hard for discriminative tasks to benefit from unsupervised learning. We bridge this gap with a new class of architectures that perform well in both domains.

To achieve this, we focus on reversible networks which have been shown to produce competitive performance on discriminative (Gomez et al., 2017; Jacobsen et al., 2018) and generative (Dinh et al., 2014, 2017; Kingma & Dhariwal, 2018) tasks independently, albeit in the same model paradigm. They typically rely on fixed dimension splitting heuristics, but common splittings interleaved with non-volume conserving elements are constraining and their choice has a significant impact on performance (Kingma & Dhariwal, 2018; Dinh et al., 2017). This makes building reversible networks a difficult task. In this work we show that these exotic designs, necessary for competitive density estimation performance, can severely hurt discriminative performance.

To overcome this problem, we leverage the viewpoint of ResNets as an Euler discretization of ODEs (Haber & Ruthotto, 2018; Ruthotto & Haber, 2018; Lu et al., 2017; Ciccone et al., 2018) and prove that invertible ResNets (i-ResNets) can be constructed by simply changing the normalization scheme of standard ResNets.

As an intuition, Figure 1 visualizes the differences in the dynamics learned by standard and invertible ResNets.

This approach allows unconstrained architectures for each residual block, while only requiring a Lipschitz constant smaller than one for each block. We demonstrate that this restriction negligibly impacts performance when building image classifiers - they perform on par with their non-invertible counterparts on classifying MNIST, CIFAR10 and CIFAR100 images.

We then show how i-ResNets can be trained as maximum likelihood generative models on unlabeled data. To compute likelihoods, we introduce a tractable approximation to the Jacobian determinant of a residual block. Like FFJORD (Grathwohl et al., 2019), i-ResNet flows have unconstrained (free-form) Jacobians, allowing them to learn more expressive transformations than the triangular mappings used in other reversible models. Our empirical evaluation shows that i-ResNets perform competitively with both state-of-the-art image classifiers and flow-based generative models, bringing general-purpose architectures one step closer to reality.Official code release: https://github.com/jhjacobsen/invertible-resnet

Enforcing Invertibility in ResNets

There is a remarkable similarity between ResNet architectures and Euler’s method for ODE initial value problems:

which amounts to the implicit backward Euler discretization. In particular, solving the dynamics backwards in time would implement an inverse of the corresponding ResNet. The following theorem states that a simple condition suffices to make the dynamics solvable and thus renders the ResNet invertible:

Note that this condition is not necessary for invertibility. Other approaches (Dinh et al., 2014, 2017; Jacobsen et al., 2018; Chang et al., 2018; Kingma & Dhariwal, 2018) rely on partitioning dimensions or autoregressive structures to create analytical inverses.

Thus, the convergence rate is exponential in the number of iterations $n$ and smaller Lipschitz constants will yield faster convergence.

Additional to invertibility, a contractive residual block also renders the residual layer bi-Lipschitz.

Hence by design, invertible ResNets offer stability guarantees for both their forward and inverse mapping. In the following section, we discuss approaches to enforce the Lipschitz condition.

We implement residual blocks as a composition of contractive nonlinearities $\phi$ (e.g. ReLU, ELU, tanh) and linear mappings.

For example, in our convolutional networks $g=W_{3}\phi(W_{2}\phi(W_{1}))$ , where $W_{i}$ are convolutional layers. Hence,

where $\|\cdot\|_{2}$ denotes the spectral norm. Note, that regularizing the spectral norm of the Jacobian of $g$ (Sokolić et al., 2017) only reduces it locally and does not guarantee the above condition. Thus, we will enforce $\|W_{i}\|_{2}<1$ for each layer.

Generative Modelling with i-ResNets

where $J_{F}(x)$ is the Jacobian of $F$ evaluated at x. Models of this form are known as Normalizing Flows (Rezende & Mohamed, 2015). They have recently become a popular model for high-dimensional data due to the introduction of powerful bijective function approximators whose Jacobian log-determinant can be efficienty computed (Dinh et al., 2014, 2017; Kingma & Dhariwal, 2018; Chen et al., 2018) or approximated (Grathwohl et al., 2019).

Since i-ResNets are guaranteed to be invertible we can use them to parameterize $F$ in Equation (3). Samples from this model can be drawn by first sampling $z\sim p(z)$ and then computing $x=F^{-1}(z)$ with Algorithm 1. In Figure 2 we show an example of using an i-ResNet to define a generative model on some two-dimensional datasets compared to Glow (Kingma & Dhariwal, 2018).

While the invertibility of i-ResNets allows us to use them to define a Normalizing Flow, we must compute $\ln|\det J_{F}(x)|$ to evaluate the data-density under the model. Computing this quantity has a time cost of $\mathcal{O}(d^{3})$ in general which makes naïvely scaling to high-dimensional data impossible.

To bypass this constraint we present a tractable approximation to the log-determinant term in Equation (3), which will scale to high dimensions $d$ . Previously, Ramesh & LeCun (2018) introduced the application of log-determinant estimation to non-invertible deep generative models without the specific structure of i-ResNets.

First, we note that the Lipschitz constrained perturbations $x+g(x)$ of the identity yield positive determinants, hence

where $\tr$ denotes the matrix trace and $\ln$ the matrix logarithm. Thus for $z=F(x)=(I+g)(x)$ , it is

The trace of the matrix logarithm can be expressed as a power series (Hall, 2015)

which converges if $\|J_{g}\|_{2}<1$ . Hence, due to the Lipschitz constraint, we can compute the log-determinant via the above power series with guaranteed convergence.

2 Stochastic Approximation of log-determinant

Expressing the log-determinant with the power series in (4) has three main computational drawbacks: 1) Computing $\tr(J_{g})$ exactly costs $\mathcal{O}(d^{2})$ , or approximately needs $d$ evaluations of $g$ as each entry of the diagonal of the Jacobian requires the computation of a separate derivative of $g$ (Grathwohl et al., 2019). 2) Matrix powers $J_{g}^{k}$ are needed, which requires the knowledge of the full Jacobian. 3) The series is infinite.

While this allows for an unbiased estimate of the matrix trace, to achieve bounded computational costs, the power series (4) will be truncated at index $n$ to address drawback 3). Algorithm 2 summarizes the basic steps. The truncation turns the unbiased estimator into a biased estimator, where the bias depends on the truncation error. Fortunately, this error can be bounded as we demonstrate below.

To improve the stability of optimization when using this estimator we recommend using nonlinearities with continuous derivatives such as ELU (Clevert et al., 2015) or softplus instead of ReLU (See Appendix C.3).

3 Error of Power Series Truncation

We estimate $\ln|\det(I+J_{g})|$ with the finite power series

Let $g$ denote the residual function and $J_{g}$ the Jacobian as before. Then, the error of a truncated power series at term $n$ is bounded as

While the result above gives an error bound for evaluation of the loss, during training the error in the gradient of the loss is of greater interest. Similarly, we can obtain the following bound. The proofs are given in Appendix A.

In practice, only 5-10 terms must be taken to obtain a bias less than .001 bits per dimension, which is typically reported up to .01 precision (See Appendix E).

Related Work

We put our focus on invertible architectures with efficient inverse computation, namely NICE (Dinh et al., 2014), i-RevNet (Jacobsen et al., 2018), Real-NVP (Dinh et al., 2017), Glow (Kingma & Dhariwal, 2018) and Neural ODEs (Chen et al., 2018) and its stochastic density estimator FFJORD (Grathwohl et al., 2019). A summary of the comparison between different reversible networks is given in Table 1.

The dimension-splitting approach used in NICE, i-RevNet, Real-NVP and Glow allows for both analytic forward and inverse mappings. However, this restriction required the introduction of additional steps like invertible $1\crossproduct 1$ convolutions in Glow (Kingma & Dhariwal, 2018). These $1\crossproduct 1$ convolutions need to be inverted numerically, making Glow altogether not analytically invertible. In contrast, i-ResNet can be viewed as an intermediate approach, where the forward mapping is given analytically, while the inverse can be computed via a fixed-point iteration.

Furthermore, an i-ResNet block has a Lipschitz bound both for forward and inverse (Lemma 2), while other approaches do not have this property by design. Hence, i-ResNets could be an interesting avenue for stability-critical applications like inverse problems (Ardizzone et al., 2019) or invariance-based adversarial vulnerability (Jacobsen et al., 2019).

Neural ODEs (Chen et al., 2018) allow free-form dynamics similar to i-ResNets, meaning that any architecture could be used as long as the input and output dimensions are the same. To obtain discrete forward and inverse dynamics, Neural ODEs rely on adaptive ODE solvers, which allows for an accuracy vs. speed trade-off. Yet, scalability to very high input dimension such as high-resolution images remains unclear.

2 Ordinary Differential Equations

Due to the similarity of ResNets and Euler discretizations, there are many connections between the i-ResNet and ODEs, which we review in this section.

Relationship of i-ResNets to Neural ODEs: The view of deep networks as dynamics over time offers two fundamental learning approaches: 1) Direct learning of dynamics using discrete architectures like ResNets (Haber & Ruthotto, 2018; Ruthotto & Haber, 2018; Lu et al., 2017; Ciccone et al., 2018). 2) Indirect learning of dynamics via parametrizing an ODE with a neural network as in Chen et al. (2018); Grathwohl et al. (2019).

The dynamics $x(t)$ of a fixed ResNet $F_{\theta}$ are only defined at time points $t_{i}$ corresponding to each block $g_{\theta_{t_{i}}}$ . However, a linear interpolation in time can be used to generate continuous dynamics. See Figure 1, where the continuous dynamics of a linearly interpolated invertible ResNet are shown against those of a standard ResNet. Invertible ResNets are bijective along the continuous path while regular ResNets may result in crossing or merging paths. The indirect approach of learning an ODE, on the other hand, adapts the discretization based on an ODE-solver, but does not have a fixed computational budget compared to an i-ResNet.

3 Spectral Sum Approximations

The approximation of spectral sums like the log-determinant is of broad interest for many machine learning problems such as Gaussian Process regression (Dong et al., 2017). Among others, Taylor approximation (Boutsidis et al., 2017) of the log-determinant similar to our approach or Chebyshev polynomials (Han et al., 2016) are used. In Boutsidis et al. (2017), error bounds on the estimation via truncated power series and stochastic trace estimation are given for symmetric positive definite matrices. However, $I+J_{g}$ is not symmetric and thus, their analysis does not apply here.

Recently, unbiased estimates (Adams et al., 2018) and unbiased gradient estimators (Han et al., 2018) were proposed for symmetric positive definite matrices. Furthermore, Chebyshev polynomials have been used to approximate the log-determinant of Jacobian of deep neural networks in Ramesh & LeCun (2018) for density matching and evaluation of the likelihood of GANs.

Experiments

We complete a thorough experimental survey of invertible ResNets. First, we numerically verify the invertibility of i-ResNets. Then, we investigate their discriminative abilities on a number of common image classification datasets. Furthermore, we compare the discriminative performance of i-ResNets to other invertible networks. Finally, we study how i-ResNets can be used to define generative models.

To compare the discriminative performance and invertibility of i-ResNets with standard ResNet architectures, we train both models on CIFAR10, CIFAR100, and MNIST. The CIFAR and MNIST models have models have 54 and 21 residual blocks, respectively and we use identical settings for all other hyperparameters. We replace strided downsampling with “invertible downsampling” operations (Jacobsen et al., 2018) to ensure bijectivity, see Appendix C.2 for training and architectural details. We increase the number of input channels to 16 by padding with zeros. This is analagous to the standard practice of projecting the data into a higher-dimensional space using a standard convolutional layer at the input of a model, but this mapping is reversible. To obtain the numerical inverse, we apply 100 fixed point iterations (Equation (1)) for each block. This number is chosen to ensure that the poor reconstructions for vanilla ResNets (see Figure 3) are not due to using too few iterations. In practice far fewer iterations suffice, as the trade-off between reconstruction error and number of iterations analyzed in Appendix D shows.

Classification and reconstruction results for a baseline pre-activation ResNet-164, a ResNet with architecture like i-ResNets without Lipschitz constraint (denoted as vanilla) and five invertible ResNets with different spectral normalization coefficients are shown in Table 2. The results illustrate that for larger settings of the layer-wise Lipschitz constant $c$ , our proposed invertible ResNets perform competitively with the baselines in terms of classification performance, while being provably invertible. When applying very conservative normalization (small $c$ ), the classification error becomes higher on all datasets tested.

To demonstrate that our normalization scheme is effective and that standard ResNets are not generally invertible, we reconstruct inputs from the features of each model using Algorithm 1. Intriguingly, our analysis also reveals that unconstrained ResNets are invertible after training on MNIST (see Figure 7 in Appendix B), whereas on CIFAR10/100 they are not. Further, we find ResNets with and without BatchNorm are not invertible after training on CIFAR10, which can also be seen from the singular value plots in Appendix B (Figure 6). The runtime on 4 GeForce GTX 1080 GPUs with 1 spectral norm iteration was 0.5 sec for a forward and backward pass of batch with 128 samples, while it took 0.2 sec without spectral normalization. See section C.1 (appendix) for details on the runtime.

The reconstruction error decays quickly and the errors are already imperceptible after 5-20 iterations, which is the cost of 5-20 times the forward pass and corresponds to 0.15-0.75 seconds for reconstructing 100 CIFAR10 images.

Computing the inverse is fast even for the largest normalization coefficient, but becomes faster with stronger normalization. The number of iterations needed for full convergence is approximately cut in half when reducing the spectral normalization coefficient by 0.2, see Figure 8 (Appendix D) for a detailed plot. We also ran an i-RevNet (Jacobsen et al., 2018) with comparable hyperparameters as ResNet-164 and it performs on par with ResNet-164 with 5.6%. Note however, that i-RevNets, like NICE (Dinh et al., 2014), are volume-conserving, making them less well-suited to generative modeling.

In summary, we observe that invertibility without additional constraints is unlikely, but possible, whereas it is hard to predict if networks will have this property. In our proposed model, we can guarantee the existence of an inverse without significantly harming classification performance.

2 Comparison with Other Invertible Architectures

In this section we compare i-ResNet classifiers to the state-of-the-art invertible flow-based model Glow. We take the implementation of Kingma & Dhariwal (2018) and modify it to classify CIFAR10 images (with no generative modeling component). We create an i-ResNet that is as close as possible in structure to the default Glow model on CIFAR10 (denoted as i-ResNet Glow-style) and compare it to two variants of Glow, one that uses learned ( $1\crossproduct 1$ convolutions) and affine block structure, and one with reverse permutations (like Real-NVP) and additive block structure. Results of this experiment can be found in Table 3. We can see that i-ResNets outperform all versions of Glow on this discriminative task, even when adapting the network depth and width to that of Glow. This indicates that i-ResNets have a more suitable inductive bias in their block structure for discriminative tasks than Glow.

We also find that i-ResNets are considerably easier to train than these other models. We are able to train i-ResNets using SGD with momentum and a learning rate of $0.1$ whereas all version of Glow we tested needed Adam or Adamax (Kingma & Ba, 2014) and much smaller learning rates to avoid divergence.

3 Generative Modeling

We run a number of experiments to verify the utility of i-ResNets in building generative models. First, we compare i-ResNet Flows with Glow (Kingma & Dhariwal, 2018) on simple two-dimensional datasets. Figure 2 qualitatively shows the density learned by a Glow model with 100 coupling layers and 100 invertible linear transformations. We compare against an i-ResNet where the coupling layers are replaced by invertible residual blocks with the same number of parameters and the invertible linear transformations are replaced by actnorm (Kingma & Dhariwal, 2018). This results in the i-ResNet model having slightly fewer parameters, while maintaining an equal number of layers. In this experiment we train i-ResNets using the brute-force computed log-determinant since the data is two-dimensional. We find that i-ResNets are able to more accurately fit these simple densities. As stated in Grathwohl et al. (2019), we believe this is due to our model’s ability to avoid partitioning dimensions.

Next we evaluate i-ResNets as a generative model for images on MNIST and CIFAR10. Our models consist of multiple i-ResNet blocks followed by invertible downsampling or dimension “squeezing” to downsample the spatial dimensions. We use multi-scale architectures like those of Dinh et al. (2017); Kingma & Dhariwal (2018). In these experiments we train i-ResNets using the log-determinant approximation, see Algorithm 2. Full architecture, experimental, and evaluation details can be found in Appendix C.3. Samples from our CIFAR10 model are shown in Figure 5 and samples from our MNIST model can be found in Appendix F. Compared to the classification model, the log-determinant approximation with 5 series terms roughly increased the computation times by a factor of 4. The bias and variance of our log-determinant estimator is shown in Figure 4.

Results and comparisons to other generative models can be found in Table 4. While our models did not perform as well as Glow and FFJORD, we find it intriguing that ResNets, with very little modification, can create a generative model competitive with these highly engineered models. We believe the gap in performance is mainly due to our use of a biased log-determinant estimator and that the use of an unbiased method (Han et al., 2018) can help close this gap.

Other Applications

In many applications, a secondary unsupervised learning or generative modeling objective is formulated in combination with a primary discriminative task. i-ResNets are appealing here, as they manage to achieve competitive performance on both discriminative and generative tasks. We summarize some application areas to highlight that there is a wide variety of tasks for which i-ResNets would be promising to consider:

Hybrid density and discriminative models for joint classification and detection or fairness applications (Nalisnick et al., 2018; Louizos et al., 2016)

Unsupervised learning for downstream tasks (Hjelm et al., 2019; Van Den Oord et al., 2018)

Semi-supervised learning from few labeled examples (Oliver et al., 2018; Kingma et al., 2014)

Solving inverse problems with hybrid regression and generative losses (Ardizzone et al., 2019)

Adversarial robustness with likelihood-based generative models (Schott et al., 2019; Jacobsen et al., 2019)

Finally, it is plausible that the Lipschitz bounds on the layers of the i-ResNet could aid with the stability of gradients for optimization, as well as adversarial robustness.

Conclusions

We introduced a new architecture, i-ResNets, which allow free-form layer architectures while still providing tractable density estimates. The unrestricted form of the Jacobian allows expansion and contraction via the residual blocks, while partitioning-based models (Dinh et al., 2014, 2017; Kingma & Dhariwal, 2018) must include affine blocks and scaling layers to be non-volume preserving.

Several challenges remain to be addressed in future work. First, our estimator of the log-determinant is biased. However, there have been recent advances in building unbiased estimators for the log-determinant (Han et al., 2018), which we believe could improve the performance of our generative model. Second, learning and designing networks with a Lipschitz constraint is challenging. For example, we need to constrain each linear layer in the block instead of being able to directly control the Lipschitz constant of a block, see Anil et al. (2018) for a promising approach for addressing this problem.

Acknowledgments

We thank Rich Zemel for very helpful comments on an earlier version of the manuscript. We thank Yulia Rubanova for spotting a mistake in one of the proofs. We also thank everyone else at Vector for helpful discussions and feedback.

We gratefully acknowledge the financial support from the German Science Foundation for RTG 2224 ” $\pi^{3}$ : Parameter Identification - Analysis, Algorithms, Applications”

References

Appendix A Additional Lemmas and Proofs

The condition above was also stated in Zhao et al. (2019) (Appendix D), however, their proof restricts the domain of the residual block $g$ to be bounded and applies only to linear operators $g$ , because the inverse was given by a convergent Neumann-series.

where $\lambda_{i}$ denotes the eigenvalues.

(Lemma 7)) First, the sum over the layers is due to the function composition, because $J_{F}(x)=\prod_{t}J_{F^{t}}(x)$ and

where we use the positivity of the determinant, see Lemma 6. Furthermore, note that

(Theorem 4) First, we derive the by differentiating the power series and using the linearity of the trace operator. We obtain

which is why, we consider an arbitrary $i$ from now on. It is

where we used the same arguments as in estimation (A).

In order to bound $\left\|\frac{\partial J_{g}(x,\theta)}{\partial\theta_{i}}\right\|_{2}$ , we need to look into the design of the residual block. We assume contractive and element-wise activation functions (hence $\phi^{\prime}(\cdot)<1$ ) and $N$ linear layers $W_{i}$ in a residual block. Then, we can write the Jacobian as a matrix product

Since we need to bound the derivative of the Jacobian with respect to weights $\theta_{i}$ , double backpropagation (Drucker & Lecun, 1992) is necessary. In general, the terms $\|W_{i}^{T}\|_{2}$ , $\|D_{i}\|_{2}$ , $\|D^{*}_{i}\|_{2}:=\|diag(\phi^{\prime\prime}(z_{i-1}))\|_{2}$ , $\left\|\left(\frac{\partial W_{i}}{\partial\theta_{i}}\right)\right\|_{2}$ and $\|x\|_{2}$ appear in the bound of the derivative. Hence, in order to bound $\left\|\frac{\partial J_{g}(x,\theta)}{\partial\theta_{i}}\right\|_{2}$ , we bound the previous terms as follows

In particular, (12) is due to the assumption of a Lipschitz activation function and (13) due to assuming a Lipschitz derivative of the activation function. Note, that we are using continuously differentiable activations functions (hence, not ReLU), where this assumptions holds for common functions like ELU, softplus and tanh. Furthermore, (14) holds by assuming bounded inputs and due to the network being Lipschitz. To understand the bound (15), we denote $s$ as the amount of parameter sharing of $\theta_{i}$ . For example, if $\theta_{i}$ is a entry from a convolution kernel, $s=w*h$ with $w$ spatial width and $h$ spatial height. Then

Hence, as each term appearing in the second derivative $\left\|\frac{\partial J_{g}(x,\theta)}{\partial\theta_{i}}\right\|_{2}$ is bounded, we can introduce the constant $a(g,\theta,x)<\infty$ which depends on the parameters, the implementation of $g$ and the inputs $x$ . Note, that we do not give an exact bound on $\left\|\frac{\partial J_{g}(x,\theta)}{\partial\theta_{i}}\right\|_{2}$ , since we are only interesting in the existence of such a bound in order to proof the convergence in the claim.

which proves the claimed convergence rate. ∎

Appendix B Verification of Invertibility

In this experiment we train standard ResNets and i-ResNets with various layer-wise Lipschitz coefficients ( $c\in\{.3,.5,.7,.9\})$ . After training, we inspect the learned transformations at each layer by computing the largest singular value of each linear mapping based on the approach in Sedghi et al. (2019). It can be seen clearly (Figure 6 left) that the standard and BatchNorm models have many singular values above 1, making their residual connections non-invertible. Conversely, in the i-ResNet models (Figure 6 right), all singular values are below 1 (and roughly equal to $c$ ) indicating their residual connections are invertible.

Computing Inverses with Fixed-Point Iteration

Here we numerically compute inverses in our trained models using the fixed-point iteration, see Algorithm 1. We invert each residual connection using 100 iterations (to ensure convergence). We see that i-ResNets can be inverted using this method whereas with standard ResNets this is not guaranteed (Figure 7 top). Interestingly, on MNIST we find that standard ResNets are indeed invertible after training on MNIST (Figure 7 bottom).

Appendix C Experimental Details

C.2 Classification

We use pre-activation ResNets with 39 convolutional bottleneck blocks with 3 convolution layers each and kernel sizes of 3x3, 1x1, 3x3 respectively. All models use the ELU nonlinearity (Clevert et al., 2015). In the BatchNorm version, we apply batch normalization before every nonlinearity and in the invertible models we use ActNorm (Kingma & Dhariwal, 2018) before each residual block. The network has 2 down-sampling stages after 13 and 26 blocks where a dimension squeezing operation is used to decrease the spatial resolution. This reduces the spatial dimension by a factor of two in each direction, while increasing the number of channels by a factor of four. All models transform the data to a 8x8x256 tensor to which we apply BatchNorm, a nonlinearity, and average pooling to a 256-dimensional vector. A linear classifier is used on top of this representation.

Injective Padding

Since our invertible models are not able to increase the dimension of their latent representation, we use injective padding (Jacobsen et al., 2018) which concatenates channels of 0’s to the input, increasing the size of the transformed tensor. This is analagous to the standard practice of projecting the data into a higher-dimensional space using a non-ResNet convolution at the input of a model, but this mapping is reversible. We add 13 channels of 0’s to all models tested, thus the input to our first residual block is a tensor of size 32x32x16. We experimented with removing this step but found it led to approximately a 2% decrease in accuracy for our CIFAR10 models.

Training

We train for 200 epochs with momentum SGD and a weight decay of 5e-4. The learning rate is set to 0.1 and decayed by a factor of 0.2 after 60, 120 and 160 epochs. For data-augmentation, we apply random shifts of upt to two pixels for MNIST and shifts/ random horizontal flips for CIFAR(10/100) during training. The inputs for MNIST are normalized to [-0.5,0.5] and for CIFAR(10/100) normalize by subtracting the mean and dividing by the standard deviation of the training set.

C.3 Generative Modeling

We used 100 residual blocks, where each residual connection is a multilayer perceptron with state sizes of 2-64-64-64-2 and ELU nonlinearities (Clevert et al., 2015). We used ActNorm (Kingma & Dhariwal, 2018) after each residual block. The change in log density was computed exactly by constructing the full Jacobian during training and visualization.

MNIST and CIFAR

The structure of our generative models closely resembles that of Glow. The model consists of “scale-blocks” which are groups of i-ResNet blocks that operate at different spatial resolutions. After each scale-block, apart from the last, we perform a squeeze operation which decreases the spatial resolution by 2 in each dimension and multiplies the number of channels by 4 (invertible downsampling).

Our MNIST and CIFAR10 models have three scale-blocks. Each scale-block has 32 i-ResNet blocks. Each i-ResNet block consists of three convolutions of $3\crossproduct 3$ , $1\crossproduct 1$ , $3\crossproduct 3$ filters with ELU (Clevert et al., 2015) nonlinearities in between. Each convolutional layer has 32 filters in the MNIST model and 512 filters in the CIFAR10 model.

We train for 200 epochs using the Adamax (Kingma & Ba, 2014) optimizer with a learning rate of $.003$ . Throughout training we estimate the log-determinant in Equation (3) using the power-series approximation (Equation (4)) with ten terms for the MNIST model and 5 terms for the CIFAR10 model.

Evaluation

During evaluation we use the bound presented in Section 3.3 to determine the number of terms needed to give an estimate with bias less than .0001 bit/dim. We then average over enough samples from Hutchinson’s estimator such that the standard error is less than .0001 bit/dim, thus we can safely report our model’s bit/dim accurate up to a tolerance of .0002.

Choice of Nonlinearity

Differentiating our log-determinant estimator requires us to compute second derivatives of our neural network’s output. If we were to use a nonlinearity with discontinuous derivatives (i.e. ReLU), then these values are not defined in certain regions. This can lead to unstable optimization. To guarantee the quantities required for optimization always exist, we recommend using nonlinearities which have continuous derivatives such as ELU (Clevert et al., 2015) or softplus. In all of our experiments we use ELU.

Appendix D Fixed Point Iteration Analysis

Appendix E Evaluating the Bias of Our Log-determinant Estimator

Here we numerically evaluate the bias of the log-determinant estimator used to train our generative models (Equation (4)). We compare the true value (computed via brute-force) with the estimator’s mean and standard deviation as the number of terms in the power series is increased. After 10 terms, the estimator’s bias is negligible and after 20 terms it is numerically 0. This is averaged over 1000 test examples.