Robust and interpretable blind image denoising via bias-free convolutional neural networks

Sreyas Mohan, Zahra Kadkhodaie, Eero P. Simoncelli, Carlos Fernandez-Granda

Introduction and Contributions

The problem of denoising consists of recovering a signal from measurements corrupted by noise, and is a canonical application of statistical estimation that has been studied since the 1950’s. Achieving high-quality denoising results requires (at least implicitly) quantifying and exploiting the differences between signals and noise. In the case of photographic images, the denoising problem is both an important application, as well as a useful test-bed for our understanding of natural images. In the past decade, convolutional neural networks (LeCun et al., 2015) have achieved state-of-the-art results in image denoising (Zhang et al., 2017; Chen & Pock, 2017). Despite their success, these solutions are mysterious: we lack both intuition and formal understanding of the mechanisms they implement. Network architecture and functional units are often borrowed from the image-recognition literature, and it is unclear which of these aspects contributes to, or limits, the denoising performance. The goal of this work is advance our understanding of deep-learning models for denoising. Our contributions are twofold: First, we study the generalization capabilities of deep-learning models across different noise levels. Second, we provide novel tools for analyzing the mechanisms implemented by neural networks to denoise natural images.

An important advantage of deep-learning techniques over traditional methodology is that a single neural network can be trained to perform denoising at a wide range of noise levels. Currently, this is achieved by simulating the whole range of noise levels during training (Zhang et al., 2017). Here, we show that this is not necessary. Neural networks can be made to generalize automatically across noise levels through a simple modification in the architecture: removing all additive constants. We find this holds for a variety of network architectures proposed in previous literature. We provide extensive empirical evidence that the main state-of-the-art denoising architectures systematically overfit to the noise levels in the training set, and that this is due to the presence of a net bias. Suppressing this bias makes it possible to attain state-of-the-art performance while training over a very limited range of noise levels.

The data-driven mechanisms implemented by deep neural networks to perform denoising are almost completely unknown. It is unclear what priors are being learned by the models, and how they are affected by the choice of architecture and training strategies. Here, we provide novel linear-algebraic tools to visualize and interpret these strategies through a local analysis of the Jacobian of the denoising map. The analysis reveals locally adaptive properties of the learned models, akin to existing nonlinear filtering algorithms. In addition, we show that the deep networks implicitly perform a projection onto an adaptively-selected low-dimensional subspace capturing features of natural images.

Related Work

The classical solution to the denoising problem is the Wiener filter (Wiener, 1950), which assumes a translation-invariant Gaussian signal model. The main limitation of Wiener filtering is that it over-smoothes, eliminating fine-scale details and textures. Modern filtering approaches address this issue by adapting the filters to the local structure of the noisy image (e.g. Tomasi & Manduchi (1998); Milanfar (2012)). Here we show that neural networks implement such strategies implicitly, learning them directly from the data.

In the 1990’s powerful denoising techniques were developed based on multi-scale ("wavelet") transforms. These transforms map natural images to a domain where they have sparser representations. This makes it possible to perform denoising by applying nonlinear thresholding operations in order to discard components that are small relative to the noise level (Donoho & Johnstone, 1995; Simoncelli & Adelson, 1996; Chang et al., 2000). From a linear-algebraic perspective, these algorithms operate by projecting the noisy input onto a lower-dimensional subspace that contains plausible signal content. The projection eliminates the orthogonal complement of the subspace, which mostly contains noise. This general methodology laid the foundations for the state-of-the-art models in the 2000’s (e.g. (Dabov et al., 2006)), some of which added a data-driven perspective, learning sparsifying transforms (Elad & Aharon, 2006), and nonlinear shrinkage functions (Hel-Or & Shaked, 2008; Raphan & Simoncelli, 2008), directly from natural images. Here, we show that deep-learning models learn similar priors in the form of local linear subspaces capturing image features.

In the past decade, purely data-driven models based on convolutional neural networks (LeCun et al., 2015) have come to dominate all previous methods in terms of performance. These models consist of cascades of convolutional filters, and rectifying nonlinearities, which are capable of representing a diverse and powerful set of functions. Training such architectures to minimize mean square error over large databases of noisy natural-image patches achieves current state-of-the-art results (Zhang et al., 2017; Huang et al., 2017; Ronneberger et al., 2015; Zhang et al., 2018a).

Network Bias Impairs Generalization

Based on equation 1 we can perform a first-order decomposition of the error or residual of the neural network for a specific input: $y-f(y)=(I-A_{y})y-b_{y}$ . Figure 1 shows the magnitude of the residual and the constant, which is equal to the net bias $b_{y}$ , for a range of noise levels. Over the training range, the net bias is small, implying that the linear term is responsible for most of the denoising (see Figures 9 and 10 for a visualization of both components). However, when the network is evaluated at noise levels outside of the training range, the norm of the bias increases dramatically, and the residual is significantly smaller than the noise, suggesting a form of overfitting. Indeed, network performance generalizes very poorly to noise levels outside the training range. This is illustrated for an example image in Figure 2, and demonstrated through extensive experiments in Section 5.

Proposed Methodology: Bias-Free Networks

Section 3 shows that CNNs overfit to the noise levels present in the training set, and that this is associated with wild fluctuations of the net bias $b_{y}$ . This suggests that the overfitting might be ameliorated by removing additive (bias) terms from every stage of the network, resulting in a bias-free CNN (BF-CNN). Note that bias terms are also removed from the batch-normalization used during training. This simple change in the architecture has an interesting consequence. If the CNN has ReLU activations the denoising map is locally homogeneous, and consequently invariant to scaling: rescaling the input by a constant value simply rescales the output by the same amount, just as it would for a linear system.

We can write the action of a bias-free neural network with $L$ layers in terms of the weight matrix $W_{i}$ , $1\leq i\leq L$ , of each layer and a rectifying operator $\mathcal{R}$ , which sets to zero any negative entries in its input. Multiplying by a nonnegative constant does not change the sign of the entries of a vector, so for any $z$ with the right dimension and any $\alpha>0$ $\mathcal{R}(\alpha z)=\alpha\mathcal{R}(z)$ , which implies

Note that networks with nonzero net bias are not scaling invariant because scaling the input may change the activation pattern of the ReLUs. Scaling invariance is intuitively desireable for a denoising method operating on natural images; a rescaled image is still an image. Note that Lemma 1 holds for networks with skip connections where the feature maps are concatenated or added, because both of these operations are linear.

In the following sections we demonstrate that removing all additive terms in CNN architectures has two important consequences: (1) the networks gain the ability to generalize to noise levels not encountered during training (as illustrated by Figure 2 the improvement is striking), and (2) the denoising mechanism can be analyzed locally via linear-algebraic tools that reveal intriguing ties to more traditional denoising methodology such as nonlinear filtering and sparsity-based techniques.

Bias-Free Networks Generalize Across Noise Levels

In order to evaluate the effect of removing the net bias in denoising CNNs, we compare several state-of-the-art architectures to their bias-free counterparts, which are exactly the same except for the absence of any additive constants within the networks (note that this includes the batch-normalization additive parameter). These architectures include popular features of existing neural-network techniques in image processing: recurrence, multiscale filters, and skip connections. More specifically, we examine the following models (see Section A for additional details):

DnCNN (Zhang et al., 2017): A feedforward CNN with $20$ convolutional layers, each consisting of $3\times 3$ filters, $64$ channels, batch normalization (Ioffe & Szegedy, 2015), a ReLU nonlinearity, and a skip connection from the initial layer to the final layer.

Recurrent CNN: A recurrent architecture inspired by Zhang et al. (2018a) where the basic module is a CNN with 5 layers, $3\times 3$ filters and $64$ channels in the intermediate layers. The order of the recurrence is 4.

UNet (Ronneberger et al., 2015): A multiscale architecture with 9 convolutional layers and skip connections between the different scales.

Simplified DenseNet: CNN with skip connections inspired by the DenseNet architecture (Huang et al., 2017; Zhang et al., 2018b).

We train each network to denoise images corrupted by i.i.d. Gaussian noise over a range of standard deviations (the training range of the network). We then evaluate the network for noise levels that are both within and beyond the training range. Our experiments are carried out on $180\times 180$ natural images from the Berkeley Segmentation Dataset (Martin et al., 2001) to be consistent with previous results (Schmidt & Roth, 2014; Chen & Pock, 2017; Zhang et al., 2017). Additional details about the dataset and training procedure are provided in Section B.

Figures 3, 11 and 12 show our results. For a wide range of different training ranges, and for all architectures, we observe the same phenomenon: the performance of CNNs is good over the training range, but degrades dramatically at new noise levels; in stark contrast, the corresponding BF-CNNs provide strong denoising performance over noise levels outside the training range. This holds for both PSNR and the more perceptually-meaningful Structural Similarity Index (Wang et al., 2004) (see Figure 12). Figure 2 shows an example image, demonstrating visually the striking difference in generalization performance between a CNN and its corresponding BF-CNN. Our results provide strong evidence that removing net bias in CNN architectures results in effective generalization to noise levels out of the training range.

Revealing the Denoising Mechanisms Learned by BF-CNNs

In this section we perform a local analysis of BF-CNN networks, which reveals the underlying denoising mechanisms learned from the data. A bias-free network is strictly linear, and its net action can be expressed as

where $A_{y}$ is the Jacobian of $f_{\operatorname{BF}}(\cdot)$ evaluated at $y$ . The Jacobian at a fixed input provides a local characterization of the denoising map. In order to study the map we perform a linear-algebraic analysis of the Jacobian. Our approach is similar in spirit to visualization approaches– proposed in the context of image classification– that differentiate neural-network functions with respect to their input (e.g. Simonyan et al. (2013); Montavon et al. (2017)).

The linear representation of the denoising map given by equation 4 implies that the $i$ th pixel of the output image is computed as an inner product between the $i$ th row of $A_{y}$ , denoted $a_{y}(i)$ , and the input image:

The vectors $a_{y}(i)$ can be interpreted as adaptive filters that produce an estimate of the denoised pixel via a weighted average of noisy pixels. Examination of these filters reveals their diversity, and their relationship to the underlying image content: they are adapted to the local features of the noisy image, averaging over homogeneous regions of the image without blurring across edges. This is shown for two separate examples and a range of noise levels in Figures 4, 13, 14 and 15 for the architectures described in Section 5. We observe that the equivalent filters of all architectures adapt to image structure.

Classical Wiener filtering (Wiener, 1950) denoises images by computing a local average dependent on the noise level. As the noise level increases, the averaging is carried out over a larger region. As illustrated by Figures 4, 13, 14 and 15, the equivalent filters of BF-CNNs also display this behavior. The crucial difference is that the filters are adaptive. The BF-CNNs learn such filters implicitly from the data, in the spirit of modern nonlinear spatially-varying filtering techniques designed to preserve fine-scale details such as edges (e.g. Tomasi & Manduchi (1998), see also Milanfar (2012) for a comprehensive review, and Choi et al. (2018) for a recent learning-based approach).

2 Projection onto adaptive low-dimensional subspaces

The local linear structure of a BF-CNN facilitates analysis of its functional capabilities via the singular value decomposition (SVD). For a given input $y$ , we compute the SVD of the Jacobian matrix: $A_{y}=USV^{T}$ , with $U$ and $V$ orthogonal matrices, and $S$ a diagonal matrix. We can decompose the effect of the network on its input in terms of the left singular vectors $\{U_{1},U_{2}\ldots,U_{N}\}$ (columns of $U$ ), the singular values $\{s_{1},s_{2}\ldots,s_{N}\}$ (diagonal elements of $S$ ), and the right singular vectors $\{V_{1},V_{2},\ldots V_{N}\}$ (columns of $V$ ):

The output is a linear combination of the left singular vectors, each weighted by the projection of the input onto the corresponding right singular vector, and scaled by the corresponding singular value.

Analyzing the SVD of a BF-CNN on a set of ten natural images reveals that most singular values are very close to zero (Figure 5(a)). The network is thus discarding all but a very low-dimensional portion of the input image. We also observe that the left and right singular vectors corresponding to the singular values with non-negligible amplitudes are approximately the same (Figure 5(b)). This means that the Jacobian is (approximately) symmetric, and we can interpret the action of the network as projecting the noisy signal onto a low-dimensional subspace, as is done in wavelet thresholding schemes. This is confirmed by visualizing the singular vectors as images (Figure 6). The singular vectors corresponding to non-negligible singular values are seen to capture features of the input image; those corresponding to near-zero singular values are unstructured. The BF-CNN therefore implements an approximate projection onto an adaptive signal subspace that preserves image structure, while suppressing the noise.

We can define an "effective dimensionality" of the signal subspace as $d:=\sum_{i=1}^{N}s_{i}^{2}$ , the amount of variance captured by applying the linear map to an $N$ -dimensional Gaussian noise vector with variance $\sigma^{2}$ , normalized by the noise variance. The remaining variance equals

where $E_{n}$ indicates expectation over noise $n$ , so that $d=E_{n}||A_{y}n||^{2}/\sigma^{2}=\sum_{i=1}^{N}s_{i}^{2}$ .

When we examine the preserved signal subspace, we find that the clean image lies almost completely within it. For inputs of the form $y:=x+n$ (where $x$ is the clean image and $n$ the noise), we find that the subspace spanned by the singular vectors up to dimension $d$ contains $x$ almost entirely, in the sense that projecting $x$ onto the subspace preserves most of its energy. This holds for the whole range of noise levels over which the network is trained (Figure 7).

We also find that for any given clean image, the effective dimensionality of the signal subspace ( $d$ ) decreases systematically with noise level (Figure 5(c)). At lower noise levels the network detects a richer set of image features, and constructs a larger signal subspace to capture and preserve them. Empirically, we found that (on average) $d$ is approximately proportional to $\frac{1}{\sigma}$ (see dashed line in Figure 5(c)). These signal subspaces are nested: the subspaces corresponding to lower noise levels contain more than 95% of the subspace axes corresponding to higher noise levels (Figure 7).

Finally, we note that this behavior of the signal subspace dimensionality, combined with the fact that it contains the clean image, explains the observed denoising performance across different noise levels (Figure 3). Specifically, if we assume $d\approx\alpha/\sigma$ , the mean squared error is proportional to $\sigma$ :

Note that this result runs contrary to the intuitive expectation that MSE should be proportional to the noise variance, which would be the case if the denoiser operated by projecting onto a fixed subspace. The scaling of MSE with the square root of the noise variance implies that the PSNR of the denoised image should be a linear function of the input PSNR, with a slope of $1/2$ , consistent with the empirical results shown in Figure 3. Note that this behavior holds even when the networks are trained only on modest levels of noise (e.g., $\sigma\in$ ).

Discussion

In this work, we show that removing constant terms from CNN architectures ensures strong generalization across noise levels, and also provides interpretability of the denoising method via linear-algebra techniques. We provide insights into the relationship between bias and generalization through a set of observations. Theoretically, we argue that if the denoising network operates by projecting the noisy observation onto a linear space of “clean” images, then that space should include all rescalings of those images, and thus, the origin. This property can be guaranteed by eliminating bias from the network. Empirically, in networks that allow bias, the net bias of the trained network is quite small within the training range. However, outside the training range the net bias grows dramatically resulting in poor performance, which suggests that the bias may be the cause of the failure to generalize. In addition, when we remove bias from the architecture, we preserve performance within the training range, but achieve near-perfect generalization, even to noise levels more than 10x those in the training range. These observations do not fully elucidate how our network achieves its remarkable generalization- only that bias prevents that generalization, and its removal allows it.

It is of interest to examine whether bias removal can facilitate generalization in noise distributions beyond Gaussian, as well as other image-processing tasks, such as image restoration and image compression. We have trained bias-free networks on uniform noise and found that they generalize outside the training range. In fact, bias-free networks trained for Gaussian noise generalize well when tested on uniform noise (Figures 18 and 19). In addition, we have applied our methodology to image restoration (simultaneous deblurring and denoising). Preliminary results indicate that bias-free networks generalize across noise levels for a fixed blur level, whereas networks with bias do not (Figure 20). An interesting question for future research is whether it is possible to achieve generalization across blur levels. Our initial results indicate that removing bias is not sufficient to achieve this.

Finally, our linear-algebraic analysis uncovers interesting aspects of the denoising map, but these interpretations are very local: small changes in the input image change the activation patterns of the network, resulting in a change in the corresponding linear mapping. Extending the analysis to reveal global characteristics of the neural-network functionality is a challenging direction for future research.

References

Appendix A Description of denoising architectures

In this section we describe the denoising architectures used for our computational experiments in more detail.

We implement BF-DnCNN based on the architecture of the Denoising CNN (DnCNN) (Zhang et al., 2017). DnCNN consists of $20$ convolutional layers, each consisting of $3\times 3$ filters and $64$ channels, batch normalization (Ioffe & Szegedy, 2015), and a ReLU nonlinearity. It has a skip connection from the initial layer to the final layer, which has no nonlinear units. To construct a bias-free DnCNN (BF-DnCNN) we remove all sources of additive bias, including the mean parameter of the batch-normalization in every layer (note however that the scaling parameter is preserved).

A.2 Recurrent CNN

Inspired by Zhang et al. (2018a), we consider a recurrent framework that produces a denoised image estimate of the form $\hat{x}_{t}=f(\hat{x}_{t-1},y_{\text{noisy}})$ , at time $t$ where $f$ is a neural network. We use a 5-layer fully convolutional network with $3\times 3$ filters in all layers and $64$ channels in each intermediate layer to implement $f$ . We initialize the denoised estimate as the noisy image, i.e $\hat{x}_{0}:=y_{\text{noisy}}$ . For the version of the network with net bias, we add trainable additive constants to every filter in all but the last layer. During training, we run the recurrence for a maximum of $T$ times, sampling $T$ uniformly at random from $\{1,2,3,4\}$ for each mini-batch. At test time we fix $T=4$ .

A.3 UNet

Our UNet model (Ronneberger et al., 2015) has the following layers:

conv1 - Takes in input image and maps to $32$ channels with $5\times 5$ convolutional kernels.

conv2 - Input: $32$ channels. Output: $32$ channels. $3\times 3$ convolutional kernels.

conv3 - Input: $32$ channels. Output: $64$ channels. $3\times 3$ convolutional kernels with stride 2.

conv4- Input: $64$ channels. Output: $64$ channels. $3\times 3$ convolutional kernels.

conv5- Input: $64$ channels. Output: $64$ channels. $3\times 3$ convolutional kernels with dilation factor of 2.

conv6- Input: $64$ channels. Output: $64$ channels. $3\times 3$ convolutional kernels with dilation factor of 4.

conv7- Transpose Convolution layer. Input: $64$ channels. Output: $64$ channels. $4\times 4$ filters with stride $2$ .

conv8- Input: $96$ channels. Output: $64$ channels. $3\times 3$ convolutional kernels. The input to this layer is the concatenation of the outputs of layer conv7 and conv2.

conv9- Input: $32$ channels. Output: $1$ channels. $5\times 5$ convolutional kernels.

The structure is the same as in Zhang et al. (2018a), but without recurrence. For the version with bias, we add trainable additive constants to all the layers other than conv9. This configuration of UNet assumes even width and height, so we remove one row or column from images in with odd height or width.

A.4 Simplified DenseNet

Our simplified version of the DenseNet architecture (Huang et al., 2017) has $4$ blocks in total. Each block is a fully convolutional $5$ -layer CNN with $3\times 3$ filters and $64$ channels in the intermediate layers with ReLU nonlinearity. The first three blocks have an output layer with $64$ channels while the last block has an output layer with only one channel. The output of the $i^{th}$ block is concatenated with the input noisy image and then fed to the $(i+1)^{th}$ block, so the last three blocks have $65$ input channels. In the version of the network with bias, we add trainable additive parameters to all the layers except for the last layer in the final block.

Appendix B Datasets and training procedure

Our experiments are carried out on $180\times 180$ natural images from the Berkeley Segmentation Dataset (Martin et al., 2001). We use a training set of $400$ images. The training set is augmented via downsampling, random flips, and random rotations of patches in these images (Zhang et al., 2017). A test set containing $68$ images is used for evaluation. We train the DnCNN and it’s bias free model on patches of size $50\times 50$ , which yields a total of 541,600 clean training patches. For the remaining architectures, we use patches of size $128\times 128$ for a total of 22,400 training patches.

We train DnCNN and its bias-free counterpart using the Adam Optimizer (Kingma & Ba, 2014) over $70$ epochs with an initial learning rate of $10^{-3}$ and a decay factor of $0.5$ at the $50^{th}$ and $60^{th}$ epochs, with no early stopping. We train the other models using the Adam optimizer with an initial learning rate of $10^{-3}$ and train for $50$ epochs with a learning rate schedule which decreases by a factor of $0.25$ if the validation PSNR decreases from one epoch to the next. We use early stopping and select the model with the best validation PSNR.

Appendix C Additional results

In this section we report additional results of our computational experiments:

Figure 8 shows the first-order analysis of the residual of the different architectures described in Section A, except for DnCNN which is shown in Figure 1.

Figures 9 and 10 visualize the linear and net bias terms in the first-order decomposition of an example image at different noise levels.

Figure 11 shows the PSNR results for the experiments described in Section 5.

Figure 12 shows the SSIM results for the experiments described in Section 5.

Figures 13, 14 and 15 show the equivalent filters at several pixels of two example images for different architectures (see Section 6.1).

Figure 16 shows the singular vectors of the Jacobian of different BF-CNNs (see Section 6.2).

Figure 17 shows the singular values of the Jacobian of different BF-CNNs (see Section 6.2).

Figure 18 and 19 shows that networks trained on noise samples drawn from Gaussian distribution with mean generalizes to noise drawn from uniform distribution with mean during test time. Experiments follow the procedure described in Section 5 except that the networks are evaluated on a different noise distribution during the test time.

Figure 20 shows the application of BF-CNN and CNN to the task of image restoration, where the image is corrupted with both noise and blur at the same time. We show that BF-CNNs can generalize outside the training range for noise levels for a fixed blur level, but do not outperform CNN when generalizing to unseen blur levels.