Deep Decoder: Concise Image Representations from Untrained Non-convolutional Networks

Reinhard Heckel, Paul Hand

Introduction

Data models are central for signal and image processing and play a key role in compression and inverse problems such as denoising, super-resolution, and compressive sensing. These data models impose structural assumptions on the signal or image, which are traditionally based on expert knowledge. For example, imposing the assumption that an image can be represented with few non-zero wavelet coefficients enables modern (lossy) image compression [Ant+92] and efficient denoising [Don95].

In recent years, it has been demonstrated that for a wide range of imaging problems, from compression to denoising, deep neural networks trained on large datasets can often outperform methods based on traditional image models [Tod+16, Agu+17, The+17, Bur+12, Zha+17]. This success can largely be attributed to the ability of deep networks to represent realistic images when trained on large datasets. Examples include learned representations via autoencoders [HS06] and generative adversarial models [Goo+14]. Almost exclusively, three common features of the recent success stories of using deep neural network for imaging related tasks are i) that the corresponding networks are over-parameterized (i.e., they have much more parameters than the dimension of the image that they represent or generate), ii) that the networks have a convolutional structure, and perhaps most importantly, iii) that the networks are trained on large datasets.

An important exception that breaks with the latter feature is a recent work by Ulyanov et al. [Uly+18], which provides an algorithm, called the deep image prior (DIP), based on deep neural networks, that can solve inverse problems well without any training. Specifically, Ulyanov et al. demonstrated that fitting the weights of an over-parameterized deep convolutional network to a single image, together with strong regularization by early stopping of the optimization, performs competitively on a variety of image restoration problems. This result is surprising because it does not involve a training dataset, which means that the notion of what makes an image ‘natural’ is contained in a combination of the network structure and the regularization. However, without regularization the proposed network has sufficient capacity to overfit to noise, preventing meaningful image denoising.

These prior works demonstrating the effectiveness of deep neural networks for image generation beg the question whether there may be a deep neural network model of natural images that is underparameterized and whose architecture alone, without algorithmic assistance, forms an efficient model for natural images.

In this paper, we propose a simple image model in the form of a deep neural network that can represent natural images well while using very few parameters. This model thus enables image compression, denoising, and solving a variety of inverse problems with close to or state of the art performance. We call the network the deep decoder, due to its resemblance to the decoder part of an autoencoder. The network does not require training, and contrary to previous approaches, the network itself incorporates all assumptions on the data, is under-parameterized, does not involve convolutions, and has a simplicity that makes it amenable to theoretical analysis. The key contributions of this paper are as follows:

The network is under-parameterized. Thus, the network maps a lower-dimensional space to a higher-dimensional space, similar to classical image representations such as sparse wavelet representations. This feature enables image compression by storing the coefficients of the network after its weights are optimized to fit a single image. In Section 2, we demonstrate that the compression is on-par with wavelet thresholding [Ant+92], a strong baseline that underlies JPEG-2000. An additional benefit of underparameterization is that it provides a barrier to overfitting, which enables regularization of inverse problems.

The network itself acts as a natural data model. Not only does the network require no training (just as the DIP [Uly+18]); it also does not critically rely on regularization, for example by early stopping (in contrast to the DIP). The property of not involving learning has at least two benefits: The same network and code is usable for a number of applications, and the method is not sensitive to a potential misfit of training and test data.

The network does not use convolutions. Instead, the network does have pixelwise linear combinations of channels, and, just like in a convolutional neural network, the weights are shared among spatial positions. Nonetheless, these are not convolutions because they provide no spatial coupling between pixels, despite how pixelwise linear combinations are sometimes called ‘1x1 convolutions.’ In contrast, the majority of the networks for image compression, restoration, and recovery have convolutional layers with filters of nontrivial spatial extent [Tod+16, Agu+17, The+17, Bur+12, Zha+17]. This work shows that relationships characteristic of nearby pixels of natural images can be imposed directly by upsampling layers.

The network only consists of a simple combination of few building blocks, which makes it amenable to analysis and theory. For example, we prove that the deep decoder can only fit a small proportion of noise, which, combined with the empirical observation that it can represent natural images well, explains its denoising performance.

The remainder of the paper is organized as follows. In Section 2, we first demonstrate that the deep decoder enables concise image representations. We formally introduce the deep decoder in Section 3. In Section 4, we show the performance of the deep decoder on a number of inverse problems such as denoising. In Section 5 we discuss related work, and finally, in Section 6 we provide theory and explanations on what makes the deep decoder work.

Concise image representations with a deep image model

Intuitively, a model describes a class of signals well if it is able to represent or approximate a member of the class with few parameters. In this section, we demonstrate that the deep decoder, an untrained, non-convolutional neural network, defined in the next section, enables concise representation of an image—on par with state of the art wavelet thresholding.

We draw 100 images from the ImageNet validation set uniformly at random and crop the center to obtain a 512x512 pixel color image. For each image $\mathbf{x}^{\ast}$ , we fit a deep decoder model $G(\mathbf{C})$ by minimizing the loss

We compare the compression performance to wavelet compression [Ant+92] by representing each image with the $N$ -largest wavelet coefficients. Wavelets—which underly JPEG 2000, a standard for image compression—are one of the best methods to approximate images with few coefficients. In Fig. 1 we depict the results. It can be seen that for large compression factors ( $3\cdot 512^{2}/N=32.3$ ), the representation by the deep decoder is slightly better for most images (i.e., is above the red line), while for smalle compression factors ( $3\cdot 512^{2}/N=8$ ), the wavelet representation is slightly better. This experiment shows that deep neural networks can represent natural images well with very few parameters and without any learning.

The observation that, for small compression factors, wavelets enable more concise representations than the deep decoder is intuitive because any image can be represented exactly with sufficiently many wavelet coefficients. In contrast, there is no reason to believe a priori that the deep decoder has zero representation error because it is underparameterized.

The main point of this experiment is to demonstrate that the deep decoder is a good image model, which enables applications like solving inverse problems, as in Section 4. However, it also suggest that the deep decoder can be used for lossy image compression, by quantizing the coefficients $\mathbf{C}$ and saving the quantized coefficients. In the appendix, we show that image representations of the deep decoder are not sensitive to perturbations of its coefficients, thus quantization does not have a detrimental effect on the image quality. Deep networks were used successfully before for the compression of images [Tod+16, Agu+17, The+17]. In contrast to our work, which is capable of compressing images without any learning, the aforementioned works learn an encoder and decoder using convolutional recurrent neural networks [Tod+16] and convolutional autoencoders [The+17] based on training data.

The deep decoder

We finally note that naturally variations of the deep decoder are possible; for example in a previous version of this manuscript, we applied upsampling after applying the relu-nonlinearity, but found that applying it before yields slightly better results.

While the deep decoder does not use convolutions, its structure is closely related to that of a convolutional neural network. Specifically, the network does have pixelwise linear combinations of channels, and just like in a convolutional neural network, the weights are shared among spatial positions. Nonetheless, pixelwise linear combinations are not proper convolutions because they provide no spatial coupling of pixels, despite how they are sometimes called $1\times 1$ convolutions. In the deep decoder, the source of spatial coupling is only from upsampling operations.

In contrast, a large number of networks for image compression, restoration, and recovery have convolutional layers with filters of nontrivial spatial extent [Tod+16, Agu+17, The+17, Bur+12, Zha+17]. Thus, it is natural to ask whether using linear combinations as we do, instead of actual convolutions yields better results.

Our simulations indicate that, indeed, linear combinations yield more concise representations of natural images than $p\times p$ convolutions, albeit not by a huge factor. Recall that the number of parameters of the deep decoder with $d$ layers, $k$ channels at each layer, and $1\times 1$ convolutions is $N(d,k;1)=dk^{2}+3k+2dk.$ If we consider a deep decoder with convolutional layers with filters of size $p\times p$ , then the number of parameters is: $N(d,k;p)=p^{2}(dk^{2}+3k)+2dk.$ If we fix the number of channels, $k$ , but increase $p$ to 3, the representation error only decreases since we increase the number of parameters (by a factor of approximately $3^{2}$ ). We consider image reconstruction as described in Section 2. For a meaningful comparison, we keep the number of parameters fixed, and compare the representation error of a deep decoder with $p=1$ and $k=64$ (the default architecture in our paper) to a variant of the deep decoder with $p=3$ and $k=22$ , so that the number of parameters is essentially the same in both configurations. We find that the representation of the deep decoder with $p=1$ is better (by about 1dB, depending on the image), and thus for concise image representations, linear combinations ( $1\times 1$ convolutions) appear to be more effective than convolutions of larger spatial extent.

Solving inverse problems with the deep decoder

In this section, we use the deep decoder as a structure-enforcing model or regularizers for solving standard inverse problems: denoising, super-resolution, and inpainting. In all of those inverse problems, the goal is to recover an image $\mathbf{x}$ from a noisy observation $\mathbf{y}=f(\mathbf{x})+\eta$ . Here, $f$ is a known forward operator (possibly equal to identity), and $\eta$ is structured or unstructured noise.

We recover the image $\mathbf{x}$ with the deep decoder as follows. Motivated by the finding from the previous section that a natural image $\mathbf{x}$ can (approximately) be represented with the deep decoder as $G(\mathbf{C})$ , we estimate the unknown image from the noisy observation $\mathbf{y}$ by minimizing the loss

with respect to the model parameters $\mathbf{C}$ . Let $\hat{\mathbf{C}}$ be the result of the optimization procedure. We estimate the image as $\hat{\mathbf{x}}=G(\hat{\mathbf{C}})$ .

We start with the perhaps most basic inverse problem, denoising. The motivation to study denoising is at least threefold: First, denoising is an important problem in practice, second, many inverse problem can be solved as a chain of denoising steps [Rom+17], and third, the denoising problem is simple to model mathematically, and thus a common entry point for gaining intuition on a new method. Given a noisy observation $\mathbf{y}=\mathbf{x}+\eta$ , where $\eta$ is additive noise, we estimate an image with the deep decoder by minimizing the least squares loss ${\left\|G(\mathbf{C})-\mathbf{y}\right\|}_{2}^{2}$ , as described above.

The results in Fig. 2 and Table 1 demonstrate that the deep decoder has denoising performance on-par with state of the art untrained denoising methods, such as the related Deep Image Prior (DIP) method [Uly+18] (discussed in more detail later) and the BM3D algorithm [Dab+07]. Since the deep decoder is an untrained method, we only compared to other state-of-the-art untrained methods (as opposed to learned methods such as [Zha+17]).

Why does the deep decoder denoise well? In a nutshell, from Section 2 we know that the deep decoder can represent natural images well even when highly underparametrized. In addition, as a consequence of being under-parameterized, the deep decoder can only represent a small proportion of the noise, as we show analytically in Section 6, and as demonstrated experimentally in Fig. 4. Thus, the deep decoder “filters out” a significant proportion of the noise, and retains most of the signal.

How to choose the parameters of the deep decoder? The larger $k$ , the larger the number of latent parameters and thus the smaller the representation error, i.e., the error that the deep decoder makes when representing a noise-free image. On the other hand, the smaller $k$ , the fewer parameters, and the smaller the range space of the deep decoder $G(\mathbf{C})$ , and thus the more noise the method will remove. The optimal $k$ trades off those two errors; larger noise levels require smaller values of $k$ (or some other form of regularization). If the noise is significantly larger, then the method requires either choosing $k$ smaller, or it requires another means of regularization, for example early stopping of the optimization. For example $k=64$ or $128$ performs best out of $\{32,64,128\}$ , for a PSNR of around 20dB, while for a PSNR of about 14dB, $k=32$ performs best.

2 Superresolution

We next super-resolve images with the deep denoiser. We define a forward model $f$ that performs downsampling with the Lanczos filter by a factor of four. We then downsample a given image by a factor of four, and then reconstruct it with the deep decoder (with $k=128$ , as before). We compare performance to bi-cubic interpolation and to the deep image prior, and find that the deep decoder outperforms bicubic interpolation, and is on-par with the deep image prior (see Table 1 in the appendix).

3 Inpainting

Finally, we use the deep decoder for inpainting, where we are given an inpainted image $\mathbf{y}$ , and a forward model $f$ mapping a clean image to an inpainted image. The forward model $f$ is defined by a mask that describes the inpainted region, and simply maps that part of the image to zero. Fig. 3 and Table 1 demonstrate that the deep decoder performs well on the inpainting problems; however, the deep image prior performs slightly better on average over the examples considered. For the impainting problem we choose a significantly more expressive prior, specifically $k=320$ .

Related work

Image compression, restoration, and recovery algorithms are either trained or untrained. Conceptually, the deep decoder image model is most related to untrained methods, such as sparse representations in overcomplete dictionaries (for example wavelets [Don95] and curvelets [Sta+02]). A number of highly successful image restoration and recovery schemes are not directly based on generative image models, but rely on structural assumptions about the image, such as exploiting self-similarity in images for denoising [Dab+07] and super-resolution [Gla+09].

Since the deep decoder is an image-generating deep network, it is also related to methods that rely on trained deep image models. Deep learning based methods are either trained end-to-end for tasks ranging from compression [Tod+16, Agu+17, The+17, Bur+12, Zha+17] to denoising [Bur+12, Zha+17], or are based on learning a generative image model (by training an autoencoder or GAN [HS06, Goo+14]) and then using the resulting model to solve inverse problems such as compressed sensing [Bor+17, HV18], denoising [Hec+18], phase retrieval [Han+18, SA18], and blind deconvolution [Asi+18], by minimizing an associated loss. In contrast to the deep decoder, where the optimization is over the weights of the network, in all the aforementioned methods, the weights are adjusted only during training and then are fixed upon solving the inverse problem.

Most related to our work is the Deep Image Prior (DIP), recently proposed by Ulyanov et al. [Uly+18]. The deep image prior is an untrained method that uses a network with an hourglass or encoder-decoder architecture, similar to the U-net and related architectures that work well as autoencoders. The key differences to the deep decoder are threefold: i) the DIP is over-parameterized, whereas the deep decoder is under-parameterized. ii) Since the DIP is highly over-parameterized, it critically relies on regularization through early stopping and adding noise to its input, whereas the deep decoder does not need to be regularized (however, regularization can enhance performance). iii) The DIP is a convolutional neural network, whereas the deep decoder is not.

We further illustrate point ii) comparing the DIP and deep decoder by denoising the astronaut image from Fig. 2. In Fig. 4(a) we plot the Mean Squared Error (MSE) over the number of iterations of the optimizer for fitting the noisy astronaut image $\mathbf{x}+\eta$ . Note that to fit the model, we minimize the error ${\left\|G(\mathbf{C})-(\mathbf{x}+\eta)\right\|}_{2}^{2}$ , because we are only given the noisy image, but we plot the MSE between the representation and the actual, true image ${\left\|G(\mathbf{C}^{t})-\mathbf{x}\right\|}_{2}^{2}$ at iteration $t$ . Here, $\mathbf{C}^{t}$ are the parameters of the deep decoder after $t$ iterations of the optimizer. In Fig. 4(b) and (c), we plot the loss or MSE associated with fitting the noiseless astronaut image, $\mathbf{x}$ ( ${\left\|G(\mathbf{C}^{t})-\mathbf{x}\right\|}_{2}^{2}$ ) and the noise itself, $\eta$ , ( ${\left\|G(\mathbf{C}^{t})-\eta\right\|}_{2}^{2}$ ). Models are fitted independently for the noisy image, the noiseless image, and the noise.

The plots in Fig. 4 show that with sufficiently many iterations, both the DIP and the DD can fit the image well. However, even with a large number of iterations, the deep decoder can not fit the noise well, whereas the DIP can. This is not surprising, given that the DIP is over-parameterized and the deep decoder is under-parameterized. In fact, in Section 6 we formally show that due to the underparameterization, the deep decoder can only fit a small proportion of the noise, no matter how and how long we optimize. As a consequence, it filters out much of the noise when applied to a natural image. In contrast, the DIP relies on the empirical observation that the DIP fits a structured image faster than it fits noise, and thus critically relies on early stopping.

Discussion on what makes the decoder work

In the previous sections we empirically showed that the deep decoder can represent images well and at the same time cannot fit noise well. In this section, we formally show that the deep decoder can only fit a small proportion of the noise, relative to the degree of underparameterization. In addition, we provide insights into how the components of the deep decoder contribute to representing natural images well, and we provide empirical observations on the sensitivity of the parameters and their distribution.

We start by showing that an under-parameterized deep decoder can only fit a proportion of the noise relative to the degree of underparameterization. At the heart of our argument is the intuition that a method mapping from a low- to a high-dimensional space can only fit a proportion of the noise relative to the number of free parameters. For simplicity, we consider a one-layer network, and ignore the batch normalization operation. Then, the networks output is given by

Here, we take $\mathbf{C}=(\mathbf{C}_{0},\mathbf{c}_{1})$ , where $\mathbf{C}_{0}$ is a $k\times k$ matrix and $\mathbf{c}_{1}$ is a $k$ -dimensional vector, assuming that the number of output channels is $1$ . While for the performance of the deep decoder the choice of upsampling matrix is important, it is not relevant for showing that the deep decoder cannot represent noise well. Therefore, the following statement makes no assumptions about the upsampling matrix $\mathbf{U}_{0}$ .

The proposition asserts that the deep decoder can only fit a small portion of the noise energy, precisely a proportion determined by its number of parameters relative to the output dimension, $n$ . Our simulations and preliminary analytic results suggest that this statement extends to multiple layers in that the lower bound becomes $\left(1-c{\frac{k^{2}\log(\prod_{i=1}^{d}n_{i-1})}{n}}\right)$ , where $c$ is a numerical constant. Note that the lower bound does not directly depend on the noise variance $\sigma$ since both sides of the inequality scale with $\sigma^{2}$ .

2 Upsampling

Upsampling is a vital part of the deep decoder because it is the only way that the notion of locality explicitly enters the signal model. In contrast, most convolutional neural networks have spatial coupling between pixels both by unlearned upsampling, but also by learned convolutional filters of nontrivial spatial extent. The choice of the upsampling method in the deep decoder strongly affects the ‘character’ of the resulting signal estimates. We now discuss the impacts of a few choices of upsampling matrices $\mathbf{U}_{i}$ , and their impact on the images the model can fit.

No upsampling: If there is no upsampling, or, equivalently, if $\mathbf{U}_{i}=\mathbf{I}$ , then there is no notion of locality in the resulting image. All pixels become decoupled, and there is then no notion of which pixels are near to each other. Specifically, a permutation of the input pixels (the rows of $\mathbf{B}_{0}$ ) simply induces the identical permutation of the output pixels. Thus, if a deep decoder without upsampling could fit a given image, it would also be able to fit random permutations of the image equally well, which is practically equivalent to fitting random noise.

Nearest neighbor upsampling: If the upsampling operations perform nearest neighbor upsampling, then the output of the deep decoder consists of piecewise constant patches. If the upsampling doubles the image dimensions at each layer, this would result in patches of $2^{d}\times 2^{d}$ pixels that are constant. While this upsampling method does induce a notion of locality, it does so too strongly in the sense that squares of nearby pixels become identical and incapable of fitting local variation within natural images.

Linear and convex, non-linear upsampling: The specific choice of upsampling matrix affects the multiscale ‘character’ of the signal estimates. To illustrate this, Figure 5 shows the signal estimate from a 1-dimensional deep decoder with upsampling operations given by linear upsampling $(x_{0},x_{1},x_{2},\ldots)\mapsto(x_{0},0.5x_{0}+0.5x_{1},x_{1},0.5x_{1}+0.5x_{2},x_{2},\ldots)$ and convex nonlinear upsampling given by $(x_{0},x_{1},x_{2},\ldots)\mapsto(x_{0},0.75x_{0}+0.25x_{1},x_{1},0.75x_{1}+0.25x_{2},x_{2},\ldots)$ . Note that while both models are able to capture the coarse signal structure, the convex upsampling results in a multiscale fractal-like structure that impedes signal representation. In contrast, linear upsampling is better able to represent smoothly varying portions of the signal. Linear upsampling in a deep decoder indirectly encodes the prior that natural signals are piecewise smooth and in some sense have approximately linear behavior at multiple scales

3 Network input

Throughout, the network input is fixed. We choose the network input $\mathbf{B}_{1}$ by choosing its entries uniformly at random. The particular choice of the input is not very important; it is however desirable that the rows are incoherent. To see this, as an extreme case, if any two rows of $\mathbf{B}_{1}$ are equal and if the upsampling operation preserves the values of those pixels exactly (for example, as with the linear upsampling from the previous section), then the corresponding pixels of the output image is also exactly the same, which restricts the range space of the deep decoder unrealistically, since for any pair of pixels, the majority of natural images does not have exactly the same value at this pair of pixels.

4 Image generation by successive approximation

Code to reproduce the results is available at https://github.com/reinhardh/supplement_deep_decoder

Acknowledgments

RH is partially supported by NSF award IIS-1816986, an NVIDIA Academic GPU Grant, and would like to thank Ludwig Schmidt for helpful discussions on the deep decoder in general, and in particular for suggestions on the experiments in Section 2.

References

Appendix

Appendix A Proof of Proposition 1

Next, fix the matrixes $\{{\mathbf{W}_{0j}}\}_{j}$ . As $G(\mathbf{C})$ lies in an at-most- $k^{2}$ -dimensional subspace, let $S$ be a $k^{2}$ -dimensional subspace that contains the range of $G$ for these fixed $\{{\mathbf{W}_{0j}}\}_{j}$ . It follows that

Now, we make use of the following bound on the projection of the noise $\eta$ onto a subspace.

From [LM00, Lem. 1 ], if $X\sim\chi^{2}_{n}$ , then

Since the number of matrices $\{{\mathbf{W}_{0j}}\}_{j}$ is bounded by ${n_{0}}^{k^{2}}$ , by a union bound,

where the last inequality follows with choosing $\beta=2\log({n_{0}})$ and by the assumption that $k^{2}<\frac{n}{32\log{n_{0}}}$ . This proves the claim in Proposition 1.

Our goal is to count the number of sign patterns $(\mathbf{A}\mathbf{v}>0)\in\{0,1\}$ . Note that this number is equal to the maximum number of partitions one can get when cutting a $k$ -dimensional space with $n$ many hyperplanes that all pass through the origin, and are perpendicular to the rows of $\mathbf{A}$ . This number if well known (see for example [Win66]) and is upper bounded by

where the last inequality holds for $k\geq 5$ .

Appendix B Sensitivity to parameter perturbations and distribution of parameters

The deep decoder is not overly sensitive to perturbations of its coefficients. To demonstrate this, fit the standard test image Barbara with a deep decoder with $6$ layers and $k=128$ , as before. We then perturb the weights in a given layer $i$ (i.e., the matrix $\mathbf{C}_{i}$ ) with Gaussian noise of a certain signal-to-noise ratio relative to $\mathbf{C}_{i}$ and leave the other weights and the input untouched. We then measure the peak signal-to-noise ratio in the image domain, and plot the corresponding curve for each layer (see Fig. 7). It can be seen that the representation provided by the deep decoder is relatively stable with respect to perturbations of its coefficients, and that it is more sensitive to perturbations in higher levels.

Finally, in Fig. 8 we depict the distribution of the weights of the network after fitted to the Barbara test image, and note that the weights are approximately Gaussian distributed.