Soft-to-Hard Vector Quantization for End-to-End Learning Compressible Representations

Eirikur Agustsson, Fabian Mentzer, Michael Tschannen, Lukas Cavigelli, Radu Timofte, Luca Benini, Luc Van Gool

Introduction

In recent years, deep neural networks (DNNs) have led to many breakthrough results in machine learning and computer vision , and are now widely deployed in industry. Modern DNN models often have millions or tens of millions of parameters, leading to highly redundant structures, both in the intermediate feature representations they generate and in the model itself. Although overparametrization of DNN models can have a favorable effect on training, in practice it is often desirable to compress DNN models for inference, e.g., when deploying them on mobile or embedded devices with limited memory. The ability to learn compressible feature representations, on the other hand, has a large potential for the development of (data-adaptive) compression algorithms for various data types such as images, audio, video, and text, for all of which various DNN architectures are now available.

DNN model compression and lossy image compression using DNNs have both independently attracted a lot of attention lately. In order to compress a set of continuous model parameters or features, we need to approximate each parameter or feature by one representative from a set of quantization levels (or vectors, in the multi-dimensional case), each associated with a symbol, and then store the assignments (symbols) of the parameters or features, as well as the quantization levels. Representing each parameter of a DNN model or each feature in a feature representation by the corresponding quantization level will come at the cost of a distortion $D$ , i.e., a loss in performance (e.g., in classification accuracy for a classification DNN with quantized model parameters, or in reconstruction error in the context of autoencoders with quantized intermediate feature representations). The rate $R$ , i.e., the entropy of the symbol stream, determines the cost of encoding the model or features in a bitstream.

To learn a compressible DNN model or feature representation we need to minimize $D+\beta R$ , where $\beta>0$ controls the rate-distortion trade-off. Including the entropy into the learning cost function can be seen as adding a regularizer that promotes a compressible representation of the network or feature representation. However, two major challenges arise when minimizing $D+\beta R$ for DNNs: i) coping with the non-differentiability (due to quantization operations) of the cost function $D+\beta R$ , and ii) obtaining an accurate and differentiable estimate of the entropy (i.e., $R$ ). To tackle i), various methods have been proposed. Among the most popular ones are stochastic approximations and rounding with a smooth derivative approximation . To address ii) a common approach is to assume the symbol stream to be i.i.d. and to model the marginal symbol distribution with a parametric model, such as a Gaussian mixture model , a piecewise linear model , or a Bernoulli distribution (in the case of binary symbols).

In this paper, we propose a unified end-to-end learning framework for learning compressible representations, jointly optimizing the model parameters, the quantization levels, and the entropy of the resulting symbol stream to compress either a subset of feature representations in the network or the model itself (see inset figure). We address both challenges i) and ii) above with methods that are novel in the context DNN model and feature compression. Our main contributions are:

We provide the first unified view on end-to-end learned compression of feature representations and DNN models. These two problems have been studied largely independently in the literature so far.

Our method is simple and intuitively appealing, relying on soft assignments of a given scalar or vector to be quantized to quantization levels. A parameter controls the “hardness” of the assignments and allows to gradually transition from soft to hard assignments during training. In contrast to rounding-based or stochastic quantization schemes, our coding scheme is directly differentiable, thus trainable end-to-end.

Our method does not force the network to adapt to specific (given) quantization outputs (e.g., integers) but learns the quantization levels jointly with the weights, enabling application to a wider set of problems. In particular, we explore vector quantization for the first time in the context of learned compression and demonstrate its benefits over scalar quantization.

Unlike essentially all previous works, we make no assumption on the marginal distribution of the features or model parameters to be quantized by relying on a histogram of the assignment probabilities rather than the parametric models commonly used in the literature.

We apply our method to DNN model compression for a 32-layer ResNet model and full-resolution image compression using a variant of the compressive autoencoder proposed recently in . In both cases, we obtain performance competitive with the state-of-the-art, while making fewer model assumptions and significantly simplifying the training procedure compared to the original works .

The remainder of the paper is organized as follows. Section 2 reviews related work, before our soft-to-hard vector quantization method is introduced in Section 3. Then we apply it to a compressive autoencoder for image compression and to ResNet for DNN compression in Section 4 and 5, respectively. Section 6 concludes the paper.

Related Work

There has been a surge of interest in DNN models for full-resolution image compression, most notably , all of which outperform JPEG and some even JPEG 2000 The pioneering work showed that progressive image compression can be learned with convolutional recurrent neural networks (RNNs), employing a stochastic quantization method during training. both rely on convolutional autoencoder architectures. These works are discussed in more detail in Section 4.

In the context of DNN model compression, the line of works adopts a multi-step procedure in which the weights of a pretrained DNN are first pruned and the remaining parameters are quantized using a $k$ -means like algorithm, the DNN is then retrained, and finally the quantized DNN model is encoded using entropy coding. A notable different approach is taken by , where the DNN compression task is tackled using the minimum description length principle, which has a solid information-theoretic foundation.

It is worth noting that many recent works target quantization of the DNN model parameters and possibly the feature representation to speed up DNN evaluation on hardware with low-precision arithmetic, see, e.g., . However, most of these works do not specifically train the DNN such that the quantized parameters are compressible in an information-theoretic sense.

Gradually moving from an easy (convex or differentiable) problem to the actual harder problem during optimization, as done in our soft-to-hard quantization framework, has been studied in various contexts and falls under the umbrella of continuation methods (see for an overview). Formally related but motivated from a probabilistic perspective are deterministic annealing methods for maximum entropy clustering/vector quantization, see, e.g., . Arguably most related to our approach is , which also employs continuation for nearest neighbor assignments, but in the context of learning a supervised prototype classifier. To the best of our knowledge, continuation methods have not been employed before in an end-to-end learning framework for neural network-based image compression or DNN compression.

Proposed Soft-to-Hard Vector Quantization

Compressible representations. We say that a weight parameter $\mathbf{w}_{i}$ or a feature $\mathbf{x}^{(i)}$ has a compressible representation if it can be serialized to a binary stream using few bits. For DNN compression, we want the entire network parameters $\mathbf{W}$ to be compressible. For image compression via an autoencoder, we just need the features in the bottleneck, $\mathbf{x}^{(b)}$ , to be compressible.

Our generic goal is hence to optimize the rate distortion trade-off between the expected loss and the entropy of $E(\mathsf{Z})$ :

where $\hat{F}$ is the architecture where $\mathbf{z}$ has been replaced with $\hat{\mathbf{z}}$ , and $\beta>0$ controls the trade-off between compressibility of $\mathbf{z}$ and the distortion it imposes on $\hat{F}$ .

However, we cannot optimize (3) directly. First, we do not know the distribution of $\mathsf{X}$ and $\mathsf{Y}$ . Second, the distribution of $\mathsf{Z}$ depends in a complex manner on the network parameters $\mathbf{W}$ and the distribution of $\mathsf{X}$ . Third, the encoder $E$ is a discrete mapping and thus not differentiable. For our first approximation we consider the sample entropy instead of $H(E(\mathsf{Z}))$ . That is, given the data $\mathcal{X}$ and some fixed network parameters $\mathbf{W}$ , we can estimate the probabilities $P(E(\mathsf{Z})=\mathbf{e})$ for $\mathbf{e}\in[L]^{m}$ via a histogram. For this estimate to be accurate, we however would need $|\mathcal{X}|\gg L^{m}$ . If $\mathbf{z}$ is the bottleneck of an autoencoder, this would correspond to trying to learn a single histogram for the entire discretized data space. We relax this by assuming the entries of $E(\mathsf{Z})$ are i.i.d. such that we can instead compute the histogram over the $L$ distinct values. More precisely, we assume that for $\mathbf{e}=(e_{1},\cdots,e_{m})\in[L]^{m}$ we can approximate $P(E(\mathsf{Z})=\mathbf{e})\approx\prod_{l=1}^{m}p_{e_{l}},$ where $p_{j}$ is the histogram estimate

where we denote the entries of $E(\mathbf{z})=(e_{1}(\mathbf{z}),\cdots,e_{m}(\mathbf{z}))$ and $\mathbf{z}_{i}$ is the output feature $\mathbf{z}$ for training data point $\mathbf{x}_{i}\in\mathcal{X}$ . We then obtain an estimate of the entropy of $\mathsf{Z}$ by substituting the approximation (3.1) into (2),

where the first (exact) equality is due to , Thm. 2.6.6, and $H(p):=-\sum_{j=1}^{L}p_{j}\log p_{j}$ is the sample entropy for the (i.i.d., by assumption) components of $E(\mathsf{Z})$ In fact, from , Thm. 2.6.6, it follows that if the histogram estimates $p_{j}$ are exact, (5) is an upper bound for the true $H(E(\mathsf{Z}))$ (i.e., without the i.i.d. assumption)..

We note that so far we have assumed that $\mathbf{z}$ is a feature output in $F$ , i.e., $\mathbf{z}=\mathbf{x}^{(k)}$ for some $k\in[K]$ . However, the above treatment would stay the same if $\mathbf{z}$ is the concatenation of multiple feature outputs. One can also obtain a separate sample entropy term for separate feature outputs and add them to the objective in (6).

In case $\mathbf{z}$ is composed of one or more parameter vectors, such as in DNN compression where $\mathbf{z}=\mathbf{W}$ , $\mathbf{z}$ and $\hat{\mathbf{z}}$ cease to be random variables, since $\mathbf{W}$ is a parameter of the model. That is, opposed to the case where we have a source $\mathcal{X}$ that produces another source $\hat{\mathsf{Z}}$ which we want to be compressible, we want the discretization of a single parameter vector $\mathbf{W}$ to be compressible. This is analogous to compressing a single document, instead of learning a model that can compress a stream of documents. In this case, (3) is not the appropriate objective, but our simplified objective in (6) remains appropriate. This is because a standard technique in compression is to build a statistical model of the (finite) data, which has a small sample entropy. The only difference is that now the histogram probabilities in (4) are taken over $\mathbf{W}$ instead of the dataset $\mathcal{X}$ , i.e., $N=1$ and $\mathbf{z}_{i}=\mathbf{W}$ in (4), and they count towards storage as well as the encoder $E$ and decoder $D$ .

Challenges. Eq. (6) gives us a unified objective that can well describe the trade-off between compressible representations in a deep architecture and the original training objective of the architecture.

However, the problem of finding a good encoder $E$ , a corresponding decoder $D$ , and parameters $\mathbf{W}$ that minimize the objective remains. First, we need to impose a form for the encoder and decoder, and second we need an approach that can optimize (6) w.r.t. the parameters $\mathbf{W}$ . Independently of the choice of $E$ , (6) is challenging since $E$ is a mapping to a finite set and, therefore, not differentiable. This implies that neither $H(p)$ is differentiable nor $\hat{F}$ is differentiable w.r.t. the parameters of $\mathbf{z}$ and layers that feed into $\mathbf{z}$ . For example, if $\hat{F}$ is an autoencoder and $\mathbf{z}=\mathbf{x}^{(b)}$ , the output of the network will not be differentiable w.r.t. $\mathbf{w}_{1},\cdots,\mathbf{w}_{b}$ and $\mathbf{x}^{(0)},\cdots,\mathbf{x}^{(b-1)}$ .

These challenges motivate the design decisions of our soft-to-hard annealing approach, described in the next section.

2 Our Method

The idea is then to relax $E$ and $D$ into continuous mappings via soft assignments instead of the hard nearest neighbor assignment of $E$ .

where $\text{softmax}(y_{1},\cdots,y_{L})_{j}:=\frac{e^{y_{j}}}{e^{y_{1}}+\cdots+e^{y_{L}}}$ is the standard softmax operator, such that $\mathbf{\phi}(\bar{\mathbf{z}})$ has positive entries and $\|\mathbf{\phi}(\bar{\mathbf{z}})\|_{1}=1$ . We denote the $j$ -th entry of $\mathbf{\phi}(\bar{\mathbf{z}})$ with $\phi_{j}(\bar{\mathbf{z}})$ and note that

such that $\hat{\mathbf{\phi}}(\bar{\mathbf{z}}):=\lim_{\sigma\to\infty}\mathbf{\phi}(\bar{\mathbf{z}})$ converges to a one-hot encoding of the nearest center to $\bar{\mathbf{z}}$ in $\mathcal{C}$ . We therefore refer to $\hat{\mathbf{\phi}}(\bar{\mathbf{z}})$ as the hard assignment of $\bar{\mathbf{z}}$ to $\mathcal{C}$ and the parameter $\sigma>0$ as the hardness of the soft assignment $\phi(\bar{\mathbf{z}})$ .

Using soft assignment, we define the soft quantization of $\bar{\mathbf{z}}$ as

Entropy estimation. Using the soft assignments, we can similarly define a soft histogram, by summing up the partial assignments to each center instead of counting as in (4):

This gives us a valid probability mass function $q=(q_{1},\cdots,q_{L})$ , which is differentiable but converges to $p=(p_{1},\cdots,p_{L})$ as $\sigma\to\infty$ .

We can now define the “soft entropy” as the cross entropy between $p$ and $q$ :

such that we get an additive loss over the samples $\mathbf{x}_{i}\in\mathcal{X}$ and the components $l\in[m]$ .

To this end, we anneal $\sigma$ from some initial value $\sigma_{0}$ to infinity during training, such that the soft approximation gradually becomes a better approximation of the final hard quantization we will use. Choosing the annealing schedule is crucial as annealing too slowly may allow the network to invert the soft assignments (resulting in large weights), and annealing too fast leads to vanishing gradients too early, thereby preventing learning. In practice, one can either parametrize $\sigma$ as a function of the iteration, or tie it to an auxiliary target such as the difference between the network losses incurred by soft quantization and hard quantization (see Section 4 for details).

Image Compression

We now show how we can use our framework to realize a simple image compression system. For the architecture, we use a variant of the convolutional autoencoder proposed recently in (see Appendix A.1 for details). We note that while we use the architecture of , we train it using our soft-to-hard entropy minimization method, which differs significantly from their approach, see below.

We trained different models using Adam , see Appendix A.2. Our training set is composed similarly to that described in . We used a subset of 90,000 images from ImageNET , which we downsampled by a factor 0.7 and trained on crops of $128\times 128$ pixels, with a batch size of 15. To estimate the probability distribution $p$ for optimizing (8), we maintain a histogram over 5,000 images, which we update every 10 iterations with the images from the current batch. Details about other hyperparameters can be found in Appendix A.2.

To evaluate the image compression performance of our Soft-to-Hard Autoencoder (SHA) method we use four datasets, namely Kodak , B100 , Urban100 , ImageNET100 (100 randomly selected images from ImageNET ) and three standard quality measures, namely peak signal-to-noise ratio (PSNR), structural similarity index (SSIM) , and multi-scale SSIM (MS-SSIM), see Appendix A.5 for details. We compare our SHA with the standard JPEG, JPEG 2000, and BPG , focusing on compression rates $<1$ bits per pixel (bpp) (i.e., the regime where traditional integral transform-based compression algorithms are most challenged). As shown in Fig. 1, for high compression rates ( $<0.4$ bpp), our SHA outperforms JPEG and JPEG 2000 in terms of MS-SSIM and is competitive with BPG. A similar trend can be observed for SSIM (see Fig. 4 in Appendix A.6 for plots of SSIM and PSNR as a function of bpp). SHA performs best on ImageNET100 and is most challenged on Kodak when compared with JPEG 2000. Visually, SHA-compressed images have fewer artifacts than those compressed by JPEG 2000 (see Fig. 1, and Appendix A.7).

Related methods and discussion.

JPEG 2000 uses wavelet-based transformations and adaptive EBCOT coding. BPG , based on a subset of the HEVC video compression standard, is the current state-of-the art for image compression. It uses context-adaptive binary arithmetic coding (CABAC) .

The recent works of also showed competitive performance with JPEG 2000. While we use the architecture of , there are stark differences between the works, summarized in the inset table. The work of build a deep model using multiple generalized divisive normalization (GDN) layers and their inverses (IGDN), which are specialized layers designed to capture local joint statistics of natural images. Furthermore, they model marginals for entropy estimation using linear splines and also use CABAC coding. Concurrent to our work, the method of builds on the architecture proposed in , and shows that impressive performance in terms of the MS-SSIM metric can be obtained by incorporating it into the optimization (instead of just minimizing the MSE).

In contrast to the domain-specific techniques adopted by these state-of-the-art methods, our framework for learning compressible representation can realize a competitive image compression system, only using a convolutional autoencoder and simple entropy coding.

DNN Compression

For DNN compression, we investigate the ResNet architecture for image classification. We adopt the same setting as and consider a 32-layer architecture trained for CIFAR-10 . As in , our goal is to learn a compressible representation for all 464,154 trainable parameters of the model.

Our compressible model achieves a comparable test accuracy of $92.1\%$ while compressing the DNN by a factor $19.15$ with Huffman and $20.15$ using arithmetic coding. Table 1 compares our results with state-of-the-art approaches reported by . We note that while the top methods from the literature also achieve accuracies above $92\%$ and compression factors above $20\times$ , they employ a considerable amount of hand-designed steps, such as pruning, retraining, various types of weight clustering, special encoding of the sparse weight matrices into an index-difference based format and then finally use entropy coding. In contrast, we directly minimize the entropy of the weights in the training, obtaining a highly compressible representation using standard entropy coding.

In Fig. 5 in Appendix A.8, we show how the sample entropy $H(p)$ decays and the index histograms develop during training, as the network learns to condense most of the weights to a couple of centers when optimizing (6). In contrast, the methods of manually impose as the most frequent center by pruning $\approx 80\%$ of the network weights. We note that the recent works by also manages to tackle the problem in a single training procedure, using the minimum description length principle. In contrast to our framework, they take a Bayesian perspective and rely on a parametric assumption on the symbol distribution.

Conclusions

In this paper we proposed a unified framework for end-to-end learning of compressed representations for deep architectures. By training with a soft-to-hard annealing scheme, gradually transferring from a soft relaxation of the sample entropy and network discretization process to the actual non-differentiable quantization process, we manage to optimize the rate distortion trade-off between the original network loss and the entropy. Our framework can elegantly capture diverse compression tasks, obtaining results competitive with state-of-the-art for both image compression as well as DNN compression. The simplicity of our approach opens up various directions for future work, since our framework can be easily adapted for other tasks where a compressible representation is desired.

References

Appendix A Image Compression Details

We rely on a variant of the compressive autoencoder proposed recently in , using convolutional neural networks for the image encoder and image decoder We note that the image encoder (decoder) refers to the left (right) part of the autoencoder, which encodes (decodes) the data to (from) the bottleneck (not to be confused with the symbol encoder (decoder) in Section 3). . The first two convolutional layers in the image encoder each downsample the input image by a factor 2 and collectively increase the number of channels from 3 to 128. This is followed by three residual blocks, each with 128 filters. Another convolutional layer then downsamples again by a factor 2 and decreases the number of channels to $c$ , where $c$ is a hyperparameter ( use 64 and 96 channels). For a $w\times h$ -dimensional input image, the output of the image encoder is the $w/8\times h/8\times c$ -dimensional “bottleneck tensor”.

The image decoder then mirrors the image encoder, using upsampling instead of downsampling, and deconvolutions instead of convolutions, mapping the bottleneck tensor into a $w\times h$ -dimensional output image. In contrast to the “subpixel” layers used in , we use standard deconvolutions for simplicity.

A.2 Hyperparameters

We do vector quantization to $L=1000$ centers, using $(p_{w},p_{h})=(2,2)$ , i.e., $m=d/(2\cdot 2)$ .We trained different combinations of $\beta$ and $c$ to explore different rate-distortion tradeoffs (measuring distortion in MSE). As $\beta$ controls to which extent the network minimizes entropy, $\beta$ directly controls bpp (see top left plot in Fig. 3). We evaluated all pairs $(c,\beta)$ with $c\in\{8,16,32,48\}$ and $m\beta\in\{1e^{-4},\dots,9e^{-4}\}$ , and selected 5 representative pairs (models) with average bpps roughly corresponding to uniformly spread points in the interval $[0.1,0.8]$ bpp. This defines a “quality index” for our model family, analogous to the JPEG quality factor.

We experimented with the other training parameters on a setup with $c=32$ , which we chose as follows. In the first stage we train for $250k$ iterations using a learning rate of $1e^{-4}$ . In the second stage, we use an annealing schedule with $T=50k,K_{G}=100$ , over $800k$ iterations using a learning rate of $1e^{-5}$ . In both stages, we use a weak $l_{2}$ regularizer over all learnable parameters, with $\lambda=1e^{-12}$ .

A.3 Effect of Vector Quantization and Entropy Loss

To investigate the effect of vector quantization, we trained models as described in Section 4, but instead of using vector quantization, we set $L=6$ and quantized to $1\times 1$ -dimensional (scalar) centers, i.e., $(p_{h},p_{w})=(1,1),m=d$ . Again, we chose 5 representative pairs $(c,\beta)$ . We chose $L=6$ to get approximately the same number of unique symbol assignments as for $2\times 2$ patches, i.e., $6^{4}\approx 1000$ .

To investigate the effect of the entropy loss, we trained models using $2\times 2$ centers for $c\in\{8,16,32,48\}$ (as described above), but used $\beta=0$ .

Fig. 2 shows how both vector quantization and entropy loss lead to higher compression rates at a given reconstruction MSE compared to scalar quantization and training without entropy loss, respectively.

A.4 Effect of Annealing

A.5 Data Sets and Quality Measure Details

Kodak is the most frequently employed dataset for analizing image compression performance in recent years. It contains 24 color $768\times 512$ images covering a variety of subjects, locations and lighting conditions.

B100 is a set of 100 content diverse color $481\times 321$ test images from the Berkeley Segmentation Dataset .

Urban100 has 100 color images selected from Flickr with labels such as urban, city, architecture, and structure. The images are larger than those from B100 or Kodak, in that the longer side of an image is always bigger than 992 pixels. Both B100 and Urban100 are commonly used to evaluate image super-resolution methods.

ImageNET100 contains 100 images randomly selected by us from ImageNET , also downsampled and cropped, see above.

PSNR (peak signal-to-noise ratio) is a standard measure in direct monotonous relation with the mean square error (MSE) computed between two signals. SSIM and MS-SSIM are the structural similarity index and its multi-scale SSIM computed variant proposed to measure the similarity of two images. They correlate better with human perception than PSNR.

We compute quantitative similarity scores between each compressed image and the corresponding uncompressed image and average them over whole datasets of images. For comparison with JPEG we used libjpeghttp://libjpeg.sourceforge.net/, for JPEG 2000 we used the Kakadu implementationhttp://kakadusoftware.com/, subtracting in both cases the size of the header from the file size to compute the compression rate. For comparison with BPG we used the reference implementationhttps://bellard.org/bpg/ and used the value reported in the picture_data_length header field as file size.

A.6 Image Compression Performance

A.7 Image Compression Visual Examples

An online supplementary of visual examples is available at http://www.vision.ee.ethz.ch/~aeirikur/compression/visuals2.pdf, showing the output of compressing the first four images of each of the four datasets with our method, BPG, JPEG, and JPEG 2000, at low bitrates.