A Deep Generative Deconvolutional Image Model

Yunchen Pu, Xin Yuan, Andrew Stevens, Chunyuan Li, Lawrence Carin

Introduction

Convolutional neural networks (CNN) (LeCun et al., 1989) are effective tools for image and video analysis (Chatfield et al., 2014; Krizhevsky et al., 2012; Mnih et al., 2013; Sermanet et al., 2013). The CNN is characterized by feedforward (bottom-up) sequential application of convolutional filterbanks, pointwise nonlinear functions (e.g., sigmoid or hyperbolic tangent), and pooling. Supervision in CNN is typically implemented via a fully-connected layer at the top of the deep architecture, usually with a softmax classifier (Ciresan et al., 2011; He et al., 2014; Jarrett et al., 2009; Krizhevsky et al., 2012).

A parallel line of research concerns dictionary learning (Mairal et al., 2008; Zhang and Li, 2010; Zhou et al., 2012) based on a set of image patches. In this setting one imposes sparsity constraints on the dictionary weights with which the data are represented. For image analysis/processing tasks, rather than using a patch-based model, there has been recent interest in deconvolutional networks (DN) (Chen et al., 2011, 2013; Zeiler et al., 2010). In a DN one uses dictionary learning on an entire image (as opposed to the patches of an image), and each dictionary element is convolved with a sparse set of weights that exist across the entire image. Such models are termed “deconvolutional” because, given a learned dictionary, the features at test are found through deconvolution. One may build deep deconvolutional models, which typically employ a pooling step like the CNN (Chen et al., 2011, 2013). The convolutional filterbank of the CNN is replaced in the DN by a library of convolutional dictionaries.

In this paper we develop a new deep generative model for images, based on convolutional dictionary learning. At test, after the dictionary elements are learned, deconvolutional inference is employed, like in the aforementioned DN research. The proposed method is related to Chen et al. (2011, 2013), but a complete top-down generative model is developed, with stochastic unpooling connecting model layers (this is distinct from almost all other models, which employ bottom-up pooling). Chen et al. (2011, 2013) trained each layer separately, sequentially, with no final coupling of the overall model (significantly undermining classification performance). Further, in Chen et al. (2011, 2013) Bayesian posterior inference was approximated for all model parameters (e.g., via Gibbs sampling), which scales poorly. Here we employ Monte Carlo expectation maximization (MCEM) (Wei and Tanner, 1990), with a point estimate learned for the dictionary elements and the parameters of the classifier, allowing learning on large-scale data and fast testing.

Forms of stochastic pooling have been applied previously (Lee et al., 2009; Zeiler and Fergus, 2013). Lee et al. (2009) defined stochastic pooling in the context of an energy-based Boltzmann machine, and Zeiler and Fergus (2013) proposed stochastic pooling as a regularization technique. Here unpooling is employed, yielding a top-down generative process.

To impose supervision, we employ the Bayesian support vector machine (SVM) (Polson and Scott, 2011), which has been used for supervised dictionary learning (Henao et al., 2014) (but not previously for deep learning). The proposed generative model is amenable to Bayesian analysis, and here the Bayesian SVM is learned simultaneously with the deep model. The models in Donahue et al. (2014); He et al. (2014); Zeiler and Fergus (2014) do not train the SVM jointly, as we do – instead, the SVM is trained separately using the learned CNN features (with CNN supervised learning implemented via softmax).

This paper makes several contributions: (i) A new deep model is developed for images, based on convolutional dictionary learning; this model is a generative form of the earlier DN. (ii) A new stochastic unpooling method is proposed, linking consecutive layers of the deep model. (iii) An SVM is integrated with the top layer of the model, enabling max-margin supervision during training. (iv) The algorithm is implemented on a GPU, for large-scale learning and fast testing; we demonstrate state-of-the-art classification results on several benchmark datasets, and demonstrate scalability through analysis of the ImageNet dataset.

Supervised Deep Deconvolutional Model

The form of (1) motivates choices for the priors in the proposed generative model. Specifically, consider

with $S_{i,j}^{(n,k)}$ denoting element $(i,j)$ of ${\bf S}^{(n,k)}$ , drawn $S_{i,j}^{(n,k)}\sim\mathsf{Laplace}(0,b)=\frac{1}{2b}\exp(-|S_{i,j}^{(n,k)}|/b)$ . We have “vectorized” the matrices ${\bf E}^{(n)}$ and ${\bf D}^{(k)}$ (from the standpoint of the distributions from which they are drawn), and ${\bf I}$ is an appropriately sized identity matrix. The maximum a posterior (MAP) solution to (3), with the Laplace prior imposed independently on each component of ${\bf S}^{(n,k)}$ , corresponds to the optimization problem in (1), and the hyperparameters $\gamma_{e}$ and $b$ play roles analogous to $\lambda_{1}$ and $\lambda_{2}$ .

where $z_{i,j}^{(n,k)}\in\{0,1\}$ , $\delta_{0}$ is a unit point measure concentrated at zero, and $(a_{0},b_{0})$ are set to encourage that most $\pi^{(n,k)}$ are small (Paisley and Carin, 2009), i.e., $a_{0}=1/K$ and $b_{0}=1-1/K$ . For parameters $\gamma_{e}$ and $\gamma_{s}$ we impose the priors $\gamma_{s}\sim\mathsf{Gamma}(a_{s},b_{s})$ and $\gamma_{e}\sim\mathsf{Gamma}(a_{e},b_{e})$ , with hyperparameters $a_{s}=b_{s}=a_{e}=b_{e}=10^{-6}$ to impose diffuse priors (Tipping, 2001).

2 Generative Deep Model via Stochastic Unpooling

The model in (2) is motivated by the idea that each image ${\bf X}^{(n)}$ may be represented in terms of convolutional dictionary elements ${\bf D}^{(k)}$ that are shared across all $N$ images. In the proposed deep model, we similarly are motivated by the idea that the feature maps ${\bf S}^{(n,k)}$ may also be represented in terms of convolutions of (distinct) dictionary elements. Consider a two-layer model, with

where ${\bf X}^{(n,1)}={\bf X}^{(n)}$ . Dictionary elements ${\bf D}^{(k_{1},1)}$ replace ${\bf D}^{(k)}$ in (2), for representation of ${\bf X}^{(n)}$ . The weights ${\bf S}^{(n,k_{1},1)}$ are connected to ${\bf X}^{(n,2)}$ via the stochastic operation $\mathsf{unpool}(\cdot)$ , detailed below. Motivated as discussed above, ${\bf X}^{(n,2)}$ is represented in terms of convolutions with second-layer dictionary elements ${\bf D}^{(k_{2},2)}$ . The forms of the priors on ${\bf D}^{(k_{1},1)}$ and ${\bf D}^{(k_{2},2)}$ are as above for ${\bf D}^{(k)}$ , and the prior on ${\bf E}^{(n)}$ is unchanged.

Let ${\boldsymbol{z}}_{i,j}^{(n,k_{1},1)}\in\{0,1\}^{p_{x}^{(1)}p_{y}^{(1)}}$ be a vector of all zeros, and a single one, and the location of the non-zero element of ${\boldsymbol{z}}_{i,j}^{(n,k_{1},1)}$ identifies the location of the single non-zero element of pooling block ${\bf S}^{(n,k_{1},1)}_{i,j}$ which is set as $X^{(n,k_{1},2)}_{i,j}$ . The function $\mathsf{unpool}(\cdot)$ is a stochastic operation that defines ${\boldsymbol{z}}_{i,j}^{(n,k_{1},1)}$ , and hence the way ${\bf X}^{(n,k_{1},2)}$ is unpooled to constitute the sparse ${\bf S}^{(n,k_{1},2)}$ . We impose

where $\mathsf{Dir}(\cdot)$ denotes the symmetric Dirichlet distribution; the Dirichlet distribution has a set of parameters, and here they are all equal to the value indicated in $\mathsf{Dir}(\cdot)$ .

When introducing the above two-layer model, ${\bf S}^{(n,k_{2},2)}$ is drawn from a spike-slab prior, as in (4). However, we may extend this to a three-layer model, with pooling blocks defined in ${\bf S}^{(n,k_{2},2)}$ . A convolutional dictionary representation is similarly constituted for ${\bf X}^{(n,3)}$ , and this is stochastically unpooled to generate ${\bf S}^{(n,k_{2},2)}$ . This may continued for $L$ layers, where the hierarchical convolutional dictionary learning learns multi-scale structure in the weights on the dictionary elements. At the top layer in the $L$ -layer model, the weights ${\bf S}^{(n,k_{L},L)}$ are drawn from a spike-slab prior of the form in (4).

Consider an $L$ -layer model, and assume that $\{\{{\bf D}^{(k_{l},l)}\}_{k_{l}=1}^{K_{l}}\}_{l=1}^{L}$ have been learned/specified. An image is generated by starting at the top, drawing $\{{\bf S}^{(n,k_{L},L)}\}_{k_{L}=1}^{K_{L}}$ from a spike-slab model. Then $\{{\bf S}^{(n,k_{L-1},L-1)}\}_{k_{L-1}=1}^{K_{L-1}}$ are constituted by convolving $\{{\bf S}^{(n,k_{L},L)}\}_{k_{L}=1}^{K_{L}}$ with $\{{\bf D}^{(k_{L},L)}\}_{k_{L}=1}^{K_{L}}$ , summing over the $K_{L}$ dictionary elements, and then performing stochastic unpooling. This process of convolution and stochastic unpooling proceeds for $K_{L}$ layers, ultimately yielding $\sum_{k_{1}=1}^{K_{1}}{\bf D}^{(k_{1},1)}*{\bf S}^{(n,k_{1},1)}$ at the bottom (first) layer. With the added stochastic residual ${\bf E}^{(n)}$ , the image ${\bf X}^{(n)}$ is specified.

We note an implementation detail that has been found useful in experiments. In (9), the unpooling was performed such that each pooling block in ${\bf S}^{(n,k_{l},l)}$ has a single non-zero element, with the non-zero element defined in ${\bf X}^{(n,k_{l},l+1)}$ . The unpooling for block $(i,j)$ was specified by the $p_{x}^{(l)}p_{y}^{(l)}$ -dimensional ${\boldsymbol{z}}_{i,j}^{(n,k_{l},l)}$ vector of all-zeros and a single one. In our slightly modified implementation, we have considered a $(p_{x}^{(l)}p_{y}^{(l)}+1)$ -dimensional ${\boldsymbol{z}}_{i,j}^{(n,k_{l},l)}$ , which is again all zeros with a single one, and ${\boldsymbol{\theta}}^{(n,k_{l},l)}$ is also $p_{x}^{(l)}p_{y}^{(l)}+1$ dimensional. If the single one in ${\boldsymbol{z}}_{i,j}^{(n,k_{l},l)}$ is located among the first $p_{x}^{(l)}p_{y}^{(l)}$ elements of ${\boldsymbol{z}}_{i,j}^{(n,k_{l},l)}$ , then the location of this non-zero element identifies the location of the single non-zero element in the $(i,j)$ pooling block, as before. However, if the non-zero element of ${\boldsymbol{z}}_{i,j}^{(n,k_{l},l)}$ is in position $p_{x}^{(l)}p_{y}^{(l)}+1$ , then all elements of pooling block $(i,j)$ are set to zero. This imposes further sparsity on the feature maps and, as demonstrated in the Supplementary Material (SM), it yields a model in which the elements of the feature map that are relatively small are encouraged to be zero. This turning off of dictionary elements with small weights is analogous to dropout (Srivastava et al., 2014), which has been used in CNN, and in our model it also has been found to yield slightly better classification performance.

3 Supervision via Bayesian SVMs

Given a feature vector ${\boldsymbol{s}}$ , the goal of the SVM is to find an $f({\boldsymbol{s}})$ that minimizes the objective function

where $\max(1-y_{n}f({\boldsymbol{s}}_{n}),0)$ is the hinge loss, $R(f({\boldsymbol{s}}))$ is a regularization term that controls the complexity of $f({\boldsymbol{s}})$ , and $\gamma$ is a tuning parameter controlling the trade-off between error penalization and the complexity of the classification function. The decision boundary is defined as $\{{\boldsymbol{s}}:f({\boldsymbol{s}})=0\}$ and $\mbox{sign}(f({\boldsymbol{s}}))$ is the decision rule, classifying ${\boldsymbol{s}}$ as either $-1$ or $1$ (Vapnik, 1995).

Recently, Polson and Scott (2011) showed that for the linear classifier $f({\boldsymbol{s}})={\boldsymbol{\beta}}^{T}{\boldsymbol{s}}$ , minimizing (10) is equivalent to estimating the mode of the pseudo-posterior of ${\boldsymbol{\beta}}$

where ${\boldsymbol{y}}=[y_{1}\ \ldots\ y_{N}]^{T}$ , ${\bf S}=[{\boldsymbol{s}}_{1}\ \ldots\ {\boldsymbol{s}}_{N}]$ , $\mathcal{L}(y_{n}|{\boldsymbol{s}}_{n},{\boldsymbol{\beta}},\gamma)$ is the pseudo-likelihood function, and $p({\boldsymbol{\beta}}|\cdot)$ is the prior distribution for the vector of coefficients ${\boldsymbol{\beta}}$ . Choosing ${\boldsymbol{\beta}}$ to maximize the log of (11) corresponds to (10), where the prior is associated with $R(f({\boldsymbol{s}}))$ . Polson and Scott (2011) showed that $\mathcal{L}(y_{n}|{\boldsymbol{s}}_{n},{\boldsymbol{\beta}},\gamma)$ admits a location-scale mixture of normals representation by introducing latent variables $\lambda_{n}$ , such that

Note that the exponential in (12) is Gaussian wrt ${\boldsymbol{\beta}}$ . As described in Polson and Scott (2011), this encourages data augmentation for variable $\lambda_{n}$ ( $\lambda_{n}$ is treated as a new random variable), which permits efficient Bayesian inference (see Polson and Scott (2011); Henao et al. (2014) for details). One of the benefits of a Bayesian formulation for SVMs is that we can flexibly specify the behavior of ${\boldsymbol{\beta}}$ while being able to adaptively regularize it by specifying a prior $p(\gamma)$ as well.

We impose shrinkage (near sparsity) (Polson and Scott, 2010) on ${\boldsymbol{\beta}}$ using the Laplace distribution; letting $\beta_{i}$ denote $i^{\rm th}$ element of ${\boldsymbol{\beta}}$ , we impose

and similar to $\kappa$ and $\lambda_{n}$ , a diffuse Gamma prior is imposed on $\gamma$ .

Model Training

The previous section described a supervised deep generative model for images, based on deep convolutional dictionary learning, stochastic unpooling, and the Bayesian SVM. The conditional posterior distribution for each model parameter can be written in closed form, assuming the other model parameters are fixed (see the SM). For relatively small datasets we can therefore employ a Gibbs sampler for both training and deconvolutional inference, yielding an approximation to the posterior distribution on all parameters. Large-scale datasets prohibit the application of standard Gibbs sampling. For large data we use stochastic MCEM (Wei and Tanner, 1990) to find a maximum a posterior (MAP) estimate of the model parameters.

We consolidate the “local” model parameters (latent data-sample-specific variables) as

which can be interpreted as an EM problem:

Perform an expectation with respect to the local variables, using $p({\boldsymbol{\Phi}}_{n}|{\bf Y}_{n},{\boldsymbol{\Psi}}_{t-1})\,\forall n$ , where ${\boldsymbol{\Psi}}_{t-1}$ is the estimate of the global parameters from iteration $(t-1)$ .

M-step:

where ${\boldsymbol{\Phi}}_{n}^{s}$ is a sample from the full conditional posterior distribution, and $N_{s}$ is the number of samples; we seek to maximize $Q({\boldsymbol{\Psi}}|{\boldsymbol{\Psi}}_{t-1})$ wrt ${\boldsymbol{\Psi}}$ , constituting ${\boldsymbol{\Psi}}_{t}$ . Recall from above that each of the conditional distributions in a Gibbs sampler of the model is analytic; this allows convenient sampling of local parameters, conditioned on specified global parameters ${\boldsymbol{\Psi}}_{t-1}$ , and therefore the aforementioned sampling is implemented efficiently (using mini-batches of data, where $\mathcal{I}_{t}\subset\{1,\dots,N\}$ identifies the stochastically defined subset of data in mini-batch $t$ ). An approximation to the M-step is implemented via stochastic gradient descent (SGD). The stochastic MCEM gradient at iteration $t$ is

We solve (15) using RMSprop (Dauphin et al., 2015; Tieleman and Hinton, 2012) with the gradient approximation in (17).

In the learning phase, the MCEM method is used to learn a point estimate for the global parameters ${\boldsymbol{\Psi}}$ . During testing, we follow the same MCEM setup with ${\boldsymbol{\Phi}}^{\text{test}}=\big{(}\{{\boldsymbol{z}}^{(*,l)}\}_{l=1}^{L-1},{{\boldsymbol{\gamma}}}_{s}^{(*)},{{\bf E}}^{(*)}\big{)}$ , $\quad{\boldsymbol{\Psi}}^{\text{test}}={\bf S}^{(*,L)},$ when given a new image ${\bf X}^{*}$ . We find a MAP estimator:

using MCEM (gradient wrt ${\boldsymbol{\Psi}}^{\text{test}}$ ). In this form of the MCEM, all data-dependent latent variables ${\boldsymbol{\Phi}}^{\text{test}}$ are integrated (summed) out in the expectation, except for the top-layer feature map ${\boldsymbol{\Psi}}^{\text{test}}$ , for which the gradient descent M step yields a point estimate. The top-layer features are then sent to the trained SVM to predict the label. Details for training and inference are provided in the SM.

Experimental Results

We present results for the MNIST, CIFAR-10 & 100, Caltech 101 & 256 and ImageNet 2012 datasets. The same hyperparameter settings (discussed at the end of Section 2.1) were used in all experiments; no tuning was required between datasets.

For the first five (small/modest-sized) datasets, the model is learned via Gibbs sampling. We found that it is effective to use layer-wise pretraining as employed in some deep generative models (Erhan et al., 2010; Hinton and Salakhutdinov, 2006). The pretraining is performed sequentially from the bottom layer (touching the data), to the top layer, in an unsupervised manner. Details on the layerwise pretraining are discussed in the SM. In the pretraining step, we average 500 collection samples, to obtain parameter values (e.g., dictionary elements) after first discarding 1000 burn-in samples. Following pre-training, we refine the entire model jointly using the complete set of Gibbs conditional distributions. 1000 burn-in iterations are performed followed by 500 collection draws, retaining one of every 50 iterations. During testing, the predictions are based on averaging the decision values of the collected samples.

For each of these first five datasets, we show three classification results, using part of or all of our model (to illustrate the role of each component): 1) Pretraining only: this model (in an unsupervised manner) is used to extract features and the futures are sent to a separate linear SVM, yielding a 2-step procedure. 2) Unsupervised model: this model includes the deep generative developed in Sec. 2.2, but is also trained in an unsupervised manner (this is the unsupervised model after refinement). The features extracted by this model are sent to a separate linear SVM, and therefore this is also a 2-step procedure. 3) Supervised model: this is the complete refined supervised model developed in Sec. 2.2 and Sec. 2.3.

ImageNet 2012 is used to assess the scalability of our model to large datasets. In this case, we learn the supervised model initialized from the priors (without layerwise pretraining). The proposed online learning method, MCEM, based on RMSProp (Dauphin et al., 2015; Tieleman and Hinton, 2012), is developed for both training and inference with mini-batch size 256 and decay rate 0.95. Our implementation of MCEM learning is based on the publicly available CUDA C++ Caffe toolbox (August 2015 branch) (Jia et al., 2014), but contains significant modifications for our model. Our model takes around one week to train on ImageNet 2012 using a nVidia GeForce GTX TITAN X GPU with 12GB memory. Testing for the validation set of ImageNet 2012 (50K images) takes less than 12 minutes. In the subsequent tables providing classification results, the best results achieved by our model are bold.

The MNIST data (http://yann.lecun.com/exdb/mnist/) has 60,000 training and 10,000 testing images, each $28\times 28$ , for digits 0 through 9. A two-layer model is used with dictionary element size $8\times 8$ and $6\times 6$ at the first and second layer, respectively. The pooling size is $3\times 3$ ( $p_{x}=p_{y}=3$ ) and the number of dictionary elements at layers 1 and 2 are $K_{1}=39$ and $K_{2}=117$ , respectively. These numbers of dictionary elements are obtained by setting the initial dictionary number to a relatively large value ( $K_{1}=50$ and $K_{2}=200$ ) in the pretraining step and discarding infrequently used elements by counting the corresponding binary indicator ${\boldsymbol{z}}$ – effectively inferring the number of needed dictionary elements, as in Chen et al. (2011, 2013).

Table 1 summarizes the classification results for MNIST. Our 2-layer supervised model outperforms most other modern approaches. The methods that outperforms ours are the complicated (6-layer) ConvNet model with elastic disortions (Ciresan et al., 2011) and the MCDNN, which combines several deep convolutional neural networks (Ciresan et al., 2012). Specifically, (Ciresan et al., 2012) used a committee of 35 convolutional networks, width normalization, and elastic distortions of the data; (Ciresan et al., 2011) used elastic distortions and a single convolutional neural network to achieve the similar error as our approach.

To further examine the performance of the proposed model, we plot a selection of top-layer dictionary elements (projected through the generative process down to the data plane) learned by our supervised model, on the right of Fig. 2, and on the left we show the corresponding elements inferred by our unsupervised model. It can be seen that the elements inferred by the supervised model are clearer (“unique” to a single number), whereas the elements learned by the unsupervised model are blurry (combinations of multiple numbers). Similar results were reported in Erhan et al. (2010).

Since our model is generative, using it to generate digits after training on MNIST is straightforward, and some examples are shown in Fig. 3 (based on random draws of the top-layer weights). We also demonstrate the ability of the model to predict missing data (generative nature of the model); reconstructions are shown in Fig. 4. More results are provided in the SM.

2 CIFAR-10 & 100

The CIFAR-10 dataset (Krizhevsky and Hinton, 2009) is composed of 10 classes of natural $32\times 32$ RGB images with 50000 images for training and 10000 images for testing. We apply the same preprocessing technique of global contrast normalization and ZCA whitening as used in the Maxout network (Goodfellow et al., 2013). A three-layer model is used with dictionary element size $5\times 5$ , $5\times 5$ , $4\times 4$ at the first, second and third layer. The pooling sizes are both $2\times 2$ and the numbers of dictionary elements for each layer are $K_{1}=48$ , $K_{2}=128$ and $K_{3}=128$ . If we augment the data by translation and horizontal flipping as used in other models (Goodfellow et al., 2013), we achieve $8.27\%$ error. Our result is competitive with the state-of-art, which integrates supervision on every hidden layer (Lee et al., 2015). In constrast, we only impose supervision at the top layer. Table 2 summarizes the classification accuracy of our models and some related models.

The CIFAR-100 dataset (Krizhevsky and Hinton, 2009) is the same as CIFAR-10 in size and format, except it contains 100 classes. We use the same settings as in the CIFAR-10. Table 3 summarizes the classification accuracy of our model and some related models. It can be seen that our results ( $34.62\%$ ) are also very close to the state-of-the-art: ( $34.57\%$ ) in Lee et al. (2015).

3 Caltech 101 & 256

To balance speed and performance, we resize the images of Caltech 101 and Caltech 256 to $128\times 128$ , followed by local contrast normalization (Jarrett et al., 2009). A three layer model is adopted. The dictionary element sizes are set to $7\times 7$ , $5\times 5$ and $5\times 5$ , and the size of the pooling regions are $4\times 4$ (layer 1 to layer 2) and $2\times 2$ (layer 2 to layer 3).

The dictionary sizes for each layer are set to $K_{1}=48$ , $K_{2}=84$ and $K_{3}=84$ for Caltech 101, and $K_{1}=48$ , $K_{2}=128$ and $K_{3}=128$ for Caltech 256. Tables 4 and 5 summarize the classification accuracy of our model and some related models. Using only the data inside Caltech 101 and Caltech 256 (without using other datasets) for training, our results ( $87.82\%,\,66.4\%$ ) exceed the previous state-of-art results ( $83\%,\,58\%$ ) by a substantial margin ( $4\%,\,12.4\%$ ), which are the best results obtained by models without using deep convolutional models (using hand-crafted features).

As a baseline, we implemented the neural network consisting of three convolutional layers and two fully-connected layers with a final softmax classifier. The architecture of three convolutional layers is the same as our model. The fully-connected layers have 1024 neurons each. The results of neural network trained with dropout (Srivastava et al., 2014), after carefully parameter tuning, are also shown in Tables 4 and 5.

The state-of-the-art results on these two datasets are achieved by pretraining the deep network on a large dataset, ImageNet (Donahue et al., 2014; He et al., 2014; Zeiler and Fergus, 2014). We consider similar ImageNet pretraining in Sec. 4.4. We also observe from Table 5 that when there are fewer training images, our accuracy diminishes. This verifies that the model complexity needs to be selected based on the size of the data. This is also consistent with the results reported by Zeiler and Fergus (2014), in which the classification performance is very poor without training the model on ImageNet.

Fig. 5 shows selected dictionary elements learned from the unsupervised and the supervised model, to illustrate the differences. It is observed that the dictionaries without supervision tend to reconstruct the data while the dictionary elements with supervision tend to extract features that will distinguish different classes. For example, the dictionaries learned with supervision have double sides on the image edges. Our model is generative, and as an example we generate images using the dictionaries trained from the “Faces easy” category, with random top-layer dictionary weights (see Fig. 6). Similar to the MNIST example, we also show in Fig. 7 the interpolation results of face data with half the image missing. Though the background is a little noisy, each face is recovered in great detail by the third (top) layer dictionaries. More results are provided in the SM.

4 ImageNet 2012

We train our model on the 1000-category ImageNet 2012 dataset, which consists of 1.3M/50K/100K training/validation/test images. Our training process follows the procedure of previous work (Howard, 2013; Krizhevsky et al., 2012; Zeiler and Fergus, 2014). The smaller image dimension is scaled to 256, and a $224\times 224$ crop is chosen at 1024 random locations within the image. The data are augmented by color alteration and horizontal flips (Howard, 2013; Krizhevsky et al., 2012). A five layer convolutional model is employed ( $L=5$ ); the numbers (sizes) of dictionary elements for each layer are set to $96\,(5\times 5)$ , $256\,(5\times 5)$ , $512\,(3\times 3)$ , $1024\,(3\times 3)$ and $\,512(3\times 3)$ ; the pooling ratios are $4\times 4\,$ (layer 1 to 2) and $2\times 2\,$ (others). The number of parameters in our model is around 30 million.

We emphasize that our intention is not to directly compete with the best performance in the ImageNet challenge (Szegedy et al., 2015; Simonyan and Zisserman, 2015), which requires consideration of many additional aspects, but to provide a comparison on this dataset with a CNN with a similar network architecture (size). Table 6 summarizes our results compared with the “ZF”-net developed in Zeiler and Fergus (2014) which has a similiar architecture with ours.

The MAP estimator of our model, described in Sec. 3, achieves a top-5 error rate of $16.1\%$ on the testing set, which is close to Zeiler and Fergus (2014). Model averaging used in Bayesian inference often improves performance, and is considered here. Specifically, after running the MCEM algorithm, we have a (point) estimate of the global parameters. Using a mini-batch of data, one can leverage our analytic Gibbs updates to sample from the posterior (starting from the MAP estimate), and therefore obtain multiple samples for the global model parameters. We collect the approximate posterior samples every 1000 iterations, and retain 20 samples. Averaging the predictions of these 20 samples (model averaging) gives a top-5 error rate of $13.6\%$ , which outperforms the combination of 6 “ZF”-nets. Limited additional training time (one day) is required for this model averaging.

To illustrate that our model can generalize to other datasets, we follow the setup in (Donahue et al., 2014; He et al., 2014; Zeiler and Fergus, 2014), keeping five convolutional layers of our ImageNet-trained model fixed and train a new Bayesian SVM classifier on the top using the training images of Caltech 101 and Caltech 256, with each image resized to $256\times 256$ (effectively, we are using ImageNet to pretrain the model, which is then refined for Caltech 101 and 256). The results are shown in Tables 4 and 5. We obtain state-of-art results ( $77.9\%$ ) on Caltech 256. For Caltech 101, our result ( $93.15\%$ ) is competitive with the state-of-the-art result ( $94.11\%$ ), which combines spatial pyramid matching and deep convolutional networks (He et al., 2014). These results demonstrate that we can provide comparable results to the CNN in data generalization tasks, while also scaling well.

Conclusions

A supervised deep convolutional dictionary-learning model has been proposed within a generative framework, integrating the Bayesian support vector machine and a new form of stochastic unpooling. Extensive image classification experiments demonstrate excellent classification performance on both small and large datasets. The top-down form of the model constitutes a new generative form of the deep deconvolutional network (DN) (Zeiler et al., 2010), with unique learning and inference methods.

References

More Results

2 Missing data interpolation

MCEM algorithm

Algorithms 1 and 2 detail the training and testing process. The steps are explained in the next two sections.

Gibbs Sampling

In the remainder of this discussion, we use the following definitions.

$ceil(x)=\lceil x\rceil$ is the smallest integer that is not less than $x$ .

The summation and the quadratic summation of all elements in a matrix:

Thus, the unpooling process (equation(6) in the main paper) can be formed as:

This “generative” function measures how much the $k^{\rm th}$ band of $l^{\rm th}$ layer feature is “responsible” for the of input image ${{\bf X}}^{(n)}$ in the current model:

It can be considered as if $k^{\rm th}$ band of $l^{\rm th}$ layer feature changes ${\bf X}$ (i.e. ${\bf X}^{(n,k,l)}\rightarrow{\bf X}^{(n,k,l)}+{\bf X}$ ), the corresponding data layer representation will change $g({{\bf X}},n,k,l)$ (i.e. ${\bf X}^{(n)}\rightarrow{\bf X}^{(n)}+g({{\bf X}},n,k,l)$ ). Thus, for $l=2,\dots,L$ , we have

Note that $g()$ is a linear function for ${\bf X}$ , which means:

For convenience, we also use the following notations:

We use ${{\bf Z}}^{(n,k_{l},l)}$ to represent $\{{\boldsymbol{z}}^{(n,k_{l},l)}_{i,j};\forall i,j\}$ , where the vector version of the $(i,j)^{\rm th}$ block of ${{\bf Z}}^{(n,k_{l},l)}$ is equal to ${\boldsymbol{z}}^{(n,k_{l},l)}_{i,j}$ .

$\bf{0}$ denotes the all 0 vector or matrix. $\bf{1}$ denotes the all one vector or matrix. ${\boldsymbol{e}}_{m}$ denotes a “one-hot” vector with the $m^{\rm th}$ element equal to 1.

2 Full Conditional Posterior Distribution

where $X^{-(p,q)}_{i,j}$ is a term which is independent of $D_{p,q}$ but related by the index $(i,j,p,q)$ ; so is $S_{(i+N_{dx}-p,j+N_{dy}-q)}$ . Following this, for every elements in ${\bf D}$ , we can represent ${\bf X}$ as:

where matrices ${{\bf C}}^{(n,k_{l},l)}_{p,q}$ and ${{\bf F}}^{(n,k_{l},l)}_{i,j}$ are independent of ${D}^{(n,k_{l},l)}_{p,q}$ but related by the index $(n,k_{l},l,p,q)$ .

Similarly, for every elements in ${\boldsymbol{z}}$ , we have

The conditional posterior of ${\bf D}_{i,j}^{(k_{l-1},k_{l},l)}$ :

The conditional posterior of ${\boldsymbol{z}}_{i,j}^{(n,k_{l},l)}$ :

The conditional posterior of ${\boldsymbol{\theta}}^{(n,k_{l},l)}$

The conditional posterior of $S^{(n,k_{L},L)}_{i,j}$ :

The conditional posterior of $\gamma_{s}^{(n,k_{L})}$ :

The conditional posterior of $\gamma_{e}^{(n)}$ :

where ${\cal{IG}}$ denotes the inverse Gaussian distribution.

MCEM algorithm Details

where $const$ denotes the terms which are not a function of ${{\boldsymbol{\Psi}}}$ .

Obtaining a closed form of the exact $Q$ function is analytically intractable. We here approximate the expectations in (58) by samples collected from the posterior distribution of the hidden variables developed in Section 8.2. The $Q$ function in (58) can be approximated by:

where $\bar{{\bf S}}^{(L,s,t)}$ , $\bar{{\boldsymbol{\gamma}}_{e}}^{(s,t)}$ , $\bar{{\boldsymbol{\lambda}}}^{(s,t)}$ and $\bar{{\bf Z}}^{(s,t)}$ are a sample of the corresponding variables from the full conditional posterior at the $t^{\rm th}$ iteration. $N_{s}$ is the number of collected samples.

2 M step

We can maximize $\bar{Q}({\boldsymbol{\Psi}}|{\boldsymbol{\Psi}}^{(t)})$ via the following updates:

For $l=1,\dots,L$ , $k_{l-1}=1,\dots,K_{L-1}$ and $k_{l}=1,\dots,K_{L}$ , the gradient wrt ${{\bf D}}^{(k_{l-1},k_{l},l)}$ is:

Following this, the update rule of ${\bf D}$ based on RMSprop is:

3 Testing

During testing, when given a test image ${{\bf X}}^{(*)}$ , we treat ${{\bf S}}^{(*,L)}$ as model parameters and use MCEM to find a MAP estimator:

where $/{{\bf Z}}^{(L)}=\{{{\bf Z}}^{(l)}\}_{l=1}^{L-1}$ . Let ${\boldsymbol{\Psi}}_{test}=\{{{\bf W}}^{(*,L)},{{\bf Z}}^{(*,L)}\}$ and ${\boldsymbol{\Phi}}_{test}=\{\{{{\bf Z}}^{(l)}\}_{l=1}^{L-1},{{\boldsymbol{\gamma}}}_{s}^{*},{{\bf E}}^{*}\}$ . The $Q$ function for testing can be represented as:

In the E-step we collect the samples of ${\boldsymbol{\gamma}}_{e}$ , ${\boldsymbol{\gamma}}_{s}$ and $\{{{\bf Z}}^{(l)}\}_{l=1}^{L-1}$ from conditional posterior distributions, which is similar to the training process. $Q_{test}$ can thus be approximated by:

In the M-step, we maximize $\bar{Q}_{test}$ via the following updates:

The gradient $w.r.t.$ ${{\bf W}}^{(*,K_{L},L)}$ is:

where ${\boldsymbol{\delta}}^{(*,k_{L-1},L,t)}$ is the same as (66). Therefore, the update rule of ${\bf W}$ based on RMSprop is:

The update rule ${{\bf Z}}^{(*,k_{L},L)}$ is

where ${\eta}^{(*,k_{L},L)}_{i,j}$ is the same as (45).

Bottom-Up Pretraining

The model is pretrained sequentially from the bottom layer to the top layer. We consider here pretraining the relationship between layer $l$ and layer $l+1$ , and this process may be repeated up to layer $L$ . The basic framework of this pretraining process is closely connected to top-down generative process, with a few small but important modifications. Matrix ${\bf X}^{(n,l)}$ represents the pooled and stacked activation weights from layer $l-1$ , image $n$ ( $K_{l-1}$ “spectral bands” in ${\bf X}^{(n,l)}$ , due to $K_{l-1}$ dictionary elements at layer $l-1$ ). We constitute the representation

The features ${\bf{S}}^{(n,k_{l},l)}$ can be partitioned into contiguous blocks with dimension $p_{x}^{l}\times p_{y}^{l}$ . In our generative model, ${\bf{S}}^{(n,k_{l},l)}$ is generated from ${\bf X}^{(n,k_{l},l+1)}$ and ${\boldsymbol{z}}^{(n,k_{l},l)}$ , where the non-zero element within the $(i,j)-$ th pooling block of ${\bf{S}}^{(n,k_{l},l)}$ is set equal to $X_{i,j}^{(n,k_{l},l+1)}$ , and its location within the pooling block is determined by ${\boldsymbol{z}}^{(n,k_{l},l)}_{i,j}$ , a $p_{x}^{l}\times p_{y}^{l}$ binary vector (Sec. 2.2 in the main paper). Now the matrix ${\bf X}^{(n,k_{l},l+1)}$ is constituted by “stacking” the spatially-aligned and pooled versions of ${\bf{S}}^{(n,k_{l},l)}_{k_{l}=1,K_{l}}$ . Thus, we need to place a prior on the $(i,j)-$ th pooling block of ${\bf{S}}^{(n,k_{l},l)}$ :

If all the elements of ${\boldsymbol{z}}^{(n,l)}_{i,j,k_{l}}$ are zero, the corresponding pooling block in ${\bf S}_{i,j}^{(n,k_{l},l)}$ will be all zero and $X_{i,j}^{(n,k_{l},l+1)}$ will be zero.

where the vector version of the $(i,j)$ -th block of ${\bf{Z}}^{(n,k_{l},l)}$ is equal to ${\boldsymbol{z}}^{(n,k_{l},l)}_{i,j}$ and $\odot$ is the Hadamard (element-wise) product operator. The hyperparameters are set as $a_{e}=b_{e}=a_{w}=b_{w}=10^{-6}$ .

We summarize distinctions between pretraining, and the top-down generative model.

A pair of consecutive layers is considered at a time during pretraining.

During the pretraining process, the residual term ${\bf E}^{(n,l)}$ is used to fit each layer.

In the top-down generative process, the residual is only employed at the bottom layer to fit the data.

During pretraining, the pooled activation weights ${\bf X}^{(n,l+1)}$ are sparse, encouraging a parsimonious convolutional dictionary representation.

The model parameters learned from pretraining are used to initialize the model when executing top-down model refinement, using the full generative model.

2 Conditional Posterior Distribution for Pretraining

$\displaystyle D^{(k_{l-1},k_{l},l)}_{i,j}|-\sim{\mathcal{N}}({{\boldsymbol{\Phi}}}^{(k_{l-1},k_{l},l)}_{i,j},{\boldsymbol{\Sigma}}_{i,j}^{(k_{l-1},k_{l},l)})$

$\displaystyle W_{i,j}^{(n,k_{l},l)}|-\sim{\mathcal{N}}({\boldsymbol{\Xi}}_{i,j}^{(n,k_{l},l)},{\boldsymbol{\Lambda}}_{i,j}^{(n,k_{l},l)})\\$

$\displaystyle\gamma_{w}^{(n,k_{l},l)}|-\sim\text{Gamma}\left(a_{w}+\frac{N_{sx}^{l}\times N_{sy}^{l}}{2},b_{w}+\frac{\|{{\bf W}}^{(k_{l-1},k_{l},l)}\|_{2}^{2}}{2}\right)$

Let $m\in\{1,...,p_{x}^{l}p_{y}^{l}\}$ ; from

$\displaystyle{\boldsymbol{\theta}}^{(n,k_{l},l)}|-\sim\text{Dir}({\alpha}^{(n,k_{l},l)})$

$\displaystyle\gamma_{e}^{(n,l)}|-\sim\text{Gamma}\left(a_{e}+\frac{N_{x}^{l}\times N_{y}^{l}\times K_{l-1}}{2},b_{e}+\sum_{k_{l-1}=1}^{K_{l-1}}\frac{\|{{\bf X}}^{-(n,k_{l-1},l)}\|_{2}^{2}}{2}\right)$