Learning Descriptor Networks for 3D Shape Synthesis and Analysis

Jianwen Xie, Zilong Zheng, Ruiqi Gao, Wenguan Wang, Song-Chun Zhu, Ying Nian Wu

Introduction

Recently, with the introduction of large 3D CAD datasets, e.g., ShapeNet , some interesting attempts have been made on object recognition and synthesis based on voxelized 3D shape data. From the perspective of statistical modeling, the existing 3D models can be grouped into two main categories: (1) 3D discriminators, such as Voxnet , which aim to learn a mapping from 3D voxel input to semantic labels for the purpose of 3D object classification and recognition, and (2) 3D generators, such as 3D-GAN, which are in the form of latent variable models that assume that the 3D voxel signals are generated by some latent variables. The training of discriminators usually relies on big data with annotations and is accomplished by a direct minimization of the prediction errors, while the training of the generators learns a mapping from the latent space to 3D voxel data space.

The generator model, while useful for synthesizing 3D shape patterns, involves a challenging inference step (i.e., sampling from the posterior distribution) in maximum likelihood learning, therefore variational inference and adversarial learning methods are commonly used, where an extra network is incorporated into the learning algorithm to get around the difficulty of the posterior inference.

The past few years have witnessed impressive progress on developing discriminator models and generator models for 3D shape data, however, there has not been much work in the literature on modeling 3D shape data based on energy-based models. We call this type of models the descriptive models or descriptor networks following , because the models describe the data based on bottom-up descriptive features learned from the data. The focus of the present paper is to develop a volumetric 3D descriptor network for voxelized shape data. It can be considered an alternative to 3D-GAN for 3D shape generation.

2 3D shape descriptor network

Specifically, we present a novel framework for probabilistic modeling of volumetric shape patterns by combining the merits of energy-based model and volumetric convolutional neural network . The model is a probability density function directly defined on voxelized shape signal, and the model is in the form of a deep convolutional energy-based model, where the feature statistics or the energy function is defined by a bottom-up volumetric ConvNet that maps the 3D shape signal to the features. We call the proposed model the 3D DescriptorNet, because it uses a volumetric ConvNet to extract 3D shape features from the voxelized data.

The training of the proposed model follows an “analysis by synthesis” scheme . Different from the variational inference or adversarial learning, the proposed model does not need to incorporate an extra inference network or an adversarial discriminator in the learning process. The learning and sampling process is guided by the same set of parameters of a single model, which makes it a particularly natural and statistically rigorous framework for probabilistic 3D shape modeling.

Modeling 3D shape data by a probability density function provides distinctive advantages: First, it is able to synthesize realistic 3D shape patterns by sampling examples from the distribution via MCMC, such as Langevin dynamics. Second, the model can be modified into a conditional version, which is useful for 3D object recovery and 3D object super-resolution. Specifically, a conditional probability density function that maps the corrupted (or low resolution) 3D object to the recovered (or high resolution) 3D object is trained, and then the 3D recovery (or 3D super-resolution) can be achieved by sampling from the learned conditional distribution given the corrupted or low resolution 3D object as the conditional input. Third, the model can be used in a cooperative training scheme , as opposed to adversarial training, to train a 3D generator model via MCMC teaching. The training of 3D generator in such a scheme is stable and does not encounter mode collapsing issue. Fourth, the model is useful for semi-supervised learning. After learning the model from unlabeled data, the learned features can be used to train a classifier on the labeled data.

We show that the proposed 3D DescriptorNet can be used to synthesize realistic 3D shape patterns, and its conditional version is useful for 3D object recovery and 3D object super-resolution. The 3D generator trained by 3D DescriptorNet in a cooperative scheme carries semantic information about 3D objects. The feature maps trained by 3D DescriptorNet in an unsupervised manner are useful for 3D object classification.

3 Related work

3D object synthesis. Researchers in the fields of graphics and vision have studied the 3D object synthesis problems . However, most of these object synthesis methods are nonparametric and they generate new patterns by retrieving and merging parts from an existing database. Our model is a parametric probabilistic model that requires learning from the observed data. 3D object synthesis can be achieved by running MCMC such as Langevin dynamics to draw samples from the learned distribution.

3D deep learning. Recently, the vision community has witnessed the success of deep learning, and researchers have used the models in the field of deep learning, such as convolutional deep belief network , deep convolutional neural network , and deep convolutional generative adversarial nets (GAN) , to model 3D objects for the sake of synthesis and analysis. Our proposed 3D model is also powered by the ConvNets. It incorporates a bottom-up 3D ConvNet structure for defining the probability density, and learns the parameters of the ConvNet by an “analysis by synthesis” scheme.

Descriptive models for synthesis. Our model is related to the following descriptive models. The FRAME (Filters, Random field, And Maximum Entropy) model, which was developed for modeling stochastic textures. The sparse FRAME model , which was used for modeling object patterns. Inspired by the successes of deep convolutional neural networks (CNNs or ConvNets), proposes a deep FRAME model, where the linear filters used in the original FRAME model are replaced by the non-linear filters at a certain convolutional layer of a pre-trained deep ConvNet. Instead of using filters from a pre-trained ConvNet, learns the ConvNet filters from the observed data by maximum likelihood estimation. The resulting model is called generative ConvNet, which can be considered a recursive multi-layer generalization of the original FRAME model.

Building on the early work of , recently have developed an introspective learning method to learn the energy-based model, where the energy function is discriminatively learned.

4 Contributions

(1) We propose a 3D deep convolutional energy-based model that we call 3D DescriptorNet for modeling 3D object patterns by combining the volumetric ConvNets and the generative ConvNets . (2) We present a mode seeking and mode shifting interpretation of the learning process of the model. (3) We present an adversarial interpretation of the zero temperature limit of the learning process. (4) We propose a conditional learning method for recovery tasks. (5) we propose metrics that can be useful for evaluating 3D generative models. (6) A 3D cooperative training scheme is provided as an alternative to the adversarial learning method to train 3D generator.

3D DescriptorNet

The 3D DescriptorNet is a 3D deep convolutional energy-based model defined on the volumetric data YY, which is in the form of exponential tilting of a reference distribution :

where p0(Y)p_{0}(Y) is the reference distribution such as Gaussian white noise model, i.e., p0(Y)exp(Y2/2s2),p_{0}(Y)\propto\exp\left(-{\|Y\|^{2}}/{2s^{2}}\right), f(Y;θ)f(Y;\theta) is defined by a bottom-up 3D volumetric ConvNet whose parameters are denoted by θ\theta. Z(θ)=exp[f(Y;θ)]p0(Y)dYZ(\theta)=\int\exp\left[f(Y;\theta)\right]p_{0}(Y)dY is the normalizing constant or partition function that is analytically intractable. The energy function is

We may also take p0(Y)p_{0}(Y) as uniform distribution within a bounded range. Then E(Y;θ)=f(Y;θ){\cal E}(Y;\theta)=-f(Y;\theta).

2 Analysis by synthesis

The maximum likelihood estimation (MLE) of the 3D DescriptorNet follows an “analysis by synthesis” scheme. Suppose we observe 3D training examples {Yi,i=1,...,n}\{Y_{i},i=1,...,n\} from an unknown data distribution Pdata(Y)P_{\rm data}(Y). The MLE seeks to maximize the log-likelihood function L(θ)=1ni=1nlogp(Yi;θ).L(\theta)=\frac{1}{n}\sum_{i=1}^{n}\log p(Y_{i};\theta). If the sample size nn is large, the maximum likelihood estimator minimizes KL(Pdatapθ){\rm KL}(P_{\rm data}\parallel p_{\theta}), the Kullback-Leibler divergence from the data distribution PdataP_{\rm data} to the model distribution pθp_{\theta}. The gradient of the L(θ)L(\theta) is

where Eθ{\rm E}_{\theta} denotes the expectation with respect to p(Y;θ)p(Y;\theta). The expectation term in equation (3) is due to θlogZ(θ)=Eθ[θf(Y;θ)]\frac{\partial}{\partial\theta}\log Z(\theta)={\rm E}_{\theta}[\frac{\partial}{\partial\theta}f(Y;\theta)], which is analytically intractable and has to be approximated by MCMC, such as Langevin dynamics, which iterates the following step:

where τ\tau indexes the time steps of the Langevin dynamics, Δτ\Delta\tau is the discretized step size, and ϵτN(0,I)\epsilon_{\tau}\sim{\rm N}(0,I) is the Gaussian white noise term. The Langevin dynamics consists of a deterministic part, which is a gradient descent on a landscape defined by E(Y;θ){\cal E}(Y;\theta), and a stochastic part, which is a Brownian motion that helps the chain to escape spurious local minima of the energy E(Y;θ){\cal E}(Y;\theta).

3 Mode seeking and mode shifting

The above “analysis by synthesis” learning scheme can be interpreted as a mode seeking and mode shifting process. We can rewrite equation (5) in the form of

The equation (6) reveals that the gradient of the log-likelihood L(θ)L{(\theta)} coincides with the gradient of VV.

The training algorithm of the 3D DescriptorNet is presented in Algorithm 1.

4 Alternating back-propagation

Both mode seeking (sampling) and mode shifting (learning) steps involve the derivatives of f(Y;θ)f(Y;\theta) with respect to YY and θ\theta respectively. Both derivatives can be computed efficiently by back-propagation. The algorithm is thus in the form of alternating back-propagation that iterates the following two steps: (1) Sampling back-propagation: Revise the synthesized examples by Langevin dynamics or gradient descent. (2) Learning back-propagation: Update the model parameters given the synthesized and the observed examples by gradient ascent.

5 Zero temperature limit

We can add a temperature term to the model pT(Y;θ)=exp(E(Y;θ)/T)/ZT(θ)p_{T}(Y;\theta)=\exp(-{\cal E}(Y;\theta)/T)/Z_{T}(\theta), where the original model corresponds to T=1T=1. At zero temperature limit as T0T\rightarrow 0, the Langevin sampling will become gradient descent where the noise term diminishes in comparison to the gradient descent term. The resulting algorithm approximately solves the minimax problem below

6 Conditional learning for recovery

The conditional distribution p(YC(Y)=c;θ)p(Y|C(Y)=c;\theta) can be derived from p(Y;θ)p(Y;\theta). This conditional form of the 3D DescriptorNet can be used for recovery tasks such as inpainting and super-resolution. In inpinating, C(Y)C(Y) consists of the visible part of YY. In super-resolution, C(Y)C(Y) is the low resolution version of YY. For such tasks, we can learn the model from the fully observed training data {Yi,i=1,...,n}\{Y_{i},i=1,...,n\} by maximizing the conditional log-likelihood

where cic_{i} is the observed value of C(Yi)C(Y_{i}). The learning and sampling algorithm is essentially the same as maximizing the original log-likelihood, except that in the Langevin sampling step, we need to sample from the conditional distribution, which amounts to fixing C(Yτ)C(Y_{\tau}) in the sampling process. The zero temperature limit (with the noise term in the Langevin dynamics disabled) approximately solves the following minimax problem

Teaching 3D generator net

We can let a 3D generator network learn from the MCMC sampling of the 3D DescriptorNet, so that the 3D generator network can be used as an approximate direct sampler of the 3D DescriptorNet.

The 3D generator model is a 3D non-linear multi-layer generalization of the traditional factor analysis model. The generator model has the following form

where ZZ is a dd-dimensional vector of latent factors that follow N(0,1){\rm N}(0,1) independently, and the 3D object YY is generated by first sampling ZZ from its known prior distribution N(0,Id){\rm N}(0,I_{d}) and then transforming ZZ to the DD-dimensional YY by a top-down deconvolutional network g(Z;α)g(Z;\alpha) plus the white noise ϵ\epsilon. α\alpha denotes the parameters of the generator.

2 MCMC teaching of 3D generator net

The 3D generator model can be trained simultaneously with the 3D DescriptorNet in a cooperative training scheme . The basic idea is to use the 3D generator to generate examples to initialize a finite step Langevin dynamics for training the 3D DescriptorNet. In return, the 3D generator learns from how the Langevin dynamics changes the initial examples it generates.

Experiments

Project page: The code and more results and details can be found at http://www.stat.ucla.edu/~jxie/3DDescriptorNet/3DDescriptorNet.html

We conduct experiments on synthesizing 3D objects of categories from ModelNet dataset . Specifically, we use ModelNet10, a 10-category subset of ModelNet which is commonly used as benchmark for 3D object analysis. The categories are chair, sofa, bathtub, toilet, bed, desk, table, nightstand, dresser, and monitor. The size of the training set for each category ranges from 100 to 700.

For qualitative experiment, we learn one 3-layer 3D DescriptorNet for each object category in ModelNet10. The first layer has 200 16×16×1616\times 16\times 16 filters with sub-sampling of 3, the second layer has 100 6×6×66\times 6\times 6 filters with sub-sampling of 22, and the final layer is a fully connected layer with a single filter that covers the whole voxel grid. We add ReLU layers between convolutional layers. We fix the standard deviation of the reference distribution of the model to be s=0.5s=0.5. The number of Langevin dynamics steps in each learning iteration is ll=20 and the step size Δτ=0.1\Delta\tau=0.1. We use Adam for optimization with β1=0.5\beta_{1}=0.5 and β2=0.999\beta_{2}=0.999. The learning rate is 0.001. The number of learning iterations is 3,0003,000. We disable the noise term in the Langevin step after 100100 iterations. The training data are of size 32×32×3232\times 32\times 32 voxels, whose values are 0 or 1. We prepare the training data by subtracting the mean value from the data. Each voxel value of the synthesized data is discretized into 0 or 1 by comparing with a threshold 0.5. The mini-batch size is 20. The number of parallel sampling chains for each batch is 25.

To quantitatively evaluate our model, we adopt the Inception score proposed by , which uses a reference convolutional neural network to compute

We learn a single model from mixed 3D objects from the training sets of 10 3D object categories of ModelNet10 dataset. Table 1 reports the Inception scores of our model as well as a comparison with some baseline models including 3D-GAN , 3D ShapeNets , and 3D-VAE .

We also evaluate the quality of the synthesized 3D shapes by the model learned from single category by using average softmax class probability that reference network assigns to the synthesized examples for the underlying category. Table 2 displays the results for all 10 categories. It can be seen that our model generates 3D shapes with higher softmax class probabilities than other baseline models.

2 3D object recovery

3 3D object super-resolution

We test the conditional 3D DescriptorNet on the 3D object super-resolution task. Similar to Experiment 4.2, we can perform super-resolution on a low resolution 3D objects by sampling from a conditional 3D DescriptorNet p(YhighYlow,θ)p(Y_{\rm high}|Y_{\rm low},\theta), where YhighY_{\rm high} denotes a high resolution version of YlowY_{\rm low}. The sampling of the conditional model p(YhighYlow,θ)p(Y_{\rm high}|Y_{\rm low},\theta) is accomplished by the Langevin dynamics initialized with the given low resolution 3D object that needs to be super-resolutioned. In the learning stage, we learn the conditional model from the fully observed training 3D objects as well as their low resolution versions. To specialize the learned model to this super-resolution task, in the training process, we down-scale each fully observed training 3D object YhighY_{\rm high} into a low resolution version YlowY_{\rm low}, which leads to information loss. In each iteration, we first up-scale YlowY_{\rm low} by expanding each voxel of YlowY_{\rm low} into a d×d×dd\times d\times d block (where dd is the ratio between the sizes of YhighY_{\rm high} and YlowY_{\rm low}) of constant values to obtain an up-scaled version YhighY^{{}^{\prime}}_{\rm high} of YlowY_{\rm low} (The up-scaled YhighY^{{}^{\prime}}_{\rm high} is not identical to the original high resolution YhighY_{\rm high} since the high resolution details are lost), and then run Langevin dynamics starting from YhighY^{{}^{\prime}}_{\rm high}. The parameters θ\theta are then updated by gradient ascent according to (5). Figure 3 shows some qualitative results of 3D super-resolution, where we use a 2-layer conditional 3D DescriptorNet. The first layer has 200 16×16×1616\times 16\times 16 filters with sub-sampling of 3. The second layer is a fully-connected layer with one single filter. The Langevin step size is 0.01.

To be more specific, let Ylow=CYhighY_{\rm low}=CY_{\rm high}, where CC is the down-scaling matrix, e.g., each voxel of YlowY_{\rm low} is the average of the corresponding d×d×dd\times d\times d block of YhighY_{\rm high}. Let CC^{-} be the pseudo-inverse of CC, e.g., CYlowC^{-}Y_{\rm low} gives us a high resolution shape by expanding each voxel of YlowY_{\rm low} into a d×d×dd\times d\times d block of constant values. Then the sampling of p(YhighYlow;θ)p(Y_{\rm high}|Y_{\rm low};\theta) is similar to sampling the unconditioned model p(Yhigh;θ)p(Y_{\rm high};\theta), except that for each step of the Langevin dynamics, let ΔY\Delta Y be the change of YY, we update YY+(ICC)ΔYY\leftarrow Y+(I-C^{-}C)\Delta Y, i.e., we project ΔY\Delta Y to the null space of CC, so that the low resolution version of YY, i.e., CYCY, remains fixed. From this perspective, super-resolution is similar to inpainting, except that the visible voxels are replaced by low resolution voxels.

4 Analyzing the learned 3D generator

We evaluate a 3D generator trained by a 3D DescriptorNet via MCMC teaching. The generator network g(Z;α)g(Z;\alpha) has 4 layers of volumetric deconvolution with 4×4×44\times 4\times 4 kernels, with up-sampling factors {1,2,2,2}\{1,2,2,2\} at different layers respectively. The numbers of channels at different layers are 256, 128, 64, and 1. There is a fully connected layer under the 100 dimensional latent factors ZZ. The output size is 32×32×3232\times 32\times 32. Batch normalization and ReLU layers are used between deconvolution layers and tanh non-linearity is added at the bottom-layer. We train a 3D DescriptorNet with the above 3D generator as a sampler in a cooperative training scheme presented in Algorithm 2 for the categories of toilet, sofa, and nightstand in ModelNet10 dataset independently. The 3D DescriptorNet has a 4-layer network, where the first layer has 64 9×9×99\times 9\times 9 filters, the second layer has 128 7×7×77\times 7\times 7 filters, the third layer has 256 4×4×44\times 4\times 4 filters, and the fourth layer is a fully connected layer with a single filter. The sub-sampling factors are {2,2,2,1}\{2,2,2,1\}. ReLU layers are used between convolutional layers.

We use Adam for optimization of 3D DescriptorNet with β1=0.4\beta_{1}=0.4 and β2=0.999\beta_{2}=0.999, and for optimization of 3D generator with β1=0.6\beta_{1}=0.6 and β2=0.999\beta_{2}=0.999. The learning rates for 3D DescriptorNet and 3D generator are 0.001 and 0.0003 respectively. The number of parallel chains is 50, and the mini-batch size is 50. The training data are scaled into the range of .Thesynthesizeddataarerescaledbackinto. The synthesized data are re-scaled back into for visualization. Figure 5 shows some examples of 3D objects generated by the 3D generators trained by the 3D DescriptorNet via MCMC teaching.

We show results of interpolating between two latent vectors of ZZ in Figure 5. For each row, the 3D objects at the two ends are generated from ZZ vectors that are randomly sampled from N(0,Id){\rm N}(0,I_{d}). Each object in the middle is obtained by first interpolating the ZZ vectors of the two end objects, and then generating the objects using the 3D generator. We observe smooth transitions in 3D shape structure and that most intermediate objects are also physically plausible. This experiment demonstrates that the learned 3D generator embeds the 3D object distribution into a smooth low dimensional manifold. Another way to investigate the learned 3D generator is to show shape arithmetic in the latent space. As shown in Figure 6, the 3D generator is able to encode semantic knowledge of 3D shapes in its latent space such that arithmetic can be performed on ZZ vectors for visual concept manipulation of 3D shapes.

5 3D object classification

We evaluate the feature maps learned by our 3D DescriptorNet. We perform a classification experiment on ModelNet10 dataset. We first train a single model on all categories of the training set in an unsupervised manner. The network architecture and learning configuration are the same as the one used for synthesis in Section 4.1. Then we use the model as a feature extractor. Specifically, for each input 3D object, we use the model to extract its first and second layers of feature maps, apply max pooling of kernel sizes 4×4×44\times 4\times 4 and 2×2×22\times 2\times 2 respectively, and concatenate the outputs as a feature vector of length 8,100. We train a multinomial logistic regression classifier from labeled data based on the extracted feature vectors for classification. We evaluate the classification accuracy of the classifier on the testing data using the one-versus-all rule. For comparison, Table 4 lists 8 published results on this dataset obtained by other baseline methods. Our method outperforms the other methods in terms of classification accuracy on this dataset.

Conclusion

We propose the 3D DescriptorNet for volumetric object synthesis, and the conditional 3D DescriptorNet for 3D object recovery and 3D object super resolution. The proposed model is a deep convolutional energy-based model, which can be trained by an “analysis by synthesis” scheme. The training of the model can be interpreted as a mode seeking and mode shifting process, and the zero temperature limit has an adversarial interpretation. A 3D generator can be taught by the 3D DescriptorNet via MCMC teaching. Experiments demonstrate that our models are able to generate realistic 3D shape patterns and are useful for 3D shape analysis.

Acknowledgment

The work is supported by Hikvision gift fund, DARPA SIMPLEX N66001-15-C-4035, ONR MURI N00014-16-1-2007, DARPA ARO W911NF-16-1-0579, and DARPA N66001-17-2-4029. We thank Erik Nijkamp for his help on coding. We thank Siyuan Huang for helpful discussions.

References