LION: Latent Point Diffusion Models for 3D Shape Generation

Xiaohui Zeng, Arash Vahdat, Francis Williams, Zan Gojcic, Or Litany, Sanja Fidler, Karsten Kreis

Introduction

Generative modeling of 3D shapes has extensive applications in 3D content creation and has become an active area of research . However, to be useful as a tool for digital artists, generative models of 3D shapes have to fulfill several criteria: (i) Generated shapes need to be realistic and of high-quality without artifacts. (ii) The model should enable flexible and interactive use and refinement: For example, a user may want to refine a generated shape and synthesize versions with varying details. Or an artist may provide a coarse or noisy input shape, thereby guiding the model to produce multiple realistic high-quality outputs. Similarly, a user may want to interpolate different shapes. (iii) The model should output smooth meshes, which are the standard representation in most graphics software.

Existing 3D generative models build on various frameworks, including generative adversarial networks (GANs) , variational autoencoders (VAEs) , normalizing flows , autoregressive models , and more . Most recently, denoising diffusion models (DDMs) have emerged as powerful generative models, achieving outstanding results not only on image synthesis but also for point cloud-based 3D shape generation . In DDMs, the data is gradually perturbed by a diffusion process, while a deep neural network is trained to denoise. This network can then be used to synthesize novel data in an iterative fashion when initialized from random noise . However, existing DDMs for 3D shape synthesis struggle with simultaneously satisfying all criteria discussed above for practically useful 3D generative models.

Here, we aim to develop a DDM-based generative model of 3D shapes overcoming these limitations. We introduce the Latent Point Diffusion Model (LION) for 3D shape generation (see Fig. 1). Similar to previous 3D DDMs, LION operates on point clouds, but it is constructed as a VAE with DDMs in latent space. LION comprises a hierarchical latent space with a vector-valued global shape latent and another point-structured latent space. The latent representations are predicted with point cloud processing encoders, and two latent DDMs are trained in these latent spaces. Synthesis in LION proceeds by drawing novel latent samples from the hierarchical latent DDMs and decoding back to the original point cloud space. Importantly, we also demonstrate how to augment LION with modern surface reconstruction methods to synthesize smooth shapes as desired by artists. LION has multiple advantages:

Expressivity: By mapping point clouds into regularized latent spaces, the DDMs in latent space are effectively tasked with learning a smoothed distribution. This is easier than training on potentially complex point clouds directly , thereby improving expressivity. However, point clouds are, in principle, an ideal representation for DDMs. Because of that, we use latent points, this is, we keep a point cloud structure for our main latent representation. Augmenting the model with an additional global shape latent variable in a hierarchical manner further boosts expressivity. We validate LION on several popular ShapeNet benchmarks and achieve state-of-the-art synthesis performance.

Varying Output Types: Extending LION with Shape As Points (SAP) geometry reconstruction allows us to also output smooth meshes. Fine-tuning SAP on data generated by LION’s autoencoder reduces synthesis noise and enables us to generate high-quality geometry. LION combines (latent) point cloud-based modeling, ideal for DDMs, with surface reconstruction, desired by artists.

Flexibility: Since LION is set up as a VAE, it can be easily adapted for different tasks without re-training the latent DDMs: We can efficiently fine-tune LION’s encoders on voxelized or noisy inputs, which a user can provide for guidance. This enables multimodal voxel-guided synthesis and shape denoising. We also leverage LION’s latent spaces for shape interpolation and autoencoding. Optionally training the DDMs conditioned on CLIP embeddings enables image- and text-driven 3D generation.

In summary, we make the following contributions: (i) We introduce LION, a novel generative model for 3D shape synthesis, which operates on point clouds and is built on a hierarchical VAE framework with two latent DDMs. (ii) We validate LION’s high synthesis quality by reaching state-of-the-art performance on widely used ShapeNet benchmarks. (iii) We achieve high-quality and diverse 3D shape synthesis with LION even when trained jointly over many classes without conditioning. (iv) We propose to combine LION with SAP-based surface reconstruction. (v) We demonstrate the flexibility of our framework by adapting it to relevant tasks such as multimodal voxel-guided synthesis.

Background

Traditionally, DDMs were introduced in a discrete-step fashion: Given samples x0q(x0){\mathbf{x}}_{0}\sim q({\mathbf{x}}_{0}) from a data distribution, DDMs use a Markovian fixed forward diffusion process defined as

where TT denotes the number of steps and q(xtxt1)q({\mathbf{x}}_{t}|{\mathbf{x}}_{t-1}) is a Gaussian transition kernel, which gradually adds noise to the input with a variance schedule β1,...,βT\beta_{1},...,\beta_{T}. The βt\beta_{t} are chosen such that the chain approximately converges to a standard Gaussian distribution after TT steps, q(xT)N(xT;0,I)q({\mathbf{x}}_{T}){\approx}{\mathcal{N}}({\mathbf{x}}_{T};\bm{0},{\bm{I}}). DDMs learn a parametrized reverse process (model parameters θ{\bm{\theta}}) that inverts the forward diffusion:

This generative reverse process is also Markovian with Gaussian transition kernels, which use fixed variances ρt2\rho_{t}^{2}. DDMs can be interpreted as latent variable models, where x1,...,xT{\mathbf{x}}_{1},...,{\mathbf{x}}_{T} are latents, and the forward process q(x1:Tx0)q({\mathbf{x}}_{1:T}|{\mathbf{x}}_{0}) acts as a fixed approximate posterior, to which the generative pθ(x0:T)p_{\bm{\theta}}({\mathbf{x}}_{0:T}) is fit. DDMs are trained by minimizing the variational upper bound on the negative log-likelihood of the data x0{\mathbf{x}}_{0} under pθ(x0:T)p_{\bm{\theta}}({\mathbf{x}}_{0:T}). Up to irrelevant constant terms, this objective can be expressed as

where αt=s=1t(1βs)\alpha_{t}=\sqrt{\prod_{s=1}^{t}(1-\beta_{s})} and σt=1αt2\sigma_{t}=\sqrt{1-\alpha_{t}^{2}} are the parameters of the tractable diffused distribution after tt steps q(xtx0)=N(xt;αtx0,σt2I)q({\mathbf{x}}_{t}|{\mathbf{x}}_{0})={\mathcal{N}}({\mathbf{x}}_{t};\alpha_{t}{\mathbf{x}}_{0},\sigma_{t}^{2}{\bm{I}}). Furthermore, Eq. (3) employs the widely used parametrization μθ(xt,t):=11βt(xtβt1αt2ϵθ(xt,t))\mathbf{\mu}_{\bm{\theta}}({\mathbf{x}}_{t},t):=\tfrac{1}{\sqrt{1-\beta_{t}}}\left({\mathbf{x}}_{t}-\tfrac{\beta_{t}}{\sqrt{1-\alpha_{t}^{2}}}{\bm{\epsilon}}_{\bm{\theta}}({\mathbf{x}}_{t},t)\right). It is common practice to set w(t)=1w(t)=1, instead of the one in Eq. (3), which often promotes perceptual quality of the generated output. In the objective of Eq. (3), the model ϵθ{\bm{\epsilon}}_{\bm{\theta}} is, for all possible steps tt along the diffusion process, effectively trained to predict the noise vector ϵ{\bm{\epsilon}} that is necessary to denoise an observed diffused sample xt{\mathbf{x}}_{t}. After training, the DDM can be sampled with ancestral sampling in an iterative fashion:

where ηN(η;0,I)\bm{\eta}\sim{\mathcal{N}}(\bm{\eta};\bm{0},{\bm{I}}). This sampling chain is initialized from a random sample xTN(xT;0,I){\mathbf{x}}_{T}\sim{\mathcal{N}}({\mathbf{x}}_{T};\bm{0},{\bm{I}}). Furthermore, the noise injection in Eq. 4 is usually omitted in the last sampling step.

DDMs can also be expressed with a continuous-time framework . In this formulation, the diffusion and reverse generative processes are described by differential equations. This approach allows for deterministic sampling and encoding schemes based on ordinary differential equations (ODEs). We make use of this framework in Sec. 3.1 and we review this approach in more detail in App. B.

Hierarchical Latent Point Diffusion Models

We first formally introduce LION, then discuss various applications and extensions in Sec. 3.1, and finally recapitulate its unique advantages in Sec. 3.2. See Fig. 1 for a visualization of LION.

First Stage Training. Initially, LION is trained by maximizing a modified variational lower bound on the data log-likelihood (ELBO) with respect to the encoder and decoder parameters ϕ{\bm{\phi}} and ξ{\bm{\xi}} :

Here, the global shape latent z0{\mathbf{z}}_{0} is sampled from the posterior distribution qϕ(z0x)q_{\bm{\phi}}({\mathbf{z}}_{0}|{\mathbf{x}}), which is parametrized by factorial Gaussians, whose means and variances are predicted via an encoder network. The point cloud latent h0{\mathbf{h}}_{0} is sampled from a similarly parametrized posterior qϕ(h0x,z0)q_{\bm{\phi}}({\mathbf{h}}_{0}|{\mathbf{x}},{\mathbf{z}}_{0}), while also conditioning on z0{\mathbf{z}}_{0} (ϕ{\bm{\phi}} denotes the parameters of both encoders). Furthermore, pξ(xh0,z0)p_{\bm{\xi}}({\mathbf{x}}|{\mathbf{h}}_{0},{\mathbf{z}}_{0}) denotes the decoder, parametrized as a factorial Laplace distribution with predicted means and fixed unit scale parameter (corresponding to an L1L_{1} reconstruction loss). λz\lambda_{\mathbf{z}} and λh\lambda_{\mathbf{h}} are hyperparameters balancing reconstruction accuracy and Kullback-Leibler regularization (note that only for λz=λh=1\lambda_{\mathbf{z}}=\lambda_{\mathbf{h}}=1 we are optimizing a rigorous ELBO). The priors p(z0)p({\mathbf{z}}_{0}) and p(h0)p({\mathbf{h}}_{0}) are N(0,I){\mathcal{N}}(\bm{0},{\bm{I}}). Also see Fig. 1 again.

Second Stage Training. In principle, we could use the VAE’s priors to sample encodings and generate new shapes. However, the simple Gaussian priors will not accurately match the encoding distribution from the training data and therefore produce poor samples (prior hole problem ). This motivates training highly expressive latent DDMs. In particular, in the second stage we freeze the VAE’s encoder and decoder networks and train two latent DDMs on the encodings z0{\mathbf{z}}_{0} and h0{\mathbf{h}}_{0} sampled from qϕ(z0x)q_{\bm{\phi}}({\mathbf{z}}_{0}|{\mathbf{x}}) and qϕ(h0x,z0)q_{\bm{\phi}}({\mathbf{h}}_{0}|{\mathbf{x}},{\mathbf{z}}_{0}), minimizing score matching (SM) objectives similar to Eq. (2):

where zt=αtz0+σtϵ{\mathbf{z}}_{t}=\alpha_{t}{\mathbf{z}}_{0}+\sigma_{t}{\bm{\epsilon}} and ht=αth0+σtϵ{\mathbf{h}}_{t}=\alpha_{t}{\mathbf{h}}_{0}+\sigma_{t}{\bm{\epsilon}} are the diffused latent encodings. Furthermore, θ{\bm{\theta}} denotes the parameters of the global shape latent DDM ϵθ(zt,t){\bm{\epsilon}}_{\bm{\theta}}({\mathbf{z}}_{t},t), and ψ{\bm{\psi}} refers to the parameters of the conditional DDM ϵψ(ht,z0,t){\bm{\epsilon}}_{\bm{\psi}}({\mathbf{h}}_{t},{\mathbf{z}}_{0},t) trained over the latent point cloud (note the conditioning on z0{\mathbf{z}}_{0}).

Generation. With the latent DDMs, we can formally define a hierarchical generative model pξ,ψ,θ(x,h0,z0)=pξ(xh0,z0)pψ(h0z0)pθ(z0)p_{{\bm{\xi}},{\bm{\psi}},{\bm{\theta}}}({\mathbf{x}},{\mathbf{h}}_{0},{\mathbf{z}}_{0})=p_{{\bm{\xi}}}({\mathbf{x}}|{\mathbf{h}}_{0},{\mathbf{z}}_{0})p_{{\bm{\psi}}}({\mathbf{h}}_{0}|{\mathbf{z}}_{0})p_{{\bm{\theta}}}({\mathbf{z}}_{0}), where pθ(z0)p_{{\bm{\theta}}}({\mathbf{z}}_{0}) denotes the distribution of the global shape latent DDM, pψ(h0z0)p_{{\bm{\psi}}}({\mathbf{h}}_{0}|{\mathbf{z}}_{0}) refers to the DDM modeling the point cloud-structured latents, and pξ(xh0,z0)p_{{\bm{\xi}}}({\mathbf{x}}|{\mathbf{h}}_{0},{\mathbf{z}}_{0}) is LION’s decoder. We can hierarchically sample the latent DDMs following Eq. (4) and then translate the latent points back to the original point cloud space with the decoder.

Network Architectures and DDM Parametrization. Let us briefly summarize key implementation choices. The encoder networks, as well as the decoder and the latent point DDM, operating on point clouds x{\mathbf{x}}, are all implemented based on Point-Voxel CNNs (PVCNNs) , following Zhou et al. . PVCNNs efficiently combine the point-based processing of PointNets with the strong spatial inductive bias of convolutions. The DDM modeling the global shape latent uses a ResNet structure with fully-connected layers (implemented as 1×11{\times}1-convolutions). All conditionings on the global shape latent are implemented via adaptive Group Normalization in the PVCNN layers. Furthermore, following Vahdat et al. we use a mixed score parametrization in both latent DDMs. This means that the score models are parametrized to predict a residual correction to an analytic standard Gaussian score. This is beneficial since the latent encodings are regularized towards a standard Gaussian distribution during the first training stage (see App. D for all details).

Here, we discuss how LION can be used and extended for different relevant applications.

Multimodal Generation. We can synthesize different variations of a given shape, enabling multimodal generation in a controlled manner: Given a shape, i.e., its point cloud x{\mathbf{x}}, we encode it into latent space. Then, we diffuse its encodings z0{\mathbf{z}}_{0} and h0{\mathbf{h}}_{0} for a small number of steps τ<T\tau<T towards intermediate zτ{\mathbf{z}}_{\tau} and hτ{\mathbf{h}}_{\tau} along the diffusion process such that only local details are destroyed. Running the reverse generation process from this intermediate τ\tau, starting at zτ{\mathbf{z}}_{\tau} and hτ{\mathbf{h}}_{\tau}, leads to variations of the original shape with different details (see, for instance, Fig. 2). We refer to this procedure as diffuse-denoise (details in App. C.1). Similar techniques have been used for image editing .

Encoder Fine-tuning for Voxel-Conditioned Synthesis and Denoising. In practice, an artist using a 3D generative model may have a rough idea of the desired shape. For instance, they may be able to quickly construct a coarse voxelized shape, to which the generative model then adds realistic details. In LION, we can support such applications: using a similar ELBO as in Eq. (5), but with a frozen decoder, we can fine-tune LION’s encoder networks to take voxelized shapes as input (we simply place points at the voxelized shape’s surface) and map them to the corresponding latent encodings z0{\mathbf{z}}_{0} and h0{\mathbf{h}}_{0} that reconstruct the original non-voxelized point cloud. Now, a user can utilize the fine-tuned encoders to encode voxelized shapes and generate plausible detailed shapes. Importantly, this can be naturally combined with the diffuse-denoise procedure to clean up imperfect encodings and to generate different possible detailed shapes (see Fig. 4).

Furthermore, this approach is general. Instead of voxel-conditioned synthesis, we can also fine-tune the encoder networks on noisy shapes to perform multimodal shape denoising, also potentially combined with diffuse-denoise. LION supports these applications easily without re-training the latent DDMs due to its VAE framework with additional encoders and decoders, in contrast to previous works that train DDMs on point clouds directly . See App. C.2 for technical details.

Shape Interpolation. LION also enables shape interpolation: We can encode different point clouds into LION’s hierarchical latent space and use the probability flow ODE (see App. B) to further encode into the latent DDMs’ Gaussian priors, where we can safely perform spherical interpolation and expect valid shapes along the interpolation path. We can use the intermediate encodings to generate the interpolated shapes (see Fig. 7; details in App. C.3).

Surface Reconstruction. While point clouds are an ideal 3D representation for DDMs, artists may prefer meshed outputs. Hence, we propose to combine LION with modern geometry reconstruction methods (see Figs. 2, 4 and 5). We use Shape As Points (SAP) , which is based on differentiable Poisson surface reconstruction and can be trained to extract smooth meshes from noisy point clouds. Moreover, we fine-tune SAP on training data generated by LION’s autoencoder to better adjust SAP to the noise distribution in point clouds generated by LION. Specifically, we take clean shapes, encode them into latent space, run a few steps of diffuse-denoise that only slightly modify some details, and decode back. The diffuse-denoise in latent space results in noise in the generated point clouds similar to what is observed during unconditional synthesis (details in App. C.4).

2 LION’s Advantages

We now recapitulate LION’s unique advantages. LION’s structure as a hierarchical VAE with latent DDMs is inspired by latent DDMs on images . This framework has key benefits:

(i) Expressivity: First training a VAE that regularizes the latent encodings to approximately fall under standard Gaussian distributions, which are also the DDMs’ equilibrium distributions towards which the diffusion processes converge, results in an easier modeling task for the DDMs: They have to model only the remaining mismatch between the actual encoding distributions and their own Gaussian priors . This translates into improved expressivity, which is further enhanced by the additional decoder network. However, point clouds are, in principle, an ideal representation for the DDM framework, because they can be diffused and denoised easily and powerful point cloud processing architectures exist. Therefore, LION uses point cloud latents that combine the advantages of both latent DDMs and 3D point clouds. Our point cloud latents can be interpreted as smoothed versions of the original point clouds that are easier to model (see Fig. 1). Moreover, the hierarchical VAE setup with an additional global shape latent increases LION’s expressivity even further and results in natural disentanglement between overall shape and local details captured by the shape latents and latent points (Sec. 5.2).

(ii) Flexibility: Another advantage of LION’s VAE framework is that its encoders can be fine-tuned for various relevant tasks, as discussed previously, and it also enables easy shape interpolation. Other 3D point cloud DDMs operating on point clouds directly do not offer simultaneously as much flexibility and expressivity out-of-the-box (see quantitative comparisons in Secs. 5.1 and 5.4).

(iii) Mesh Reconstruction: As discussed, while point clouds are ideal for DDMs, artists likely prefer meshed outputs. As explained above, we propose to use LION together with modern surface reconstruction techniques , again combining the best of both worlds—a point cloud-based VAE backbone ideal for DDMs, and smooth geometry reconstruction methods operating on the synthesized point clouds to generate practically useful smooth surfaces, which can be easily transformed into meshes.

Related Work

We are building on DDMs , which have been used most prominently for image and speech synthesis . We train DDMs in latent space, an idea that has been explored for image and music generation, too. However, these works did not train separate conditional DDMs. Hierarchical DDM training has been used for generative image upsampling , text-to-image generation , and semantic image modeling . Most relevant among these works is Preechakul et al. , which extracts a high-level semantic representation of an image with an auxiliary encoder and then trains a DDM that adds details directly in image space. We are the first to explore related concepts for 3D shape synthesis and we also train both DDMs in latent space. Furthermore, DDMs and VAEs have also been combined in such a way that the DDM improves the output of the VAE .

Most related to LION are “Point-Voxel Diffusion” (PVD) and “Diffusion Probabilistic Models for 3D Point Cloud Generation” (DPM) . PVD trains a DDM directly on point clouds, and our decision to use PVCNNs is inspired by this work. DPM, like LION, uses a shape latent variable, but models its distribution with Normalizing Flows , and then trains a weaker point-wise conditional DDM directly on the point cloud data (this allows DPM to learn useful representations in its latent variable, but sacrifices generation quality). As we show below, neither PVD nor DPM easily enables applications such as multimodal voxel-conditioned synthesis and denoising. Furthermore, LION achieves significantly stronger generation performance. Finally, neither PVD nor DPM reconstructs meshes from the generated point clouds. Point cloud and 3D shape generation have also been explored with other generative models: PointFlow , DPF-Net and SoftFlow rely on Normalizing Flows . SetVAE treats point cloud synthesis as set generation and uses VAEs. ShapeGF learns distributions over gradient fields that model shape surfaces. Both IM-GAN , which models shapes as neural fields, and l-GAN train GANs over latent variables that encode the shapes, similar to other works , while r-GAN generates point clouds directly. PDGN proposes progressive deconvolutional networks within a point cloud GAN. SP-GAN uses a spherical point cloud prior. Other progressive and graph-based architectures have been used, too. Also generative cellular automata (GCAs) can be employed for voxel-based 3D shape generation . In orthogonal work, point cloud DDMs have been used for generative shape completion .

Recently, image-driven training of 3D generative models as well as text-driven 3D generation have received much attention. These are complementary directions to ours; in fact, augmenting LION with additional image-based training or including text-guidance are promising future directions. Finally, we are relying on SAP for mesh generation. Strong alternative approaches for reconstructing smooth surfaces from point clouds exist .

Experiments

We provide an overview of our most interesting experimental results in the main paper. All experiment details and extensive additional experiments can be found in App. E and App. F, respectively.

Datasets. To compare LION against existing methods, we use ShapeNet , the most widely used dataset to benchmark 3D shape generative models. Following previous works , we train on three categories: airplane, chair, car. Also like previous methods, we primarily rely on PointFlow’s dataset splits and preprocssing. It normalizes the data globally across the whole dataset. However, some baselines require per-shape normalization ; hence, we also train on such data. Furthermore, training SAP requires signed distance fields (SDFs) for volumetric supervision, which the PointFlow data does not offer. Hence, for simplicity we follow Peng et al. and also use their data splits and preprocessing, which includes SDFs.We train LION, DPM, PVD, and IM-GAN (which synthesizes shapes as SDFs) also on this dataset version (denoted as ShapeNet-vol here). This data is also per-shape normalized. Dataset details in App. E.1.

Evaluation. Model evaluation follows previous works . Various metrics to evaluate point cloud generative models exist, with different advantages and disadvantages, discussed in detail by Yang et al. . Following recent works , we use 1-NNA (with both Chamfer distance (CD) and earth mover distance (EMD)) as our main metric. It quantifies the distributional similarity between generated shapes and validation set and measures both quality and diversity . For fair comparisons, all metrics are computed on point clouds, not meshed outputs (App. E.2 discusses different metrics; further results on coverage (COV) and minimum matching distance (MMD) in App. F.2).

Results. Samples from LION are shown in Fig. 6 and quantitative results in Tabs. 3-3 (see Sec. 4 for details about baselines—to reduce the number of baselines to train, we are focusing on the most recent and competitive ones). LION outperforms all baselines and achieves state-of-the-art performance on all classes and dataset versions. Importantly, we outperform both PVD and DPM, which also leverage DDMs, by large margins. Our samples are diverse and appear visually pleasing.

Mesh Reconstruction. As explained in Sec. 3.1, we combine LION with mesh reconstruction, to directly synthesize practically useful meshes. We show generated meshes in Fig. 2, which look smooth and of high quality. In Fig. 2, we also visually demonstrate how we can vary the local details of synthesized shapes while preserving the overall shape with our diffuse-denoise technique (Sec. 3.1). Details about the number of diffusion steps for all diffuse-denoise experiments are in App. E.

Shape Interpolation. As discussed in Sec. 3.1, LION also enables shape interpolation, potentially useful for shape editing applications. We show this in Fig. 7, combined with mesh reconstruction. The generated shapes are clean and semantically plausible along the entire interpolation path. In App. F.12.1, we also show interpolations from PVD and DPM for comparison.

2 Many-class Unconditional 3D Shape Generation

13-Class LION Model. We train a LION model jointly without any class conditioning on 13 different categories (airplane, chair, car, lamp, table, sofa, cabinet, bench, telephone, loudspeaker, display, watercraft, rifle) from ShapeNet (ShapeNet-vol version). Training a single model without conditioning over such diverse shapes is challenging, as the data distribution is highly complex and multimodal. We show LION’s generated samples in Fig. 3, including meshes: LION synthesizes high-quality and diverse plausible shapes even when trained on such complex data. We report the model’s quantitative generation performance in Tab. 4, and we also trained various strong baseline methods under the same setting for comparison. We find that LION significantly outperforms all baselines by a large margin. We further observe that the hierarchical VAE architecture of LION becomes crucial: The shape latent variable z0{\mathbf{z}}_{0} captures global shape, while the latent points h0{\mathbf{h}}_{0} model details. This can be seen in Fig. 8: we show samples when fixing the global shape latent z0{\mathbf{z}}_{0} and only sample h0{\mathbf{h}}_{0} (details in App. F.3).

55-Class LION Model. Encouraged by these results, we also trained a LION model again jointly without any class conditioning on all 55 different categories from ShapeNet. Note that we did on purpose not use class-conditioning in these experiments to create a difficult 3D generation task and thereby explore LION’s scalability to highly complex and multimodal datasets. We show generated point cloud samples in Fig. 9 (we did not train an SAP model on the 55 classes data): LION synthesizes high-quality and diverse shapes. It can even generate samples from the cap class, which contributes with only 39 training data samples, indicating that LION has an excellent mode coverage that even includes the very rare classes. To the best of our knowledge no previous 3D shape generative models have demonstrated satisfactory generation performance for such diverse and multimodal 3D data without relying on conditioning information (details in App. F.4). In conclusion, we observe that LION out-of-the-box easily scales to highly complex multi-category shape generation.

3 Training LION on Small Datasets

Next, we explore whether LION can also be trained successfully on very small datasets. To this end, we train models on the Mug and Bottle ShapeNet classes. The number of training samples is 149 and 340, respectively, which is much smaller than the common classes like chair, car and airplane. Furthermore, we also train LION on 553 animal assets from the TurboSquid data repository. Generated shapes from the three models are shown in Fig. 10. LION is able to generate correct mugs and bottles as well as diverse and high-quality animal shapes. We conclude that LION also performs well even when training in the challenging low-data setting (details in Apps. F.5 and F.6).

4 Voxel-guided Shape Synthesis and Denoising with Fine-tuned Encoders

Next, we test our strategy for multimodal voxel-guided shape synthesis (see Sec. 3.1) using the airplane class LION model (experiment details in App. E, more experiments in App. F.7). We first voxelize our training set and fine-tune our encoder networks to produce the correct encodings to decode back the original shapes. When processing voxelized shapes with our point-cloud networks, we sample points on the surface of the voxels. As discussed, we can use different numbers of diffuse-denoise steps in latent space to generate various plausible shapes and correct for poor encodings. Instead of voxelizations, we can also consider different noisy inputs (we use normal, uniform, and outlier noise, see App. F.7) and achieve multimodal denoising with the same approach. The same tasks can be attempted with the important DDM-based baselines PVD and DPM, by directly—not in a latent space—diffusing and denoising voxelized (converted to point clouds) or noisy point clouds.

Fig. 13 shows the reconstruction performance of LION, DPM and PVD for different numbers of diffuse-denoise steps (we voxelized or noised the validation set to measure this). We see that for almost all inputs—voxelized or different noises—LION performs best. PVD and DPM perform acceptably for normal and uniform noise, which is similar to the noise injected during training of their DDMs, but perform very poorly for outlier noise or voxel inputs, which is the most relevant case to us, because voxels can be easily placed by users. It is LION’s unique framework with additional fine-tuned encoders in its VAE and only latent DDMs that makes this possible. Performing more diffuse-denoise steps means that more independent, novel shapes are generated. These will be cleaner and of higher quality, but also correspond less to the noisy or voxel inputs used for guidance. In Fig. 13, we show this trade-off for the voxel-guidance experiment (other experiments in App. F.7), where (top) we measured the outputs’ synthesis quality by calculating 1-NNA with respect to the validation set, and (bottom) the average intersection over union (IOU) between the input voxels and the voxelized outputs. We generally see a trade-off: More diffuse-denoise steps result in lower 1-NNA (better quality), but also lower IOU. LION strikes the best balance by a large gap: Its additional encoder network directly generates plausible latent encodings from the perturbed inputs that are both high quality and also correspond well to the input. This trade-off is visualized in Fig. 11 for LION, DPM, and PVD, where we show generated point clouds and voxelizations (note that performing no diffuse-denoise at all for PVD and DPM corresponds to simply keeping the input, as these models’ DDMs operate directly on point clouds). We see that running 50 diffuse-denoise steps to generate diverse outputs for DPM and especially PVD results in a significant violation of the input voxelization. In contrast, LION generates realistic outputs that also obey the driving voxels. Overall, LION wins out both in this task and also in unconditional generation with large gaps over these previous DDM-based point cloud generative models. We conclude that LION does not only offer state-of-the-art 3D shape generation quality, but is also very versatile. Note that guided synthesis can also be combined with mesh reconstruction, as shown in Fig. 4.

5 Sampling Time

While our main experiments use 1,000-step DDPM-based synthesis, which takes 27.12\approx 27.12 seconds, we can significantly accelerate generation without significant loss in quality. Using DDIM-based sampling , we can generate high quality shapes in under one second (Fig. 15), which would enable real-time interactive applications. More analyses in App. F.9.

6 Overview of Additional Experiments in Appendix

(i) In App. F.1, we perform various ablation studies. The experiments quantitatively validate LION’s architecture choices and the advantage of our hierarchical VAE setup with conditional latent DDMs. (ii) In App. F.8, we measure LION’s autoencoding performance. (iii) To demonstrate the value of directly outputting meshes, in App. F.10 we use Text2Mesh to generate textures based on text prompts for synthesized LION samples (Fig. 14). This would not be possible, if we only generated point clouds. (iv) To qualitatively show that LION can be adapted easily to other relevant tasks, in App. F.11 we condition LION on CLIP embeddings of the shapes’ rendered images, following CLIP-Forge (Fig. 16). This enables text-driven 3D shape generation and single view 3D reconstruction (Fig. 17). (v) We also show many more samples (Apps. F.2-F.6) and shape interpolations (App. F.12) from our models, more examples of voxel-guided and noise-guided synthesis (App. F.7), and we further analyze our 13-class LION model (App. F.3.2).

Conclusions

We introduced LION, a novel generative model of 3D shapes. LION uses a VAE framework with hierarchical DDMs in latent space and can be combined with SAP for mesh generation. LION achieves state-of-the-art shape generation performance and enables applications such as voxel-conditioned synthesis, multimodal shape denoising, and shape interpolation. LION is currently trained on 3D point clouds only and can not directly generate textured shapes. A promising extension would be to include image-based training by incorporating neural or differentiable rendering and to also synthesize textures . Furthermore, LION currently focuses on single object generation only. It would be interesting to extend it to full 3D scene synthesis. Moreover, synthesis could be further accelerated by building on works on accelerated sampling from DDMs .

Broader Impact. We believe that LION can potentially improve 3D content creation and assist the workflow of digital artists. We designed LION with such applications in mind and hope that it can grow into a practical tool enhancing artists’ creativity. Although we do not see any immediate negative use-cases for LION, it is important that practitioners apply an abundance of caution to mitigate impacts given generative modeling more generally can also be used for malicious purposes, discussed for instance in Vaccari and Chadwick , Nguyen et al. , Mirsky and Lee .

References

Appendix A Funding Disclosure

Appendix B Continuous-Time Diffusion Models and Probability Flow ODE Sampling

Here, we are providing additional background on denoising diffusion models (DDMs). In Sec. 2, we have introduced DDMs in the “discrete-time” setting, where we have a fixed number TT of diffusion and denoising steps . However, DDMs can also be expressed in a continuous-time framework, in which the fixed forward diffusion and the generative denoising process in a continuous manner gradually perturb and denoise, respectively . In this formulation, these processes can be described by stochastic differential equations (SDEs). In particular, the fixed forward diffusion process is given by (for the “variance-preserving” SDE , which we use. Other diffusion processes are possible ):

where time tt\in and wt{\mathbf{w}}_{t} is a standard Wiener process. In the continuous-time formulation, we usually consider times tt\in, while in the discrete-time setting it is common to consider discrete time values t{0,...,T}t\in\{0,...,T\} (with t=0t=0 corresponding to no diffusion at all). This is just a convention and we can easily translate between them as tcont.=tdisc.Tt_{\textrm{cont.}}=\frac{t_{\textrm{disc.}}}{T}. We always take care of these conversions here when appropriate without explicitly noting this to keep the notation concise. The function βt\beta_{t} in Eq. (8) above is a continuous-time generalization of the set of βt\beta_{t}’s used in the discrete formulation (denoted as variance schedule in Sec. 2). Usually, the βt\beta_{t}’s in the discrete-time setting are generated by discretizing an underlying continuous function βt\beta_{t}—in our case βt\beta_{t} is simply a linear function of tt—, which is now used in Eq. (8) above directly.

It can be shown that a corresponding reverse diffusion process exists that effectively inverts the forward diffusion from Eq. (8) :

Here, qt(xt)q_{t}({\mathbf{x}}_{t}) is the marginal diffused data distribution after time tt, and xtlogqt(xt)\nabla_{{\mathbf{x}}_{t}}\log q_{t}({\mathbf{x}}_{t}) is the score function. Hence, if we had access to this score function, we could simulate this reverse SDE in reverse time direction, starting from random noise x1N(x1;0,I){\mathbf{x}}_{1}\sim{\mathcal{N}}({\mathbf{x}}_{1};\bm{0},{\bm{I}}), and thereby invert the forward diffusion process and generate novel data. Consequently, the problem reduces to learning a model for the usually intractable score function. This is where the discrete-time and continuous-time frameworks connect: Indeed, the objective in Eq. (3) for training the denoising model also corresponds to denoising score matching , i.e., it represents an objective to learn a model for the score function. We have

However, we trained ϵθ(xt,t){\bm{\epsilon}}_{\bm{\theta}}({\mathbf{x}}_{t},t) for TT discrete steps only, rather than for continuous times tt. In principle, the objective in Eq. (3) can be easily adapted to the continuous-time setting by simply sampling continuous time values rather than discrete ones. In practice, T=1000T=1000 steps, as used in our models, represents a fine discretization of the full integration interval and the model generalizes well when queried at continuous tt “between” steps, due to the smooth cosine-based time step embeddings.

A unique advantage of the continuous-time framework based on differential equations is that it allows us to construct an ordinary differential equation (ODE), which, when simulated with samples from the same random noise distribution x1N(x1;0,I){\mathbf{x}}_{1}\sim{\mathcal{N}}({\mathbf{x}}_{1};\bm{0},{\bm{I}}) as inputs (where t=1t=1, with xt=1{\mathbf{x}}_{t=1}, denotes the end of the diffusion for continuous tt\in), leads to the same marginal distributions along the reverse diffusion process and can therefore also be used for synthesis :

This is an instance of continuous Normalizing flows and often called probability flow ODE. Plugging in our score function estimate, we have

which we refer to as the generative ODE. Given a sample from x1N(x1;0,I){\mathbf{x}}_{1}\sim{\mathcal{N}}({\mathbf{x}}_{1};\bm{0},{\bm{I}}), the generative process of this generative ODE is fully deterministic. Similarly, we can also use this ODE to encode given data into the DDM’s own prior distribution x1N(x1;0,I){\mathbf{x}}_{1}\sim{\mathcal{N}}({\mathbf{x}}_{1};\bm{0},{\bm{I}}) by simulating the ODE in the other direction.

These properties allow us to perform interpolation: Due to the deterministic generation process with the generative ODE, smoothly changing an encoding x1{\mathbf{x}}_{1} will result in a similarly smoothly changing generated output x0{\mathbf{x}}_{0}. We are using this for our interpolation experiments (see Sec. 3.1 and App. C.3)

Appendix C Technical Details on LION’s Applications and Extensions

In this section, we provide additional methodological details on the different applications and extensions of LION that we discussed in Sec. 3.1 and demonstrated in our experiments.

Our diffuse-denoise technique is essentially a tool to inject diversity into the generation process in a controlled manner and to “clean up” imperfect encodings when working with encoders operating on noisy or voxelized data (see Sec. 3.1 and App. C.2). It is related to similar methods that have been used for image editing .

Specifically, assume we are given an input shape x{\mathbf{x}} in the form of a point cloud. We can now use LION’s encoder networks to encode it into the latent spaces of LION’s autoencoder and obtain the shape latent encoding z0{\mathbf{z}}_{0} and the latent points h0{\mathbf{h}}_{0}. Now, we can diffuse those encodings for τ<T\tau<T steps (using the Gaussian transition kernel defined in Eq. (1)) to obtain intermediate zτ{\mathbf{z}}_{\tau} and hτ{\mathbf{h}}_{\tau} along the diffusion process. Next, we can denoise them back to new zˉ0\bar{\mathbf{z}}_{0} and hˉ0\bar{\mathbf{h}}_{0} using the generative stochastic sampling defined in Eq. (4), starting from the intermediate zτ{\mathbf{z}}_{\tau} and hτ{\mathbf{h}}_{\tau}. Note that we first need to generate the new zˉ0\bar{\mathbf{z}}_{0}, since denoising hτ{\mathbf{h}}_{\tau} is conditioned on zˉ0\bar{\mathbf{z}}_{0} according to LION’s hierarchical latent DDM setup.

The forward diffusion of DDMs progressively destroys more and more details of the input data. Hence, diffusing LION’s latent encodings only for small τ\tau, and then denoising again, results in new zˉ0\bar{\mathbf{z}}_{0} and hˉ0\bar{\mathbf{h}}_{0} that have only changed slightly compared to the original z0{\mathbf{z}}_{0} and h0{\mathbf{h}}_{0}. In other words for small τ\tau, the diffuse-denoised zˉ0\bar{\mathbf{z}}_{0} and hˉ0\bar{\mathbf{h}}_{0} will be close to the original z0{\mathbf{z}}_{0} and h0{\mathbf{h}}_{0}. This observation was also made by Meng et al. . Similarly, we find that when zˉ0\bar{\mathbf{z}}_{0} and hˉ0\bar{\mathbf{h}}_{0} are sent through LION’s decoder network the corresponding point cloud xˉ\bar{\mathbf{x}} resembles the input point cloud x{\mathbf{x}} in overall shape well, and only has different details. Diffusing for more steps, i.e., larger τ\tau, corresponds to resampling the shape also more globally (with τ=T\tau=T meaning that an entirely new shape is generated), while using smaller τ\tau implies that the original shape is preserved more faithfully (with τ=0\tau=0 meaning that the original shape is preserved entirely). Hence, we can use this technique to inject diversity into any given shape and resample different details in a controlled manner (as shown, for instance, in Fig. 2).

We can use this diffuse-denoise approach not only for resampling different details from clean shapes, but also to “clean up” poor encodings. For instance, when LION’s encoders operate on very noisy or coarsely voxelized input point clouds (see Sec. 3.1 and App. C.2), the predicted shape encodings may be poor. The encoder networks may roughly recognize the overall shape but not capture any details due to the noise or voxelizations. Hence, we can perform some diffuse-denoise to essentially partially discard the poor encodings and regenerate them from the DDMs, which have learnt a model of clean detailed shapes, while preserving the overall shape. This allows us to perform multimodal generation when using voxelized or noisy input point clouds as guidance, because we can sample various different plausible versions using diffuse-denoise, while always approximately preserving the overall input shape (see examples in Figs. 4, 29, 30, and 31).

C.2 Encoder Fine-Tuning for Voxel-Conditioned Synthesis and Denoising

A crucial advantage of LION’s underlying VAE framework with latent DDMs is that we can adapt the encoder neural networks for different relevant tasks, as discussed in Sec. 3.1 and demonstrated in our experiments. For instance, a digital artist may have a rough idea about the shape they desire to synthesize and they may be able to quickly put together a coarse voxelization according to whatever they imagine. Or similarly, a noisy version of a shape may be available and the user may want to guide LION’s synthesis accordingly.

When training the encoder to denoise uniform or Gaussian noise added to the point cloud, we use the same reconstruction objective as during original LION training, i.e.,

However, when training with voxelized inputs or outlier noise, there is no good corresponce to define the point-wise reconstruction loss with the Laplace distribution (corresponding to an L1L_{1} loss). Therefore, in these cases we instead rely on Chamfer Distance (CD) and Earth Mover Distance (EMD) for the reconstruction term:

Here, LCD\mathcal{L}^{\textrm{CD}} and LEMD\mathcal{L}^{\textrm{EMD}} denote CD and EMD losses:

where γ\gamma denotes a bijection between the point clouds x{\mathbf{x}} and y{\mathbf{y}} (with the same number of points). Note that we are using an L1L_{1} loss for the distance calculation in the CD, which we found to work well and corresponds to the L1L_{1} loss we are relying on during original LION training.

One question that naturally arises is regarding the processing of the noisy or voxelized input shapes. Our PVCNN-based encoder networks can easily process noisy point clouds, but not voxels. Therefore, given a voxelized shape, we uniformly distribute points over the voxelized shape’s surface, such that it can be consumed by LION’s point cloud processing networks (see details in App. E.4).

We would like to emphasize that LION supports these applications easily without re-training the latent DDMs due to its VAE framework with additional encoders and decoders, in contrast to previous works that train DDMs on point clouds directly . For instance, PVD operates directly on the voxelized or perturbed point clouds with its DDM. Because of that PVD needs to perform many steps of diffuse-denoise to remove all the noise from the input—there is no encoder that can help with that. However, this has the drawback that this induces significant shape variations that do not well correspond to the original noisy or voxelized inputs (see experiments and discussion in Sec. 5.4).

C.3 Shape Interpolation

Here, we explain in detail how exactly we perform shape interpolation. It may be instructive to take a step back first and motivate our approach. Of course, we cannot simply linearly interpolate two point clouds, this is, the points’ xyzxyz-coordinates, directly. This would result in unrealistic outputs along the interpolation path. Rather, we should perform interpolation in a space where semantically similar point clouds are mapped near each other. One option that comes to mind is to use the latent space, this is, both the shape latent space and the latent points, of LION’s point cloud VAE. We could interpolate two point clouds’ encodings, and then decode back to point cloud space. However, we also do not have any guarantees in this situation, either, due to the VAE’s prior hole problem , this is, the problem that the distribution of all encodings of the training data won’t perfectly form a Gaussian, which it was regularized towards during VAE training (see Eq. (5)). Hence, when simply interpolating directly in the VAE’s latent space, we would pass regions in latent space for which the decoder does not produce a realistic sample. This would result in poor outputs.

Therefore, we rather interpolate in the prior spaces of our latent DDMs themselves, this is, the spaces that emerge at the end of the forward diffusion processes. Since the diffusion process of DDMs by construction perturbs all data points into almost perfectly Gaussian x1N(x1;0,I){\mathbf{x}}_{1}\sim{\mathcal{N}}({\mathbf{x}}_{1};\bm{0},{\bm{I}}) (where t=1t=1 denotes the end of the diffusion for continuous tt\in), DDMs do not suffer from any prior hole challenges—the denoising model is essentially well trained for all possible x1N(x1;0,I){\mathbf{x}}_{1}\sim{\mathcal{N}}({\mathbf{x}}_{1};\bm{0},{\bm{I}}). Hence, given two x1A{\mathbf{x}}_{1}^{A} and x1B{\mathbf{x}}_{1}^{B}, in DDMs we can safely interpolate them according to

for ss\in and expect meaningful outputs when generating the corresponding denoised samples.

But why do we choose the square root-based interpolation? Since we are working in a very high-dimensional space, we know that according to the Gaussian annulus theorem both x1A{\mathbf{x}}_{1}^{A} and x1B{\mathbf{x}}_{1}^{B} are almost certainly lying on a thin (high-dimensional) spherical shell that supports almost all probability mass of p1(x1)N(x1;0,I)p_{1}({\mathbf{x}}_{1})\approx{\mathcal{N}}({\mathbf{x}}_{1};\bm{0},{\bm{I}}). Furthermore, since x1A{\mathbf{x}}_{1}^{A} and x1B{\mathbf{x}}_{1}^{B} are almost certainly orthogonal to each other, again due to the high dimensionality, our above interpolation in Eq. (18) between x1A{\mathbf{x}}_{1}^{A} and x1B{\mathbf{x}}_{1}^{B} corresponds to performing spherical interpolation along the spherical shell where almost all probability mass concentrates. In contrast, linear interpolation would leave this shell, which resulted in poorer results, because the model wasn’t well trained for denoising samples outside the typical set. Note that we found spherical interpolation to be crucial (in DDMs of images, linear interpolation tends to still work decently; for our latent point DDM, however, linear interpolation performed very poorly).

In LION, we have two DDMs operating on the shape latent variables z0{\mathbf{z}}_{0} and the latent points h0{\mathbf{h}}_{0}. Concretely, for interpolating two shapes xA{\mathbf{x}}^{A} and xB{\mathbf{x}}^{B} in LION, we first encode them into z0A{\mathbf{z}}_{0}^{A} and h0A{\mathbf{h}}_{0}^{A}, as well as z0B{\mathbf{z}}_{0}^{B} and h0B{\mathbf{h}}_{0}^{B}. Now, using the generative ODE (see App. B) we further encode these latents into the DDMs’ prior distributions, resulting in encodings z1A{\mathbf{z}}_{1}^{A} and h1A{\mathbf{h}}_{1}^{A}, as well as z1B{\mathbf{z}}_{1}^{B} and h1B{\mathbf{h}}_{1}^{B} (note that we need to correctly capture the conditioning when using ϵψ(ht,z0,t){\bm{\epsilon}}_{\bm{\psi}}({\mathbf{h}}_{t},{\mathbf{z}}_{0},t) in the generative ODE for ht{\mathbf{h}}_{t}). Next, we first interpolate the shape latent DDM encodings z1s=sz1A+1sz1B{\mathbf{z}}_{1}^{s}=\sqrt{s}{\mathbf{z}}_{1}^{A}+\sqrt{1-s}{\mathbf{z}}_{1}^{B} and use the generative ODE to deterministically generate all z0s{\mathbf{z}}_{0}^{s} along the interpolation path. Then, we also interpolate the latent point DDM encodings h1s=sh1A+1sh1B{\mathbf{h}}_{1}^{s}=\sqrt{s}{\mathbf{h}}_{1}^{A}+\sqrt{1-s}{\mathbf{h}}_{1}^{B} and, conditioned on the corresponding z0s{\mathbf{z}}_{0}^{s} along the interpolation path, also generate deterministically all h0s{\mathbf{h}}_{0}^{s} along the interpolation path using the generative ODE. Finally, we can decode all z0s{\mathbf{z}}_{0}^{s} and h0s{\mathbf{h}}_{0}^{s} along the interpolation ss\in back to point cloud space and obtain the interpolated point clouds xs{\mathbf{x}}^{s}, which we can optionally convert into meshes with SAP.

Note that instead of using given shapes and encoding them into the VAE’s latent space and further into the DDMs’ prior, we can also directly sample novel encodings in the DDM priors and interpolate those.

In practice, to solve the generative ODE both for encoding and generation, we are using an adaptive step size Runge-Kutta4(5) solver with error tolerances 10510^{-5}. Furthermore, we don’t actually solve the ODE all the way to exactly , but only up to a small time 10510^{-5} for numerical reasons (hence, the actual integration interval for the ODE solver is [105,1][10^{-5},1]). We are generally relying on our LION models whose latent DDMs were trained with 1000 discrete time steps (see objectives Eqs. (6) and (7)) and found them to generalize well to the continuous-time setting where the model is also queried for intermediate times tt (see discussion in App. B).

C.4 Mesh Reconstruction with Shape As Points

Before explaining in App. C.4.2 how we incorporate Shape As Points into LION to reconstruct smooth surfaces, we first provide background on Shape As Points in App. C.4.1.

After upsampling the point cloud and predicting normals, SAP solves a Poisson partial differential equation (PDE) to recover the function χ\chi from the densified point cloud. Casting surface reconstruction as a Poisson problem is a widely used approach first introduced by Kazhdan et al. . Unlike Kazhdan et al. , which encodes χ\chi as a linear combination of sparse basis functions and solves the PDE using a finite element solver on an octree, SAP represents χ\chi in a discrete Fourier basis on a dense grid and solves the problem using a spectral solver. This spectral approach has the benefits of being fast and differentiable, at the expense of cubic (with respect to the grid size) memory consumption.

To train the upsampling network ff, SAP minimizes the L2L_{2} distance between the predicted indicator function χ\chi (sampled on a dense, regular grid) and a pseudo-ground-truth indicator function χgt\chi_{\text{gt}} recovered by solving the same Poisson PDE on a dense set of points and normals. Denoting the differentiable Poisson solve as χ=Poisson(X,N)\chi=\text{Poisson}(X^{\prime},N^{\prime}), we can write the loss minimized by SAP as

where D\mathcal{D} is the training data distribution of indicator functions χi\chi_{i} for shapes and point samples on the surface of those shapes XiX_{i}.

Intuitively, we would like the recovered χ\chi to change sharply between a positive value and a negative value at the surface boundary along the direction orthogonal to the surface. Thus, PSR treats the surface normals NN as noisy samples of the gradient of χ\chi. In practice, PSR first constructs a smoothed vector field V\vec{V} from NN by convolving these with a filter (e.g. a Gaussian), and recovers χ\chi by minimizing

over the input domain. Observe that applying the (linear) divergence operator to the problem in Eq. (20) does not change the solution. Thus, we can apply the divergence operator to Eq. (20) to transform it into a Poisson problem

which can be solved using standard numerical methods for solving Elliptic PDEs. Since PSR is effectively integrating V\vec{V} to recover χ\chi, the solution is ambiguous up to an additive constant. To remedy this, PSR subtracts the mean value of χ\chi at the input points, i.e., 1Ni=1Nχ(xi)\frac{1}{N}\sum_{i=1}^{N}\chi(x_{i}), yielding a unique solution.

C.4.2 Incorporating Shape As Points in LION

SAP is commonly trained using slightly noisy perturbed point clouds as input to its neural network fθf_{\theta} . This results in robustness and generalization to noisy shapes during inference. Also, the point clouds generated by LION are not perfectly clean and smooth but subject to some noise. In principle, to make our SAP model ideally suited for reconstructing surfaces from LION’s generated point clouds, it would be best to train SAP using inputs that are subject to the same noise as generated by LION. Although we do not know the exact form of LION’s noise, we propose to nevertheless specialize the SAP model for LION: Specifically, we take SAP’s clean training data (i.e. densely sampled point clouds from which accurate pseudo-ground-truth indicator functions can be calculated via PSR; see previous App. C.4.1) and encode it into LION’s latent spaces z0{\mathbf{z}}_{0} and h0{\mathbf{h}}_{0}. Then, we perform a few diffuse-denoise steps in latent space (see App. C.1) that create small shape variations of the input shapes when decoded back to point clouds. However, when doing these diffuse-denoise steps, we are exactly using LION’s generation mechanism, i.e., the stochastic sampling in Eq. (4), to generate the slightly perturbed encodings. Hence, we are injecting the same noise that is also seen in generation. Therefore, the correspondingly generated point clouds can serve as slightly noisy versions of the original clean point clouds before encoding, diffuse-denoise, and decoding, and we can use this data to train SAP. We found experimentally that this LION-specific training of SAP can indeed improve SAP’s performance when reconstructing meshes from LION’s generated point clouds. We investigate this experimentally in App. F.1.4.

Note that in principle an even tighter integration of SAP with LION would be possible. In future versions of LION, it would be interesting to study joint end-to-end LION and SAP training, where LION’s decoder directly predicts a dense set of points with normals that is then matched to a pseudo-ground-truth indicator function using differentiable PSR. However, we are leaving this to future research. To the best of our knowledge, LION is the first point cloud generative model that directly incorporates modern surface and mesh reconstruction at all. In conclusion, using SAP we can convert LION into a mesh generation model, while under the hood still leveraging point clouds, which are ideal for DDM-based modeling.

Appendix D Implementation

In Fig. 19, we plot the building blocks used in LION:

Multilayer perceptron (MLP), point-voxel convolution (PVC), set abstraction (SA), and feature propagation (FP) represent the building modules for our PVCNNs. The Grouper block (in SA) consists of the sampling layer and grouping layer introduced by PointNet++ .

PVCNN visualizes a typical network used in LION. Both the latent points encoder, decoder and the latent point prior share this high-level architecture design, which is modified from the base network of PVDhttps://github.com/alexzhou907/PVD . It consists of some set abstraction levels and feature propagation levels. The details of these levels can be found in PointNet++ .

ResSE denotes a ResNet block with squeeze-and-excitation (SE) layers.

AdaGN is the adaptive group normalization (GN) layer that is used for conditioning on the shape latent.

Our VAE backbone consists of two encoder networks, and a decoder network. The PVCNNs we used are based on PointNet++ with point-voxel convolutions .

We show the details of the shape latent encoder in Tab. 5, the latent points encoder in Tab. 6, and the details of the decoder in Tab. 7.

We use a dropout probability of 0.1 for all dropout layers in the VAE. All group normalization layers in the latent points encoder as well as in the decoder are replaced by adaptive group normalization (AdaGN) layers to condition on the shape latent. For the AdaGN layers, we initialized the weight of the linear layer with scale at 0.10.1. The bias for the output factor is set as 1.01.0 and the bias for the output bias is set as 0.00.0. The AdaGN is also plot in Fig. 19.

Model Initialization. We initialize our VAE model such that it acts as an identity mapping between the input, the latent space, and reconstructed points at the beginning of training. We achieve this by scaling down the variances of encoders and by weighting the skip connections accordingly.

Weighted Skip Connection. We add skip connections in different places to improve information propagation. In the latent points encoder, the clean point cloud coordinates (in 3 channels) are added to the mean of the predicted latent point coordinates (in 3 channels), which is multiplied by 0.010.01 before the addition. In the decoder, the sampled latent points coordinates are added to the output point coordinates (in 3 channels). The predicted output point coordinates are multiplied by 0.010.01 before the addition.

Variance Scaling. We subtract the log of the standard deviation of the posterior Normal distribution with a constant value. The constant value helps pushing the variance of the posterior towards zero when the LION model is initialized. In our experiments, we set this offset value as 66.

With the above techniques, at the beginning of training the input point cloud is effectively copied into the latent point cloud and then directly decoded back to point cloud space, and the shape latent variables are not active. This prevents diverging reconstruction losses at the beginning of training.

D.2 Shape Latent DDM Prior

We show the details of the shape latent DDM prior in Tab. 8. We use a dropout probability of 0.3, 0.3, and 0.4 for the airplane, car, and chair category, respectively. The time embeddings are added to the features of each ResSE layer.

D.3 Latent Points DDM Prior

We show the details of the latent points DDM prior in Tab. 9. We use a dropout probability of 0.1 for all dropout layers in this DDM prior. All group normalization layers are replaced by adaptive group normalization layers to condition on the shape latent variable. The time embeddings are concatenated with the point features for the inputs of each SA and FP layer.

Note that both latent DDMs use a mixed denoising score network parametrization, directly following Vahdat et al. . In short, the DDM’s denoising model is parametrized as the analytically ideal denoising network assuming a normal data distribution plus a neural network-based correction. This can be advantageous, if the distribution that is modeled by the DDM is close to normal. This is indeed the case in our situation, because during the first training stage all latent encodings were regularized to fall under a standard normal distribution due to the VAE objective’s Kullback-Leibler regularization. Our implementation of the mixed denoising score network technique directly follows Vahdat et al. and we refer the reader there for further details.

D.4 Two-stage Training

The training of LION consists of two stages:

First Stage Training. LION optimizes the modified ELBO of Eq. (5) with respect to the two encoders and the decoder as shown in the main paper. We use the same value for λz\lambda_{\mathbf{z}} and λh\lambda_{\mathbf{h}}. These KL weights, starting at 10710^{-7}, are annealed linearly for the first 50% of the maximum number of epochs. Their final value is set to 0.50.5 at the end of the annealing process.

Second Stage Training. In this stage, the encoders and the decoder are frozen, and only the two DDM prior networks are trained using the objectives in Eqs. (6) and (7). During training, we first encode the clean point clouds x{\mathbf{x}} with the encoders and sample z0qϕ(z0x), h0qϕ(h0x,z0){\mathbf{z}}_{0}\sim q_{\bm{\phi}}({\mathbf{z}}_{0}|{\mathbf{x}}),\ {\mathbf{h}}_{0}\sim q_{\bm{\phi}}({\mathbf{h}}_{0}|{\mathbf{x}},{\mathbf{z}}_{0}). We then draw the time steps tt uniformly from U{1,...,T}U\{1,...,T\}, then sample the diffused shape latent zt{\mathbf{z}}_{t} and latent points ht{\mathbf{h}}_{t}. Our shape latent DDM prior takes zt{\mathbf{z}}_{t} with tt as input, and the latent points DDM prior takes (z0,t,ht)({\mathbf{z}}_{0},t,{\mathbf{h}}_{t}) as input. We use the un-weighted training objective (i.e., w(t)=1w(t)=1).

During second stage training, we regularize the prior DDM neural networks by adding spectral normaliation (SN) and a group normalization (GN) loss similar to Vahdat et al. . Furthermore, we record the exponential moving average (EMA) of the latent DDMs’ weight parameters, and use the parameter EMAs during inference when calling the DDM priors.

Appendix E Experiment Details

For the unconditional 3D point cloud generation task, we follow previous works and use the ShapeNet dataset, as pre-processed and released by PointFlow . Also following previous works and to be able to compare with many different baseline methods, we train on three categories: airplane, chair and car. The ShapeNet dataset released by PointFlow consists of 15k points for each shape. During training, 2,048 points are randomly sampled from the 15k points at each iteration. The training set consists of 2,832, 4,612, and 2,458 shapes for airplane, chair and car, respectively. The sample quality metrics are reported with respect to the standard reference set, which consists of 405, 662, and 352 shapes for airplane, chair and car, respectively. During training, we use the same normalization as in PointFlow and PVD , where the data is normalized globally across all shapes. We compute the means for each axis across the whole training set, and one standard deviation across all axes and the whole training set. Note that there is a typo in the caption of Tab. 3 in the main text: In fact, this kind of global normalization using standard deviation does not result in $$ point coordinate bounds, but the coordinate values usually extend beyond that.

When reproducing the baselines on the ShapeNet dataset released by PointFlow , we found that some methods require per-shape normalization, where the mean is computed for each axis for each shape, and the scale is computed as the maximum length across all axes for each shape. As a result, the xyzxyz-values of the point coordinates will be bounded within $$. We train and evaluate LION following this convention when comparing it to these methods. Note that these different normalizations imply different generative modeling problems. Therefore, it is important to carefully distinguish these different setups for fair comparisons.

When training the SAP model, we follow Peng et al. , Mescheder et al. and also use their data splits and data pre-processing to get watertight meshes. Watertight meshes are required to properly determine whether points are in the interior of the meshes or not, and to define signed distance fields (SDFs) for volumetric supervision, which the PointFlow data does not offer. More details of the data processing can be found in Mescheder et al. (Sec. 1.2 in the Supplementary Material). This dataset variant is denoted as ShapeNet-vol. This data is per-shape normalized, i.e., the points’ coordinates are bounded by $$. To combine LION and SAP, we also train LION on the same data used by the SAP model. Therefore, we report sample quality of LION as well as the most relevant baselines DPM, PVD, and also IM-GAN (which synthesizes shapes as SDFs) also on this dataset variant. The number of training shapes is 2,832, 1,272, 1,101, 5,248, 4,746, 767, 1,624, 1,134, 1,661, 2,222, 5,958, 737, and 1,359 for airplane, bench, cabinet, car, chair, display, lamp, loudspeaker, rifle, sofa, table, telephone, and watercraft, respectively. The number of shapes in the reference set is 404, 181, 157, 749, 677, 109, 231, 161, 237, 317, 850, 105, and 193 for airplane, bench, cabinet, car, chair, display, lamp, loudspeaker, rifle, sofa, table, telephone, and watercraft, respectively.

E.2 Evaluation Metrics

Different metrics to quantitatively evaluate the generation performance of point cloud generative models have been proposed, and some of them suffer from certain drawbacks. Given a generated set of point clouds SgS_{g} and a reference set SrS_{r}, the most popular metrics are (we are following Yang et al. ):

where D(,)D(\cdot,\cdot) is either the Chamfer distance (CD) or earth mover distance (EMD). COV measures the number of reference point clouds that are matched to at least one generated shape. COV can quantify diversity and is sensitive to mode dropping, but it does not quantify the quality of the generated point clouds. Also low quality but diverse generated point clouds can achieve high coverage scores.

where again D(,)D(\cdot,\cdot) is again either CD or EMD. The idea behind MMD is to calculate the average distance between the point clouds in the reference set and their closest neighbors in the generated set. However, MMD is not sensitive to low quality points clouds in SgS_{g}, as they are most likely not matched to any shapes in SrS_{r}. Therefore, it is also not a reliable metric to measure overall generation quality, and it also does not quantify diversity or mode coverage.

1-nearest neighbor accuracy (1-NNA): To overcome the drawbacks of COV and MMD, Yang et al. proposed to use 1-NNA as a metric to evaluate point cloud generative models:

Following Yang et al. , we can conclude that COV and MMD are potentially unreliable metrics to quantify point cloud generation performance and 1-NNA seems like a more suitable evaluation metric. Also the more recent and very relevant PVD follows this and uses 1-NNA as its primary evaluation metric. Note that also Jensen-Shannon Divergence (JSD) is sometimes used to quantify point cloud generation performance. However, it measures only the “average shape” similarity by marginalizing over all point clouds from the generated and reference set, respectively. This makes it an almost meaningless metric to quantify individual shape quality (see discussion in Yang et al. ).

In conclusion, we are following Yang et al. and Zhou et al. and use 1-NNA as our primary evaluation metric to quantify point cloud generation performance and we evaluate it generally both using CD and EMD distances, according to the following standard definitions:

where γ\gamma is a bijection between point clouds XX and YY with the same number of points. We use released codes to compute CDhttps://github.com/ThibaultGROUEIX/ChamferDistancePytorch (MIT License) and EMDhttps://github.com/daerduoCarey/PyTorchEMD.

Since COV and MMD are still widely used in the literature, though, we are also reporting COV and MMD for all our models in App. F, even though they may be unreliable as metrics for generation quality. Note that for the more meaningful 1-NNA metric, LION generally outperforms all baselines in all experiments.

For fair comparisons and to quantify LION’s performance in isolation without SAP-based mesh reconstruction, all metrics are computed directly on LION’s generated point clouds, not meshed outputs. However, we also do calculate generation performance after the SAP-based mesh reconstruction in a separate ablation study (see App. F.1.4). In those cases, we sample points from the SAP-generated surface to create the point clouds for evaluation metric calculation. Similarly, when calculating metrics for the IM-GAN baseline we sample points from the implicitly defined surfaces generated by IM-GAN. Analogously, for the GCA baseline we sample points from the generated voxels’ surfaces.

E.3 Details for Unconditional Generation

We list the hyperparameters used for training the unconditional generative LION models in Tab. 10. The hyperparameters are the same for both the single class model and many-class model. Notice that we do not perform any hyperparameter tuning on the many-class model, i.e., it is likely that the many-class LION can be further improved with some tuning of the hyperparameters.

When tuning the model for unconditional generation, we found that the dropout probability and the hidden dimension for the shape latent DDM prior have the largest impact on the model performance. The other hyperparameters, such as the size of the encoders and decoder, matter less.

E.4 Details for Voxel-guided Synthesis

Setup. We use a voxel size of 0.6 for both training and testing. During training, the training data (after normalization) are first voxelized, and the six faces of all voxels are collected. The faces that are shared by two or more voxels are discarded. To create point clouds from the voxels, we sample the voxels’ faces and then randomly sample points within the faces. In our experiments, 2,048 points are sampled from the voxel surfaces for each shape. We randomly sample a similar number of points at each face.

Encoder Fine-Tuning. For encoder fine-tuning, we initialize the model weights from the LION model trained on the same categories with clean data. Both the shape latent encoder and the latent points encoder are fine-tuned on the voxel inputs, while the decoder and the latent DDMs are frozen. We set the maximum training epochs as 10,000 and perform early-stopping when the reconstruction loss on the validation set reaches a minimum value. In our experiments, training is usually stopped early after around 500 epochs. For example, our model on airplane, chair, and car category are stopped at 129, 470, and 189 epochs, respectively. All other hyperparameters are the same as for the unconditional generation experiments. The training objective can be found in Eq. (13) and Eq. (15).

In Fig. 13 and Fig. 13, we report the reconstruction of input points and IOU of the voxels on the test set. We also evaluate the output shape quality by having the models encode and decode the whole training set, and compute the sample quality metrics with respect to the reference set.

Note that we also tried fine-tuning the encoder of the DPM baseline ; however, the results did not substantially change. Hence, we kept using standard DPM models.

Multimodal Generation. When performing multimodal generation for the voxel-guided synthesis experiments, we encode the voxel inputs into the shape latent z0{\mathbf{z}}_{0} and the latent points h0{\mathbf{h}}_{0}, and run the forward diffusion process for a few steps to obtain their diffused versions. The diffused shape latent (zτ{\mathbf{z}}_{\tau}) is then denoised by the shape latent DDM. The diffused latent points hτ{\mathbf{h}}_{\tau} are denoised by the latent points DDM, conditioned on the shape latent generated by the shape latent DDM (also see App. C.1). Thee number of diffuse-denoise steps can be found in Figs. 11, 13, and 13.

E.5 Details for Denoising Experiments

Setup. We perturb the input data using different types of noise and show how well different methods denoise the inputs. The experimental setting for each noise type is listed below:

Normal Noise: for each coordinate of a point, we first sample the standard deviation value of the noise uniformly from 0 to 0.25; then, we perturb the point with the noise, sampled from a normal distribution with zero mean and the sampled standard deviation value.

Uniform Noise: for each coordinate of a point, we add noise sampled from the uniform distribution U(0,0.25)U(0,0.25).

Outlier Noise: for a shape consisting of NN points, we replace 50% of its points with points drawn uniformly from the 3D bounding box of the original shape. The remaining 50% of the points are kept at their original location.

Similar to the encoder fine-tuning for voxel-guided synthesis (App. E.4), when fine-tuning LION’s encoder networks for the different denoising experiments, we freeze the latent DDMs and the decoder and only update the weights of the shape latent encoder and the latent points encoder. The maximum number of epochs is set to 4,000 and the training process is stopped early based on the reconstruction loss on the validation set. The other hyperparameters are the same as for the unconditional generation experiments. To get different generations from the same noisy inputs, we again diffuse and denoise in the latent space. The operations are the same as for the multimodal generation during voxel-guided synthesis (App. E.4).

E.6 Details for Fine-tuning SAP on LION

Training the Original SAP. We first train the SAP model on the clean data with normal noise injected, following the practice in SAP . We set the standard deviation of the noise to 0.005.

Data Preparation. The training data for SAP fine-tuning is obtained by having LION encode the whole training set, diffuse and denoise in the latent space for some steps, and then decode the point cloud using the decoder. We ablate the number of steps for the diffuse-denoise process in App. F.1.4. In our experiments, we randomly sample the number of steps from {20,30,35,40,50}\{20,30,35,40,50\}. The number of points used in this preparation process is 3,000, since the SAP model takes 3,000 points as input (since LION is constructed only from PointNet-based and convolutional networks, it can be run with any number of points). To prevent SAP from overfitting to the sampled points, we generate 4 different samples for each shape, with the same number of diffuse-denoise steps. During fine-tuning, SAP randomly draws one sample as input.

Fine-Tuning. When fine-tuning SAP, we use the same learning rate, batch size, and other hyperparameters as during training of the original SAP model, except that we change the input and reduce the maximum number of epochs to 1,000.

E.7 Training Times

For single-class LION models, the total training time is 550\approx 550 GPU hours (110\approx 110 GPU hours for training the backbone VAE; 440\approx 440 GPU hours for training the two latent diffusion models). Sampling time analyses can be found in App. F.9.

E.8 Used Codebases

Here, we list all external codebases and datasets we use in our project.

To compare to baselines, we use the following codes:

r-GAN, l-GAN : https://github.com/optas/latent_3d_points (MIT License)

PointFlow : https://github.com/stevenygd/PointFlow (MIT License)

SoftFlow : https://github.com/ANLGBOY/SoftFlow

Set-VAE : https://github.com/jw9730/setvae (MIT License)

DPF-NET : https://github.com/Regenerator/dpf-nets

DPM : https://github.com/luost26/diffusion-point-cloud (MIT License)

PVD : https://github.com/alexzhou907/PVD (MIT License)

ShapeGF : https://github.com/RuojinCai/ShapeGF (MIT License)

SP-GAN : https://github.com/liruihui/sp-gan (MIT License)

PDGN : https://github.com/fpthink/PDGN (MIT License)

IM-GAN : https://github.com/czq142857/implicit-decoder (MIT license) and https://github.com/czq142857/IM-NET-pytorch (MIT license)

GCA : https://github.com/96lives/gca (MIT license)

We use further codebases in other places:

We use the MitSuba renderer for visualizations : https://github.com/mitsuba-renderer/mitsuba2 (License: https://github.com/mitsuba-renderer/mitsuba2/blob/master/LICENSE), and the code to generate the scene discription files for MitSuba : https://github.com/zekunhao1995/PointFlowRenderer.

We rely on SAP for mesh generation with the code at https://github.com/autonomousvision/shape_as_points (MIT License).

For calculating the evaluation metrics, we use the implementation for CD at https://github.com/ThibaultGROUEIX/ChamferDistancePytorch (MIT License) and for EMD at https://github.com/daerduoCarey/PyTorchEMD.

We use Text2Mesh for per-sample text-driven texture synthesis: https://github.com/threedle/text2mesh (MIT License)

ShapeNet . Its terms of use can be found at https://shapenet.org/terms.

The Cars dataset from http://ai.stanford.edu/~jkrause/cars/car_dataset.html with ImageNet License: https://image-net.org/download.php.

The TurboSquid data repository, https://www.turbosquid.com. We obtained a custom license from TurboSquid to use this data.

Redwood 3DScan Dataset : https://github.com/isl-org/redwood-3dscan (Public Domain)

Pix3D : https://github.com/xingyuansun/pix3d. (Creative Commons Attribution 4.0 International License).

E.9 Computational Resources

The total amount of compute used in this research project is roughly 340,000 GPU hours. We used an in-house GPU cluster of V100 NVIDIA GPUs.

Appendix F Additional Experimental Results

In App. F.1.1, we present an ablation study on LION’s hierarchical architecture.

In App. F.1.2, we present an ablation study on the point cloud processing backbone neural network architecture.

In App. F.1.3, we present an ablation study on the extra dimensions of the latent points.

In App. F.1.4, we show an ablation study on the number of diffuse-denoise steps used during SAP fune-tuning.

In App. F.2, we provide additional experimental results on single-class unconditional generation. We show MMD and COV metrics, and also incorporate additional baselines in the extended tables. Furthermore, in App. F.2.1 we visualize additional samples from the LION models.

In App. F.3, we provide additional experimental results for the 13-class unconditional generation LION model. In App. F.3.1 we show more samples from our many-class LION model. Additionally, in App. F.3.2 we analyze LION’s shape latent space via a two-dimensional t-SNE projection .

In App. F.4, we provide additional experimental results for the 55-class unconditional generation LION model.

In App. F.5, we provide additional experimental results for the LION models trained on ShapeNet’s Mug and Bottle classes.

In App. F.6, we provide additional experimental results for the LION model trained on 3D animal shapes.

In App. F.7, we provide additional results on voxel-guided synthesis and denoising for the chair and car categories.

In App. F.8, we quantify LION’s autoencoding performance and compare to various baselines, which we all outperform.

In App. F.9, we provide additional results on significantly accelerated DDIM-based synthesis in LION .

In App. F.10, we use Text2Mesh to generate textures based on text prompts for synthesized LION samples.

In App. F.11, we condition LION on CLIP embeddings of the shapes’ rendered images, following CLIP-Forge . This allows us to perform text-driven 3D shape generation and single view 3D reconstruction.

In App. F.12, we demonstrate more shape interpolations using the three single-class and also the 13-class LION models and we also show shape interpolations of the relevant PVD and DPM baselines.

We perform an ablation experiment with the car category over the different components of LION’s architecture. We consider three settings:

LION model without shape latents. But it still has latent points and a corresponding latent points DDM prior.

LION model without latent points. But it still has the shape latents and a corresponding shape latent DDM.

LION model without any latent variables at all, i.e., a DDM operates on the point clouds directly (this is somewhat similar to PVD ).

When simply dropping the different architecture components, the model “loses” parameters. Hence, a decrease in performance could also simply be due to the smaller model rather than an inferior architecture. Therefore, we also increase the model sizes in the above ablation study (by scaling up the channel dimensions of all networks), such that all models have approximately the same number of parameters as our main LION model that has all components. The results on the car category can be found in Tab. 11. The results show that the full LION setup with both shape latents and latent points performs best on all metrics, sometimes by a large margin. Furthermore, for the models with no or only one type of latent variables, increasing model size does not compensate for the loss of performance due to the different architectures. This ablation study demonstrates the unique advantage of the hierarchical setup with both shape latent variables and latent points, and two latent DDMs. We believe that the different latent variables complement each other—the shape latent variables model overall global shape, while the latent points capture details. This interpretation is supported by the experiments in which we keep the shape latent fixed and only observe small shape variations due to different local point latent configurations (Sec. 5.2 and Fig. 8).

F.1.2 Ablation Study on the Backbone Point Cloud Processing Network Architecture

We ablate different point cloud processing neural network architectures used for implementing LION’s encoder, decoder and the latent points prior. Results are shown in Tab. 12 and Tab. 13, using the LION model on the car category as in the other ablation studies. We choose three different popular backbones used in the point cloud processing literature: Point-Voxel CNN (PVCNN) , Dynamic Graph CNN (DGCNN) and PointTransformer . For the ablation on the encoder and decoder backbones, we train LION’s VAE model (without prior) with different backbones, and compare the reconstruction performance for different backbones. We select the PVCNN as it provides the strongest performance (Tab. 12). For the ablation on the prior backbone, we first train the VAE model with the PVCNN architecture, as in all of our main experiments, and then train the prior with different backbones and compare the generation performance. Again, PVCNN performs best as network to implement the latent points diffusion model (Tab. 13). In conclusion, these experiments support choosing PVCNN as our point cloud neural network backbone architecture for implementing LION.

Note that all ablations were run with similar hyperparameters and the neural networks were generally set up in such a way that different architectures consumed the same GPU memory.

F.1.3 Ablation Study on Extra Dimensions for Latent Points

Next, we ablate the extra dimension DhD_{\mathbf{h}} for the latent points in Tab. 14, again using LION models on the car category. We see that Dh=1D_{\mathbf{h}}=1 provides the overall best performance. With a relatively large number of extra dimensions, it is observed that the 1-NNA scores are getting worse in general. We use Dh=1D_{\mathbf{h}}=1 for all other experiments.

F.1.4 Ablation Study on SAP Fine-Tuning

After applying SAP to extract meshes from the generated point clouds, it is possible to again sample points from the meshed surface and evaluate the points’ quality with the generation metrics that we used for unconditional generation. We call this process resampling.

See Tab. 15 for an ablation over the results of resampling from SAP with or without fine-tuning. It also contains the ablation over different numbers of diffuse-denoise steps used to generate the training data for the SAP fine-tuning. Without fine-tuning, the reconstructed mesh has slightly lower quality according to 1-NNA, presumably since the noise within the generated points is different from the noise which the SAP model is trained on. For the “mixed” number of steps entry in the table, SAP randomly chooses one number of diffuse-denoise steps from the above five values at each iteration when producing the training shapes. This setting tends to give an overall good sample quality in terms of the 1-NNA evaluation metrics. We use this setting in all experiments.

To visually demonstrate the improvement of SAP’s mesh reconstruction performance with and without fine-tuning, we show the reconstructed meshes before and after finetuning in Fig. 20. The original SAP is trained with clean point clouds augmented with small Gaussian noise. As a result, SAP can handle small scale Gaussian noise in the point clouds. However, it is less robust to the generated points where the noise is different from the Gaussian noise which SAP is trained with. With our proposed fine-tuning, SAP produces smoother surfaces and becomes more robust to the noise distribution in the point clouds generated by LION.

F.2 Single-Class Unconditional Generation

For our three single-class LION models, we show the full evaluation metrics for different dataset splits, and different data normalizations, in Tab. 16, Tab. 17 and Tab. 18. Under all settings and datasets, LION achieves state-of-the-art performance on the 1-NNA metrics, and is competitive on the MMD and COV metrics, which, however, can be unreliable with respect to quality (see discussion in App. E.2).

More visualizations of the generated shapes from the LION models trained on airplane, chair and car classes can be found in Fig. 32, Fig. 33 and Fig. 34. LION is generating high quality samples with high diversity. We visualize both point clouds and meshes generated with the SAP that is fine-tuned on the VAE-encoded training set.

F.3 Unconditional Generation of 13 ShapeNet Classes

See Tab. 19 for the evaluation metrics of the sample quality of LION and other baselines, trained on the 13-class dataset. To evaluate the models, we sub-sample 1,000 shapes from the reference set and sample 1,000 shapes from the models. We can see that LION is better than all baselines under this challenging setting. The results are also consistent with our observations on the single-class models. For baseline comparisons, we picked PVD and DPM , because they are also DDM-based and most relevant. We also picked TreeGAN , as it is also trained on diverse data in their original paper, and DPF-Net , as it represents a modern competitive flow-based method that we could train relatively quickly. We did not run all other baselines that we ran for the single-class models due to limited compute resources.

See Fig. 35 for more visualizations of the generated shapes from LION trained on the 13-class data. We visualize both point clouds and meshes generated with the SAP that is fine-tuned on the VAE-encoded training set. LION is again able to generate diverse and high quality shapes even when training in the challenging 13-class setting. We also show in Fig. 36 additional samples from the 13-class LION model with fixed shape latent variables, where only the latent points are sampled, similar to the experiments in Sec. 5.2 and Fig. 8. We again see that the shape latent variables seem to capture overall shape, while the latent points are responsible for generating different details.

F.3.2 Shape Latent Space Visualization

We project the shape latent variables learned by LION’s 13-classes VAE into the 2D plane and create a t-SNE plot in Fig. 37. It can be seen that many categories are separated, such as the rifle, car, watercraft, airplane, telephone, lamp, and display classes. The other categories that are hard to distinguish such as bench and table are mixing a bit, which is also reasonable. This indicates that LION’s shape latent is learning to represent the category information, presumably capturing overall shape information, as also supported by our experiments in Sec. 5.2 and Fig. 8. Potentially, this also means that the representation learnt by the shape latents could be leveraged for downstream tasks, such as shape classification, similar to Luo and Hu .

F.4 Unconditional Generation of all 55 ShapeNet Classes

We train a LION model jointly without any class conditioning on all 55the 55 classes are airplane, bag, basket, bathtub, bed, bench, birdhouse, bookshelf, bottle, bowl, bus, cabinet, camera, can, cap, car, cellphone, chair, clock, dishwasher, earphone, faucet, file, guitar, helmet, jar, keyboard, knife, lamp, laptop, mailbox, microphone, microwave, monitor, motorcycle, mug, piano, pillow, pistol, pot, printer, remote control, rifle, rocket, skateboard, sofa, speaker, stove, table, telephone, tin can, tower, train, vessel, washer different categories from ShapeNet. The total number of training data is 35,708. Training a single model without conditioning over such a large number of categories is challenging, as the data distribution is highly complex and multimodal. Note that we did on purpose not use class-conditioning to explore LION’s scalability to such complex and multimodal datasets. Furthermore, the number of training samples across different categories is imbalanced in this setting: 15 categories have less than 100 training samples and 5 categories have more than 2,000 training samples. We adopt the same model hyperparameters as for the single class LION models here without any tuning.

We show LION’s generated samples in Fig. 21: LION synthesizes high-quality and diverse shapes. It can even generate samples from the cap class, which contributes with only 39 training samples, indicating that LION has an excellent mode coverage that even includes the very rare classes. Note that we did not train an SAP model on the 55 classes data. Hence, we only show the generated point clouds in Fig. 21.

This experiment is run primarily as a qualitative scalability test of LION and due to limited compute resources, we did not train baselines here. Furthermore, to the best of our knowledge no previous 3D shape generative models have demonstrated satisfactory generation performance for such diverse and multimodal 3D data without relying on conditioning information. That said, to make sure future works can compare to LION, we report the generation performance over 1,000 samples in Tab. 20. We would like to emphasize that hyperparameter tuning and using larger LION models with more parameters will almost certainly significantly improve the results even further. We simply used the single-class training settings out of the box.

F.5 Unconditional Generation of ShapeNet’s Mug and Bottle Classes

Next, we explore whether LION can also be trained successfully on very small datasets. To this end, we train LION on the Mug and Bottle classes in ShapeNet. The number of training samples is 149 and 340, respectively, which is much smaller than the common classes like chair, car and airplane. All the hyperparameters are the same as for the models trained on single classes. We show generated shapes in Fig. 22 and Fig. 23 (to extract meshes from the generated point clouds, for convenience we are using the SAP model that was trained for the 13-class LION experiment). We find that LION is also able to generate correct mugs and bottles in this very small training data set situation. We report the performance of the generated samples in Tab.21, such that future work can compare to LION on this task.

F.6 Unconditional Generation of Animal Shapes

Furthermore, we also train LION on 553 animal assets from the TurboSquid data repository.https://www.turbosquid.com We obtained a custom license from TurboSquid to use this data.. The animal data includes shapes of cats, bears, goats, etc. All the hyperparameters are again the same as for the models trained on single classes. See Fig. 24 for visualizations of the generated shapes from LION trained on the animal data. We visualize both point clouds and meshes. For simplicity, the meshes are generated again with the SAP model that was trained on the ShapeNet 13-classes data. LION is again able to generate diverse and high quality shapes even when training in the challenging low-data setting.

F.7 Voxel-guided Synthesis and Denoising

We additionally add the results for voxel-guided synthesis and denoising experiments on the chair and car categories. In Fig. 26 and Fig. 28, we show the reconstruction metrics for different types of input: voxelized input, input with outlier noise, input with uniform noise, and input with normal noise. LION outperforms the other two baselines (PVD and DPM), especially for the voxelized input and the input with outliers, similar to the results presented in the main paper on the airplane class (Sec. 5.4). In Fig. 26 and Fig. 28, we show the output quality metrics and the voxel IOU for voxel-guided synthesis on chair, and car category, respectively. LION achieves high output quality while obeying the voxel input constraint well.

More on Multimodal Visualization. In Fig. 30, we show visualizations of multimodal voxel-guided synthesis on different classes. As discussed, we generate various plausible shapes using different numbers of diffuse-denoise steps. We show two different plausible shapes (with the corresponding latent points, and reconstructed meshes) given the same input at each row under different settings. LION is able to capture the structure indicated by the voxel grids: the shapes obey the voxel grid constraints. For example, the tail of the airplane, the legs of the chair, and the back part of the car are consistent with the input. Meanwhile, LION generates diverse and reasonable details in the output shapes.

See Fig. 29 for denoising experiments, with comparisons to other baselines. In Fig. 31, we also show the visualizations for different classes. LION handles different types of input noises and generates reasonable and diverse details given the same input. See the car examples in the first column for the normal noise, uniform noise and the outlier noise.

Notice that we applied the SAP model here only for visualizing the meshed output shapes. The SAP model is not fine-tuned on voxel or noisy input data. This is potentially one reason why some reconstructed meshes do not have high quality.

We additionally compare LION’s performance on voxel-guided synthesis to Deep Marching Tetrahedra (DMTet) on the airplane category (see Tab. 22). We train and evaluate DMTet with the same data as was used in our voxel-guided shape synthesis experiments (see Sec. 5.4 and Apps. C.2 and E.4). To compute the evaluation metrics on the DMTet output, we randomly sample points on DMTet’s output meshes. LION achieves reconstruction results of similar or slighty better quality than DMTet. However, note that DMTet was specifically designed for such reconstruction tasks and is not a general generative model that could synthesize novel shapes from scratch without any guidance signal, unlike LION, which is a highly versatile general 3D generative model. Furthermore, as we demonstrated in the main paper, LION can generate multiple plausible de-voxelized shapes, while DMTet is fully deterministic and can only generate a single reconstruction.

F.8 Autoencoding

We report the auto-encoding performance of LION and other baselines in Tab. 23 for single-class models. We are calculating the reconstruction performance of LION’s VAE component. Additional results for the LION model trained on many classes can be found in Tab. 24. LION achieves much better reconstruction performance compared to all other baselines. The hierarchical latent space is expressive enough for the model to perform high quality reconstruction. At the same time, as we have shown above, LION also achieves state-of-the-art generation quality. Moreover, as shown in App. F.3.2, its shape latent space is also still semantically meaningful.

F.9 Synthesis Time and DDIM Sampling

Our main results in the paper are all generated using standard 1,000-step DDPM-based ancestral sampling (see Sec. 2). Generating a point cloud sample (with 2,048 points) from LION takes 27.12\approx 27.12 seconds, where 4.04\approx 4.04 seconds are used in the shape latent diffusion model and 23.05\approx 23.05 seconds in the latent points diffusion model. Optionally running SAP for mesh reconstruction requires an additional 2.57\approx 2.57 seconds.

A simple and popular way to accelerate sampling in diffusion models is based on the Denoising Diffusion Implicit Models-framework (DDIM). We show DDIM-sampled shapes in Fig. 38 and the generation performance in Tab. 25. For all DDIM sampling, we use η=0.5\eta=0.5 as stochasticity hyperparameter and the quadratic time schedule as also proposed in DDIM . We also tried deterministic DDIM sampling, but it performed worse (for 50-step sampling). We find that we can produce good-looking shapes in under one second with only 25 synthesis steps. Performance significantly degrades when using 10\leq 10 steps.

F.10 Per-sample Text-driven Texture Synthesis

To demonstrate the value to artists of being able to synthesize meshes and not just point clouds, we consider a downstream application: We apply Text2Meshhttps://github.com/threedle/text2mesh on some generated meshes from LION to additionally synthesize textures in a text-driven manner, leveraging CLIP . Optionally, Text2Mesh can also locally refine the mesh and displace vertices for enhanced visual effects. See Fig. 39 for results where we show different objects made of snow and potato chips, respectively. In Fig. 40, we apply different text prompts on the same generated airplane. We show more diverse results on other categories in Fig. 41. Note that this is only possible because of our SAP-based mesh reconstruction.

F.11 Single View Reconstruction and Text-driven Shape Synthesis

Although our main goal in this work was to develop a strong generative model of 3D shapes, here we qualitatively show how to extend LION to also allow for single view reconstruction (SVR) from RGB data. We render 2D images from the 3D ShapeNet shapes, extracted the images’ CLIP image embeddings, and trained LION’s latent diffusion models while conditioning on the shapes’ CLIP image embeddings. At test time, we then take a single view 2D image, extract the CLIP image embedding, and generate corresponding 3D shapes, thereby effectively performing SVR. We show SVR results from real RGB data in Fig. 42, Fig. 45 and Fig. 46. The RGB images of the chairs are from Pix3DWe downloaded the data from https://github.com/xingyuansun/pix3d. The Pix3D dataset is licensed under a Creative Commons Attribution 4.0 International License. and Redwood 3DScan datasetWe downloaded the Redwood 3DScan dataset (public domain) from https://github.com/isl-org/redwood-3dscan. , respectively. The RGB images of the cars are from the Cars datasetWe downloaded the Cars dataset from http://ai.stanford.edu/~jkrause/cars/car_dataset.html. The Cars dataset is licensed under the ImageNet License: https://image-net.org/download.php . For each input image, LION is able to generate different feasible shapes, showing LION’s ability to perform multi-modal generation. Qualitatively, our results appear to be of similar quality as the results of PVD for that task, and at least as good or better than the results of AutoSDF . Note that this approach only requires RGB images. In contrast, PVD requires RGB-D images, including depth. Hence, our approach can be considered more flexible. Using CLIP’s text encoder, our method additionally allows for text-guided generation as demonstrated in Fig. 43 and Fig. 44. Overall, this approach is inspired by CLIP-Forge . Note that this is a simple qualitative demonstration of LION’s extendibility. We did not perform any hyperparameter tuning here and believe that these results could be improved with more careful tuning and training.

F.12 More Shape Interpolations

We show more shape interpolation results for single-class LION models in Figs. 47, 48, 49, and the many-class LION model in Figs. 50, 51. We can see that LION is able to interpolate two shapes from different classes smoothly. For example, when it tries to interpolate a chair and a table, it starts to make the chair wider and wider, and gradually removes the back of the chair. When it tries to interpolate an airplane and a chair, it starts with making the wings more chair-like, and reduces the size of the rest of the body. The shapes in the middle of the interpolation provide a smooth and reasonable transition.

To be able to better judge the performance of LION’s shape interpolation results, we now also show shape interpolations with PVD and DPM in Fig. 52 and Fig. 53, respectively. We apply the spherical interpolation (see Sec. C.3) on the noise inputs for both PVD and DPM. DPM leverages a Normalizing Flow, which already offers deterministic generation given the noise inputs of the Flow’s normal prior. For PVD, just like for LION, we again use the diffusion model’s ODE formulation to obtain deterministic generation paths. In other words, to avoid confusion, in both cases we are interpolating in the normal prior distribution, just like for LION.

Although PVD is also able to interpolate two shapes, the transition from the source shapes to the target shapes appear less smooth than for LION; see, for example, the chair interpolation results of PVD. Furthermore, DPM’s generated shape interpolations appear fairly noisy. When interpolating very different shapes using the 13-classes models, both PVD and DPM essentially break down and do not produce sensible outputs anymore. All shapes along the interpolation paths appear noisy.

In contrast, LION generally produces coherent interpolations, even when using the multimodal model that was trained on 13 ShapeNet classes (see Figs. 47, 48, 49 and 50 for LION interpolations for reference).