CvxNet: Learnable Convex Decomposition

Boyang Deng, Kyle Genova, Soroosh Yazdani, Sofien Bouaziz, Geoffrey Hinton, Andrea Tagliasacchi

Introduction

While images admit a standard representation in the form of a scalar function uniformly discretized on a grid, the curse of dimensionality has prevented the effective usage of analogous representations for learning 3D geometry. Voxel representations have shown some promise at low resolution , while hierarchical representations have attempted to reduce the memory footprint required for training , but at the significant cost of complex implementations. Rather than representing the volume occupied by a 3D object, one can resort to modeling its surface via a collection of points , polygons , or surface patches . Alternatively, one might follow Cezanne’s advice and “treat nature by means of the cylinder, the sphere, the cone, everything brought into proper perspective”, and think to approximate 3D geometry as geons – collections of simple to interpret geometric primitives , and their composition . Hence, one might rightfully start wondering “why so many representations of 3D data exist, and why would one be more advantageous than the other?” One observation is that multiple equivalent representations of 3D geometry exist because real-world applications need to perform different operations and queries on this data ( [9, Ch.1]). For example, in computer graphics, points and polygons allow for very efficient rendering on GPUs, while volumes allow artists to sculpt geometry without having to worry about tessellation or assembling geometry by smooth composition , while primitives enable highly efficient collision detection and resolution . In computer vision and robotics, analogous trade-offs exist: surface models are essential for the construction of low-dimensional parametric templates essential for tracking , volumetric representations are key to capturing 3D data whose topology is unknown , while part-based models provide a natural decomposition of an object into its semantic components. Part-based models create a representation useful to reason about extent, mass, contact, … quantities that are key to describing the scene, and planning motions .

In this paper, we propose a novel representation for geometry based on primitive decomposition. The representation is parsimonious, as we approximate geometry via a small number of convex elements, while we seek to allow low-dimensional representation to be automatically inferred from data – without any human supervision. More specifically, inspired by recent works we train our pipeline in an unsupervised manner: predicting the primitive configuration as well as their parameters by checking whether the reconstructed geometry matches the geometry of the target. We note how we inherit a number of interesting properties from several of the aforementioned representations. As it is part-based it is naturally locally supported, and by training on a shape collection, parts have a semantic association (i.e. the same element is used to represent the backs of chairs). Although part-based, each of them is not restricted to belong to the class of boxes , ellipsoids , or sphere-meshes , but to the more general class of convexes. As a convex is defined by a collection of half-space constraints, it can be simultaneously decoded into an explicit (polygonal mesh), as well as implicit (indicator function) representation. Because our encoder decomposes geometry into convexes, it is immediately usable in any application requiring real-time physics simulation, as collision resolution between convexes is efficiently decided by GJK (Figure 1). Finally, parts can interact via structuring to generate smooth blending between parts.

Related works

One of the simplest high-dimensional representations is voxels, and they are the most commonly used representation for discriminative models, due to their similarity to image based convolutions. Voxels have also been used successfully for generative models . However, the memory requirements of voxels makes them unsuitable for resolutions larger than $64^{3}$ . One can reduce the memory consumption significantly by using octrees that take advantage of the sparsity of voxels . This can extend the resolution to $512^{3}$ , for instance, but comes at the cost of more complicated implementation.

In computer graphics, polygonal meshes are the standard representation of 3D objects. Meshes have also been considered for discriminative classification by applying graph convolutions to the mesh . Recently, meshes have also been considered as the output of a network . A key weakness of these models is the fact that they may produce self-intersecting meshes. Another natural high-dimensional representation that has garnered some traction in vision is the point cloud representation. Point clouds are the natural representation of objects if one is using sensors such as depth cameras or LiDAR, and they require far less memory than voxels. Qi et al. used point clouds as a representation for discriminative deep learning tasks. Hoppe et al. used point clouds for surface mesh reconstruction (see also for a survey of other techniques). Fan et. al. and Lin et. al. used point clouds for 3D reconstruction using deep learning. However, these approaches require additional non-trivial post-processing steps to generate the final 3D mesh.

Primitives

Far more common is to approximate the input shape by set of volumetric primitives. With this perspective in mind, representing shapes as voxels will be a special case, where the primitives are unit cubes in a lattice. Another fundamental way to describe 3D shapes is via Constructive Solid Geometry . Sherma et. al. presents a model that will output a program (i.e. set of Boolean operations on shape primitives) that generate the input image or shape. In general, this is a fairly difficult task. Some of the classical primitives used in graphics and computer vision are blocks world , generalized cylinders , geons , and even Lego pieces . In , a deep CNN is used to interpret a shape as a union of simple rectangular prisms. They also note that their model provides a consistent parsing across shapes (i.e. the head is captured by the same primitive), allowing some interpretability of the output. In , they extended cuboids to superquadrics, showing that the extra flexibility will result in better reconstructions.

Implicit surfaces

If one generalizes the shape primitives to analytic surfaces (i.e. level sets of analytic functions), then new analytic tools become available for generating shapes. In , for instance, they train a model to discriminate inside coordinates from outside coordinates (referred to as an occupancy function in the paper, and as an indicator function in the graphics community). Park et. al. used the signed distance function to the surface of the shape to achieve the same goal. One disadvantage of the implicit description of the shape is that most of the interpretability is missing from the final answer. In , they take a more geometric approach and restrict to level sets of axis-aligned Gaussians. Partly due to the restrictions of these functions, their representation struggles on shapes with angled parts, but they do recover the interpretability that offers.

Convex decomposition

In graphics, a common method to represent shapes is to describe them as a collection of convex objects. Several methods for convex decomposition of meshes have been proposed . In machine learning, however, we only find early attempts to approach convex hull computation via neural networks . Splitting the meshes into exactly convexes generally produces too many pieces . As such, it is more prudent to seek small number of convexes that approximate the input shape . Recently also extended convex decomposition to the spatio-temporal domain, by considering moving geometry. Our method is most related to and , in that we train an occupancy function. However, we choose our space of functions so that their level sets are approximately convex, and use these as building blocks.

Method – CvxNet

We denote the collection of hyperplane parameters as $\mathbf{h}=\{(\mathbf{n}_{h},d_{h})\}$ , and the overall set of parameters for a convex as ${\boldsymbol{\beta}}=[\mathbf{h},\sigma]$ . We treat $\sigma$ as a hyperparameter, and consider the rest as the learnable parameters of our representation. As illustrated in Figure 2, the parameter $\delta$ controls the smoothness of the generated convex, while $\sigma$ controls the sharpness of the transition of the indicator function. Similar to the smooth maximum function, the soft classification boundary created by Sigmoid facilitates training.

In summary, given a collection of hyperplane parameters, this differentiable module generates a function that can be evaluated at any position $\mathbf{x}$ .

2 Convex encoder/decoder – Figure 3

A sufficiently large set of hyperplanes can represent any convex object, but one may ask whether it would be possible to discover some form of correlation between their parameters. Towards this goal, we employ an auto-encoder architecture illustrated in Figure 3. Given an input, the encoder $\mathcal{E}$ derives a bottleneck representation $\boldsymbol{\lambda}$ from the input. Then, a decoder $\mathcal{D}$ derives the collection of hyperplane parameters. While in theory permuting the $H$ hyperplanes generates the same convex, the decoder $\mathcal{D}$ correlates a particular hyperplane with a corresponding orientation. This is visible in Figure 4, where we color-code different 2D hyperplanes and indicate their orientation distribution in a simple 2D auto-encoding task for a collection of axis-aligned ellipsoids. As ellipsoids and oriented cuboids are convexes, we argue that the architecture in Figure 3 allows us to generalize the core geometric primitives proposed in VP and SIF ; we verify this claim in Figure 5.

3 Explicit interpretation – Figure 6

What is significantly different from other methods that employ indicator functions as trainable representations of 3D geometry, is that convexes generated by our network admit an explicit interpretation: they can be easily converted into polygonal meshes. This is in striking contrast to , where a computationally intensive iso-surfacing operation needs to be executed to extract their surface (e.g. Marching Cubes ). More specifically, iso-surfacing techniques typically suffer the curse of dimensionality, with a performance that scales as $1/\varepsilon^{d}$ , where $\varepsilon$ the desired spatial resolution and $d$ is usually $3$ . Conversely, as we illustrate in Figure 6, we only require the execution of two duality transforms, and the computations of two convex hulls of $H$ points. The complexity of these operations is clearly independent of any resolution parameter $\varepsilon$ .

4 Multi convex decomposition – Figure 7

Having a learnable pipeline for a single convex object, we can now expand the expressivity of our model by representing generic non-convex objects as compositions of convexes . To achieve this task an encoder $\mathcal{E}$ outputs a low-dimensional bottleneck representation of all $K$ convexes $\boldsymbol{\lambda}$ that $\mathcal{D}$ decodes into a collection of $K$ parameter tuples. Each tuple (indexed by $k$ ) is comprised of a shape code ${\boldsymbol{\beta}}_{k}$ , and corresponding transformation $\mathbf{T}_{k}(\mathbf{x})=\mathbf{x}+\mathbf{c}_{k}$ that transforms the point from world coordinates to local coordinates. $\mathbf{c}_{k}$ is the predicted translation vector (Figure 7).

5 Training losses

First and foremost, we want the (ground truth) indicator function of our object $\mathcal{O}$ to be well approximated:

where $\hat{\mathcal{O}}(\mathbf{x})=\max_{k}\{\mathcal{C}_{k}(\mathbf{x})\}$ , and $\mathcal{C}_{k}(\mathbf{x})=\mathcal{C}(\mathbf{T}_{k}(\mathbf{x})|{\boldsymbol{\beta}}_{k})$ . The application of the $\max$ operator produces a perfect union of convexes. While constructive solid geometry typically applies the $\min$ operator to compute the union of signed distance functions, note that we apply the $\max$ operator to indicator functions instead with the same effect; see Section 6 in the supplementary material for more details. We couple the approximation loss with several auxiliary losses that enforce the desired properties of our decomposition.

We seek a parsimonious decomposition of an object akin to Tulsiani et al. . Hence, overlap between elements should be discouraged:

where we use a permissive $\tau\!=\!2$ , and note how the ReLU activates the loss only when an overlap occurs.

Unique parameterization loss (auxiliary)

While each convex is parameterized with respect to the origin, there is a nullspace of solutions – we can move the origin to another location within the convex, and update offsets $\{d_{h}\}$ and transformation $\mathbf{T}$ accordingly – see Figure 8(left). To remove such a null-space, we simply regularize the magnitudes of the offsets for each of the $K$ elements:

In the supplementary material, we prove that minimizing $\mathcal{L}_{\text{unique}}$ leads to a unique solution and centers the convex body to the origin. This loss further ensures that “inactive” hyperplane constraints can be readily re-activated during learning. Because they fit tightly around the surface they are therefore sensitive to shape changes.

Guidance loss (auxiliary)

As we will describe in Section 3.6, we use offline sampling to speed-up training. However, this can cause severe issues. In particular, when a convex “falls within the cracks” of sampling (i.e. $\nexists x\>|\>\mathcal{C}(x)\!>\!0.5$ ), it can be effectively removed from the learning process. This can easily happen when the convex enters a degenerate state (i.e. $d_{h}\!\!=\!0\>\>\forall h$ ). Unfortunately these degenerate configurations are encouraged by the loss (5). We can prevent collapses by ensuring that each of them represents a certain amount of information (i.e. samples):

where $\mathcal{N}^{N}_{k}$ is the subset of $N$ samples from the set $\mathbf{x}~{}\sim~{}\{\mathcal{O}\}$ with the smallest distance value $\Phi_{k}(\mathbf{x})$ from $\mathcal{C}_{k}$ . In other words, each convex is responsible for representing at least the $N$ closest interior samples.

Localization loss (auxiliary)

When a convex is far from interior points, the loss in (6) suffers from vanishing gradients due to the sigmoid function. We overcome this problem by adding a loss with respect to $\mathbf{c}_{k}$ , the translation vector of the $k$ -th convex:

Observations

Note that we supervise the indicator function $\mathcal{C}$ rather than $\Phi$ , as the latter does not represent the signed distance function of a convex (e.g. $\|\nabla\Phi(x)\|\neq 1$ ). Also note how the loss in (4) is reminiscent of SIF [21, Eq.1], where the overall surface is modeled as a sum of meta-ball implicit functions – which the authors call “structuring”. The core difference lies in the fact that SIF models the surface of the object $\partial\mathcal{O}$ as an iso-level of the function post structuring – therefore, in most cases, the iso-surface of the individual primitives do not approximate the target surface, resulting in a slight loss of interpretability in the generated representation.

6 Implementation details

To increase training speed, we sample a set of points on the ground-truth shape offline, precompute the ground truth quantities, and then randomly sub-sample from this set during our training loop. For volumetric samples, we use the samples from OccNet , while for surface samples we employ the “near-surface” sampling described in SIF . Following SIF , we also tune down $\mathcal{L}_{\text{approx}}$ of “near-surface” samples by $0.1$ . We draw $100k$ random samples from the bounding box of $\mathcal{O}$ and $100k$ samples from each of $\partial\mathcal{O}$ to construct the points samples and labels. We use a sub-sample set (at training time) with $1024$ points for both sample sources. Although Mescheder et al. claims that using uniform volumetric samples are more effective than surface samples, we find that balancing these two strategies yields the best performance – this can be attributed to the complementary effect of the losses in (3) and (4).

Experiments

We use the ShapeNet dataset in our experiments. We use the same voxelization, renderings, and data split as in Choy et. al. . Moreover, we use the same multi-view depth renderings as for our {Depth}-to-3D experiments, where we render each example from cameras placed on the vertices of a dodecahedron. Note that this problem is a harder problem than 3D auto-encoding with point cloud input as proposed by OccNet and resembles more closely the single view reconstruction problem. At training time we need ground truth inside/outside labels, so we employ the watertight meshes from – this also ensures a fair comparison to this method. For the quantitative evaluation of semantic decomposition, we use labels from PartNet and exploit the overlap with ShapeNet.

We quantitatively compare our method to a number of self-supervised algorithms with different characteristics. First, we consider VP that learns a parsimonious approximation of the input via (the union of) oriented boxes. We also compare to the Structured Implicit Function SIF method that represents solid geometry as an iso-level of a sum of weighted Gaussians; like VP , and in contrast to OccNet , this methods provides an interpretable encoding of geometry. Finally, from the class of techniques that directly learn non-interpretable representations of implicit functions, we select OccNet , P2M , and AtlasNet ; in contrast to the previous methods, these solutions do not provide any form of shape decomposition. As OccNet only report results on RGB-to-3D tasks, we extend the original codebase to also solve {Depth}-to-3D tasks. We follow the same data pre-processing used by SIF .

Metrics

1 Abstraction – Figure 9, 10, 11

As our convex decomposition is learnt on a shape collection, the convexes produced by our decoder are in natural correspondence – e.g. we expect the same $k$ -th convex to represent the leg of a chair in the chairs dataset. We analyze this quantitatively on the PartNet dataset . We do so by verifying whether the $k$ -th component is consistently mapped to the same PartNet part label; see Figure 11 (left) for the distribution of PartNet labels within each component. We can then assign the most commonly associated label to a given convex to segment the PartNet point cloud, achieving a relatively high accuracy; see Figure 11 (right). This reveals how our representation captures the semantic structure in the dataset. We also evaluate our shape abstraction capabilities by varying the number of components and evaluating the trade-off between representation parsimony and reconstruction accuracy; we visualize this via Pareto-optimal curves in the plot of Figure 9. We compare with SIF , and note that thanks to the generalized shape space of our model, our curve dominates theirs regardless of the number of primitives chosen. We further investigate the use of natural correspondence in a part-based retrieval task. We first encode an input into our representation, allow a user to select a few parts of interest, and then use this (incomplete) shape-code to fetch the elements in the training set with the closest (partial) shape-code; see Figure 10.

2 Reconstruction – Table 1 and Figure 12

We quantitatively evaluate the reconstruction performance against a number of state-of-the-art methods given inputs as multiple depth map images ({Depth}-to-3D) and a single color image (RGB-to-3D); see Table 1. A few qualitative examples are displayed in Figure 12. We find that CvxNet is: \raisebox{-.6pt}{1}⃝ consistently better than other part decomposition methods (SIF, VP, and SQ) which share the common goal of learning shape elements; \raisebox{-.6pt}{2}⃝ in general comparable to the state-of-the-art reconstruction methods; \raisebox{-.6pt}{3}⃝ better than the leading technique (OccNet ) when evaluated in terms of F-score, and tested on multi-view depth input. Note that SIF first trains for the template parameters on ({Depth}-to-3D) with a reconstruction loss, and then trains the RGB-to-3D image encoder with a parameter regression loss; conversely, our method trains both encoder and decoder of the RGB-to-3D task from scratch.

3 Ablation studies

We summarize here the results of several ablation studies found in the supplementary material. Our analysis reveals that the method is relatively insensitive to the dimensionality of the bottleneck $|\boldsymbol{\lambda}|$ . We also investigate the effect of varying the number of convexes $K$ and number of hyperplanes $H$ in terms of reconstruction accuracy and inference/training time. Moreover, we quantitatively demonstrate that using signed distance as supervision for $\mathcal{L}_{\text{approx}}$ produces significantly worse results and at the cost of slightly worse performance we can collapse $\mathcal{L}_{\text{guide}}$ and $\mathcal{L}_{\text{loc}}$ into one. Finally, we perform an ablation study with respect to our losses, and verify that each is beneficial towards effective learning.

Conclusions

We propose a differentiable representation of convex primitives that is amenable to learning. The inferred representations are directly usable in graphics/physics pipelines; see Figure 1. Our self-supervised technique provides more detailed reconstructions than very recently proposed part-based techniques (SIF in Figure 9), and even consistently outperforms the leading reconstruction technique on multi-view input (OccNet in Table 1). In the future we would like to generalize the model to be able to predict a variable number of parts , understand symmetries and modeling hierarchies , and include the modeling of rotations . Leveraging the invariance of hyperplane ordering, it would be interesting to investigate the effect of permutation-invariant encoders , or remove encoders altogether in favor of auto-decoder architectures .

We would like to acknowledge Luca Prasso and Timothy Jeruzalski for their help with preparing the rigid-body simulations, Avneesh Sud and Ke Li for reviewing our draft, and Anton Mikhailov, Tom Funkhouser, and Erwin Coumans for fruitful discussions.

References

Union of smooth indicator functions

We define the smooth indicator function for the $k$ -th object:

where $\Phi_{\delta}^{k}(\mathbf{x})$ is the $k$ -th object signed distance function. In constructive solid geometry the union of signed distance function is defined using the $\min$ operator. Therefore the union operator for our indicator function can be written:

Note that the $\max$ operator is commutative with respect to monotonically increasing functions allowing us to extract the $\max$ operator from the $\text{{Sigmoid}}_{\sigma}(\cdot)$ function in (10).

Merged guidance loss and localization loss

While the guidance loss (6) and localization loss (7) are designed by different motivations, they are inherently consistent in encouraging the convex elements to reside close to the ground truth. Therefore, we propose an alternative training strategy to merge these two losses into a single one:

The idea is that while $\Phi_{k}(\mathbf{x})$ is approximate, it can still be used to get weak gradients. Essentially, this means that each convex needs to “explain” the N closest samples of $\mathcal{O}$ .

Proof of auxiliary null-space loss

Consider a collection of hyperplanes $\mathbf{h}=\{(\mathbf{n}_{h},d_{h})\}$ where each pair $(\mathbf{n}_{h},d_{h})$ denotes $\mathcal{H}_{h}(\mathbf{x})=\mathbf{n}_{h}\cdot\mathbf{x}+d_{h}$ . We like to show that amongst all possible translations of this collection, there is a unique one that minimizes $\mathcal{L}_{\text{unique}}=\tfrac{1}{H}\sum_{h}\mathcal{H}_{h}(\mathbf{x})^{2}$ . We can rewrite this sum as

where $\mathbf{d_{h}}{=}(d_{h})_{h}$ is a vector with entries $d_{h}$ , and $N$ is the matrix formed by $\mathbf{n}_{h}$ as columns. However it is well known that the above minimization has a unique solution, and the minimizing solution can be computed explicitly (e.g. by using the Moore-Penrose inverse of $N$ ).

We also note that $\mathcal{L}_{\text{unique}}$ has a nice geometric description. In fact, minimizing $\mathcal{L}_{\text{unique}}$ is essentially “centring” the convex body at the origin. To see this we first prove the following:

Let $\mathcal{H}_{h}(\mathbf{x})=\mathbf{n}_{h}\cdot\mathbf{x}+d_{h}$ be a hyperplane and let $\mathbf{x}_{0}$ be any point. Then the distance between $\mathbf{x}_{0}$ and the hyperplane $\mathcal{H}_{h}$ is given by ${|\mathcal{H}_{h}(\mathbf{x}_{0})|}/{\|\mathbf{n}_{h}\|}$ .

Let $X$ be any point on the hyperplane $h$ , so $\mathcal{H}_{h}(X)=0$ . Then the distance of $\mathbf{x}_{0}$ to $\mathcal{H}_{h}$ is the norm of the projection of $X-\mathbf{x}_{0}$ onto the norm of $\mathcal{H}_{h}$ , $\mathbf{n}_{h}$ . Using standard formulas we get, as desired:

Note that if we assume that $\mathbf{n}_{h}$ are all normalized, then $|\mathcal{H}_{h}(\mathbf{x}_{0})|$ gives us the distance to the hyperplane $h$ for each hyperplane $h$ . Therefore, given a convex polytope defined by collection of hyperplanes $\mathbf{h}=\{(\mathbf{n}_{h},d_{h})\}$ , the $\sum_{h}|\mathcal{H}_{h}(\mathbf{x})|^{2}$ is the sum of squares of distance of $\mathbf{x}$ to the sides of the polytope. Also note that if we translate $\mathbf{h}$ by $\mathbf{x}_{0}$ , then we get the new set of hyperplane $\mathbf{h}^{\prime}=\{(\mathbf{n}_{h},d_{h}+\mathbf{n}_{h}\cdot\mathbf{x}_{0})\}=\{(\mathbf{n}_{h},\mathcal{H}_{h}(\mathbf{x}_{0}))$ . Therefore,

can be interpreted as the square distance of the origin to the collection of hyperplanes $\mathbf{h}$ .