SAL: Sign Agnostic Learning of Shapes from Raw Data

Matan Atzmon, Yaron Lipman

Introduction

Recently, deep neural networks have been used to reconstruct, learn and generate 3D surfaces. There are two main approaches: parametric and implicit . In the parametric approach neural nets are used as parameterization mappings, while the implicit approach represents surfaces as zero level-sets of neural networks:

In this paper we advocate Sign Agnostic Learning (SAL), defined by a family of loss functions that can be used directly with raw (unsigned) geometric data ${\mathcal{X}}$ and produce signed implicit representations of surfaces. An important application for SAL is in generative models such as variational auto-encoders , learning shape spaces directly from the raw 3D data. Figure 1 depicts an example where collectively learning a dataset of raw human scans using SAL overcomes many imperfections and artifacts in the data (left in every gray pair) and provides high quality surface reconstructions (right in every gray pair) and shape space (interpolations of latent representations are in gold).

We have experimented with SAL for surface reconstruction from point clouds as well as learning a human shape space from the raw scans of the D-Faust dataset . Comparing our results to current approaches and baselines we found SAL to be the method of choice for learning shapes from raw data, and believe SAL could facilitate many computer vision and computer graphics shape learning applications, allowing the user to avoid the tedious and unsolved problem of surface reconstruction in preprocess. Our code is available at https://github.com/matanatz/SAL.

Previous work

Learning collections of shapes is done using Generative Adversarial Networks (GANs) , auto-encoders and variational auto-encoders , and auto-decoders . Wu et al. use GAN on a voxel grid encoding of the shape, while Ben-Hamu et al. apply GAN on a collection of conformal charts. Dai et al. use encoder-decoder architecture to learn a signed distance function to a complete shape from a partial input on a volumetric grid. Stutz et al. use variational auto-encoder to learn an implicit surface representations of cars using a volumetric grid. Baqautdinov et al. use variational auto-encoder with a constant mesh to learn parametrizations of faces shape space. Litany et al. use variational auto-encoder to learn body shape embeddings of a template mesh. Park et al. use auto-decoder to learn implicit neural representations of shapes, namely directly learns a latent vector for every shape in the dataset. In our work we also make use of a variational auto-encoder but differently from previous work, learning is done directly from raw 3D data.

2 Surface reconstruction.

Many surface reconstruction methods require normal or inside/outside information. Carr et al. were among the first to suggest using a parametric model to reconstruct a surface by computing its implicit representation; they use radial basis functions (RBFs) and regress at inside and outside points computed using oriented normal information. Kazhdan et al. solve a Poisson equation on a volumetric discretization to extend points and normals information to an occupancy indicator function. Walder et al. use radial basis functions and solve a variational hermite problem (i.e., fitting gradients of the implicit to the normal data) to avoid trivial solution. In general our method works with a non-linear parameteric model (MLP) and therefore does not require a-priori space discretization nor works with a fixed linear basis such as RBFs.

More related to this paper are surface reconstruction methods that work with unsigned data such as point clouds and triangle soups. Zhao et al. use the level-set method to fit an implicit surface to an unoriented point cloud by minimizing a loss penalizing distance of the surface to the point cloud achieving a sort of minimal area surface interpolating the points. Walder et al. formulates a variational problem fitting an implicit RBF to an unoriented point cloud data while minimizing a regularization term and maximizing the norm of the gradients; solving the variational problem is equivalent to an eigenvector problem. Mullen et al. suggests to sign an unsigned distance function to a point cloud by a multi-stage algorithm first dividing the problem to near and far field sign estimation, and propagating far field estimation closer to the zero level-set; then optimize a convex energy fitting a smooth sign function to the estimated sign function. Takayama et al. suggested to orient triangle soups by minimizing the Dirichlet energy of the generalized winding number noting that correct orientation yields piecewise constant winding number. Xu et al. suggested to compute robust signed distance function to triangle soups by using an offset surface defined by the unsigned distance function. Zhiyang et al. fit an RBF implicit by optimizing a non-convex variational problem minimizing smoothness term, interpolation term and unit gradient at data points term. All these methods use some linear function space; when the function space is global, e.g. when using RBFs, model fitting and evaluation are costly and limit the size of point clouds that can be handled efficiently, while local support basis functions usually suffer from inferior smoothness properties . In contrast we use a non-linear function basis (MLP) and advocate a novel and simple sign agnostic loss to optimize it. Evaluating the non-linear neural network model is efficient and scalable and the training process can be performed on a large number of points, e.g., with stochastic optimization techniques.

Sign agnostic learning

We introduce the Sign Agnostic Learning (SAL) defined by a loss of the form

To theoretically motivate the loss family in equation 2 we will prove that it possess a plane reproduction property. That is, if the data ${\mathcal{X}}$ is contained in a plane, there is a critical weight ${\bm{\theta}}^{*}$ reconstructing this plane as the zero level-set of $f({\bm{x}};{\bm{\theta}}^{*})$ . Plane reproduction is important for surface approximation since surfaces, by definition, have an approximate tangent plane almost everywhere .

We will explore instantiations of SAL based on different choices of unsigned distance functions $h_{\mathcal{X}}$ , as follows.

We consider two $p$ -distance functions: For $p=2$ we have the standard $L^{2}$ (Euclidean) distance

Although many choices exist for the unsigned similarity function, in this paper we take

The choice of $D_{\mathcal{X}}$ is depending on the particular choice of $h_{\mathcal{X}}$ . For $L^{2}$ distance, it is enough to make the simple choice of splatting an isotropic Gaussian, ${\mathcal{N}}({\bm{x}},\sigma^{2}I)$ , at every point (uniformly randomized) ${\bm{x}}\in{\mathcal{X}}$ ; we denote this probability ${\mathcal{N}}_{\sigma}({\mathcal{X}})$ ; note that $\sigma$ can be taken to be a function of ${\bm{x}}\in{\mathcal{X}}$ to reflect local density in ${\mathcal{X}}$ . In this case, the loss takes the form

For the $L^{0}$ distance however, $h_{\mathcal{X}}({\bm{x}})\neq 1$ only for ${\bm{x}}\in{\mathcal{X}}$ and therefore a non-continuous density should be used; we opt for ${\mathcal{N}}({\bm{x}},\sigma^{2}I)+\delta_{\bm{x}}$ , where $\delta_{\bm{x}}$ is the delta distribution measure concentrated at ${\bm{x}}$ . The loss takes the form

Remarkably, the latter loss requires only randomizing points ${\bm{z}}$ near the data samples without any further computations involving ${\mathcal{X}}$ . This allows processing of large and/or complex geometric data.

Although SAL can work with different parametric models, in this paper we consider a multilayer perceptron (MLP) defined by

In this paper we use $\varphi(a)=a$ or $\varphi(a)=\tanh(a)+\gamma a$ , where $\gamma\geq 0$ is a parameter. Furthermore, similarly to previous work we have incorporated a skip connection layer $s$ , concatenating the input ${\bm{x}}$ to the middle hidden layer, that is $s({\bm{y}})=({\bm{y}},{\bm{x}})$ , where here ${\bm{y}}$ is a hidden variable in $f$ .

Notice that both $h_{\mathcal{X}}({\bm{x}})$ and its signed version are local minima of the loss in equation 2. These local minima are stable in the sense that there is an energy barrier when moving from one to the other. For example, to get to a solution as in Figure 2(b) from the solution in Figure 2(d) one needs to flip the sign in the interior or exterior of the region defined by the black line. Changing the sign continuously will result in a considerable increase to the SAL loss value.

We elaborate on our initialization method, ${\bm{\theta}}={\bm{\theta}}^{0}$ , that in practice favors the signed version of $h_{\mathcal{X}}$ in the next section.

Geometric network initialization

A key aspect of our method is a proper, geometrically motivated initialization of the network’s parameters. For MLPs, equations 8-9, we develop an initialization of its parameters, ${\bm{\theta}}={\bm{\theta}}^{0}$ , so that $f({\bm{x}};{\bm{\theta}}^{0})\approx\varphi(\left\|{\bm{x}}\right\|-r)$ , where $\left\|{\bm{x}}\right\|-r$ is the signed distance function to an $r$ -radius sphere. The following theorem specify how to pick ${\bm{\theta}}^{0}$ to achieve this:

Figure 3 depicts level-sets (zero level-sets in bold) using the initialization of Theorem 1 with the same 8-layer MLP (using $\varphi(a)=a$ ) and increasing width of 100, 200, and 2000 neurons in the hidden layers. Note how the approximation $f({\bm{x}};{\bm{\theta}}^{0})\approx\left\|{\bm{x}}\right\|-r$ improves as the layers’ width increase, while the sphere-like (in this case circle-like) zero level-set remains topologically correct at all approximation levels.

The proof to Theorem 1 is provided in the supplementary material; it is a corollary of the following theorem, showing how to chose the initial weights for a single hidden layer network:

Properties

Plane reproduction is a key property to surface approximation methods since, in essence, surfaces are locally planar, i.e., have an approximating tangent plane almost everywhere . In this section we provide a theoretical justification to SAL by proving a plane reproduction property. We first show this property for a linear model (i.e., a single layer MLP) and then show how this implies local plane reproduction for general MLPs.

This theorem can be applied locally when optimizing a general MLP (equation 8) with SAL to prove local plane reproduction. See supplementary for more details.

2 Convergence to the limit signed function

The SAL loss pushes the neural implicit function $f$ towards a signed version of the unsigned distance function $h_{\mathcal{X}}$ . In the $L^{0}$ case it is the inside/outside indicator function of the surface, while for $L^{2}$ it is a signed version of the Euclidean distance to the data ${\mathcal{X}}$ . Figure 4 shows advanced epochs of the 2D experiment in Figure 2; note that the $f$ in these advanced epochs is indeed closer to the signed version of the respective $h_{\mathcal{X}}$ . Since the indicator function and the signed Euclidean distance are discontinuous across the surface, they potentially impose quantization errors when using standard contouring algorithms, such as Marching Cubes , to extract their zero level-set. In practice, this phenomenon is avoided with a standard choice of stopping criteria (learning rate and number of iterations). Another potential solution is to add a regularization term to the SAL loss; we mark this as future work.

Experiments

2 Learning shape space from raw scans

In the main experiment of this paper we trained on the D-Faust scan dataset , consisting of approximately 41k raw scans of 10 humans in multiple posesDue to the dense temporal sampling in this dataset we experimented with a 1:5 sample.. Each scan is a triangle soup, ${\mathcal{X}}_{i}$ , where common defects include holes, ghost geometry, and noise, see Figure 1 for examples.

We use SAL loss with $L^{2}$ distance, i.e., $h_{2}({\bm{z}})=\min_{{\bm{x}}\in{\mathcal{X}}_{i}}\left\|{\bm{z}}-{\bm{x}}\right\|_{2}$ the unsigned distance to the triangle soup ${\mathcal{X}}_{i}$ , and combine it with a variational auto-encoder type loss :

where ${\bm{\theta}}=({\bm{\theta}}_{1},{\bm{\theta}}_{2})$ , $\left\|\cdot\right\|_{1}$ is the 1-norm, $\left\|{\bm{\mu}}\right\|_{1}$ encourages the latent prediction ${\bm{\mu}}$ to be close to the origin, while $\left\|{\bm{\eta}}+\mathbf{1}\right\|_{1}$ encourages the variances $\bm{\Sigma}$ to be constant $\exp{(-1)}$ ; together, these enforce a regularization on the latent space. $\lambda$ is a balancing weight chosen to be $10^{-3}$ .

We compared versus three baseline methods. First, AtlasNet , one of the only existing algorithms for learning a shape collection from raw point clouds. AtlasNet uses a parametric representation of surfaces, which is straight-forward to sample. On the down side, it uses a collection of patches that tend to not overlap perfectly, and their loss requires computation of closest points between the generated and input point clouds which poses a challenge for learning large point clouds. Second, we approximate a signed distance function, $\bar{h}_{2}$ , to the data ${\mathcal{X}}_{i}$ in two different ways, and regress them using an MLP as in DeepSDF ; we call these methods SignReg. Note that Occupancy Networks and regress a different signed distance function and perform similarly.

To approximate the signed distance function, $\bar{h}_{2}$ , we first tried using a state of the art surface reconstruction algorithm to produce watertight manifold surfaces. However, only 28684 shapes were successfully reconstructed ( $69\%$ of the dataset), making this option infeasible to compute $\bar{h}_{2}$ . We have opted to approximate the signed distance function similar to with $\bar{h}_{2}({\bm{z}})={\bm{n}}_{*}^{T}({\bm{z}}-{\bm{x}}_{*})$ , where ${\bm{x}}_{*}=\operatorname*{arg\,min}_{{\bm{x}}\in{\mathcal{X}}_{i}}\left\|{\bm{z}}-{\bm{x}}\right\|_{2}$ is the closest point to ${\bm{z}}$ in ${\mathcal{X}}_{i}$ and ${\bm{n}}_{*}$ is the normal at ${\bm{x}}_{*}\in{\mathcal{X}}_{i}$ . To approximate the normal ${\bm{n}}_{*}$ we tested two options: (i) taking ${\bm{n}}^{*}$ directly from the original scan ${\mathcal{X}}_{i}$ with its original orientation; and (ii) using local normal estimation using Jets followed by consistent orientation procedure based on minimal spanning tree using the CGAL library .

Table 1 and Figure 6 show the result on a random 75%-25% train-test split on the D-Faust raw scans. We report the 5%, 50% (median), and 95% percentiles of the Chamfer distances between the surface reconstructions and the raw scans (one-sided Chamfer from reconstruction to scan), and ground truth registrations. The SAL and SignReg reconstructions were generated by a forward pass $({\bm{\mu}},{\bm{\eta}})=g({\bm{X}};{\bm{\theta}}_{1})$ of a point cloud ${\bm{X}}\subset{\mathcal{X}}_{i}$ sampled from the raw unseen scans, yielding an implicit function $f({\bm{x}};{\bm{\mu}},{\bm{\theta}}_{2})$ . We used the Marching Cubes algorithm to mesh the zero level-set of this implicit function. Then, we sampled uniformly 30K points from it and compute the Chamfer Distance.

Table 2 demonstrates that the latent optimization method further improves predictions quality, compared to a single forward pass. In 7 and 8, we demonstrate few representatives examples, where we plot left to right in each column: input test scan, SAL reconstruction with forward pass alone, and SAL reconstruction with latent optimization. Failure cases are shown in the bottom-right. Despite the little variability of humans in the training dataset (only 8 humans), 7 shows that SAL can usually fit a pretty good human shape to the unseen human scan using a single forward pass reconstruction; using latent optimization further improves the approximation as can be inspected in the different examples in this figure.

Figure 8 shows how a single forward reconstruction is able to predict the pose correctly, where latent optimization improves the prediction in terms of shape and pose.

SAL’s limitation is mainly in capturing thin structures. Figure 9 shows reconstructions (obtained similarly to 6.1) of a chair and a plane from the ShapeNet dataset; note that some parts in the chair back and the plane wheel structure are missing.

Conclusions

We introduced SAL: Sign Agnostic Learning, a deep learning approach for processing raw data without any preprocess or need for ground truth normal data or inside/outside labeling. We have developed a geometric initialization formula for MLPs to approximate the signed distance function to a sphere, and a theoretical justification proving planar reproduction for SAL. Lastly, we demonstrated the ability of SAL to reconstruct high fidelity surfaces from raw point clouds, and that SAL easily integrates into standard generative models to learn shape spaces from raw geometric data. One limitation of SAL was mentioned in Section 5, namely the stopping criteria for the optimization.

Using SAL in other generative models such as generative adversarial networks could be an interesting follow-up. Another future direction is global reconstruction from partial data. Combining SAL with image data also has potentially interesting applications. We think SAL has many exciting future work directions, progressing geometric deep learning to work with unorganized, raw data.

The research was supported by the European Research Council (ERC Consolidator Grant, ”LiftMatch” 771136), the Israel Science Foundation (Grant No. 1830/17) and by a research grant from the Carolito Stiftung (WAIC).

References

Appendix

We train the network in the surface reconstruction experiment with Adam optimizer , learning rate $0.0001$ for 5000 epochs. Training was done on an Nvidia V-100 GPU, using pytorch deep learning framework .

1.2 Learning shape space

Our network architecture is Encoder-Decoder based, where for the encoder we have used PointNet and DeepSets layers. Each layer is composed of

where $\left[\cdot,\cdot\right]$ is the concat operation. Our Architecture is

Training our networks for learning the shape space of the D-Faust dataset was done with the following choices. We have used the Adam optimizer , initialized with learning rate $0.0005$ and batch size of $64$ . We scheduled the learning rate to decrease every 500 epochs by a factor of 0.5. We stopped the training process after 2000 epochs. Training was done on 4 Nvidia V-100 GPUs, using pytorch deep learning framework .

2 Additional Experiments

One of the key advantages of SAL is that it can be used for reconstructing a surface from a single input scan or incorporated into a VAE architecture for learning a shape space from an entire scans dataset. This raises an interesting question, whether learning a shape space has also an impact on the quality of the reconstructions. To answer this question, we ran SAL surface reconstruction on each of the scans used for training the main experiment of the paper (See table 2 for more details). When comparing our SAL VAE training results on the registrations (ground truth) versus SAL single reconstruction we see differences in favor of our VAE learner, whereas the results on the original scans are comparable. That is, SAL single reconstruction results are $0.10,0.17,0.22;0.07,0.08,0.10$ on the registrations and scans for the $5\%,50\%,95\%$ percentiles respectively.

Figure 10 shows reconstructions of test scans for different stages of training on the D-Faust dataset. Given the main paper discussion on SAL limit signed function, we additionally add reconstructions from relatively advanced epoch as $3500$ , showing that no error in contouring occur.

3 Proofs

For simplicity, we restrict our attention to absolutely continuous measures $D_{\mathcal{X}}$ , that is defined by a continuous density function $\mu({\bm{x}})$ . Generalizing to measures with a discrete part (such as the one we use for the $L^{0}$ distance, for example) can be proven similarly.

Denoting ${\bm{\theta}}=({\bm{w}},b)$ , the loss in equation 2 can be written as

where $\nabla_{{\bm{w}},b}f({\bm{x}};{\bm{\theta}})=\varphi^{\prime}({\bm{w}}^{T}{\bm{x}}+b)[{\bm{x}},1]$ .

As $\varphi$ is anti-symmetric, $\varphi^{\prime}$ is symmetric, i.e., $\varphi^{\prime}(a)=\varphi^{\prime}(-a)$ and therefore

Plugging these in the integral after the change of variables we reach

The last integral is scalar and we denote its integrand by $g_{\alpha}({\bm{x}})$ (remember that $({\bm{w}},b)=\alpha({\bm{n}},c)$ ), i.e.,

Let us show the existence of $\alpha_{+}$ . Let ${\bm{x}}\in\Omega_{+}$ such that $\gamma={\bm{n}}^{T}{\bm{x}}+c>0$ and $\mu({\bm{x}})>0$ . Then,

Furthermore, since $\beta^{-1}\geq\varphi^{\prime}(a)\geq\beta>0$ we have that

the existence of $\alpha_{+}$ is established. The case of $\alpha_{-}$ is proven similarly. ∎

3.2 Local plane reproduction with MLP

By ”reconstructs ${\mathcal{P}}$ locally” we mean that ${\bm{\theta}}^{*}$ is critical for the loss if $D_{\mathcal{X}}$ is sufficiently concentrated around any point in ${\mathcal{P}}\cap\Omega$ .

where $H(a)={\tiny\begin{cases}1&a\geq 0\\ 0&a<0\end{cases}}$ is the Heaviside function.

and $D_{i}$ are constant diagonal matrices. Over this domain we have

3.3 Proof of Theorem 1

It is enough to consider a single layer: $h({\bm{x}})=\nu({\bm{W}}{\bm{x}}+{\bm{b}})$ . For brevity let $k=d^{out}$ . Now,

Note that the entries of $\sqrt{k{\bm{W}}_{i,:}}$ are distributed i.i.d. ${\mathcal{N}}(0,\sqrt{2})$ . Hence by the law of large numbers the last term converge to

3.4 Proof of Theorem 2

For brevity we denote $k=d^{out}$ . Note that plugging ${\bm{w}},{\bm{b}},c$ in $f$ we get $f({\bm{x}})=\frac{\sqrt{2\pi}}{\sigma k}\sum_{i=1}^{k}\nu({\bm{w}}_{i}\cdot{\bm{x}})-r$ , where ${\bm{w}}_{i}$ is the $i^{th}$ row of ${\bm{W}}$ . Let $\mu$ denote the density of multivariate normal distribution ${\mathcal{N}}=(0,\sigma^{2}I_{k})$ . By the law of large numbers, the first term converges to