OctNet: Learning Deep 3D Representations at High Resolutions

Gernot Riegler, Ali Osman Ulusoy, Andreas Geiger

Introduction

Over the last several years, convolutional networks have lead to substantial performance gains in many areas of computer vision. In most of these cases, the input to the network is of two-dimensional nature, e.g., in image classification , object detection or semantic segmentation . However, recent advances in 3D reconstruction and graphics allow capturing and modeling large amounts of 3D data. At the same time, large 3D repositories such as ModelNet , ShapeNet or 3D Warehousehttps://3dwarehouse.sketchup.com as well as databases of 3D object scans are becoming increasingly available. These factors have motivated the development of convolutional networks that operate on 3D data.

Most existing 3D network architectures replace the 2D pixel array by its 3D analogue, i.e., a dense and regular 3D voxel grid, and process this grid using 3D convolution and pooling operations. However, for dense 3D data, computational and memory requirements grow cubically with the resolution. Consequently, existing 3D networks are limited to low 3D resolutions, typically in the order of $30^{3}$ voxels. To fully exploit the rich and detailed geometry of our 3D world, however, much higher resolution networks are required.

In this work, we build on the observation that 3D data is often sparse in nature, e.g., point clouds, or meshes, resulting in wasted computations when applying 3D convolutions naïvely. We illustrate this in Fig. 1 for a 3D classification example. Given the 3D meshes of we voxelize the input at a resolution of $64^{3}$ and train a simple 3D convolutional network to minimize a classification loss. We depict the maximum of the responses across all feature maps at different layers of the network. It is easy to observe that high activations occur only near the object boundaries.

Motivated by this observation, we propose OctNet, a 3D convolutional network that exploits this sparsity property. Our OctNet hierarchically partitions the 3D space into a set of unbalanced octrees . Each octree splits the 3D space according to the density of the data. More specifically, we recursively split octree nodes that contain a data point in its domain, i.e., 3D points, or mesh triangles, stopping at the finest resolution of the tree. Therefore, leaf nodes vary in size, e.g., an empty leaf node may comprise up to $8^{3}=512$ voxels for a tree of depth 3 and each leaf node in the octree stores a pooled summary of all feature activations of the voxel it comprises. The convolutional network operations are directly defined on the structure of these trees. Therefore, our network dynamically focuses computational and memory resources, depending on the 3D structure of the input. This leads to a significant reduction in computational and memory requirements which allows for deep learning at high resolutions. Importantly, we also show how essential network operations (convolution, pooling or unpooling) can be efficiently implemented on this new data structure.

We demonstrate the utility of the proposed OctNet on three different problems involving three-dimensional data: 3D classification, 3D orientation estimation of unknown object instances and semantic segmentation of 3D point clouds. In particular, we show that the proposed OctNet enables significant higher input resolutions compared to dense inputs due to its lower memory consumption, while achieving identical performance compared to the equivalent dense network at lower resolutions. At the same time we gain significant speed-ups at resolutions of $128^{3}$ and above. Using our OctNet, we investigate the impact of high resolution inputs wrt. accuracy on the three tasks and demonstrate that higher resolutions are particularly beneficial for orientation estimation and semantic point cloud labeling. Our code is available from the project websitehttps://github.com/griegler/octnet.

Related Work

While 2D convolutional networks have proven very successful in extracting information from images , there exists comparably little work on processing three-dimensional data. In this Section, we review existing work on dense and sparse models.

Dense Models: Wu et al. trained a deep belief network on shapes discretized to a $30^{3}$ voxel grid for object classification, shape completion and next best view prediction. Maturana et al. proposed VoxNet, a feed-forward convolutional network for classifying $32^{3}$ voxel volumes from RGB-D data. In follow-up work, Sedaghat et al. showed that introducing an auxiliary orientation loss increases classification performance over the original VoxNet. Similar models have also been exploited for semantic point cloud labeling and scene context has been integrated in .

Recently, generative models and auto-encoders have demonstrated impressive performance in learning low-dimensional object representations from collections of low-resolution ( $32^{3}$ ) 3D shapes. Interestingly, these low-dimensional representations can be directly inferred from a single image or a sequence of images .

Due to computational and memory limitations, all aforementioned methods are only able to process and generate shapes at a very coarse resolution, typically in the order of $30^{3}$ voxels. Besides, when high-resolution outputs are desired, e.g., for labeling 3D point clouds, inefficient sliding-window techniques with a limited receptive field must be adopted . Increasing the resolution naïvely reduces the depth of the networks and hence their expressiveness. In contrast, the proposed OctNets allow for training deep architectures at significant higher resolutions.

Sparse Models: There exist only few network architectures which explicitly exploit sparsity in the data. As these networks do not require exhaustive dense convolutions they have the potential of handling higher resolutions.

Engelcke et al. proposed to calculate convolutions at sparse input locations by pushing values to their target locations. This has the potential to reduce the number of convolutions but does not reduce the amount of memory required. Consequently, their work considers only very shallow networks with up to three layers.

A similar approach is presented in where sparse convolutions are reduced to matrix operations. Unfortunately, the model only allows for $2\times 2$ convolutions and results in indexing and copy overhead which prevents processing volumes of larger resolution (the maximum resolution considered in is $80^{3}$ voxels). Besides, each layer decreases sparsity and thus increases the number of operations, even at a single resolution. In contrast, the number of operations remains constant in our model.

Li et al. proposed field probing networks which sample 3D data at sparse points before feeding them into fully connected layers. While this reduces memory and computation, it does not allow for exploiting the distributed computational power of convolutional networks as field probing layers can not be stacked, convolved or pooled.

Jampani et al. introduced bilateral convolution layers (BCL) which map sparse inputs into permutohedral space where learnt convolutional filters are applied. Their work is related to ours with respect to efficiently exploiting the sparsity in the input data. However, in contrast to BCL our method is specifically targeted at 3D convolutional networks and can be immediately dropped in as a replacement in existing network architectures.

Octree Networks

To decrease the memory footprint of convolutional networks operating on sparse 3D data, we propose an adaptive space partitioning scheme which focuses computations on the relevant regions. As mathematical operations of deep networks, especially convolutional networks, are best understood on regular grids, we restrict our attention to data structures on 3D voxel grids. One of the most popular space partitioning structures on voxel grids are octrees which have been widely adopted due to their flexible and hierarchical structure. Areas of application include depth fusion , image rendering and 3D reconstruction . In this paper, we propose 3D convolutional networks on octrees to learn representations from high resolution 3D data.

An octree partitions the 3D space by recursively subdividing it into octants. By subdividing only the cells which contain relevant information (e.g., cells crossing a surface boundary or cells containing one or more 3D points) storage can be allocated adaptively. Densely populated regions are modeled with high accuracy (i.e., using small cells) while empty regions are summarized by large cells in the octree.

Unfortunately, vanilla octree implementations have several drawbacks that hamper its application in deep networks. While octrees reduce the memory footprint of the 3D representation, most versions do not allow for efficient access to the underlying data. In particular, octrees are typically implemented using pointers, where each node contains a pointer to its children. Accessing an arbitrary element (or the neighbor of an element) in the octree requires a traversal starting from the root until the desired cell is reached. Thus, the number of memory accesses is equal to the depth of the tree. This becomes increasingly costly for deep, i.e., high-resolution, octrees. Convolutional network operations such as convolution or pooling require frequent access to neighboring elements. It is thus critical to utilize an octree design that allows for fast data access.

We tackle these challenges by leveraging a hybrid grid-octree data structure which we describe in Section 3.1. In Section 3.2, we show how 3D convolution and pooling operations can be implemented efficiently on this data structure.

The above mentioned problems with the vanilla octree data structure increase with the octree depth. Instead of representing the entire high resolution 3D input with a single unbalanced octree, we leverage a hybrid grid-octree structure similar to the one proposed by Miller et al. . The key idea is to restrict the maximal depth of an octree to a small number, e.g., three, and place several such shallow octrees along a regular grid (Fig. 2). While this data structure may not be as memory efficient as the standard octree, significant compression ratios can still be achieved. For instance, a single shallow octree that does not contain input data stores only a single vector, instead of $8^{3}=512$ vectors for all voxels at the finest resolution at depth $3$ .

An additional benefit of a collection of shallow octrees is that their structure can be encoded very efficiently using a bit string representation which further lowers access time and allows for efficient GPGPU implementations . Given a shallow octree of depth $3$ , we use $73$ bit to represent the complete tree. The first bit with index indicates, if the root node is split, or not. Further, bits $1$ to $8$ indicate if one of the child nodes is subdivided and bits $9$ to $72$ denote splits of the grandchildren, see Fig. 3. A tree depth of $3$ gives a good trade-off between memory consumption and computational efficiency. Increasing the octree depth results in an exponential growth in the required bits to store the tree structure and further increases the cell traversal time.

Using this bit-representation, a single voxel in the shallow octree is fully characterised by its bit index. This index determines the depth of the voxel in the octree and therefore also the voxel size. Instead of using pointers to the parent and child nodes, simple arithmetic can be used to retrieve the corresponding indices of a voxel with bit index $i$ :

In contrast to , we associate a data container (for storing features vectors) with all leaf nodes of each shallow tree. We allocate the data of a shallow octree in a contiguous data array. The offset associated with a particular voxel in this array can be computed as follows:

Here, $\operatorname{mod~{}}$ denotes the modulo operator and $\operatorname{bit}$ returns the tree bit-string value at $i$ . See supp. document for an example. Both sum operations can be efficiently implemented using bit counting intrinsics (popcnt). The data arrays of all shallow octrees are concatenated into a single contiguous data array during training and testing to reduce I/O latency.

2 Network Operations

Given the hybrid grid-octree data structure introduced in the previous Section, we now discuss the efficient implementation of network operations on this data structure. We will focus on the most common operations in convolutional networks : convolution, pooling and unpooling. Note that point-wise operations, like activation functions, do not differ in their implementation as they are independent of the data structure.

Let us first introduce the notation which will be used throughout this Section. $T_{i,j,k}$ denotes the value of a 3D tensor $T$ at location $(i,j,k)$ . Now assume a hybrid grid-octree structure with $D\times H\times W$ unbalanced shallow octrees of maximum depth $3$ . Let $O[i,j,k]$ denote the value of the smallest cell in this structure which comprises the voxel $(i,j,k)$ . Note that in contrast to the tensor notation, $O[i_{1},j_{1},k_{1}]$ and $O[i_{2},j_{2},k_{2}]$ with $i_{1}\neq i_{2}\vee j_{1}\neq j_{2}\vee k_{1}\neq k_{2}$ may refer to the same voxel in the hybrid grid-octree, depending on the size of the voxels. We obtain the index of the shallow octree in the grid via $(\lfloor\tfrac{i}{8}\rfloor,\lfloor\tfrac{j}{8}\rfloor,\lfloor\tfrac{k}{8}\rfloor)$ and the local index of the voxel at the finest resolution in that octree by $(\operatorname{mod~{}}(i,8),\operatorname{mod~{}}(j,8),\operatorname{mod~{}}(k,8))$ .

Given this notation, the mapping from a grid-octree $O$ to a tensor $T$ with compatible dimensions is given by

Similarly, the reverse mapping is given by

where $\operatorname*{pool\_voxels~{}}(\cdot)$ is a pooling function (e.g., average- or max-pooling) which pools all voxels in $T$ over the smallest grid-octree cell comprising location $(i,j,k)$ , denoted by $\Omega[{i,j,k}]$ . This pooling is necessary as a single voxel in $O$ can cover up to $8^{3}=512$ elements of $T$ , depending on its size $|\Omega[{i,j,k}]|$ .

Remark: With the two functions defined above, we could wrap any network operation $f$ defined on 3D tensors via

However, this would require a costly conversion from the memory efficient grid-octrees to a regular 3D tensor and back. Besides, storing a dense tensor in memory limits the maximal resolution. We therefore define our network operations directly on the hybrid grid-octree data structure.

with $\hat{i}=i-l+\lfloor L/2\rfloor$ , $\hat{j}=j-m+\lfloor M/2\rfloor$ , $\hat{k}=k-n+\lfloor N/2\rfloor$ . Similarly, the convolutions on the grid-octree data structure are defined as

While this calculation yields the same result as the tensor convolution in Eq. (7) with the $\operatorname{oc2ten},\operatorname{ten2oc}$ wrapper, we are now able to define a computationally more efficient convolution operator. Our key observation is that for small convolution kernels and large voxels, $T_{i,j,k}$ is constant within a small margin of the voxel due to its constant support $O^{\mathsf{in}}[\hat{i},\hat{j},\hat{k}]$ . Thus, we only need to compute the convolution within the voxel once, followed by convolution along the surface of the voxel where the support changes due to adjacent voxels taking different values (Fig. 4). This minimizes the number of calculations by a factor of $4$ for voxels of size $8^{3}$ , see supp. material for a detailed derivation. At the same time, it enables a better caching mechanism.

Pooling

Another important operation in deep convolutional networks is pooling. Pooling reduces the spatial resolution of the input tensor and aggregates higher-level information for further processing, thereby increasing the receptive field and capturing context. For instance, strided $2^{3}$ max-pooling divides the input tensor $T^{\mathsf{in}}$ into $2^{3}$ non-overlapping regions and computes the maximum value within each region. Formally, we have

To implement pooling on the grid-octree data structure we reduce the number of shallow octrees. For an input grid-octree $O^{\mathsf{in}}$ with $2D\times 2H\times 2W$ shallow octrees, the output $O^{\mathsf{out}}$ contains $D\times H\times W$ shallow octrees. Each voxel of $O^{\mathsf{in}}$ is halved in size and copied one level deeper in the shallow octree. Voxels at depth $3$ in $O^{\mathsf{in}}$ are pooled. This can be formalized as

where $\operatorname{vxd}(\cdot)$ computes the depth of the indexed voxel in the shallow octree. A visual example is depicted in Fig. 5.

Unpooling

Again, we can define the analogous operation on the hybrid grid-octree data structure by

This operation also changes the data structure: The number of shallow octrees increases by a factor of $8$ , as each node at depth spawns a new shallow octree. All other nodes double their size. Thus, after this operation the tree depth is decreased. See Fig. 6 for a visual example of this operation.

Remark: To capture fine details, voxels can be split again at the finest resolution according to the original octree of the corresponding pooling layer. This allows us to take full advantage of skip connections. We follow this approach in our semantic 3D point cloud labeling experiments.

Experimental Evaluation

In this Section we leverage our OctNet representation to investigate the impact of input resolution on three different 3D tasks: 3D shape classification, 3D orientation estimation and semantic segmentation of 3D point clouds. To isolate the effect of resolution from other factors we consider simple network architectures. Orthogonal techniques like data augmentation, joint 2D/3D modeling or ensemble learning are likely to further improve the performance of our models.

We implemented the grid-octree data structure, all layers including the necessary forward and backward functions, as well as utility methods to create the data structure from point clouds and meshes, as a stand-alone C++/CUDA library. This allows the usage of our code within all existing deep learning frameworks. For our experimental evaluation we used the Torchhttp://torch.ch/ framework.

1 3D Classification

We use the popular ModelNet10 dataset for the 3D shape classification task. The dataset contains $10$ shape categories and consists of $3991$ 3D shapes for training and $908$ 3D shapes for testing. Each shape is provided as a triangular mesh, oriented in a canonical pose. We convert the triangle meshes to dense respective grid-octree occupancy grids, where a voxel is set to $1$ if it intersects the mesh. We scale each mesh to fit into a 3D grid of $(N-P)^{3}$ voxels, where $N$ is the number of voxels in each dimension of the input grid and $P=2$ is a padding parameter.

We first study the influence of the input resolution on memory usage, runtime and classification accuracy. Towards this goal, we create a series of networks of different input resolution from $8^{3}$ to $256^{3}$ voxels. Each network consists of several blocks which reduce resolution by half until we reach a resolution of $8^{3}$ . Each block comprises two convolutional layers ( $3^{3}$ filters, stride $1$ ) and one max-pooling layer ( $2^{3}$ filters, stride $2$ ). The number of feature maps in the first block is $8$ and increases by $6$ with every block. After the last block we add a fully-connected layer with $512$ units and a final output layer with $10$ units. Each convolutional layer and the first fully-connected layer are followed by a rectified linear unit as activation function and the weights are initialized as described in . We use the standard cross-entropy loss for training and train all networks for $20$ epochs with a batch size of $32$ using Adam . The initial learning rate is set to $0.001$ and we decrease the learning rate by a factor of $10$ after $15$ epochs.

Overall, we consider three different types of networks: the original VoxNet architecture of Maturana et al. which operates on a fixed $32^{3}$ voxel grid, the proposed OctNet and a dense version of it which we denote “DenseNet” in the following. While performance gains can be obtained using orthogonal approaches such as network ensembles or a combination of 3D and 2D convolutional networks , in this paper we deliberately focus on “pure” 3D convolutional network approaches to isolate the effect of resolution from other influencing factors.

Fig. 7 shows our results. First, we compare the memory consumption and run-time of our OctNet wrt. the dense baseline approach, see Fig. 7(a) and 7(b). Importantly, OctNets require significantly less memory and run-time for high input resolutions compared to dense input grids. Using a batch size of $32$ samples, our OctNet easily fits in a modern GPU’s memory (12GB) for an input resolution of $256^{3}$ . In contrast, the corresponding dense model fits into the memory only for resolutions ${\leq}64^{3}$ . A more detailed analysis of the memory consumption wrt. the sparsity in the data is provided in the supp. document. OctNets also run faster than their dense counterparts for resolutions ${>}64^{3}$ . For resolutions ${\leq}64^{3}$ , OctNets run slightly slower due to the overhead incurred by the grid-octree representation and processing.

Leveraging our OctNets, we now compare the impact of input resolution with respect to classification accuracy. Fig. 7(c) shows the results of different OctNet architectures where we keep the number of convolutional layers per block fixed to 1, 2 and 3. Fig. 7(d) shows a comparison of accuracy with respect to DenseNet and VoxNet when keeping the capacity of the model, i.e., the number of parameters, constant by removing max-pooling layers from the beginning of the network. We first note that despite its pooled representation, OctNet performs on par with its dense equivalent. This confirms our initial intuition (Fig. 1) that sparse data allows for allocating resources adaptively without loss in performance. Furthermore, both models outperform the shallower VoxNet architecture, indicating the importance of network depth.

Regarding classification accuracy we observed improvements for lower resolutions but diminishing returns beyond an input resolution of $32^{3}$ voxels. Taking a closer look at the confusion matrices in Fig. 9, we observe that higher input resolution helps for some classes, e.g., bathtub, while others remain ambiguous independently of the resolution, e.g., dresser vs. night stand. We visualize this lack of discriminative power by showing voxelized representations of 3D shapes from the ModelNet10 database Fig. 8. While bathtubs look similar to beds (or sofas, tables) at low resolution they can be successfully distinguished at higher resolutions. However, a certain ambiguity between dresser and night stand remains.

2 3D Orientation Estimation

In this Section, we investigate the importance of input resolution on 3D orientation estimation. Most existing approaches to 3D pose estimation assume that the true 3D shape of the object instance is known. To assess the generalization ability of 3D convolutional networks, we consider a slightly different setup where only the object category is known. After training a model on a hold-out set of 3D shapes from a single category, we test the ability of the model to predict the 3D orientation of unseen 3D shapes from the same category.

More concretely, given an instance of an object category with unknown pose, the goal is to estimate the rotation with respect to the canonical pose. We utilize the 3D shapes from the chair class of the ModelNet10 dataset and rotate them randomly between $\pm 15^{\circ}$ around each axis. We use the same network architectures and training protocol as in the classification experiment, except that the networks regress orientations. We use unit quaternions to represent 3D rotations and train our networks with an Euclidean loss. For small angles, this loss is a good approximation to the rotation angle $\phi=\arccos(2\langle q_{1},q_{2}\rangle^{2}-1)$ between quaternions $q_{1},q_{2}$ .

Fig. 10 shows our results using the same naming convention as in the previous Section. We observe that fine details are more important compared to the classification task. For the OctNet 1-3 architectures we observe a steady increase in performance, while for networks with constant capacity across resolutions (Fig. 10(b)), performance levels beyond $128^{3}$ voxels input resolution. Qualitative results of the latter experiment are shown in Fig. 11. Each row shows 10 different predictions for two randomly selected chair instance over several input resolutions, ranging from $16^{3}$ to $128^{3}$ . Darker colors indicate larger errors which occur more frequently at lower resolutions. In contrast, predictions at higher network resolutions cluster around the true pose. Note that learning a dense 3D representation at a resolution of $128^{3}$ voxels or beyond would not be feasible.

3 3D Semantic Segmentation

In this Section, we evaluate the proposed OctNets on the problem of labeling 3D point cloud with semantic information. We use the RueMonge2014 dataset that provides a colored 3D point cloud of several Haussmanian style facades, comprising ${\sim}1$ million 3D points in total. The labels are window, wall, balcony, door, roof, sky and shop.

For this task, we train a U-shaped network on three different input resolutions, $64^{3}$ , $128^{3}$ and $256^{3}$ , where the voxel size was selected such that the height of all buildings fits into the input volume. We first map the point cloud into the grid-octree structure. For all leaf nodes which contain more than one point, we average the input features and calculate the majority vote of the ground truth labels for training. As features we use the binary voxel occupancy, the RGB color, the normal vector and the height above ground. Due to the small number of training samples, we augment the data for this task by applying small rotations.

Our network architecture comprises an encoder and a decoder part. The encoder part consists of four blocks which comprise $2$ convolution layers ( $3^{3}$ filters, stride $1$ ) followed by one max-pooling layer each. The decoder consists of four blocks which comprise $2$ convolutions ( $3^{3}$ filters, stride $1$ ) followed by a guided unpooling layer as discussed in the previous Section. Additionally, after each unpooling step all features from the last layer of the encoder at the same resolution are concatenated to provide high-resolution details. All networks are trained with a per voxel cross entropy loss using Adam and a learning rate of $0.0001$ .

Table 1 compares the proposed OctNet to several state of the art approaches on the facade labeling task following the extended evaluation protocol of . The 3D points of the test set are assigned the label of the corresponding grid-octree voxels. As evaluation measures we use overall pixel accuracy $\tfrac{\text{TP}}{\text{TP}+\text{FN}}$ over all 3D points, average class accuracy, and intersection over union $\tfrac{\text{TP}}{\text{TP}+\text{FN}+\text{FP}}$ over all classes. Here, FP, FN and TP denote false positives, false negatives and true positives, respectively.

Our results clearly show that increasing the input resolution is essential to obtain state-of-the-art results, as finer details vanish at coarser resolutions. Qualitative results for one facade are provided in Fig. 12. Further results are provided in the supp. document.

Conclusion and Future Work

We presented OctNet, a novel 3D representation which makes deep learning with high-resolution inputs tractable. We analyzed the importance of high resolution inputs on several 3D learning tasks, such as object categorization, pose estimation and semantic segmentation. Our experiments revealed that for ModelNet10 classification low-resolution networks prove sufficient while high input (and output) resolution matters for 3D orientation estimation and 3D point cloud labeling. We believe that as the community moves from low resolution object datasets such as ModelNet10 to high resolution large scale 3D data, OctNet will enable further improvements. One particularly promising avenue for future research is in learning representations for multi-view 3D reconstruction where the ability to process high resolution voxelized shapes is of crucial importance.

References

A Appendix

In the following supplemental material we present details regarding our operations on the hybrid grid-octree data structure, additional experimental results and all details on the used network architectures. In Section A.2 we provide details on the data index used to efficiently access data in the grid-octree data structure. Section A.3 yields more insights into the efficient implementation of the convolution operation on octrees. We present further quantitative and qualitative results in Section A.4 and in Section A.8 we specify all details of the network architectures used in our experiments.

A.2 Data Index

An important implementation detail of the hybrid grid-octree data structure is the memory alignment for fast data access. We store all data associated with the leaf nodes of a shallow octree in a compact contiguous array. Thus, we need a fast way to compute the offset in this data array for any given voxel. In the main text we presented the following equation:

As explained in the main text the whole octree structure is stored as a bit-string and a voxel is uniquely identified by the bit index $i$ , i.e., the index within the bit string. The data is aligned breadth-first and only the leaf nodes have data associated. Consequently, the first part of the equation above counts the number of split and leaf nodes up to the voxel with bit index $i$ . The second term subtracts the number of split nodes before the particular voxel as data is only associated with leaf nodes. Finally, we need to get the offset within the voxel’s neighborhood. This is done by the last term of the equation.

Let us illustrate this with a simple example: For ease of visualization we will consider a quadtree. Hence, each voxel can be split into $4$ instead of $8$ children. The equation for the offset changes to

Now consider the following bit string for instance: $1\,0101\,0000\,1001\,0000\,0100$ . According to our definition, this bit string corresponds to the tree structure visualized in Fig. 13(a) and 13(b), where s indicates a split node and v a leaf node with associated data. In Fig. 13(c) we show the bit indices for all nodes. Note that the leaf nodes at depth $3$ do not need to be stored in the bit string as this information is implicit. Finally, the data index for all leaf nodes is visualized in Fig. 13(d). Now we can verify equation (14) using a simple example. Assume the bit index $51$ : The parent bit index is given by equation (15) as $12$ . To compute the data index we first count the number of nodes before $49$ as it is the first node within its siblings (first term of equation), which is $17$ . Next, we count the number of split nodes up to $49$ (second term of equation), which is $6$ . Finally, we look up the position $51$ within its siblings (last term of equation), which is $2$ . Combining those three terms yields the data index $17-6+2=13$ .

A.3 Efficient Convolution

In the main text we discussed that the convolution for larger octree cells and small convolution kernels can be efficiently implemented. A naïve implementation applies the convolution kernel at every location $(i,j,k)$ comprised by the cell $\Omega[i,j,k]$ . Therefore, for an octree cell of size $8^{3}$ and a convolution kernel kernel of $3^{3}$ this would require $8^{3}\cdot 3^{3}=13,824$ multiplications. However, we can implement this calculation much more efficiently as depicted in Fig. 14. We observe that the value inside the cell of size $8^{3}$ is constant. Thus, we only need to evaluate the convolution once inside this cell and multiply the result with the size of the cell $8^{3}$ , see Fig. 14(a). Additionally, we only need to evaluate a truncated versions of the kernel on the corners, edges and faces of the voxel, see Fig. 14(b)-d. This implementation is more efficient, as we need only $27$ multiplications for the constant part, $8\cdot 19$ multiplications for the corners, $12\cdot 6\cdot 15$ multiplications for the edges, and $6\cdot 6^{2}\cdot 9$ multiplications for the faces of the voxel. In total this yields $3203$ multiplications, or $23.17\%$ of the multiplications required by the naïve implementation.

A.4 Additional Results

In this Section we show additional quantitative and qualitative results for 3D shape classification, 3D orientation estimation and semantic 3D point labeling.

A.5 3D Classification

In the main text of our work we analyzed the runtime and memory consumption of OctNet compared with the equivalent dense networks on ModelNet10 . Additionally, we demonstrated that without further data augmentation, ensemble learning, or more sophisticated architectures the accuracy saturates at an input resolution of about $16^{3}$ , when keeping the number of network parameters fixed across all resolutions. In this Section we show the same experiment on ModelNet40 . The results are summarized in Fig. 15. In contrast to ModelNet10, we see an increase in accuracy up to an input resolution of $32^{3}$ . Beyond this resolution the classification performance does not further improve. Note that the only form of data augmentation we used in this experiment was rotation around the up-vector as the 3D models in this dataset vary in pose. We conclude that object classification on the ModelNet40 dataset is more challenging than on the ModelNet10 dataset, but both datasets are relatively easy in the sense that details do not matter as much as in the datasets used for our other experiments.

We further use this experiment to investigate the question at which level of sparsity OctNet becomes useful. To answer this, we visualize the memory consumption of a dense representation vs. our data structure at different resolutions and occupancies by sampling 500 shapes from the ModelNet40 dataset (see Fig. 16). It can be observed that even at very low resolutions ( $8^{3}$ ), our data structure is superior compared to the dense representation, even up to an occupancy level of $50\%$ . As the voxel resolution increases, the occupancy levels (x-axis) decreases since the data becomes sparser. Our OctNet exploits this sparsity to achieve a significant reduction in memory consumption.

A.6 3D Orientation Estimation

To demonstrate that OctNet can handle input resolutions larger than $256^{3}$ we added results for the 3D orientation estimation experiment with an input resolution of $512^{3}$ . The results are presented in Fig. 17. As for the orientation experiment in the main paper, we can observe the trend that performance increases with increasing input resolution.

We evaluated our OctNet also on the Biwi Kinect Head Pose Database as an additional experiment on 3D pose estimation. The dataset consists of $24$ sequences of $20$ individuals sitting in front of a Kinect depth sensor. For each frame the head center and the head pose in terms of its 3D rotation is annotated. We split the dataset into a training set of $18$ individuals for training and $2$ individuals for testing and project the depth map to 3D points with the given camera parameters. We then create the hybrid grid-octree structure from the 3D points that belong to the head (In this experiment we are only interested in 3D orientation estimation, as the head can be reliable detected in the color images). As in the previous experiment we parameterize the orientation with unit quaternions and train our OctNet using the same settings as in the previous 3D orientation estimation experiment. Fig. 18 shows the quantitative results over varying input resolutions. We see a reasonable improvement of accuracy from $8^{3}$ up to $64^{3}$ . Beyond this input resolution the octree resolution becomes finer than the resolution of the 3D point cloud. Thus, further improvements can not be expected. In Fig. 19 we show some qualitative results.

A.7 3D Semantic Segmentation

In this Section we present additional qualitative results of the semantic 3D point labeling task in Fig. 20 to 22. In these visualizations we show the color part of the voxelized input, the result of the labeling in the voxel representation, and the result back-projected to the 3D point cloud for different houses in the test set of .

A.8 Network Architecture Details

In this Section we detail the network architectures used throughout our experimental evaluations. We use the following notation for brevity: $\textrm{conv}(x,y)$ denotes a $3^{3}$ convolutional layer with $x$ input feature maps and $y$ output feature maps. Similarly, $\textrm{maxpool}(f)$ is a max-pooling operation that decreases dimensionality by a factor of $f$ along each axis. All convolutional and fully-connected layers, except the very last one, are followed by a ReLU as activation function.

In the first two experiments, 3D classification and 3D orientation estimation, we show two different classes of architectures. In the first one we keep the number of convolution layers per block fixed and add blocks depending on the input resolution of the network. We call those networks OctNet1, OctNet2, and OctNet3, depending on the number of convolution layers per block. Therefore, the number of parameters increases along with the input resolution. The detailed architectures are depicted in Table 2, 3, and 4 for the classification task and in Table 6, 7, and 8 for the orientation estimation tasks, respectively. Second, we trained network architectures where we keep the number of parameters fixed, independently of the input resolution. The detailed network architectures for those experiments are presented in Table 5 and 9.

Finally, for semantic 3D point labeling, we use the U-Net type architecture shown in Table 10. We use a concatenation layer $\textrm{concat}(\cdot,\cdot)$ to combine the outputs from the decoder and encoder parts of the networks to preserve details.