GIFT: A Real-time and Scalable 3D Shape Search Engine

Song Bai, Xiang Bai, Zhichao Zhou, Zhaoxiang Zhang, Longin Jan Latecki

Introduction

3D shape retrieval is a fundamental issue in computer vision and pattern recognition. With the rapid development of large scale public 3D repositories, e.g., Google 3D Warehouse or TurboSquid, and large scale shape benchmarks, e.g., ModelNet , SHape REtrieval Contest (SHREC) , the scalability of 3D shape retrieval algorithms becomes increasingly important for practical applications. However, efficiency issue has been more or less ignored by previous works, though enormous efforts have been devoted to retrieval effectiveness, that is to say, to design informative and discriminative features to boost the retrieval accuracy. As suggested in , plenty of these algorithms do not scale up to large 3D shape databases due to their high time complexity.

Meanwhile, owing to the fact that human visual perception of 3D shapes depends upon 2D observations, projective analysis has became a basic and inherent tool in 3D shape domain for a long time, with applications to segmentation , matching , reconstruction, etc.. Specifically in 3D shape retrieval, projection-based methods demonstrate impressive performances. Especially in recent years, the success of planar image representation , makes it easier to describe 3D models using depth or silhouette projections.

Generally, a typical 3D shape search engine is comprised of the following four components (see also Fig. 1):

Projection rendering. With a 3D model as input, the output of this component is a collection of projections. Most methods set an array of virtual cameras at pre-defined view points to capture views. These view points can be the vertices of a dodecahedron , located on the unit sphere , or around the lateral surface of a cylinder . In most cases, pose normalization is needed for the sake of invariance to translation, rotation and scale changes.

View feature extraction. The role of this component is to obtain multiple view representations, which affects the retrieval quality largely. A widely-used paradigm is Bag-of-Words (BoW) model, since it has shown its superiority as natural image descriptors. However, in order to get better performances, many features are of extremely high dimension. As a consequence, raw descriptor extraction (e.g., SIFT ), quantization and distance calculation are all time-consuming.

Multi-view matching. This component establishes the correspondence between two sets of view features, and returns a matching cost between two 3D models. Since at least a set-to-set matching strategy is required, this stage suffers from high time complexity even when using the simplest Hausdorff matching. Hence, the usage of algorithms incorporated with some more sophisticated matching strategies on large scale 3D datasets is limited due to their heavy computational cost.

Re-ranking. It aims at refining the initial ranking list by using some extra information. For retrieval problems, since no prior or supervised information is available, contextual similarity measure is usually utilized. A classic context-based re-ranking methodology for shape retrieval is diffusion process , which exhibits outstanding performance on various datasets. However, as graph-based and iterative algorithms, many variants of diffusion process (e.g., locally constrained diffusion process ), generally require the computational complexity of O(TN3)O(TN^{3}), where NN is the total number of shapes in the database and TT is the number of iterations. In this sense, diffusion process does not seem to be applicable for real-time analysis.

In this paper, we present a real-time 3D shape search engine using projections that includes all the aforementioned components. It combines Graphics Processing Unit (GPU) acceleration and Inverted File (Twice), hence we name it GIFT. In on-line processing, once a user submits a query shape, GIFT can react and present the retrieved shapes within one second (the off-line preprocessing operations, such as CNN model training and inverted file establishment, are excluded). GIFT is evaluated on several popular 3D benchmarks datasets, especially on one track of SHape REtrieval Contest (SHREC) which focuses on scalable 3D retrieval. The experimental results on retrieval accuracy and query time demonstrate the capability of GIFT in handling large scale data.

In summary, our main contributions are as follows. Firstly, GPU is used to speed up the procedure of projection rendering and feature extraction. Secondly, in multi-view matching procedure, a robust version of Hausdorff distance for noise data is approximated with an inverted file, which allows for extremely efficient matching between two view sets without impairing the retrieval performances too much. Thirdly, in the re-ranking component, a new context-based algorithm based on fuzzy set theory is proposed. Different from diffusion processes of high time complexity, our re-ranking here is ultra time efficient on the account of using inverted file again.

Proposed Search Engine

Prior to projection rendering, pose normalization for each 3D shape is needed in order to attain invariance to some common geometrical transformations. However, apart from many pervious algorithms that require rotation normalization using some Principal Component Analysis (PCA) techniques, we only normalize the scale and the translation in our system. Our concerns are two-fold: 1) PCA techniques are not always stable, especially when dealing with some specific geometrical characteristics such as symmetries, large planar or bumpy surfaces; 2) the view feature used in our system can tolerate the rotation issue to a certain extent, though cannot be completely invariant to such changes. In fact, we observe that if enough projections (more than 2525 in our experiments) are used, one can achieve reliable performances.

The projection procedure is as follows. Firstly, we place the centroid of each 3D shape at the origin of a spherical coordinate system, and resize the maximum polar distance of the points on the surface of the shape to unit length. Then NvN_{v} virtual cameras are set on the unit sphere evenly, and they are located by the azimuth θaz\theta_{az} and the elevation θel\theta_{el} angles. At last, we render one projected view in depth buffer at each combination of θaz\theta_{az} and θel\theta_{el}. For the sake of speed, GPU is utilized here such that for each 3D shape, the average time cost of rendering 6464 projections is only 30ms30ms.

2 Feature Extraction via GPU Acceleration

Feature design has been a crucial problem in 3D shape retrieval for a long time owing to its great influence on the retrieval accuracy. Though extensively studied, almost all the existing algorithms ignore the efficiency of the feature extraction.

To this end, our search engine adopts GPU to accelerate the procedure of feature extraction. Impressed by the superior performance of deep learning approaches in various visual tasks, we propose to use the activation of a Convolutional Neural Network (CNN). The CNN used here takes depth images as input, and the loss function is exerted on the classification error for projections. The network architecture consists of five successive convolutional layers and three fully connected layers as in . We normalize each activation in its Euclidean norm to avoid scale changes. It only takes 56ms56ms on average to extract the view features for a 3D model.

Since no prior information is available to judge the discriminative power of activations of different layers, we propose a robust re-ranking algorithm described in Sec. 2.4. It can fuse those homogenous features efficiently based on fuzzy set theory.

3 Inverted File for Multi-view Matching

Consider a query shape xqx_{q} and a shape xpx_{p} from the database X={x1,x2,,xN}\mathcal{X}=\{x_{1},x_{2},\dots,x_{N}\}. Let V\mathcal{V} denote a mapping function from 3D shapes to their feature sets. We can obtain two sets V(xq)={q1,q2,,qNv}\mathcal{V}(x_{q})=\{q_{1},q_{2},\dots,q_{N_{v}}\} and V(xp)={p1,p2,,pNv}\mathcal{V}(x_{p})=\{p_{1},p_{2},\dots,p_{N_{v}}\} respectively, where NvN_{v} is the number of views. qiq_{i} (or pip_{i}) denotes the view feature assigned to the ii-th view of shape xqx_{q} (or xpx_{p}).

A 3D shape search engine requires a multi-view matching component to establish a correspondence between two sets of view features. These matching strategies are usually metrics defined on sets (e.g., Hausdorff distance) or graph matching algorithms (e.g., Hungarian method, Dynamic Programming, clock-matching). However, these pairwise strategies are time-consuming for a real-time search engine. Among them, Hausdorff distance may be the most efficient one, since it only requires some simple algebraic operations without sophisticated optimizations.

Recall that the standard Hausdorff distance measures the difference between two sets, and it is defined as

where function d()d(\cdot) measures the distance between two input vectors. In order to eliminate the disturbance of isolated views in the query view set, a more robust version of Hausdorff distance is given by

For the convenience of analysis, we consider its dual form in the similarity space as

where s()s(\cdot) measures the similarity between the two input vectors. In this paper, we adopt the cosine similarity.

As can be seen from Eq. (2) and Eq. (3), Hausdorff matching requires the time complexity O(N×Nv2)O(N\times{N_{v}}^{2}) for retrieving a given query (assuming that there are NN shapes in the database). Though the complexity grows linearly with respect to the database size, it is still intolerable when NN gets larger. However, by analyzing Eq. (3), we can make several observations: (1) let s(qi)=max1jNvs(qi,pj)s^{\ast}(q_{i})=\max_{1\leq j\leq N_{v}}s(q_{i},p_{j}), the similarity calculations of s(qi,pj)s(q_{i},p_{j}) are unnecessary when s(qi,pj)<s(qi)s(q_{i},p_{j})<s^{\ast}(q_{i}), since these similarity values are unused due to the maxmax operation, i.e., only s(qi)s^{\ast}(q_{i}) is kept; (2) when considering from the query side, we can find that s(qi)s^{\ast}(q_{i}) counts little to the final matching cost if s(qi)<ξs^{\ast}(q_{i})<\xi and ξ\xi is a small threshold. Those observations suggest that although the matching function in Eq. (3) requires the calculation of all the pairwise similarities between two view sets, some similarity calculations, which generate small values, can be eliminated without impairing the retrieval performance too much.

In order to avoid these unnecessary operations and improve the efficiency of multi-view matching procedure, we adopt inverted file to approximate Eq. (3) by adding the Kronecker delta response as

where δx,y=1\delta_{x,y}=1 if x=yx=y, and δx,y=0\delta_{x,y}=0 if xyx\neq y. The quantizer c(x)=argmin1iKxbi2c(x)=\arg\min_{1\leq i\leq K}\|x-b_{i}\|^{2} maps the input feature into an integer index that corresponds to the nearest codeword of the given vocabulary B={b1,b2,,bK}B=\{b_{1},b_{2},\dots,b_{K}\}. As a result, the contribution of pjp_{j}, which satisfies c(qi)c(pj)c(q_{i})\neq c(p_{j}), to the similarity measure can be directly set to zero, without estimating s(qi,pj)s(q_{i},p_{j}) explicitly.

In conclusion, our inverted file for multi-view matching is built as illustrated in Fig. 2. For each view feature, we store it and its corresponding shape ID in the nearest codeword. It should be mentioned that we can also use Multiple Assignment (MA), i.e., assign each view to multiple codewords, to improve the matching precision at the sacrifice of memory cost and on-line query time.

4 Inverted File for Re-ranking

A typical search engine usually involves a re-ranking component , aiming at refining the initial candidate list by using some contextual information. In GIFT, we propose a new contextual similarity measure called Aggregated Contextual Activation (ACA), which follows the same principles as diffusion process , i.e., the similarity between two shapes should go beyond their pairwise formulation and is influenced by their contextual distributions along the underlying data manifold. However, apart from diffusion process which has high time complexity, ACA enables real-time re-ranking, which can be applied to large scale data.

Let Nk(xq)\mathcal{N}_{k}(x_{q}) denote the neighbor set of xqx_{q}, which contains its top-kk neighbors. Similar to , our basic idea is that the similarity between two shapes can be more reliably measured by comparing their neighbors using Jaccard similarity as

One can find that the neighbors are treated equally in Eq. (5). However the top-ranked neighbors are more likely to be true positives. So a more proper behavior is increasing the weights of top-ranked neighbors.

To achieve this, we propose to define the neighbor set using fuzzy set theory. Different from classical (crisp) set theory where each element either belongs or does not belong to the set, fuzzy set theory allows a gradual assessment of the membership of elements in a set. We utilize S(xq,xi)S(x_{q},x_{i}) to measure the membership grade of xix_{i} in the neighbor set of xqx_{q}. Accordingly, Eq. (5) is re-written as

Based on this definition we replace Eq. 6 with

Considering vector FqF_{q} is sparse, we can view it as sparse activation of shape xqx_{q}, where the activation at coordinate ii is the membership grade of xix_{i} in the neighbor set Nk(xq)\mathcal{N}_{k}(x_{q}). Eq. (8) utilizes the sparse activations FqF_{q} and FpF_{p} to define the new contextual shape similarity measure.

Note that all the above analysis is carried out for only one similarity measure. However, in our specific scenario, the outputs of different layers of CNN are usually at different abstraction resolutions.

For example, two different layers of CNN lead to two different similarities S(1)S^{(1)} and S(2)S^{(2)} by Eq. (3), which in turn yield two different sparse activations Fq(1)F_{q}^{(1)} and Fq(2)F_{q}^{(2)} by Eq. (7). Since no prior information is available to assess their discriminative power, our goal now is to fuse them in a unsupervised way. For this we utilize the aggregation operation in fuzzy set theory, by which several fuzzy sets are combined in a desirable way to produce a single fuzzy set. We consider two fuzzy sets represented by the sparse activations Fq(1)F_{q}^{(1)} and Fq(2)F_{q}^{(2)} (the extension to more than two activations is similar) . Their aggregation is then defined as

which computes the element-wise generalized means with exponent α\alpha of Fq(1)F_{q}^{(1)} and Fq(2)F_{q}^{(2)}. Instead of using arithmetic mean, we use this generalized means (α\alpha is set to 0.50.5 throughout our experiments). Our concern for this is to avoid the problem that some artificially large elements in FqF_{q} dominate the similarity measure. This motivation is very similar to handling bursty visual elements in Bag-of-Words (BoW) model (see for examples).

In summary, we call the feature in Eq. (9) Aggregated Contextual Activation (ACA). Next, we will introduce some improvements of Eq. (9) concerning its retrieval accuracy and computational efficiency.

Similar to diffusion process, the proposed ACA requires an accurate estimation of the context in the data manifold. Here we provide two alternative ways to improve the retrieval performance of ACA without depriving its efficiency.

Neighbor Augmentation. The first one is to augment FqF_{q} using the neighbors of second order, i.e., the neighbors of the neighbors of xqx_{q}. Inspired by query expansion , the second order neighbors are added as

Neighbor Co-augmentation. Our second improvement is to use a so-called “neighbor co-augmentation”. Specifically, the neighbors generated by one similarity measure are used to augment contextual activations of the other similarity measure, formally defined as

This formula is inspired by “co-training” . Essentially, one similarity measure tells the other one that “I think these neighbors to be true positives, and lend them to you such that you can improve your own discriminative power”.

Note that the size of neighbor set used here may be different from that used in Eq. (7). In order to distinguish them, we denote the size of neighbor set in Eq. (7) as k1k_{1}, while that used in Eq. (10) and Eq. (11) as k2k_{2}.

4.2 Improving Efficiency

Considering that the length of FqF_{q} is NN, one may doubt the efficiency of similarity computation in Eq. (8), especially when the database size NN is large. In fact, FqF_{q} is a sparse vector, since FqF_{q} only encodes the neighborhood structure of xqx_{q}, and the number of non-zero values is only determined by the size of Nk(xq)\mathcal{N}_{k}(x_{q}). This observation motivate us to utilize an inverted file again to leverage the sparsity of FqF_{q}. Now we derive the feasibility of applying inverted file in Jaccard similarity theoretically.

Since all values of the aggregated contextual activation are non-negative, the last two items in Eq. (12) are equal to zero. Consequently, Eq. (12) can be simplified as

which only requires accessing non-zero entries of the query, and hence can be computed efficiently on-the-fly.

Although the calculation of the denominator in Eq. (8) seems sophisticated, it can be expressed as

Besides the query-dependent operations (the first and the last items), Eq. (14) only involves an operation of L1L_{1} norm calculation of FpF_{p}, which is simply equal to the cardinality of the fuzzy set Nk(xp)\mathcal{N}_{k}(x_{p}) and can be pre-computed off-line.

Our inverted file for re-ranking is built as illustrated in Fig. 3. It has exactly NN entries, and each entry corresponds to one shape in the database. For each entry, we first store the cardinality of its fuzzy neighbor set. Then, we find those shapes which have non-negative membership values in this entry. Those shape IDs and the membership values are stored in this entry.

Experiments

In this section, we evaluate the performance of GIFT on different kinds of 3D shape retrieval tasks. The evaluation metrics used in this paper include mean average precision (MAP), area under curve (AUC), Nearest Neighbor (NN), First Tier (FT) and Second Tier (ST). Refer to for their detailed definitions.

If not specified, we adopt the following setup throughout our experiments. The projection rendered for each shape is Nv=64N_{v}=64. For multi-view matching procedure, the approximate Hausdorff matching defined in Eq. (4) with an inverted file of 256256 entries is used. Multiple Assignment is set to 22. We use two pairwise similarity measures, which are calculated using features from convolutional layer L5L_{5} and fully-connected layer L7L_{7} respectively. In re-ranking component, each similarity measure generates one sparse activation FqF_{q} to capture the contextual information for the 3D shape xqx_{q}, and neighbor co-augmentation in Eq. (11) is used to produce Fq(1)F^{(1)}_{q} and Fq(2)F^{(2)}_{q}. Finally, both Fq(1)F^{(1)}_{q} and Fq(2)F^{(2)}_{q} are integrated by (9) with exponent α=0.5\alpha=0.5.

ModelNet is a large-scale 3D CAD model dataset introduced by Wu et al. recently, which contains 151,128151,128 3D CAD models divided into 660660 object categories. Two subsets are used for evaluation, i.e., ModelNet40 and ModelNet10. The former one contains 12,31112,311 models, and the latter one contains 4,8994,899 models. We evaluate the performance of GIFT on both subsets and adopt the same training and test split as in , namely randomly selecting 100100 unique models per category from the subset, in which 8080 models are used for training the CNN model and the rest for testing the retrieval performance.

For comparison, we collected all the retrieval results publicly availablehttp://modelnet.cs.princeton.edu/. The chosen methods are (Spherical Harmonic) SPH , (Light Field descriptor) LFD , PANORAMA , 3D ShapeNet , DeepPano and MVCNN . As Table 1 shows, GIFT outperforms all the state-of-the-art methods remarkably. We also present the performance of two baseline methods, i.e., feature L5L_{5} or L7L_{7} with exact Hausdorff matching. As can be seen, L7L_{7} achieves a better performance than L5L_{5}, and GIFT leads to a significant improvement over L7L_{7} of 5.82% in AUC, 5.31% in MAP for ModelNet40 dataset, and 3.32% in AUC, 3.07% in MAP for ModelNet10 dataset.

Fig. 4 compares the precision-recall curves. It demonstrates again the discriminative power of the proposed search engine in 3D shape retrieval. Note that ModelNet also defines the 3D shape classification tasks. Considering GIFT is initially developed for real-time retrieval, its classification results are given in the supplementary material.

2 Large Scale Competition

As the most authoritative 3D retrieval competition held each year, SHape REtrieval Contest (SHREC) pays much attention to the development of scalable algorithms gradually. Especially in recent years, several large scale tracks , such as SHREC14LSGTB , are organized to test the scalability of algorithms. However, most algorithms that the participants submit are of high time complexity, and cannot be applied when the dataset becomes larger (millions or more). Here we choose SHREC14LSGTB dataset for a comprehensive evaluation. This dataset contains 8,9878,987 3D models classified into 171171 classes, and each 3D shape is taken in turn as the query. As for the feature extractor, we collected 54,72854,728 unrelated models from ModelNet divided into 461461 categories to train a CNN model.

To keep the comparison fair, we choose two types of results from the survey paper to present in Table 2. The first type consists of the top-55 best-performing methods on retrieval accuracy, including PANORAMA , DBSVC, MR-BF-DSIFT, MR-D1SIFT and LCDR-DBSVC. The second type is the most efficient one, i.e., ZFDR .

As can be seen from the table, excluding GIFT, the best performance is achieved by LCDR-DBSVC. However, it requires 668.6s668.6s to return the retrieval results per query, which means that 6969 days are needed to finish the query task on the whole dataset. The reason behind such a high complexity lies in two aspects: 1) its visual feature is 270K270K dimensional, which is time consuming to compute, store and compare; 2) it adopts locally constrained diffusion process (LCDP) for re-ranking, while it is known that LCDP is an iterative graph-based algorithm of high time complexity. As for ZFDR, its average query time is shortened to 1.77s1.77s by computing parallel on 1212 cores. Unfortunately, ZFDR achieves much less accurate retrieval performance, and its FT is 13%13\% smaller than LCDR-DBSVC. In summary, a conclusion can be drawn that no method can achieve a good enough performance at a low time complexity.

By contrast, GIFT outperforms all these methods, including a very recent algorithm called Two Layer Coding (TLC) which reports 0.5850.585 in FT. What is more important that GIFT can provide the retrieval results within 63.14ms63.14ms, which is 44 orders of magnitude faster than LCDR-DBSVC. Meanwhile, the two baseline methods L5L_{5} and L7L_{7} incur heavy query cost due to the usage of exact Hausdorff matching, which testifies the advantage of the proposed F-IF.

3 Generic 3D Retrieval

Following , we select three popular datasets for a generic evaluation, including Princeton Shape Benchmark (PSB) , Watertight Models track of SHape REtrieval Contest 2007 (WM-SHREC07) and McGill dataset . Among them, PSB dataset is probably the first widely-used generic shape benchmark, and it consists of 907907 polygonal models divided into 9292 categories. WM-SHREC07 contains 400400 watertight models evenly distributed in 2020 classes, and is a representative competition held by SHREC community. McGill dataset focuses on non-rigid analysis, and contains 255255 articulated objects classified into 1010 classes. We train CNN on an independent TSB dataset , and then use the trained CNN to extract view features for the shapes in all the three testing datasets.

In Table 3, a comprehensive comparison between GIFT and various state-the-art methods is presented, including LFD , the curve-based method of Tabia et al. , DESIRE descriptor , total Bregman Divergences (tBD) , Covariance descriptor , the Hybrid of 2D and 3D descriptor , Two Layer Coding (TLC) and PANORAMA . As can be seen, GIFT exhibits encouraging discriminative ability in retrieval accuracy and achieves state-of-the-art performances consistently on all the three evaluation metrics.

4 Execution Time

In addition to state-of-the-art performances on several datasets and competitions, the most important property of GIFT is the “real-time” performance with the potential of handling large scale shape corpora. In Table 4, we give a deeper analysis of the time cost. The off-line operations mainly include projection rendering and feature extraction for database shapes, training CNN, and building two inverted files. As the table shows, the time cost of off-line operations varies significantly for different datasets. Among them, the most time-consuming operation is training CNN, followed by building the first inverted file with k-means. However, the average query time for different datasets can be controlled within one second, even for the biggest SHREC14LSGTB dataset.

5 Parameter Discussion

Due to the space limitation, the discussion is conducted only on PSB dataset.

Improvements Over Baseline. In Table 5, a thorough discussion is given about the influence of various components of GIFT. We can observe a consistent performance boost by those improvements. The performance jumps a lot especially when re-ranking component is embedded. One should note a slight performance decrease when approximate Hausdorff matching with F-IF is used as compared with its exact version. However, as discussed below, the embedding with inverted file does not necessarily result in a poorer performance, but shortens the query time significantly.

Discussion on F-IF. In Fig. 5, we plot the retrieval performance and the average query time using feature L7L_{7}, as the number of entries used in the first inverted file changes. As Fig. 5 shows, the retrieval performance generally decreases with more entries, and multiple assignment can boost the retrieval performance significantly. However, it should be addressed that a better approximation to Eq. (3) using fewer entries (decreasing KK) or larger multiple assignments (increasing MA) does not necessarily imply a better retrieval performance. For example, when K=256K=256 and MA=2=2, the performance of approximate Hausdorff matching using inverted file surpasses the baseline using exact Hausdorff matching. The reason for this “abnormal” observation is that the principle of inverted file here is to reject those view matching operations that lead to smaller similarities, and sometimes they are noisy and false matching pairs which can be harmful to retrieval performance.

As can be seen from Fig. 5, the average query time is higher at smaller KK and larger MA, since the two cases both increase the number of candidate matchings in each entry. The baseline query time using exact Hausdorff matching is 0.69s0.69s, which is at least one order of magnitude larger than the approximate one.

Discussion on S-IF. Two parameters, k1k_{1} and k2k_{2}, are involved in the second inverted file, which are determined empirically. We plot the influence of them in Fig. 6. As can be drawn from the figure, when k1k_{1} increases, the retrieval performance increases at first. Since noise contextual information can be included at a larger k1k_{1}, we can observe the performance decreases after k1>10k_{1}>10. Meanwhile, neighbor augmentation can boost the performance further. For example, the best performance is achieved when k2=4k_{2}=4. However, when k2=5k_{2}=5, the performance tends to decrease. One may find that the optimal value of k2k_{2} is much smaller than that of k1k_{1}. The reason for this is that k2k_{2} defines the size of the second order neighbor, which is more likely to return noise context compared with the first order neighbor defined by k1k_{1}.

Conclusions

In the past years, 3D shape retrieval was evaluated with only small numbers of shapes. In this sense, the problem of 3D shape retrieval has stagnated for a long time. Only recently, shape community started to pay more attention to the scalable retrieval issue gradually. However, as suggested in , most classical methods encounter severe obstacles when dealing with larger databases.

In this paper, we focus on the scalability of 3D shape retrieval algorithms, and build a well-designed 3D shape search engine called GIFT. In our retrieval system, GPU is utilized to accelerate the speed of projection rendering and view feature extraction, and two inverted files are embedded to enable real-time multi-view matching and re-ranking. As a result, the average query time is controlled within one second, which clearly demonstrates the potential of GIFT for large scale 3D shape retrieval. What is more impressive is that while preserving the high time efficiency, GIFT outperforms state-of-the-art methods in retrieval accuracy by a large margin. Therefore, we view the proposed search engine as a promising step towards larger 3D shape corpora.

We submitted a version of GIFT to the latest SHREC2016 large scale track (the results are available in https://shapenet.cs.stanford.edu/shrec16/), and won the first place on perturbed dataset.

References