Learning Local Image Descriptors with Deep Siamese and Triplet Convolutional Networks by Minimising Global Loss Functions

Vijay Kumar B G, Gustavo Carneiro, Ian Reid

Introduction

The design of effective local image descriptors has been instrumental for the application of computer vision methods in several problems involving the matching of local image patches, such as wide baseline stereo , structure from motion , image classification , just to name a few. Over the last decades, numerous hand-crafted and automatically learned local image descriptors have been proposed and used in the applications above. Despite their conceptual differences, these two types of local descriptors are formed based on similar goals: descriptors extracted from local image patches of the same 3-D location of a scene must be unique (compared with descriptors from different 3-D locations) and robust to brightness and geometric deformations. Given the difficulty in guaranteeing such goals for hand-crafted local descriptors , the field has gradually focused more on the automatic learning of such local descriptors, where an objective function that achieves the goals above is used in an optimisation procedure. In particular, the most common objective function minimises the distance between the descriptors from the same 3-D location (i.e., same class) extracted under varying imaging conditions and different viewpoints and, at the same time, maximises that distance between patches from different 3-D locations (or different classes) .

The more recently proposed approaches based on deep ConvNets optimise slightly new objective functions that have the same goal as mentioned above. Specifically, Zagoruyko and Komodakis and Han et al. minimise a pairwise similarity loss of local image patches using a siamese network (see Fig. 1-(b)), where the patches can belong to the same or different classes (a class is for example a specific 3-D location). Dosovitskiy et al. minimise a multi-class classification loss (Fig. 1-(c)), where the model outputs the classification of a single input patch into one of the many descriptor classes (estimated in an unsupervised manner). Moreover, Masci et al. propose a siamese network trained with a pairwise loss that minimises the distance (in the embedded space) between patches of the same class and maximises the distance between patches of different classes (Fig. 1-(b)). Even though these methods show substantial gains compared to the previous state of the art in public benchmark datasets , we believe that the loss functions and network structures being explored for this task can be improved. For instance, the triplet network (see Fig. 1-(d)) has been shown to improve the siamese network on several classification problems, and the training of the siamese and triplet networks can involve loss functions based on global classification results, which has the potential to generalise better.

In this paper, we propose the use of the triplet network (Fig. 1-(d)) and a new global loss function to train local image descriptor learning models that can be applied to the siamese and triplet networks (Fig. 1-(b),(d)). The global loss to produce a feature embedding minimises the variance of the distance between descriptors (in the embedded space) belonging to the same and different classes, minimises the mean distance between descriptors belonging to the same class and maximises the mean distance between descriptors belonging to different classes (Fig. 1-(b),(d)). For the case of pairwise similarity in siamese networks, this global loss minimises the variances of the pairwise similarity between descriptors belonging to the same and different classes, maximises the mean similarity between descriptors belonging to the same class and minimises the mean similarity between descriptors belonging to different classes (Fig. 1-(b)). We first extend the siamese network to a triplet network, trained with a triplet loss and regularised by the proposed global loss (embedding). Then we take the siamese network and train it exclusively with the global loss (pairwise similarity). Finally, we take the central-surround siamese network , which is the current state-of-the-art model for the problem of local image descriptor learning, and train it with the global loss (pairwise similarity). We show on the public benchmark UBC dataset that: 1) the triplet network shows better classification results than the siamese network ; 2) the combination of the triplet and the global loss functions improves the results produced by the triplet loss from item (1) above, resulting in the best embedding result in the field for the UBC dataset; and 3) the global loss function used to train the central-surround siamese network produces the best classification result on the UBC dataset.

Related Work

In this section, we first discuss metric learning methods, which form the basis for several local image descriptor learning approaches. Then, we discuss relevant local image descriptor learning methods recently proposed in the field, and highlight our contributions.

Extending (1) to a non-linear transformation can be done by re-formulating $\mathbf{S}_{k}$ such that it involves inner products, which can then be kernelised , and the optimisation is again solved with generalised Eigenvalue problem . Alternatively, this non-linear transform can be learned with a ConvNet using a siamese network that minimises a pairwise loss (Fig. 1-(b)) by reducing the distance of patches (in the embedded space) belonging to the same class and increasing the distance of patches from different classes, similarly to the objective function derived from (1). Note that this siamese network can produce either an embedding or a pairwise similarity estimation, depending on the architecture and loss function. This siamese network has been extended to a triplet network that uses a triplet loss (Fig. 1-(d)), which has been shown not only to produce the best classification results in several problems (e.g., STL10 , LineMOD , Labelled Faces in the Wild), but also to produce effective feature embeddings.

2 Local Image Descriptor

In the past, many local image descriptor learning methodologies have been proposed, with most based on the linear or non-linear distance metric learning, and explored in different ways . However, these methods have been shown to produce significantly worse classification results on the UBC dataset than the recently proposed siamese deep ConvNets (note that the UBC dataset is a benchmark dataset that has been used to compare local image descriptors). Even though the triplet network has been demonstrated to improve the results produced by the siamese networks, it has yet to be applied to the problem of local image descriptor learning. Finally, another relevant method is the discriminative unsupervised learning of local descriptors , which uses a single deep ConvNet to classify input local patches into many classes, which are generated in an unsupervised manner (Fig. 1-(c)). However, the latter method has not been applied to the UBC dataset mentioned above. It is also important to notice that none of the deep ConvNets methods above use the whole training set in a global loss function during the learning process, which can improve the generalisation ability of the model.

Methodology

The siamese network is trained with a two-tower deep ConvNet (Fig. 1-(b)), where the weights on both towers are initialised at the same values and during stochastic gradient descent, they receive the same gradients (i.e., the weights on both towers are tied). We consider the following definition of a deep ConvNet:

where the parameter $\theta_{f}$ is formed by the network weight matrices, bias vectors, and normalisation parameters for each layer $l\in\{1,...,L\}$ , $f_{l}(.)$ represents the pre-activation function (i.e., the linear transforms in the convolutional layers), $h_{l}(.)$ represents a normalisation function, and $r_{l}(.)$ is a non-linear activation function (e.g., ReLU ). Also note that $\mathbf{f}_{l}=[\mathbf{f}_{l,1},...,\mathbf{f}_{l,n_{l}}]$ is an array of $n_{l}$ pre-activation functions, representing the number of features in layer $l$ . A siamese network is then represented by two identical deep ConvNets trained using pairs of labelled inputs, where one possible loss function (called pairwise loss) minimises the distance between embedded features of the same class and maximises the distance between embedded features of different classes, as follows :

where $\delta(.)$ is the Dirac delta function and $f^{(1)}(\mathbf{x},\theta_{f})$ is constrained to equal to $f^{(2)}(\mathbf{x},\theta_{f})$ . Alternatively, the siamese network can be trained as a pairwise similarity estimator, with a pairwise similarity loss that can be defined as:

where the ConvNet $g(\mathbf{x}_{i},\mathbf{x}_{j},\theta_{g})$ returns the similarity between the descriptors $\mathbf{x}_{i}$ and $\mathbf{x}_{j}$ , with $\kappa$ representing a small positive constant. Note that the loss functions used by recently proposed methods are conceptually similar to (4), but not exactly the same, where the idea is to produce a ConvNet $g(\mathbf{x}_{i},\mathbf{x}_{j},\theta_{g})$ that returns large similarity values when the inputs belong to the same class and small values, otherwise. It is important to emphasise that the local descriptor learning model that currently produces the smallest error on the UBC dataset (Central-surround two-stream network) consists of a siamese network, trained with a loss similar to (4), where the input patch is sampled twice at half the resolution of the input image: one sample containing the whole patch is input to the surround stream and another sample containing a sub-patch at the centre of the original patch is input to the central stream . The output of these two streams are combined to obtain a similarity score.

The triplet network (Fig. 1-(d)) is an extension of the siamese network that is trained with triplets at the input (which produces an embedding) using the triplet loss function, as follows:

where $m$ is the margin, $\mathbf{x}^{+}$ and $\mathbf{x}$ belong to the same class, $\mathbf{x}^{-}$ and $\mathbf{x}$ are from different classes, and $f^{(1)}(.)$ , $f^{(2)}(.)$ and $f^{(3)}(.)$ are constrained to be the same network. Note that the losses in (3) and (5) are apparently similar, but they have a noticeable difference, which is the fact that a triplet of similar and dissimilar inputs gives context for the optimisation process, as opposed to the pairwise loss that the siamese network minimises (same class) or maximises (different classes) as much as possible for each pair independently .

2 Global Loss function

The siamese and triplet networks presented in Sec. 3.1 typically contain a large number of parameters, which means that a large number of pairs or triplets must be sampled from the training data such that a robust model can be learned. However, sampling all possible pairs or triplets from the training dataset can quickly become intractable, where the majority of those samples may produce small costs in (3)-(5), resulting in slow convergence . An alternative is to have a smart sampling strategy, where one must be careful to avoid focusing too much on the hard training cases because of the possibility of overfitting . In this paper, we propose a simple, yet effective, loss function that can overcome these drawbacks.

The main idea behind our proposed loss function is the avoidance of the over- or under-sampling problems mentioned above with the assumption that the distances (or similarities) between descriptors of the same class (i.e., matching pairs) or different classes (i.e., non-matching pairs) are samples from two distinct distributions. This allows us to formulate a loss function (for the embedding case) that globally tries to: 1) minimise the variance of the two distributions and the mean value of the distances between matching pairs, and 2) maximise the mean value of the distances between non-matching pairs. Fig. 2-(a) depicts the reasoning behind the design of the proposed global loss function, which is defined for the feature embedding case by:

where $\mu^{+}=\sum_{i=1}^{N}d_{i}^{+}/N,~{}~{}\mu^{-}=\sum_{i=1}^{N}d_{i}^{-}/N$ , $\sigma^{2+}=\sum_{i=1}^{N}(d_{i}^{+}-\mu^{+})^{2}/N,~{}~{}~{}\sigma^{2-}=\sum_{i=1}^{N}(d_{i}^{-}-\mu^{-})^{2}/N$ , with $\mu^{+}$ and $\sigma^{2+}$ denoting the mean and variance of the matching pair distance distribution, $\mu^{-}$ and $\sigma^{2-}$ representing the mean and variance of the non-matching pair distance distribution, $d_{i}^{+}=\frac{\lVert f^{(1)}(\mathbf{x}_{i},\theta_{f})-f^{(2)}(\mathbf{x}_{i}^{+},\theta_{f})\rVert_{2}^{2}}{4}$ , $d_{i}^{-}=\frac{\lVert f^{(1)}(\mathbf{x}_{i},\theta_{f})-f^{(3)}(\mathbf{x}_{i}^{-},\theta_{f})\rVert_{2}^{2}}{4}$ , $\lambda$ is a term that balances the importance of each term, $t$ is the margin between the mean of the matching and non-matching distance distributions and $N$ is the size of the training set. Note in (6), that we assume a triplet network (i.e., $f^{(1)}(.)$ , $f^{(2)}(.)$ and $f^{(3)}(.)$ are the same network), where the squared Euclidean distances of the matching and non-matching pairs of the $i^{th}$ triplet are constrained to be $0\leq d_{i}^{+},d_{i}^{-}\leq 1$ because of the division by 4, and the normalisation layer enforces that the norm of the embedding is 1.

Given that we use SGD for the optimisation process, we need to derive the gradient of the global loss function, as follows:

where the dependence on $\theta_{f}$ and the channel index $f^{(.)}$ are dropped to simplify the notation, and $\mathds{1}(a)$ is an indicator function with value $1$ when $a$ is true. It is important to note that the gradient ${\partial J_{1}^{t}}{/\partial f(\mathbf{x}_{i})}$ of the triplet loss in (5) depends only on the $i^{th}$ triplet of the training set, whereas the gradient ${\partial J_{1}^{g}}{/\partial f(\mathbf{x}_{i})}$ of the global loss in (LABEL:eq:loss_global_gradient) depends on $\mu^{+}$ and $\mu^{-}$ , which in turn depends on the statistics of the samples in the whole training set. This dependence on global training set statistics has the potential to suppress the spurious gradients computed from outliers and thus improving the generalisation of the trained model.

This global loss can be slightly modified to train a siamese network that estimates pairwise similarities, where the objective consists of: 1) minimising the variance of the two distributions and the mean value of the similarities between non-matching pairs, and 2) maximising the mean value of the similarities between matching pairs. Fig. 2-(b) shows the idea behind the design of the proposed pairwise similarity global loss function, which is defined by:

3 Proposed Models

We propose four models for the the problem of local image descriptor learning. The first model consists of a triplet network trained with the triplet loss in (5), which produces an embedding - this is labelled as TNet, TLoss. The second model is a triplet network that also produces an embedding and uses the following loss function that combines the original triplet loss (5) and the proposed global loss (6) for the learning process:

In terms of the ConvNet structure, we use an architecture similar to the one described by Zagoruyko and Komodakis . Specifically, the triplet network has the following structure: B(96,7,3)-P(2,2)-B(192,5,1)-P(2,2)-B(256,3,1)-B(256,1,1)-B(256,1,1). The siamese network has the following architecture: B(96,7,3)-P(2,2)-B(192,5,1)-P(2,2)-B(256,3,1)-B(256,1,1)-C(1,1,1). Furthermore, the central-surround siamese network has the following structure: B(95,5,1)-P(2,2)-B(96,3,1)-P(2,2)-B(192,3,1)-B(192,3,1) and the final block that combines the outputs of the two input streams has components B(768,2,1)-C(1,1,1). In the description above, P( $p,q$ ) is a max pooling layer of size $p\times p$ and stride $q$ , and B( $n,k,s$ ) is a block with the components C( $n,k,s$ )-bnorm( $n$ ), where C( $n,k,s$ ) is a convolutional layer with $n$ filters of kernel size $k$ and stride $s$ , bnorm( $n$ ) is the batch normalisation unit with $2n$ parameters. Each $B$ and $C$ is followed by a rectified linear unit except the final layer. Finally, the output feature from the exmedding networks are normalised to have unit norm, as mentioned in Sec. 3.2.

Toy Problem

To illustrate the robustness of the proposed global loss function to outliers, we generated a toy dataset in two dimensions with two classes (80 samples from two Gaussian distributions) represented by two distinct cloud of points, as indicated by the red and green points in Fig. 3-(a). We introduce outliers by switching the labels of randomly selected points (i.e., we switch the labels of $5\%$ of the training set, or $4$ samples). We generate a set of triplets from this training set and train a ConvNet that maps the input points to an output embedding space with $128$ dimensions with the following structure: B(256,2,1)-B(512,1,1)-C(128,1,1), where the output is normalised to have unit norm and these blocks are defined in Sec. 3.3. Three separate trainings are run: the first training uses the triplet loss function in (5), the second uses a combination of the triplet and global losses in (10), and the third uses only the global loss in (6). To ensure a fair comparison, we run the experiments with identical settings, where the only difference is the loss function. We evaluate the models learned from each loss function by computing the embedding of a grid of points from the input space, and labelling them based on the label of the nearest neighbour from the training set, found in the embedding space.

Figure 3-(b) shows the input space labelled according to the nearest neighbour classifier run in the embedding space generated by the triplet loss. Similarly, Fig. 3-(c) shows the same result for the combined triplet and global losses and Fig. 3-(d) displays the results for the global loss. In general, it is clear that outliers affect more the classifier in (b), which seems to be over-fitting the training data. Such labelling mistakes are reduced when we use the combination of the triplet and global losses as show in Fig. 3-(c). The label map in Fig. 3-(d) generated by the embedding that uses global loss is coherent even at the locations, where outliers can be found in the training set, indicating that the global loss function is robust to outliers.

Experiments

In this section, we first describe the dataset used for assessing our proposed models, then we explain the model setup, followed by a presentation of the results.

The experiments are based on the performance evaluation of local image patches using the standard UBC benchmark dataset , which contains three sets: Yosemite, Notre Dame, and Liberty. Using these sets, we run six combinations of training and testing sets, where we use one set for training and another for testing. Each one of this sets has more than $450,000$ local image patches (with normalised orientation and scale) of size $64\times 64$ sampled using a Difference of Gaussians (DoG) detector. In each of these sets there are more than $100,000$ patch classes that are determined based on their 3-D locations obtained from multi-view stereo depth maps. These patch classes are used to produce $500,000$ pairs of matching (i.e., from the same class) and non-matching (i.e., different classes) image patches. Each model is assessed using the false positive at $95\%$ recall (FPR95) on each of the six combinations of training and testing sets, the mean over all combinations, and the receiver operating characteristic (ROC) curve also for each of the six combinations. The test set contains $100,000$ pairs with equal number of matching and non-matching pairs and is chosen as specified in .

2 Training Setup and Implementation Details

The model training is based on stochastic gradient descent (SGD) that involves: 1) the use of a learning rate of $0.01$ that gradually (automatically computed based on the number of epochs set for training) decreases after each epoch until it reaches $0.0001$ ; 2) a momentum set at $0.9$ , 3) weight decay of $0.0005$ , and 4) data augmentation by rotating the pair of patches by $90$ , $180$ , and $270$ degrees, and flipping the images horizontally and vertically (i.e., augmented 5 times: 3 rotations and 2 flippings) . The training set for the triplet and siamese networks consists of a set of $250,000$ triplets, which are sampled randomly from the aforementioned set of $500,000$ pairs of matching and non-matching image patches, where it is important to make sure that the triplet contains one pair of matching image patches and one patch that belongs to a different class of this pair. The mini-batch of the SGD optimisation consists of $250$ triplets (randomly picked from this $250K$ set of triplets), which is used to compute the global loss in (6) and (8). Our Matlab implementation takes $\approx 56$ hours for training a model and processes $16K$ images/sec during testing on a GTX 980 GPU.

3 Results on UBC Benchmark Dataset

Tables 1 and 2 summarises the performance of the proposed models and compares them with the current state-of-the-art methods for the UBC dataset using the FPR95 on each of the six combinations of training and testing sets, and the mean over all combinations. Note that we separate the results in terms of the comparison of descriptors obtained by pairwise similarity methods (Tab. 1) and embedding (Tab. 2). We also show the ROC curves for each of the six combinations of training and testing sets in Fig. 4 for our proposed models, in addition to the current state-of-the-art models .

From the results in Tab. 2 and Fig. 4, we observe that the proposed triplet network trained with a combination of the triplet and global losses (i.e., the TNet-TGLoss) shows the best result in the field in terms of feature embedding. The pairwise similarity results in Tab. 1 and Fig. 4 indicate that our centre-surround siamese network trained with global loss (i.e., the CS SNet, GLoss) produces a result that is almost half of the previous state-of-the-art result, i.e., the 2ch-2stream .

Conclusions

We have presented new methods for patch matching based on learning using triplet and siamese networks trained with a combination of triplet loss and global loss applied to mini-batches - this is the first time such global loss and triplet network have been applied in patch matching. This new loss overcomes a number of the issues that have previously arisen when using triplet loss, most notably slow or even unreliable convergence.

We argue that the superior results provided by our models are due to the better regularisation provided by the global loss, as shown in Sec. 4. We have shown our models to be very effective on the UBC benchmark dataset, delivering state-of-the-art results.

A natural extension of our models is the use of the global loss with the triplet network, but our preliminary results (not shown in this paper) indicate that this model does not produce better results than the ones in Table 2. We plan to extend this method to other applications, such as pre-training in visual class recognition problems. Acknowledgements: This research was supported by the Australian Research Council through the Centre of Excellence in Robotic Vision, CE140100016, and through Laureate Fellowship FL130100102 to IDR