Learning to rank in person re-identification with metric ensembles
Sakrapee Paisitkriangkrai, Chunhua Shen, Anton van den Hengel
Introduction
The task of person re-identification (re-id) is to match pedestrian images observed from multiple cameras. It has recently gained popularity in research community due to its several important applications in video surveillance. An automated re-id system could save a lot of human labour in exhaustively searching for a person of interest from a large amount of video sequences.
Despite several years of research in the computer vision community, person re-id is still a very challenging task and remains unsolved due to (a) large variation in visual appearance (person’s appearance often undergoes large variations across different camera views); (b) significant changes in human poses at the time the image was captured; (c) large amount of illumination changes and (d) background clutter and occlusions. Moreover the problem becomes increasingly difficult when persons share similar appearance, e.g., people wearing similar clothing style with similar color.
To address these challenges, existing research on this topic has concentrated on the development of sophisticated and robust features to describe the visual appearance under significant changes. However the system that relies heavily on one specific type of visual cues, e.g., color, texture or shape, would not be practical and powerful enough to discriminate individuals with similar visual appearance. Existing studies have tried to address the above problem by seeking a combination of robust and distinctive feature representation of person’s appearance, ranging from color histogram , spatial co-occurrence representation , LBP , color SIFT , etc.
One simple approach to exploit multiple visual features is to build an ensemble of distance functions, in which each distance function is learned using a single feature and the final distance is calculated from a weighted sum of these distance functions . However existing works on person re-id often pre-define these weights, which need to be re-estimated beforehand for different data sets. Since different re-id benchmark data sets can have very different characteristics, i.e., variation in view angle, lighting and occlusion, combining multiple distance functions using pre-determined weights is undesirable as highly discriminative features in one environment might become irrelevant in another environment.
In this paper, we introduce effective approaches to learn weights of these distance functions. The first approach optimizes the relative distance using the triplet information and the second approach maximizes the average rank- recognition rate, in which is chosen to be small, e.g., . Setting the value of to be small is crucial for many real-world applications since most surveillance operators typically inspect only the first ten or twenty items retrieved.
The main contributions of this paper are twofold: 1) We propose two principled approaches to build an ensemble of person re-id algorithms. The first approach aims at maximizing the relative distance between images of different individuals and images of the same individual such that the CMC curve approaches one with a minimal number of returned candidates. The second approach directly optimizes the probability that any of these top matches are correct using structured learning. Our ensemble-based approaches are highly flexible and can be combined with linear and non-linear metrics. 2) Extensive experiments are carried out to demonstrate that by building an ensemble of person re-id algorithms learned from different visual features, notable improvement on rank- recognition rate can be obtained. Experimental results show that our approach achieves the state-of-the-art performance on most person re-id benchmark data sets evaluated. In addition, our ensemble approach is complementary to any existing distance learning methods.
Existing person re-id systems consist of two major components: feature representation and metric learning. In feature representation, robust and discriminative features are constructed such that they can be used to describe the appearance of the same individual across different camera views under various changes and conditions . We briefly discuss some of these work below. More feature representations, which have been applied in person re-id, can be found in .
Bazzani et al. represent a person by a global mean color histogram and recurrent local patterns through epitomic analysis . Farenzena et al. propose the symmetry-driven accumulation of local features which exploits both symmetry and asymmetry, and represents each part of a person by a weighted color histogram, maximally stable color regions and texture information . Gray and Tao introduce an ensemble of local features which combines three color channels with texture channels . Schwartz and Davis propose a discriminative appearance based model using partial least squares, in which multiple visual features: texture, gradient and color features are combined . Zhao et al. propose dcolorSIFT which combines SIFT features with color histogram. The same authors also propose mid-level filters for person re-identification by exploring the partial area under the ROC curve (pAUC) score .
Although a large number of existing algorithms have exploited state-of-the-art visual features and advanced metric learning algorithms, we observe that the best obtained overall performance on commonly evaluated person re-id benchmarks, e.g., iLIDS and VIPeR, is still far from the performance needed for most real-world surveillance applications.
Our Approach
In this section, we propose two approaches that can learn an ensemble of base metrics. We then discuss base metrics and visual features that will be used in our experiment.
The most commonly used performance measure for evaluating person re-id is known as a cumulative matching characteristic (CMC) curve , which is analogous to the ROC curve in detection problems. The CMC curve represents results of an identification task by plotting the probability of correct identification (y-axis) against the number of candidates returned (x-axis). The faster the CMC curve approaches one, the better the person re-id algorithm. Since a better rank- recognition rate is often preferred , our aim is to improve the recognition rate among the best candidates, e.g., , which is crucial for many real-world surveillance applications. Note that, in practice, the system that achieves the best recognition rate when is large (e.g., ) is of little interest since most users inspect or consider only the first ten or twenty returned candidates.
In this section, we propose two different approaches which learn an ensemble of base metrics (discussed in the next section). The first approach, CMC, aims at minimizing the number of returned list of candidates in order to achieve a perfect identification, i.e., minimizing such that the rank- recognition rate is equal to one. The second approach, CMC, optimizes the probability that any of these best matches are correct.
In order to minimize such that the rank- recognition rate is equal to , we consider learning an ensemble of distance functions based on relative comparison of triplets . Given a set of triplets , in which is taken from camera view A and are taken from camera view B, the basic idea is to learn a distance function such that images of the same individual are closer than any images of different individuals, i.e., is closer to than any . For a triplet , the following condition must hold . Following the large margin framework with the hinge loss, the condition should be satisfied. This condition means that the distance between two images of different individuals should be larger by at least a unit than the distance between two images of the same individual. Since the above condition cannot be satisfied by all triplets, we introduce a slack variable to enable soft margin. By generalizing the above idea to the entire training set, the primal problem that we want to optimize can be written as,
Here is the regularization parameter and = , = and ,, represent a set of base metrics. Note that we introduce the regularization term to avoid the trivial solution of arbitrarily large .
where being the triplet index set. In this paper, we consider the hinge loss but other convex loss functions can be applied.
Since the number of constraints in (1) is quadratic in the number of training examples, directly solving (1) using off-the-shelf optimization toolboxes can only solve problems with up to a few thousand training examples. In the following, we present an equivalent reformulation of (1), which can be efficiently solved in a linear runtime using cutting-plane algorithms. We first reformulate (1) by writing it as:
Note that the new formulation has a single slack variable. Later on in this section, we show how the cutting-plane method can be applied to solve (3).
1.2 Top recognition at rank-k𝑘k (CMCtoptop{}^{\,\rm top})
Our previous formulation assumes that, for any triplets, images belonging to the same individual should be closer than images belonging to different individuals. Our second formulation is motivated by the nature of the problem, in which person re-id users often browse only the first few retrieved matches. Hence we propose another approach, in which the objective is no longer to minimize (the number of returned matches before achieving recognition rate), but to maximize the correct identification among the top best candidates. Built upon the structured learning framework , we optimize the performance measure commonly used in the CMC curve (recognition rate at rank-) using structured learning. The difference between our work and is that assumes training samples consist of positive instances and negative instances, while our work assumes that there are individuals in camera view A and individuals in camera view B. However there exists ranking in both works: attempts to rank all positive samples before a subset of negative samples while our works attempt to rank a pair of the same individual above a pair of different individuals. Both also apply structure learning of to solve the optimization problem.
The correct relative ordering of can be defined as where . The loss among the top candidates can then be written as,
where denotes the index of the retrieved candidates ranked in the -th position among all top best candidates. We define the joint feature map, , of the form:
where represent a set of triplets generated from the training data, = and = . The choice of guarantees that the variable , which optimizes , will also produce the distance function that achieves the optimal average recognition rate among the top candidates. The above problem can be summarized as the following convex optimization problem:
and . Here denote the correct relative ordering and denote any arbitrary orderings. Similar to CMC, we use the cutting-plane method to solve (7).
1.3 Cutting-plane optimization
In this section, we illustrate how the cutting-plane method can be used to solve both optimization problems: (3) and (7). The key idea of the cutting-plane is that a small subset of the constraints are sufficient to find an -approximate solution to the original problem. The cutting-plane algorithm begins with an empty initial constraint set and iteratively adds the most violated constraint set. At each iteration, the algorithm computes the solution over the current working set. The algorithm then finds the most violated constraint and add it to the working set. The cutting-plane algorithm continues until no constraint is violated by more than . Since the quadratic program is of constant size, the cutting-plane method converges in a constant number of iterations. We present our proposed CMC in Algorithm LABEL:ALG:cutting.
The optimization problem for finding the most violated constraint (Algorithm LABEL:ALG:cutting, step ②) can be written as,
where . Since in (8) is independent, the solution to (8) can be solved by maximizing over each element . Hence that most violates the constraint corresponds to,
For CMC, one replaces in Algorithm LABEL:ALG:cutting with and repeats the same procedure.
In this section we assume that the base metrics, ,,, are provided. In the next section, we introduce two base metrics adopted in our proposed approaches.
2 Base metrics
Metric learning can be divided into two categories: linear and non-linear methods . In the linear case, the goal is to learn a linear mapping by estimating a matrix such that the distance between images of the same individual, , is less than the distance between images of different individuals, . The linear method can be easily extended to learn non-linear mapping by kernelization . The basic idea is to learn a linear mapping in the feature space of some non-linear function, , such that the distance is less than the distance .
The basic idea of KISS metric learning (KISS ML) , is to learn the Mahalanobis distance by considering a log likelihood ratio test of two Gaussian distributions. The likelihood ratio test between dissimilar pairs and similar pairs can be written as,
where , , , and are covariance matrices of dissimilar pairs and similar pairs, respectively. By taking log and discarding constant terms, (12) can be simplified as,
Hence the Mahalanobis distance matrix can be written as . The authors of clip the spectrum of by eigen-analysis to ensure is positive semi-definite. This simple algorithm has shown to perform surprisingly well on the person re-id problem .
There exist several non-linear extensions to metric learning. In this section, we introduce recently proposed kernel-based metric learning, known as kernel Local Fisher Discriminant Analysis (kLFDA) , which is a non-linear extension to the previously proposed LFDA and has demonstrated the state-of-the-art performance on iLIDS, CAVIAR and 3DPeS data sets. The basic idea of kLFDA is to find a projection matrix which maximizes the between-class scatter matrix while minimizing the within-class scatter matrix using the Fisher discriminant objective. Similar to LFDA, the projection matrix can be estimated using generalized Eigenvalues. Unlike LFDA, kLFDA represent the projection matrix with the data samples in the kernel space .
3 Visual features
We introduce visual features which have been applied in our person re-id approaches.
Scale-invariant feature transform (SIFT) has gained a lot of research attention due to its invariance to scaling, orientation and illumination changes . The descriptor represents occurrences of gradient orientation in each region. In this work, we combine discriminative SIFT with color histogram extracted from the LAB colorspace.
Local Binary Pattern (LBP) is another feature descriptor that has received a lot of attention in the literature due to its effectiveness and efficiency . The standard version of -neighbours LBP has a radius of and is formed by thresholding the neighbourhood with the centre pixel’s value. To improve the classification accuracy of LBP, we combine LBP histograms with color histograms extracted from the RGB colorspace.
Region covariance is another texture descriptor which has shown promising results in texture classification . The covariance descriptor is extracted from the covariance of several image statistics inside a region of interest . Covariance matrix provides a measure of the relationship between two or more set of variates. The diagonal entries of covariance matrices represent the variance and the non-diagonal entries represent the correlation value between low-level features.
Large amount of available training data and increasing computing power have lead to a recent success of deep convolutional neural networks (CNN) on a large number of computer vision applications. CNN exploits the strong spatially local correlation present in natural images by enforcing a local connectivity pattern between neurons of adjacent layers. In the deep CNN architecture, convolutional layers are placed alternatively between max-pooling and contrast normalization layers .
See supplementary for detailed implementation.
Experiments
There exist several challenging benchmark data sets for person re-identification. In this experiment, we select four commonly used data sets (iLIDS, 3DPES, PRID2011, VIPeR) and two recently introduced data sets with a large number of individuals (CUHK01 and CUHK03). The iLIDS data set has individuals captured from eight cameras with different viewpoints . The number of images for each individual varies from to , i.e., eight cameras are used to capture individuals. The data set consists of large occlusions caused by people and luggages. The 3DPeS data set is designed mainly for people tracking and person re-identification . It contains numerous video sequences taken from a real surveillance environment with eight different surveillance cameras and consists of individuals. The number of images for each individual varies from to images. The Person RE-ID 2011 (PRID2011) data set consists of images extracted from multiple person trajectories recorded from two surveillance static cameras . Camera view A contains individuals, camera view B contains individuals, with of them appearing in both views. Hence, there are person image pairs in the dataset.
VIPeR is one of the most popular used data sets for person re-identification . It conntains individuals taken from two cameras with arbitrary viewpoints and varying illumination conditions. The CUHK01 data set contains persons captured from two camera views in a campus environment . Camera view A captures the frontal or back view of the individuals while camera view B captures the profile view. Finally, the CUHK03 data set consists of persons taken from six cameras The data set consists of manually cropped pedestrian images and images cropped from the pedestrian detector of . Due to the imperfection in the pedestrian detector, which causes some misalignments of cropped images, we use images which are manually annotated by hand.
In this paper, we adopt a single-shot experiment setting, similar to . For all data sets except CUHK03, all the individuals in the data set are randomly divided into two subsets so that the training set and the test set contains half of the available individuals with no overlap on person identities. For data set with two cameras, we randomly select one image of the individual taken from camera view A as the probe image and one image of the same individual taken from camera view B as the gallery image. For multi-camera data sets, two images of the same individual are chosen: one is used as the probe image and the other as the gallery image. For CUHK03, we set the number of individuals in the traintest split to as conducted in . To be more specific, there are , , , , and individuals in each of the test split for the iLIDS, 3DPeS, PRID2011, VIPeR, CUHK01 and CUHK03 data sets, respectively. The number of probe images (test phase) is equal to the number of gallery images in all data sets except PRID2011, in which the number of probe images is and the number of test gallery images is (all images from camera view B except the training samples). This procedure is repeated times and the average of cumulative matching charateristic (CMC) curves across partitions is reported. The CMC curve provides a ranking for every image in the gallery with respect to the probe.
For the linear base metric (KISS ML ), we apply principal component analysis (PCA) to reduce the dimensionality and remove noise. Without performing PCA, it is computationally infeasible to inverse covariance matrices of both similar and dissimilar pairs as discussed in . For each visual feature, we reduce the feature dimension to dimensional subspaces. For the non-linear base metric (kLFDA ), we set the regularization parameter for class scatter matrix to , i.e., we add a small identity matrix to the class scatter matrix. For both SIFTLAB and LBPRGB features, we apply the RBF- kernel. For region covariance and CNN features, we apply the Gaussian RBF kernel . The kernel parameter is tuned to an appropriate value for each data set. In this experiment, we set the value of to be the same as the first quantile of all distances .
For CMC, we choose the regularization parameter ( in (1)) from ,,, by cross-validation on the training data. For CMC, we choose the regularization parameter ( in (7)) from ,,, by cross-validation on the training data. We set the cutting-plane termination threshold to . The recall parameter ( in (6)) is set to be for iLIDS, 3DPeS, PRID2011 and VIPeR and for larger data sets (CUHK01 and CUHK03). Since the success of metric learning algorithms often depends on the choice of good parameters, we train multiple metric learning for each feature. Specifically, for KISS ML, we reduce their feature dimensionality to , and dimensions and use all three to learn the weight for CMC and CMC. Similarly, for kFLDA, we set the to be the same as the \nth5, the \nth10 and the first quantile of all distances.
1 Evaluation and analysis
We investigate the impact of low-level and high-level visual features on the recognition performance of person re-identification. Fig. 1 shows the CMC performance of different visual features and their rank- recognition rates when trained with the kernel-based LFDA (non-linear metric learning) on six benchmark data sets. On VIPeR, CUHK01 and CUHK03 data sets, we observe that both SIFTLAB and LBPRGB significantly outperform covariance descriptor and CNN features. This result is not surprising since SIFTLAB combines edges and color features while LPBRGB combines texture and color features. We suspect the use of color helps improve the overall recognition performance of both features. We observe that CNN features perform poorer than hand-crafted low-level features in our experiments. We suspect that the CNN pre-trained model has been designed for ImageNet object categories , in which color information might be less important. However on many person re-id data sets, a large number of persons wear similar types of clothing, e.g., t-shirt and jeans, but with different color. Therefore color information becomes an important cue for recognizing two different individuals. Overall, we observe that SIFTLAB features perform well consistently on all data sets evaluated.
Next we compare the performance of our approach with two different base metrics: linear metric learning and non-linear metric learning (introduced in Sec. 2.2). In this experiment, we use CMC to learn an ensemble. Experimental results are shown in Fig. 2. Two observations can be made from the figure: 1) Both approaches perform similarly when the number of traintest individuals is small, e.g., on iLIDS and 3DPeS data sets; 2) Non-linear base metrics outperforms linear base metric when the number of individuals increase. We suspect that there is less diversity when the number of individuals is small. No further improvement is observed when we replace linear base metrics with non-linear base metrics.
Next we compare the performance of the proposed CMC with CMC. Both optimization algorithms optimize the recognition rate of person re-id but with different objective criteria. We compare the performance of both algorithms with the baseline approach, in which we simply set the value of to a uniform weight. Since distance functions of different features have different scales, we normalize the distance between each probe image to all images in the gallery to be between zero and one. In other words, we set the distance between the probe image and the nearest gallery image to be zero and the distance between the probe image and the furthest gallery image to be one. The matching accuracy is shown in Table 1. We observe that CMC achieves the best recognition rate performance at a small recall value. At a large recall value (rank ), both CMC and CMC perform similarly. Interestingly, a simple averaging performs quite well on VIPeR, in which the number of individuals in the test set is small.
2 Comparison with state-of-the-art results
Fig. 3 compares our results with other person re-id algorithms on two major benchmark data sets: VIPeR and CUHK01. Our approach outperforms all existing person re-id algorithms. Next we compare our results with the best reported results in the literature. The algorithm proposed in achieves state-of-the-art results on iLIDS and 3DPeS data sets ( and recognition rate at rank-, respectively). Our approach outperforms on the iLIDS () and achieve a comparable result on 3DPeS (). Zhao et al. propose mid-level filters for person re-identification , which achieve state-of-the-art results on the VIPeR and CUHK01 data sets ( and recognition rate at rank-, respectively). Our approach outperforms by achieving a recognition rate of and on the VIPeR and CUHK01 data sets, respectively. Table 2 compares our results with other state-of-the-art methods on other person re-identification data sets.
Conclusion
In this paper, we present an effective structured learning based approach for person re-id by combining multiple low-level and high-level visual features into a single framework. Our approach is practical to real-world applications since the performance can be concentrated in the range of most practical importance. Moreover our proposed approach is flexible and can be applied to any metric learning algorithms. Experimental results demonstrate the effectiveness of the proposed approach on six major person re-id data sets.