Person Re-Identification by Discriminative Selection in Video Ranking

Taiqing Wang, Shaogang Gong, Xiatian Zhu, Shengjin Wang

Introduction

For making sense of the vast quantity of video data generated by large scale surveillance camera networks in public spaces, automatically (re-)identifying individual persons across non-overlapping camera views distributed at different physical locations is essential. This task is known as person re-identification (ReID). Automatic ReID enables the discovery and analysis of person-specific long-term activities over widely expanded areas and is fundamental to many important surveillance applications such as multi-camera people tracking and forensic search. Specifically, for performing cross-view person ReID, one matches a probe (or query) person against a set of gallery people for generating a ranked list according to their matching similarity. Typically, it is assumed that the correct match is assigned to one of the top ranks, ideally the top- $1$ rank . As the probe and gallery people are often captured from a pair of disjoint camera views at different times, cross-view visual appearance variations can be significant. Person ReID by visual matching is thus inherently challenging . The state-of-the-art methods perform this task mostly by matching spatial appearance features (e.g. colour and texture) using a pair of single-shot person images . However, single-shot appearance features of people are intrinsically limited due to the inherent visual ambiguity caused by clothing similarity among people in public spaces and appearance changes from cross-camera viewing condition variations (Fig. 1). It is desirable to explore space-time information from image sequences of people for ReID.

Space-time information has been explored extensively for action recognition . Moreover, discriminative space-time video patches have also been exploited for action recognition . Nonetheless, action recognition approaches are not directly applicable to person ReID because pedestrians in public spaces exhibit similar walking activities without distinctive and semantically categorisable action patterns unique to different identities.

On the other hand, gait recognition techniques have been developed for person recognition using image sequences by discriminating subtle distinctiveness in the style of walking . Different from action recognition, gait is a behavioural biometric that measures the way people walk. An advantage of gait recognition is no assumption being made on either subject cooperation (framing) or person distinctive actions (posing). These characteristics are similar to person ReID situations. However, existing gait recognition models are subject to stringent requirements on person foreground segmentation and accurate alignment over time throughout a gait image sequence (a walking cycle). It is also assumed that complete gait/walking cycles were captured in the target image sequences . Most gait recognition methods do not cope well with cluttered background and/or random occlusions with unknown covariate conditions . Person ReID is hence inherently challenging for gait recognition techniques (Fig. 1).

In this study, we aim to construct a discriminative video matching framework for person re-identification by selecting more reliable space-time features from person videos, beyond the often-adopted spatial appearance features. To that end, we assume the availability of image sequences of people which may be highly noisy, i.e., with arbitrary sequence duration and starting/ending frames, unknown camera viewpoint/lighting variations during each image sequence, incomplete frames due to uncontrolled occlusions, no guaranteed high frame rates, and possible clothing changes over time. We call these videos unregulated image sequences of people (Fig. 1 and Fig. 5). More specifically, we propose a novel approach to Discriminative Video fragment selection and Ranking (DVR) based on a robust space-time and appearance feature representation given unregulated person image sequences.

The main contributions of this study are: (1) We derive a multi-fragment based appearance and space-time feature representation of image sequences of people. This representation is based on a combination of HOG3D, colour and optic flow energy profile of image sequence, designed to break down automatically unregulated video clips of people into multiple fragments. (2) We formulate a discriminative video ranking model for cross-view person re-identification by simultaneously selecting and matching more reliable appearance and space-time features from video fragments. The model is formulated using a multi-instance ranking strategy for learning from pairs of image sequences over non-overlapping camera views. The proposed method can relax significantly the strict assumptions made by gait recognition techniques. (3) We extensively provide comparative evaluations of the proposed model against a wide range of contemporary methods (e.g. gait recognition, holistic sequence matching and state-of-the-art person ReID models) on three challenging image sequence based datasets.

Related Work

Space-time features Space-time feature representations have been extensively explored in action/activity recognition . One common representation is constructed based on space-time interest points . They facilitate a compact description of image sequences based on sparse interest points, but are somewhat sensitive to shadows and highlights in appearance and may lose discriminative information . Therefore, they may not be suitable for person ReID scenarios where lighting variations and viewpoints are unknown and uncontrolled. Relatively, space-time volume/patch based representations can be richer and more robust. Mostly these representations are spatial-temporal extensions of corresponding image descriptors, e.g. HoGHoF , 3D-SIFT and HOG3D . In this study, we adopt HOG3D as the space-time feature of video fragment because: (1) It can be computed efficiently; (2) It contains both spatial gradient and temporal dynamic information, and is therefore potentially more expressive ; (3) It is more robust against cluttered background and occlusions . The choice of space-time feature is independent of our model.

Gait recognition Space-time information of sequences has been extensively exploited by gait recognition . However, these methods often make stringent assumptions on the image sequences, e.g. uncluttered background, consistent silhouette extraction and alignment, accurate gait phase estimation and complete gait cycles, most of which are unrealistic in ordinary person ReID scenarios. It is challenging to extract a suitable gait representation from typical ReID data. In contrast, our approach relaxes significantly these assumptions by simultaneously selecting discriminative video fragments from noisy sequences, learning and matching them without temporal alignment.

Temporal sequence matching One approach to exploiting image sequences for ReID is holistic sequence matching. For instance, Dynamic Time Warping (DTW) is a popular sequence matching method widely used for action recognition , and recently also for person ReID . However, given two unregulated sequences, it is difficult to align sequence pairs for accurate matching, especially when the image sequences are subject to significant noise caused by unknown camera viewpoint changes, background clutter and drastic lighting changes. Our approach is designed to address this problem while avoiding any implicit assumptions on sequence alignment and camera view similarity among image frames both within and between sequences.

Multi-shot person re-identification Multiple images from a sequence of the same person have been exploited for person re-identification. For example, interest points were accumulated across images for capturing appearance variability , manifold geometric structures in image sequences of people were utilised to construct more compact spatial descriptors of people , and the time index of image frames and identity consistency of a sequence were used to constrain spatial feature similarity estimation . There were also attempts on training a person appearance model from image sets or by selecting best pairs . Multiple images of a person sequence were often used either to enhance spatial feature descriptions of local image regions or patches , or to extract additional appearance information such as appearance change statistics . In contrast, the proposed model aims to simultaneously select and match discriminative video appearance and space-time features for maximising cross-view identity ranking. Our experiments show the advantages of the proposed model over existing multi-shot models for person ReID.

Discriminative Video Ranking

We formulate the person re-identification problem as a ranking problem . Although image sequences of people may provide intuitively richer content to learn discriminative information about an individual’s visual appearance when compared to a single still image widely used by existing person ReID methods , the availability of more (and often redundant) data poses additional challenges in model learning, e.g. more random inter-object occlusions and thus incomplete frames, arbitrary sequence duration and uncertain starting/ending postures, and potential clothing variations of some people over time. Moreover, human annotators may implicitly and unconsciously have the tendency to select carefully more clear and better-segmented person images for learning image-based ReID models. On the other hand, tracked sequences of person bounding boxes in typical surveillance videos are inherently more noisy and incomplete. Directly utilising all the sequence data for constructing ReID models can easily result in unstable models, which is undesirable. A selection mechanism is required to be part of the learning method in order to optimally explore the redundant information available in sequence data.

In the context of relative ranking based person ReID model learning, it is non-trivial to automatically learn a robust discriminative ranking function from such contaminated and uncontrolled image sequence data. Inherently, one needs to address the problem of how to mitigate the negative influence of unknown noisy observations, e.g. various types of occlusion and clutter in the background. This is beyond solving the more common problem of misalignment over time in sequence matching. In this work, we formulate a novel discriminative re-identification model capable of simultaneously selecting and ranking informative video fragments from pairs of unregulated person image sequences captured in two non-overlapping camera views. Our model not only mitigates unwanted data whilst exploring useful information from image sequences for person ReID, but also requires no rigid sequence alignment as in the case of traditional methods, e.g. dynamic time warping. Specifically, our model is based on : (i) Video fragmentation by motion energy profiling (Fig. 2(b,c) and Sec. 3.2) ; (ii) Learning a sequence based relative ranking function by simultaneously selecting and ranking cross-view video fragment pairs (Fig. 2(d,e) and Sec. 3.3) . Once learned, our model can then be deployed to re-identify previously unseen people given cross-view unregulated image sequences (Sec. 3.4). An overview diagram of the proposed approach is presented in Fig. 2.

Suppose we have a collection of image sequence pairs $\{(Q_{i}^{a},Q_{i}^{b})\}_{i=1}^{N}$ , where $Q_{i}^{a}$ and $Q_{i}^{b}$ denote the image sequences of person $p_{i}$ captured by two disjoint cameras $a$ and $b$ , and $N$ the total number of training people. Each image sequence $Q$ is defined by a set of consecutive frames $I$ as $Q=(I_{1},...,I_{T})$ , where $T$ is not a constant because in typical surveillance videos, tracked person image sequences are not guaranteed to have (1) a uniform duration (arbitrary frame numbers), (2) the same number of walking cycles, (3) similar starting/ending postures, (4) high video frame rates, or (5) invariant clothing over time.

For model training, we aim to learn a ranking function $f(Q^{a},Q^{b})$ of image sequence pairs that satisfies the following ranking constraints:

i.e. the sequence pair $(Q_{i}^{a},Q_{i}^{b})$ of the same person $p_{i}$ is constrained/optimised to have a higher rank over any cross-view sequence pairing of person $p_{i}$ and $p_{j}$ with $j\neq i$ .

Learning a ranking function holistically without discrimination and selection from pairs of unsegmented and temporally unaligned person image sequences will subject the learned model to significant noise and degrade any meaningful discriminative information contained in the image sequences. This is an inherent drawback of any holistic sequence matching approach, including those with dynamic time warping applied for non-linear mapping (see experiments in Sec. 4). Reliable human parsing/pose detection or occlusion detection may help, but such approaches are difficult to scale, especially with image sequences from crowded public scenes. The challenge is to learn a robust ranking model effective in coping with incomplete and partial image sequences by identifying and selecting discriminative/informative video fragments from each sequence suitable for extracting trustworthy fragment features. Let us first consider generating a pool of candidate fragments for each video, i.e. video fragmentation.

2 Video Fragmentation

Given unregulated image sequences of people, it is too noisy to attempt to holistically locate and extract reliable discriminative features from entire image sequences. Instead, we consider breaking down each sequence into a pool of localised video fragments to allow a learning model to automatically select the discriminative fragments (Sec. 3.3).

It can be observed that motion energy intensity induced by the activity of human muscles during walking exhibits regular periodicity . This motion energy intensity can be approximately estimated by optic flow computation. We call this a Flow Energy Profile (FEP), see Fig. 3. This FEP signal is particularly suitable to address our video fragmentation problem due to: (i) the local minima and maxima landmarks probably correspond to characteristic gestures of a walking process, and thus help in detecting them (e.g. one foot is about to land); (ii) it is relatively robust to changes in camera viewpoint. More specifically, we first compute the optic flow field $(v_{x},v_{y})$ for each image frame $I$ from a sequence $Q$ . Its flow energy is defined as

where $U$ is the pixel set of the lower body, e.g. the lower half of $I$ . The FEP $\mathcal{E}$ of $Q$ is then obtained as $\mathcal{E}=[e(I_{1}),...,e(I_{T})]$ , which is further smoothed by a Gaussian filter to suppress noise.

Subsequently, we locate the local minima and maxima landmarks $\{\,t\,\}$ of $\mathcal{E}$ and for each landmark create a video fragment $s$ by extracting the surrounding frames $s=\{I_{t-L},...,I_{t},...,I_{t+L}\}$ . We fix $L=10$ for all our experiments, determined by cross-validation on the iLIDS-VID dataset. Finally, we build a candidate set of video fragments $S=\{\,s\,\}$ by pooling all the fragments from $Q$ . Note that some fragments of each sequence can have similar walking phases since the local minima/maxima landmarks of the FEP signal are likely to correspond to certain characteristic walking postures (Fig. 3). This increases the possibility of finding temporally aligned video fragment pairs (i.e. centred at similar walking postures) given a pair of video fragment sets $(S^{a},S^{b})$ from two disjoint camera views, facilitating discriminative video fragment selection and matching during model learning. Also, Fig. 3 shows that the FEP signal can be sensitive to random occlusions and background clutter that could lead to non-characteristic fragments. However, this has limited impact on the overall effectiveness of the proposed selection-and-ranking model (Sec. 3.3), as it is designed specifically to identify and exploit automatically discriminative video fragments from largely redundant sets for training a ReID model.

Video fragment representation To encode both the dynamic and static appearance information of the subjects, we represent video fragments with both space-time and colour features. They complement each other, especially in the context of person ReID. Colour features have been shown to be significant for person ReID , implicitly capturing the chromatic patterns of clothing independent from space-time characteristics of a person’s appearance, such as the way people walk. In contrast, the latter is encoded by the space-time features.

Notations – Formally, for the $m$ -th fragment $s^{a}_{i,m}$ from the person $p_{i}$ ’s image sequence captured in camera $a$ , its descriptor is denoted by $\bm{x}^{a}_{i,m}$ . The same is for $s^{b}_{i,m}$ and $\bm{x}^{b}_{i,m}$ . We denote $X^{a}_{i}=\{\bm{x}^{a}_{i,m}\}_{m=1}^{|X^{a}_{i}|}$ and $X^{b}_{i}=\{\bm{x}^{b}_{i,m}\}_{m=1}^{|X^{b}_{i}|}$ as the descriptor set for the fragments segmented from the sequences $Q_{i}^{a}$ and $Q_{i}^{b}$ of person $p_{i}$ in camera $a$ and $b$ respectively, where $|\cdot|$ represents the set cardinality. The entire collection of descriptors for $N$ training image sequence pairs $\{(Q^{a}_{i},Q^{b}_{i})\}^{N}_{i=1}$ is denoted as $\{(X^{a}_{i},X^{b}_{i})\}^{N}_{i=1}$ .

3 Selection and Ranking

As shown in Fig. 3, the fragments of a person image sequence can be contaminated by unknown occlusions and background dynamics, and may also be extracted at an arbitrary time-instance of a walking cycle. Given such noisy fragment pair collections generated from cross-view image sequences, a significant challenge for sequence matching based ReID is how to identify and select discriminative/informative and temporally aligned fragment pairs (rather than the entire sequences) to learn a suitable ranking model. Formally, the objective is to learn a linear ranking function on the entry-wise absolute difference of two cross-view fragments $\bm{x}^{a}$ and $\bm{x}^{b}$ :

We assume that for each person, there exists at least one cross-view fragment pair that is sufficiently aligned over time and carries desired identity-sensitive information for this person. Our aim is to construct a model capable of automatically discovering and locating not only the best cross-view fragment pair but also multiple cross-view fragment pairs that are sufficiently aligned and discriminative for person ReID. For model training with the best fragment pair, it is equivalent to constraining a ranking function $h$ to prefer the most discriminative cross-view fragment pair of the same person $p_{i}$ to the pairings over $p_{i}$ and any other person $p_{j}$ , $i\neq j$ , i.e.

where each column of $\bm{Y}_{i}$ corresponds to one $\bm{y}^{+}\in B^{+}_{i}$ , $||\bm{v}_{i}||_{0}=1,\;\bm{e}^{\top}\bm{v}_{i}=1$ , and $\bm{e}$ denotes a vector of all “1”s.

To achieve good generalisation ability for the ranking model given the ranking constraints in Eqn. (6), we formulate our problem as a max-margin ranking problem by defining the objective function as:

where $\bm{w}$ is the parameter of the objective ranking function defined in Eqn. (3), and $N$ the number of people in the training set. $\bm{v}$ is the concatenation of the binary selection variables of all persons: $\bm{v}=[\bm{v}_{1};\bm{v}_{2};...\,\bm{v}_{N}]$ . $\bm{\xi}$ is the flattened slack variable, formed by all the possible $\xi_{i,m}$ . We solve Eqn. (7) by iteratively optimising $\bm{w}$ and $\bm{v}$ between a ranking step and a selecting step.

Ranking step We fix $\bm{v}$ to optimise $\bm{w}$ . Eqn. (7) turns into

With the fragment selections $\bm{v}$ known, Eqn. (8) is a standard RankSVM problem and can be efficiently solved with a primal training algorithm .

Selecting step We fix $\bm{w}$ to optimize $\bm{v}$ . The term on $\bm{w}$ (i.e. $\frac{1}{2}||\bm{w}||^{2}$ ) can be eliminated and Eqn. (7) becomes

Considering that the person-wise $\bm{v}_{i}$ is associated only with $\{\xi_{i,m}\}^{|B^{-}_{i}|}_{m=1}$ and we are optimising the summation of all possible $\xi_{i,m}$ , Eqn. (9) is equivalent to optimising $\bm{v}_{i}$ for each person $p_{i}$ separately, as

where $\bm{\xi}_{i}=[\xi_{i,1},\dots,\xi_{i,|B^{-}_{i}|}]^{\top}$ . The inequality constraints in Eqn. (10) can be transformed as

Therefore, for any particular $\bm{v}_{i}\in V$ that holds $||\bm{v}_{i}||_{0}=1$ and $\bm{e}^{\top}\bm{v}_{i}=1$ in the selecting space $V$ , the entries $\xi^{*}_{i,m}$ of the optimal $\bm{\xi}^{*}_{i}$ that minimises the summation $\bm{e}^{\top}\bm{\xi}_{i}$ shall be

It is obvious that the summation $\bm{e}^{\top}\bm{\xi}_{i}$ is a function of $\bm{v}_{i}$ ,

Finally we can obtain the $\bm{v}^{*}_{i}$ by optimising $q(\bm{v}_{i})$ via:

For each person $p_{i}$ , we only have a limited number of $\bm{v}_{i}$ in $V$ . Therefore Eqn. (14) can be efficiently solved even with a greedy search.

To begin the model training process, we set $\bm{v}_{i}=\frac{1}{|B_{i}^{+}|}\bm{e}$ to initiate a balanced/moderate start since the quality of $\bm{y}_{i,\cdot}^{+}$ is unknown a priori. The iteration terminates when $\bm{v}_{i}$ does not change any more. Typically, the training process stops after $4\sim 5$ iterations. For learning efficiency, $10\%$ out of all the $\bm{y}^{-}_{i,\cdot}$ are randomly selected to form $B_{i}^{-}$ . Since only a single $\bm{y}_{i,\cdot}^{+}$ for each person $p_{i}$ is selected and utilised for model learning, we call this model DVR(single).

Thus far we have detailed the procedure of training our DVR(single) model via identifying the best cross-view fragment pair in each positive bag $B^{+}_{i}$ (corresponding to person $p_{i}$ ) for learning the ranking function (Eqn. (3)). This allows us to largely avoid the contamination effect from harmful data. Nonetheless, we may simultaneously lose some useful information from discarding the majority of instances $\bm{y}_{i,\cdot}^{+}$ of each bag $B^{+}_{i}$ , because some of these ignored $\bm{y}_{i,\cdot}^{+}$ can be of good quality. Identifying and exploiting these “good though not the best” fragment data $\bm{y}_{i,\cdot}^{+}$ is likely to benefit the model learning. To that end, we shall describe next our multiple cross-view fragment pair selection algorithm for better exploring image sequence data.

Our multiple fragment-pair selection algorithm is based on a goodness/quality measure of individual $\bm{y}_{i,\cdot}^{+}$ . Once all instances $\bm{y}_{i,\cdot}^{+}$ of person $p_{i}$ are measured by assigning a score $\gamma_{i,\cdot}$ (higher is better) to each instance, we can easily locate multiple (top $k$ ) discriminative $\bm{y}_{i,\cdot}^{+}$ from the ranked list of all $\bm{y}_{i,\cdot}^{+}$ sorted in descending order of $\gamma_{i,\cdot}$ . Formally, we define $\gamma_{i,\cdot}$ for each $\bm{y}_{i,\cdot}^{+}$ as

We denote $1-\xi_{i,m}^{*}$ as the ranking margin of $\bm{y}_{i,\cdot}^{+}$ against $\bm{y}_{i,m}^{-}$ , which can be obtained by Eqn. (12). Given Eqn. (15), the $\bm{y}_{i,\cdot}^{+}$ with a larger cumulated ranking margin over all the negative instance $\bm{y}_{i,m}^{-}$ is preferred. This formulation generalises the single selection case that searches for the best $\bm{v}_{i}^{*}$ (Eqn. (14)), i.e. the $\bm{v}_{i}^{*}$ and the highest $\gamma_{i,\cdot}$ leads to the same selection of positive instance $\bm{y}_{i,\cdot}^{+}$ .

After the top $k$ $\bm{y}_{i,\cdot}^{+}$ for each person $p_{i}$ are found and selected, we can obtain multiple (i.e. $k$ ) $\bm{v}_{i}^{*}$ s by setting the corresponding entry of each $\bm{v}_{i}^{*}$ to “ $1$ ” whilst the remaining entries to “”. We call this model DVR(top $\bm{k}$ ). Similar to the single selection model DVR(single), these ranking constraints associated with the selected top $k$ $\bm{y}_{i,\cdot}^{+}$ are then employed for optimising $\bm{w}$ with Eqn. (8). In Sec. 4.1, we shall evaluate the effect of different top $k$ positive instances on the person ReID performance. An overview of learning the proposed DVR model is presented in Algorithm 1.

3.2 Model Complexity

We analyse the training complexity of the DVR model, focusing on the ranking and selecting steps. For model training, we adopt the primal RankSVM scheme as the ranking solver. Its complexity is $O(cd^{2})+O(d^{3})$ due to Hessian computation and the linear search in Newton direction respectively, with $c$ and $d$ denoting the number of ranking constraints (see Equations (4) and (8)) and the feature dimensions. Suppose $k$ positive instances per person are selected in the training stage, then $c=k\sum_{i=1}^{N}|B_{i}^{-}|$ , where $N$ is the total number of training people.

The cost for the selection process mainly involves measuring the quality score of each positive instance of all training people with Eqn. (12) and Eqn. (15). Its complexity is $O(cdu)$ , where $u=\sum_{i=1}^{N}|B_{i}^{+}|$ denotes the total number of positive instances across all training data. The total complexity of model training is thus $O(cd^{2}+d^{3}+cdu)$ . We evaluated and reported the model training cost in our experiments (Sec. 4.1).

4 Re-Identification by DVR

Once learned, the ranking model (Eqn. (3)) can be deployed to perform person re-identification by matching a given probe person image sequence $Q^{p}$ observed in one camera view against a gallery set $\{Q^{g}\}$ in another disjoint camera. Formally, the ranking/matching score of a gallery person sequence $Q^{g}$ with respect to $Q^{p}$ is computed as

where $X^{p}$ and $X^{g}$ are the feature sets of the video fragments extracted from the sequences $Q^{p}$ and $Q^{g}$ , respectively. The same video fragmentation process as used for model training (Sec. 3.2) is employed for deploying a trained model. Finally, the gallery people are sorted in descending order of their assigned matching scores to generate a ranking list.

Combination with prior spatial feature based models Our approach can complement existing spatial feature based person re-identification approaches. In particular, we incorporate Eqn. (16) into the ranking scores $\mathcal{R}_{i}$ obtained by other models as

where $\alpha_{i}$ refers to the weighting assigned to the $i$ -th method, which is estimated by cross-validation.

5 Discussions on Related Models

We discuss the relationship of our proposed DVR model with other relevant contemporary models in the literature, with a focus on their differences. First, most existing max-margin ranking methods do not consider uncertainty in the ranking constraints during model optimisation. In contrast, the proposed DVR model jointly optimises both the selection of the ranking constraints and the ranking function. This is necessary because the bag-level (e.g. image sequences) supervision cannot directly determine the instance-level (e.g. fragments) constraints (Sec. 3.3).

Second, our model also differs notably from other multi-instance ranking models in a number of aspects. (1) Bergeron et al. relaxed the selection vectors $\bm{v}_{i}$ (Eqn. (6)) to be continuous during model optimisation, whilst our model searches for exact solutions of instance selection. As shown in our evaluation (Sec. 4.1), Bergeron et al.’s relaxation method can significantly increase the cost of constraint selection when the training set is large, though it does not compromise the model performance. (2) The model presented in focuses on encoding bag-level (or sample-level) constraints into the ranking function by modelling instance-level constraints, assuming all instances can provide contribution to model optimisation. In contrast, we emphasise the selection of discriminative/informative instance data (e.g. fragments) for robust learning, necessary for coping with very noisy and incomplete data (e.g. unregulated image sequences), whilst the stronger assumption made in is less valid. (3) Different from all these multi-instance models , the proposed DVR model is unique in its capability for allowing different quantities of explicit discriminative instance selection and then exploitation, due to our formulation of a principled instance quality measure (Eqn. (15)). This can potentially increase the flexibility and scalability of our model in a variety of problem settings (e.g. varying degrees of noise) and applications (e.g. other sequence matching based tasks).

Experiments

Datasets Extensive experiments were conducted on three image sequence datasets designed for person ReID, iLIDS Video re-IDentification (iLIDS-VID) , PRID $2011$ , and HDA+ . All three datasets are very challenging due to clothing similarities among people, lighting and viewpoint variations across camera views, cluttered background and occlusions (Fig. 1 and Fig. 5).

iLIDS-VID – Our new iLIDS-VID person sequence dataset was created based on two non-overlapping camera views from the i-LIDS Multiple-Camera Tracking Scenario (MCTS) , which was captured at an airport arrival hall under a multi-camera CCTV network (Fig. 5(a)). It consists of $600$ image sequences for $300$ randomly sampled people, with one pair of image sequences from two disjoint camera views for each person. Each image sequence has a variable length consisting of $23$ to $192$ image frames, with an average number of $73$ .

PRID $2011$ – The PRID $2011$ dataset includes $400$ image sequences for $200$ people from two camera views that are adjacent to each other (Fig. 5(b)). Each image sequence has a variable length consisting of $5$ to $675$ image framesWe used sequences of $>$ $21$ frames from $178$ people in the evaluation. , with an average number of $100$ . Compared with the iLIDS-VID dataset, it is less challenging due to being captured in non-crowded outdoor scenes with relatively simple and clean backgrounds and rare occlusions.

HDA+ – The HDA+ dataset contains a total of $83$ labelled people across $13$ indoor cameras in an office environment (Fig. 5(c,d)). HDA+ is characterised by (i) low and variable frame rates, e.g. $2<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo>∼</mo></mrow><annotation encoding="application/x-tex">\sim</annotation></semantics></math>∼5$ fps (frames per second) of HDA+ versus $25$ fps of both PRID $2011$ and iLIDS-VID; and (ii) clothing variation over time. One limitation of HDA+ is the small number of people re-appearing between camera pairs whilst re-appearance is required for evaluating ReID. In our experiments, we selected two camera pairs, $(19,40)$ and $(50,57)$ , that satisfy: (1) a sufficiently large number of people reappearing across the camera views; (2) very low video frame rates to evaluate its effect on space-time feature based ReID models; (3) some people’s clothing changes to evaluate the clothing-variation challenge. In particular, camera pair $(19,40)$ provides pairwise image sequences of $28$ different people at $5$ fps. Each video has $15<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo>∼</mo></mrow><annotation encoding="application/x-tex">\sim</annotation></semantics></math>∼227$ frames with an average of $88$ frames. In contrast, camera pair $(50,57)$ contains pairwise videos of $10$ people at only $2$ fps, with sequence length varying between $1<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo>∼</mo></mrow><annotation encoding="application/x-tex">\sim</annotation></semantics></math>∼136$ frames with an average of $31$ frames. For sequences $<$ $21$ frames, we expanded them up to $21$ frames by interpolating new frames using duplicates of the temporally-nearest frames in a sequence. This is to enable fragmentation on them. Note, little or no space-time information is available in very short sequences, e.g. $1$ frame. This is designed to test how a space-time feature based model degrades with decreasing space-time information available in the input video data.

Evaluation settings From every dataset, all sequence pairs are randomly split into two subsets of equal size, one for training and one for testing. Following the evaluation protocol on the PRID $2011$ dataset , in the testing phase, the sequences from one camera are used as the probe set while the ones from another camera are the gallery set. The results are measured by Cumulated Matching Characteristics (CMC). Specifically, we show top rank matching rates. As CMC values are proportional to the dataset size (the overall population for the ranked pairs), we adopt Ranks $1<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo>∼</mo></mrow><annotation encoding="application/x-tex">\sim</annotation></semantics></math>∼20$ for PRID $2011$ and iLIDS-VID, and Ranks $1<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo>∼</mo></mrow><annotation encoding="application/x-tex">\sim</annotation></semantics></math>∼4$ for HDA+ ( $<$ $1/5$ size of iLIDS-VID and PRID $2011$ ), so that these values are approximately comparable across all four datasets. To obtain stable statistical results, we repeat the experiments for $10$ trials and report the average results.

We evaluated and analysed the proposed DVR model in three aspects: (1) effectiveness of the selection mechanisms; (2) effectiveness of the fragment representations; (3) robustness against low and variable video frame rates.

Effectiveness of the selection mechanisms – For the selection mechanism, we conducted two comparisons: (a) the DVR(single) model versus our preliminary model reported in which we call DVR(float) since its selection involves a (float) weighted combination of instances in contrast to our new single or multiple explicit instance selection strategies, (b) single versus multiple fragment-pair selection (Sec. 3). The results in Table I (the first two rows) show that identical scores are obtained by DVR(single) and DVR(float) . This is further verified by the observation that both models select almost identical discriminative video fragments. On the other hand, the computational cost/time required are different for the two models, in particular when the visual content is more crowded and selection becomes harder. More specifically, for model training including both the ranking and selecting steps, Table I shows that both models require similar time for the ranking step on all datasets. This is because they are subject to the same number of ranking constraints (Eqn. (8)). However, although the time required for the selection routine is similar for PRID $2011$ and HDA+, DVR(single) is significantly faster than DVR(float) on iLIDS-VID, e.g. over $7\times$ speed up. This was performed on a $64$ -bit Intel CPU Processor@ $2.7$ GHz with a MATLAB implementation in Linux OS. These observations suggest no advantage in treating the selection as a float weighted combination of instances as originally proposed in .

One may ask the question how many discriminative fragment pairs should be selected from each cross-view image sequence pair of a person during model training. To that end, we evaluated the performance of ReID using different numbers of positive fragment pairs per person on PRID $2011$ and iLIDS-VID This multi-fragment selection evaluation is not performed on HDA+ as some short image sequences have only one fragment. . It is evident from Table I that the use of additional discriminative fragment pairs can further boost the overall performance of person ReID at the price of increased model training time. This empirically supports our analysis on the potential benefits of multiple fragment pair selection and exploitation as discussed in Sec. 3.3.1. However, the margin of improvement from additional fragment data quickly diminishes. In our experiments, we utilised up to the top $5$ fragment pairs per person. Any further addition of more pairs had very limited effect in improving the learned ranking model. Moreover, it is also observed that the construction of ranking constraints in RankSVM is a time consuming process and its complexity is linear in the number of constraints. Empirically, selecting the top $3$ discriminative fragment pairs from a matched training image sequence pair for model learning provides a good trade-off between ReID accuracy and model learning cost. For the remaining experiments reported in this section, DVR(top $3$ ) models were trained for PRID $2011$ and iLIDS-VID and DVR(single) models for HDA+ in the comparative evaluation against other baseline methods.

Effectiveness of the fragment representations – It is worth pointing out that our preliminary work presented in is somewhat limited on fragment representation as no colour appearance information is considered. Here we report a significant improvement in performance from combining the space-time features (HOG3D) with colour features (Sec. 3.2). For the DVR(single) model, Table II shows $34.6\%$ , $57.9\%$ , $35.8\%$ and $188.9\%$ increase at Rank-1 recognition rate on PRID $2011$ , iLIDS-VID, HDA+( $5$ fps) and HDA+( $2$ fps) respectively, when comparing with the results by HOG3D and ColHOG3D. This suggests that colour plays an important role in re-identifying people, also evident from the colour-only ReID performance in the table. These results demonstrate the importance of utilising both space-time and colour appearance information for person ReID in image sequence data, further supporting previous studies on the importance of leveraging colour information for ReID . Throughout the following experiments, ColHOG3D is adopted as the default fragment representation in our DVR model, unless specified otherwise.

Robustness against low and variable video frame-rates – The proposed DVR model is expected to benefit more from higher frame-rate videos, whilst its advantage over appearance-only based models diminishes gradually with a decrease in video frame rate as less space-time information is available. The results in Table II show that the space-time feature (HOG3D) only based DVR model produces very competitive ReID accuracy compared to models using colour features alone, given high ( $25$ fps) frame rate videos from PRID $2011$ and iLIDS-VID. Encouragingly, HOG3D-only based DVR retains credible ReID accuracies on $5$ fps sequences from HDA+. However, when the frame rate decreases more significantly to $2$ fps, the performance of the HOG3D-only based model degrades considerably whilst the colour-only based DVR is less affected. These results are consistent with the expectation that space-time feature alone based ReID models degrade when very limited or no space-time information is available in very low frame rate videos. Nevertheless, the space-time information selected by the DVR model is still useful for ReID even at such a low frame rate. It is also evident that the full DVR model using the ColHOG3D representation selectively explores the complementary information from both space-time and colour appearance features for significant improvements on ReID accuracies in all situations including very low video frame-rates (the bottom row in Table II). This illustrates the strength and robustness of the DVR model in utilising complementary visual information, even when space-time information is very poor or even absent. This also demonstrates the robustness and flexibility of the DVR model in coping with significant variations in video frame rate when extracting and exploiting discriminative space-time information from unregulated surveillance videos.

2 Comparing Gait Recognition and Temporal Sequence Matching

We compared the proposed DVR model with contemporary gait recognition and temporal sequence matching methods for person (re-)identification. (I) Gait recognition (GEI+RSVM) is a state-of-the-art gait recognition model using Gait Energy Image (GEI) (computed from pre-segmented silhouettes) as sequence representation and RankSVM for recognition. A challenge for applying gait recognition to unregulated image sequences in ReID scenarios is to generate good gait silhouettes as input. To that end, we first deployed the DPAdaptiveMedianBGS algorithm provided by the BGSLibrary to extract silhouettes from image sequences given by each dataset. This approach produces better foreground masking than other alternatives. (II) ColLBP/HoGHoF/ColLBPHoGHoF+DTW applies Dynamic Time Warping to compute the similarity between two sequences, using either ColLBP or HoGHoF or their combination as the per-frame feature descriptor. This is similar to the approach of Simonnet et al. , except that they only used colour features. In comparison, ColLBP is a stronger representation as it encodes both colour and texture. Alternatively, HoGHoF encodes both texture and motion information.

Table III presents the comparative ReID results among DVR, GEI+RSVM (gait), ColLBP+DTW, HoGHoF+DTW, and ColLBPHoGHoF+DTW. It is evident that the proposed DVR outperforms significantly any competitor on all datasets. Gait recognition gives significantly weaker performance than the DVR model on every dataset. In comparison, its ReID accuracy on PRID $2011$ and HDA+ is much better than that on iLIDS-VID. This is because the GEI gait features are very sensitive to background clutter and occlusions, as shown by the examples in Fig. 6. It is obvious that the extracted gait foreground masks from the iLIDS-VID person sequence (middle) are contaminated more heavily by cluttered background and other moving objects, compared to those from either PRID $2011$ (top) or HDA+ (bottom). Our DVR model trains itself by simultaneously selecting and ranking only those video fragments which suffer the least from occlusions and noise. Moreover, DTW based sequence matching methods using either ColLBP, HoGHoF, or their combination also suffer notably from the inherently uncertain nature of ReID sequences and perform significantly poorer than the proposed DVR approach. This is largely due to: (1) Person sequences have different durations with arbitrary starting/ending frames, also potentially different numbers of walking cycles. Therefore, attempts to match entire sequences holistically inevitably suffer from mismatching with erroneous similarity measurement; (2) There is no clear (explicit) mechanism to avoid incomplete/missing data, typical in crowded scenes; (3) Direct sequence matching is less discriminative than learning an inter-camera discriminative mapping function, which is explicitly built into the DVR model by exploring multi-instance (fragment-pair) selection and ranking.

3 Comparing Spatial Feature Representations

To evaluate the effectiveness of discriminative video fragment selection and ranking using both spatial appearance and space-time features for person ReID, we compared the proposed DVR model against a wide range of contemporary ReID models using spatial features, either in single-shot or multi-shot (multi-frames). In order to process the iLIDS-VID dataset for our experiments, we mainly considered contemporary methods with code available publicly. They include (1) SDALF (single-/multi-shot versions); (2) eSDC The eSDC model cannot be evaluated on the small HDA+ dataset as it requires additionally saliency statistics modelling with two large reference sets which are not available on HDA+. ; (3) SS-ColLBP which uses RankSVM as model and colour&LBP as representation; (4) We also extended SS-ColLBP to multi-shot by averaging the ColLBP features of each frame over an image sequence to focus on stable appearance cues and suppress noise, in a similar approach to . We call this method MS-ColLBP. Moreover, we discuss the effect of clothing variation on person ReID methods, a challenging topic which is mostly ignored and under-investigated currently in the literature.

Comparing with spatial feature based methods – The results in Table IV show that the proposed DVR model outperforms significantly all the spatial feature based methods on all datasets, e.g. it gains $55.0\%$ and $287.3\%$ Rank- $1$ improvement over eSDC; it also yields $16.6\%$ , $70.3\%$ , $13.4\%$ and $52.9\%$ Rank- $1$ improvement over MS-ColLBP on PRID $2011$ , iLIDS-VID, HDA+( $5$ fps) and HDA+( $2$ fps) respectively. Note that the improvement margin achieved by the DVR model on iLIDS-VID (a more challenging dataset) is much more significant than those on PRID $2011$ and HDA+. This demonstrates the effectiveness of the proposed selective sequence matching method in coping with challenging real-world data for learning a robust re-identification ranking function. More concretely, the power of our DVR model can be largely attributed to identity-sensitive space-time gradient cues learned by our discriminative fragment selection based matching and ranking mechanism, beyond the conventional models of only learning from the spatial appearance data, e.g. colour and texture.

Clothing change challenge – Existing person ReID studies typically assume no changes in clothing. However, this assumption is not always valid. Realistically, clothing may change for some people within and/or across camera views. Specifically, while there is no ( $0\%$ ) explicit clothing change among the people in both PRID $2011$ and iLIDS-VID, $35.7\%$ people changed their jacket/coat/shirt in HDA+( $5$ fps) and $50.0\%$ in HDA+( $2$ fps), resulting in substantial change in appearance (Fig. 5(c,d)). Whilst only partial appearance variation may arise from changes in viewpoint and lighting, severe occlusion can also cause significant appearance change (Fig. 5(a,b)). Given this observation, we compared the performance of DVR against other appearance-based ReID models on the four different datasets with different degrees of clothing changes. We pay special attention to multi-shot models as they are expected to be more robust under clothing changes. The results in Table IV show that MS-SDALF benefits consistently from multiple shots on all four datasets, either with clothing changes or not. This is largely due to its body-part selective matching strategy, i.e. using the best-matched patch pairs during matching. However, this method can also give weak ReID accuracy due to the inherent difficulties in obtaining explicitly reliable body-part segmentation in surveillance images. In comparison, MS-ColLBP suffers considerably more from clothing changes, evident from a decreased performance advantage over SS-ColLBP on HDA+( $5$ fps) and worse still on HDA+( $2$ fps), when compared with those on PRID $2011$ and iLIDS-VID. This suggests that the advantage of MS-ColLBP over SS-ColLBP decreases when clothing changes are abrupt at low frame rates. Under such conditions, averaging without selection is a poor strategy to cope with clothing changes. In contrast, the proposed DVR model not only explores discriminative space-time ReID information less sensitive to appearance change, but also selects automatically the best-matched fragments for appearance consistency, sharing a similar principle of MS-SDALF but being more flexible and robust without requiring explicit part segmentation. We show in Fig. 7 two examples of model selected discriminative fragment pairs across camera views for person ReID. Note, this selection is driven by both static appearance and dynamic motion information embedded in our DVR model design. This demonstrates the potential advantage of the DVR model in addressing the clothing change challenge in person ReID, a problem under-studied in the current literature.

4 Complementary to Spatial Features

We further evaluated the complementary effect between the DVR model and existing colour/texture feature based ReID approaches. The results are reported in Table V. It is evident that for any existing appearance model, significant performance gain is achieved by incorporating the DVR ranking score (Eqn. (17)) into its ranking result. More specifically, on PRID $2011$ and iLIDS-VID, the Rank- $1$ ReID performance of using multi-shot colour and texture features (MS-ColLBP) is boosted by $23.9\%$ and $76.7\%$ ; Rank- $1$ of eSDC is improved by $86.8\%$ and $302.0\%$ ; Rank- $1$ of eSDC+MS-SDALF is increased by $92.4\%$ and $304.9\%$ , respectively. Similar improvements are gained on low frame rate sequences from HDA+ by MS-ColLBP and MS-SDALF. Such a performance step-change in improving conventional spatial feature based models is primarily due to the exploration of discriminative space-time features and the fragment selection based matching scheme by the proposed DVR model. This space-time selective matching process discovers mostly independent source of information when comparing with all static appearance features, therefore playing a significant complementary and beneficial role to contemporary spatial feature based models. It is also worth pointing out that most existing spatial feature based methods benefit more from combining with DVR when tested on iLIDS-VID, and less on PRID $2011$ and HDA+. This observation highlights the importance and necessity of discriminative fragment selection for robust model learning given video data from more crowded public scenarios where blind learning from all the sequence data without selection leads to poorer and degraded models.

In addition, it is evident from Table V that the DVR model can benefit from combining with other spatial feature based ReID models, although slightly. This gain may be explained as the result of drawing from diverse sources of spatial features.

5 Evaluation of Space-time Fragment Selection

To evaluate the space-time video fragment selection mechanism in the proposed DVR model, we implemented two baseline methods without this selection mechanism: (1) SS-ColHOG3D represents each image sequence by ColHOG3D features of a single fragment randomly selected from the image sequence; (2) MS-ColHOG3D represents each image sequence by the averaged ColHOG3D features of four fragments uniformly selected from the sequence. In both baseline methods, RankSVM is used to rank the person sequence representations. For a fair comparison, the length of these fragments used for both baselines is set the same as that in our DVR model.

The results are presented in Table VI. The DVR model outperforms SS-ColHOG3D and MS-ColHOG3D in Rank- $1$ by $55.6\%$ and $35.1\%$ on PRID $2011$ , by $33.4\%$ and $7.1\%$ on HDA+( $5$ fps), and by $73.3\%$ and $73.3\%$ on HDA+( $2$ fps). The performance advantage of DVR over SS-ColHOG3D and MS-ColHOG3D is even greater on the more challenging iLIDS-VID dataset, i.e. yielding $154.8\%$ and $98.5\%$ Rank- $1$ improvement respectively. This demonstrates clearly that in the presence of significant noise and given unregulated person image sequences, it is indispensable to automatically select discriminative space-time fragments from raw image sequences in order to construct a more robust model for person ReID. It is also noted that MS-ColHOG3D outperforms SS-ColHOG3D by suppressing noise using temporal averaging. Although such a straightforward averaging approach can have some benefits over single-shot methods, it loses out on discriminative information selection due to uniform temporal smoothing.

Conclusion and Future Work

Conclusion We have presented a novel DVR framework for person re-identification by video ranking using discriminative space-time and appearance feature selection. Our extensive evaluations show that this model outperforms a wide range of contemporary techniques from gait recognition and temporal sequence matching to state-of-the-art single-/multi-shot(or frame) spatial feature representation based ReID models. In contrast to existing ReID approaches that often employ spatial appearance of people alone, the proposed method is capable of capturing more accurately both appearance and space-time information discriminative for person ReID through learning a cross-view multi-instance ranking function. This is made possible by the ability of our model to discover and exploit automatically the most reliable and informative video fragments extracted from inherently incomplete and inaccurate person image sequences captured against cluttered backgrounds, without any guarantee on person walking cycles, starting/ending frame alignment, video frame rates, and clothing stability. Moreover, the proposed DVR model significantly complements and improves existing spatial appearance features when combined for person ReID. Extensive comparative evaluations were conducted to validate the advantages of the proposed model over a variety of baseline methods on three challenging image sequence based ReID datasets.

Future work Person re-identification remains largely an unsolved problem , and our future work includes: (1) In addition to space-time information, how to exploit automatically other knowledge sources, e.g. the topology structure of a camera network, or the semantic description (e.g. mid-level attributes nameable by human) of people’s appearance and walking style; (2) How to cope with open-world person re-identification settings where the probe people are not guaranteed to appear in the gallery set.

Acknowledgement

We shall thank Dario Figueira of IST for providing the HDA+ dataset and for assisting in extracting the person bounding boxes from raw videos required for person ReID evaluations and gait experiments; Martin Hirzer, Peter Roth and Csaba Beleznai of AIT for providing the additional raw videos of PRID $2011$ required for gait recognition experiments. Corresponding authors: Shaogang Gong and Shengjin Wang.