Diversity Regularized Spatiotemporal Attention for Video-based Person Re-identification

Shuang Li, Slawomir Bak, Peter Carr, Xiaogang Wang

Introduction

Person re-identification matches images of pedestrians in one camera with images of pedestrians from another, non-overlapping camera. This task has drawn increasing attention in recent years due to its importance in applications, such as surveillance , activity analysis and tracking . It remains a challenging problem because of complex variations in camera viewpoints, human poses, lighting, occlusions, and background clutter.

In this paper, we investigate the problem of video-based person re-identification, which is a generalization of the standard image-based re-identification task. Instead of matching image pairs, the algorithm must match pairs of video sequences (possibly of different durations). A key challenge in this paradigm is developing a good latent feature representation of each video sequence.

Existing video-based person re-identification methods represent each frame as a feature vector and then compute an aggregate representation across time using average or maximum pooling . Unfortunately, this approach has several drawbacks when applied to datasets where occlusions are frequent (Fig. 1). The feature representation generated for each image is often corrupted by the visual appearances of occluders. However, the remaining visible portions of the person may provide strong cues for re-identification. Assembling an effective representation of a person from these various glimpses should be possible. However, aggregating features across time is not straightforward. A person’s pose will change over time, which means any aggregation method must account for spatial misalignment (in addition to occlusion) when comparing features extracted from different frames.

In this paper, we propose a new spatiotemporal attention scheme that effectively handles the difficulties of video-based person re-identification. Instead of directly encoding the whole image (or a predefined decomposition, such as a grid), we use multiple spatial attention models to localize discriminative image regions, and pool these extracted local features across time using temporal attention. Our approach has several useful properties:

Spatial attention explicitly solves the alignment problem between images, and avoids features from being corrupted by occluded regions.

Although many discriminative image regions correspond to body parts, accessories like sunglasses, backpacks and hats; are prevalent and useful for re-identification. Because these categories are hard to predefine, we employ an unsupervised learning approach and let the neural network automatically discover a set of discriminative object part detectors (spatial attention models).

We employ a novel diversity regularization term based on the Hellinger distance to ensure multiple spatial attention models do not discover the same body part.

We use temporal attention models to compute an aggregate representation of the features extracted by each spatial attention model. These aggregate representations are then concatenated into a final feature vector that represents all of the information available from the entire video.

We demonstrate the effectiveness of our approach on three challenging video re-identification datasets. Our technique out performs the state-of-the-art methods under multiple evaluation metrics.

Related Work

Person re-identification was first proposed for multi-camera tracking . Gheissari et al. designed a spatial-temporal segmentation method to extract visual cues and employed color and salient edges for foreground detection. This work defined the image-based person re-identification as a specific computer vision task.

Image-based person re-identification mainly focuses on two categories: extracting discriminative features and learning robust metrics . In recent years, researchers have proposed numerous deep learning based methods to jointly handle both aspects. Ahmed et al. input a pair of cropped pedestrian images to a specifically designed CNN with a binary verification loss function for person re-identification. In , Ding et al. minimize feature distances between the same person and maximize the distances among different people by employing a triplet loss function when training deep neural networks. Xiao et al. jointly train the pedestrian detection and person re-identification in a single CNN model. They propose an Online Instance Matching loss function which learns features more efficiently in large scale verification problems.

Video-based person re-identification. Video-based person re-identification is an extension of image-based approaches. Instead of pairs of images, the learning algorithm is given pairs of video sequences. In , You et al. present a top-push distance learning model accompanied by the minimization of intra-class variations to optimize the matching accuracy at the top rank for person re-identification. McLaughlin et al. introduce an RNN model to encode temporal information. They utilize temporal pooling to select the maximum activation over each feature dimension and compute the feature similarity of two videos. Wang et al. select reliable space-time features from noisy/incomplete image sequences while simultaneously learning a video ranking function. Ma et al. encode multiple granularities of spatiotemporal dynamics to generate latent representations for each person. A Time Shift Dynamic Time Warping model is derived to select and match data between inaccurate and incomplete sequences.

Attention models for person re-identification. Attention models have grown in popularity since . Zhou et al. combine spatial and temporal information by building an end-to-end deep neural network. An attention model assigns importance scores to input frames according to the hidden states of an RNN. The final feature is a temporal average pooling of the RNN’s outputs. However, if trained in this way, corresponding weights at different time steps of the attention model tend to have the same values. Liu et al. proposed a multi-directional attention module to exploit the global and local contents for image-based person re-identification. However, jointly training multiple attentions might cause the mode collapse. The network has to be carefully trained to avoid attention models focusing on similar regions with high redundancy. In this paper, we combine spatial and temporal attentions into spatiotemporal attention models to address the challenges in video-based person re-identification. For spatial attention, we use a penalization term to regularize multiple redundant attentions. We employ temporal attention to assign weights to different salient regions on a per-frame basis to take full advantage of discriminative image regions. Our method demonstrates better empirical performance, and decomposes into an intuitive network architecture.

Method

We propose a new deep learning architecture (Fig. 2) to better handle video re-identification by automatically organizing the data into sets of consistent salient subregions. Given an input video sequence, we first use a restricted random sampling strategy to select a subset of video frames (Sec. 3.1). Then we send the selected frames to a multi-region spatial attention module (Sec. 3.2) to generate a diverse set of discriminative spatial gated visual features—each roughly corresponding to a specific salient region of a person (Sec. 3.3). The overall representation of each salient region across the duration of the video is generated using temporal attention (Sec. 3.4). Finally, we concatenate all temporal gated features and send them to a fully-connected layer which represents the latent spatiotemporal encoding of the original input video sequence. An OIM loss function, proposed by Xiao et al. , is built on top of the FC layer to supervise the training of the whole network in an end-to-end fashion. However, any traditional loss function (like softmax) could also be employed.

Previous video-based person re-identification methods do not model long-range temporal structure because the input video sequences are relatively short. To some degree, this paradigm is only slightly more complicated than image-based re-identification since consecutive video frames are highly correlated, and the visual features extracted from one frame do not change drastically over the course of a short sequence. However, when input video sequences are long, any re-identification methodology must be able to cope with significant visual changes over time, such as different body poses and angles relative to the camera.

Wang et al. proposed a temporal segment network to generate video snippets for action recognition. Inspired by them, we propose a restricted random sampling strategy to generate compact representations of long video sequences that still provide good representations of the original data. Our approach enables models to utilize visual information from the entire video and avoids the redundancy between sequential frames. Given an input video $\mathbf{V}$ , we divide it into $N$ chunks $\{C_{n}\}_{n=1,N}$ of equal duration. From each chunk $C_{n}$ , we randomly sample an image $I_{n}$ . The video is then represented by the ordered set of sampled frames $\{I_{n}\}_{n=1,N}$ .

2 Multiple Spatial Attention Models

We employ multiple spatial attention models to automatically discover salient image regions (body parts or accessories) useful for re-identification. Instead of pre-defining a rigid spatial decomposition of input images (e.g. a grid structure), our approach automatically identifies multiple disjoint salient regions in each image that consistently occur across multiple training videos. Because the network learns to identify and localize these regions (e.g. automatically discovering a set of object part detectors), our approach mitigates registration problems that arise from pose changes, variations in scale, and occlusion. Our approach is not limited to detecting human body parts. It can focus on any informative image regions, such as hats, bags and other accessories often found in re-identification datasets. Feature representations directly generated from entire images can easily miss fine-grained visual cues (Fig. 1). Multiple diverse spatial attention models, on the other hand, can simultaneously discover discriminative visual features while reducing the distraction of background contents and occlusions. Although spatial attention is not a new concept, to the best of our knowledge, this is first time that a network has been designed to automatically discover a diverse set of attentions within image frames that are consistent across multiple videos.

For each image $I_{n}$ , we generate $K$ spatial gated visual features $\{\mathbf{x}_{n,k}\}_{k=1,K}$ using attention weighted averaging

Similar to fine-grained object recognition , we pool information across frames to created an enhanced variant

of each spatial gated feature. The enhancement function $E()$ follows the past work on second-order pooling . See the supplementary material for further details.

3 Diversity Regularization

The outlined approach for learning multiple spatial attention models can easily produce a degenerate solution. For a given image, there is no constraint that the receptive field generated by one attention model needs to be different from the receptive field of another model. In other words, multiple attention models could easily learn to detect the same body part. In practice, we need to ensure each of the $N$ spatial attention models focuses on different regions of the given image.

Typically, the attention matrix has many values close to zero after the $\text{softmax}()$ function, and these small values drop sharply when passed though the $\log()$ operation in the Kullback-Leibler divergence. In this case, the empirical evidence suggests the training process is unstable .

To encourage the spatial attention models to focus on different salient regions, we design a penalty term which measures the overlap between different receptive fields. Suppose $\mathbf{s}_{n,i}$ and $\mathbf{s}_{n,j}$ are two attention vectors in attention matrix $\mathbf{S}_{n}$ . Employing the probability mass property of attention vectors, we use the Hellinger distance to measure the similarity of $\mathbf{s}_{n,i}$ and $\mathbf{s}_{n,j}$ . The distance is defined as

To ensure diversity of the receptive fields, we need to maximize the distance between $\mathbf{s}_{n,i}$ and $\mathbf{s}_{n,j}$ , which is equivalent to minimizing $1-H^{2}(\mathbf{s}_{n,i},\mathbf{s}_{n,j})$ . We introduce $\mathbf{R}_{n}=\sqrt{\mathbf{S}_{n}}$ for notation convenience, where each element in $\mathbf{R}_{n}$ is the square root of the corresponding element in $\mathbf{S}_{n}$ . Thus, the regularization term to measure the redundancy between receptive fields per image is

where $\|\cdot\|_{F}$ denotes the Frobenius norm of a matrix and $\mathbf{I}$ is a $K$ -dimensional identity matrix. This regularization term $Q$ will be multiplied by a coefficient, and added to the original OIM loss.

Diversity regularization was recently employed for text embedding using recurrent networks . In this case, the authors employed a variant

4 Temporal Attention

Recall that each frame $I_{n}$ is represented by a set $\{\widehat{\mathbf{x}}_{n,1},\ldots,\widehat{\mathbf{x}}_{n,K}\}$ of $K$ enhanced spatial gated features, each generated by one of the $K$ spatial attention models. We now consider how best to combine these features extracted from individual frames to produce a compact representation of the entire input video.

All parts of an object are seldom visible in every video frame—either because of self-occlusion or from an explicit foreground occluder (Fig. 1). Therefore, pooling features across time using a per-frame weight $t_{n}$ is not sufficiently robust, since some frames could contain valuable partial information about an individual (e.g. face, presence of a bag or other accessory, etc.).

Instead of applying the same temporal attention weight $t_{n}$ to all features extracted from frame $I_{n}$ , we apply multiple temporal attention weights $\{t_{n,1},\ldots,t_{n,K}\}$ to each frame—one for each spatial component. With this approach, our temporal attention model is able to assess the importance of a frame based on the merits of the different salient regions. Temporal attention models which only operate on whole frame features could easily lose fine-grained cues in frames with moderate occlusion.

Similarly, basic temporal aggregation techniques (compared to temporal attention models) like average pooling or max pooling generally weaken or over emphasize the contribution of discriminative features (regardless of whether the pooling is applied per-frame, or per-region). In our experiments, we compare our proposed per-region-per-frame temporal attention model to average and maximum pooling applied on a per-region basis, and indeed find that maximum performance is achieved with our temporal attention model.

Similar to spatial attention, we define the temporal attention $t_{n,k}$ for the spatial component $k$ in frame $n$ to be the softmax of a linear response function

The temporal attentions are then used to gate the enhanced spatial features on a per component basis by weighted averaging

5 Re-Identification Loss

In this paper, we adopt the Online Instance Matching loss function (OIM) to train the whole network. Typically, re-identification uses a multi-class softmax layer as the objective loss. Often, the number of mini-batch samples is much smaller than the number of identities in the training dataset, and network parameter updates can be biased. Instead, the OIM loss function uses a lookup table to store features of all identities appearing in the training set. In each forward iteration, a mini-batch sample is compared against all the identities when computing classification probabilities. This loss function has shown to be more effective than softmax when training re-identification networks.

Experiments

We evaluate the proposed algorithm on three commonly used video-based person re-identification datasets: PRID2011 , iLIDS-VID , and MARS . PRID2011 consists of person videos from two camera views, containing $385$ and $749$ identities, respectively. Only the first $200$ people appear in both cameras. The length of each image sequence varies from 5 to 675 frames. iLIDS-VID consists of 600 image sequences of 300 subjects. For each person we have two videos with the sequence length ranging from 23 to 192 frames with an average duration of 73 frames. The MARS dataset is the largest video-based person re-identification benchmark with 1,261 identities and around 20,000 video sequences generated by DPM detector and GMMCP tracker . Each identity is captured by at least 2 cameras and has 13.2 sequences on average. There are 3,248 distractor sequences in the dataset.

For PRID2011 and iLIDS-VID datasets, we follow the evaluation protocol from . Datasets are randomly split into probe/gallery identities. This procedure is repeated $10$ times for computing averaged accuracies. For the MARS dataset, we follow the original splits provided by which use the predefined 631 identities for training and the remaining identities for testing.

2 Implementation details and evaluation metrics

We divide each input video sequence into $N=6$ chunks of equal duration. We first pretrain the ResNet-50 model on image-based person re-identification datasets, including CUHK01 , CUHK03 , 3DPeS , VIPeR , DukeMTMC-reID and CUHK-SYSU . and then fine-tune it on PRID2011, iLIDS-VID and MARS training sets. Once finished, we fix the CNN model and train the set of multiple spatial attention models with average temporal pooling and OIM loss function. Finally, the whole network, except the CNN model, is trained jointly. The input image is resized to $256\times 128$ . The network is updated using batched Stochastic Gradient Descent with an initial learning rate set to $0.1$ and then dropped to $0.01$ . The aggregated feature vector after the last FC layer is embeded into $128$ -dimensions and L2-normalized to represent each video sequence. During the training stage, we utilize the Restricted Random Sampling to select training samples. For each video, we extract its L2-normalized feature and sent it to the OIM loss function to supervise the training process. During testing, we use the first image from each of $N$ segments as a testing sample and its L2-normalized features are utilized to compute the similarity of the spatiotemporal gated features generated for the pair of videos being assessed.

Re-identification performance is reported using the rank-1 accuracy. On the MARS dataset we also evaluate the mean average precision (mAP) . Since mAP takes recall into consideration, it is more suitable for the MARS dataset which has multiple videos per identity.

3 Component Analysis of the Proposed Model

We investigate the effect of each component of our model by conducting several analytic experiments. In Tab. 1, we list the results of each component in the proposed network. Baseline corresponds to ResNet-50 trained with OIM loss on image-based person re-id datasets and then jointly fine-tuned on video datasets: PRID2011, iLIDS-VID, and MARS. SpaAtn consists of the subnetwork of ResNet-50 (from $res2x$ to $res5x$ ) and multiple spatial attention models. All spatial gated features generated by the same attention model are grouped together and averaged over all frames. For each video sequence, there will be $K$ averaged feature vectors. We concatenate the $K$ features and then send them to the last FC layer and OIM loss function to train the neural network. Compared with Baseline, SpaAtn improves the rank-1 accuracy by $1.5\%$ , $3.7\%$ , and $1.1\%$ on PRID2011, iLIDS-VID and MARS, respectively. This shows that multiple spatial attention models are effective at finding persistent discriminative image regions which are useful for boosting re-identification performance.

SpaAtn+Q’ has the same network architecture as SpaAtn but with the text embedding diversity regularization term $Q^{\prime}$ . SpaAtn+Q uses our proposed diversity regularization term $Q$ based on Hellinger distance. From the results, we can see that our proposed Hellinger regularization improves accuracy. We believe the improvement comes from being able to learn multiple attention models with sufficiently large (but minimally overlapping) receptive fields (see Fig.3 for sample receptive fields generated for the learned attention models using SpaAtn+Q). SpaAtn+Q and SpaAtn+Q+MaxPool are strategies for average temporal pooling and maximum temporal pooling, respectively. SpaAtn+Q+TemAtn applies multiple temporal attentions to each frame—one for each diverse spatial attention model. The assigned temporal attention weights reflect the pertinence of each spatially attended region (e.g. is the part fully visible and easy to detect?). We finally fine-tune the whole network, including the CNN model, to each video dataset independently. SpaAtn+Q+TemAtn+Ind is the final result of our proposed framework.

Different number of spatial attention models:

We also carry out experiments to investigate the effect of varying the number $K$ of spatial attention models (Tab. 2). When $K=1$ , the framework is limited to a single spatial attention model, which tends to cover the whole body. As $K$ is increased, the network is able to discover a larger set of body parts, and since the receptive fields are regularized to have minimal overlap, the reception fields tend to shrink as $K$ gets bigger. Interestingly, there is a general drop in perform when $K$ is increased from $1$ to $2$ . This implies treating a person as a single region instead of two distinct body parts is better. However, when a sufficiently large $K=6$ number of spatial models is used, the network achieves maximum performance.

Example learned spatial attention models and corresponding receptive fields are shown in Fig. 3. The receptive fields generally correspond to specific body parts and have varying sizes dependent on the discovered concept. In constrast, the receptive fields generated by tend to include background clutter and exhibit substantial overlap between different attention models. Our receptive fields, on the other hand, have minimal overlap and focus primarily on the foreground regions.

4 Comparison with the State-of-the-art Methods

Table 3 reports the performance of our approach with other state-of-the-art techniques. On each dataset, our method attains the highest performance. We achieve maximum improvement on MARS dataset, where we improve the state-of-the-art by 11.7%. The previous best reported results are from PAM-LOMO+KISSME (which learns signature representation to cater for high variance in a person’s appearance) and from SeeForest (which combines six spatial RNNs and temporal attention followed by a temporal RNN to encode the input video). In contrast, our network architecture is intuitive and straightforward to train. MARS is the most challenging data (it contains distractor sequences and has a substantially larger gallery set) and our methodology achieves a significant increase in mAP accuracy. This result suggests our spatiotemporal model is very effective for video-based person re-identification in challenging scenarios.

Summary

A key challenge for successful video-based person re-identification is developing a latent feature representation of each video as a basis for making comparisons. In this work, we propose a new spatiotemporal attention mechanism to achieve better video representations. Instead of extracting a single feature vector per frame, we employ a diverse set of spatial attention models to consistently extract similar local patches across multiple images (Fig. 3). This approach automatically solves two common problems in video re-identification: aligning corresponding image patches across frames (because of changes in body pose, orientation relative to the camera, etc.) and determining whether a particular part of the body is occluded or not.

To avoid learning redundant spatial attention models, we employ a diversity regularization term based on Hellinger distance. This encourages the network to discover a set of spatial attention models that have minimal overlap between receptive fields generated for each image. Although diversity regularization is not a new topic, we are the first to learn a diverse set of spatial attention models for video sequences, and illustrate the importance of Hellinger distance for this task (our experiments illustrate how a diversity regularization term used in text embedding is less effective for images).

Finally, temporal attention is used to aggregate features across frames on a per-spatial attention model basis—e.g. all features from the facial region are combined. This allows the network to represent each discovered body part based on the most pertinent image regions within the video. We evaluated our proposed approach on three datasets and performed a series of experiments to analyze the effect of each component. Our method outperforms the state-of-the-art approaches by large margins which demonstrates its effectiveness in video-based person re-identification.