Deep Continuous Conditional Random Fields with Asymmetric Inter-object Constraints for Online Multi-object Tracking

Hui Zhou, Wanli Ouyang, Jian Cheng, Xiaogang Wang, Hongsheng Li

I Introduction

Robust tracking of multiple objects is a challenging problem in computer vision and acts as an important component of many real-world applications. It aims to reliably recover trajectories and maintain identities of objects of interest in an image sequence. State-of-the-art Multi-Object Tracking (MOT) methods , mostly utilize the tracking-by-detection strategy because of its robustness against tracking drift. Such a strategy generates per-frame object detection results from the image sequence and associates the detections into object trajectories. It is able to handle newly appearing objects and is robust to tracking drift. The tracking-by-detection methods can be categorized into offline and online methods. The offline methods use both detection results from past and future with some global optimization techniques for linking detections to generate object trajectories. The online methods, on the other hand, use only detection results up to the current time to incrementally generate object trajectories. Our proposed method focuses on online MOT, which is more suitable for real-time applications including autonomous driving and intelligent surveillance.

In MOT methods, the tracked objects usually show consistent or slowly varying appearance across time. Visual features of the objects are therefore important cues for associating detection boxes into tracklets. In recent years, deep learning techniques have shown great potential in learning discriminative visual features for single-object and multi-object tracking. However, visual cues alone cannot guarantee robust tracking results. When tracked objects with similar appearances occlude or are close to each other, their trajectories might be wrongly associated to other objects. In addition, there also exist mis-detections or inaccurate detections by imperfect object detectors. Such difficulties escalate when the camera is hold by hand or fixed on a car. Each object moves according to its own movement pattern as well as the global camera motion. Solving such problems was explored by modeling interactions between tracked objects in the optimization model. For online MOT methods, there were investigations on modeling inter-object interactions with social force models , relative spatial and speed differences , and relative motion constraints . Most of the previous methods model pairwise inter-object interactions in symmetric mathematical forms, i.e., pairs of objects influence each other with the same magnitude.

However, such pairwise object interactions should be directional and modeled in an asymmetric form, while existing methods model such interactions in a symmetric way. For instance, large-size detection boxes are more likely to be noisy (if measured in actual pixels). Smaller boxes should influence larger boxes more than large ones to small ones because the smaller ones usually provide more accurate localization for objects. Similarly, high-confidence trajectories should influence low-confidence ones more and low-confidence ones should have minimal impact on the high-confidence ones. In this way, the more accurate detections or trajectories could help correct errors of the inaccurate ones and would not be affected by the inaccurate ones much. Moreover, in existing methods, individual object’s movements and inter-object interactions are usually modeled separately. The relations between the two terms are mostly manually tuned and not effectively studied in a unified framework.

To tackle the difficulties, we propose a Deep Continuous Conditional Random Field (DCCRF) with asymmetric inter-object constraints for solving the problem of online MOT. The DCCRF inputs a pair of consecutive images at time $t-1$ and time $t$ , and tracked object’s past trajectories up to time $t-1$ . It estimates locations of the tracked objects at time $t$ . The DCCRF optimizes an objective function with two terms, the unary terms, which estimate individual object’s movement patterns, and the asymmetric pairwise terms, which model interactions between tracked objects. The unary terms are modeled by a deep Convolutional Neural Network (CNN), which is trained to estimate each individual object’s displacement between time $t-1$ and time $t$ with each object’s visual appearance. The asymmetric pairwise terms aim to tackle the problem caused by object occlusions, object mis-detections and global camera motion. For two neighboring tracked trajectories, the pairwise influence is different along each direction to let the high-confidence trajectory assists the low-confidence one more. Our proposed DCCRF utilizes mean-field approximation for inference and is trained in an end-to-end manner to estimate the optimal displacement for each tracked object. Based on such estimated object locations, a final visual-similarity CNN is proposed for generating the final detection association results.

The contribution of our proposed online MOT framework is two-fold. (1) A novel DCCRF model is proposed for solving the online MOT problem. Each object’s individual movement patterns as well as inter-object interactions are studied in a unified framework and trained in an end-to-end manner. In this way, the unary terms and pairwise terms of our DCCRF can better adapt each other to achieve more accurate tracking performance. (2) An asymmetric inter-object interaction term is proposed to model the directional influence between pairs of objects, which aims to correct errors of low-confidence trajectories while maintain the estimated displacements of the high-confidence ones. Extensive experiments on two public datasets show the effectiveness of our proposed MOT framework.

II Related Work

There are a large number of methods on solving the multi-object tracking problem. We focus on reviewing online MOT methods that utilize interactive constraints, as well as single-object and multi-object tracking algorithms with deep neural networks.

Interaction models for MOT. Social force models were adopted in MOT methods to model pairwise interactions (attraction and repulsion) between objects. These methods required objects’ 3D positions for modeling inter-object interactions, which were obtained by visual odometry.

Grabner et al. assumed that the relative positions between feature points and objects were more or less fixed over short-time intervals. Generalized Hough transform was therefore used to predict each target’s location with the assist of supporter feature points. Duan et al. proposed mutual relation models to describe the spatial relations between tracked objects and to handle occluded objects. Such constraints are learned by an online structured SVM. Zhang and Maaten incorporated spatial constraints between objects into an MOT framework to track objects with similar appearances.

The CRF algorithm was used frequently in segmentation tasks to model the relationship between different pixels in the spatial-domain. There were also many works that modeled the multi-object tracking problem with CRF models. Yang and Nevatia proposed an online-learned CRF model for MOT, and assumed linear and smooth motion of the objects to associate past and future tracklets. Andriyenko et al. modeled multi-object tracking as optimizing discrete and continuous CRF models. A continuous CRF was used for enforcing motion smoothness, and a discrete CRF with a temporal interaction pairwise term was optimized for data association. Milan et al. designed new CRF potenials for modeling spatio-temporal constraints between pairs of trajectories to tackle detection and trajectory-level occlusions.

Deep learning based object tracking. Most existing deep learning based tracking methods focused on single object tracking, because deep neural networks were able to learn powerful visual features for distinguishing the tracked objects from the background and other similar objects. Early single-object tracking methods , with deep learning focused on learning discriminative appearance features for online training. However, due to the large learning capacitity of deep neural networks, it is easy to overfit the data. , pretrained deep convolutional neural networks on large-scale image dataset to learn discriminative visual features, and updated the classifier online with new training samples. More recently, methods that did not require model updating were proposed. Tao et al. utilized Siamese CNNs to determine visual similarities between image pacthes for tracking. Bertinetto et al. changed the network into a fully convolutional setting and achieved real-time running speed.

Recently, deep models have been applied to multi-object tracking. Milan et al. proposed an online MOT framework with two RNNs. One RNN was used for state (object locations, motions, etc.) prediction and update, and the other for associating objects across time. However, this method did not utilize any visual feature and relied solely on spatial locations of the detection results. , replaced the hand-crafted features (e.g., color histograms) with the learned features between image patches by a Siamese CNN, which increases the discriminative ability. However, those methods focused on modeling individual object’s movement patterns with deep learning. Inter-object relations were not integrated into deep neural networks.

III Method

The overall framework of our proposed MOT method is illustrated in Fig. 1. We propose a Deep Continuous Conditional Random Field (DCCRF) model for solving the online MOT problem. At each time $t$ , the framework takes past tracklets up to time $t-1$ and detection boxes at time $t$ as inputs, and generates new tracklets up to time $t$ . At each time $t$ , new tracklets are also initialized and current tracklets are terminated if tracked objects disappear from the scene.

The core components of the proposed DCCRF consist of unary terms and asymmetric pairwise terms. The unary terms of our DCCRF are modeled by a deep CNN that estimates the individual tracked object’s displacements between consecutive times $t-1$ and $t$ . The asymmetric pariwise terms aim to model inter-object interactions, which consider differences of speeds, visual-confidence, and object sizes between neighboring objects. Unlike interaction terms in existing MOT methods, which treat inter-object interactions in a symmetric way, asymmetric relationship terms are proposed in our DCCRF. For pairs of tracklets in our DCCRF model, the proposed asymmetric pairwise term models the two directions differently, so that high-confidence trajectories with small-size detection boxes can help correct errors of low-confidence trajectories with large-size detection boxes. Based on the estimated object displacements by DCCRF, we adopt a visual-similarity CNN and Hungarian algorithm to obtain the final tracklet-detection associations.

The goal of our conditional random field $({\bf r},{\bf d})$ is to maximize the following conditional distribution,

where $E({\bf d},{\bf r},I)$ represents the Gibbs energy and $Z({\bf t})=\int_{\bf r}\exp(-E({\bf d},{\bf r}))d{\bf r}$ is the partition function. Maximizing the conditional distribution w.r.t. ${\bf d}$ is equivalent to minimizing the Gibbs energy function,

where $\phi(d_{i},r_{i},I)$ and $\psi(d_{i},d_{j},r_{i},r_{j})$ are the unary terms and pairwise terms.

After the displacements ${\bf d}$ of tracked objects between time $t-1$ and time $t$ are obtained, individual object’s estimated locations at time $t$ can be easily calculated for associating tracklets and detection boxes to generate tracklets up to time $t$ . Such displacements are then iteratively calculated for the following time frames. Without loss of generality, we only discuss the approach for optimizing object displacements between time $t-1$ and time $t$ in this section.

For the $i$ th object tracklet, the unary term $\phi(d_{i},r_{i},I)$ of our DCCRF model is defined as

This term penalizes the quadratic deviations between the final output displacement $d_{i}$ and the estimated displacement by a visual displacement estimation function $f_{d}$ . $w_{i,1}$ is an online adaptive parameter for the $i$ th object that controls to trust more the estimated displacement based on the $i$ th object’s visual cues (the unary terms) or based on inter-object relations (the pairwise terms). Intuitively, when the visual displacement estimator $f_{d}$ has higher confidence on its estimated displacement, $w_{i,1}$ should be larger to bias the final output $d_{i}$ towards the visually inferred displacements. On the other hand, when $f_{d}$ has lower confidence on its estimation, due to object occlusion or appearing of similar objects, $w_{i,1}$ should be smaller to let the final displacement $d_{i}$ be mainly inferred by inter-object constraints.

In our framework, the visual displacement estimation function $f_{d}$ is modeled as a deep Convolution Neural Network (CNN) that utilizes only the tracked objects’ visual information for estimating its location displacement between time $t-1$ and time $t$ . For each tracked object $r_{i}$ , our visual-displacement CNN takes a pair of images patched from frames $t-1$ and $t$ as inputs, and outputs the object’s inferred displacement. A network structure similar to ResNet-101 (except for the topmost layer) is adopted for our visual-displacement CNN.

The network inputs and outputs are illustrate in Fig. 2. For the inputs, given currently tracked object $r_{i}$ ’s bounding box location $b_{i}$ at time $t-1$ , a larger bounding box $\bar{b}_{i}$ centered at $b_{i}$ is first created. Two image patches are cropped at the same spatial location $\bar{b}_{i}$ but from different frames at time $t-1$ and time $t$ . They are then concatenated along the channel dimension to serve as the inputs for our visual-displacement CNN. The reasons for using a larger bounding box $\bar{b}_{i}$ instead of the original box $b_{i}$ are to tolerate large possible displacement between the two consecutive frames and also to incorporate more visual contextual information of the object for more accurate displacement estimation. After training with thousands of such pairs, the visual-displacement CNN is able to capture important visual cues from image-patch pairs to infer object displacements between time $t-1$ and time $t$ .

The confidence weight $w_{i,1}$ is obtained by the following equation,

where $\sigma$ is the sigmoid function constraining the range of $w_{i,1}$ being between 0 and 1, $\max({\bf c}_{i})$ obtains the maximal confidence of ${\bf c}_{i}=\{c_{i}^{1},c_{i}^{2},\cdots,c_{i}^{m}\}$ , and $a_{1}$ and $b_{1}$ are learnable scalar parameters. In our experiments, the learned parameter $a_{1}$ is generally positive after training, which denotes that, if the visual-displacement CNN is more confident about its displacement estimations, the value of $w_{i,1}$ is larger and the final output displacement $d_{i}$ can be more biased towards the visually inferred displacement $f_{d}(r_{i},I)$ . Otherwise, the final displacement $d_{i}$ can be biased to be inferred by inter-object constraints.

If the energy function $E$ in Eq. (2) consists of only the unary terms $\phi(d_{i},r_{i},I)$ , the final output displacement $d_{i}$ can be solely dependent on each tracked object’s visual information without considering inter-object constraints.

III-A2 Asymmetric pairwise terms

The pairwise terms in Eq. (2) are utilized to model asymmetric inter-object relations between object tracklets for regularizing the final displacement results ${\bf d}$ . To handle global camera motion, we assume that from time $t-1$ to time $t$ , the speed differences between two tracked objects should be maintained, i.e.,

where $\Delta d_{ij}=d_{i}-d_{j}$ is the displacement (which can be viewed as speed) difference between objects $i$ and $j$ at time $t$ , $\Delta s_{ij}=s_{i}-s_{j}$ is the speed difference at the previous time $t-1$ , and $w_{ij,2}^{(k)}$ are a series of weighting functions (two in our experiments) that control the directional influences between the pair of objects,

For better modeling inter-object relations, two important observations are made to define the asymmetric weighting functions $w_{ij,2}^{(k)}$ . 1) For detection boxes, in terms of localization accuracy, larger object detection boxes are more likely to be noisy, while smaller ones tend to be more stable (as shown in Fig. 3). This is because the displacements of both large and small detection boxes are all recorded in pixels in our tracking frameworks. Noisy large detection boxes would significantly influence the displacement estimation for other boxes. This problem is illustrated in Fig. 4. The two targets in Fig. 4(a) have accurate locations and speeds which can be used to build inter-object constraints at time $t-1$ . When the detector outputs roughly accurate bounding boxes for both targets at time $t$ , symmetric inter-object constraints could well refine the objects’ locations (see Fig. 4(b)). However, since the larger-size detection boxes are more likely to be noisy, using the symmetric inter-object constraints would significantly affect tracking results of the small-size objects (see Fig. 4(c)). In contrast, small-size objects have smaller localization errors and could better infer larger-size objects’ locations. Asymmetric small-to-large-size inter-object constraints are robust, even when the smaller-size detection box is noisy(see Fig. 4(d)). Therefore, between a pair of tracked objects, the one with smaller detection box should have more influence to infer the displacement of the ones with larger detection box, and the object with a larger box should have less chance to deteriorate the displacement estimation of the smaller one.

2) If our above mentioned visual-displacement CNN has high confidence for an object’s displacement, this object’s visually inferred displacement should be used more to infer other objects’ displacements. On the other hand, the objects with low confidences on their visually inferred displacements should not affect other objects with high-confidence displacements. Based on the two observations, we model the weighting function $w_{ij,2}^{(k)}$ by a product of a size-based weighting function and a confidence-based weighting function between a pair of tracked objects as

where $\sigma$ denotes the sigmoid function, $area_{i}$ denotes the size of the $i$ th tracked object at time $t-1$ , $\max({\bf c}_{i})$ obtains the maximal displacement confidence from $\{c_{i}^{1},c_{i}^{2},\cdots,c_{i}^{m}\}$ by our proposed visual-displacement CNN, and $a_{21}^{(k)}$ , $b_{21}^{(k)}$ , $a_{22}^{(k)}$ , $b_{22}^{(k)}$ are learnable scalar parameters. In our DCCRF, these parameters can be learned by back-propagation algorithm with mean-field approximation. If we use the mean-field approximation for DCCRF inference, the influence from object $r_{i}$ to $r_{j}$ and that from $r_{j}$ and $r_{i}$ are different (see next subsection for details). After training, we see that $a_{21}^{(k)}>0$ and $a_{22}^{(k)}<0$ , which means that smaller $area_{i}/area_{j}$ and larger $\max({\bf c}_{i})-\max({\bf c}_{j})$ lead to greater weights. It validates our above mentioned observations that objects with smaller sizes and greater visual-displacement confidences should have greater influences to other objects, but not the other around.

In Fig. 5, we show example values of one learned weighting function $w_{ij,2}^{(k)}$ . In Fig. 5(a), compared with object 6, objects 2-4 are of smaller sizes and also higher visual confidences. With the directional weighting functions, they have greater influence to correct errors of tracking object 6 (red vs. green rectangles of object 6) and are not affected much by the erroneous estimation of object 6. Similar directional weighting function values can be found in Fig. 5(b), where objects 1, 3, 4 with high visual-displacement confidences are able to correct tracking errors of object 5 with low visual-displacement confidence.

III-A3 Inference

For our unary terms, we utilize forward propagation of the visual-displacement CNN for calculating objects’ estimated displacements and displacement confidences $\{c_{i}^{1},c_{i}^{2},\cdots,c_{i}^{m}\}$ . After the unary term inference, the overall maximum posterior marginal inference is achieved by mean-field approximation. This approximation yields an iterative message passing for approximate inference. Our unary terms and pairwise terms are both of quadratic form. The energy function is convex and the optimal displacement is obtained as the mean value of the energy function,

In each iteration, the node $i$ receives messages from all other objects to update its displacement estimation. The mean-field approximation is usually converged in 5-10 iterations. The above displacement update equation clearly shows the differences between the messages transmitted from $i$ to $j$ and that from object $j$ to $i$ because of the asymmetric weighting functions $w_{ij,2}^{(k)}$ . For a pair of objects, $w_{ij,2}^{(k)}$ and $w_{ji,2}^{(k)}$ are generally different. Even if $w_{i,1}=w_{j,1}$ , when $w_{ij,2}^{(k)}>w_{ji,2}^{(k)}$ , object $j$ has greater influence to $i$ than that from $j$ to $i$ .

A detailed derivation of Eq. (7) is given as follows. The mean-field method is to approximate the distribution $P({\bf d}|{\bf r},I)$ with a distribution $Q({\bf d}|{\bf r},I)$ , which can be expressed as a product of independent marginals $Q({\bf d}|{\bf r},I)=\prod_{1}^{N}Q_{i}(d_{i}|{\bf r},I)$ . The optimal approximation of $Q$ is obtained by minimizing Kullback-Leibler (KL) divergence between $P$ and $Q$ . The solution for $Q$ has the following form,

where $E_{i\neq j}$ denotes expectation under $Q$ distributions over all variables $d_{j}$ for $j\neq i$ . The inference is formulated as

Each $\log(Q_{i}(d_{i}|{\bf r},I))$ is a quadratic form with respect to $d_{i}$ and its means therefore are

The inference task is to minimize $P({\bf d}|{\bf r},I)$ . Since we approximate conditional distribution with product of independent marginals, an estimate of each $d_{i}$ is obtained as the expected value $\mu_{i}$ of the corresponding quadratic function,

III-B The Overall MOT Algorithm

The overall algorithm with our proposed DCCRF is shown in Algorithm 1. At each time $t$ , the DCCRF inputs are existing tracklets up time $t-1$ , and consecutive frames at time $t-1$ and time $t$ . It outputs each tracklet’s displacement estimation. After obtaining displacement estimations $\widehat{d_{i}}$ for each tracklet $r_{i}$ by DCCRF, its estimated location at time $t$ can be simply calculated as the summation of its location $b_{r_{i}}$ at time $t-1$ and its estimated displacement $\widehat{d_{i}}$ , i.e.,

Based on such estimated locations, we utilize a visual-similarity CNN (Section III-B1) as well as the Intersection-over-Union value as the criterion for tracklet-detection association to generate longer tracklets (Section III-B2). To make our online MOT system complete, we also specify our detailed strategies for tracklet initialization (Section III-B3), occlusion handling and tracklet termination (Section III-B4).

The tracklet-detection associations need to be determined based on visual cues and spatial cues simultaneously. We propose a visual-similarity CNN for calculating visual similarities between image patches cropped at bounding box locations in the same frame. The visual-similarity CNN has similar network structure as our visual-displacement CNN in Section III-A1. However, the network takes image patches in the same video frame as inputs and outputs the confidence whether the input pair represents the same object. It is therefore trained with a binary cross-entropy loss. In addition, the training samples are generated differently for the visual-similarity CNN. Instead of cropping two consecutive video frames at the same bounding box locations as the visual-displacement CNN, the visual-similarity CNN requires positive pairs to be cropped at different locations of the same object at anytime in the same video, while the negative pairs to be image patches belonging to different objects. For cropping image patches, we dont’t enlarge the object’s bounding box, which is also different to our visual-displacement CNN. During training, the ratio between positive and negative pairs are set to $1$ : $3$ and the network is trained similarly to that of visual-displacement CNN.

III-B2 Tracklet-detection association

Given the estimated tracklet locations and detection boxes at time $t$ , they are associated with detection boxes based on the visual and spatial similarities between them. The associated detection boxes can then be appended to their corresponding tracklets to form longer ones up to time $t$ . Let $\widehat{b_{r_{i}}}$ and $b_{j}$ denote the $i$ th tracklet’s estimated location and the $j$ th detection box at time $t$ . Their visual similarity calculated by the visual-similarity CNN in Section III-B1 is denoted as $V(\widehat{b_{r_{i}}},b_{j})$ . The spatial similarity between the estimated tracklet locations and detection boxes are measured as the their box Intersection-over-Union values $IoU(\widehat{b_{r_{i}}},b_{j})$ . If a detection box is tried to be associated with multiple tracklets, Hungrian algorithm is utilized to determine the optimal associations with the following overall similarity,

where $\lambda$ is the weight balancing the visual and spatial similarities and is set to 1 in our experiments. After the box association by Hungarian algorithm, if a tracklet is associated with a detection box that has an IoU value greater than 0.5 with it, the associated detection box are directly appended to the end of the tracklet. If the IoU value is between 0.3 and 0.5, the average of the associated detection box and estimated tracklet box are appended to the tracklet to compensate for the possible noisy detection box. If the IoU value is smaller than 0.3, tracklet might be considered as being terminated or temporally occluded (Section III-B4).

III-B3 Tracklet initialization

If an object detection box at time $t-1$ is not associated to any tracklet in the above tracklet-detection association step, it is treated as a candidate box for initializing new tracklets. For each such candidate box at time $t-1$ , its visually inferred displacement between time $t-1$ and $t$ is first obtained by our visual-displacement CNN in Section III-A1. Its estimated box location can be easily calculated following Eq. (12). The visual similarities $V$ and spatial similarities $IoU$ between the estimated box at $t$ and candidate boxes at $t$ are calculated. To form new candidate tracklet, the candidate box at time $t-1$ is only associated with the candidate box at time $t$ that has 1) greater-than-0.3 $IoU$ and 2) greater-than-0.8 visual similarity with its estimated box location. If there are multiple candidate associations, Hungarian algorithm is utilized to associate the candidate box at $t$ to its optimal candidate association at $t-1$ according to the overall similarities (Eq. (13)). If none of the candidate associations at time $t$ satisfies the above two conditions with the candidate box at $t-1$ , the candidate box is ignored and would not be used for tracking initialization. Such operations are iterated over time to generate longer candidate tracklets. If a candidate tracklet is over $k$ frames ( $k=4$ for pedestrain tracking with 25-fps videos), it is initialized as a new tracklet.

III-B4 Occlusion handling and tracklet termination

If a past tracklet is not associated to any detection box at time $t$ , the tracked object is considered as being possibly occluded or temporally missed. For a possibly occluded object, we directly associate its past tracklet to its estimated location by our DCCRF at time $t$ to create a virtual tracklet. The same operation is iterated for $m$ frames, i.e., if the virtual tracklet is not associated to any detection box for more than $m$ time steps, the virtual tracklet is terminated. For pedestrian tracking, we empirically set $m=5$ .

IV Experiments

In this section, we present experimental results of the proposed online MOT algorithm. We first introduce evaluation datasets and implementation details for our proposed framework in Sections IV-A and IV-B. In Section IV-C, we compare the proposed method with state-of-the-art approaches on the public MOT datasets. The individual components of our proposed method are evaluated in Section IV-D.

We conduct experiments on the 2DMOT15 and 2DMOT16 benchmarks, which are widely used to evaluate the performance of MOT methods. Both of them have two tracks: public detection boxes and private detection boxes . For comparing with only the performance of tracking algorithms, we evaluate our method with the provided public detection boxes.

This dataset is one of the largest datasets with moving or static cameras, different viewpoints and different weather conditions. It contains a total of 22 sequences, half for training and half for testing, with a total of 11286 frames (or 996 seconds). The training sequences contain over 5500 frames, 500 annotated trajectories and 39905 annotated bounding boxes. The testing sequences contain over 5700 frames, 721 annotated trajectories and 61440 annotated bounding boxes. The public detection boxes in 2DMOT15 are generated with aggregated channel features (ACF).

IV-A2 2DMOT16

This dataset is an extension to 2DMOT15. Compared to 2DMOT15, new sequences are added and the dataset contains almost 3 times more bounding boxes for training and testing. Most sequences are in high resolution, and the average pedestrian number in each video frame is 3 times higher than that of the 2DMOT15. In 2DMOT16, deformable part models (DPM) based methods are used to generate public detection boxes, which are more accurate than boxes in 2DMOT15.

IV-A3 Evaluation Metric

For the quantitative evaluation, we adopt the popular CLEAR MOT metrics , which include:

MOTA: Multiple Object Tracking Accuracy. This metric is usually chosen as the main performance indicator for MOT methods. It combines three types of errors: false positives, false negatives, and identity switches.

MOTP: Multiple Object Tracking Precision. The misalignment between the annotated and the predicted bounding boxes.

MT: Mostly Tracked targets. The ratio of ground-truth trajectories that are covered by a track hypothesis for at least 80% of their respective life span.

ML: Mostly Lost targets. The ratio of ground-truth trajectories that are covered by a track hypothesis for at most 20% of their respective life span.

FN: The total number of false negatives (missed targets).

ID Sw: The total number of identity switches. Please note that we follow the stricter definition of identity switches as described in MOT challenge.

Frag: The total number of times a trajectory is fragmented (i.e., interrupted during tracking).

IV-B Implementation details

For visual-displacement and visual-similarity CNNs, we adopt ResNet-101 as the network structure and replace the topmost layer to output displacement confidence or same-object confidence. Both CNN are pretrained on the ImageNet dataset. For cropping image patches from $\bar{b}_{i}$ , we enlarge each detection box $b_{i}$ by a factor of 5 in width and 2 in height to obtain $\bar{b}_{i}$ . Image patches for the two CNNs are cropped at the same locations from consecutive frames as described in Section III-A1, which are then resized to $224\times 224$ as the CNN inputs.

We train our proposed DCCRF in three stages. In the first stage, the proposed visual-displacement CNN is trained with the cross-entropy loss and batch Stochastic Gradient Descent (SGD) with a batch size of 5. The initial learning rate is set to $10^{-6}$ and is decreased by a factor of 1/10 every 50,000 iterations. The training generally converges after 600,000 iterations. In the second stage, the learned visual-displacement CNN from stage-1 is fixed and other parameters in our DCCRF are trained with $L_{1}$ loss,

where $\widehat{d_{i}}$ and $d_{i}^{gt}$ are estimated displacements and the ground-truth displacements for the $i$ th tracked object. In the final stage, the DCCRF is trained in an end-to-end manner with the above $L_{1}$ loss and the cross-entropy loss for visual-displacement CNN in unary terms. We find that 5 iterations of the mean-field approximation generate satisfactory results. The DCCRF is trained with an initial learning rate of $10^{-4}$ , which is decreased by a factor of 1/3 every 5,000 iterations. The training typically converges after 3 epochs.

Our code is implemented with MATLAB and Caffe. The overall tracking speed of the proposed method on MOT16 test sequences is 0.1 fps using the 2.4GHz CPU and a Maxwell TITAN X GPU without some acceleration library packages.

IV-B2 Data augmentation

To introduce more variation into the training data and thus reduce possible overfitting, we augment the training data. For pre-training the visual-displacement CNN, the input images are image patches centered at detection boxes. We augment the training samples by random flipping as well as randomly shifting the cropping positions by no more than $\pm 1/5$ of detection box width or height for $x$ and $y$ dimensions respectively. For end-to-end training the DCCRF, except for random flipping of whole video frames, the time intervals between the two input video frames are randomly sampled from the interval of $$ to generate more frame pairs with larger possible displacements between them.

IV-C Quantitative results on 2DMOT15 and 2DMOT16

On the MOT2015 and MOT2016 datasets, we test our proposed method and compare it with state-of-the-art MOT methodsNote that only methods in peer-reviewed publications are compared in this paper. ArXiv papers that have not undergone peer-review are not included. including SMOT , MDP , SCEA , CEM , RNN_LSTM , RMOT , TC_ODAL , CNNTCM , SiameseCNN , oICF , NOMT , CDA_DDAL . The results of the compared methods are listed in Tables I and II. We focus on the MOTA value as the main performance indicator, which is a weighted combination of false negatives (FN), false positives (FP) and identity switches (ID Sw). Note that offline methods generally have higher MOTA than online methods because they can utilize not only past but also future information for object tracking and are only listed for reference here. Our proposed online MOT method outperforms all compared online methods and most offline methods . As shown by the quantitative results, our proposed method is able to alleviate the difficulties caused by object mis-detections, noisy detections, and short-term occlusion. The qualitative results are shown in Fig. 6.

Compared with SCEA , which also models inter-object interactions and speed differences to handle mis-detections caused by global camera motion, our learned DCCRF shows better performance, especially in FN for our more accurate displacement prediction which is able to recover more mis-detections. Our proposed method also outperforms MDP in terms of MOTA and FP by a large margin. MDP learns to predict four target states (active, tracked, lost and inactive) for each tracked object. However, it only models tracked object’s movement patterns with a constant speed assumption, which is likely to result in false tracklet-detection associations and thus increases FP. CDA_DDAL focuses on using discriminative visual features by a siamese CNN for tracklet-detection associations, which is not robust for occlusions and is easy to increase FN. Compared with other algorithms DCO_X and LTTSC-CRF which also use conditional random field approximation to solve MOT problems, the results show that our proposed DCCRF has great advantages over other CRF-based methods in MOTA.

However, our method produces more ID switches than some compared methods, which is due to long-term occlusions that cannot be solved by our method.

IV-D Component analysis on 2DMOT16

To analyze the effectiveness of different components in our proposed framework, we also design a series of baseline methods for comparison. The results of these baselines and our final method are reported in Table III. Similar to the above experiments, we focus on MOTA value as the main performance indicator. 1) Unary-only: this baseline utilizes only our unary terms in DCCRF, i.e., the visual-displacement CNN, with our overall MOT algorithm. Such a baseline model considers only tracked objects’ appearance information. Compared with our proposed DCCRF, it has a $3\%$ MOTA drop, which denotes that the inter-object relations are crucial for regularizing each object’s estimated displacement and should not be ignored. 2) Unary-only+ $L_{1}$ -loss (reg): since our visual-displacement CNN is trained with proposed cross-entropy loss instead of conventional $L_{1}$ or $L_{2}$ losses for regression problems, we train a visual-displacement CNN with smooth $L_{1}$ -loss and test it in the same way as the above unary-only baseline. Compared with unary-only baseline, unary-only+ $L_{1}$ -loss has a significant $7\%$ MOTA drop, which demonstrates that our proposed cross-entropy loss results in much better displacement estimation accuracy. 3) DCCRF w/o cfd-asym and DCCRF w/o size-asym: the weighting functions of the pairwise term in our proposed DCCRF have two terms, a confidence-asymmetric term and a size-asymmetric term. We test using only one of them in our DCCRF’s pairwise terms. The results show more than $1\%$ drop in terms of MOTA for both baseline methods compared with our proposed DCCRF, which validates the need of both terms in the weighting functions. 4) DCCRF w/ symmetry: this baseline method replaces the asymmetric pairwise term in our DCCRF with a symmetric one,

where $l_{i}$ is the coordinates of $i$ th object’s center position and $a_{2}^{(k)}$ are learnable Gaussian kernel bandwidth parameters. Such a symmetric term assumes that the speed differences between close-by objects should be better maintained across time, while those between far-away objects are less regularized. There is a $1\%$ MOTA drop compared with our proposed DCCRF, which shows our asymmetric term is beneficial for the final performance. We also try to directly replace the sigmoid function in Eq. (5) with a Gaussian-like function in the weighting function (Eq. (15)), which results in even worse performance.

In addition to the above, we also conduct experiments to analysize the effects of different hyper-parameters to show our DCCRF robustness. 1) The $\lambda$ controls the weight between the visual-similarity term and the DCCRF location prediction term for tracklet-detection association in Eq. (13). We test three different values of $\lambda$ and the results of different $\lambda$ are reported in Table IV, which the final performance is not sensitive to the $\lambda$ value. 2) The $k$ is the length of a candidate tracklet to create an actual tracklet in section III-B3. We additionally test $k=8$ in Table V, which shows slightly performance drop, because larger $k$ will cause more low-confidence detections to be ignored. 3) The $m$ denotes the number of consecutive frames of missing objects to terminate its associated tracklet in section III-B4. We additionally test $m=8$ and the results in Table VI show the peformance is not sensitive to the choice of $m$ .

V Conclusion

In this paper, we present the Deep Continuous Conditional Random Field (DCCRF) model with asymmetric inter-object constraints for solving the MOT problem. The unary terms are modeled as a visual-displacement CNN that estimates object displacements across time with visual information. The asymmetric pairwise terms regularize inter-object speed differences across time with both size-based and confidence-based weighting functions to weight more on high-confidence tracklets to correct tracking errors. By jointly training the two terms in DCCRF, the relations between objects’ individual movement patterns and complex inter-object constraints can be better modeled and regularized to achieve more accurate tracking performance. Extensive experiments demonstrate the effectiveness of our proposed MOT framework as well as the individual components of our DCCRF.