Detecting events and key actors in multi-person videos

Vignesh Ramanathan, Jonathan Huang, Sami Abu-El-Haija, Alexander Gorban, Kevin Murphy, Li Fei-Fei

Introduction

Event recognition and detection in videos has hugely benefited from the introduction of recent large-scale datasets and models. However, this is mainly confined to the domain of single-person actions where the videos contain one actor performing a primary activity. Another equally important problem is event recognition in videos with multiple people. In our work, we present a new model and dataset for this specific setting.

Videos captured in sports arenas, market places or other outdoor areas typically contain multiple people interacting with each other. Most people are doing “something”, but not all of them are involved in the main event. The main event is dominated by a smaller subset of people. For instance, a “shot” in a game is determined by one or two people (see Figure 1). In addition to recognizing the event, it is also important to isolate these key actors. This is a significant challenge which differentiates multi-person videos from single-person videos.

Identifying the people responsible for an event is thus an interesting task in its own right. However acquiring such annotations is expensive and it is therefore desirable to use models that do not require annotations for identifying these key actors during training. This can also be viewed as a problem of weakly supervised key person identification. In this paper, we propose a method to classify events by using a model that is able to “attend” to this subset of key actors. We do this without ever explicitly telling the model who or where the key actors are.

Recently, several papers have proposed to use “attention” models for aligning elements from a fixed input to a fixed output. For example, translate sentences in one language to another language, attending to different words in the input; generate an image-caption, attending to different regions in the image; and generate a video-caption, attending to different frames within the video.

In our work, we use attention to decide which of several people is most relevant to the action being performed; this attention mask can change over time. Thus we are combining spatial and temporal attention. Note that while the person detections vary from one frame to another, they can be associated across frames through tracking. We show how to use a recurrent neural network (RNN) to represent information from each track; the attention model is tasked with selecting the most relevant track in each frame. In addition to being able to isolate the key actors, we show that our attention model results in better event recognition.

In order to evaluate our method, we need a large number of videos illustrating events involving multiple people. Most prior activity and event recognition datasets focus on actions involving just one or two people. Multi-person datasets like are usually restricted to fewer videos. Therefore we collected our own dataset. In particular we propose a new dataset of basketball events with time-stamp annotations for all occurrences of 1111 different events across 257257 videos each 1.51.5 hours long in length. This dataset is comparable to the THUMOS detection dataset in terms of number of annotations, but contains longer videos in a multi-person setting.

In summary, the contributions of our paper are as follows. First, we introduce a new large-scale basketball event dataset with 14K dense temporal annotations for long video sequences. Second, we show that our method outperforms state-of-the-art methods for the standard tasks of classifying isolated clips and of temporally localizing events within longer, untrimmed videos. Third, we show that our method learns to attend to the relevant players, despite never being told which players are relevant in the training set.

Related Work

Action recognition in videos Traditionally, well engineered features have proved quite effective for video classification and retrieval tasks . The improved dense trajectory (IDT) features achieve competitive results on standard video datasets. In the last few years, end-to-end trained deep network models were shown to be comparable and at times better than these features for various video tasks. Other works like explore methods for pooling such features for better performance. Recent works using RNN have achieved state-of-the-art results for both event recognition and caption-generation tasks . We follow this line of work with the addition of an attention mechanism to attend to the event participants.

Another related line of work jointly identifies the region of interest in a video while recognizing the action. Gkioxari et al. and Raptis et al. automatically localize a spatio-temporal tube in a video. Jain et al. merge super-voxels for action localization. While these methods perform weakly-supervised action localization, they target single actor videos in short clips where the action is centered around the actor. Other methods like require annotations during training to localize the action.

Muti-person video analysis Activity recognition models for events with well defined group structures such as parades have been presented in . They utilize the structured layout of participants to identify group events. More recently, use context as a cue for recognizing interaction-based group activities. While they work with multi-person events, these methods are restricted to smaller datasets such as UT-Interaction, Collective activity and Nursing home.

Attention models Itti et al. explored the idea of saliency-based attention in images, with other works like using eye-gaze data as a means for learning attention. Mnih et al. attend to regions of varying resolutions in an image through a RNN framework. Along similar lines, attention has been used for image classification and detection as well.

Bahdanau et al. showed that attention-based RNN models can effectively align input words to output words for machine translation. Following this, Xu et al. and Yao et al. used attention for image-captioning and video-captioning respectively. In all these methods, attention aligns a sequence of input features with words of an output sentence. However, in our work we use attention to identify the most relevant person to the overall event during different phases of the event.

Action recognition datasets Action recognition in videos has evolved with the introduction of more sophisticated datasets starting from smaller KTH , HMDB to larger , UCF101 , TRECVID-MED and Sports-1M datasets. More recently, THUMOS and ActivityNet also provide a detection setting with temporal annotations for actions in untrimmed videos. There are also fine-grained datasets in specific domains such as MPII cooking and breakfast . However, most of these datasets focus on single-person activities with hardly any need for recognizing the people responsible for the event. On the other hand, publicly available multi-person activity datasets like are restricted to a very small number of videos. One of the contributions of our work is a multi-player basketball dataset with dense temporal event annotations in long videos.

Person detection and tracking. There is a very large literature on person detection and tracking. There are also specific methods for tracking players in sports videos . Here we just mention a few key methods. For person detection, we use the CNN-based multibox detector from . For person tracking, we use the KLT tracker from . There is also work on player identification (e.g., ), but in this work, we do not attempt to distinguish players.

NCAA Basketball Dataset

A natural choice for collecting multi-person action videos is team sports. In this paper, we focus on basketball games, although our techniques are general purpose. In particular, we use a subset of the 296296 NCAA games available from YouTube.https://www.youtube.com/user/ncaaondemand These games are played in different venues over different periods of time. We only consider the most recent 257257 games, since older games used slightly different rules than modern basketball. The videos are typically 1.51.5 hours long. We manually identified 1111 key event types listed in Tab. 1. In particular, we considered 5 types of shots, each of which could be successful or failed, plus a steal event.

Next we launched an Amazon Mechanical Turk task, where the annotators were asked to annotate the “end-point” of these events if and when they occur in the videos; end-points are usually well-defined (e.g., the ball leaves the shooter’s hands and lands somewhere else, such as in the basket). To determine the starting time, we assumed that each event was 4 seconds long, since it is hard to get raters to agree on when an event started. This gives us enough temporal context to classify each event, while still being fairly well localized in time.

The videos were randomly split into 212212 training, 1212 validation and 3333 test videos. We split each of these videos into 4 second clips (using the annotation boundaries), and subsampled these to 6fps. We filter out clips which are not profile shots (such as those shown in Figure 3) using a separately trained classifier; this excludes close-up shots of players, as well as shots of the viewers and instant replays. This resulted in a total of 1143611436 training, 856856 validation and 22562256 test clips, each of which has one of 11 labels. Note that this is comparable in size to the THUMOS’15 detection challenge (150 trimmed training instances for each of the 2020 classes and 65536553 untrimmed validation instances). The distribution of annotations across all the different events is shown in Tab. 1. To the best of our knowledge, this is the first dataset with dense temporal annotations for such long video sequences.

In addition to annotating the event label and start/end time, we collected AMT annotations on 850850 video clips in the test set, where the annotators were asked to mark the position of the ball on the frame where the shooter attempts a shot.

We also used AMT to annotate the bounding boxes of all the players in a subset of 9000 frames from the training videos. We then trained a Multibox detector with these annotations, and ran the trained detector on all the videos in our dataset. We retained all detections above a confidence of 0.5 per frame; this resulted in 6–8 person detections per clip, as listed in Tab. 1. The multibox model achieves an average overlap of 0.70.7 at a recall of 0.80.8 with ground-truth bounding boxes in the validation videos.

We plan to release our annotated data, including time stamps, ball location, and player bounding boxes.

Our Method

All events in a team sport are performed in the same scene by the same set of players. The only basis for differentiating these events is the action performed by a small subset of people at a given time. For instance, a “steal” event in basketball is completely defined by the action of the player attempting to pass the ball and the player stealing from him. To understand such an event, it is sufficient to observe only the players participating in the event.

This motivates us to build a model (overview in Fig. 3) which can reason about an event by focusing on specific people during the different phases of the event. In this section, we describe our unified model for classifying events and simultaneously identifying the key players.

Each video-frame is represented by a 10241024 dimensional feature vector ftf_{t}, which is the activation of the last fully connected layer of the Inception7 network . In addition, we compute spatially localized features for each person in the frame. In particular, we compute a 28052805 dimensional feature vector ptip_{ti} which contains both appearance (13651365 dimensional) and spatial information (14401440 dimensional) for the ii’th player bounding box in frame tt. Similar to the RCNN object detector, the appearance features were extracted by feeding the cropped and resized player region from the frame through the Inception7 network and spatially pooling the response from a lower layer. The spatial feature corresponds to a 32×3232\times 32 spatial histogram, combined with a spatial pyramid, to indicate the bounding box location at multiple scales. While we have only used static CNN representations in our work, these features can also be easily extended with flow information as suggested in .

2 Event classification

Given ftf_{t} and ptip_{ti} for each frame tt, our goal is to train the model to classify the clip into one of 11 categories. As a side effect of the way we construct our model, we will also be able to identify the key player in each frame.

First we compute a global context feature for each frame, htfh_{t}^{f}, derived from a bidirectional LSTM applied to the frame-level feature as shown by the blue boxes in Fig. 3. This is a concatenation of the hidden states from the forward and reverse LSTM components of a BLSTM and can be compactly represented as:

Next we use a unidirectional LSTM to represent the state of the event at time tt:

where ata_{t} is a feature vector derived from the players, as we describe below. From this, we can predict the class label for the clip using wkhtew_{k}^{\intercal}h_{t}^{e}, where the weight vector corresponding to class kk is denoted by wkw_{k}. We measure the squared-hinge loss as follows:

where yky_{k} is 11 if the video belongs to class kk, and is 1-1 otherwise.

3 Attention models

Unlike past attention models we need to attend to a different set of features at each time-step. There are two key issues to address in this setting.

First, although we have different detections in each frame, they can be connected across the frames through an object tracking method. This could lead to better feature representation of the players.

Second, player attention depends on the state of the event and needs to evolve with the event. For instance, during the start of a “free-throw” it is important to attend to the player making the shot. However, towards the end of the event the success or failure of the shot can be judged by observing the person in possession of the ball.

With these issues in mind, we first present our model which uses player tracks and learns a BLSTM based representation for each player track. We then also present a simple tracking-free baseline model.

Attention model with tracking. We first associate the detections belonging to the same player into tracks using a standard method. We use a KLT tracker combined with bipartite graph matching to perform the data association.

The player tracks can now be used to incorporate context from adjacent frames while computing their representation. We do this through a separate BLSTM which learns a latent representation for each player at a given time-step. The latent representation of player ii in frame tt is given by the hidden state htiph_{ti}^{p} of the BLSTM across the player-track:

At every time-step we want the most relevant player at that instant to be chosen. We achieve this by computing ata_{t} as a convex combination of the player representations at that time-step:

where NtN_{t} is the number of detections in frame tt, and ϕ()\phi() is a multi layer perceptron, similar to . τ\tau is the softmax temperature parameter. This attended player representation is input to the unidirectional event recognition LSTM in Eq. 2. This model is illustrated in Figure 3.

Attention model without tracking. Often, tracking people in a crowded scene can be very difficult due to occlusions and fast movements. In such settings, it is beneficial to have a tracking-free model. This could also allow the model to be more flexible in switching attention between players as the event progresses. Motivated by this, we present a model where the detections in each frame are considered to be independent from other frames.

We compute the (no track) attention based player feature as shown below:

Note that this is similar to the tracking based attention equations except for the direct use of the player detection feature ptip_{ti} in place of the BLSTM representation htiph_{ti}^{p}.

Experimental evaluation

In this section, we present three sets of experiments on the NCAA basketball dataset: 1. event classification, 2. event detection and 3. evaluation of attention.

We used a hidden state dimension of 256256 for all the LSTM and BLSTM RNNs, an embedding layer with ReLU non-linearity and 256256 dimensions for embedding the player features and frame features before feeding to the RNNs. We used 32×3232\times 32 bins with spatial pyramid pooling for the player location feature. All the event videos clips were four seconds long and subsampled to 6fps. The τ\tau value was set to 0.250.25 for the attention softmax weighting. We used a batch size of 128128, and a learning rate of 0.0050.005 which was reduced by a factor of 0.10.1 every 1000010000 iterations with RMSProp. The models were trained on a cluster of 2020 GPUs for 100k100k iterations over one day. The hyperparameters were chosen by cross-validating on the validation set.

2 Event classification

In this section, we compare the ability of methods to classify isolated video-clips into 11 classes. We do not use any additional negatives from other parts of the basketball videos. We compare our results against different control settings and baseline models explained below:

IDT We use the publicly available implementation of dense trajectories with Fisher encoding.

IDT player We use IDT along with averaged features extracted from the player bounding boxes.

C3D We use the publicly available pre-trained model for feature extraction with an SVM classifier.

LRCN We use an LRCN model with frame-level features. However, we use a BLSTM in place of an LSTM. We found this to improve performance. Also, we do not back-propagate into the CNN extracting the frame-level features to be consistent with our model.

MIL We use a multi-instance learning method to learn bag (frame) labels from the set of player features.

Only player We only use our player features from Sec. 4.1 in our model without frame-level features.

Avg. player We combine the player features by simple averaging, without using attention.

Attention no track Our model without tracks (Eq. 6).

Attention with track Our model with tracking (Eq. 5).

The mean average precision (mAP) for each setting is shown in Tab. 2. We see that the method that uses both global information and local player information outperforms the model only using local player information (“Only player”) and only using global information (“LRCN”). We also show that combining the player information using a weighted sum (i.e., an attention model) is better than uniform averaging (“Avg. player”), with the tracking based version of attention slightly better than the track-free version. Also, a standard weakly-supervised approach such as MIL seems to be less effective than any of our modeling variants.

The performance varies by class. In particular, performance is much poorer (for all methods) for classes such as “slam dunk fail” for which we have very little data. However, performance is better for shot-based events like “free-throw”, “layups” and “3-pointers”where attending to the shot making person or defenders can be useful.

3 Event detection

In this section, we evaluate the ability of methods to temporally localize events in untrimmed videos. We use a sliding window approach, where we slide a 44 second window through all the basketball videos and try to classify the window into a negative class or one of the 11 event classes. We use a stride length of 22 seconds. We treat all windows which do not overlap more than 11 second with any of the 1111 annotated events as negatives. We use the same setting for training, test and validation. This leads to 9020090200 negative examples across all the videos. We compare with the same baselines as before. However, we were unable to train the MIL model due to computational limitations.

The detection results are presented in Tab. 3. We see that, as before, the attention models beat previous state of the art methods. Not surprisingly, all methods are slightly worse at temporal localization than for classifying isolated clips. We also note a significant difference in classification and detection performance for “steal” in all methods. This can be explained by the large number of negative instances introduced in the detection setting. These negatives often correspond to players passing the ball to each other. The “steal” event is quite similar to a “pass” except that the ball is passed to a player of the opposing team. This makes the “steal” detection task considerably more challenging.

4 Analyzing attention

We have seen above that attention can improve the performance of the model at tasks such as classification and detection. Now, we evaluate how accurate the attention models are at identifying the key players. (Note that the models were never explicitly trained to identify key players).

To evaluate the attention models, we labeled the player who was closest (in image space) to the ball as the “shooter”. (The ball location is annotated in 850 test clips.) We used these annotations to evaluate if our “attention” scores were capable of classifying the “shooter” correctly in these frames.

The mean AP for this “shooter” classification is listed in Tab. 4. The results show that the track-free attention model is quite consistent in picking the shooter for several classes like “free-throw succ./fail”, “layup succ./fail.” and “slam dunk succ.”. This is a very promising result which shows that attention on player detections alone is capable of localizing the player making the shot. This could be a useful cue for providing more detailed event descriptions including the identity and position of the shooter as well.

In addition to the above quantitative evaluation, we wanted to visualize the attention masks visually. Figure 4 shows sample videos. In order to make results comparable across frames, we annotated 5 points on the court and aligned all the attended boxes for an event to one canonical image. Fig. 5 shows a heatmap visualizing the spatial distributions of the attended players with respect to the court. It is interesting to note that our model consistently focuses under the basket for a layup, at the free-throw line for free-throws and outside the 3-point ring for 3-pointers.

Another interesting observation is that the attention scores for the tracking based model are less selective in focusing on the shooter. We observed that the tracking model is often reluctant to switch attention between frames and focuses on a single player throughout the event. This biases the model towards players who are present throughout the video. For instance, in free-throws (Fig. 6) the model always attends to the defender at a specific position, who is visible throughout the entire event unlike the shooter.

Conclusion

We have introduced a new attention based model for event classification and detection in multi-person videos. Apart from recognizing the event, our model can identify the key people responsible for the event without being explicitly trained with such annotations. Our method can generalize to any multi-person setting. However, for the purpose of this paper we introduced a new dataset of basketball videos with dense event annotations and compared our performance with state-of-the-art methods on this dataset. We also evaluated the ability of our model to recognize the “shooter” in the events with visualizations of the spatial locations attended by our model.

References