Hierarchical Deep Temporal Models for Group Activity Recognition

Mostafa S. Ibrahim, Srikanth Muralidharan, Zhiwei Deng, Arash Vahdat, Greg Mori

Introduction

WE could describe the action that is happening in Figure 1 in numerous levels of abstraction. For instance, we could describe the scene in terms of what each individual player is doing. This task of person-level action recognition is an important component of visual understanding. At another level of detail, we could instead ask what is the overarching group activity that is depicted. For example, this frame could be labeled as “right team setting.” In this paper, we focus on this higher-level group activity task, devising methods for classifying a video according to the activity that is being performed by the group as a whole.

Human activity recognition is a challenging computer vision problem and has received a lot of attention from the research community. It is a challenging problem due to factors such as the variability within action classes, background clutter, and similarity between different action classes, to name a few. Group activity recognition finds a lot of applications in the context of video surveillance, sport analytics, video search and retrieval. A particular challenge of group activity recognition is the fact that the inference of the label for a scene can be quite sensitive to context. For example, in the volleyball scene shown in Fig. 1, the group activity hinges on the action of one key individual who is performing the “setting” action – though other people in the scene certainly provide helpful information to resolve ambiguity. In contrast, for group activity categories such as “talking” or “queuing” (e.g. Fig. 4), the group activity label depends on the actions of many inter-related people in a scene. As such, successful models likely require the ability to aggregate information across the many people present in a scene and make distinctions utilizing all of this information.

Spatio-temporal relations among the people in the scene have been at the crux of several approaches in the past that dealt with group activity recognition. The literature shows that spatio-temporal appearance/motion properties of an individual and their relations can discern which group activity is present. A volume of research has explored models for this type of reasoning . These approaches that utilize underlying person-level action recognition based on hand-crafted feature representations including histogram of gradients (HOG) or motion boundary histograms (MBH) both in a dense and sparse fashion , . However, since they rely on shallow hand crafted feature representation, they are limited by their representational abilities to model a complex learning task. Similarly, the higher-level group activity recognition models utilize on probabilistic or discriminative models built from relatively limited components.

On the other hand, deep representations have overcome this limitation and yielded state of the art results in several computer vision benchmarks , , . A naive approach to group activity recognition with a deep model would be to simply treat an image as an holistic input. One could train a model to classify this image according to the group activity taking place. However, it isn’t clear if this will work given the redundancy in the training data: with a dataset of volleyball videos, frames will be dominated by features of volleyball courts. As a result, we might have dominant set of features from the non-discriminative, uninteresting regions in the frame, that is common to multiple classes. This might result in poorer performance of the classifier.

The inter-class distinctions in group activity recogntion arise from the variations in spatio-temporal relations between people, beyond just global appearance. Utilizing a deep model to learn invariance to translation, to focus on the relations between people, presents a significant challenge to the learning algorithm. Similar challenges exist in the object recognition literature, and research often focuses on designing pooling operators for deep networks (e.g. ) that enable the network to learn effective classifiers.

Group activity recognition presents a similar challenge – appropriate networks need to be designed that allow the learning algorithm to focus on differentiating higher-level classes of activities. A simple solution to come up with such a representation is to have a layered approach in which each layer focuses on a subset of the image, and a given layer collects the information learnt from its previous layer to learn the higher level information. Hence, we develop a novel hierarchical deep temporal model. This consists of one dedicated layer which reasons about individual people and a second higher level layer that collects the information from the previous layer and learns discriminative frame level information for group activity recognition.

Our method starts with a set of detected and tracked people. Given a set of detected and tracked people, we use temporal deep networks (LSTMs) to analyze each individual person. These person-level LSTMs are aggregated over the people in a scene into a higher level deep temporal model. This allows the deep model to learn the relations between the people (and their appearances) that contribute to recognizing a particular group activity. Through this work we show that we can use LSTMs as a plausible deep learning alternative to the graphical models previously used for this task.

The contribution of this paper is the this novel deep architecture that models group activities in a principled structured temporal framework. Our 2-stage approach models individual person activities in its first stage, and then combines person-level information to represent group activities. The model’s temporal representation is based on the long short-term memory (LSTM): recurrent neural networks such as these have recently demonstrated successful results in sequential tasks such as image captioning and speech recognition . Through the model structure, we aim at constructing a representation that leverages the discriminative information in the hierarchical structure between individual person actions and group activities.

We show that our algorithm works in two scenarios. First, we demonstrate performance on the Collective Activity Dataset , a surveillance-type video dataset. We also propose a new Volleyball Dataset that offers person detections, and both the person action label as well as the group activity label. Experimentally, the model is effective in recognizing the overall team activity based on recognizing and integrating player actions.

This paper builds upon a previous version of this work . Here, we present a modified model for alternative pooling structures, an enlarged Volleyball Dataset, and additional empirical evaluations and analyses.

This paper is organized as follows. In Section 2, we provide a brief overview of the literature related to activity recognition. In Section 3, we elaborate details of the proposed group activity recognition model. In Section 4, we tabulate the performance of the approach, and end in Section 5 with a conclusion of this work.

Related Work

Human activity recognition is an active area of research, with many existing algorithms. Surveys by Weinland et al. and Poppe explore the vast literature in activity recognition. Here, we will focus on the group activity recognition problem and recent related advances in deep learning.

Group activity recognition has attracted a large body of work recently. Most previous work has used hand-crafted features fed to structured models that represent information between individuals in space and/or time domains. For example, Choi et al. craft spatio-temporal feature representations of relative human actions. Lan et al. proposed an adaptive latent structure learning that represents hierarchical relationships ranging from lower person-level information to higher group-level interactions.

Lan et al. and Ramanathan et al. explore the idea of social roles, the expected behaviour of an individual person in the context of group, in fully supervised and weakly supervised frameworks respectively. Lan et al. map the features defined on individuals to group activity by constructing a hierarchical model consisting of individual action, role based unary components, pairwise roles, and scene level group activities. The interactions and unary roles/activities are represented using an undirected graphical model. The parameters of this model are learnt using a structured SVM formulation in a max margin framework, and operates under completely supervised settings.

Ramanathan et al. define a CRF-based social role model under a weakly supervised setting. To learn model parameters and role labels, a joint variational inference procedure is adapted. HOG3D , spatio-temporal features , object interaction feature , and social role features are used as unary component representations. A subsequent layer consisting of pairwise spatio-temporal interaction features is used to refine the noisy unary component features. Finally, variational inference is used to learn the unknown role labels and model parameters.

Choi and Savarese unified tracking multiple people, recognizing individual actions, interactions and collective activities in a joint framework. The model is based on the premise that strong correlation exists between an individual’s activity, and the activities of the other nearby people. Following this intuition, they come up with a hierarchical structure of activity types that maps the individual activity to overall group activity. In this process, they simultaneously track atomic activities, interactions and overall group activities. The parameters of this model (and the inference) are learnt by combining belief propagation with the branch and bound algorithm.

Chang et al. employ a probabilistic grouping strategy to perform high level recognition tasks happening in the scene. Specifically, group structure is determined by soft grouping structures to facilitate the representation of dynamics present in the scene. Secondly, they also use a probabilistic motion analysis to extract interesting spatio-temporal patterns for scenario recognition. Vascon et al. detect conversational groups in crowded scenes of people. The approach uses pairwise affinities between people based on pose and a game-theoretic clustering procedure.

In other work , a random forest structure is used to sample discriminative spatio-temporal regions from input video fed to 3D Markov random field to localize collective activities in a scene. Shu et al. detect group activities from aerial video using an AND-OR graph formalism. The above-mentioned methods use shallow hand crafted features, and typically adopt a linear model that suffers from representational limitations.

2 Sport Video Analysis

Computer vision-based analysis of sports video is a burgeoning area for research, with many recent papers and workshops focused on this topic. Work on sports video analysis has spanned a range of topics from individual player detection, tracking, and action recognition, to player-player interactions, to team-level activity classifications. Much work spans many of these taxonomy elements, including the seminal work of Intille and Bobick , who examined stochastic representations of American football plays.

Player tracking: Nillius et al. link player trajectories to maintain identities via reasoning in a Bayesian network formulation. Morariu et al. track players, infer part locations, and reason about temporal structure in 1-on-1 basketball games. In Soomro et al. , a graph based optimization technique is applied to address the task of tracking in broadcast soccer videos where a disjoint temporal sequence of soccer videos is present. They first extract panoramic view video clips, and subsequently detect and track multiple players by a two step bipartite matching algorithm. Bo et al. introduced a novel approach to scale and rotation invariant tracking of human body parts. They use a dynamic programming based approach that optimizes the assembly of body part region proposals, given spatio-temporal constraints under a loopy body part graph construction, to enable scale and rotation invariance.

Actions and player roles: Turchini et al. perform activity recognition by first obtaining dense trajectories , clustering them, and finally employ a cluster set kernel to learn a action representations. Kwak et al. optimize based on a rule-based depiction of interactions between people.

Wei et al. compute a role ordered feature representation to predict the ball owner at each time instance in a given video. They start from the annotated positions of each player, permute them and obtain the feature representation ordered by relative position (called as role) with respect to other players.

Team activities: Siddiquie et al. proposed sparse multiple kernel learning to select features incorporated in a spatio-temporal pyramid. In Bialkowski et al. , two detection based representations that are based on team occupancy map and team centroid map respectively, are shown to effectively detect team activities in field hockey videos. First, players are detected in each of the eight camera views that are used, and then team level aggregations are computed after classifying each player into one of the two teams. Finally, using these aggregated representations, team activity labels are computed.

Atmosukarto et al. define an automated approach for recognizing offensive team formation in American football. First, the frame pertaining to the offensive team formation is first identified, line of scrimmage is obtained, and eventually the team formation label is obtained by learning a SVM classifier on top of the offensive team side’s features inferred using the line of scrimmage. Direkoglu and O’Connor solved a Poisson equation to generate a holistic player location representation. Swears et al. used the Granger Causality statistic to automatically constrain the temporal links of a Dynamic Bayesian Network (DBN) for handball videos.

In Gade et al. , player occupancy heat maps are employed to handle sport type classification. People are first detected, and the occupancy maps are obtained by summing their locations over time. Finally, a sport type classifier is trained on top of Fisher vector representations of the heat maps to infer the sports type happening in a test scene.

3 Deep Learning

Deep Convolutional Neural Networks (CNNs) have shown impressive performance by unifying feature and classifier learning, enabled by the availability of large labeled training datasets. Successes have been demonstrated on a variety of computer vision tasks including image classification and action recognition . More flexible recurrent neural network (RNN) based models are used for handling variable length space-time inputs. Specifically, LSTM models are popular among RNN models due to the tractable learning framework that they offer when it comes to deep representations. These LSTM models have been applied to a variety of tasks .

For instance, in Donahue et al. , the so-called Long term Recurrent Convolutional network, formed by stacking an LSTM on top of pre-trained CNNs, is proposed for handling sequential tasks such as activity recognition, image description, and video description. In this work, they showed that it is possible to jointly train LSTMs along with convolutional networks and achieve comparable results to the state of the art for time-varying tasks. For example, in video captioning, they first construct a semantic representation of the video using maximum a posteriori estimation of a conditional random field. This is then used to construct a natural sentence using LSTMs.

In Karpathy et al. , structured objectives are used to align CNNs over image regions and bi-directional RNNs over sentences. A deep multi-modal RNN architecture is used for generating image descriptions using the deduced alignments. In the first stage, words and image regions are embedded onto an alignment space. Image regions are represented by RCNN embeddings, and words are represented using bi-directional recurrent neural network embeddings. In the second stage, using the image regions and textual snippets, or full image and sentence descriptions, a generative model based on an RNN is constructed, that outputs a probability map of the next word.

In this work, we aim at building a hierarchical structured model that incorporates a deep LSTM framework to recognize individual actions and group activities. Previous work in the area of deep structured learning includes Tompson et al. for pose estimation, and Zheng et al. and Schwing et al. for semantic image segmentation.

In Deng et al. a similar framework is used for group activity recognition, where a neural network-based hierarchical graphical model refines person action labels and learns to predict the group activity simultaneously. While these methods use neural network-based graphical representations, in our current approach, we leverage LSTM-based temporal modelling to learn discriminative information from time varying sports activity data. In , a new dataset is introduced that contains dense multiple labels per frame for underlying action, and a novel Multi-LSTM is used to model the temporal relations between labels present in the dataset. Ramanathan et al. develop LSTM-based methods for analyzing sports videos, using an attention mechanism to determine who is the principal actor in a scene. In a sense, this work is complementary to our pooling-based models that represent aggregations of all people involved in a group activity.

4 Datasets

Popular datasets for activity recognition include the Sports-1M dataset , UCF 101 database , and the HMDB movie database . These datasets were part of a shift in focus toward unconstrained Internet videos as a domain for action recognition research. These datasets are challenging because they contain substantial intra-class variation and clutter both in terms of extraneous background objects and varying temporal duration of the action of interest. However, these datasets tend to focus on individual human actions, as opposed to the group activities we consider in our work.

Scenes involving multiple, potentially interacting people present significant challenges. In the context of surveillance video, the TRECVid Surveillance Event Detection , UT-Interaction , VIRAT , and UCLA Courtyard datasets are examples of challenging tasks including individual and pairwise interactions.

Datasets for analyzing group activities include the Collective Activity Dataset . This dataset consists of real world pedestrian sequences where the task is to find the high level group activity. The S-HOCK dataset focuses on crowds of spectators and contains more than 100 million annotations ranging from person body poses to actions to social relations among spectators. In this paper, we experiment with the Collective Activity Dataset, and also introduce a new dataset for group activity recognition in sport footage which is annotated with player pose, location, and group activitiesThe dataset is available for download: https://github.com/mostafa-saad/deep-activity-rec..

Proposed Approach

Our goal in this paper is to recognize activities performed by a group of people in a video sequence. The input to our method is a set of tracklets of the people in a scene. The group of people in the scene could range from players in a sports video to pedestrians in a surveillance video. In this paper we consider three cues that can aid in determining what a group of people is doing:

Person-level actions collectively define a group activity. Person action recognition is a first step toward recognizing group activities.

Temporal dynamics of a person’s action is higher-order information that can serve as a strong signal for group activity. Knowing how each person’s action is changing over time can be used to infer the group’s activity.

Temporal evolution of group activity represents how a group’s activity is changing over time. For example, in a volleyball game a team may move from defence phase to pass and then attack.

Many classic approaches to the group activity recognition problem have modeled these elements in a form of structured prediction based on hand crafted features . Inspired by the success of deep learning based solutions, in this paper, a novel hierarchical deep learning based model is proposed that is potentially capable of learning low-level image features, person-level actions, their temporal relations, and temporal group dynamics in a unified end-to-end framework.

Given the sequential nature of group activity analysis, our proposed model is based on a Recurrent Neural Network (RNN) architecture. RNNs consist of non-linear units with internal states that can learn dynamic temporal behavior from a sequential input with arbitrary length. Therefore, they overcome the limitation of CNNs that expect constant length input. This makes them widely applicable to video analysis tasks such as activity recognition.

Our model is inspired by the success of hierarchical models. Here, we aim to mimic a similar intuition using recurrent networks. We propose a deep model by stacking several layers of RNN-type structures to model a range of low-level to high-level dynamics defined on top of people and entire groups. Fig. 2 provides an overview of our model. We describe the use of these RNN structures for individual and group activity recognition next.

Given tracklets of each person in a scene, we use long short-term memory (LSTM) models to represent temporally the action of each individual person. Such temporal information is complementary to spatial features and is critical for performance. LSTMs, originally proposed by Hochreiter and Schmidhuber , have been used successfully for many sequential problems in computer vision. Each LSTM unit consists of several cells with memory that stores information for a short temporal interval. The memory content of a LSTM makes it suitable for modeling complex temporal relationships that may span a long time range.

The content of the memory cell is regulated by several gating units that control the flow of information in and out of the cells. The control they offer also helps in avoiding spurious gradient updates that can typically happen in training RNNs when the length of a temporal input is large. This property enables us to stack a large number of such layers in order to learn complex dynamics present in the input in different ranges.

We use a deep Convolutional Neural Network (CNN) to extract features from the bounding box around the person in each time step on a person trajectory. The output of the CNN, represented by xtx_{t}, can be considered as a complex image-based feature describing the spatial region around a person. Assuming xtx_{t} as the input of an LSTM cell at time tt, the cell activition can be formulated as :

When modeling individual actions, the hidden state hth_{t} could be used to model the action a person is performing at time tt. Note that the cell output is evolving over time based on the past memory content. Due to the deployment of gates on the information flow, the hidden state will be formed based on a short-range memory of the person’s past behaviour. Therefore, we can simply pass the output of the LSTM cell at each time to a softmax classification layerMore precisely, a fully connected layer fed to softmax loss layer. to predict individual person-level action for each tracklet.

The LSTM layer on top of person trajectories forms the first stage of our hierarchical model. This stage is designed to model person-level actions and their temporal evolution. Our training proceeds in a stage-wise fashion, first training to predict person level actions, and then pasing the hidden states of the LSTM layer to the second stage for group activity recognition, as discussed in the next section.

2 Hierarchical Model for Group Activity Recognition

At each time step, the memory content of the first LSTM layer contains discriminative information describing the subject’s action as well as past changes in his action. If the memory content is collected over all people in the scene, it can be used to describe the group activity in the whole scene.

Moreover, it can also be observed that direct image-based features extracted from the spatial domain around a person carry a discriminative signal for the current activity. Therefore, a deep CNN model is used to extract complex features for each person in addition to the temporal features captured by the first LSTM layer.

At this moment, the concatenation of the CNN features and the LSTM layer represent temporal features for a person. Various pooling strategies can be used to aggregate these features over all people in the scene at each time step. The output of the pooling layer forms our representation for the group activity. The second LSTM network, working on top of the temporal representation, is used to directly model the temporal dynamics of group activity. The LSTM layer of the second network is directly connected to a classification layer in order to detect group activity classes in a video sequence.

Mathematically, the pooling layer can be expressed as the following:

In this equation, htkh_{tk} corresponds to the first stage LSTM output, and xtkx_{tk} corresponds to the AlexNet fc7 feature, both obtained for the kth person at time t. We concatenate these two features (represented by \oplus) to obtain the temporal feature representation PtkP_{tk} for kth person. We then construct the frame level feature representation ZtZ_{t} at time t by applying a max pooling operation (represented by \diamond) over the features of all the people. Finally, we feed the frame level representation to our second LSTM stage that operates similar to the person level LSTMs that we described in the previous subsection, and learn the group level dynamics. ZtZ_{t}, passed through a fully connected layer, is given to the input of the second-stage LSTM layer. The hidden state of the LSTM layer represented by htgrouph_{t}^{group} carries temporal information for the whole group dynamics. htgrouph_{t}^{group} is fed to a softmax classification layer to predict group activities.

3 Handling sub-groups

In team sports, there might be several sub-groups of players with common responsibilities within a team. For example, the front players of a volleyball team are responsible for blocking the ball. Max pooling all players’ representation in one representation reduces the model capabilities (e.g. causes confusions between left team and right team activities). To consider that, we propose a modified model where we split the players to several sub-groups and recognize the team activity based on the concatenation of each sub-group’s representation. In our experiments we consider a set of different possible spatial sub-groupings of players (c.f. standard spatial pyramids ). Figure 3 illustrates this variant of the model, showing splitting into two team-based groups.

Mathematically, the pooling layer can be re-expressed as the following:

where again, tt indexes time, kk indexes players, htkh_{tk} corresponds to first stage LSTM output, xtkx_{tk} to fc7 features, and PtkP_{tk} is the temporal feature representation for the player. Assume that the KK players are ordered in a list (e.g. based on top-left point of a bounding box), dd is the number of sub-groups and mm indexes the groups. SmS_{m} and EmE_{m} are the start and end positions of the m-th group players. GtmG_{tm} is the mm-th group representation: a max pooling on all group players’ representation in this group. ZtZ_{t} is the the frame level feature representation constructed by the concatenation operator (represented by \oplus) of the dd sub-groups.

4 Implementation Details

We trained our model in two steps. In the first step, the person-level CNN and the first LSTM layer are trained in an end-to-end fashion using a set of training data consisting of person tracklets annotated with action labels. We implement our model using Caffe . Similar to other approaches , we initialize our CNN model with the pre-trained AlexNet network and we fine-tune the whole network for the first LSTM layer.

After training the first LSTM layer, we concatenate the fc7 layer of AlexNet and the LSTM layer for every person and pool over all people in a scene. The pooled features, which correspond to frame level features, are fed to the second LSTM network.

For training all our models, we follow the same training protocol. We use a fixed learning rate of 0.00001 and a momentum of 0.9. For tracking subjects in a scene, we used the tracker by Danelljan et al. , implemented in the Dlib library . The baseline models are structured and trained in a similar manner as our two-stage model.

Experiments

In this section, we evaluate our model by running ablation studies using several baselines and comparing to previously published works on the Collective Activity Dataset . First, we describe our baseline models for the ablation studies. Then, we present our results on the Collective Activity Dataset followed by experiments on the Volleyball Dataset.

The following baselines are considered in all our experiments in order to assess the contributions of components of our proposed model.

Image Classification: This baseline is the basic AlexNet model fine-tuned for group activity recognition in a single frame.

Person Classification: In this baseline, the AlexNet CNN model is deployed on each person, fc7 features are pooled over all people, and are fed to a softmax classifier to recognize group activities in each single frame.

Fine-tuned Person Classification: This baseline is similar to the previous baseline with one distinction. The AlexNet model on each player is fine-tuned to recognize person-level actions. Then, fc7 is pooled over all players to recognize group activities in a scene without any fine-tuning of the AlexNet model.

The rationale behind this baseline is to examine a scenario where person-level action annotations as well as group activity annotations are used in a deep learning model that does not model the temporal aspect of group activities. This is very similar to our two-stage model without the temporal modeling.

Temporal Model with Image Features: This baseline is a temporal extension of the first baseline. It examines the idea of feeding image level features directly to a LSTM model to recognize group activities. In this baseline, the AlexNet model is deployed on the whole image and resulting fc7 features are fed to a LSTM model. This baseline can be considered as a reimplementation of Donahue et al. .

Temporal Model with Person Features: This baseline is a temporal extension of the second baseline: fc7 features pooled over all people are fed to a LSTM model to recognize group activities.

Two-stage Model without LSTM 1: This baseline is a variant of our model, omitting the person-level temporal model (LSTM 1). Instead, the person-level classification is done only with the fine-tuned person CNN.

Two-stage Model without LSTM 2: This baseline is a variant of our model, omitting the group-level temporal model (LSTM 2). In other words, we do the final classification based on the outputs of the temporal models for individual person action labels, but without an additional group-level LSTM.

2 Experiments on the Collective Activity Dataset

The Collective Activity Dataset has been widely used for evaluating group activity recognition approaches in the computer vision literature . This dataset consists of 44 videos, eight person-level pose labels (not used in our work), five person level action labels, and five group-level activities. A scene is assigned a group activity label based on the majority of what people are doing. We follow the train/test split provided by . In this section, we present our results on this dataset.

Model details: In the Collective Activity Dataset, 9 timesteps and 3000 hidden nodes are used for the first LSTM layer and a softmax layer is deployed for the classification layer in this stage. The second network consists of a 3000-node fully connected layer followed by a 9-timestep 500-node LSTM layer which is passed to a softmax layer trained to recognize group activity labels.

Ablation studies: In Table I, the classification results of our proposed architecture is compared with the baselines. As shown in the table, our two-stage LSTM model significantly outperforms the baseline models. A comparison can be made between temporal and frame-based counterparts including B1 vs. B4, B2 vs. B5 and B3 vs. our two-stage model. We observe that adding temporal information using LSTMs improves the performance of these baselines.

Comparison to other methods: Table II compares our method with state of the art methods for group activity recognition. Fig. 4 provides visualizations of example results. The performance of our two-stage model is comparable to the state of the art methods. Note that only Deng et al. is a previously published deep learning model. In contrast, the cardinality kernel approach outperformed our model. It should be noted that this approach works on hand crafted features fed to a model highly optimized for a cardinality problem (i.e. counting the number of actions in the scene) which is exactly the way group activities are defined in this dataset.

The confusion matrix obtained for the Collective Activity Dataset using our two-stage model is shown in Figure 7. We observe that the model performs almost perfectly for the talking and queuing classes, and gets confused between crossing, waiting, and walking. Such behaviour is perhaps due to a lack of consideration of spatial relations between people in the group, which is shown to boost the performance of previous group activity recognition methods: e.g. crossing involves the walking action, but is confined in a path which people perform in orderly fashion. Therefore, our model that is designed only to learn the dynamic properties of group activities often gets confused with the walking action.

It is clear that our two-stage model has improved performance with compared to baselines. The temporal information improves performance. Further, finding and describing the elements of a video (i.e. persons) provides benefits over utilizing frame level features.

3 Experiments on the Volleyball Dataset

In order to evaluate the performance of our model for team activity recognition on sport footage, we collected a new dataset using publicly available YouTube volleyball videos. We annotated 4830 frames that were handpicked from 55 videos with nine player action labels and eight team activity labels. We used frames from 2/3rd of the videos for training, and the remaining 1/3rd for testing. The list of action and activity labels and related statistics are tabulated in Tables III and IV.

From the tables, we observe that the group activity labels are relatively more balanced compared to the player action labels. This follows from the fact that we often have people present in static actions like standing compared to dynamic actions (setting, spiking, etc.). Therefore, our dataset presents a challenging team activity recognition task, where we have interesting actions that can directly determine the group activity occur rarely in our dataset. The dataset will be made publicly available to facilitate future comparisons https://github.com/mostafa-saad/deep-activity-rec.

Model details: The model hyperparameters for the Volleyball Dataset include 5 timesteps and 3000 hidden nodes for the first LSTM layer. The second network uses 10 timesteps and 2000 hidden nodes for the second LSTM layer.

We further experiment with a set of different player sub-grouping approaches for pooling. To find the sub-groups, we follow a simple strategy. First, we order players based on their top-left bounding box point (x-axis first). To split players to two groups (e.g. left/right teams), we consider the first half of players as group one. Similarly, to split to four groups, we consider the first quarter of players as group one, second quarter as group two and so on. If players cannot be divided evenly (missing players), the last sub-groups will have fewer players.

Ablation studies: In Table V, the classification performance of our proposed model is compared against the baselines. Similar to the performance in the Collective Activity Dataset, our two-stage LSTM model outperforms the baseline models.

Moreover, explicitly modeling people is necessary for obtaining better performance in this dataset, since the background is rapidly changing due to a fast moving camera, and therefore it corrupts the temporal dynamics of the foreground. This could be verified from the performance of our baseline model B4, which is a temporal model that does not consider people explicitly, showing inferior performance compared to the baseline B1, which is a non-temporal image classification style model. On the other hand, baseline model B5, which is a temporal model that explicitly considers people, performs comparably to the image classification baseline, in spite of the problems that arise due to tracking and motion artifacts.

In both datasets, an observation from the tables is that while both LSTMs contribute to overall classification performance, having the first layer LSTM (B7 baseline) is relatively more critical to the performance of the system, compared to the second layer LSTM (B6 baseline).

To further investigate players sub-grouping, in Table VI, we run experiments over 4 sub-groups: left-team-back players, left-team-front players, right-team-back players and front-team-bottom players.

It seems from the results that more brute force sub-grouping doesn’t improve the performance of the system for this dataset. It shows that extracting additional information by segregating players on basis of their position renders information from static/insignificant players results in more confusion, and perhaps leading to degradation in performance. Therefore, from this experiment, it is evident all types of explicit spatio-temporal relation modelling does not lead to an improvement in performance.

To evaluate the effect of number of LSTM nodes of the model’s two network, we conducted set of experiments outlined in Table VII. Similarly, we evaluate the effect of the number of timesteps of the model’s two network, we conducted set of experiments outlined in Table VIII.

Comparison to other methods: In Table IX, we compare our model to the improved dense trajectory approach . Dense trajectories is a hand-crafted approach that competes strongly versus deep learning features. In addition, we also created two variations of , where the considered trajectories are only the ones inside the players’ bounding boxes, in other words, ignoring background trajectories. The variations emulate our model with one group and 2 groups style. That is, the first variation represents the players from the whole team, while the second represents each team and then concatenates the two representations to get the whole scene representation.

The traditional dense trajectories approach and its one group style show close performance, but the 2-groups trajectories variation yields higher performance. Probably, this is due the reduction of confusions between left and right teams’ activities. However, our model outperforms these strong dense trajectory-based baseline methods.

Figure 8 shows the confusion matrix obtained for the Volleyball Dataset using our two-stage model by grouping all players (no sub-groups) in one representation using max pooling operation, similar to . From the confusion matrix, we observe that our model generates accurate high level action labels. Nevertheless, our model has some confusion between left winpoint and right winpoint activities. On the contrary to , the confusion between set and pass activities is resolved, probably due to using more data.

Figures 9 shows the confusion matrix obtained for the Volleyball Dataset using our two-stage model, but by sub-grouping left team and right team first. From the confusion matrix, we observe that our model generates more accurate high level action labels than using no groups. In addition, the confusion between left winpoint and right winpoint activities is reduced.

In Figure 10, we show the visualizations of our detected activities with different failure and success scenarios.

Conclusion

In this paper, we presented a novel deep structured architecture to deal with the group activity recognition problem. Through a two-stage process, we learn a temporal representation of person-level actions and combine the representation of individual people to recognize the group activity. We created a new Volleyball Dataset to train and test our model, and also evaluated our model on the Collective Activity Dataset. Results show that our architecture can improve upon baseline methods lacking hierarchical consideration of individual and group activities using deep learning.

Acknowledgements

This work was supported by grants from NSERC and Disney Research.

References