Joint Inference of Groups, Events and Human Roles in Aerial Videos

Tianmin Shu, Dan Xie, Brandon Rothrock, Sinisa Todorovic, Song-Chun Zhu

Introduction

Video surveillance of large spatial areas using unmanned aerial vehicles (UAVs) becomes increasingly important in a wide range of civil, military and homeland security applications. For example, identifying suspicious human activities in aerial videos has the potential of saving human lives and preventing catastrophic events. Yet, there is scant prior work on aerial video analysis , which for the most part is focused on tracking people and vehicles (with few exceptions ) in relatively sanitized settings.

Towards advancing aerial video understanding, this paper presents a new problem of parsing extremely low-resolution aerial videos of large spatial areas, such as picnic areas rich with co-occurring group events, viewed top-down under camera motion, as illustrated in Fig. 1 and 2. Given an aerial video, our objectives include:

Recognizing events present in each group;

Recognizing roles of people involved in these events.

2 Scope and Challenges

As illustrated in Fig. 1, we focus on videos of relatively wide spatial areas (e.g., parks with parking lots) with interesting terrains, taken on-board of a UAV flying at a large altitude (25m) from the ground. People in such videos are formed into groups engaged in different events, involving complex $n$ -ary interactions among themselves (e.g., a Guide leading Tourists in Group Tour), as well as interactions with objects (e.g., Play Frisbee). Also, people play particular roles in each event (e.g., Deliverer and Receiver roles in Exchange Box).

1. Low resolution. People and their portable objects are viewed at an extremely low resolution. Typically, the size of a person is only $15\times 15$ pixels in a frame, and small objects critical for distinguishing one event from another may not be even distinguishable by a human eye.

2. Camera motion makes important cues for event recognition (e.g., object like Car) only partially visible or even out of view, and thus may require seeing longer video footage for their reliable detection.

3. Shadows in top view make background subtraction very challenging.

Unfortunately, popular appearance-based approaches to detecting people and objects used to produce input for recognizing group events and interactions do not handle the above three challenges. Thus we have to depart from the appearance-based event recognition.

In addition, in the face of these challenges, the state of the art methods in people and vehicle tracking frequently miss to track moving foreground, and typically produce short, broken tracklets with a high rate of switched track IDs.

4. Space-time dynamics. Our events are characterized by both very large and very small space-time dynamics within a group of people. For example, in the event of a line forming in front of a vending machine, called Queue for Vending machine, the participants may be initially scattered across a large spatial area, and may form the line very slowly, while partially occluding one another when closely standing in the line.

3 Overview of Our Approach

As Fig. 2 illustrates, our approach consists of two main steps:

1. Preprocessing. We ground our approach onto noisy detections and tracking. Foreground tracking under camera motion is made feasible by registering video frames onto a reference plane. By frame registration, we generate a panorama for scene labeling. Due to the challenges mentioned in Sec. 1.2, tracking of small portable objects and people produces highly unreliable frequently broken tracklets, with a high miss rate. We improve the initial tracking results by agglomeratively clustering tracklets into longer trajectories based on their spatial layout and velocity. We detect large objects (e.g. buildings, cars) using the approach of , and classify superpixels of the panorama for scene labeling.

2. Inference. We seek event occurrences in the space-time patterns of the foreground trajectories and their relations with the detections of objects in the scene. To constrain our recognition hypotheses under uncertainty, we resort to domain knowledge represented by a probabilistic grammar – namely, a spatiotemporal AND-OR graph (ST-AOG). ST-AOG encodes decompositions of events into temporal sequences of sub-events. Sub-events are defined by our new formalism called latent spatiotemporal templates of $n$ -ary relations among people and objects. The templates jointly encode varying spatiotemporal relations of characteristic roles of all people, as well as their interactions with objects, while engaged in the event.

We specify an iterative algorithm based on Markov Chain Monte Carlo (MCMC ) along with dynamic programming (DP) to jointly infer groups, events and human roles.

4 Prior Work and Our Contributions

Our work is related to three research streams.

Event Recognition in Aerial Videos. Prior work on aerial image and video understanding typically puts restrictions on their settings for limited tasks. For example, requires robust motion segmentation and learning of object shapes for tracking objects; recognizes people based on background subtraction and motion; and depends on appearance-based regressor and background subtraction for tracking vehicles. Regarding the objectives, these approaches mainly focus on detecting and tracking people or vehicles . We advance prior work by relaxing their assumptions about the setting, and by extending their objectives to jointly infer groups, events, human roles.

Group Activity Recognition. Simultaneous tracking of multiple people, discovering groups of people, and recognizing their collective activities have been addressed only in every-day videos, rather than aerial videos . Also, work on recognizing group activities in large spatial scenes requires high-resolution videos for a “digital zoom-in” . As input, these approaches use person detections along with cues about human appearance, pose, and orientation — i.e., information that cannot be reliably extracted from our aerial videos. There are also some trajectory-based methods for event recognition , but they focus on simpler events compared to what we discuss in this paper. Regarding the representation of collective activities, prior work has used a descriptor of human locations and orientations, similar to shape-context . We advance prior work with our new formalism of latent spatiotemporal template of human roles and their interactions with other actors and objects.

Recognition of Human Roles. Existing work on recognizing social roles and social interactions of people typically requires perfect tracking results , reliable estimation of face direction and attention in 3D space , detection of agent’s feet location in the scene , and thus are not applicable to our domain. Our approach is related to recent approaches aimed at jointly recognizing events and social roles by identifying interactions of sub-groups .

Addressing a more challenging setting of aerial videos;

New formalism of latent spatiotemporal templates of $n$ -ary relations among human roles and objects;

Efficient inference using dynamic programming aimed at grouping, recognition and localizing temporal extents of events and human roles

New dataset of aerial videos with per-frame annotations of people’s trajectories, object labels, roles, events and groups.

Representation

Similar with hierarchical representation in , domain knowledge is formalized as ST-AOG, depicted in Fig. 3. Its nodes represent the following four sets of concepts: events $\Delta_{\text{E}}=\{E_{i}\}$ ; sub-events $\Delta_{L}=\{L_{a}\}$ ; human roles $\Delta_{\text{R}}=\{R_{j}\}$ ; small objects that people interact with $\Delta_{\text{O}}=\{O_{j}\}$ ; and large objects and scene surfaces $\Delta_{\text{S}}=\{S_{j}\}$ . A particular pattern of foreground trajectories observed in a given time interval gives rise to a sub-event, and a particular sequence of sub-events defines an event.

Edges of the ST-AOG represent decomposition and temporal relations in the domain. In particular, the nodes are hierarchically connected by decomposition edges into three levels, where the root level corresponds to events, middle level encodes sub-events, and leaf level is grounded onto foreground tracklets and object detections in the video. The nodes of sub-events are also laterally connected for capturing “followed-by” temporal relations of sub-events within the corresponding events.

ST-AOG has special types of nodes. An AND node, $\wedge$ , encodes a temporal sequence of latent sub-events required to occur in the video so as to enable the event occurrence (e.g., in order to Exchange Box, the Deliverers first need to approach the Receivers, give the Box to the Receivers, and then leave). For a given event, an OR node, $\vee$ , serves to encode alternative space-time patterns of distinct sub-events.

2 Sub-events as Latent Spatiotemporal Templates

A temporal segment of foreground trajectories corresponds to a sub-event. ST-AOG represents a sub-event as the latent spatiotemporal template of $n$ -ary spatiotemporal relations among foreground trajectories within a time interval, as illustrated in Fig. 4. In particular, as an event is unfolding in the video, foreground trajectories form characteristic space-time patterns, which may not be semantically meaningful. As they frequently occur in the data, they can be robustly extracted from training videos through unsupervised clustering. Our spatiotemporal templates formalize these patterns within the Bayesian framework using unary, pairwise, and $n$ -ary relations among the foreground trajectories. In addition, our unsupervised learning of spatiotemporal templates address unstructured events in a unified manner. Namely, more structured events need more templates and an unstructured one is represented by a single template.

Unary attributes. A foreground trajectory, $\Gamma=[\Gamma^{1},...,\Gamma^{k},...]$ , can be viewed as spanning a number of time intervals, $\tau_{k}=[t_{k-1},t_{k}]$ , where $\Gamma^{k}=\Gamma(\tau_{k})$ . Each trajectory segment, $\Gamma^{k}$ , is associated with unary attributes, $\bm{\phi}=[\bm{r}^{k},s^{k},\bm{c}^{k}]$ . Elements of the role indicator vector $\bm{r}^{k}(l)=1$ if $\Gamma^{k}$ belongs to a person with role $l\in\Delta_{\text{R}}$ or object class $l\in\Delta_{\text{O}}$ ; otherwise $\bm{r}^{k}(l)=0$ . The speed indicator $s^{k}=1$ when the normalized speed of $\Gamma^{k}$ is greater than a threshold (we use 2 pixels/sec); otherwise, $s^{k}=0$ . Elements of the closeness indicator vector $\bm{c}^{k}(l)=1$ when $\Gamma^{k}$ is close to any of the large objects or types of surfaces detected in the scene indexed by $l\in\Delta_{\text{S}}$ , such as Building, Car, for a threshold (70 pixels); o.w., $\bm{c}^{k}(l)=0$ .

Pairwise relations. of a pair of trajectory segments, $\Gamma_{j}^{k}$ and $\Gamma_{j^{\prime}}^{k}$ , are aimed at capturing spatiotemporal relations of human roles or objects represented by the two trajectories, as illustrated in Fig. 4. The pairwise relations are specified as: $\bm{\phi}_{jj^{\prime}}=[d_{jj^{\prime}}^{k},\theta_{jj^{\prime}}^{k},\bm{r}_{jj^{\prime}}^{k},s_{jj^{\prime}}^{k},\bm{c}_{jj^{\prime}}^{k}]$ , where $d_{jj^{\prime}}^{k}$ is the mean distance between $\Gamma_{j}^{k}$ and $\Gamma_{j^{\prime}}^{k}$ ; $\theta_{jj^{\prime}}^{k}$ is the angle subtended between $\Gamma_{j}^{k}$ and $\Gamma_{j^{\prime}}^{k}$ ; and the remaining three pairwise relations check for compatibility between the aforementioned binary relations as: $\bm{r}_{jj^{\prime}}^{k}=\bm{r}_{j}^{k}\oplus\bm{r}_{j^{\prime}}^{k}$ , $s_{jj^{\prime}}^{k}=s_{j}^{k}\oplus s_{j^{\prime}}^{k}$ , $\bm{c}_{jj^{\prime}}^{k}=\bm{c}_{j}^{k}\oplus\bm{c}_{j^{\prime}}^{k}$ , where $\oplus$ denotes the Kronecker product.

$n$ -ary relations. Towards encoding unique spatiotemporal patterns of a set of trajectories, we specify the following $n$ -ary attribute. A set of trajectory segments, $G_{i}(\tau_{k})=G_{i}^{k}=\{\Gamma_{j}^{k}\}$ , can be described by a 18-bin histogram $\bm{h}^{k}$ of their velocity vectors. $\bm{h}^{k}$ counts orientations of velocities at every point along the trajectories in a polar coordinate system: 6 bins span the orientations in $[0,2\pi]$ , and 3 bins encode the locations of trajectory points relative to a given center. As the polar-coordinate origin, we use the center location of a given event in the scene.

Unsupervised Extraction of Templates. Given training videos with ground-truth partition of all their ground-truth foreground trajectories $G$ into disjoint subsets $G=\{G_{i}\}$ . Every $G_{i}$ can be further partitioned into equal-length time intervals $G_{i}=\{G_{i}^{k}\}$ ( $|\tau^{k}|=2\text{sec}$ ). We use K-means clustering to group all $\{\Gamma_{i,j}^{k}\}$ , and then estimate spatiotemporal templates $\{L_{a}\}$ as representatives of the resulting clusters $a$ . For K-means clustering, we use ground-truth values of the aforementioned unary and pairwise relations of $\{\Gamma_{i,j}^{k}\}$ . In our setting of 11 categories of events occurring in aerial videos, we estimate $|\Delta_{L}|=27$ templates.

Formulation and Learning of Templates

Given the spatiotemporal templates, $\Delta_{L}=\{L_{a}\}$ , extracted by K-means clustering from training videos (see Sec. 2.2), we will conduct inference by seeking these latent templates in foreground trajectories of the new video. To this end, we define the log-likelihood of a set of foreground trajectories $G=\{\Gamma_{j}\}$ given $L_{a}\in\Delta_{L}$ as

where the bottom equation of (1) formalizes every template as a set of parameters $\bm{w}_{a}=[\bm{w}_{a}^{1},\bm{w}_{a}^{2},\bm{w}_{a}^{3}]$ appropriately weighting the unary, pairwise and $n$ -ary relations of $G$ , $\bm{\psi}$ . Recall that our spatiotemporal templates are extracted from unit-time segments of foreground trajectories in training. Thus, the log-likelihood in (1) is defined only for sets $G$ consisting of unit-time trajectory segments.

From (1), the parameters $\bm{w}_{a}$ can be learned by maximizing the log-likelihood of $\{\bm{\psi}_{a}^{k}\}$ extracted from the corresponding clusters $a$ of training trajectories.

The log-posterior of assigning template $L_{a}$ to longer temporal segments of trajectories, falling in $\tau=(t^{\prime},t)$ , $t^{\prime}<t$ , is specified as

where $p(L_{a}(\tau))$ is a log-normal prior that $L_{a}$ can be assigned to a time interval of length $|\tau|$ . The hyper-parameters of $p(L_{a}(\tau))$ are estimated using the MLE on training data.

Probabilistic Model

A parse graph is an instance of ST-AOG, explaining the event, sequence of sub-events, and human role and object label assignment. The solution of our video parsing is a set of parse graphs, $W=\{pg_{i}\}$ , where every $pg_{i}$ explains a subset of foreground trajectories, $G_{i}\subset G$ , as

where $e_{i}\in\Delta_{\text{E}}$ is the recognized event conducted by $G_{i}$ ; $\tau_{i}=[t_{i,0},t_{i,T}]$ is the temporal extent of $e_{i}$ in the video starting from frame $t_{i,0}$ and ending at frame $t_{i,T}$ ; $\{L(\tau_{i,u})\}$ are the templates (i.e., latent sub-events) assigned to non-overlapping, consecutive time intervals $\tau_{i,u}\subset\tau_{i}$ , such that $|\tau_{i}|=\sum_{u}|\tau_{i,u}|$ ; and $\bm{r}_{i,j}$ is the human role or object class assignment to $j$ th trajectory $\Gamma_{i,j}$ of $G_{i}$ .

Our objective is to infer $W$ that maximizes the log-posterior $\log p(W|G)\propto-\mathcal{E}(W|G)$ , given all foreground trajectories $G$ extracted from the video. The corresponding energy $\mathcal{E}(W|G)$ is specified for a given partitioning of $G$ into $N$ disjoint subsets $G_{i}$ as

where $G_{i}(\tau_{i,u})$ denotes temporal segments of foreground trajectories falling in time intervals $\tau_{i,u}$ , $|\tau_{i}|=\sum_{u}|\tau_{i,u}|$ , and $\log p(L(\tau_{i,u})|G_{i}(\tau_{i,u}))$ is given by (2). Also, $\log p(\wedge_{e_{i}}|\vee_{\text{root}})$ and $\log p(\wedge_{L_{a}}|\vee_{e_{i}})$ are the log-probabilities of the corresponding switching OR nodes in ST-AOG for selecting particular events $e_{i}\in\Delta_{E}$ and spatiotemporal templates $L_{a}\in\Delta_{L}$ . These two switching probabilities are simply estimated as the frequency of corresponding selections observed in training data.

Inference

Given an aerial video, we first build a video panorama and extract foreground trajectories $G$ . Then, the goal of inference is to: (1) partition $G$ into disjoint groups of trajectories $\{G_{i}\}$ and assign label event $e_{i}\in\Delta_{\text{E}}$ to every $G_{i}$ ; (2) assign human roles and object labels $\bm{r}_{i,j}$ to trajectories $\Gamma_{i,j}$ within each group $G_{i}$ ; and 3) assign latent spatiotemporal templates $L(\tau_{i,u})\in\Delta_{L}$ to temporal segments $\tau_{i,u}$ of foreground trajectories within every $G_{i}$ . For steps (1) and (2) we use two distinct MCMC processes. Given groups $G_{i}$ , event labels $e_{i}$ and role assignment $r_{i,j}$ proposed in (1) and (2), step (3) uses dynamic programming for efficient estimation of sub-events $L(\tau)$ and their temporal extents $\tau$ . Steps (1)–(3) are iterated until convergence, i.e., when $\mathcal{E}(W|G)$ , given by (4), stops decreasing after a sufficiently large number of iterations.

Given $G$ , we first use to perform initial clustering of foreground trajectories into atomic groups. Then, we apply the first MCMC to iteratively propose either to merge two smaller groups into a merger, with probability $p(1)=0.7$ , or to split a merger into two smaller groups, with probability $p(2)=0.3$ . Given the proposal, each resulting group $G_{i}$ is labeled with an event $e_{i}\in\Delta_{\text{E}}$ (we enumerate all possible labels). In each proposal, the MCMC jumps from current solution $W$ to a new solution $W^{\prime}$ generated by one of the dynamics. The acceptance rate is $\alpha=\min\left\{1,\frac{Q(W\rightarrow W^{\prime})p(W^{\prime}|G)}{Q\left(W^{\prime}\rightarrow W\right)p\left(W|G\right)}\right\}$ , where the proposal distribution $Q(W\rightarrow W^{\prime})$ is one of $p(1)$ or $p(2)$ depending on the proposal, and $p\left(W|G\right)$ is given by (4).

2 Human Role Assignment

Given a partitioning of $G$ into groups $\{G_{i}\}$ and their event labels $\{e_{i}\}$ , we use the second MCMC process within every $G_{i}$ to assign human roles and object labels to trajectories. Each trajectory $\Gamma_{i,j}$ in $G_{i}$ is randomly assigned with an initial human-role/object label $\bm{r}_{i,j}$ for solution $pg_{i}$ . In each iteration, we randomly select $\Gamma_{i,j}$ and change it’s role label to generate a new proposal $pg^{\prime}_{i}$ . The acceptance rate is $\alpha=\min\left\{1,\frac{Q(pg_{i}\rightarrow pg^{\prime}_{i})p(pg^{\prime}_{i}|G_{i})}{Q(pg^{\prime}_{i}\rightarrow pg_{i})p\left(pg_{i}|G_{i}\right)}\right\}$ , where $\frac{Q(pg_{i}\rightarrow pg^{\prime}_{i})}{Q(pg^{\prime}_{i}\rightarrow pg_{i})}=1$ and $p\left(pg^{\prime}_{i}|G_{i}\right)$ is maximized by dynamic programming specified in the next section 5.3.

3 Detection of Latent Sub-events with DP

From steps (1) and (2), we have obtained the trajectory groups $\{G_{i}\}$ , and their event $\{e_{i}\}$ and role labels $\{\bm{r}_{i,j}\}$ . Every $G_{i}$ can be viewed as occupying time interval of $\tau_{i}=[t_{i,0},t_{i,T}]$ . The results of steps (1) and (2) are jointly used with detections of large objects $\{S_{i}\}$ to estimate all unary, pairwise, and $n$ -ary relations $\bm{\psi}_{i}$ of every $G_{i}$ . Then, we apply dynamic programming for every $G_{i}$ in order to find latent templates $L(\tau_{i,u})\in\Delta_{L}$ and their optimal durations $\tau_{i,u}\subset[t_{i,0},t_{i,T}]$ . In the sequel, we drop notion $i$ for the group, for simplicity.

The optimal assignment of sub-events can be formulated using a graph, shown in Fig. 5. To this end, we partition $[t_{0},t_{T}]$ into equal-length time intervals $\{[t_{k-1},t_{k}]\}$ , where $t_{k}-t_{k-1}=\delta t$ , $\delta t=2\text{sec}$ . Nodes $L_{a}^{k}$ in the graph represent the assignment of templates $L_{a}\in\Delta_{L}$ to the intervals $[t_{k-1},t_{k}]$ . The graph also has the source and sink nodes.

Directed edges in the graph are established only between nodes $L_{a}^{k^{\prime}}$ and $L_{a}^{k}$ , $1\leq k^{\prime}<k$ , to denote a possible assignment of the very same template $L_{a}$ to the temporal sequence $[t_{k^{\prime}},t_{k}]$ . The directed edges are assigned weights (a.k.a. belief messages), $m(L_{a}^{k^{\prime}},L_{a}^{k})$ , defined as

where $\log p(L_{a}(t_{k^{\prime}},t_{k})|G_{i}(t_{k^{\prime}},t_{k}))$ is given by (2). Consequently, the belief of node $L_{a}^{k}$ is defined as

Here $b(L_{a}^{0})=0$ . We compute the optimal assignment of latent sub-events using the above graph in two passes. In the forward pass, we compute the beliefs of all nodes in the graph using (6). Then, in the backward pass, we backtrace the optimal path between the sink and source nodes, in the following steps:

Find the optimal sub-event assignment at time $t_{k}$ as $L_{a^{*}}^{k}=\arg\max_{a}~{}b(L_{a}^{k})$ ; let $a\leftarrow a^{*}$ ;

Find the best time moment in the past $t_{k^{*}}$ , $k^{*}{<}k$ , and its best sub-event assignment as $L_{a^{*}}^{k^{*}}=\max_{a^{\prime},k^{\prime}}b(L_{a^{\prime}}^{k^{\prime}}){+}m(L_{a}^{k^{\prime}},L_{a}^{k})$ ; Let $a{\leftarrow}a^{*}$ and $k{\leftarrow}k^{*}$ .

Experiment

Existing Datasets. Existing datasets on aerial videos, group events or human roles are inappropriate for our evaluation. These aerial videos or images indeed show some group events, but the events are not annotated (). Most aerial datasets are compiled for tracking evaluation only . Existing group-activity videos or social role videos are captured on or near the ground surface, and have sufficiently high resolution for robust people detection. Thus, we have prepared and released a new aerial video dataset Dataset can be download from http://www.stat.ucla.edu/~tianmin.shu/AerialVideo/AerialVideo.html with the new challenges listed in Sec. 1.2.

Aerial Events Dataset. A hex-rotor with a GoPro camera was used to shoot aerial videos at altitude of 25 meters from the ground. The videos show two different scenes, viewed top-down from the flying hex-rotor. The dataset contains 27 videos, 86 minutes, 60 fps, resolution of $1920\times 1080$ , with about 15 actors in each video. All video frames are registered onto a reference plane of the video panorama. Annotations are provided () as: bounding boxes around groupings of people, events, human roles, and small and large objects. The objects include: 1. Building, 2. Vending Machine, 3. Table & Seat, 4. BBQ Oven, 5. Trash Bin, 6. Shelter, 7. Info Booth, 8. Box, 9. Frisbee, 10. Car, 11. Desk, 12. Blanket. The events include: 1. Play Frisbee, 2. Serve Table, 3. Sell BBQ, 4. Info Consult, 5. Exchange Box, 6. Pick Up, 7. Queue for Vending Machine, 8. Group Tour, 9. Throw Trash, 10. Sit on Table, 11. Picnic. The human roles include: 1. Player, 2. Waiter, 3. Customer, 4. Chef, 5. Buyer, 6. Consultant, 7. Visitor, 8. Deliverer, 9. Receiver, 10. Driver, 11. Queuing Person, 13. Guide, 14. Tourist, 15. Trash Thrower, 16. Picnic Person.

Evaluation Metrics. We split the 27 videos into 3 sets, such that different event categories are evenly distributed, and use a three-fold cross validation for our evaluation. Although our training and test videos show the same two scenes, we make the assumption that the layout of ground surfaces and large objects is unknown. Also, different videos in our dataset cover different parts of these large scenes, which are also assumed unknown. We evaluate accuracy of: i) grouping people, ii) event recognition, iii) role assignment. While our approach also estimates sub-events, note that they are latent and not annotated. The results are all time-averaged with the lengths of trajectories in each video. For specifying evaluation metrics we use the following notation. $G=\{G_{i}\}$ and $G^{\prime}=\{G_{i}^{\prime}\}$ are the sets of groups in ground-truth and inference results respectively. $\Gamma_{ij}$ is the $j$ th trajectory in $i$ th group in ground-truth data, with duration of $|\tau_{ij}|$ , group label $g_{ij}$ , event type $e_{ij}$ and human role $r_{ij}$ in ground-truth. So is $\Gamma^{\prime}_{ij}$ in our inference. For group $G_{i}$ , we call the best matched (i.e. overlapped) group in $G^{\prime}$ as $M_{i}$ . For group $G^{\prime}_{i}$ , we call the best match group in $G$ as $M^{\prime}_{i}$ . Then, precision and recall of grouping are

Accuracy of grouping is $F_{g}=2\Big{/}(1/Pr_{g}+1/Rc_{g})$ .

Event recognition accuracy $E_{e}$ and role assignment accuracy $E_{r}$ are defined as

Baselines. To evaluate effectiveness of each module of our approach, we compare with baselines and variants of our method defined in Tab. 1. For the baselines we extract the following low-level features on trajectories: shape-context like feature , average velocity, aligned orientation, distance from each type of large objects. All elements of feature vectors are normalized to fall in .

Results. We register raw videos by RANSAC over Harris Corner feature points, then apply method of for tracking, which is based on background subtraction . We also use the detector of to detect buildings and cars, while other static objects are inferred in scene labeling. We do not detect portable objects, e.g., Frisbee and Box.

We evaluate our approach on both annotated bounding boxes and real tracking results. Example qualitative results are presented in Fig. 6. As can be seen, the results are reasonably good. The quantitative results are shown in Tab. 1. Confusion matrices of event recognition and role assignment are shown in Fig. 7. Additional results are presented in the supplementary material.

Conclusion

We collected a new aerial video dataset with detailed annotations, which presents new challenges to computer vision and complements existing benchmarks. We specified a framework for joint inference of events, human roles and people groupings using noisy input. Our experiments showed that addressing each of these inference tasks in isolation is very difficult in aerial videos, and thus provided justification for our holistic framework. Our results demonstrated significant performance improvements over baselines when we constrained uncertainty in input features with domain knowledge.

Our model is limited and can be extended in two directions. First, we infer the function of the objects implicitly based on the group events currently. In the future, we wish to explicitly infer the functional map for a given site, in the sense that certain area corresponds to specific human activities, e.g., dinning area, parking lot, etc. Unlike appearance-based aerial image parsing , the spatial segmentation will be guided by the spatiotemporal characteristics of human activities. Second, similar to what did for the prediction of individual intention, we would like to reason the intention of a group as another extension of our work.

Acknowledgements

This research has been sponsored in part by grants DARPA MSEE FA 8650-11-1-7149, ONR MURI N00014-10-1-0933 and NSF IIS-1423305. The authors would like to thank Dr. Michael Ryoo at JPL for the helpful discussions.