The Kinetics Human Action Video Dataset

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, Andrew Zisserman

cs.CV

Introduction

In this paper we introduce a new, large, video dataset for human action classification. We developed this dataset principally because there is a lack of such datasets for human action classification, and we believe that having one will facilitate research in this area – both because the dataset is large enough to train deep networks from scratch, and also because the dataset is challenging enough to act as a performance benchmark where the advantages of different architectures can be teased apart.

Our aim is to provide a large scale high quality dataset, covering a diverse range of human actions, that can be used for human action classification, rather than temporal localization. Since the use case is classification, only short clips of around 10s containing the action are included, and there are no untrimmed videos. However, the clips also contain sound so the dataset can potentially be used for many purposes, including multi-modal analysis. Our inspiration in providing a dataset for classification is ImageNet , where the significant benefits of first training deep networks on this dataset for classification, and then using the trained network for other purposes (detection, image segmentation, non-visual modalities (e.g. sound, depth), etc) are well known.

The Kinetics dataset can be seen as the successor to the two human action video datasets that have emerged as the standard benchmarks for this area: HMDB-51 and UCF-101 . These datasets have served the community very well, but their usefulness is now expiring. This is because they are simply not large enough or have sufficient variation to train and test the current generation of human action classification models based on deep learning. Coincidentally, one of the motivations for introducing the HMDB dataset was that the then current generation of action datasets was too small. The increase then was from 10 to 51 classes, and we in turn increase this to 400 classes.

Table 1 compares the size of Kinetics to a number of recent human action datasets. In terms of variation, although the UCF-101 dataset contains 101 actions with 100+ clips for each action, all the clips are taken from only 2.5k distinct videos. For example there are 7 clips from one video of the same person brushing their hair. This means that there is far less variation than if the action in each clip was performed by a different person (and different viewpoint, lighting, etc). This problem is avoided in Kinetics as each clip is taken from a different video.

The clips are sourced from YouTube videos. Consequently, for the most part, they are not professionally videoed and edited material (as in TV and film videos). There can be considerable camera motion/shake, illumination variations, shadows, background clutter, etc. More importantly, there are a great variety of performers (since each clip is from a different video) with differences in how the action is performed (e.g. its speed), clothing, body pose and shape, age, and camera framing and viewpoint.

Our hope is that the dataset will enable a new generation of neural network architectures to be developed for video. For example, architectures including multiple streams of information (RGB/appearance, optical flow, human pose, object category recognition), architectures using attention, etc. That will enable the virtues (or otherwise) of the new architectures to be demonstrated. Issues such as the tension between static and motion prediction, and the open question of the best method of temporal aggregation in video (recurrent vs convolutional) may finally be resolved.

The rest of the paper is organized as: Section 2 gives an overview of the new dataset; Section 3 describes how it was collected and discusses possible imbalances in the data and their consequences for classifier bias. Section 4 gives the performance of a number of ConvNet architectures that are trained and tested on the dataset. Our companion paper explores the benefit of pre-training an action classification network on Kinetics, and then using the features from the network for action classification on other (smaller) datasets.

The URLs of the YouTube videos and temporal intervals of the dataset can be obtained from http://deepmind.com/kinetics.

An Overview of the Kinetics Dataset

The dataset is focused on human actions (rather than activities or events). The list of action classes covers: Person Actions (singular), e.g. drawing, drinking, laughing, pumping fist; Person-Person Actions, e.g. hugging, kissing, shaking hands; and, Person-Object Actions, e.g. opening present, mowing lawn, washing dishes. Some actions are fine grained and require temporal reasoning to distinguish, for example different types of swimming. Other actions require more emphasis on the object to distinguish, for example playing different types of wind instruments.

There is not a deep hierarchy, but instead there are several (non-exclusive) parent-child groupings, e.g. Music (playing drums, trombone, violin, …); Personal Hygiene (brushing teeth, cutting nails, washing hands, …); Dancing (ballet, macarena, tap, …); Cooking (cutting, frying, peeling, …). The full list of classes is given in the appendix, together with parent-child groupings. Figure 1 shows clips from a sample of classes.

Statistics:

The dataset has 400 human action classes, with 400–1150 clips for each action, each from a unique video. Each clip lasts around 10s. The current version has 306,245 videos, and is divided into three splits, one for training having 250–1000 videos per class, one for validation with 50 videos per class and one for testing with 100 videos per class. The statistics are given in table 2. The clips are from YouTube videos and have a variable resolution and frame rate.

Non-exhaustive annotation.

Each class contains clips illustrating that action. However, a particular clip can contain several actions. Interesting examples in the dataset include: “texting” while “driving a car”; “Hula hooping” while “playing ukulele”; “brushing teeth” while “dancing” (of some type). In each case both of the actions are Kinetics classes, and the clip will probably only appear under only one of these classes not both, i.e. clips do not have complete (exhaustive) annotation. For this reason when evaluating classification performance, a top-5 measure is more suitable than top-1. This is similar to the situation in ImageNet , where one of the reasons for using a top-5 measure is that images are only labelled for a single class, although it may contain multiple classes.

How the Dataset was Built

In this section we describe the collection process: how candidate videos were obtained from YouTube, and then the processing pipeline that was used to select the candidates and clean up the dataset. We then discuss possible biases in the dataset due to the collection process.

clips for each class were obtained by first searching on YouTube for candidates, and then using Amazon Mechanical Turkers (AMT) to decide if the clip contains the action or not. Three or more confirmations (out of five) were required before a clip was accepted. The dataset was de-duped, by checking that only one clip is taken from each video, and that clips do not contain common video material. Finally, classes were checked for overlap and de-noised.

We now describe these stages in more detail.

1 Stage 1: Obtaining an action list

Curating a large list of human actions is challenging, as there is no single listing available at this scale with suitable visual action classes. Consequently, we had to combine numerous sources together with our own observations of actions that surround us. These sources include: (i) Action datasets – existing datasets like ActivityNet , HMDB , UCF101 , MPII Human Pose , ACT have useful classes and a suitable sub set of these were used; (ii) Motion capture – there are a number of motion capture datasets which we looked through and extracted file titles. These titles described the motion within the file and were often quite creative; and, (iii) Crowdsourced – we asked Mechanical Turk workers to come up with a more appropriate action if the label we had presented to them for a clip was incorrect.

2 Stage 2: Obtaining candidate clips

The chosen method and steps are detailed below which combine a number of different internal efforts:

Videos are drawn from the YouTube corpus by matching video titles with the Kinetics actions list.

Step 2: temporal positioning within a video.

Image classifiers are available for a large number of human actions. These classifiers are obtained by tracking user actions on Google Image Search. For example, for a search query “climbing tree”, user relevance feedback on images is collected by aggregating across the multiple times that that search query is issued. This relevance feedback is used to select a high-confidence set of images that can be used to train a “climbing tree” image classifier. These classifiers are run at the frame level over the videos found in step 1, and clips extracted around the top $k$ responses (where $k=2$ ).

It was found that the action list had a better match to relevant classifiers if action verbs are formatted to end with ‘ing’. Thinking back to image search, this makes sense as typically if you are searching for an example of someone performing an action you would issue queries like ‘running man’ or ‘brushing hair’ over other tenses like ‘man ran’ or ‘brush hair’.

The output of this stage is a large number of videos and a position in all of them where one of the actions is potentially occurring. 10 second clips are created by taking 5 seconds either side of that position (there are length exceptions when the position is within 5 seconds of the start or end of the video leading to a shorter clip length). The clips are then passed onto the next stage of cleanup through human labelling.

3 Stage 3: Manual labelling process

The key aim of this stage was to identify whether the supposed action was actually occurring during a clip or not. A human was required in the loop for this phase and we chose to use Amazon’s Mechanical Turk (AMT) for the task due to the large numbers of high quality workers using the platform.

A single-page webapp was built for the labelling task and optimised to maximise the number of clips presented to the workers whilst maintaining a high quality of annotation. The labelling interface is shown in figure 2. The user interface design and theme were chosen to differentiate the task from many others on the platform as well as make the task as stimulating and engaging as possible. This certainly paid off as the task was one of the highest rated on the platform and would frequently get more than 400 distinct workers as soon as a new run was launched.

The workers were given clear instructions at the beginning. There were two screens of instruction, the second reinforcing the first. After acknowledging they understood the task they were presented with a media player and several response icons. The interface would fetch a set of videos from the available pool for the worker at that moment and embed the first clip. The task consisted of 20 videos each with a different class where possible; we randomised all the videos and classes to make it more interesting for the workers and prevent them from becoming stuck on classes with low yields. Two of the video slots were used by us to inject groundtruth clips. This allowed us to get an estimate of the accuracy for each worker. If a worker fell below a 50% success rating on these, we showed them a ‘low accuracy’ warning screen. This helped address many low accuracies.

In the labelling interface, workers were asked the question “Can you see a human performing the action class-name?”. The following response options were available on the interface as icons:

Yes, this contains a true example of the action

No, this does not contain an example of the action

You are unsure if there is an example of the action

Video does not play, does not contain a human, is an image, cartoon or a computer game.

When a worker responded with ‘Yes’ we also asked the question “Does the action last for the whole clip?” in order to use this signal later during model training.

Note, the AMT workers didn’t have access to the audio to ensure that the video can be classified purely based on its visual content.

In order for a clip to be added to the dataset, it needed to receive at least 3 positive responses from workers. We allowed each clip to be annotated 5 times except if it had been annotated by more than 2 of a specific response. For example, if 3 out of 3 workers had said it did not contain an example of the action we would immediately remove it from the pool and not continue until 5 workers had annotated it.

Due to the large scale of the task it was necessary to quickly remove classes that were made up of low quality or completely irrelevant candidates. Failing to do this would have meant that we spent a lot of money paying workers to mark videos as negative or bad. Accuracies for each class were calculated after 20 clips from that class had been annotated. We adjusted the accuracy threshold between runs but would typically start at a high accuracy of 50% (1 in 2 videos were expected to contain the action).

Following annotating, the video ids, clip times and labels were exported from the database and handed on to be used for model training.

We found that more specific classes like ‘riding mule’ were producing much less noise than more general classes like ‘riding’. However, occasionally using more general classes was a benefit as they could subsequently be split into a few distinct classes that were not previously present and the candidates resent out to workers e.g. ‘gardening’ was split into ‘watering plants’, ‘trimming trees’ and ‘planting trees’.

The amount of worker traffic that the task generated meant that we could not rely on direct fetching and writes to the database even with appropriate indexes and optimised queries. We therefore created many caches which were made up of groups of clips for each worker. When a worker started a new task, the interface would fetch a set of clips for that specific worker. The cache was replenished often by background processes as clips received a sufficient number of annotations. This also negated labelling collisions where previously $>1$ worker might pick up the same video to annotate and we would quickly exceed 5 responses for any 1 clip.

4 Stage 4: Cleaning up and de-noising

One of the dataset design goals was having a single clip from each given video sequence, different from existing datasets which slice videos containing repetitive actions into many (correlated) training examples. We also employed mechanisms for identifying structural problems as we grew the dataset, such as repeated classes due to synonymy or different word order (e.g. riding motorbike, riding motorcycle), classes that are too general and co-occur with many others (e.g. talking) and which are problematic for typical 1-of-K classification learning approaches (instead of multi-label classification). We will now describe these procedures.

We de-duplicated videos using two complementary approaches. First, in order to have only one clip from each YouTube link, we randomly selected a single clip from amongst those validated by Turkers for that video. This stage filtered out around 20% of Turker-approved examples, but we visually found that it still left many duplicates. The reason is that YouTube users often create videos reusing portions of other videos, for example as part of video compilations or promotional adverts. Sometimes they are cropped, resized and generally pre-processed in different ways (but, nevertheless, the image classifier could localize the same clip). So even though each clip is from a distinct video there were still duplications.

We devised a process for de-duplicating across YouTube links which operated independently for each class. First we computed Inception-V1 feature vectors (taken after last average pooling layer) on $224\times 224$ center crops of 25 uniformly sampled frames from each video, which we then averaged. Afterwards we built a class-wise matrix having all cosine similarities between these feature vectors and thresholded it. Finally, we computed connected components and kept a random example from each. We found this to work well for most classes using the same threshold of 0.97, but adjusted it in a few cases where classes were visually similar, such as some taking place in the snow or in the water. This process reduced the number of Turker-approved examples by a further 15%.

Detecting noisy classes.

Classes can be ‘noisy’ in that they may overlap with other classes or they may contain several quite distinct (in terms of the action) groupings due to an ambiguity in the class name. For example, ‘skipping’ can be ‘skipping with a rope’ and also ‘skipping stones across water’. We trained two-stream action classifiers repeatedly throughout the dataset development to identify these noise classes. This allowed us to find the top confusions for each class, which sometimes were clear even by just verifying the class names (but went unnoticed due to the scale of the dataset), and other times required eyeballing the data to understand if the confusions were alright and the classes were just difficult to distinguish because of shortcomings of the model. We merged, split or outright removed classes based on these detected confusions.

Final filtering.

After all the data was collected, de-duplicated and the classes were selected, we ran a final manual clip filtering stage. Here the class scores from the two-stream model were again useful as they allowed sorting the examples from most confident to least confident – a measure of how prototypical they were. We found that noisy examples were often among the lowest ranked examples and focused on those. The ranking also made adjacent any remaining duplicate videos, which made it easier to filter out those too.

5 Discussion: dataset bias I

We are familiar with the notion of dataset bias leading to lack of generalization: where a classifier trained on one dataset, e.g. Caltech 256 , does not perform well when tested on another, e.g. PASCAL VOC . Indeed it is even possible to train a classifier to identify which dataset an image belongs to .

There is another sense of bias which could arise from unbalanced categories within a dataset. For example, gender imbalance in a training set could lead to a corresponding performance bias for classifiers trained on this set. There are precedents for this, e.g. in publicly available face detectors not being race agnostichttps://www.media.mit.edu/posts/media-lab-student-recognized-for-fighting-bias-in-machine-learning/, and more recently in learning a semantic bias in written texts . It is thus an important question as to whether Kinetics leads to such bias.

To this end we carried out a preliminary study on (i) whether the data for each action class of Kinetics is gender balanced, and (ii) if, there is an imbalance, whether it leads to a biased performance of the action classifies.

The outcome of (i) is that in 340 action classes out of the 400, the data is either not dominated by a single gender, or it is mostly not possible to determine the gender – the latter arises in classes where, for example, only hands appear, or the ‘actors’ are too small or heavily clothed. The classes that do show gender imbalance include ‘shaving beard’ and ‘dunking basketball’, that are mostly male, and ’filling eyebrows’ and ‘cheerleading’, that are mostly female.

The outcome of (ii) for these classes we found little evidence of classifier bias for action classes with gender imbalance. For example in ‘playing poker’, which tends to have more male players, all videos with female players are correctly classified. The same happens for ‘Hammer throw’. We can conjecture that this lack of bias is because the classifier is able to make use of both the objects involved in an action as well as the motion patterns, rather than simply physical appearance.

Imbalance can also be examined on other ‘axes’, for example age and race. Again, in a preliminary investigation we found very little clear bias. There is one exception where there is clear bias to babies – in ‘crying’, where many of the videos of non-babies crying are misclassified; another example is ‘wrestling’, where the opposite happens: adults wrestling in a ring seem to be better classified than children wrestling in their homes, but it is hard to tell whether the deciding factor is age or the scenes where the actions happen. Nevertheless, these issues of dataset imbalance and any resulting classifier bias warrant a more thorough investigation, and we return to this in section 5.

6 Discussion: dataset bias II

Another type of bias could arise because classifiers are involved in the dataset collection pipeline: it could be that these classifiers lead to a reduction in the visual variety of the clips obtained, which in turn leads to a bias in the action classifier trained on these clips. In more detail, although the videos are selected based on their title (which is provided by the person uploading the video to YouTube), the position of the candidate clip within the video is provided by an image (RGB) classifier, as described above. In practice, using a classifier at this point does not seem to constrain the variety of the clips – since the video is about the action, the particular frame chosen as part of the clip may not be crucial; and, in any case, the clip contains hundreds of more frames where the appearance (RGB) and motion can vary considerably. For these reasons we are not so concerned about the intermediate use of image classifiers.

Benchmark Performance

In this section we first briefly describe three standard ConvNet architectures for human action recognition in video. We then use these architectures as baselines and compare their performance by training and testing on the Kinetics dataset. We also include their performance on UCF-101 and HMDB-51.

We consider three typical approaches for video classification: ConvNets with an LSTM on top ; two-stream networks ; and a 3D ConvNet . There have been many improvements over these basic architectures, e.g. , but our intention here is not to perform a thorough study on what is the very best architecture on Kinetics, but instead to provide an indication of the level of difficulty of the dataset. A rough graphical overview of the three types of architectures we compare is shown in figure 3, and the specification of their temporal interfaces is given in table 3.

For the experiments on the Kinetics dataset all three architectures are trained from scratch using Kinetics. However, for the experiments on UCF-101 and HMDB-51 the architectures (apart from the 3D ConvNet) are pre-trained on ImageNet (since these datasets are too small to train the architectures from scratch).

The high performance of image classification networks makes it appealing to try to reuse them with as minimal change as possible for video. This can be achieved by using them to extract features independently from each frame then pooling their predictions across the whole video . This is in the spirit of bag of words image modeling approaches , but while convenient in practice, it has the issue of entirely ignoring temporal structure (e.g. models can’t potentially distinguish opening from closing a door).

In theory, a more satisfying approach is to add a recurrent layer to the model , such as an LSTM, which can encode state, and capture temporal ordering and long range dependencies. We position an LSTM layer with batch normalization (as proposed by Cooijmans et al. ) after the last average pooling layer of a ResNet-50 model , with 512 hidden units. We then add a fully connected layer on top of the output of the LSTM for the multi-way classification. At test time the classification is taken from the model output for the last frame.

2 Two-Stream networks

LSTMs on features from the last layers of ConvNets can model high-level variation, but may not be able to capture fine low-level motion which is critical in many cases. It is also expensive to train as it requires unrolling the network through multiple frames for backpropagation-through-time.

A different, very practical approach, introduced by Simonyan and Zisserman , models short temporal snapshots of videos by averaging the predictions from a single RGB frame and a stack of $10$ externally computed optical flow frames, after passing them through two replicas of an ImageNet-pretrained ConvNet. The flow stream has an adapted input convolutional layer with twice as many input channels as flow frames (because flow has two channels, horizontal and vertical), and at test time multiple snapshots are sampled from the video and the action prediction is averaged. This was shown to get very high performance on existing benchmarks, while being very efficient to train and test.

3 3D ConvNets

3D ConvNets seem like a natural approach to video modeling. They are just like standard 2D convolutional networks, but with spatio-temporal filters, and have a very interesting characteristic: they directly create hierarchical representations of spatio-temporal data. One issue with these models is that they have many more parameters than 2D ConvNets because of the additional kernel dimension, and this makes them harder to train. Also, they seem to preclude the benefits of ImageNet pre-training and previous work has defined relatively shallow custom architectures and trained them from scratch . Results on benchmarks have shown promise but have not yet matched the state-of-the-art, possibly because they require more training data than their 2D counterparts. Thus 3D ConvNets are a good candidate for evaluation on our larger dataset.

For this paper we implemented a small variation of C3D , which has $8$ convolutional layers, $5$ pooling layers and $2$ fully connected layers at the top. The inputs to the model are short $16$ -frame clips with $112\times 112$ -pixel crops. Differently from the original paper we use batch normalization after all convolutional and fully connected layers. Another difference to the original model is in the first pooling layer, where we use a temporal stride of $2$ instead of $1$ , which reduces the memory footprint and allows for bigger batches – this was important for batch normalization (especially after the fully connected layers, where there is no weight tying). Using this stride we were able to train with 15 videos per batch per GPU using standard K40 GPUs.

At test time, we split the video uniformly into crops of 16 frames and apply the classifier separately on each. We then average the class scores, as in the original paper.

4 Implementation details

The ConvNet+LSTM and Two-Stream architecures use ResNet-50 as the base architecture. In the case of the Two-Stream architecture, a separate ResNet-50 is trained independently for each stream. As noted earlier, for these architectures the ResNet-50 model is pre-trained on ImageNet for the experiments on UCF-101 and HMDB-51, and trained from scratch for experiments on Kinetics. The 3D-ConvNet is not pre-trained.

We trained the models on videos using standard SGD with momentum in all cases, with synchronous parallelization across 64 GPUs for all models. We trained models on Kinetics for up to 100k steps, with a 10x reduction of learning rate when validation loss saturated, and tuned weight decay and learning rate hyperparameters on the validation set of Kinetics. All the models were implemented in TensorFlow .

The original clips have variable resolution and frame rate. In our experiments they are all normalized so that the larger image side is 340 pixels wide for models using ResNet-50 and 128 pixels wide for the 3D ConvNet. We also resample the videos so they have 25 frames per second.

Data augmentation is known to be of crucial importance for the performance of deep architectures. We used random cropping both spatially – randomly cropping a $299\times 299$ patch (respectively $112\times 112$ for the 3D ConvNet) – and temporally, when picking the starting frame among those early enough to guarantee a desired number of frames. For shorter videos, we looped the video as many times as necessary to satisfy each model’s input interface. We also applied random left-right flipping consistently for each video during training.

At test time, we sample from up to 10 seconds of video, again looping if necessary. Better performance could be obtained by also considering left-right flipped videos at test time and by adding additional augmentation, such as photometric, during training. We leave this to future work.

5 Baseline evaluations

In this section we compare the performance of the three baseline architectures whilst varying the dataset used for training and testing.

Table 4 shows the classification accuracy when training and testing on either UCF-101, HMDB-51 or Kinetics. We train and test on split 1 of UCF-101 and HMDB-51, and on the train/val set and held-out test set of Kinetics.

There are several noteworthy observations. First, the performance is far lower on Kinetics than on UCF-101, an indication of the different levels of difficulty of the two datasets. On the other hand, the performance on HMDB-51 is worse than on Kinetics – it seems to have a truly difficult test set, and it was designed to be difficult to appearance-centered methods, while having little training data. The parameter-rich 3D-ConvNet model is not pre-trained on ImageNet, unlike the other baselines. This translates into poor performance on all datasets but especially on UCF-101 and HMDB-51 – on Kinetics it is much closer to the performance of the other models, thanks to the much larger training set of Kinetics.

Class difficulty. We include a full list of Kinetics classes sorted by classification accuracy under the two-stream model in figure 4. Eating classes are among the hardest, as they sometimes require distinguishing what is being eaten, such as hotdogs, chips and doughnuts – and these may appear small and already partially consumed, in the video. Dancing classes are also hard, as well as classes centered on a specific body part, such as “massaging feet”, or “shaking head”.

Class confusion. The top 10 class confusions are provided in table 5. They mostly correspond to fine-grained distinctions that one would expect to be hard, for example ‘long jump’ and ‘triple jump’, confusing burger with doughnuts. The confusion between ‘swing dancing’ and ‘salsa dancing’ raises the question of how accurate motion modeling is in the two-stream model, since ‘swing dancing’ is typically much faster-paced and has a peculiar style that makes it easy for humans to distinguish from salsa.

Classes where motion matters most. We tried to analyze for which classes motion is more important and which ones were recognized correctly using just appearance information, by comparing the recognition accuracy ratios when using the flow and RGB streams of the two-stream model in isolation. We show the five classes where this ratio is largest and smallest in table 6.

Conclusion

We have described the Kinetics Human Action Video dataset, which has an order of magnitude more videos than previous datasets of its type. We have also discussed the procedures we employed collecting the data and for ensuring its quality. We have shown that the performance of standard existing models on this dataset is much lower than on UCF-101 and on par with HMDB-51, whilst allowing large models such as 3D ConvNets to be trained from scratch, unlike the existing human action datasets.

We have also carried out a preliminary analysis of dataset imbalance and whether this leads to bias in the classifiers trained on the dataset. We found little evidence that the resulting classifiers demonstrate bias along sensitive axes, such as across gender. This is however a complex area that deserves further attention. We leave a thorough analysis for future work, in collaboration with specialists from complementary areas, namely social scientists and critical humanists.

We will release trained baseline models (in TensorFlow), so that they can be used, for example, to generate features for new action classes.

The collection of this dataset was funded by DeepMind. We are very grateful for help from Andreas Kirsch, John-Paul Holt, Danielle Breen, Jonathan Fildes, James Besley and Brian Carver. We are grateful for advice and comments from Tom Duerig, Juan Carlos Niebles, Simon Osindero, Chuck Rosenberg and Sean Legassick; we would also like to thank Sandra and Aditya for data clean up.

References

Appendix A List of Kinetics Human Action Classes

This is the list of classes included in the human action video dataset. The number of clips for each action class is given by the number in brackets following each class name.

passing American football (in game) (863)

passing American football (not in game) (1045)

skiing (not slalom or crosscountry) (1140)

using remote controller (not gaming) (549)

Appendix B List of Parent-Child Groupings

These lists are not exclusive and are not intended to be comprehensive. Rather, they are a guide for related human action classes.

arts and crafts (12) arranging flowers blowing glass brush painting carving pumpkin clay pottery making decorating the christmas tree drawing getting a tattoo knitting making jewelry spray painting weaving basket

athletics – jumping (6) high jump hurdling long jump parkour pole vault triple jump

athletics – throwing + launching (9) archery catching or throwing frisbee disc golfing hammer throw javelin throw shot put throwing axe throwing ball throwing discus

auto maintenance (4) changing oil changing wheel checking tires pumping gas

ball sports (25) bowling catching or throwing baseball catching or throwing softball dodgeball dribbling basketball dunking basketball golf chipping golf driving golf putting hitting baseball hurling (sport) juggling soccer ball kicking field goal kicking soccer ball passing American football (in game) passing American football (not in game) playing basketball playing cricket playing kickball playing squash or racquetball playing tennis playing volleyball shooting basketball shooting goal (soccer) shot put

body motions (16) air drumming applauding baby waking up bending back clapping cracking neck drumming fingers finger snapping headbanging headbutting pumping fist shaking head stretching arm stretching leg swinging legs

cleaning (13) cleaning floor cleaning gutters cleaning pool cleaning shoes cleaning toilet cleaning windows doing laundry making bed mopping floor setting table shining shoes sweeping floor washing dishes

cloths (8) bandaging doing laundry folding clothes folding napkins ironing making bed tying bow tie tying knot (not on a tie) tying tie

communication (11) answering questions auctioning bartending celebrating crying giving or receiving award laughing news anchoring presenting weather forecast sign language interpreting testifying

cooking (22) baking cookies barbequing breading or breadcrumbing cooking chicken cooking egg cooking on campfire cooking sausages cutting pineapple cutting watermelon flipping pancake frying vegetables grinding meat making a cake making a sandwich making pizza making sushi making tea peeling apples peeling potatoes picking fruit scrambling eggs tossing salad

dancing (18) belly dancing breakdancing capoeira cheerleading country line dancing dancing ballet dancing charleston dancing gangnam style dancing macarena jumpstyle dancing krumping marching robot dancing salsa dancing swing dancing tango dancing tap dancing zumba

eating + drinking (17) bartending dining drinking drinking beer drinking shots eating burger eating cake eating carrots eating chips eating doughnuts eating hotdog eating ice cream eating spaghetti eating watermelon opening bottle tasting beer tasting food

electronics (5) assembling computer playing controller texting using computer using remote controller (not gaming)

garden + plants (10) blowing leaves carving pumpkin chopping wood climbing tree decorating the christmas tree egg hunting mowing lawn planting trees trimming trees watering plants

golf (3) golf chipping golf driving golf putting

gymnastics (5) bouncing on trampoline cartwheeling gymnastics tumbling somersaulting vault

hair (14) braiding hair brushing hair curling hair dying hair fixing hair getting a haircut shaving head shaving legs trimming or shaving beard washing hair waxing back waxing chest waxing eyebrows waxing legs

hands (9) air drumming applauding clapping cutting nails doing nails drumming fingers finger snapping pumping fist washing hands

head + mouth (17) balloon blowing beatboxing blowing nose blowing out candles brushing teeth gargling headbanging headbutting shaking head singing smoking smoking hookah sneezing sniffing sticking tongue out whistling yawning

heights (15) abseiling bungee jumping climbing a rope climbing ladder climbing tree diving cliff ice climbing jumping into pool paragliding rock climbing skydiving slacklining springboard diving swinging on something trapezing

interacting with animals (19) bee keeping catching fish feeding birds feeding fish feeding goats grooming dog grooming horse holding snake ice fishing milking cow petting animal (not cat) petting cat riding camel riding elephant riding mule riding or walking with horse shearing sheep training dog walking the dog

juggling (6) contact juggling hula hooping juggling balls juggling fire juggling soccer ball spinning poi

makeup (5) applying cream doing nails dying hair filling eyebrows getting a tattoo

martial arts (10) arm wrestling capoeira drop kicking high kick punching bag punching person side kick sword fighting tai chi wrestling

miscellaneous (9) digging extinguishing fire garbage collecting laying bricks moving furniture spraying stomping grapes tapping pen unloading truck

mobility – land (20) crawling baby driving car driving tractor faceplanting hoverboarding jogging motorcycling parkour pushing car pushing cart pushing wheelchair riding a bike riding mountain bike riding scooter riding unicycle roller skating running on treadmill skateboarding surfing crowd using segway waiting in line

mobility – water (10) crossing river diving cliff jumping into pool scuba diving snorkeling springboard diving swimming backstroke swimming breast stroke swimming butterfly stroke water sliding

music (29) beatboxing busking playing accordion playing bagpipes playing bass guitar playing cello playing clarinet playing cymbals playing didgeridoo playing drums playing flute playing guitar playing harmonica playing harp playing keyboard playing organ playing piano playing recorder playing saxophone playing trombone playing trumpet playing ukulele playing violin playing xylophone recording music singing strumming guitar tapping guitar whistling

paper (12) bookbinding counting money folding napkins folding paper opening present reading book reading newspaper ripping paper shredding paper unboxing wrapping present writing

personal hygiene (6) brushing teeth taking a shower trimming or shaving beard washing feet washing hair washing hands

playing games (13) egg hunting flying kite hopscotch playing cards playing chess playing monopoly playing paintball playing poker riding mechanical bull rock scissors paper shuffling cards skipping rope tossing coin

racquet + bat sports (8) catching or throwing baseball catching or throwing softball hitting baseball hurling (sport) playing badminton playing cricket playing squash or racquetball playing tennis

snow + ice (18) biking through snow bobsledding hockey stop ice climbing ice fishing ice skating making snowman playing ice hockey shoveling snow ski jumping skiing (not slalom or crosscountry) skiing crosscountry skiing slalom sled dog racing snowboarding snowkiting snowmobiling tobogganing

swimming (3) swimming backstroke swimming breast stroke swimming butterfly stroke

touching person (11) carrying baby hugging kissing massaging back massaging feet massaging legs massaging person’s head shaking hands slapping tickling

using tools (13) bending metal blasting sand building cabinet building shed changing oil changing wheel checking tires plastering pumping gas sanding floor sharpening knives sharpening pencil welding

water sports (8) canoeing or kayaking jetskiing kitesurfing parasailing sailing surfing water water skiing windsurfing

waxing (4) waxing back waxing chest waxing eyebrows waxing legs