PyTorchVideo: A Deep Learning Library for Video Understanding

Haoqi Fan, Tullie Murrell, Heng Wang, Kalyan Vasudev Alwala, Yanghao Li, Yilei Li, Bo Xiong, Nikhila Ravi, Meng Li, Haichuan Yang, Jitendra Malik, Ross Girshick, Matt Feiszli, Aaron Adcock, Wan-Yen Lo, Christoph Feichtenhofer

Introduction

Recording, storing, and viewing videos has become an ordinary part of our lives; in 2022, video traffic will amount to 82 percent of all Internet traffic (Cisco, 2020). With the increasing amount of video material readily available (e.g. on the web), it is now more important than ever to develop ML frameworks for video understanding.

With the rise of deep learning, significant progress has been made in video understanding research, with novel neural network architectures, better training recipes, advanced data augmentation, and model acceleration techniques. However, the sheer amount of data that video brings often makes these tasks computationally demanding; therefore efficient solutions are non-trivial to implement.

To date, there exist several popular video understanding frameworks, which provide implementations of advanced state-of-the-art video models, including PySlowFast (Fan et al., 2020), MMAction (Yue Zhao, 2019), MMAction2 (Contributors, 2020), and Gluon-CV (Guo et al., 2020). However, unlike a modularized library that can be imported into different projects, all of these frameworks are designed around training workflows, which limit their adoption beyond applications tailored to one specific codebase.

More specifically, we see the following limitations in prior efforts. First, reproducibility – an important requirement for deep learning software – varies across frameworks; e.g. identical models are reproduced with varying accuracy on different frameworks (Fan et al., 2020; Yue Zhao, 2019; Contributors, 2020; Guo et al., 2020). Second, regarding input modalities, the frameworks are mainly focused on visual-only data streams. Third, supported tasks only encompass human action classification and detection. Fourth, none of the existing codebases support on-device acceleration for real-time inference on mobile hardware.

We believe that a modular, component-focused video understanding library that addresses the aforementioned limitations will strongly support the video research community. Our intention is to develop a library that aims to provide fast and easily extensible components to benefit researchers and practitioners in academia and industry.

We present PyTorchVideo – an efficient, modular and reproducible deep learning library for video understanding which supports the following (see Fig. 1 for an overview):

a modular design with extendable interface for video modeling using Python

a full stack of video understanding machine learning components from established datasets to state-of-the-art models

real-time video classification through hardware accelerated on-device support

multiple tasks, including human action classification and detection, self-supervised learning, and low-level vision tasks

reproducible models and datasets, benchmarked in a comprehensive model zoo

multiple input modalities, including visual, audio, optical-flow and IMU data

PyTorchVideo is distributed with a Apache 2.0 License, and is available on GitHub at https://github.com/facebookresearch/pytorchvideo.

Library Design

Our library follows four design principles, outlined next (§2.1-2.4).

PyTorchVideo is built to be component centric: it provides independent components that are plug-and-play and ready to mix-and-match for any research or production use case. We achieve this by designing models, datasets and data transformations (transforms) independently, only enforcing consistency through general argument naming guidelines. For example, in the pytorchvideo.data module all datasets provide a data_path argument, or, for the pytorchvideo.models module, any reference to input dimensions uses the name dim_in. This form of duck-typing provides flexibility and straightforward extensibility for new use cases.

2. Compatibility

PyTorchVideo is designed to be compatible with other frameworks and domain specific libraries. In contrast to existing video frameworks (Fan et al., 2020; Yue Zhao, 2019; Contributors, 2020; Guo et al., 2020), PyTorchVideo does not rely on a configuration system. To maximize the compatibility with Python based frameworks that can have arbitrary config-systems, PyTorchVideo uses keyword arguments in Python as a “naive configuration” system.

PyTorchVideo is designed to be interoperable with other standard domain specific libraries by setting canonical modality based tensor types. For videos, we expect a tensor of shape [...,C,T,H,W][...,C,T,H,W], where T×H×WT\times H\times W are spatiotemporal dimensions, and CC is the number of color channels, allowing any TorchVision model or transform to be used together with PyTorchVideo. For raw audio waveforms, we expect a tensor of shape [...,T][...,T], where TT is the temporal dimension, and for spectrograms, we expect a tensor of shape [...,T,F][...,T,F], where TT is time and FF is frequency, aligning with TorchAudio.

3. Customizability

One of PyTorchVideo’s primary use cases is supporting the latest research methods; we want researchers to easily contribute their work without requiring refactoring and architecture modifications. To achieve this, we designed the library to reduce overhead for adding new components or sub-modules. Notably in the pytorchvideo.model module, we use a dependency injection inspired API. We have a composable interface, which contains injectable skeleton classes and a factory function interface that builds reproducible implementations using composable classes. We anticipate this injectable class design to being useful for researchers that want to easily plug in new sub-components (e.g. a new type of convolution) into the structure of larger models such as a ResNet (He et al., 2016) or SlowFast (Feichtenhofer et al., 2019). The factory functions are more suitable for reproducible benchmarking of complete models, or usage in production. An example for a customized SlowFast network is in Algorithm 1.

4. Reproducibility

PyTorchVideo maintains reproducible implementations of all models and datasets. Each component is benchmarked against the reported performance in the respective, original publication. We report performance and release model files onlineNumbers and model weights can be found in https://github.com/facebookresearch/pytorchvideo/blob/master/docs/source/model_zoo.md as well as on PyTorch Hubhttps://pytorch.org/hub/. We rely on test coverage and recurrent benchmark jobs to verify and monitor performance and to detect potential regressions introduced by codebase updates.

Library Components

PyTorchVideo allows training of state-of-the-art models on multi-modal input data, and deployment of an accelerated real-time model on mobile devices. Example components are shown in Algorithm 2.

Video contains rich information streams from various sources, and, in comparison to image understanding, video is more computationally demanding. PyTorchVideo provides a modular, and efficient data loader to decode visual, motion (optical-flow), acoustic, and Inertial Measurement Unit (IMU) information from raw video.

PyTorchVideo supports a growing list of data loaders for various popular video datasets for different tasks: video classification task for UCF-101 (Soomro et al., 2012), HMDB-51 (Kuehne et al., 2011), Kinetics (Kay et al., 2017), Charades (Sigurdsson et al., 2016), and Something-Something (Goyal et al., 2017), egocentric tasks for Epic Kitchen (Damen et al., 2018) and DomSev (Silva et al., 2018), as well as video detection in AVA (Gu et al., 2018).

All data loaders support several file formats and are data storage agnostic. For encoded video datasets (e.g. videos stored in mp4 files), we provide PyAV, TorchVision, and Decord decoders. For long videos – when decoding is an overhead – PyTorchVideo provides support for pre-decoded video datasets in the form image files.

2. Transforms

Transforms - as the key components to improve the model generalization - are designed to be flexible and easy to use in PyTorchVideo. PyTorchVideo provides factory transforms that include common recipes for training state-of-the-art video models (Feichtenhofer et al., 2019; Feichtenhofer, 2020; Fan et al., 2021). Recent data augmentations are also provided by the library (e.g. MixUp, CutMix, RandAugment (Cubuk et al., 2020), and AugMix (Hendrycks et al., 2020)). Finally, users have the option to create custom transforms by composing individual ones.

3. Models

PyTorchVideo contains highly reproducible implementations of popular models and backbones for video classification, acoustic event detection, human action localization (detection) in video, as well as self-supervised learning algorithms.

The current set of models includes standard single stream video backbones such as C2D (Wang et al., 2018), I3D (Wang et al., 2018), Slow-only (Feichtenhofer et al., 2019) for RGB frames and acoustic ResNet (Xiao et al., 2020) for audio signal, as well as efficient video networks such as SlowFast (Feichtenhofer et al., 2019), CSN (Tran et al., 2019), R2+1D (Tran et al., 2018), and X3D (Feichtenhofer, 2020) that provide state-of-the-art performance. PyTorchVideo also provides multipathway architectures such as Audiovisual SlowFast networks (Xiao et al., 2020) which enable state-of-the-art performance by disentangling spatial, temporal, and acoustic signals across different pathways.

It further supports methods for low-level vision tasks for researchers to build on the latest trends in video representation learning.

PyTorchVideo models can be used in combination with different downstream tasks: supervised classification and detection of human actions in video (Feichtenhofer et al., 2019), as well as self-supervised (i.e. unsupervised) video representation learning with Momentum Contrast (He et al., 2021), SimCLR (Chen et al., 2020), and Bootstrap your own latent (Grill et al., 2020).

4. Accelerator

PyTorchVideo provides a complete environment (Accelerator) for hardware-aware design and deployment of models for fast inference on device, including efficient blocks and kernel optimization. The deployment flow is illustrated in Figure 2.

Specifically, we perform kernel-level latency optimization for common kernels in video understanding models (e.g. conv3d). This optimization brings two-fold benefits: (1) latency of these kernels for floating-point inference is significantly reduced; (2) quantized operation (int8) of these kernels is enabled, which is not supported for mobile devices by vanilla PyTorch. Our Accelerator provides a set of efficient blocks that build upon these optimized kernels, with their low latency validated by on-device profiling. In addition, our Accelerator provides kernel optimization in deployment flow, which can automatically replace modules in the original model with efficient blocks that perform equivalent operations. Overall, the PyTorchVideo Accelerator provides a complete environment for hardware-aware model design and deployment for fast inference.

Benchmarks

PyTorchVideo supports popular video understanding tasks, such as video classification (Kay et al., 2017; Goyal et al., 2017; Feichtenhofer et al., 2019), action detection (Gu et al., 2018), video self-supervised learning (Feichtenhofer et al., 2021) and efficient video understanding on mobile hardware. We provide comprehensive benchmarks for different tasks and a large set of model weights for state-of-the-art methods. This section provides a snapshot of the benchmarks. A comprehensive listing can be found in the model zoo A full set of supported models on different datasets can be found from https://pytorchvideo.readthedocs.io/en/latest/model_zoo.html. The models in PyTorchVideos’ model zoo reproduce performance of the original publications and can seed future research that builds upon them, bypassing the need for re-implementation and costly re-training.

PyTorchVideo implements classification for various datasets, including UCF-101 (Soomro et al., 2012), HMDB-51 (Kuehne et al., 2011), Kinetics (Kay et al., 2017), Charades (Sigurdsson et al., 2016), Something-Something (Goyal et al., 2017) and Epic-Kitchens (Damen et al., 2018). Table 1 shows a benchmark snapshot on Kinetics-400 for three popular state-of-the-art methods, which are measured in Top-1 and Top-5 accuracies. Further classification models with pre-trained weights, that can be used for a variety of downstream tasks, are available online3{}^{\text{3}}.

2. Video action detection

The video action detection task aims to perform spatiotemporal localization of human actions in videos. Table 2 shows detection performance in mean Average Precision (mAP) on the AVA dataset (Gu et al., 2018) using Slow and SlowFast networks (Feichtenhofer et al., 2019).

3. Video self-supervised learning

We provide reference implementations of popular self-supervised learning methods for video (Feichtenhofer et al., 2021), which can be used to perform unsupervised spatiotemporal representation learning on large-scale video data. Table 3 summarizes the results on 5 downstream datasets.

4. Efficient mobile video understanding

The Accelerator environment for efficient video understanding on mobile devices is benchmarked in Table 4.4. We show several efficient X3D models (Feichtenhofer, 2020) on a Samsung Galaxy S9 mobile phone. With the efficient optimization strategies in PyTorchVideo, X3D achieves 4.6 - 5.6×\times inference speed up, compared to the default PyTorch implementation. With quantization (int8), it can be further accelerated by 1.4×\times. The resulting PyTorchVideo-accelerated X3D model runs around 6×\times faster than real time, requiring roughly 165ms to process one second of video, directly on the mobile phone. The source code for an on-device demo in iOShttps://github.com/pytorch/ios-demo-app/tree/master/TorchVideo and Androidhttps://github.com/pytorch/android-demo-app/tree/master/TorchVideo is available.

We introduce PyTorchVideo, an efficient, flexible, and modular deep learning library for video understanding that scales to a variety of research and production applications. Our library welcomes contributions from the research and open-source community, and will be continuously updated to support further innovation. In future work, we plan to continue enhancing the library with optimized components and reproducible state-of-the-art models.