Audiovisual SlowFast Networks for Video Recognition

Fanyi Xiao, Yong Jae Lee, Kristen Grauman, Jitendra Malik, Christoph Feichtenhofer

Introduction

Joint audiovisual learning is core to human perception. However, most contemporary models for video analysis exploit only the visual signal and ignore the audio signal. For many video understanding tasks, audio could be very helpful. Audio has the potential to influence action recognition not only in obvious cases where sound dominates, like “playing saxophone”, but also visually subtle cases where the action itself is is difficult to see in the video frames, like “whistling”, or closely related actions where sound helps disambiguate, like “closing” vs. “slamming” the door.

This line of thinking is supported by perceptual and neuroscience studies suggesting interesting ways in which visual and audio signals are combined in the brain. A classic example is the McGurk effect https://www.youtube.com/watch?v=G-lN8vWm3m0 – when one is listening to an audio clip (e.g., sounding “ba-ba”), alongside watching a video of fabricated lip movements (indicating “va-va”), the sound one perceives changes (in this case from “ba-ba” to “va-va”).

This effect demonstrates that there is tight entanglement between audio and visual signals (known as the multisensory integration process) . Importantly, research has suggested this fusion between audio and visual signals happens at a fairly early stage .

Given its high potential in facilitating video understanding, researchers have attempted to utilize audio in videos . However, there are a few challenges in making effective use of audio. First, audio does not always correspond to the visual frames (e.g., in a “dunking basketball” video, there can be class-unrelated background music playing). Conversely, audio does not always contain information that can help understand the video (e.g., “shaking hands” does not have a particular sound signature). There are also challenges from a technical perspective. Specifically, we identify the incompatibility of “learning dynamics” between the visual and audio pathways – audio pathways generally train much faster than visual ones, which can lead to generalization issues during joint audiovisual training. Due in part to these various difficulties, a principled approach for audiovisual modeling is currently lacking. Many previous methods adopt an ad-hoc scheme that consists of a separate audio network that is integrated with the visual pathway via “late-fusion” .

The objective of this paper is to build an architecture for integrated audiovisual perception. We aim to go beyond previous work that performs “late-fusion” of independent audio and visual pathways, to instead learn hierarchies of integrated audiovisual features, enabling unified audiovisual perception. We propose a new architecture, Audiovisual SlowFast Networks (AVSlowFast), to perform fusion at multiple levels (Fig. 1). AVSlowFast Networks build on SlowFast , a class of architectures that has two pathways, of which one (Slow) is designed to capture more static but semantic-rich information whereas the other (Fast) is tasked to capture motion. AVSlowFast hierarchically intertwines a Faster Audio pathway with the Slow and Fast pathways, as audio has higher sampling rate, that learns end-to-end from vision and sound. The Audio pathway can be lightweight ( $<$ 20% of computation), but requires a careful design and training strategies to be useful in practice.

We evaluate our approach on standard datasets in the human action recognition community and find consistent improvement for integrating audio. The improvement in accuracy varies for datasets and classes but comes with a relatively small increase in computational cost. For example, on the leading dataset for egocentric video, EPIC-kitchens , audio boosts by +2.9/+4.3/+2.3 the top-1 accuracy for verb/noun/action recognition at 20% of overall compute, on Kinetics action classification by +1.4 top-1 accuracy at 11% of compute, and on AVA action detection by +1.2 mAP at only 2% of the overall compute.

(i) We present AVSlowFast, which fuses audio and visual information at multiple levels in the network hierarchy (i.e., hierarchical fusion) so that audio can contribute to the formation of visual concepts at different levels of abstraction. In contrast to late-fusion, this enables the audio signal to participate in the process of forming visual features.

(ii) To overcome the incompatibility of learning dynamics between the visual and audio pathways, we propose DropPathway, which randomly drops the Audio pathway during training as a simple and effective regularization technique to tune the pace of the learning process. This enables us to train our joint audiovisual model with hierarchical fusion connections across modalities.

(iii) Inspired by the multisensory integration process mentioned above and prior work in neuroscience , which suggests that there exist audiovisual mirror neurons in monkey brains that respond to “any evidence of the action, be it auditory or visual”, we propose to perform audio visual synchronization (AVS) at multiple layers to learn features that generalize across modalities.

(iv) We conduct extensive experiments on six video recognition datasets for human action classification and detection. We report state-of-the-art results and provide ablation studies to understand the trade-offs of various design choices. In addition to evaluating the performance of AVSlowFast for established supervised video classification and detection tasks, we validate the generalization of the audiovisual representation to self-supervised learning, revealing that strong video features can be learned with AVSlowFast using standard pretraining objectives.

Related Work

Significant progress has been made in video recognition in recent years. Some notable directions are two-stream networks in which one stream processes RGB frames and the other processes optical flow , 3D ConvNets as an extension of 2D networks to the spatiotemporal domain , and recent SlowFast Networks that have two pathways to process videos at different temporal frequencies . Despite all these efforts on harnessing temporal information in videos, research is relatively lacking when it comes to another important information source – audio in video.

Audiovisual activity recognition.

Joint modeling of audio and visual signals has been largely conducted in a “late-fusion” manner in video recognition literature . For example, all the entries that utilize audio in the 2018 ActivityNet challenge report have adopted this paradigm – meaning that there are networks processing visual and audio inputs separately, and then they either concatenate the output features or average the final class scores across modalities. Recently, an interesting audiovisual fusion approach has been proposed using flexible binding windows when fusing audio and visual features. With three similar network streams, this approach fuses audio features with the features from RGB and optical flow at the final stage before classification. In contrast, AVSlowFast is building a hierarchically integrated audiovisual representation.

Multi-modal learning.

Researchers have long been interested in developing models that can learn from multiple modalities (e.g., audio, vision, language). Beyond audio and visual modalities, extensive research has been conducted in other instantiations of multi-modal learning, including vision and language , vision and locomotion , and learning from physiological data .

Other audiovisual tasks.

Audio has also been extensively utilized outside of video recognition, e.g. for learning audiovisual representations in a self-supervised manner by exploiting audio-visual correspondence. Other audiovisual tasks include audio-visual speech recognition , lip reading , biometric matching , sound-source localization , audio-visual source separation , and audiovisual question answering .

Audiovisual SlowFast Networks

Inspired by research in neuroscience , which suggests that audio and visual signals fuse at multiple levels, we propose to fuse audio and visual features at multiple stages, from intermediate-level features to high-level semantic concepts. This way, audio can participate in the formation of visual concepts at different levels. AVSlowFast Networks are conceptually simple: SlowFast has Slow and Fast pathways to process visual input (§3.1), and AVSlowFast extends this with an Audio pathway (§3.2).

We begin by briefly reviewing the SlowFast architecture . The Slow pathway (Fig. 1, top row) is a convolutional network that processes videos with a large temporal stride (i.e., it samples one frame out of $\tau$ frames). The primary goal of the Slow pathway is to produce features that capture semantic contents of the video, which has a low refresh rate (semantics do not change all of a sudden). The Fast pathway (Fig. 1, middle row) is another convolutional model with three key properties. First, it has an $\alpha_{F}$ times higher frame rate (i.e., with temporal stride $\tau/\alpha_{F}$ , $\alpha_{F}>1$ ) so that it can capture fast motion information. Second, it preserves fine temporal resolution by avoiding any temporal downsampling. Third, it has a lower channel capacity ( $\beta_{F}$ times the Slow pathway channels, where $\beta_{F}<1$ ) as it is demonstrated to be a desired trade-off . We refer readers to for more details.

2 Audio pathway

A key property of the Audio pathway is that it has an even finer temporal structure than the Slow and Fast pathways (with waveform sampling rate on the order of kHz). As standard processing, we take a log-mel-spectrogram (2-D representation in time and frequency of audio) as input and set the temporal stride to $\tau/\alpha_{A}$ frames, where $\alpha_{A}$ can be much larger than $\alpha_{F}$ (e.g., 32 vs. 8). In a sense, it serves as a “Faster” pathway with respect to Slow and Fast pathways. Another notable property of the Audio pathway is its low computation cost, as audio signals, due to their lower-dimensional nature, are cheaper to process than visual signals. To control this, we set the channels of the Audio pathway to $\beta_{A}$ $\times$ Slow pathway channels. By default, we set $\beta_{A}$ to 1 $/$ 2. Depending on the specific instantiation, the Audio pathway typically only requires 10% to 20% of the overall computation of AVSlowFast.

3 Lateral connections

In addition to the lateral connections between the Slow and Fast pathways in , we add lateral connections between the Audio, Slow, and Fast pathways to fuse audio and visual features. Following , lateral connections are added after ResNet “stages” (e.g., pool1, res2, res3, res4 and pool5). However, unlike , which has lateral connections after each stage, we found that it is most beneficial to have lateral connections between audio and visual features starting from intermediate levels (we ablate this in Sec. 4.2). This is conceptually intuitive as very low-level visual features such as edges and corners might not have a particular sound signature. Next, we discuss several concrete AVSlowFast instantiations.

4 Instantiations

AVSlowFast Networks define a generic class of models that follow the same design principles. In this section, we exemplify a specific instantiation in Table 1. We denote spatiotemporal size by $T<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo>×</mo></mrow><annotation encoding="application/x-tex">\times</annotation></semantics></math>×S^{2}$ for Slow/Fast pathways and $F<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo>×</mo></mrow><annotation encoding="application/x-tex">\times</annotation></semantics></math>×T$ for the Audio pathway, where $T$ is the temporal length, $S$ is the height and width of a square spatial crop, and $F$ is the number of frequency bins for audio.

For Slow and Fast pathways, we follow the basic instantiation of SlowFast 4 $\times$ 16, R50 model defined in . It has a Slow pathway that samples $T=4$ frames out from a 64-frame raw clip with a temporal stride $\tau=16$ . There is no temporal downsampling in the Slow pathway, since input stride is large. Also, it only applies non-degenerate temporal convolutions (temporal stride $>1$ ) in res4 and res5 (see Table 1), as this is more effective.

For the Fast pathway, it has a higher frame rate ( $\alpha_{F}=8$ ) and a lower channel capacity ( $\beta_{F}=1/8$ ), such that it can better capture motion while trading off model capacity. To preserve fine temporal resolution, the Fast pathway has non-degenerate temporal convolutions in every residual block. Spatial downsampling is performed with stride 2 ${}^{\text{2}}$ convolution in the center (“bottleneck”) filter of the first residual block in each stage of both the Slow and Fast pathways.

Audio pathway.

The Audio pathway takes as input the log-mel-spectrogram representation, which is a 2-D representation with one axis being time and the other one denoting frequency bins. In Table 1, we use 128 spectrogram frames (corresponding to 2 seconds of audio) with 80 bins.

Similar to the Slow and Fast pathways, the Audio pathway is also based on a ResNet, but with specific design to better fit the audio inputs. First, we do not pool after the initial convolutional filter (i.e. there is no downsampling layer at stage pool1) to preserve information along both temporal and frequency axis. Downsampling in time-frequency space is performed by stride 2 ${}^{\text{2}}$ convolution in the center (“bottleneck”) filter of the first residual block in each stage from res2 to res5. Second, we decompose the 3 $\times$ 3 convolution filters in res2 and res3 into 1 $\times$ 3 filters for frequency and 3 $\times$ 1 filters for time. This increases accuracy slightly (by 0.2% on Kinetics) but also reduces computation. Conceptually, it allows the network to treat time and frequency separately (as opposed to 3 $\times$ 3 filters which imply both axes are uniform) in early stages. While for spatial filters it is reasonable to perform filtering in $x$ and $y$ dimensions symmetrically, this might not be optimal for early filtering in time and frequency dimensions, as the statistics of spectrograms are different from natural images, which instead are approximately isotropic and shift-invariant .

Lateral connections.

There are many options for how to fuse audio features into the visual pathways. Here, we describe several instantiations and the motivation behind them. Note that this section discusses the lateral connections between the Audio and SlowFast pathways. For the fusion connection between the two visual pathways (Slow and Fast), we adopt the temporal strided convolution as it is demonstrated to be most effective in .

(i) A $\rightarrow$ F $\rightarrow$ S: In this approach (Fig. 2 left), the Audio pathway (A) is first fused to the Fast pathway (F), and then fused to the Slow pathway (S). Specifically, audio features are subsampled to the temporal length of the Fast pathway and then fused into the Fast pathway with a sum operation. After that, the resulting features are further subsampled by $\alpha_{F}$ (e.g., 4 $\times$ subsample) and fused with the Slow pathway (as is done in SlowFast). The key property of this approach is that it enforces strong temporal alignment between audio and visual features, as audio features are fused into the Fast pathway which preserves fine temporal resolution.

(ii) A $\rightarrow$ FS: An alternative way is to fuse the Audio pathway into the output of the SlowFast fusion (Fig. 2 center), which is coarser in temporal resolution. We adopt this design as our default choice as it imposes a less stringent requirement on temporal alignment between audio and visual features, which we found to be important in our experiments. Similar ideas of relaxing the alignment requirement are also explored in , in the context of combining RGB, flow, and audio streams.

(iii) Audiovisual Nonlocal: One might also be interested in using audio as a modulating signal to visual features. Specifically, instead of directly summing or concatenating audio features into the visual stream, one might expect audio to play a more subtle role of modulating the visual concepts, through attention mechanisms such as Non-Local (NL) blocks . One example would be audio serving as a probing signal indicating where the interesting event is happening in the video, both spatially and temporally, and then focusing the attention of visual pathways on those locations. To materialize this, we adapt NL blocks to take both audio and visual features as inputs (Fig. 2 right). Audio features are then matched to different locations within visual features (along $H$ , $W$ and $T$ axis), and the affinity is used to generate a new visual feature that combines information from locations deemed important by audio features.

5 Joint audiovisual training

Unlike SlowFast, AVSlowFast trains with multiple modalities. As noted in Sec. 1, this leads to challenging training dynamics (i.e., different training speed of audio and visual pathways). To tackle this, we propose two training strategies that enable joint training.

We discuss a possible reason for why previous methods employ audio in a late fusion approach. By analyzing the model training dynamics we observe the following. Audio and visual pathways are very different in terms of their “learning speed”.

Taking the curves in Fig. 3 as an example, the green curve is for training a visual-only SlowFast model, whereas the red curve is for training an Audio-only model. It shows that the Audio-only model requires fewer training iterations before it starts to overfit (at $\scriptstyle\sim$ 70 epochs, which is $\raise 0.73193pt\hbox{$ \scriptstyle\sim $}1/3$ of the visual model’s training epochs). One modality dominating multi-modal training has also been observed for lip-reading applications and optical flow streams in action recognition and video object segmentation .

The discrepancy on learning pace leads to overfitting if we naively train both modalities jointly. To unlock the potential of joint training, we propose a simple strategy of randomly dropping the Audio pathway during training (DropPathway). Specifically, at each training iteration, we drop the Audio pathway altogether with probability ${P}_{d}$ . This way, we slow down the learning of the Audio pathway and make its learning dynamics more compatible with its visual counterpart. When dropping the audio pathway, we sum zero tensors with the visual pathways (we also explored feeding the running average of audio features, and found similar results, possibly due to BN).

Our ablation studies in the next section will show the effect of DropPathway, showing that this simple strategy provides good generalization and is essential for jointly training AVSlowFast. Note that DropPathway is different from simply setting different learning rates for the audio/visual pathways in that it 1) ensures the Audio pathway has fewer parameter updates, 2) hinders the visual pathway to ‘shortcut’ training by memorizing audio information, and 3) provides extra regularization as different audio clips are dropped in each epoch.

Hierarchical audiovisual synchronization.

As noted in Sec. 2, temporal synchronization (that comes for free) between audio and visual sources has been explored as a self-supervisory signal to learn feature representations . In this work, we use audiovisual synchronization to encourage the network to produce feature representations that are generalizable across modalities (inspired by the audiovisual mirror neurons in primate vision ). Specifically, we add an auxiliary task to classify whether a pair of audio and visual frames are in-sync or not and adopt a curriculum schedule used in that starts with easy negatives (audio and visual frames come from different clips), and transition into a mix of easy and hard (audio and visual frames are from the same clip, but with a temporal shift) after 50% of training epochs. In our experiments, we study the effect of audiovisual synchronization for both supervised and self-supervised audiovisual feature learning.

Experiments: Action Classification

We evaluate our approach on six video recognition datasets using standard evaluation protocols. For the action classification experiments in this section we use EPIC-Kitchens , Kinetics-400 , and Charades . For action detection, we use the AVA dataset covered in Sec. 5, and the AVSlowFast self-supervised representation is evaluated on UCF101 & HMDB51 in Sec. 6.

Datasets. The EPIC-Kitchens dataset consists of daily activities captured in various kitchen environments with egocentric video and sound recordings. It has 39k segments in 432 videos. For each segment, the task is to predict a verb (e.g., “turn-on”), a noun (e.g., “switch”), and an action by combining the two (“turn on switch”). Performance is measured as top-1 and top-5 accuracy. We use the train/val split in . Test results are obtained from the evaluation server.

Kinetics-400 (abbreviated as K400) is a large-scale video dataset of $\scriptstyle\sim$ 240k training videos and 20k validation videos in 400 action categories. Results on Kinetics are reported as top-1 and top-5 classification accuracy (%).

Charades is a dataset of $\scriptstyle\sim$ 9.8k training videos and 1.8k validation videos in 157 classes. Each video has multiple labels of activities spanning $\scriptstyle\sim$ 30 seconds. Performance is measured in mean Average Precision (mAP).

Audio pathway. Following previous work , we extract log-mel-spectrograms from the raw audio waveform to serve as the input to Audio pathway. Specifically, we sample audio data with 16 kHz sampling rate, then compute a spectrogram with window size of 32ms and step size of 16ms. The length of the audio input is exactly matched to the duration spanned by the RGB frames. For example, under 30 FPS, for AVSlowFast with $T<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo>×</mo></mrow><annotation encoding="application/x-tex">\times</annotation></semantics></math>×\tau$ $=$ 8 $\times$ 8 frames (2 secs) input, we sample 128 frames (2 secs) in log-mel.

Training. We train our AVSlowFast models on Kinetics from scratch without pre-training. We use synchronous SGD and follow the training recipe (learning rate, weight decay, warm-up) used in . Given a training video, we randomly sample $T$ frames with stride $\tau$ and extract the corresponding log-mel-spectrogram. We randomly crop 224 $\times$ 224 pixels from a video, randomly flip horizontally, and resize it to a shorter side sampled in .

Inference. Following previous work , we uniformly sample 10 clips from a video along its temporal axis. For each clip, we resize the shorter spatial side to 256 pixels and take 3 crops of 256 $\times$ 256 along the longer side to cover the spatial dimensions. Video-level predictions are computed by averaging softmax scores. We report the actual inference-time computation as in , by listing the FLOPs per spacetime “view” of spatial size 2562 (temporal clip with spatial crop) at inference and the number of views (i.e. 30 for 10 temporal clips each with 3 spatial crops).

Kinetics, EPIC, Charades details are in A.4, A.5, & A.6.

We compare to state-of-the-art methods on EPIC in Table 2. First, AVSlowFast improves SlowFast by +2.9 / +4.3 / +2.3 top-1 accuracy for verb / noun / action, which highlights the benefits of audio in egocentric video recognition.

Second, as a system-level comparison, AVSlowFast exhibits higher performance in all three categories (verb/noun/action) and two test sets (seen/unseen) vs. state-of-the-art under Kinetics-400 pretraining.

Comparing to LFB , which uses an object detector to localize objects, AVSlowFast achieves similar performance for nouns (objects) on both the seen and unseen test sets, whereas SlowFast without audio is largely lagging behind (-4.4% vs. LFB on val noun), which is intuitive as sound can be beneficial for recognizing objects.

We observe large performance gains over previous best (which utilizes rgb, audio and flow) on the unseen split (i.e., novel scenes) of the test set (+3.1 / +4.8/ +2.3 for verb / noun / action), showing AVSlowFasts’ strength on test data.

Kinetics.

Table 3 shows a comparison on the well-established Kinetics dataset. Comparing AVSlowFast with SlowFast shows a margin of 1.4% top-1 for R50 and 0.9% top-1 accuracy for R101, given the same network backbone and input size. This demonstrates the effectiveness of the audio stream despite its modest cost of only $\approx$ 10% $-$ 20% of the overall computation. Comparatively, going deeper from R50 to R101 increases computation by 194%.

On a system-level, AVSlowFast compares favorably to existing methods that utilize various modalities, i.e., audio (A), visual frames (V) and optical flow (F). Adding optical flow streams brings roughly similar gains as audio but doubles computation (TS in Table 3), not counting optical flow computation; by contrast, audio processing is lightweight (e.g. 11% computation overhead for AVSlowFast, R50). Further, AVSlowFast does not rely on pretraining and is competitive with multi-modal approaches that pretrain individual modality streams (✓).

As Kinetics is a visual-heavy dataset (for many classes e.g. “writing” audio is not useful), to better study audiovisual learning, “Kinetics-Sounds” is as a subset of 34 classes potentially manifested both visually and aurally. We test on Kinetics-Sounds in the “KS” column of Table 3. The gain from SlowFast to AVSlowFast doubled – for R50/R101 with +3.2%/+2.3%, showing the potential on relevant data. Further Kinetics results on standalone Audio-only classification and class-level analysis are in A.2 and A.3.

Charades.

We test the effectiveness of AVSlowFast on videos of longer range activities on Charades in Table 4. We observe that audio can facilitate recognition (+1.2% over a strong SlowFast baseline) and we achieve state-of-the-art performance under Kinetics-400 pre-training.

Discussion.

Overall, our experiments on action classification indicate that, on standard, visually created datasets for classification, a consistent improvement over very strong visual baselines can be achieved by modeling audio with AVSlowFast. For some cases improvements are exceptionally high (e.g. EPIC) and in some lower (e.g. Charades), and all results suggest that with AVSlowFast, audio can serve as an economical modality that supplements visual input.

2 Ablation Studies

We ablate of our approach on Kinetics as it represents the largest unconstrained dataset for human action recognition.

We study the effectiveness of fusion in Table 6. The first interesting phenomenon is that direct ensembling (late-fusion) of audio/visual models produces only modest gains (76.1% vs 75.6%), whereas joint training with late-fusion (“pool5”) does not help (75.6% $\rightarrow$ 75.4%).

For hierarchical, multi-level fusion, Table 6 shows it is beneficial to fuse audio and visual features at multiple levels. Specifically, we found that recognition accuracy steadily increases from 75.4% to 77.0% when we increase the number of fusion connections from one (i.e., only concatenating pool5 outputs) to three (res3,4 + pool5) where it peaks. Adding another lateral connection at res2 decreases accuracy. This suggests that it is beneficial to start fusing audio and visual features from intermediate levels (res3) all the way to the top of the network. We hypothesize that this is because audio facilitates the formation of visual concepts, but only when features mature to intermediate concepts that are generalizable across modalities (e.g. local edges do not have a general sound pattern).

Lateral connections.

We ablate the the effect of different types of lateral connections between audio and visual pathways in Table 5a. First, A $\rightarrow$ F $\rightarrow$ S, which enforces strong temporal alignment between audio and visual streams, produces lower classification accuracy compared to A $\rightarrow$ FS, which relaxes the requirement on alignment. This is consistent with findings in that it is beneficial to have tolerance on alignment between the modalities, since class-level audio signals might happen out-of-sync to visual frames (e.g., when shooting 3 pointers in basketball, the net-touching sound only comes after the action finishes). Finally, the straightforward A $\rightarrow$ FS connection performs similarly to the more complex AV Nonlocal fusion (77.0% vs 77.2%). We use A $\rightarrow$ FS as our default lateral connection for its good performance and simplicity.

Audio pathway capacity.

We study the impact of the number of channels of the Audio pathway ( $\beta_{A}$ ) in Table 5b. As expected, when we increase the number of channels (e.g., increasing $\beta_{A}$ from 1/8 to 1/2, which is the ratio between Audio and Slow pathway’s channels), accuracy improves at the cost of increased computation. However, performance starts to degrade when we further increase it to 1, likely due to overfitting. We use $\beta_{A}=1/2$ across all our experiments.

DropPathway.

We apply Audio pathway dropping to adjust the incompatibility of learning speed across modalities. Here we conduct ablative experiments to study the effects of different drop rates ${P}_{d}$ . The results are shown in Table 5c. As shown in the table, a high value of $P_{d}$ (0.5 or 0.8) is required to slow down the Audio pathway when training audio and visual pathways jointly. If we train AVSlowFast without DropPathway (“-”), the accuracy degrades to be even worse than visual-only models (75.2% vs 75.6%). This is because the Audio pathway learns too fast and starts to dominate the visual feature learning. The gain from 75.2% $\rightarrow$ 77.0% reflects the full impact of DropPathway.

Hierarchical audiovisual synchronization.

We study the effectiveness of hierarchical audiovisual synchronization in Table 5d. We use AVSlowFast with and without AVS, and vary the layers for multiple losses. We observe that adding AVS as an auxiliary task is beneficial (+0.6% gain). Furthermore, having synchronization loss at multiple levels slightly increases the performance (without extra inference cost). This suggests that it is beneficial to have a feature representation that is generalizable across audio and visual modalities and hierarchical AVS could facilitate producing such.

Experiments: AVA Action Detection

In addition to the action classification tasks, we also apply AVSlowFast models on action detection which requires both localizing and recognizing actions.

The AVA dataset focuses on spatiotemporal localization of human actions. Spatiotemporal labels are provided for one frame per second, with people annotated with a bounding box and (possibly multiple) actions. There are 211k training and 57k validation video segments. We follow the standard protocol of evaluating on 60 classes. The metric is mean Average Precision (mAP) over 60 classes, using a frame-level IoU threshold of 0.5.

Detection architecture.

We follow the detection architecture introduced in , which is adapted from Faster R-CNN for video. During training, the input to our audiovisual detector is $\alpha_{F}T~{}$ RGB frames sampled with temporal stride $~{}\tau$ and spatial size 224 $\times$ 224, to SlowFast pathways, and the corresponding log-mel-spectrogram covering this time window to Audio pathway. During testing, the backbone feature is computed fully convolutionally with RGB frame of shorter side being 256 pixels , as is standard in Faster R-CNN .

For details on architecture, training and inference, please refer to appendix A.7.

Results.

We compare to several other existing methods in Table 7. AVSlowFast, with both R50 and R101 backbones, outperforms SlowFast with a consistent margin of $\scriptstyle\sim$ 1.2%, and only increases FLOPsWe report FLOPs for fully-convolutional inference of a clip with 256 $\times$ 320 spatial size for SlowFast and AVSlowFast models, full test-time computational cost for these models is directly proportional to this. slightly, e.g. for R50 by only 2%, whereas going from SlowFast R50 to R101 (without audio) increases computation significantly by 180%.

Interestingly, the ActivityNet Challenge 2018 hosted a separate track for multiple modalities, but no team could achieve gains using audio information on AVA data. Our result shows, for the first time, that audio can be beneficial for action detection, where spatiotemporal localization is required, even with low computation overhead of just 2%.

For system-level comparison to other approaches, Table 7 shows that AVSlowFast achieves state-of-the-art performance on AVA under Kinetics-400 pretraining.

For comparisons with future work, we show results on the newer v2.2 of AVA, which provides updated annotations. We see consistent results as for v2.1. As for per-class results, we found classes like [“swim” +30.2%], [“dance” +10.0%], [“shoot” +8.6%], and [“hit (an object)” +7.6%] has the largest gain from audio; please see appendix A.3 and Fig. A.1 for more details.

Experiments: Self-supervised Learning

To further study the generalization of AVSlowFast models, we apply it to self-supervised learning (SSL). The goal here is not to propose a new SSL pretraining task. Instead, we are interested in how well a self-supervised video representation can be learned with AVSlowFast using existing tasks. We use the audiovisual synchronization and image rotation prediction (0 $\degree$ , 90 $\degree$ , 180 $\degree$ , 270 $\degree$ ; as a four-way softmax-classification) losses as pretraining tasks. With the learned AVSlowFast weights, we then re-train the last fc layer of AVSlowFast on UCF101 and HMDB51 following standard practice to evaluate the SSL feature representation. Table 8 lists the results. Using off-the-shelf pretext tasks, our smallest AVSlowFast, R50 model compares favorably to state-of-the-art SSL approaches on both datasets, with an absolute margin of +23.4 and +12.7 top-1 accuracy over previous best CBT . This is highlighting the strength of the architecture, and the features learned by AVSlowFast. For more details and results, please refer to appendix A.1.

Conclusion

This work has presented AVSlowFast Networks, an architecture for integrated audiovisual perception. We show the effectiveness of the AVSlowFast representation with state-of-the-art performance on six datasets for video action classification, detection, and self-supervised learning tasks. We hope that AVSlowFast, as a unified audiovisual backbone, will foster further research in video understanding.

Appendix A Appendix

In this section, we provide more results and detailed analysis on self-supervised learning using AVSlowFast. Training schedule and details are provided in §A.8.

First, we pretrain AVSlowFast with self-supervised objectives of audiovisual synchronization (AVS) and image rotation prediction (ROT) on Kinetics-400. Then, following the standard linear classification protocol used for image recognition tasks , we use the pretrained network as a fixed, frozen feature extractor and train a linear classifier on top of the self-supervisedly learned features. In Table A.1 (top), we compare to previous work that follows the same protocol. We note this is the same experiment as in Table 8, but with additional ablations on our models. The results indicate that features learned by AVSlowFast are significantly better than baselines including the recently introduced CBT method (+23.4% for UCF101 and +14.6% for HMDB51), which also uses ROT as well as a contrastive bidirectional transformer (CBT) loss by pretraining on the larger Kinetics-600.

In addition, we also ablate the contribution of individual tasks of AVS and ROT in Table A.1 (bottom). On UCF101, SlowFast/AVSlowFast trained under either ROT or AVS objective show strong individual performance, while the combination of them perform the best. Whereas on the smaller HMDB51, all three variants of our method perform similarly well and audio seems less important.

Another aspect is that, although many previous approaches on self-supervised feature learning focus on reporting number of parameters, the FLOPs are another important factor to consider – as shown in Table A.1 (top), the performance keeps increasing when we take higher temporal resolution clips by varying $T<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo>×</mo></mrow><annotation encoding="application/x-tex">\times</annotation></semantics></math>×\tau$ (i.e. larger FLOPs), even though model parameters remain identical.

Although we think the linear classification protocol serves as a better method to evaluate self-supervised feature learning (as features are frozen and therefore less sensitive to hyper-parameter settings such as learning schedule and regularization, especially when these datasets are relatively small), we also evaluate by fine-tuning all layers of AVSlowFast on the target datasets to compare to a larger corpus of previous work on self-supervised feature learning. Table A.2 shows that AVSlowFast achieves competitive performance comparing to prior work under this setting. When using this protocol, we believe it is reasonable to also consider methods that train multiple layers on UCF/HMDB from scratch, such as optical-flow based motion streams . It is interesting that this stream, despite being an AlexNet-like model , is comparable or better, than many newer models, pretrained on (the large) Kinetics-400 using self-supervised learning techniques.

A.2 Results: Audio-only Classification

To understand the effectiveness of our Audio pathway, we evaluate it in terms of Audio-only classification accuracy on Kinetics (in addition to Kinetics-400, we also train and evaluate on Kinetics-600 to be comparable to methods that use this data in challenges ). In Table A.3, we compare our Audio-only network to several other audio models. We observe that our Audio-only model performs better than existing methods by solid margins (+3.3% top-1 accuracy on Kinetics-600 and +3.2% on Kinetics-400, compared to best-performing methods), which demonstrates the effectiveness of our Audio pathway design. Note also that unlike some other methods in Table A.3, we train our audio network from scratch on Kinetics, without any pretraining.

A.3 Results: Classification & Detection Analysis

Comparing AVSlowFast to SlowFast (77.0% vs. 75.6% for 4 $\times$ 16, R50 backbone), classes that benefited most from audio include [“dancing macarena” +24.5%], [“whistling” +24.0%], [“beatboxing” +20.4%], [“salsa dancing” +19.1%] and [“singing” +16.0%], etc. Clearly, all these classes have distinct sound signatures to be recognized. On the other hand, classes like [“skiing (not slalom or crosscountry)” -12.3%], [“triple jump” -12.2%], [“dodgeball” -10.2%] and [“massaging legs” -10.2%] have the largest performance loss, as sound of these classes tend to be much less correlated the action.

Per-class analysis on AVA

We compare per-class results of AVSlowFast to its SlowFast counterparts in Fig. A.1. As mentioned in the main paper, classes with largest absolute gain (marked with bold black font) are “swim”, “dance”, “shoot”, “hit (an object)” and “cut”. Further, the classes “push (an object)” (3.2 $\times$ ) and “throw” (2.0 $\times$ ) largely benefit from audio in relative terms (marked with orange font in Fig. A.1). As expected, all these classes have strong sound signature that are easy to recognize from audio. On the other hand, the largest performance loss arises for classes such as “watch (e.g., TV)”, “read”, “eat” and “work on a computer”, which either do not have a distinct sound signature (“read”, “work on a computer”) or have strong background noise sound (“watch (e.g., TV)”). We believe explicitly modeling foreground and background sound might be a fruitful future direction to alleviate these challenges.

A.4 Details: Kinetics Action Classification

We train our models on Kinetics from scratch without any pretraining. Our training and testing closely follows . We use a synchronous SGD optimizer and train with 128 GPUs using the recipe in . The mini-batch size is 8 clips per GPU (so the total mini-batch size is 1024). The initial base learning rate $\eta$ is 1.6 and we decrease the it according to half-period cosine schedule : the learning rate at the $n$ -th iteration is $\eta\cdot 0.5[\cos(\frac{n}{n_{\text{max}}}\pi)+1]$ , where $n_{\text{max}}$ is the maximum training iterations. We adopt a linear warm-up schedule for the first 8k iterations. We use a scale jittering range of pixels for R101 model to improve generalization . To aid convergence, we initialize all models that use Non-Local blocks (NL) from their counterparts that are trained without NL. We only use NL on res4 (instead of res3+res4 used in ).

We train with Batch Normalization (BN) , and the BN statistics are computed within each 8 clips. Dropout with rate 0.5 is used before the final classifier layer. In total, we train for 256 epochs (60k iterations with batch size 1024, for $\scriptstyle\sim$ 240k Kinetics videos) when $T\leq$ 4 frames, and 196 epochs when the Slow pathway has $T>$ 4 frames: it is sufficient to train shorter when a clip has more frames. We use momentum of 0.9 and weight decay of 10 ${}^{\text{-4}}$ .

A.5 Details: EPIC-Kitchens Classification

We fine-tune from Kinetics pretrained AVSlowFast 8 $\times$ 8, R101 (w/o NL) for this experiment. For fine-tuning, we freeze all BNs by converting them into affine layers. We train using a single machine with 8 GPUs. Initial base learning rate $\eta$ is set to 0.01 and 0.0006 for verb and noun. We train with batch size 32 for 24k and 30k for verb and noun respectively. We use a step wise decay of the learning rate by a factor of 10 $\times$ at 2/3 and 5/6 of full training. For simplicity, we only use a single center crop for testing.

A.6 Details: Charades Action Classification

We fine-tune from the Kinetics pretrained AVSlowFast 16 $\times$ 8, R101 + NL model, to account for the longer activity range of this dataset, and a per-class sigmoid output is used to account for the mutli-class nature of the data. We train on a single machine (8 GPUs) for 40k iterations using a batch size of 8 and a base learning rate $\eta$ of 0.07 with one 10 $\times$ decay after 32k iterations. We use a Dropout rate of 0.7. For inference, we temporally max-pool scores . All other settings are the same as those of Kinetics.

A.7 Details: AVA Action Detection

We follow the detection architecture introduced in , which is adapted from Faster R-CNN for video. Specifically, we set the spatial stride of res5 from 2 to 1, thus increasing the spatial resolution of res5 by 2 $\times$ . RoI features are then computed by applying RoIAlign spatially and global average pooling temporally. These features are then fed to a per-class, sigmoid-based classifier for multi-label prediction. Again, we initialize from Kinetics pretrained models and train 52k iterations with initial learning rate $\eta$ of 0.4 and batch size 16 (we train across 16 machines, so effective batch size 16 $\times$ 16=256). We pre-compute proposals using an off-the-shelf Faster R-CNN person detector with ResNeXt-101-FPN backbone. It is pretrained on ImageNet and the COCO human keypoint data and more details can be found in .

A.8 Details: Self-supervised Evaluation

For self-supervised pretraining, we train on Kinetics-400 for 120k iterations with per-machine batch size 64 across 16 machines and initial learning rate 1.6, similar to §A.4, but with step-wise schedule. The learning rate is decayed with 10 $\times$ three times at 80k, 100k and 110k iterations. We use linear warm-up (starting from learning rate 0.001) for the first 10k iterations. As noted in Sec. 6, we adopt the curriculum learning idea for audiovisual synchronization to first train with easy negatives for the first 60k iterations and then switch to a mix of easy and hard negatives (1 $/$ 4 hard, 3 $/$ 4 easy) for the remaining 60k iterations. The easy negatives com from different videos, while hard negatives have a temporal displacement of at least 0.5 seconds.

For the “linear classification protocol” experiments on UCF and HMDB, we train 320k iterations (echoing , we found it beneficial to train long iterations in this setting) with an initial learning rate of 0.01, a half-period cosine decay schedule and a batch size of 64 on a single machine with 8 GPUs. For the “train all layers” setting, we train 80k $/$ 30k iterations with batch size 16 (also on a single machine), an initial learning rate of 0.005 $/$ 0.01 and a half-period cosine decay schedule, for UCF and HMDB, respectively.

Appendix B Details: Kinetics-Sound dataset

The original 34 classes selected in are based on an earlier version of the Kinetics dataset. Some classes are removed since then. Therefore, we use the following 32 classes that are kept in current version of Kinetics-400 dataset: “blowing nose”, “blowing out candles”, “bowling”, “chopping wood”, “dribbling basketball”, “laughing”, “mowing lawn”, “playing accordion”, “playing bagpipes”, “playing bass guitar”, “playing clarinet”, “playing drums”, “playing guitar”, “playing harmonica”, “playing keyboard”, “playing organ”, “playing piano”, “playing saxophone”, “playing trombone”, “playing trumpet”, “playing violin”, “playing xylophone”, “ripping paper”, “shoveling snow”, “shuffling cards”, “singing”, “stomping grapes”, “strumming guitar”, “tap dancing”, “tapping guitar”, “tapping pen”, “tickling”.

Introduction

Related Work

Audiovisual activity recognition.

Multi-modal learning.

Other audiovisual tasks.

Audiovisual SlowFast Networks

2 Audio pathway

3 Lateral connections

4 Instantiations

Audio pathway.

Lateral connections.

5 Joint audiovisual training

Hierarchical audiovisual synchronization.

Experiments: Action Classification

Kinetics.

Charades.

Discussion.

2 Ablation Studies

Lateral connections.

Audio pathway capacity.

DropPathway.

Hierarchical audiovisual synchronization.

Experiments: AVA Action Detection

Detection architecture.

Results.

Experiments: Self-supervised Learning

Conclusion

Appendix A Appendix

A.2 Results: Audio-only Classification

A.3 Results: Classification & Detection Analysis

Per-class analysis on AVA

A.4 Details: Kinetics Action Classification

A.5 Details: EPIC-Kitchens Classification

A.6 Details: Charades Action Classification

A.7 Details: AVA Action Detection

A.8 Details: Self-supervised Evaluation

Appendix B Details: Kinetics-Sound dataset

References