Audio-Visual Event Localization in Unconstrained Videos

Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, Chenliang Xu

Introduction

Studies in neurobiology suggest that the perceptual benefits of integrating visual and auditory information are extensive . For computational models, they reflect in lip reading , where correlations between speech and lip movements provide a strong cue for linguistic understanding; in music performance , where vibrato articulations and hand motions enable the association between sound tracks and the performers; and in sound synthesis , where physical interactions with different types of material give rise to plausible sound patterns. Albeit these advances, these models are limited in their constrained domains.

Indeed, our community has begun to explore marrying computer vision with audition in-the-wild for learning a good representation . For example, a sound network is learned in by a visual teacher network with a large amount of unlabeled videos, which shows better performance than learning in a single modality. However, they have all assumed that the audio and visual contents in a video are matched (which is often not the case as we will show) and they are yet to explore whether the joint audio-visual representations can facilitate understanding unconstrained videos.

In this paper, we study a family of audio-visual event temporal localization tasks (see Fig. 1) as a proxy to the broader audio-visual scene understanding problem for unconstrained videos. We pose and seek to answer the following questions: (Q1) Does inference jointly over auditory and visual modalities outperform inference over them independently? (Q2) How does the result vary under noisy training conditions? (Q3) How does knowing one modality help model the other modality? (Q4) How do we best fuse information over both modalities? (Q5) Can we locate the content in one modality given its observation in the other modality? Notice that the individual questions might be studied in the literature, but we are not aware of any work that conducts a systematic study to answer these collective questions as a whole.

In particular, we define an audio-visual event as an event that is both visible and audible in a video segment, and we establish three tasks to explore aforementioned research questions: 1) supervised audio-visual event localization, 2) weakly-supervised audio-visual event localization, and 3) event-agnostic cross-modality localization. The first two tasks aim to predict which temporal segment of an input video has an audio-visual event and what category the event belongs to. The weakly-supervised setting assumes that we have no access to the temporal event boundary but an event tag at video-level for training. Q1-Q4 will be explored within these two tasks. In the third task, we aim to locate the corresponding visual sound source temporally within a video from a given sound segment and vice versa, which will answer Q5.

We propose both baselines and novel algorithms to solve the above three tasks. For the first two tasks, we start with a baseline model treating them as a sequence labeling problem. We utilize CNN to encode audio and visual inputs, adapt LSTM to capture temporal dependencies, and apply Fully Connected (FC) network to make the final predictions. Upon this baseline model, we introduce an audio-guided visual attention mechanism to verify whether audio can help attend visual features; it also implies spatial locations for sounding objects as a side output. Furthermore, we investigate several audio-visual feature fusion methods and propose a novel dual multimodal residual fusion network that achieves the best fusion results. For weakly-supervised learning, we formulate it as a Multiple Instance Learning (MIL) task, and modify our network structure via adding a MIL pooling layer to handle the problem. To address the harder cross-modality localization task, we propose an audio-visual distance learning network that measures the relativeness of any given pair of audio and visual content. It projects audio and visual features into subspaces with the same dimension. Contrastive loss is introduced to learn the network.

Observing that there is no publicly available dataset directly suitable for our tasks, we collect a large video dataset that consists of 4143 10-second videos with both audio and video tracks for 28 audio-visual events and annotate their temporal boundaries. Videos in our dataset are originated from YouTube, thus they are unconstrained. Our extensive experiments support the following findings: modeling jointly over auditory and visual modalities outperforms modeling independently over them, audio-visual event localization in a noisy condition can still achieve promising results, the audio-guided visual attention can well capture semantic regions covering sounding objects and can even distinguish audio-visual unrelated videos, temporal alignment is important for audio-visual fusion, the proposed dual multimodal residual network is effective in addressing the fusion task, and strong correlations between the two modalities enable cross-modality localization. These findings have paved a way for our community to solve harder, high-level understanding problems in the future, such as video captioning and movieQA , where the auditory modality plays an important role in understanding video but lacks effective modeling.

Our work makes the following contributions: (1) a family of three audio-visual event localization tasks; (2) an audio-guided visual attention model to adaptively explore the audio-visual correlations; (3) a novel dual multimodal residual network to fuse audio-visual features; (4) an effective audio-visual distance learning network to address cross-modality localization; (5) a large audio-visual event dataset containing more than 4K unconstrained and annotated videos, which to the best of our knowledge, is the largest dataset for sound event detection. We will release our dataset along with implementations of various methods.

Related Work

In this section, we first describe how our work differs from three closely-related topics: sound event detection, temporal action localization and multimodal machine learning, then discuss relations to various recent works in modeling vision-and-sound.

Sound event detection considered in the audio signal processing community aims to detect and temporally locate sound events in an acoustic scene. Approaches based on Hidden Markov Models (HMM), Gaussian Mixture Models (GMM), feed-forward Deep Neural Networks (DNN), and Bidirectional Long Short-Term Memory (BLSTM) are developed in . These methods focus on audio signals, and visual signals have not been explored. Corresponding datasets, e.g., TUT Acoustic Scenes , for sound event detection only contain sound tracks, and are not suitable for audio-visual scene understanding.

Temporal action localization aims to detect and locate actions in videos. Most works cast it as a classification problem and utilize a temporal sliding window approach, where each window is considered as an action candidate subject to classification . Escorcia et al. present a deep action proposal network that is effective in generating temporal action proposals for long videos and can speed up temporal action localization. Recently, Shou et al. propose an end-to-end Segment-based 3D CNN method (S-CNN), and Lea et al. develop an Encoder-Decoder Temporal Convolutional Network (ED-TCN) to hierarchically model actions. Different from these works, an audio-visual event in our consideration may contain multiple actions or motionless sounding objects, and we model over both audio and visual domains. Nevertheless, we extend the ED-TCN method to address our supervised audio-visual event localization task and compare it in Sec. 6.3.

Multimodal machine learning aims to learn joint representations over multiple input modalities, e.g., speech and video, image and text. Feature fusion is one of the most important part for multimodal learning , and many different fusion models have been developed, such as statistical models , Multiple Kernel Learning (MKL) , Graphical models . Although some mutimodal deep networks have been studied in , which mainly focus on joint audio-visual representation learning based on Autoencoder or deep Boltzmann machines , we are interested in investigating the best models to fuse learned audio and visual features for localization purpose.

Recently, some inspiring works are developed for modeling vision-and-sound . Aytar et al. use a visual teacher network to learn powerful sound representations from unlabeled videos. Owens et al. leverage ambient sounds as supervision to learn visual representations. Arandjelovic and Zisserman learn both visual and audio representations in an unsupervised manner through an audio-visual correspondence task, and in , they further locate sound source spatially in an image based on an extended correspondence network. Aside from works in representation learning, audio-visual cross-modal synthesis is studied in , and associations between natural image scenes and accompanying free-form spoken audio captions are explored in . Unlike the previous works, in this paper, we systematically investigate audio-visual event localization tasks.

Dataset and Problems

To the best of our knowledge, there is no publicly available dataset directly suitable for our purpose. Therefore, we introduce the Audio-Visual Event (AVE) datasetThe supplementary material contains the detail of gathering the dataset., a subset of AudioSet , that contains 4143 videos covering 28 event categories and videos in AVE are temporally labeled with audio-visual event boundaries. Each video contains at least one 2s long audio-visual event. The dataset covers a wide range of audio-visual events (e.g., man speaking, woman speaking, dog barking, playing guitar, and frying food etc.) from different domains, e.g., human activities, animal activities, music performances, and vehicle sounds. We provide examples from different categories and show the statistics in Fig. 2. Each event category contains a minimum of 60 videos and a maximum of 188 videos, and 66.4 $\%$ videos in the AVE contain audio-visual events that span over the full 10 seconds. Next, we introduce three different tasks based on the AVE to explore the interactions between auditory and visual modalities.

2 Fully and Weakly-Supervised Event Localization

The goal of event localization is to predict the event label for each video segment, which contains both audio and visual tracks, for an input video sequence. Concretely, for a video sequence, we split it into $T$ non-overlapping segments $\{V_{t},A_{t}\}_{t=1}^{T}$ , where each segment is 1s long (since our event boundary is labeled at second-level), and $V_{t}$ and $A_{t}$ denote the visual content and its corresponding audio counterpart in a video segment, respectively. Let $\textbf{{y}}_{t}=\{y_{t}^{k}|y_{t}^{k}\in\{0,1\},k=1,...,C,\sum_{k=1}^{C}y_{t}^{k}=1\}$ be the event label for that video segment. Here, $C$ is the total number of AVE events plus one background label.

For the supervised event localization task, the event label $\textbf{{y}}_{t}$ of each visual segment $V_{t}$ or audio segment $A_{t}$ is known during training. We are interested in event localization in audio space alone, visual space alone and the joint audio-visual space. This task explores whether or not audio and visual information can help each other improve event localization. Different than the supervised setting, in the weakly-supervised manner we have only access to a video-level event tag, and we still aim to predict segment-level labels during testing. The weakly-supervised task allows us to alleviate the reliance on well-annotated data for modelings of audio, visual and audio-visual.

3 Cross-Modality Localization

In the cross-modality localization task, given a segment of one modality (auditory/visual), we would like to find the position of its synchronized content in the other modality (visual/auditory). Concretely, for visual localization from audio (A2V), given a $l$ -second audio segment $\hat{A}$ from $\{A_{t}\}_{t=1}^{T}$ , where $l<T$ , we want to find its synchronized $l$ -second visual segment within $\{V_{t}\}_{t=1}^{T}$ . Similarly, for audio localization from visual content (V2A), given a $l$ -second video segment $\hat{V}$ from $\{V_{t}\}_{t=1}^{T}$ , we would like to find its $l$ -second audio segment within $\{A_{t}\}_{t=1}^{T}$ . This task is conducted in the event-agnostic setting such that the models developed for this task are expected to work for general videos where the event labels are not available. For evaluation, we only use short-event videos, in where the lengths of audio-visual event are all shorter than 10s.

Methods for Audio-Visual Event Localization

First, we present the overall framework that treats the audio-visual event localization (defined in Sec. 3.2) as a sequence labeling problem in Sec. 4.1. Upon this framework, we propose our audio-guided visual attention in Sec. 4.2 and a novel dual multimodal residual fusion network in Sec. 4.3. Finally, we extend this framework to work in weakly-supervised setting in Sec. 4.4.

where $F_{t}$ refers to $v^{att}_{t}$ or $a_{t}$ in our model. For evaluating the performance of the proposed attention mechanism, we compare to models that do not use attention; we directly feed global average pooling visual features and audio features into LSTMs as baselines. To better incorporate the two modalities, we introduce a multimodal fusion network (see details in Sec. 4.3). The audio-visual representation $h_{t}^{*}$ is learned by a multimodal fusion network with audio and visual hidden state output vectors $h_{t}^{v}$ and $h_{t}^{a}$ as inputs. This joint audio-visual representation is used to output event category for each video segment. For this, we use a shared FC layer with the Softmax activation function to predict probability distribution over $C$ event categories for the input segment and the whole network can be trained with a multi-class cross-entropy loss.

2 Audio-Guided Visual Attention

Psychophysical and physiological evidence shows that sound is not only informative about its source but also its location . Based on this, Hershey and Movellan introduce an exploratory work on localizing sound sources utilizing audio-visual synchrony. It shows that the strong correlations between the two modalities can be used to find image regions that are highly correlated to the audio signal. Recently, show that sound indicates object properties even in unconstrained images or videos. These works inspire us to use audio signal as a means of guidance for visual modeling.

Given that attention mechanism has shown superior performance in many applications such as neural machine translation and image captioning , we use it to implement our audio-guided visual attention (see Fig. 3(a) and Fig. 4(a)). The attention network will adaptively learn which visual regions in each segment of a video to look for the corresponding sounding object or activity.

Concretely, we define the attention function $f_{att}$ and it can be adaptively learned from the visual feature map $v_{t}$ and audio feature vector $a_{t}$ . At each time step $t$ , the visual context vector $v_{t}^{att}$ is computed by:

where $w_{t}$ is an attention weight vector corresponding to the probability distribution over $k$ visual regions that are attended by its audio counterpart. The attention weights can be computed based on MLP with a Softmax activation function:

3 Audio-Visual Feature Fusion

Our fusion method is designed based on the philosophy in , which processes multiple features separately and then learns a joint representation using a middle layer. To combine features coming from visual and audio modalities, inspired by the Mutimodal Residual Network (MRN) in (which works for text-and-image), we introduce a Dual Multimodal Residual Network (DMRN). The MRN adopts a textual residual branch and feeds transformed visual features into different textual residual blocks, where only textual features are updated. In contrary, the proposed DMRN shown in Fig. 4(b) updates both audio and visual features simultaneously.

Given audio and visual features $h_{t}^{a}$ and $h_{t}^{v}$ from LSTMs, the DMRN will compute the updated audio and visual features:

where $f(\cdot)$ is an additive fusion function, and the average of $h_{t}^{a^{\prime}}$ and $h_{t}^{v^{\prime}}$ is used as the joint representation $h_{t}^{*}$ for labeling the video segment. Here, the update strategy in DMRN can both preserve useful information in the original modality and add complimentary information from the other modality. Simply, we can stack multiple residual blocks to learn a deep fusion network with updated $h_{t}^{a^{\prime}}$ and $h_{t}^{v^{\prime}}$ as inputs of new residual blocks. However, we empirically find that it does not improve performance by stacking many blocks for both MRN and DMRN. We argue that the network becomes harder to train with increasing parameters and one block is enough to handle this simple fusion task well.

We would like to underline the importance of fusing audio-visual features after LSTMs for our task. We empirically find that late fusion (fusion after temporal modeling) is much better than early fusion (fusion before temporal modeling). We suspect that the auditory and visual modalities are not temporally aligned. Temporal modeling by LSTMs can implicitly learn certain alignments which can help make better audio-visual fusion. The empirical evidences will be shown in Tab. 2.

4 Weakly-Supervised Event Localization

To address the weakly-supervised event localization, we formulate it as a MIL problem and extend our framework to handle noisy training condition. Since only video-level labels are available, we infer label of each audio-visual segment pair in the training phase, and aggregate these individual predictions into a video-level prediction by MIL pooling as in :

where $m_{1},...,m_{T}$ are predictions from the last FC layer of our audio-visual event localization network, and $g(\cdot)$ averages over all predictions. The probability distribution of event category for the video sequence can be computed using $\hat{m}$ over the Softmax. During testing, we can predict the event category for each segment according to computed $m_{t}$ .

Method for Cross-Modality Localization

To address the cross-modality localization problem (defined in Sec. 3.3), we propose an audio-visual distance learning network (AVDLN) as illustrated in Fig. 3(b); we notice similar networks are studied in concurrent works . Our network can measure the distance $D_{\theta}(V_{i},A_{i})$ for a given pair of $V_{i}$ and $A_{i}$ . At test time, for visual localization from audio (A2V), we use a sliding window method and optimize the following objective:

where $t^{*}\in\{1,...,T-l+1\}$ denotes the start time when visual and audio content synchronize, $T$ is the total length of a testing video sequence, and $l$ is the length of the audio query $\hat{A}$ . This objective function computes an optimal matching by minimizing the cumulative distance between the audio segments and the visual segments. Therefore, $\{V_{i}\}_{i=t^{*}}^{t^{*}+l-1}$ is the matched visual content. Similarly, we can define audio localization from visual content (V2A); we omit it here for a concise writing. Next, we describe the network used to implement the matching function.

Let $\{V_{i},A_{i}\}_{i=1}^{N}$ be $N$ training samples and $\{y_{i}\}_{i=1}^{N}$ be their labels, where $V_{i}$ and $A_{i}$ are a pair of 1s visual and audio segments, $y_{i}\in\{0,1\}$ . Here, $y_{i}=1$ means that $V_{i}$ and $A_{i}$ are synchronized. The AVDLN will learn to measure distances between these pairs. The network encodes them using pre-trained CNNs, and then performs dimensionality reduction for encoded audio and visual representations using two different two-layer FC networks. The outputs of final FC layers are $\{R_{i}^{v},R_{i}^{a}\}_{i=1}^{N}$ . The distance between $V_{i}$ and $A_{i}$ is measured by the Euclidean distance between $R_{i}^{v}$ and $R_{i}^{a}$ :

To optimize the parameters $\theta$ of the distance metric $D_{\theta}$ , we introduce the contrastive loss proposed by Hadsell et al. . The contrastive loss function is:

where $th>0$ is a margin. If a dissimilar pair’s distance is less than $th$ , the loss will make the distance $D_{\theta}$ bigger; if their distance is bigger than the margin, it will not contribute to the loss.

Experiments

First, we introduce the used visual and audio representations in Sec. 6.1. Then, we describe the compared baseline models and evaluation metrics in Sec 6.2. Finally, we show and analyze experimental results The supplementary material contains additional results on audio-visual event localization with C3D features, visual-guided audio attention and co-attention, and implementation details of our models. of different models in Sec. 6.3.

It has been suggested that CNN features learned from a large-scale dataset (e.g. ImageNet , AudioSet ) are highly generic and powerful for other vision or audition tasks. So, we adopt pre-trained CNN models to extract features for visual segments and their corresponding audio segments.

Visual Representation. For each 1s visual segment, we extract $pool5$ feature maps from sampled $16$ RGB video frames by VGG-19 network , which is pre-trained on ImageNet , and then utilize global average pooling over the 16 frames to generate one $512\times 7\times 7$ -D feature map. We also explore the temporal visual features extracted by C3D , which is capable of learning spatio-temporal visual features. But we do not observe significant improvements when combining C3D features.

Audio Representation. We extract a 128-D audio representation for each 1s audio segment via a VGG-like network pre-trained on AudioSet .

2 Baselines and Evaluation Metrics

To validate the effectiveness of the joint audio-visual modeling, we use single-modality models as baselines, which only use audio-alone or visual-alone features and share the same structure with our audio-visual models. To evaluate the audio-guided visual attention, we compare our V-att and A+V-att models with V and A+V models in fully and weakly supervised settings. Here, V-att models adopt audio-guided visual attention to pool visual feature maps, and the other V models use global average pooling to compute visual feature vectors. We visualize generated attention maps for subjective evaluation. To further demonstrate the effectiveness of the proposed networks, we also compare them with a state-of-the-art temporal labeling network: ED-TCN .

We compare our fusion method: DMRN with several network-based multimodal fusion methods: Additive, Maxpooling (MP), Gated, Multimodal Bilinear (MB), and Gated Multimodal Bilinear (GMB) in , Gated Multimodal Unit (GMU) in , Concatenation (Concat), and MRN . Three different fusion strategies: early, late and decision fusions are explored. Here, early fusion methods directly fuse audio features from pre-trained CNNs and attended visual features; late fusion methods fuse audio and visual features from outputs of two LSTMs; and decision fusion methods fuse the two modalities before Softmax layer. In addition, to further enhance the performance of DMRN, we also introduce a variant model of DMRN called dual multimodal residual fusion ensemble (DMRFE) method, which feed audio and visual features into two separated blocks and then use average ensemble to combine the two predicted probabilities.

For supervised and weakly-supervised event localization, we use overall accuracy as an evaluation metric. For cross-modality localization, e.g., V2A and A2V, if a matched audio/visual segment is exactly the same as its groundtruth, we regard it as a good matching; otherwise, it will be a bad matching. We compute the percentage of good matchings over all testing samples as prediction accuracy to evaluate the performance of cross-modality localization. To validate the effectiveness of the proposed model, we also compare it with deep canonical correlation analysis (DCCA) method .

3 Experimental Comparisons

Table 6 compares different variations of our proposed models on supervised and weakly-supervised audio-visual event localization tasks. Table 2 shows event localization performance of different fusion methods. Figures 5 and 6 illustrate generated audio-guided visual attention maps.

To benchmark our models with state-of-the-art temporal action labeling methods, we extend the ED-TCN to address the supervised audio-visual event localization, and train it on AVE. The ED-TCN achieves 46.9% overall accuracy. For comparison, our V model with the same features achieves 55.3%.

Audio and Visual. From Tab. 6, we observe that A outperforms V and W-A is also better than W-V. It demonstrates that audio features are more powerful to address audio-visual event localization task on the AVE dataset. However, when we look at each individual event, using audio is not always better than using visual. We observe that V is better than A for some events (e.g. car, motocycle, train, bus). Actually, most of these events are outdoor. Audios in these videos can be very noisy: several different sounds may be mixed together (e.g. people cheers with a racing car), and may have very low intensity (e.g. horse sound from far distance). For these conditions, visual information will give us more discriminative and accurate information to understand events in videos. A is much better than V for some events (e.g. dog, man and woman speaking, baby crying). Sounds will provide clear cues for us to recognize these events. For example, if we hear barking sound, we know that there may be a dog. We also observe that A+V is better than both A and V, and W-A+V is better than W-A and W-V, which validates that combining audio and visual modalities significantly improve the event localization performance.

From the above results and analysis, we can conclude that auditory and visual modalities will provide complementary information for us to understand events in videos. The results also demonstrate that our AVE dataset is suitable for studying audio-visual scene understanding tasks.

Audio-Guided Visual Attention. The quantitative results (see Tab. 6) show that V-att is much better than V (a 3.3 $\%$ absolute improvement) and A+V-att outperforms A+V by 1.3 $\%$ , which demonstrates the effectiveness of proposed audio-guided visual attention mechanism. We show qualitative results of our attention method in Fig. 5. We observe that a range of semantic regions in many different categories and examples can be attended by audio, which validates that our attention network can learn which visual regions to look at for sounding objects (even for some challenging cases: two babies crying, playing flute surrounding by crowd, rat with weak sound). An interesting observation is that the audio-guided visual attention tends to focus on sounding regions, such as man’s mouth, head of crying boy etc, rather than whole objects in some examples. Figure 6 illustrates two challenging cases. For the first example, the sounding helicopter is quite small in the first several frames but our attention model can still capture its locations. For the second example, the first five frames do not contain an audio-visual event which means that either the sound source is not visible or sound is not audible; in this case, attentions are spread on different background regions. When the rat appears in the 5th frame but is not making any sound, the attention does not focus on the rat. When the rat sound becomes audible, the attention focuses on the sounding rat. This observation validates that the audio-guided attention mechanism is helpful to distinguish audio-visual unrelated videos, and is not just to capture a saliency map with objects.

Audio-Visual Fusion. Table 2 shows audio-visual event localization prediction accuracy of different multimodal feature fusion methods on AVE dataset. Our DMRN model in the late fusion setting can achieve better performance than all compared methods, and our DMRFE model can further improve performance. We also observe that late fusion is better than early fusion and decision fusion. The superiority of late fusion over early fusion demonstrates that temporal modeling before audio-visual fusion is useful. We know that auditory and visual modalities are not completely aligned, and the temporal modeling can implicitly learn certain alignments between the two modalities, which is helpful for the audio-visual feature fusion task. The decision fusion can be regard as a type of late fusion but using lower dimension (same as the category number) features. The late fusion outperforms the decision fusion, which validates that processing multiple features separately and then learning joint representation using a middle layer rather than the bottom layer is an efficient fusion way.

Full and Weak Supervision. Obviously, supervised event localization models are better than weakly supervised ones, but quantitative comparisons show that weakly-supervised approaches achieve promising event localization performance, which demonstrates the effectiveness of the MIL networks on address this task, and validates that the audio-visual event localization task can be addressed even in a noisy condition.

Cross-Modality Localization. Table 3 reports the prediction accuracy of our method and DCCA on cross-modality localization task. Our AVDL outperforms DCCA over a large margin both on A2V and V2A tasks. Even using the strict evaluation metric (which counts only the exact matches), our models on both subtasks: A2V and V2A, show promising results, which further demonstrates that there are strong correlations between audio and visual modalities, and it is possible to address the cross-modality localization for unconstrained videos.

Conclusion

In this work, we study a suit of five research questions in the context of three audio-visual event localization tasks. We propose both baselines and novel algorithms to address each of the three tasks. Our systematic study well supports our findings: modeling jointly over auditory and visual modalities outperforms independent modeling, audio-visual event localization in a noisy condition is still tractable, the audio-guided visual attention is able to capture semantic regions of sound sources and can even distinguish audio-visual unrelated videos, temporal alignments are important for audio-visual feature fusion, the proposed dual residual network is capable of audio-visual fusion, and strong correlations existing between the two modalities enable cross-modality localization.

Acknowledgement

This work was supported by NSF BIGDATA 1741472. We gratefully acknowledge the gift donations of Markable, Inc., Tencent and the support of NVIDIA Corporation with the donation of the GPUs used for this research. This article solely reflects the opinions and conclusions of its authors and neither NSF, Markable, Tencent nor NVIDIA.

References

AVE: The Audio-Visual Event Dataset

Our Audio-Visual Event (AVE) dataset contains 4143 videos covering 28 event categories. The video data is a subset of AudioSet with the given event categories, based on which the temporal boundaries of the audio-visual events are manually annotated.

With the proliferation of video content, YouTube becomes a good resource for finding unconstrained videos. The AudioSet released by Google is a large-scale audio-visual dataset that contains 2M 10-second video clips from Youtube. Each video clip corresponds to one of the total 632 event labels that is manually-annotated to describe the audio event. In general, the events cover a variety of category types such as human and animal sounds, musical instruments and genres, and common everyday environmental sounds. Although the videos in AudioSet contain both audio and visual tracks, a lot of them are not suitable for the audio-visual event localization task. For example, visual and audio content can be completely unrelated (e.g., train horn but no train appears, wind sound but no corresponding visual signals, the absence of audible sound, etc).

To prepare our dataset, we select 34 categories including around $10,000$ videos from the AudioSet. Then we hire trained in-house annotators to select a subset of them as the desired videos, and further mark the start and end time at a resolution of 1 second as the temporal boundaries of each audio-visual event. We set a criterion that all annotators followed in the annotation process: a desired video should contain the given event category for at least a two-seconds-long segment from the whole video, in which the sound source is visible and the sound is audible. This results in total 4143 desired videos covering a wide range of audio-visual events (e.g., woman speaking, dog barking, playing guitar, and frying food, etc.) from different domains e.g., human activities, animal activities, music performances, and vehicle sounds.

Implementation Details

Videos in AVE dataset are divided into training (3339), validation (402), and testing (402) sets. For supervised and weakly-supervised audio-visual event localization tasks, we randomly sample videos from each event category to build the train/val/test datasets. For evaluating the cross-modality localization performance, we only sample testing videos from short-event videos (events in these videos are all strictly smaller than the total 10s duration). We implement our models using Pytorch and Keras with Tensorflow as backend. Networks are optimized by Adam . The LSTM hidden state size and contrastive loss margin are set to $128$ and $2.0$ , respectively.

Additional Experiments

Here, we compare different supervised audio-visual event localization models with different features in Sec. 11.1. The audio-visual event localization results with different attention mechanisms are shown in Sec. 11.2.

Although 2D CNNs pre-trained on ImageNet are effective in extracting high-level visual representations for static images, they fail to capture dynamic features modeling motion information in videos. To analyze whether temporal information is useful for the audio-visual event localization task, we utilize deep 3D convolutional neural network (C3D) to extract spatio-temporal visual features. In our experiments, we extract C3D feature maps from $pool5$ layer of C3D network pre-trained on Sport1M , and obtain feature vectors by global average pooling operation. Tables 4 and 5 show supervised audio-visual event localization results of different features on AVE dataset.

2 Different Attention Mechanisms

In our paper, we propose an audio-guided visual attention mechanism to adaptively learn which visual regions in each segment of a video to look for the corresponding sounding object or activity. Here, we further explore visual-guided audio attention mechanism and audio-visual co-attention mechanism, where the latter integrates audio-guided visual attention and visual-guided audio attention. These attention mechanisms serve as a weighted global pooling method to generate audio or visual feature vectors. The visual-guided audio attention function is similar to that in the audio-guided visual attention model, and the co-attention model uses both attended audio and attended visual feature vectors.

To implement visual-guided audio attention mechanism, we extract audio features from the last pooling layer of pre-trained VGG-like model in . Note that the network uses a log-mel spectrogram patch with 96 $\times$ 64 bins to represent a 1s waveform signal, so its pool5 layer will produce feature maps with spatial resolution; this is different than audio features of A models in our main paper and in Tabs. 4 and 5 of this supplementary file. The reason is that the audio features in A models are 128-D vectors extracted from the last fully-connected layer. We denote a model using audio features in this section as A′ to differentiate it from the model A used in our main paper and in Tabs. 4 and 5.

Table 6 illustrates supervised audio-visual event localization results of different attention models. We can see that the the A′ model in Tab. 6 is worse than the A model in Tab. 5, which demonstrates that the audio features extracted from the last FC layer of is more powerful. Similar to results in our main paper, V-att outperforms V. However, A′-att is not better than A′, and A′+V-co-att is slightly worse than A′+V, which validate that visual-guided audio attention and audio-visual co-attention can not effectively improve audio-visual event localization performance. Figure 6 illustrates visual results of audio attention and visual attention mechanisms. Clearly, we can find that audio-guided visual attention can locate semantic regions with sounding objects. We also observe that the visual-guided audio attention tends to capture certain frequency patterns, but it is pretty hard to interpret the results of visual-guided audio attention, which we leave to explore in the future work.