Audio-Visual Instance Discrimination with Cross-Modal Agreement

Pedro Morgado, Nuno Vasconcelos, Ishan Misra

Introduction

Imagine the sound of waves. This sound can evoke the memory of many scenes - a beach, a pond, a river, etc. A single sound serves as a bridge to connect multiple sceneries. It can group visual scenes that ‘go together’, and set apart the ones that do not. We leverage this property of freely occurring audio to learn video representations in a self-supervised manner.

A common technique is to setup a verification task that requires predicting if an input pair of video and audio is ‘correct’ or not. A correct pair is an ‘in-sync’ video and audio and an incorrect pair can be constructed by using ‘out-of-sync’ audio or audio from a different video . However, a task that uses a single pair at a time misses a key opportunity to reason about the data distribution at large.

In our work, we propose a contrastive learning framework to learn cross-modal representations in a self-supervised manner by contrasting video representations against multiple audios at once (and vice versa). We leverage recent advances in contrastive learning to setup a Audio-Visual Instance Discrimination (AVID) task that learns a cross-modal similarity metric by grouping video and audio instances that co-occur. We show that the cross-modal discrimination task, i.e., predicting which audio matches a video, is more powerful than the within-modal discrimination task, predicting which video clips are from the same video. With this insight, our technique learns powerful visual representations that improve upon prior self-supervised methods on action recognition benchmarks like UCF-101 and HMDB-51 .

We further identify important limitations of the AVID task and propose improvements that allow us to 1) reason about multiple instances and 2) optimize for visual similarity rather than just cross-modal similarity. We use Cross-Modal Agreement (CMA) to group together videos with high similarity in video and audio spaces. This grouping allows us to directly relate multiple videos as being semantically similar, and thus directly optimize for visual similarity in addition to cross-modal similarity. We show that CMA can identify semantically related videos, and that optimizing visual similarity among related videos significantly improves the learned visual representations. Specifically, CMA is shown to improve upon AVID on action recognition tasks such Kinetics , UCF-101 and HMDB-51 under both linear probing and full fine-tuning evaluation protocols.

Related work

Self-supervised learning is a well studied problem . Self-supervised methods often try to reconstruct the input data or impose constraints on the representation, such as sparsity , noise or invariance to learn a useful and transferable feature representation. An emerging area of research uses the structural or domain-specific properties of visual data to algorithmically define ‘pretext tasks’. Pretext tasks are generally not useful by themselves and are used as a proxy to learn semantic representations. They can use the spatial structure in images , color , temporal information in videos among other sources of ‘self’ or naturally available supervision. We propose an unsupervised learning technique that leverages the naturally available signal in video and audio alignment.

Self-supervised learning can also make use of multiple modalities, rather than the visual data alone. As pointed out in , co-occurring modalities such as audio can help learn powerful representations. For example, audio self-supervision has shown to be useful for sound source localization and separation , lip-speech synchronization and visual representation learning and audio spatialization .

Audio-Visual Correspondence (AVC)

is a standard task used in audio-video cross-modal learning. This task tries to align the visual and audio inputs by solving a binary classification problem. However, most methods use only a single video and single audio at a time for learning. Thus, the model must reason about the distribution over multiple samples implicitly. In our work, we use a contrastive loss that opposes a large number of samples simultaneously. We show in section 5 that our method performs better than recent methods that use AVC.

Contrastive Learning

techniques use a contrastive loss to learn representations either by predicting parts of the data , or discriminating between individual training instances . Contrastive learning has also been used for learning representations from video alone . Tian et al. also use a contrastive approach, but propose to learn with a cross-modal objective applied to images and depth, video and flow. In contrast, our method learns visual representations using audio as cross-modal targets. Compared to , we present a new insight for audio-visual learning that optimizing cross-modal similarity is more beneficial than within-modal similarity. We also identify important limitations of cross-modal discrimination and present an approach that goes beyond instance discrimination by modeling Cross-Modal Agreement. This identifies groups of related videos and allows us to optimize for within-modal similarity between related videos. The concurrently proposed uses alternating optimization to find clusters in visual and audio feature spaces, independently and uses them to improve cross-modal features. While our CMA method bears a resemblance to theirs, we do not use alternating optimization and use agreements between the visual and audio representations to directly improve visual similarity rather than only cross-modal similarity. Finally, similar to our work, the concurrently proposed also uses co-occurring modalities (optical flow and RGB) to expand the positive set. However, instead of mining positives based on an agreement between both modalities, relies on the opposite modality alone.

Multi-view Learning.

Multi-view learning aims to find common representations from multiple views of the same phenomenon, and has been widely used to provide learning signals in unsupervised and semi-supervised applications. Classical approaches can be broadly categorized in co-training procedures that maximize the mutual agreement between views, multiple kernel learning procedures which use kernels to model different views, and subspace learning procedures which seek to find the latent space that generates all views of the data.

Multi-view data is an effective source of supervision for self-supervised representation learning. Examples include the motion and appearance of a video , depth and appearance , luminance and chrominance of an image , or as in our work sound and video .

Audio-Visual Instance Discrimination

We learn visual representations in a self-supervised manner from unconstrained video and audio by building upon recent advances in instance discrimination and contrastive learning .

Consider a dataset of $N$ samples (instances) $\mathcal{S}=\{s_{i}\}_{i=1}^{N}$ where each instance $s_{i}$ is a video $s^{v}_{i}$ with a corresponding audio $s^{a}_{i}$ . The goal of Audio-Visual Instance Discrimination (AVID) is to learn visual and audio representations $(\mathbf{v}_{i},\mathbf{a}_{i})$ from the training instances $s_{i}$ . The learned representations are optimized for ‘instance discrimination’ , i.e., must be discriminative of $s_{i}$ itself as opposed to other instances $s_{j}$ in the training set. Prior work shows that such a discriminative objective among instances learns semantic representations that capture similarities between the instances.

To accomplish this, two neural networks extract unit norm feature vectors $\mathbf{v}_{i}=f_{v}(s_{i}^{v})$ and $\mathbf{a}_{i}=f_{a}(s_{i}^{a})$ from the video and audio independently. Slow moving (exponential moving average) representations for both video and audio features $\{(\bar{\mathbf{v}}_{i},\bar{\mathbf{a}}_{i})\}_{i=1}^{N}$ are maintained as ‘memory features’ and used as targets for contrastive learning. The AVID task learns representations $(\mathbf{v}_{i},\mathbf{a}_{i})$ that are more similar to the memory features of the instance $(\bar{\mathbf{v}}_{i},\bar{\mathbf{a}}_{i})$ as opposed to memory features of other instances $(\bar{\mathbf{v}}_{j},\bar{\mathbf{a}}_{j})$ , $j\!\neq\!i$ . However, unlike previous approaches defined on a single modality (but similar to ), AVID uses multiple modalities, and thus can assume multiple forms as depicted in fig. 2.

Self-AVID requires instance discrimination within the same modality - $\mathbf{v}_{i}$ to $\bar{\mathbf{v}}_{i}$ and $\mathbf{a}_{i}$ to $\bar{\mathbf{a}}_{i}$ . This is equivalent to prior work independently applied to the two modalities.

Cross-AVID optimizes for cross-modal discrimination, i.e., the visual representation $\mathbf{v}_{i}$ is required to discriminate the accompanying audio memory $\bar{\mathbf{a}}_{i}$ and vice-versa.

Joint-AVID combines the Self-AVID and Cross-AVID objectives.

It is not immediately obvious what the relative advantages, if any, of these variants are. In section 3.3, we provide an in-depth empirical study of the impact of these choices on the quality of the learned representations. We now describe the training procedure in detail.

2 AVID training procedure.

AVID is trained using a contrastive learning framework , where instance representations are contrasted to those of other (negative) samples.

While various loss functions have been defined for contrastive learning , we focus on noise contrastive estimation (NCE) . Let $\bar{\mathbf{x}}_{i}$ denote the (memory) target representation for a sample $s_{i}$ . The probability that a feature $\mathbf{x}$ belongs to sample $s_{i}$ is modeled by a generalized softmax function

where $\bar{Z}=\tfrac{1}{N}\sum_{\bar{\mathbf{x}}}[\exp(\mathbf{x}^{T}\bar{\mathbf{x}}/\tau)]$ is the normalized partition function and $\tau$ is a temperature hyper-parameter that controls the softness of the distribution. In the case of AVID, $\mathbf{x}$ and $\bar{\mathbf{x}}$ may or may not be from the same modality.

The network $f$ is trained to learn representations by solving multiple binary classification problems where it must choose its own target representation $\bar{\mathbf{x}}_{i}$ over representations $\bar{\mathbf{x}}_{j}$ in a negative set. The negative set consists of $K$ ‘other’ instances drawn uniformly from $\mathcal{S}$ , i.e., $\mathcal{N}_{i}=\mathcal{U}(\mathcal{S})^{K}$ . The probability of a feature $\mathbf{x}$ being from instance $s_{i}$ as opposed to the instances from the uniformly sampled negative set $\mathcal{N}_{i}$ is given as $P(D=1|\mathbf{x},\bar{\mathbf{x}}_{i})=\frac{P(s_{i}|\mathbf{x})}{P(s_{i}|\mathbf{x})+K/N}.$ The NCE loss is defined as the negative log-likelihood

The three variants of AVID depicted in fig. 2 are trained to optimize variations of the NCE loss of eq. 2, by varying the target representations $\bar{\mathbf{x}}_{i}$ .

We analyze these variants next and show that the seemingly minor differences between them translate to significant differences in performance.

3 Analyzing AVID

We present experiments to analyze various properties of the AVID task and understand the key factors that enable the different variants of AVID to learn good representations.

Experimental Setup We briefly describe the experimental setup for analysis and provide the full details in the supplemental.

Pre-training Dataset. All models are trained using the Audioset dataset which contains 1.8M videos focusing on audio events. We randomly subsample 100K videos from this dataset to train our models. We use input video and audio clips of 1 and 2-second duration, respectively. The video model is trained on 16 frames of size $112\!\times\!112$ with standard data augmentation . We preprocess the audio by randomly sampling the audio within 0.5 seconds of the video and compute a log spectrogram of size $100\!\times\!129$ (100 time steps with 129 frequency bands).

Video and audio models. The video model is a smaller version of the R(2+1)D models proposed in with 9 layers. The audio network is a 9 layer 2D ConvNet with batch normalization. In both cases, output activations are max-pooled, projected into a 128-dimensional feature using a multi-layer perceptron (MLP), and normalized into the unit sphere. The MLP is composed of three fully connected layers with 512 hidden units.

Pre-training details. AVID variants are trained to optimize the loss in Equations 3-5 with 1024 random negatives. In early experiments, we increased the number of negatives up to 8192 without seeing noticeable differences in performance. Following , we set the temperature hyper-parameter $\tau$ to 0.07, the EMA update constant to 0.5, and the normalized partition function $\bar{Z}$ is approximated during the first iteration and kept constant thereafter ( $\bar{Z}=2.2045$ ). All models are trained with the Adam optimizer for 400 epochs with a learning rate of 1e-4, weight decay of 1e-5, and batch size of 256.

Downstream tasks. We evaluate both the visual and audio features using transfer learning.

Visual Features: We use the Kinetics dataset for action recognition. We evaluate the pre-trained features by linear probing where we keep the pre-trained network fixed and train linear classifiers. We report top-1 accuracy on held-out data by averaging predictions over 25 clips per video.

Audio Features: We evaluate the audio features on the ESC-50 dataset by training linear classifiers on fixed features from the pre-trained audio network. Similar to the video case, we report top-1 accuracy by averaging predictions over 25 clips per video.

Cross vs. within-modal instance discrimination

We study the three variants of AVID depicted in fig. 2 to understand the differences between cross-modal and within-modal instance discrimination and its impact on the learned representations. We evaluate the video and audio feature representations from these variants and report results in table 1. We observe that Self-AVID is consistently outperformed by the Cross-AVID variant on both visual and audio tasks.

We believe the reason is that Self-AVID uses within-modality instance discrimination, which is an easier pretext task and can be partially solved by matching low-level statistics of the data . This hypothesis is supported by the fact that Joint-AVID, which combines the objectives of both Cross-AVID and Self-AVID, also gives worse performance than Cross-AVID. These results highlight that one cannot naively use within-modality instance discrimination when learning audio-visual representations. In contrast, Cross-AVID uses a “harder” cross-modal instance discrimination task where the video features are required to match the corresponding audio and vice-versa. As a result, it generalizes better to downstream tasks.

Beyond Instance Discrimination: Cross-Modal Agreement

We will show in section 5 that Cross-AVID achieves state-of-the-art performance on action recognition downstream tasks. However, we identify three important limitations in the instance discrimination framework of eq. 2 and the cross-modal loss of eq. 4.

Limited to instances: Instance discrimination does not account for interactions between instances. Thus, two semantically related instances are never grouped together and considered ‘positives’.

False negative sampling: The negative set $\mathcal{N}_{i}$ , which consists of all other instances $s_{j}$ , may include instances semantically related to $s_{i}$ . To make matters worse, contrastive learning requires a large number $K$ of negatives, increasing the likelihood that semantically related samples are used as negatives. This contradicts the goal of representation learning, which is to generate similar embeddings of semantically related inputs.

No within-modality calibration: The Cross-AVID loss of Equation 4 does not directly optimize for visual similarity $\mathbf{v}_{i}^{T}\mathbf{v}_{j}$ . In fact, as shown experimentally in section 3.3, doing so can significantly hurt performance. Nevertheless, the lack of within-modality calibration is problematic, as good visual representations should reflect visual feature similarities.

We extend AVID with Cross-Modal Agreement (CMA) to address these shortcomings. CMA builds upon insights from prior work in multi-view learning. We hypothesize that, if two samples are similar in both visual and audio feature space, then they are more likely to be semantically related than samples that agree in only one feature space (or do not agree at all). We thus consider instances that agree in both feature spaces to be ‘positive’ samples for learning representations. Similarly, examples with a poor agreement in either (or both) spaces are used as negatives. When compared to instance discrimination methods , CMA uses a larger positive set of semantically related instances and a more reliable negative set.

2 CMA Learning Objective

We define an agreement score for two instances $s_{i}$ and $s_{j}$ as

This is large only when both the audio and video similarities are large. A set of positives and negatives is then defined per instance $s_{i}$ . The positive set $\mathcal{P}_{i}$ contains the samples that are most similar to $s_{i}$ in both spaces, while the negative set $\mathcal{N}_{i}$ is the complement of $\mathcal{P}_{i}$ .

Furthermore, CMA enables self-supervision beyond single instances. This is achieved with a generalization of the AVID task, which accounts for the correspondences of eq. 7. At training time, $K_{n}$ negative instances are drawn per sample $s_{i}$ from the associated negative set $\mathcal{N}_{i}$ to form set $\mathcal{N}^{\prime}_{i}=\mathcal{U}(\mathcal{N}_{i})^{K_{n}}$ . The networks $f_{v},f_{a}$ are learned to optimize a combination of cross-modal instance discrimination and within-modal positive discrimination (wMPD). The former is encouraged through the Cross-AVID loss of Equation 4. The latter exploits the fact that CMA defines multiple positive instances $\mathcal{P}_{i}$ , thus enabling the optimization of within-modality positive discrimination

Note that, unlike the Self-AVID objective of Equation 3, this term calibrates within-modal similarities between positive samples. This avoids within-modal comparisons to the instance itself, which was experimentally shown to produce weak representations in section 3.3. We then minimize the weighted sum of the two losses

where $\lambda>0$ is an hyper-parameter that controls the weight of the two losses.

After Cross-AVID pre-training, cross-modal disagreements are corrected by finetuning the audio and video networks to minimize the loss in eq. 9. Models are initialized with the Cross-AVID model at epoch 200, and trained for 200 additional epochs. We compare these models to a Cross-AVID model trained for 400 epochs, thus controlling for the total number of parameter updates. For each sample, we find 32 positive instances using the CMA criterion of eq. 7 applied to video and audio memory bank representations. For efficiency purposes, the positive set is updated every 50 epochs. In each iteration, 1024 negative memories (not overlapping with positives) were sampled. These positive and negative memories were then used to minimize the CMA loss of Equations 8-9. For evaluation purposes, we use the same protocol as in section 3.3.

3 Analyzing CMA

The CMA objective consists of two terms that optimize cross-modal (eq. 4) and within-modal (eq. 8) similarity. We observed in section 3.3 that within-modal comparisons for instance discrimination result in poor visual representations due to the relatively easy task of self-discrimination. Intuitively, since CMA identifies groups of instances ( $\mathcal{P}_{i}$ ) that are likely related, calibrating within-modal similarity within these groups (instead of within the instance itself) should result in a better visual representation. To study this, we use CMA to obtain a positive set $\mathcal{P}_{i}$ and analyse the CMA objective of eq. 9 by evaluating with different values of the hyper-parameter $\lambda$ . The results shown in fig. 3 validates the advantages of CMA over Cross-AVID.

To understand the effect of the CMA procedure on within-modal similarities, we analyzed the embedding space defined by memory bank representations obtained with AVID and CMA trained on the Kinetics dataset. Since representations are restricted to the unit sphere (due to normalization), the average inner-product between two randomly chosen samples should be 0 (assuming a uniform distribution of samples over the sphere). However, when training with Cross-AVID, the average inner-product is 0.23. This means that Cross-AVID learns collapsed representations (i.e. features are on average closer to other random features than the space permits). This is likely due to the lack of within-modal negatives when training for cross-modal discrimination. By seeking within modal-discrimination of positive samples, CMA effectively addresses the feature collapsing problem observed for Cross-AVID, and yields an average dot-product between random memories of 0 as expected.

CMA vs. within-modal expansion.

CMA expands the positive set $\mathcal{P}_{i}$ to include instances that agree in both video and audio spaces. We inspected whether modeling this agreement is necessary for relating instances by exploring alternatives that do not model agreements in both spaces (see fig. 4(a)). We consider alternatives that expand the set $\mathcal{P}_{i}$ by looking at instances that are similar in 1) only the audio space; 2) only the video space; or 3) either video or audio space. Each method in fig. 4(a) is trained to optimize the objective of eq. 9 with the corresponding $\mathcal{P}_{i}$ . We also compare against the Cross-AVID baseline that uses only the instance itself as the positive set. Transfer performance is reported in fig. 4(b).

Compared to Cross-AVID, expanding the set of positives using only audio similarity (third row) hurts performance on Kinetics, and relying on video similarities alone (second row) only provides marginal improvements. We believe that expanding the set of positives only based on visual similarity does not improve the performance of visual features since the positives are already close in the feature space, and do not add extra information. CMA provides consistent gains over all methods on Kinetics, suggesting that modeling agreement can provide better positive sets for representation learning of visual features.

Qualitative Understanding.

We show examples of positive and negative samples found by CMA in fig. 5 and observe that CMA can group together semantically related concepts. As it uses agreement between both spaces, visually similar concepts, like ‘ambulance‘ and ‘bus‘ (second row), can be distinguished based on audio similarity. This leads to more precise positive sets $\mathcal{P}_{i}$ , as can be verified by inspecting the precision $@K$ of $\mathcal{P}_{i}$ measured against ground truth labels (fig. 4(c)). CMA consistently finds more precise positives compared to within-modal expansion methods showing the advantages of modeling agreement.

Cross-AVID and CMA at scale

Previous sections provide experimental validation for the proposed Cross-AVID and CMA procedures when training on a medium-sized dataset (100K videos from Audioset). We now study the proposed methods on large-scale datasets. We also compare Cross-AVID and CMA to prior work, including video-based self-supervised learning methods , and methods that leverage the natural correspondence between audio and video .

We briefly describe the experimental setup, and refer the reader to supplementary material for full details. We use the 18-layer R(2+1)D network of as the video encoder and a 9-layer (2D) CNN with batch normalization as the audio encoder. Models are trained on Kinetics-400 and the full Audioset datasets, containing 240K and 1.8M video instances, respectively. Video clips composed of $8$ frames of size $224\!\times\!224$ are extracted at a frame rate of 16fps with standard data augmentation procedures . Two seconds of audio is randomly sampled within 0.5 seconds of the video at a 24kHz sampling rate, and spectrograms of size $200\times 257$ (200 time steps with 257 frequency bands) are used as the input to the audio network. For Cross-AVID, the cross-modal discrimination loss of Equation 4 is optimized with $K=1024$ negative instances. We then find 128 positive instances for each sample using cross-modal agreements (eq. 7), and optimize the CMA criterion of eq. 9 with $K_{p}=32$ positives, $K_{n}=1024$ negatives and $\lambda=1.0$ . Video representations are evaluated on action recognition (section 5.1), and audio representations on sound classification (section 5.2).

1 Action recognition

We first evaluate the visual representations learned by Cross-AVID and AVID+CMA by training a linear classifier for the task of action recognition on the Kinetics dataset. The top-1 accuracy is reported for clip and video-level predictions. Clip-level predictions are obtained from a single 8-frame clip, while video-level predictions are computed by averaging clip-level predictions from 10 clips uniformly sampled from the whole video. The results shown in table 2 clearly demonstrate the advantage of calibrating AVID representations using the CMA procedure, yielding significant gains across both metrics and pretraining datasets. These results demonstrate the value of the CMA procedure in large-scale datasets, thus showing that its effect goes beyond a simple regularization procedure to prevent overfitting.

To compare to prior work, we follow and evaluate visual representations on the UCF-101 and HMDB-51 datasets, by full network fine-tuning. Due to the large variability of experimental setups used in the literature, it is unrealistic to provide a direct comparison to all methods, as these often use different network encoders trained on different datasets with input clips of different lengths. To increase the range of meaningful comparisons, we fine-tuned our models using clips with both 8 and 32 frames. At inference time, video-level predictions are provided by averaging clip-level predictions for 10 uniformly sampled clips . We report top-1 accuracy averaged over the three train/test splits provided with the original datasets.

Table 3 compares the transfer performance of Cross-AVID and CMA with previous self-supervised approaches. To enable well-grounded comparisons, we also list for each method the pre-training dataset and clip dimensions used while finetuning on UCF and HMDB. Despite its simplicity, Cross-AVID achieves state-of-the-art performance for equivalent data settings in most cases. In particular, when pre-trained on Audioset, Cross-AVID outperformed other audio-visual SSL methods such as L3 and AVTS by at least 1.0% on UCF and 2.5% on HMDB. Similar to Cross-AVID, L3 and AVTS propose to learn audio-visual representations by predicting whether audio/video pairs are in-sync. However, these methods optimize for the audiovisual correspondence task, which fails to reason about the data distribution at large. Cross-AVID also outperformed the concurrently proposed XDC under equivalent data settings. When pretrained on Audioset and finetuned on UCF with 32 frames, XDC does report higher accuracy, but the model was pretrained and finetuned using 32 frames, while we pretrain using only 8 frames. It should be noted that, when pretraining and finetuning with clips of 8 frames, Cross-AVID outperforms XDC by 3.4% (84.9% vs 88.3%). CMA further improves the performance of Cross-AVID on all settings considered (i.e., using both Kinetics and Audioset pretraining datasets, and evaluating on UCF and HMDB). We observed, however, that the improvements of CMA over Cross-AVID are smaller under the fine-tuning protocol than the linear evaluation of table 2. Prior work observes that full fine-tuning significantly modifies the visual features and tests the network initialization aspect of pre-training rather than the semantic quality of the representation. Thus, we believe that the feature calibration benefits of CMA are diminished under the full finetuning protocol.

2 Sound recognition

Audio representations are evaluated on the ESC-50 and DCASE datasets by linear probing for the task of sound recognition. Following , both ESC and DCASE results are obtained by training a linear one-vs-all SVM classifier on the audio representations generated by the pre-trained models at the final layer before pooling. For training, we extract 10 clips per sample on the ESC dataset and 60 clips per sample on DCASE . At test time, sample level predictions are obtained by averaging 10 clip level predictions, and the top-1 accuracy is reported in table 4. For the ESC dataset, performance is the average over the 5 original train/test splits. Similarly to video, audio representations learned by Cross-AVID and CMA outperform prior work, outperforming ConvRBM on the ESC dataset by 2.7% and AVTS on DCASE by 3%.

Discussion

We proposed a self-supervised method to learn visual and audio representations by contrasting visual representations against multiple audios, and vice versa. Our method, Audio-Visual Instance Discrimination (AVID) builds upon recent advances in contrastive learning to learn state-of-the-art representations that outperform prior work on action recognition and sound classification. We propose and analyze multiple variants of the AVID task to show that optimizing for cross-modal similarity and not within-modal similarity matters for learning from video and audio.

We also identified key limitations of the instance discrimination framework and proposed CMA to use agreement in the video and audio feature spaces to group together related videos. CMA helps us relate multiple instances by identifying more related videos. CMA also helps us reject ‘false positives’, i.e., videos that are similar visually but differ in the audio space. We show that using these groups of related videos allows us to optimize for within-modal similarity, in addition to cross-modal similarity, and improve visual and audio representations. The generalization of CMA suggests that cross-modal agreements provide non-trivial correspondences between samples and are a useful way to learn improved representations in a multi-modal setting.

Acknowledgements

We are grateful to Rob Fergus and Laurens van der Maaten for their feedback and support; Rohit Girdhar for feedback on the manuscript; and Bruno Korbar for help with the baselines.

References

Appendix A Experimental setup

The architecture details of the video and audio networks used in the analysis experiments are shown in appendix C, and those used for comparison to prior work is shown in appendix C.

Pre-training hyper-parameters

Optimization and data augmentation hyper-parameters for AVID and CMA pre-training are provided in table 7.

Action recognition hyper-parameters

Optimization and data augmentation hyper-parameters for action recognition tasks are provided in table 8.

Video pre-processing

Video clips are extracted at 16 fps and augmented with standard techniques, namely random multi-scale cropping with 8% minimum area, random horizontal flipping and color and temporal jittering. Color jittering hyper-parameters are shown in table 7 for pre-training and table 8 for transfer into downstream tasks.

Audio pre-processing

Audio signals are loaded at 24kHz, instead of 48kHz, because a large number of Audioset audio samples do not contain these high frequencies. The spectrogram is computed by taking the FFT on 20ms windows with either 10ms (§4, §5) or 20ms (§6) hop-size. We then convert the spectrogram to a log scale, and Z-normalize its intensity using mean and standard deviation values computed on the training set. We use volume and temporal jitering for data augmentation. Volume jittering is accomplished by multiplying the audio waveform by a constant factor randomly sampled between 0.9 and 1.1, and applied uniformly over time. Temporal jittering is done by randomly sampling the audio starting time within 0.5s of the video, and randomly selecting the total audio duration between 1.4s and 2.8s and rescaling back to the expected number of audio frames.

Appendix B Longer AVID pre-training

To ensure that the benefits of CMA are not caused by longer training, we trained Cross-AVID for the same number of epochs as AVID+CMA. The Cross-AVID performance on Kinetics after 200 and 400 training epochs are shown in table 5. Cross-AVID transfer performance seem to have already saturated after 200 epochs of pre-training.

Appendix C CMA calibration

To further study the benefits effect of the CMA procedure, we measured the classification performance of memory representations obtained with both AVID and CMA trained on the Kinetics dataset. We randomly split the 220K training samples, for which memory representations are available, into a train/validation set (70/30% ratio). We then train a linear classifier on the training set (using either video, audio or the concatenation of both, ConvNet is kept fixed), and evaluate the performance on the validation set. The train/validation splits are sampled 5 times and average performance is reported. The top-1 accuracies are shown in table 6.