SoundNet: Learning Sound Representations from Unlabeled Video

Yusuf Aytar, Carl Vondrick, Antonio Torralba

Introduction

The fields of object recognition, speech recognition, machine translation have been revolutionized by the emergence of massive labeled datasets and learned deep representations . However, there has not yet been the same corresponding progress in natural sound understanding tasks. We attribute this partly to the lack of large labeled datasets of sound, which are often both expensive and ambiguous to collect. We believe that large-scale sound data can also significantly advance natural sound understanding. In this paper, we leverage over one year of sounds collected in-the-wild to learn semantically rich sound representations.

We propose to scale up by capitalizing on the natural synchronization between vision and sound to learn an acoustic representation from unlabeled video. Unlabeled video has the advantage that it can be economically acquired at massive scales, yet contains useful signals about sound. Recent progress in computer vision has enabled machines to recognize scenes and objects in images and videos with good accuracy. We show how to transfer this discriminative visual knowledge into sound using unlabeled video as a bridge.

We present a deep convolutional network that learns directly on raw audio waveforms, which is trained by transferring knowledge from vision into sound. Although the network is trained with visual supervision, the network has no dependence on vision during inference. In our experiments, we show that the representation learned by our network obtains state-of-the-art accuracy on three standard acoustic scene classification datasets. Since we can leverage large amounts of unlabeled sound data, it is feasible to train deeper networks without significant overfitting, and our experiments suggest deeper models perform better. Visualizations of the representation suggest that the network is also learning high-level detectors, such as recognizing bird chirps or crowds cheering, even though it is trained directly from audio without ground truth labels.

The primary contribution of this paper is the development of a large-scale and semantically rich representation for natural sound. We believe large-scale models of natural sounds can have a large impact in many real-world applications, such as robotics and cross-modal understanding. The remainder of this paper describes our method and experiments in detail. We first review related work. In section 2, we describe our unlabeled video dataset and in section 3 we present our network and training procedure. Finally in section 4 we conclude with experiments on standard benchmarks and show several visualizations of the learned representation. Code, data, and models will be released.

Sound Recognition: Although large-scale audio understanding has been extensively studied in the context of music and speech recognition , we focus on understanding natural, in-the-wild sounds. Acoustic scene classification, classifying sound excerpts into existing acoustic scene/object categories, is predominantly based on applying a variety of general classifiers (SVMs, GMMs, etc.) to the manually crafted sound features (MFCC, spectrograms, etc.) . Even though there are unsupervised and supervised deep learning methods applied to sound classification, the models are limited by the amount of available labeled natural sound data. We distinguish ourselves from the existing literature by training a deep fully convolutional network on a large scale dataset (2M videos). This allows us to train much deeper networks. Another key advantage of our approach is that we supervise our sound recognition network through semantically rich visual discriminative models which proved their robustness on a variety of large scale object/scene categorization challenges. also investigates the relation between vision and sound modalities, but focuses on producing sound from image sequences. Concurrent work also explores video as a form of weak labeling for audio event classification.

Transfer Learning: Transfer learning is widely studied within computer vision such as transferring knowledge for object detection and segmentation , however transferring from vision to other modalities are only possible recently with the emergence of high performance visual models . Our method builds upon teacher-student models and dark knowledge transfer . In the basic idea is to compress (i.e. transfer) discriminative knowledge from a well-trained complex model to a simpler model without loosing considerable accuracy. In and both the teacher and the student are in the same modality, whereas in our approach the teacher operates on vision to train the student model in sound. also transfer visual supervision into depth models.

Cross-Modal Learning and Unlabeled Video: Our approach is broadly inspired by efforts to model cross-modal relations and works that leverage large amounts of unlabeled video . In this work, we leverage the natural synchronization between vision and sound to learn a deep representation of natural sounds without ground truth sound labels.

Large Unlabeled Video Dataset

We seek to learn a representation for sound by leveraging massive amounts of unlabeled videos. While there are a variety of sources available on the web (e.g., YouTube, Flickr), we chose to use videos from Flickr because they are natural, not professionally edited, short clips that capture various sounds in everyday, in-the-wild situations. We downloaded over two million videos from Flickr by querying for popular tags and dictionary words, which resulted in over one year of continuous natural sound and video, which we use for training. The length of each video varies from a few seconds to several minutes. We show a small sample of frames from the video dataset in Figure 2.

We wish to process sound waves in the raw. Hence, the only post-processing we did on the videos was to convert sound to MP3s, reduce the sampling rate to $22$ kHz, and convert to single channel audio. Although this slightly degrades the quality of the sound, it allows us to more efficiently operate on large datasets. We also scaled the waveform to be in the range $$. We did not need to subtract the mean because it was naturally near zero already.

Learning Sound Representations

Convolutional Network: We present a deep convolutional architecture for learning sound representations. We propose to use a series of one-dimensional convolutions followed by nonlinearities (i.e. ReLU layer) in order to process sound. Convolutional networks are well-suited for audio signals for a couple of reasons. Firstly, like images , we desire our network to be invariant to translations, a property that reduces the number of parameters we need to learn and increases efficiency. Secondly, convolutional networks allow us to stack layers, which enables us to detect higher-level concepts through a series of lower-level detectors.

Variable Length Input/Output: Since sound can vary in temporal length, we desire our network to handle variable-length inputs. To do this, we use a fully convolutional network. As convolutional layers are invariant to location, we can convolve each layer depending on the length of the input. Consequently, in our architecture, we only use convolutional and pooling layers. Since the representation adapts to the input length, we must design the output layers to work with variable length inputs as well. While we could have used a global pooling strategy to down-sample variable length inputs to a fixed dimensional vector, such a strategy may unnecessarily discard information useful for high-level representations. Since we ultimately aim to train this network with video, which is also variable length, we instead use a convolutional output layer to produce an output over multiple timesteps in video. This strategy is similar to a spatial loss in images , but instead temporally.

Network Depth: Since we will use a large amount of video to train, it is feasible to use deep architectures without significant over-fitting. We experiment with both five-layer and eight-layer networks. We visualize the eight-layer network architecture in Figure 1, which conists of $8$ convolutional layers and $3$ max-pooling layers. We show the layer configuration in Table 2 and Table 2.

2 Visual Transfer into Sound

The main idea in this paper is to leverage the natural synchronization between vision and sound in unlabeled video in order to learn a representation for sound. We model the learning problem from a student-teacher perspective. In our case, state-of-the-art networks for vision will teach our network for sound to recognize scenes and objects.

Let $x_{i}\in\mathbb{R}^{D}$ be a waveform and $y_{i}\in\mathbb{R}^{3\times T\times W\times H}$ be its corresponding video for $1\leq i\leq N$ , where $W,H,T$ are width, height and number of sampled frames in the video, respectively. During learning, we aim to use the posterior probabilities from a teacher vision network $g_{k}(y_{i})$ in order to train our student network $f_{k}(x_{i})$ to recognize concepts given sound. As we wish to transfer knowledge from both object and scene networks, $k$ enumerates the concepts we are transferring. During learning, we optimize $\min_{\theta}\;\sum_{k=1}^{K}\sum_{i=1}^{N}D_{\textrm{KL}}\left(g_{k}(y_{i})||f_{k}(x_{i};\theta)\right)$ where $D_{KL}(P||Q)=\sum_{j}P_{j}\log\frac{P_{j}}{Q_{j}}$ is the KL-divergence. While there are a variety of distance metrics we could have use, we chose KL-divergence because the outputs from the vision network $g_{k}$ can be interpreted as a distribution of categories. As KL-divergence is differentiable, we optimize it using back-propagation and stochastic gradient descent. We transfer from both scene and object visual networks ( $K=2$ ).

3 Sound Classification

Although we train SoundNet to classify visual categories, the categories we wish to recognize may not appear in visual models (e.g., sneezing). Consequently, we use a different strategy to attach semantic meaning to sounds. We ignore the output layer of our network and use the internal representation as features for training classifiers, using a small amount of labeled sound data for the concepts of interest. We pick a layer in the network to use as features and train a linear SVM. For multi-class classification, we use a one-vs-all strategy. We perform cross-validation to pick the margin regularization hyper-parameter. For robustness, we follow a standard data augmentation procedure where each training sample is split into overlapping fixed length sound excerpts, which we compute features on and use for training. During inference, we average predictions across all windows.

4 Implementation

Our approach is implemented in Torch7. We use the Adam optimizer and a fixed learning rate of $0.001$ and momentum term of $0.9$ throughout our experiments. We experimented with several batch sizes, and found $64$ to produce good results. We initialized all the weights to zero mean Gaussian noise with a standard deviation of $0.01$ . After every convolution, we use batch normalization and rectified linear activation units . We train the network for $100,000$ iterations. Optimization typically took $1$ day on a GPU.

Experiments

Experimental Setup: We split the unlabeled video dataset into a training set and a held-out validation set. We use $2,000,000$ videos for training, and the remaining $140,000$ videos for validation. After training the network, we use the hidden representation as a feature extractor for learning on smaller, labeled sound only datasets. We extract features for a given layer, and train an SVM on the task of interest. For training the SVM, we use the standard training/test splits of the datasets. We report classification accuracy.

Baselines:: In addition to published baselines on standard datasets, we explored an additional baseline trained on our unlabeled videos. We experimented using a convolutional autoencoder for sound, trained over our video dataset. We use an autoencoder with $4$ encoder layers and $4$ decoder layers. For the encoder layers, we used the same first four convolutional layers as SoundNet. For the decoders, we used a fractionally strided convolutional layers (in order to upsample instead of downsample). Note that we experimented with deeper autoencoders, but they performed worse. We used mean squared error for the reconstruction loss, and trained the autoencoders for several days.

We evaluate the SoundNet representation for acoustic scene classification. The aim in this task is to categorize sound clips into one of the many acoustic scene categories. We use three standard, publicly available datasets: DCASE Challenge, ESC-50 , and ESC-10 .

DCASE: One of the tasks in the Detection and Classification of Acoustic Scenes and Events Challenge (DCASE) is to recognize scenes from natural sounds. In the challenge, there are $10$ acoustic scene categories, $10$ training examples per category, and $100$ held-out testing examples. Each example is a 30 seconds audio recording. The task is to categorize natural sounds into existing $10$ acoustic scene categories. Multi-class classification accuracy is used as the performance metric.

ESC-50 and ESC-10 : The ESC-50 dataset is a collection of 2000 short (5 seconds) environmental sound recordings of equally balanced 50 categories selected from 5 major groups (animals, natural soundscapes, human non-speech sounds, interior/domestic sounds, and exterior/urban noises). Each category has 40 samples. The data is prearranged into 5 folds and the accuracy results are reported as the mean of 5 leave-one-fold-out evaluations. The performance of untrained human participants on this dataset is $81.3\%$ . ESC-10 is a subset of ESC-50 which consists of 10 classes (dog bark, rain, sea waves, baby cry, clock tic, person sneeze, helicopter, chainsaw, rooster, and fire cracking). The human performance on this dataset is $95.7\%$ .

We have two major evaluations on this section: (a) comparison with the existing state of the art results, (b) diagnostic performance evaluation of inner layers of SoundNet as generic features for this task. In DCASE we used 5 second excerpts, and in ESC datasets we used 1 second windows. In both evaluations a multi-class SVM (multiple one-vs all classifiers) is trained over extracted SoundNet features. Same data augmentation procedure is also applied during testing and the mean score of all sound excerpts is used as the final score of a test recording for any particular category.

Comparison to State-of-the-Art: Table 4 and 4 compare recognition performance of SoundNet features versus previous state-of-the-art features on three datasets. In all cases SoundNet features outperformed the existing results by around $10\%$ . Interestingly, SoundNet features approach human performance on ESC-10 dataset, however we stress that this dataset may be easy. We report the confusion matrix across all folds on ESC-50 in Figure 3. The results suggest our approach obtains very good performance on categories such as toilet flush (97% accuracy) or door knocks (95% accuracy). Common confusions are laughing confused as hens, foot steps confused as door knocks, and insects confused as washing machines.

2 Ablation Analysis

To better understand our approach, we perform an ablation analysis in Table 5 and Table 6.

Comparison of Loss and Teacher Net (Table 5): We tried training with different subsets of target categories. In general, performance generally improves with increasing visual supervision. As expected, our results suggest that using both ImageNet and Places networks as supervision performs better than a single one. This indicates that progress in sound understanding may be furthered by building stronger vision models. We also experimented with using $\ell_{2}$ loss on the target outputs instead of $KL$ loss, which performed significantly worse.

Comparison of Network Depth (Table 5): We quantified the impact of network depth. We use five layer version of SoundNet (instead of the full eight) as a feature extractor instead. The five-layer SoundNet architecture performed 8% worse than the eight-layer architecture, suggesting depth is helpful for sound understanding. Interestingly, the five-layer network still generally outperforms previous state-of-the-art baselines, but the margin is less. We hypothesize even deeper networks may perform better, which can be trained without significant over-fitting by leveraging large amounts of unlabeled video.

Comparison of Supervision (Table 5): We also experimented with training the network without video by using only the labeled target training set, which is relatively small (thousands of examples). We simply change the network to output the class probabilities, and train it from random initialization with a cross entropy loss. Hence, the only change is that this baseline does not use any unlabeled video, allowing us to quantify the contribution of unlabeled video. The five layer SoundNet achieves slightly better results than which is also a convolutional network trained with same data but with a different architecture, suggesting our five layer architecture is similar. Increasing the depth from five layers to eight layers decreases the performance from $65\%$ to $51\%$ , probably because it overfits to the small training set. However, when trained with visual transfer from unlabeled video, the eight layer SoundNet achieves a significant gain of around $20\%$ compared to the five layer version. This suggests that unlabeled video is a powerful signal for sound understanding, and it can be acquired at large enough scales to support training high-capacity deep networks.

Comparison of Layer and Teacher Network (Table 6): We analyze the discriminative performance of each SoundNet layer. Generally, features from the pool5 layer gives the best performance. We also compared different teacher networks for visual supervision (either VGGNet or AlexNet). The results are inconclusive on which teacher network to use: VGG is a better teacher network for DCASE while AlexNet is a better teacher network for ESC50.

(a) t-SNE embedding of visual features (b) t-SNE embedding of sound features

3 Multi-Modal Recognition

In order to compare sound features with visual features on scene/object categorization, we annotated additional 9,478 videos (vision+sound) which are not seen by the trained networks before. This new dataset consists of 44 categories from 6 major groups of concepts (i.e. urban, nature, work/home, music/entertainment, sports, and vehicles). It is annotated by Amazon Mechanical Turk workers. The frequency of categories depend on natural occurrences on the web, hence unbalanced.

Vision vs. Sound Embeddings: In order to show the semantic relevance of the features, we performed a two dimensional t-SNE embedding and visualized our dataset in figure 4. The visual features are concatenated fc7 features of the two VGG networks trained using ImageNet and Places2 datasets. We computed the visual features from uniformly selected 4 frames for each video and computed the mean feature as the final visual representation. The sound features are the conv7 features extracted using SoundNet trained with VGG supervision. This visualizations suggests that sound features alone also contain considerable amount of semantic information.

Object and Scene Classification: We also performed a quantitative comparison between sound features and visual features. We used $60\%$ of our dataset for training and the rest for the testing. The chance level of the task is $2.2\%$ and choosing always the most common category (i.e. music performance) yields $14\%$ accuracy. Similar to acoustic scene classification methods, we trained a multi-class SVM over both sound and visual features individually and then jointly. The results are displayed in Table 7. Visual features alone obtained an accuracy of $49.4\%$ . The SoundNet features obtained $32.4\%$ accuracy. This suggests that even though sound is not as informative as vision, it still contains considerable amount of discriminative information. Furthermore, sound and vision together resulted in a modest improvement of $2\%$ over vision only models.

4 Visualizations

In order to have a better insight on what network learned, we visualize its representation. Figure 5 displays the first 16 convolutional filters applied to the raw input audio. The learned filters are diverse, including low and high frequencies, wavelet-like patterns, increasing and decreasing amplitude filters. We also visualize some of the hidden units in the last hidden layer (conv7) of our sound representation by finding inputs that maximally activate a hidden unit. These visualization are displayed on Figure 6. Note that visual frames are not used during computation of activations; they are only included in the figure for visualization purposes.

Conclusion

We propose to train deep sound networks (SoundNet) by transferring knowledge from established vision networks and large amounts of unlabeled video. The synchronous nature of videos (sound + vision) allow us to perform such a transfer which resulted in semantically rich audio representations for natural sounds. Our results show that transfer with unlabeled video is a powerful paradigm for learning sound representations. All of our experiments suggest that one may obtain better performance simply by downloading more videos, creating deeper networks, and leveraging richer vision models.

Acknowledgements: We thank MIT TIG, especially Garrett Wollman, for helping store 26 TB of video. We are grateful for the GPUs donated by NVidia. This work was supported by NSF grant #1524817 to AT and the Google PhD fellowship to CV.