Learning to Separate Object Sounds by Watching Unlabeled Video
Ruohan Gao, Rogerio Feris, Kristen Grauman
Introduction
Understanding scenes and events is inherently a multi-modal experience. We perceive the world by both looking and listening (and touching, smelling, and tasting). Objects generate unique sounds due to their physical properties and interactions with other objects and the environment. For example, perception of a coffee shop scene may include seeing cups, saucers, people, and tables, but also hearing the dishes clatter, the espresso machine grind, and the barista shouting an order. Human developmental learning is also inherently multi-modal, with young children quickly amassing a repertoire of objects and their sounds: dogs bark, cats mew, phones ring.
However, while recognition has made significant progress by “looking”—detecting objects, actions, or people based on their appearance—it often does not listen. Despite a long history of audio-visual video indexing , objects in video are often analyzed as if they were silent entities in silent environments. A key challenge is that in a realistic video, object sounds are observed not as separate entities, but as a single audio channel that mixes all their frequencies together. Audio source separation, though studied extensively in the signal processing literature , remains a difficult problem with natural data outside of lab settings. Existing methods perform best by capturing the input with multiple microphones, or else assume a clean set of single source audio examples is available for supervision (e.g., a recording of only a violin, another recording containing only a drum, etc.), both of which are very limiting prerequisites. The blind audio separation task evokes challenges similar to image segmentation—and perhaps more, since all sounds overlap in the input signal.
Our goal is to learn how different objects sound by both looking at and listening to unlabeled video containing multiple sounding objects. We propose an unsupervised approach to disentangle mixed audio into its component sound sources. The key insight is that observing sounds in a variety of visual contexts reveals the cues needed to isolate individual audio sources; the different visual contexts lend weak supervision for discovering the associations. For example, having experienced various instruments playing in various combinations before, then given a video with a guitar and a saxophone (Fig. 1), one can naturally anticipate what sounds could be present in the accompanying audio, and therefore better separate them. Indeed, neuroscientists report that the mismatch negativity of event-related brain potentials, which is generated bilaterally within auditory cortices, is elicited only when the visual pattern promotes the segregation of the sounds . This suggests that synchronous presentation of visual stimuli should help to resolve sound ambiguity due to multiple sources, and promote either an integrated or segregated perception of the sounds.
We introduce a novel audio-visual source separation approach that realizes this intuition. Our method first leverages a large collection of unannotated videos to discover a latent sound representation for each object. Specifically, we use state-of-the-art image recognition tools to infer the objects present in each video clip, and we perform non-negative matrix factorization (NMF) on each video’s audio channel to recover its set of frequency basis vectors. At this point it is unknown which audio bases go with which visible object(s). To recover the association, we construct a neural network for multi-instance multi-label learning (MIML) that maps audio bases to the distribution of detected visual objects. From this audio basis-object association network, we extract the audio bases linked to each visual object, yielding its prototypical spectral patterns. Finally, given a novel video, we use the learned per-object audio bases to steer audio source separation.
Prior attempts at visually-aided audio source separation tackle the problem by detecting low-level correlations between the two data streams for the input video , and they experiment with somewhat controlled domains of musical instruments in concert or human speakers facing the camera. In contrast, we propose to learn object-level sound models from hundreds of thousands of unlabeled videos, and generalize to separate new audio-visual instances. We demonstrate results for a broad set of “in the wild” videos. While a resurgence of research on cross-modal learning from images and audio also capitalizes on synchronized audio-visual data for various tasks , they treat the audio as a single monolithic input, and thus cannot associate different sounds to different objects in the same video.
The main contributions in this paper are as follows. Firstly, we propose to enhance audio source separation in videos by “supervising” it with visual information from image recognition resultsOur task can hence be seen as “weakly supervised”, though the weak “labels” themselves are inferred from the video, not manually annotated.. Secondly, we propose a novel deep multi-instance multi-label learning framework to learn prototypical spectral patterns of different acoustic objects, and inject the learned prior into an NMF source separation framework. Thirdly, to our knowledge, we are the first to study audio source separation learned from large scale online videos. We demonstrate state-of-the-art results on visually-aided audio source separation and audio denoising.
Related Work
The sound localization problem entails identifying which pixels or regions in a video are responsible for the recorded sound. Early work on localization explored correlating pixels with sounds using mutual information or multi-modal embeddings like canonical correlation analysis , often with assumptions that a sounding object is in motion. Beyond identifying correlations for a single input video’s audio and visual streams, recent work investigates learning associations from many such videos in order to localize sounding objects . Such methods typically assume that there is one sound source, and the task is to localize the portion(s) of the visual content responsible for it. In contrast, our goal is to separate multiple audio sources from a monoaural signal by leveraging learned audio-visual associations.
Audio-visual representation learning
Recent work shows that image and audio classification tasks can benefit from representation learning with both modalities. Given unlabeled training videos, the audio channel can be used as free self-supervision, allowing a convolutional network to learn features that tend to gravitate to objects and scenes, resulting in improved image classification . Working in the opposite direction, the SoundNet approach uses image classifier predictions on unlabeled video frames to guide a learned audio representation for improved audio scene classification . For applications in cross-modal retrieval or zero-shot classification, other methods aim to learn aligned representations across modalities, e.g., audio, text, and visual . Related to these approaches, we share the goal of learning from unlabeled video with synchronized audio and visual channels. However, whereas they aim to improve audio or image classification, our method discovers associations in order to isolate sounds per object, with the ultimate task of audio-visual source separation.
Audio source separation
Audio source separation (from purely audio input) has been studied for decades in the signal processing literature. Some methods assume access to multiple microphones, which facilitates separation . Others accept a single monoaural input to perform “blind” separation. Popular approaches include Independent Component Analysis (ICA) , sparse decomposition , Computational Auditory Scene Analysis (CASA) , non-negative matrix factorization (NMF) , probabilistic latent variable models , and deep learning . NMF is a traditional method that is still widely used for unsupervised source separation . However, existing methods typically require supervision to get good results. Strong supervision in the form of isolated recordings of individual sound sources is effective but difficult to secure for arbitrary sources in the wild. Alternatively, “informed” audio source separation uses special-purpose auxiliary cues to guide the process, such as a music score , text , or manual user guidance . Our approach employs an existing NMF optimization , chosen for its efficiency, but unlike any of the above we tackle audio separation informed by automatically detected visual objects.
Audio-visual source separation
The idea of guiding audio source separation using visual information can be traced back to , where mutual information is used to learn the joint distribution of the visual and auditory signals, then applied to isolate human speakers. Subsequent work explores audio-visual subspace analysis , NMF informed by visual motion , statistical convolutive mixture models , and correlating temporal onset events . Recent work attempts both localization and separation simultaneously; however, it assumes a moving object is present and only aims to decompose a video into background (assumed low-rank) and foreground sounds/pixels. Prior methods nearly always tackle videos of people speaking or playing musical instruments —domains where salient motion signals accompany audio events (e.g., a mouth or a violin bow starts moving, a guitar string suddenly accelerates). Some studies further assume side cues from a written musical score , require that each sound source has a period when it alone is active , or use ground-truth motion captured by MoCap .
Whereas prior work correlates low-level visual patterns—particularly motion and onset events—with the audio channel, we propose to learn from video how different objects look and sound, whether or not an object moves with obvious correlation to the sounds. Our method assumes access to visual detectors, but assumes no side information about a novel test video. Furthermore, whereas existing methods analyze a single input video in isolation and are largely constrained to human speakers and instruments, our approach learns a valuable prior for audio separation from a large library of unlabeled videos.
Concurrent with our work, other new methods for audio-visual source separation are being explored specifically for speech or musical instruments . In contrast, we study a broader set of object-level sounds including instruments, animals, and vehicles. Moreover, our method’s training data requirements are distinctly more flexible. We are the first to learn from uncurated “in the wild” videos that contain multiple objects and multiple audio sources.
Generating sounds from video
More distant from our work are methods that aim to generate sounds from a silent visual input, using recurrent networks , conditional generative adversarial networks (C-GANs) , or simulators integrating physics, audio, and graphics engines . Unlike any of the above, our approach learns the association between how objects look and sound in order to disentangle real audio sources; our method does not aim to synthesize sounds.
Weakly supervised visual learning
Given unlabeled video, our approach learns to disentangle which sounds within a mixed audio signal go with which recognizable objects. This can be seen as a weakly supervised visual learning problem, where the “supervision” in our case consists of automatically detected visual objects. The proposed setting of weakly supervised audio-visual learning is entirely novel, but at a high level it follows the spirit of prior work leveraging weak annotations, including early “words and pictures” work , internet vision methods , training weakly supervised object (activity) detectors , image captioning methods , or grounding acoustic units of spoken language to image regions . In contrast to any of these methods, our idea is to learn sound associations for objects from unlabeled video, and to exploit those associations for audio source separation on new videos.
Approach
Our approach learns what objects sound like from a batch of unlabeled, multi-sound-source videos. Given a new video, our method returns the separated audio channels and the visual objects responsible for them.
We first formalize the audio separation task and overview audio basis extraction with NMF (Sec. 3.1). Then we introduce our framework for learning audio-visual objects from unlabeled video (Sec. 3.2) and our accompanying deep multi-instance multi-label network (Sec. 3.3). Next we present an approach to use that network to associate audio bases with visual objects (Sec. 3.4). Finally, we pose audio source separation for novel videos in terms of a semi-supervised NMF approach (Sec. 3.5).
Non-negative matrix factorization (NMF) is often employed to approximate the (non-negative real-valued) spectrogram matrix V as a product of two matrices W and H:
where is a measure of divergence, e.g., we employ the Kullback-Leibler (KL) divergence.
For each unlabeled training video, we perform NMF independently on its audio magnitude spectrogram to obtain its spectral patterns W, and throw away the activation matrix H. audio basis vectors are therefore extracted from each video.
2 Weakly-Supervised Audio-Visual Object Learning Framework
Multiple objects can appear in an unlabeled video at the same time, and similarly in the associated audio track. At this point, it is unknown which of the audio bases extracted (columns of W) go with which visible object(s) in the visual frames. To discover the association, we devise a multi-instance multi-label learning (MIML) framework that matches audio bases with the detected objects.
As shown in Fig. 2, given an unlabeled video, we extract its visual frames and the corresponding audio track. As defined above, we perform NMF independently on the magnitude spetrogram of each audio track and obtain basis vectors from each video. For the visual frames, we use an ImageNet pre-trained ResNet-152 network to make object category predictions, and we max-pool over predictions of all frames to obtain a video-level prediction. The top labels (with class probability larger than a threshold) are used as weak “labels” for the unlabeled video. The extracted basis vectors and the visual predictions are then fed into our MIML learning framework to discover associations, as defined next.
3 Deep Multi-Instance Multi-Label Network
We cast the audio basis-object disentangling task as a multi-instance multi-label (MIML) learning problem. In single-label MIL , one has bags of instances, and a bag label indicates only that some number of the instances within it have that label. In MIML, the bag can have multiple labels, and there is ambiguity about which labels go with which instances in the bag.
We design a deep MIML network for our task. A bag of basis vectors is the input to the network, and within each bag there are basis vectors with extracted from one video. The “labels” are only available at the bag level, and come from noisy visual predictions of the ResNet-152 network trained for ImageNet recognition. The labels for each instance (basis vector) are unknown. We incorporate MIL into the deep network by modeling that there must be at least one audio basis vector from a certain object that constitutes a positive bag, so that the network can output a correct bag-level prediction that agrees with the visual prediction.
Fig. 3 shows the detailed network architecture. basis vectors are fed through a Siamese Network of branches with shared weights. The Siamese network is designed to reduce the dimension of the audio frequency bases and learns the audio spectral patterns through a fully-connected layer (FC) followed by batch norm (BN) and a rectified linear unit (ReLU). The output of all branches are stacked to form a dimension feature map. Each slice of the feature map represents a basis vector with reduced dimension. Inspired by , each label is decomposed to sub-concepts to capture latent semantic meanings. For example, for drum, the latent sub-concepts could be different types of drums, such as bongo drum, tabla, and so on. The stacked output from the Siamese network is forwarded through a Convolution-BN-ReLU module, and then reshaped into a feature cube of dimension , where is the number of sub-concepts, is the number of object categories, and is the number of audio basis vectors. The depth of the tensor equals the number of input basis vectors, with each slice corresponding to one particular basis. The activation score of the node in the cube represents the matching score of the sub-concept of the label for the basis vector.
To get a bag-level prediction, we conduct two max-pooling operations. Max pooling in deep MIL is typically used to identify the positive instances within an aggregated bag. Our first pooling is over the sub-concept dimension () to generate an audio basis-object relation map. The second max-pooling operates over the basis dimension () to produce a video-level prediction. We use the following multi-label hinge loss to train the network:
4 Disentangling Per-Object Bases
The MIML network above learns from audio-visual associations, but does not itself disentangle them. The sounds in the audio track and objects present in the visual frames of unlabeled video are diverse and noisy (see Sec. 4.1 for details about the data we use). The audio basis vectors extracted from each video could be a component shared by multiple objects, a feature composed of them, or even completely unrelated to the predicted visual objects. The visual predictions from ResNet-152 network give approximate predictions about the objects that could be present, but are certainly not always reliable (see Fig. 5 for examples).
Therefore, to collect high quality representative bases for each object category, we use our trained deep MIML network as a tool. The audio basis-object relation map after the first pooling layer of the MIML network produces matching scores across all basis vectors for all object labels. We perform a dimension-wise softmax over the basis dimension () to normalize object matching scores to probabilities along each basis dimension. By examining the normalized map, we can discover links from bases to objects. We only collect the key bases that trigger the prediction of the correct objects (namely, the visually detected objects). Further, we only collect bases from an unlabeled video if multiple basis vectors strongly activate the correct object(s). See Supp. for details, and see Fig. 5 for examples of typical basis-object relation maps. In short, at the end of this phase, we have a set of audio bases for each visual object, discovered purely from unlabeled video and mixed single-channel audio.
5 Object Sound Separation for a Novel Video
Finally, we present our procedure to separate audio sources in new videos. As shown in Fig. 4, given a novel test video , we obtain its audio magnitude spectrogram through STFT and detect objects using the same ImageNet-trained ResNet-152 network as before. Then, we retrieve the learnt audio basis vectors for each detected object, and use them to “guide” NMF-based audio source separation. Specifically,
where is the number of detected objects ( potential sound sources), and contains the retrieved bases corresponding to object in input video . In other words, we concatenate the basis vectors learnt for each detected object to construct the basis dictionary . Next, in the NMF algorithm, we hold fixed, and only estimate activations with multiplicative update rules. Then we obtain the spectrogram corresponding to each detected object by . We reconstruct the individual (compressed) audio source signals by soft masking the mixture spectrogram:
Experiments
We now validate our approach and compare to existing methods.
We consider two public video datasets: AudioSet and the benchmark videos from , which we refer to as AV-Bench.
We use AudioSet as the source of unlabeled training videosAudioSet offers noisy video-level audio class annotations. However, we do not use any of its label information.. The dataset consists of short 10 second video clips that often concentrate on one event. However, our method makes no particular assumptions about using short or trimmed videos, as it learns bases in the frequency domain and pools both visual predictions and audio bases from all frames. The videos are challenging: many are of poor quality and unrelated to object sounds, such as silence, sine wave, echo, infrasound, etc. As is typical for related experimentation in the literature , we filter the dataset to those likely to display audio-visual events. In particular, we extract musical instruments, animals, and vehicles, which span a broad set of unique sound-making objects. See Supp. for a complete list of the object categories. Using the dataset’s provided split, we randomly reserve some videos from the “unbalanced” split as validation data, and the rest as the training data. We use videos from the “balanced” split as test data. The final AudioSet-Unlabeled data contains 104k, 2.9k, 1k / 22k, 1.2k, 0.5k / 58k, 2.4k, 0.6k video clips in the train, val, test splits, for the instruments, animals, and vehicles, respectively.
AudioSet-SingleSource:
To facilitate quantitative evaluation (cf. Sec. 4.4), we construct a dataset of AudioSet videos containing only a single sounding object. We manually examine videos in the val/test set, and obtain 23 such videos. There are 15 musical instruments (accordion, acoustic guitar, banjo, cello, drum, electric guitar, flute, french horn, harmonica, harp, marimba, piano, saxophone, trombone, violin), 4 animals (cat, dog, chicken, frog), and 4 vehicles (car, train, plane, motorbike). Note that our method never uses these samples for training.
AV-Bench:
This dataset contains the benchmark videos (Violin Yanni, Wooden Horse, and Guitar Solo) used in previous studies .
2 Implementation Details
We extract a 10 second audio clip and 10 frames (every 1s) from each video. Following common settings , the audio clip is resampled at 48 kHz, and converted into a magnitude spectrogram of size through STFT of window length 0.1s and half window overlap. We use the NMF implementation of with KL divergence and the multiplicative update solver. We extract basis vectors from each audio. All video frames are resized to , and center crops are used to make visual predictions. We use all relevant ImageNet categories and group them into 23 classes by merging the posteriors of similar categories to roughly align with the AudioSet categories; see Supp. A softmax is finally performed on the video-level object prediction scores, and classes with probability greater than 0.3 are kept as weak labels for MIML training. The deep MIML network is implemented in PyTorch with , , , and . We report all results with these settings and did not try other values. The network is trained using Adam with weight decay and batch size 256. The starting learning rate is set to 0.001, and decreased by 6% every 5 epochs and trained for 300 epochs.
3 Baselines
We compare to several existing methods and multiple baselines:
This is an off-the-shelf unsupervised audio source separation method. The separated channels are first converted into Mel frequency cepstrum coefficients (MFCC), and then K-means clustering is used to group separated channels. This is an established pipeline in the literature , making it a good representative for comparison. We use the publicly available codehttps://github.com/interactiveaudiolab/nussl.
AV-Loc [62], JIVE [55], Sparse CCA [47]:
We refer to results reported in for the AV-Bench dataset to compare to these methods.
AudioSet Supervised Upper-Bound:
This baseline uses AudioSet ground-truth labels to train our deep MIML network. AudioSet labels are organized in an ontology and each video is labeled by many categories. We use the 23 labels aligned with our subset (15 instruments, 4 animals, and 4 vehicles). This baseline serves as an upper-bound.
K-means Clustering Unsupervised Separation:
We use the same number of basis vectors as our method to initialize the W matrix, and perform unsupervised NMF. K-means clustering is then used to group separated channels, with equal to the number of ground-truth sources. The sound sources are separated by aggregating the channel spectrograms belonging to each cluster.
Visual Exemplar for Supervised Separation:
We recognize objects in the frames, and retrieve bases from an exemplar video for each detected object class to supervise its NMF audio source separation. An exemplar video is the one that has the largest confidence score for a class among all unlabeled training videos.
Unmatched Bases for Supervised Separation:
This baseline is the same as our method except that it retrieves bases of the wrong class (at random from classes absent in the visual prediction) to guide NMF audio source separation.
Gaussian Bases for Supervised Separation:
We initialize the weight matrix W randomly using a Gaussian distribution, and then perform supervised audio source separation (with W fixed) as in Sec. 3.5.
4 Quantitative Results
For “in the wild” unlabeled videos, the ground-truth of separated audio sources never exists. Therefore, to allow quantitative evaluation, we create a test set consisting of combined single-source videos, following . In particular, we take pairwise video combinations from AudioSet-SingleSource (cf. Sec. 4.1) and 1) compound their audio tracks by normalizing and mixing them and 2) compound their visual channels by max-pooling their respective object predictions. Each compound video is a test video; its reserved source audio tracks are the ground truth for evaluation of separation results.
To evaluate source separation quality, we use the widely used BSS-EVAL toolbox and report the Signal to Distortion Ratio (SDR). We perform four sets of experiments: pairwise compound two videos of musical instruments (Instrument Pair), two of animals (Animal Pair), two of vehicles (Vehicle Pair), and two cross-domain videos (Cross-Domain Pair). For unsupervised clustering separation baselines, we evaluate both possible matchings and take the best results (to the baselines’ advantage).
Table 1 shows the results. Our method significantly outperforms the Visual Exemplar, Unmatched, and Gaussian baselines, demonstrating the power of our learned bases. Compared with the unsupervised clustering baselines, including , our method achieves large gains. It also has the capability to match the separated source to acoustic objects in the video, whereas the baselines can only return ungrounded audio signals. We stress that both our method as well as the baselines use no audio-based supervision. In contrast, other state-of-the-art audio source separation methods supervise the separation process with labeled training data containing clean ground-truth sources and/or tailor separation to music/speech (e.g., ). Such methods are not applicable here.
Our MIML solution is fairly tolerant to imperfect visual detection. Using weak labels from the ImageNet pre-trained ResNet-152 network performs similarly to using the AudioSet ground-truth labels with about 30% of the labels corrupted. Using the true labels (Upper-Bound in Table 1) reveals the extent to which better visual models would improve results.
Visually-aided audio denoising
To facilitate comparison to prior audio-visual methods (none of which report results on AudioSet), next we perform the same experiment as in on visually-assisted audio denoising on AV-Bench. Following the same setup as , the audio signals in all videos are corrupted with white noise with the signal to noise ratio set to 0 dB. To perform audio denoising, our method retrieves bases of detected object(s) and appends the same number of randomly initialized bases as the weight matrix W to supervise NMF. The randomly initialized bases are intended to capture the noise signal. As in , we report Normalized SDR (NSDR), which measures the improvement of the SDR between the mixed noisy signal and the denoised sound.
Table 2 shows the results. Note that the method of is tailored to separate noise from the foreground sound by exploiting the low-rank nature of background sounds. Still, our method outperforms on 2 out of the 3 videos, and performs much better than the other two prior audio-visual methods . Pu et al. also exploit motion in manually segmented regions. On Guitar Solo, the hand’s motion may strongly correlate with the sound, leading to their better performance.
5 Qualitative Results
Next we provide qualitative results to illustrate the effectiveness of MIML training and the success of audio source separation. Here we run our method on the real multi-source videos from AudioSet. They lack ground truth, but results can be manually inspected for quality (see our video5).
Fig. 5 shows example unlabeled videos and their discovered audio basis associations. For each example, we show sample video frames, ImageNet CNN visual object predictions, as well as the corresponding audio basis-object relation map predicted by our MIML network. We also report the AudioSet audio ground truth labels, but note that they are never seen by our method. The first example (Fig. 5-a) has both piano and violin in the visual frames, which are correctly detected by the CNN. The audio also contains the sounds of both instruments, and our method appropriately activates bases for both the violin and piano. Fig. 5-b shows a man playing the violin in the visual frames, but both piano and violin are strongly activated. Listening to the audio, we can hear that an out-of-view player is indeed playing the piano. This example accentuates the advantage of learning object sounds from thousands of unlabeled videos; our method has learned the correct audio bases for piano, and “hears” it even though it is off-camera in this test video. Fig. 5-c/d show two examples with inaccurate visual predictions, and our model correctly activates the label of the object in the audio. Fig. 5-e/f show two more examples of an animal and a vehicle, and the results are similar. These examples suggest that our MIML network has successfully learned the prototypical spectral patterns of different sounds, and is capable of associating audio bases with object categories.
Please see our videohttp://vision.cs.utexas.edu/projects/separating_object_sounds/ for more results, where we use our system to detect and separate object sounds for novel “in the wild” videos.
Overall, the results are promising and constitute a noticeable step towards visually guided audio source separation for more realistic videos. Of course, our system is far from perfect. The most common failure modes by our method are when the audio characteristics of detected objects are too similar or objects are incorrectly detected (see Supp.). Though ImageNet-trained CNNs can recognize a wide array of objects, we are nonetheless constrained by its breadth. Furthermore, not all objects make sounds and not all sounds are within the camera’s view. Our results above suggest that learning can be robust to such factors, yet it will be important future work to explicitly model them.
Conclusion
We presented a framework to learn object sounds from thousands of unlabeled videos. Our deep multi-instance multi-label network automatically links audio bases to object categories. Using the disentangled bases to supervise non-negative matrix factorization, our approach successfully separates object-level sounds. We demonstrate its effectiveness on diverse data and object categories. Audio source separation will continue to benefit many appealing applications, e.g., audio events indexing/remixing, audio denoising for closed captioning, or instrument equalization. In future work, we aim to explore ways to leverage scenes and ambient sounds, as well as integrate localized object detections and motion.
Acknowledgements: This research was supported in part by an IBM Faculty Award, IBM Open Collaboration Research Award, and DARPA Lifelong Learning Machines. We thank members of the UT Austin vision group and Wenguang Mao, Yuzhong Wu, Dongguang You, Xingyi Zhou and Xinying Hao for helpful input. We also gratefully acknowledge a GPU donation from Facebook.