Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Audio Representation

Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Kunio Kashino

Introduction

With the recent progress of audio representation learning, general-purpose audio representations have shown good performance in various audio tasks (Saeed et al., 2021; Niizumi et al., 2021; Wang et al., 2021). While previous supervised learning methods (Hershey et al., 2017; Kong et al., 2020; Koutini et al., 2021) learn to discriminate labels, these general-purpose audio representations are pre-trained by self-supervised learning methods that do not rely on labels.

These methods utilize self-supervising signals such as the temporal relationship between audio samples or differences in multiple audio samples generated by data augmentations. For example, triplet loss or contrastive learning methods (Shor et al., 2020; Saeed et al., 2021; Spijkervet and Burgoyne, 2021; Fonseca et al., 2021) learn to make representations closer to temporally close audio segments while pushing away remote segments. Our previous study BYOL-A (Niizumi et al., 2021) learns representations invariant against the difference in audio signals created by data augmentations.

However, training signals of these methods do not provide information about complete details of the input. The temporal relationship among audios can be the information about the difference between the audio signals that do not describe input details, and the data augmentations can change the details in original input audio signals. Therefore, we think these training signals are suboptimal for learning to represent audio input as it is.

We think that the input signal itself can be the best training signal to learn representations that describe the input in detail. Learning frameworks to achieve our goal include Masked Language Modeling (MLM) in natural language processing (NLP) or Masked Image Modeling (MIM) in the image domain. These methods learn representations by masking a part of the input signal and using other parts to predict the masked signals. In particular, MLM such as BERT (Devlin et al., 2019) have already proven highly effective and have demonstrated strong performance. Inspired by BERT, speech self-supervised learning methods that learn from masked input prediction (Baevski et al., 2020; Liu et al., 2020; Chi et al., 2021; Hsu et al., 2021), have also shown solid results in the audio domain. In the image domain, recent progress of MIM such as BEiT (Bao et al., 2021) and Masked Autoencoders (MAE) (He et al., 2021) have shown promising performance such that “Self-supervised learning in vision may now be embarking on a similar trajectory as in NLP” (He et al., 2021).

In this study, we explore the learning of general-purpose audio representations through the MIM applied to the audio spectrogram, which we call Masked Spectrogram Modeling (MSM). MIM splits the input image into grid patches. Therefore, applying MIM to the audio spectrogram splits the input along the time and frequency axes, allowing “the model to learn both the temporal and frequency structure” (Gong et al., 2021), unlike previous methods for speech (e.g., Mockingjay (Liu et al., 2020) and wav2vec 2.0 (Baevski et al., 2020)) that split audio along time only.

To implement MSM, we use MAE as a training framework. MAE learns to efficiently encode the small number of visible patches into latent representations to carry essential information for reconstructing masked patches, a large portion of the input signal. It then calculates the reconstruction error as a training loss, achieving our goal of using the input itself as a training signal.

Our main contributions are the proposal and implementation of MSM using MAE (MSM-MAE) to learn general-purpose audio representations and results showing that our variants of MAE outperform other methods on some tasks in the HEAR 2021 NeurIPS Challenge (Turian et al., 2022). In addition, we investigate how the design choices of MSM-MAE impact the performance and present a qualitative analyses of learned representations using visualizations. Our code is available online.

Related Work

Audio representation learning closely related to our work. SSAST (Gong et al., 2021) is a self-supervised learning method that pre-trains ViT (Dosovitskiy et al., 2021) using a pretext task of Joint Discriminative and Generative Masked Spectrogram Patch Modeling (MSPM), which combines contrastive learning and masked patch reconstruction. While the patch reconstruction task is the same as that with MAE, it uses a two-layer MLP to reconstruct masked patches, unlike MAE, which uses a sufficiently deep transformer.

Mockingjay (Liu et al., 2020) proposed Masked Acoustic Model (MAM), a pretext task of reconstructing the masked time frames; the subsequent studies TERA (Liu et al., 2021) and Audio ALBERT (Chi et al., 2021) also use MAM. Unlike MIM, MAM slices the spectrogram along time as natural handling of time-series data, samely as the methods that accept raw audio as input, such as wav2vec 2.0 (Baevski et al., 2020) and HuBERT (Hsu et al., 2021).

PaSST (Koutini et al., 2021) is a supervised learning method that pre-trains ViT with the proposed Patchout, taking a similar approach to MAE. The Patchout reduces the number of patches to encode, thus, saving computation resources.

Audio self-supervised learning. Methods based on triplet (Shor et al., 2020) or contrastive loss (Saeed et al., 2021; Fonseca et al., 2021; Spijkervet and Burgoyne, 2021; Wang et al., 2021) learn from the temporal relationship between audios; they learn to make feature embeddings of audios closer for the audios temporally closer, or push them away for the ones temporally remote. Then, these methods do not model the representation of the input audio as it is.

On the other hand, our previous study BYOL-A (Niizumi et al., 2021) and SERAB BYOL-S (Scheidwasser-Clow et al., 2021) use data augmentations to produce multiple audios with a small degree of difference from the same audio input; they learn to make invariant representations for these augmented audios. Therefore, these methods may not learn part of the audio in which data augmentations make changes.

Masked Image Modeling. ViT (Dosovitskiy et al., 2021) conducts a self-supervised learning of predicting average 3bit color of masked patch as a preliminary exploration, resulted in 4% behind supervised pre-training version of the ViT. BEiT (Bao et al., 2021) proposes a masked image modeling task that learns to predict discrete visual tokens of masked patches. BEiT outperforms supervised pre-trained ViT; however, it pre-trains the model to encode the input image into discrete tokens rather than representing the input as it is, and requires pre-trained discrete VAE (Ramesh et al., 2021), which is not available for audio.

MAE (He et al., 2021) reconstructs the original signal given its partial observation. Figure 1 illustrates the pre-training flow with a spectrogram as input. First, MAE splits the input into patches and masks a large part of patches randomly, and then an encoder processes the visible patches only to latent representations. Next, a decoder reconstructs input from the latent representations of visible patches and mask tokens representing masked patches. Then the loss is calculated as the mean squared error (MSE) for all the masked patches between the reconstruction and target, which is normalized input. After pre-training, only the encoder is applied, and it encodes whole patches of input images to produce representations for downstream tasks.

MAE masks a very large portion (e.g., 75%) of patches with a notion that information density is different from that in languages and images; images are natural signals with spatial redundancy compared to languages, which are highly semantic and information-dense. The MAE paper shows that the optimal mask ratio is 75%, much higher than the 15% of BERT (Devlin et al., 2019) in the NLP domain.

Unlike classical autoencoders, MAE has an asymmetric encoder-decoder design; an encoder operates on the partial visible signal only, whereas a lightweight decoder reconstructs the full signal. These design choices save computation load and enable us to scale MAE to train large models efficiently.

A sufficiently deep decoder is essential for linear evaluation performance without fine-tuning. The last several layers in an autoencoder can be more specialized for reconstruction, thus becoming less relevant for other tasks. A reasonably deep decoder can help make latent representations from the encoder output more abstract (He et al., 2021; Cao et al., 2022).

Masked Spectrogram Modeling using Masked Autoencoders

We apply Masked Image Modeling to the audio spectrogram, which we call Masked Spectrogram Modeling (MSM). MSM splits the input along the time and frequency axes, allowing it to learn both the temporal and frequency structure, unlike the previous methods for speech that split along time only. To implement MSM, we use Masked Autoencoders (MSM-MAE). In preliminary experiments, we found that MSM-MAE can follow the basic design choices of the original MAE, except the half decoder depth. While the same mask ratio of 75% suggests that information density is close to that of the image, the half decoder depth might suggest lower complexity of the context of the spectrogram than that of the image.

Besides the original designs, taking spectrogram as input introduces new choices, namely, input and patch size, because spectrogram has different axes of frequency and time, unlike an image. For the input size, which consists of the number of frequency bins and time frames (denoted $F$ and $T$ ), we handle a constant $F$ and focus on exploring the optimal $T$ . We confirmed that both $T$ as input audio duration and patch size choices positively influence downstream task performance in preliminary experiments, unlike many other parameters that degrade performance with change. Therefore, we investigate these two design choices in this paper.

In addition, to better use learned features in the downstream tasks, we also introduce a feature calculation specialized to the spectrogram input.

During the pre-training, the longer duration, the model gains more chance to learn the relationships among the contents. Therefore, the longer duration could result in better representations. Meanwhile, the duration of samples in the downstream tasks ranges from 1-s to tens of seconds, or even longer. In addition, it may be fixed or vary from sample to sample. The optimal duration can depend on the task.

From the perspective of complexity, the shorter duration reduces the computation load of the two transformers on MAE because the length of the input sequence requires quadratic computational and memory complexity on the transformers; thus, the shorter duration is beneficial for scaling the system. For these reasons, we study various input audio durations.

2 Patch Size

While both the input and patch sizes are square with the image, we handle rectangle input with the spectrogram. Thus, the same goes for the patch sizes.

The patch size also affects task performance because it sets the frequency/time resolution of encoded representations. There are various task demands; for example, pitch detection requires sufficient frequency resolution, whereas short event detection requires fine time resolution. To meet these demands, we can make the patch smaller to make the resolution finer.

However, the available choice of resolution is limited due to computational complexity; for example, making the patch size half on both frequency and time will quadruple the sequence length. We explore various patch sizes based on the default resolution of $16\times 16$ .

3 Feature Calculation for Downstream Tasks

The encoder of the learned MAE encodes whole patches of the audio samples in the task, yielding embeddings of all patches for each audio sample; thereby, the embeddings for a single time frame consist of that of multiple patches of frequencies.

While typically, the patch embeddings for a time frame can be averaged to get a single embedding, we think it impairs available information by averaging embeddings among frequency bins. Therefore, we calculate features by concatenating all the patch embeddings of the same time frame, preserving all available features as the following python pseudo code:

where $z\in R^{B\times N_{F}N_{T}\times D}$ is the encoder output, $B$ is batch size, $N_{F}$ is the number of patches along frequency, $N_{T}$ is the number of patches along time, $D$ is a feature dimension, and $z^{\prime}\in R^{B\times N_{T}\times N_{F}D}$ is the calculation result. This calculation summarizes encoded features of a time frame for all frequencies into a single vector. For example, the feature dimension of $z^{\prime}$ will be $3840$ when $D=768$ and $N_{F}=5$ , which is used in our experiments.

Experiments

We evaluate our MSM-MAE models on a benchmark suite from the HEAR (Holistic Evaluation of Audio Representations) 2021 NeurIPS Challenge (Turian et al., 2022), which spans multiple audio domains, including speech, environmental sound, and music.

We describe the details of experiments in Section 4.1, and the downstream tasks in \sectionrefsec:ds-tasks. Next, we evaluate our models with results on the HEAR 2021 Challenge in \sectionrefsec:exp-results-hear2021, and investigate how design choices impact the performance in \sectionrefsec:exp-design-choice-impact. Then, we analyze learned representations qualitatively using visualizations in \sectionrefsec:qualitative-analysis.

We used ViT-base (Dosovitskiy et al., 2021) as an encoder model and a smaller decoder with a width of 384-d, depth of 4, and 6 heads. Then, we pre-trained on MAEs with the original parameters except for the pre-training epoch of 100, warmup epoch of 10, batch size of 512, and learning rate of 6e-4. While we normalized batch inputs with dataset statistics, we did not normalize the target when calculating reconstruction loss due to early observation of better performances. We followed the original mask ratio of 0.75.

The pre-training dataset consisted of $1,963,807$ samples from balanced_train_segments and unbalanced_train_segments data splits of the AudioSet (Gemmeke et al., 2017). We preprocessed samples to a log-scaled mel spectrogram with a sampling frequency of 16,000 Hz, window size of 25 ms, hop size of 10 ms, and mel-spaced frequency bins $F=80$ in the range 50–8,000 Hz.

1.2 Evaluation

We evaluated models with input audio duration $T\in\{96,208,304,400,512\}$ , corresponding to 960 ms to 5.12 seconds. We also evaluated models with patch sizes ( $F\times T$ ) of $16\times 16$ by default, $16\times 8$ or $16\times 4$ for double or quadruple time resolutions, and $8\times 16$ for a double frequency resolution. In addition, we evaluated a model with a patch size of $80\times 4$ , which cuts the input along time, making spectrogram strips; note that we use the fixed number of frequency bins $F=80$ . The model configurations are listed in \tablereftab:model-conf.

We used the hear-eval-kithttps://github.com/neuralaudio/hear-eval-kit from the HEAR 2021 Challenge. The hear-eval-kit evaluates the performance of the models on the downstream tasks without fine-tuning. First, it encodes all the task samples into embeddings using the model as a feature extractor. Then, it trains a shallow downstream model to solve the task, taking the embeddings as input. It reports the test performance of the downstream model as the model performance.

The hear-eval-kit requires two types of embeddings from models. One is timestamp embeddings, for which we used the $z^{\prime}$ , features calculated for the downstream task described in \sectionrefsec:msm-mae-feature-calc; the other is scene embeddings, for which we calculated the temporal average of $z^{\prime}$ . Since the ViT model accepts the fixed input duration $T$ , we convert the variable-length inputs into feature vectors in two steps: encode all the divided segments of length $T$ of input, and then concatenate the encoded features along time. All other details of downstream task evaluation follow the defaults of the hear-eval-kit, including the network design of the shallow downstream models for each task.

2 Downstream Tasks

We used 15 downstream tasks from the HEAR 2021 (Turian et al., 2022), consisting of four environmental sound tasks, five speech tasks, and six music tasks.

Speech tasks. These tasks are for non-semantic speeches, which do not include automatic speech recognition: Speech Commands (Warden, 2018) (SPC, speech command word classification), CREMA-D (Cao et al., 2014) (CRM-D, speech emotion recognition), LibriCount (StÃ¶ter et al., 2019) (LbCount, speaker count estimation), Vocal Imitations (Kim et al., 2018a) (VoImit, matching a vocal imitation with an original sound as a classification), and Vox Lingua Top 10 (Valk and Alumäe, 2021) (Lingua10, language identification).

Environmental sound tasks. ESC-50 (Piczak, 2015) (environmental sound classification), FSD50K (Fonseca et al., 2020) (multilabel sound event classification), Gunshot Triangulation (Cooper and Shaw, 2020) (Gunshot, recording location classification), Beehive States (Nolasco et al., 2019) (Beehive, normal or queen-less binary classification).

Music tasks. GTZAN (Tzanetakis and Cook, 2002) (music genre recognition), GTZAN Music Speech (GTZ-M/S, music or speech binary classification), NSynth (Engel et al., 2017) Pitch (NSPitch, pitch classification), Mridingham Stroke and Tonic (Anantapadmanabhan et al., 2013) (Mrd-Stk for stroke, or Mrd-Ton for tonic, pitched percussion stroke or tonic classification), and Beijing Opera Percussion (Tian et al., 2014) (Beijing, percussion instrument classification).

3 Experimental Results: Comparison with the HEAR 2021 Results

We compare the results of our two best performing models with the results from the HEAR 2021 Challenge in \tablereftab:result-esc-hear,tab:result-spc-hear,tab:result-music-hear. We used the HEAR 2021 Challenge results for which a single model is used—an ensemble of models is out of the scope of this study—and whose details are available in their papers.

These tables show that our models outperform others in seven tasks: Gunshot Triangulation, Vox Lingua Top 10, Vocal Imitation, CREMA-D, LibriCount, and Mridingham Tonic and Stroke. Conversely, models specialized in the task outperform our models on other tasks. On environmental sound tasks, FSD50K and ESC-50, the PANNs and PaSST-base outperform ours. These models are supervised learning pre-trained on the AudioSet using its labels. KW-MLP, wav2vec2, and SERAB-BYOLS outperform ours on the Speech Commands task. These models are pre-trained on a speech corpus for specialization in speech. CREPE, specializing in pitch estimation, outperforms ours on the NSynth Pitch task. With the exception of these specialized models, our MSM-MAE models show the top results in most tasks. These results show that MSM-MAE learns audio representation effective for general tasks without specializing in domains.

4 Experimental Results: Impact of Design Choices on Performance

We explore the impact of the design choices of input audio duration and patch size, for which we found positive effects in preliminary experiments. In addition, we also evaluate the performance difference between designs for splitting the input.

tab:result-duration shows the results of MSM-MAE with various input audio durations. The number at the tail of the model name shows the duration, which is $10\times$ in ms in actual time (e.g., 96 and 512 are 960 ms and 5.12 seconds, respectively).

These results show that the longer durations yield better results among HEAR 2021 tasks. We see an exceptional behavior with speech tasks (Lingua10, SPC, and LbCount), Gunshot, and GTZ-M/S. However, we think that a long input duration is more beneficial in general-purpose use, because most tasks show the best result with the longer input duration of 400 or 512.

4.2 Impact of Patch Size

tab:result-patchsize shows the results of MSM-MAE with various patch sizes for the models that accept 2-s input audio durations: MSM-MAE-208 ( $16\times 16$ ), MSM-MAE-200 ( $16\times 8$ ), MSM-MAE-200 ( $16\times 4$ ), and MSM-MAE-208 ( $8\times 16$ ). The $16\times 16$ , $16\times 8$ , $16\times 4$ , and $8\times 16$ show the patch size of $F$ by $T$ .

The results show that the finer time resolutions ( $16\times 8$ and $16\times 4$ ) improve performance on 11 out of 15 tasks (Gunshot, FSD50K, Beehive, Lingua10, CRM-D, LbCount, GTZAN, NSPitch, Mrd-Ton, Mrd-Stk, and Beijing). In contrast, the finer frequency resolution ( $8\times 16$ ) improves eight tasks (Gunshot, FSD50K, Beehive, CRM-D, LbCount, GTZAN, NSPitch, and Mrd-Ton), and the degree of improvements are smaller than for the finer time resolutions.

We can also find the different trends in improvements among models. If we focus on the best results (bold numbers), the finer time resolutions ( $16\times 8$ and $16\times 4$ ) excel on speech and music tasks, whereas the finer frequency resolution ( $8\times 16$ ) performs better on environmental sound tasks.

In summary, we can improve task performance with finer resolutions using smaller patch sizes, especially with finer time resolutions. However, $16\times 4$ shows that it is not always the case compared to $16\times 8$ ; the finer $16\times 4$ does not clearly show performance improvements superior to that for $16\times 8$ , even though the $16\times 4$ costs quadratically more computational complexity than $16\times 8$ .

4.3 Impact of Input Splitting: Patches vs. Strips

tab:result-patch-vs-strip shows the results of MSM-MAE for different designs of splitting the input spectrogram into patches or strips. MSM-MAE-304 ( $16\times 16$ ) cuts the input spectrogram into patches along both frequency and time. In contrast, MSM-MAE-304 ( $80\times 4$ ) cuts the input into strips along time, simulating the previous methods such as Mockingjay (Liu et al., 2020) that handle the input as a sequence of spectrogram strips. We compare the performance difference between these two ways.

The results show that the patch input model ( $16\times 16$ ) outperforms strip input model ( $80\times 4$ ) on most tasks. It could indicate that the MIM framework, which takes patches as input and learns both frequency and time structure, is more suitable for learning general-purpose audio representations than the training framework that takes sequential strips as input. By contrast, the strip input model outperforms the patch input model on CREMA-D and SPC tasks, which could also indicate that the strip input is effective for speech tasks, as used in the previous speech self-supervised learning studies that take sequential spectrogram strips as input.

5 Qualitative Analysis with Visualizations

This section presents analyses based on visualizations to gain an understanding of the representations learned by MSM-MAE. \sectionrefsec:viz-recon-random,sec:viz-recon-pattern,sec:viz-recon-ratio present visualization of reconstruction results, and \sectionrefsec:viz-att shows visualizations of attention maps. We present more visualizations in \appendixrefappendix:viz-recon.

The reconstructions of three sounds in \figurereffig:viz-recon-rand3 show results similar to those in the MAE paper (He et al., 2021), which reconstruct inputs well but with blurry details, indicating that our models were successfully trained in the experiments. In \figurereffig:viz-recon-rand3, we observe that the frequency structures, especially the harmonic structures, are reconstructed, and the stationary sounds are easy to reconstruct compared to the short sound events.

Frequency structure reconstructions can be observed by focusing on reconstruction along the frequency axis (vertical axis). Frequency bins are reconstructed even where a few visible patches (white squares) are available in a time frame. Furthermore, we can find clear patterns of the harmonic structures in the reconstruction of sound 1 with trumpet notes. These observations show that the latent representation is encoded effectively to reconstruct frequency bins in a time frame using limited information of a few patches.

We find that stationary sound can be easily reconstructed from sound 2, which is the sound of roaring low-pitch wind. The error of sound 2 shows a lighter color compared to the sound 1 or 3, indicating that the reconstructing of the 2 made a smaller error. In addition, frequency structures are recovered entirely, even without an available visible patch in a time frame, contrary to the sound 1 and 3, where some sound events are not recovered. These observations suggest that the learned representations carry the information related to temporal structure for each sound, such as stationary or short events.

5.2 Reconstructions with Patterned Masks

We compare the difference in reconstruction under the different availability of visible information made by the three mask patterns in \figurereffig:viz-recon-randhalf. In A, vertical mask, visible and mask patches alternate along the time axis; models are to recover masked patches using visible patches adjoining on the time axis. In B, horizontal mask, they alternate along the frequency axis; visible patches adjoining in the frequency axis are available. In C, the chessboard mask is for making visible patches available around the masked patches. All cases have the same mask ratio of 0.5.

In \figurereffig:viz-recon-randhalf, we observe that the more adjoining visible patches available, the easier the reconstruction becomes. For all the sounds, the errors show lighter color on the chessboard mask results than that on the other vertical and horizontal mask results, showing the smaller reconstruction error for the chessboard mask, where more adjoining visible patches are available.

If we focus on the horizontal mask in B, we can observe that harmonic structures are reconstructed more easily than noises. Sound 1, a rich harmonic structure of trumpet notes, shows less reconstruction error than sound 3, with less structured frequency patterns of noises in laughing voices. This observation suggests that the learned representation encodes the information of the harmonic structure more effectively than that of the noise.

5.3 Reconstructions with Various Mask Ratios

We varied the mask ratio to observe how the reconstruction changes according to it. We used the 3-s model (MSM-MAE-304) to reconstruct the input spectrograms with varied mask ratios from 0.40 to 0.99, focusing on cases with extremely small numbers of visible patches (e.g., 1, 2, 5, and 10 visible patches, corresponding to mask ratios of 0.99, 0.98, 0.95, and 0.90, respectively).

fig:viz-recon-varrate shows the example reconstruction results. The results show that reconstruction succeeds entirely up to the default ratio of 0.75, whereas it degrades noticeably at higher ratios. As the mask ratio increases, only the patches around the visible patch are reconstructed relatively accurately, while the rest become blurry copies of the pattern around the nearest visible patch.

Focusing on the mask ratio of 0.99, which encodes only one visible patch, we can observe the reconstruction of both local and global patterns. Locally, the frequency pattern of the time frames around the visible patch is reconstructed relatively clearly. Globally, a frequency pattern similar to the average of the original spectrogram is reconstructed stationary over the entire spectrogram. This observation suggests that even though only one patch was encoded, information related to both local and global patterns learned from the training dataset is embedded in the representation.

5.4 Self-Attention Map Visualizations

We visualize the self-attention of the encoder of pre-trained MSM-MAE-304 for six sounds in \figurereffig:viz-attn-1,fig:viz-attn-2. We picked two reference points each to show self-attention maps. These self-attention maps average from all heads in the last layer.

We see that the self-attention map reflects the repetition or continuation in the input sounds. Sound 1 of pop music has a clear repetition of notes in the input, which we can also find in the self-attention map. On the other hand, sound 2 is a stationary sound of blowing wind, and the attention continues along time similarly. Interestingly, we find in the first reference point in sound 1 that the self-attentions are strong at the same position in the beat, even though there are two similar notes per beat.

We also observe coarse segmentation of similar sounds for the reference points. For example, sound 4 shows that heartbeats are segmented in the self-attention maps of both reference points, less attending to the following voice; sound 5 shows that the first half of children’s singing voices are segmented in the first reference point, whereas the latter half of an adult female voice is coarsely segmented in the second reference point.

Conclusion

In this paper, we sought to learn audio representation from the input itself as supervision by using a pretext task of modeling masked spectrogram patches, which we call Masked Spectrogram Modeling (MSM). To implement MSM, we employed Masked Autoencoders (MAE) with audio spectrogram as input.

We conducted evaluations on the HEAR 2021 NeurIPS Challenge using its benchmark suite across a variety of domains, including speech, environmental sound, and music. Our models outperformed the HEAR 2021 Challenge results on seven out of 15 tasks (e.g., accuracies of 73.4% on CREMA-D and 85.8% on LibriCount) while showing top performances on other tasks where specialized models perform better. In addition, we investigated design choices of input audio duration and patch size and confirmed that longer duration and finer time resolution with a smaller patch size improve performance.

We also conducted qualitative analyses on various visualizations of outputs from both the MAE encoder and decoder. We observed frequential and temporal structures in the reconstruction results and the self-attention maps, suggesting that the learned representations hold information related to these structures.

While this study does not provide an exhaustive exploration, the quantitative results proved the effectiveness of MSM using MAE, and the qualitative observations suggested useful information in the learned representations. We believe that the MSM framework holds promising further improvements and applications in the future.