It's Raw! Audio Generation with State-Space Models

Karan Goel, Albert Gu, Chris Donahue, Christopher Ré

Introduction

Generative modeling of raw audio waveforms is a challenging frontier for machine learning due to their high-dimensionality—waveforms contain tens of thousands of timesteps per second and exhibit long-range behavior at multiple timescales. A key problem is developing architectures for modeling waveforms with the following properties:

Globally coherent generation, which requires modeling unbounded contexts with long-range dependencies.

Computational efficiency through parallel training, and fast autoregressive and non-autoregressive inference.

Sample efficiency through a model with inductive biases well suited to high-rate waveform data.

Among the many training methods for waveform generation, autoregressive (AR) modeling is a fundamentally important approach. AR models learn the distribution of future variables conditioned on past observations, and are central to recent advances in machine learning for language and image generation . With AR models, computing the exact likelihood is tractable, which makes them simple to train, and lends them to applications such as lossless compression and posterior sampling . When generating, they can condition on arbitrary amounts of past context to sample sequences of unbounded length—potentially even longer than contexts observed during training. Moreover, architectural developments in AR waveform modeling can have a cascading effect on audio generation more broadly. For example, WaveNet—the earliest such architecture —remains a central component of state-of-the-art approaches for text-to-speech (TTS) , unconditional generation , and non-autoregressive (non-AR) generation .

Despite notable progress in AR modeling of (relatively) short sequences found in domains such as natural language (e.g. $1$ K tokens), it is still an open challenge to develop architectures that are effective for the much longer sequence lengths of audio waveforms (e.g. $1$ M samples). Past attempts have tailored standard sequence modeling approaches like CNNs , RNNs , and Transformers to fit the demands of AR waveform modeling, but these approaches have limitations. For example, RNNs lack computational efficiency because they cannot be parallelized during training, while CNNs cannot achieve global coherence because they are fundamentally constrained by the size of their receptive field.

We introduce SaShiMi, a new architecture for modeling waveforms that yields state-of-the-art performance on unconditional audio generation benchmarks in both the AR and non-AR settings. SaShiMi is designed around recently developed deep state space models (SSM), specifically S4 . SSMs have a number of key features that make them ideal for modeling raw audio data. Concretely, S4:

Incorporates a principled approach to modeling long range dependencies with strong results on long sequence modeling, including raw audio classification.

Can be computed either as a CNN for efficient parallel training, or an RNN for fast autoregressive generation.

Is implicitly a continuous-time model, making it well-suited to signals like waveforms.

To realize these benefits of SSMs inside SaShiMi, we make $3$ technical contributions. First, we observe that while stable to train, S4’s recurrent representation cannot be used for autoregressive generation due to numerical instability. We identify the source of the instability using classical state space theory, which states that SSMs are stable when the state matrix is Hurwitz, which is not enforced by the S4 parameterization. We provide a simple improvement to the S4 parameterization that theoretically ensures stability.

Second, SaShiMi incorporates pooling layers between blocks of residual S4 layers to capture hierarchical information across multiple resolutions. This is a common technique in neural network architectures such as standard CNNs and multi-scale RNNs, and provides empirical improvements in both performance and computational efficiency over isotropic stacked S4 layers.

Third, while S4 is a causal (unidirectional) model suitable for AR modeling, we provide a simple bidirectional relaxation to flexibly incorporate it in non-AR architectures. This enables it to better take advantage of the available global context in non-AR settings.

For AR modeling in audio domains with unbounded sequence lengths (e.g. music), SaShiMi can train on much longer contexts than existing methods including WaveNet (sequences of length $128$ K vs $4$ K), while simultaneously having better test likelihood, faster training and inference, and fewer parameters. SaShiMi outperforms existing AR methods in modeling the data ( $>0.15$ bits better negative log-likelihoods), with substantial improvements ( $+0.4$ points) in the musicality of long generated samples ( $16$ s) as measured by mean opinion scores. In unconditional speech generation, SaShiMi achieves superior global coherence compared to previous AR models on the difficult SC09 dataset both quantitatively ( $80\%$ higher inception score) and qualitatively ( $2\times$ higher audio quality and digit intelligibility opinion scores by human evaluators).

Finally, we validate that SaShiMi is a versatile backbone for non-AR architectures. Replacing the WaveNet backbone with SaShiMi in the state-of-the-art diffusion model DiffWave improves its quality, sample efficiency, and robustness to hyperparameters with no additional tuning.

The central contribution of this paper is showing that deep neural networks using SSMs are a strong alternative to conventional architectures for modeling audio waveforms, with favorable tradeoffs in training speed, generation speed, sample efficiency, and audio quality.

We technically improve the parameterization of S4, ensuring its stability when switching into recurrent mode at generation time.

We introduce SaShiMi, an SSM-based architecture with high efficiency and performance for unconditional AR modeling of music and speech waveforms.

We show that SaShiMi is easily incorporated into other deep generative models to improve their performance.

Related Work

This work focuses primarily on the task of generating raw audio waveforms without conditioning information. Most past work on waveform generation involves conditioning on localized intermediate representations like spectrograms , linguistic features , or discrete audio codes . Such intermediaries provide copious information about the underlying content of a waveform, enabling generative models to produce globally-coherent waveforms while only modeling local structure.

In contrast, modeling waveforms in an unconditional fashion requires learning both local and global structure with a single model, and is thus more challenging. Past work in this setting can be categorized into AR approaches , where audio samples are generated one at a time given previous audio samples, and non-AR approaches , where entire waveforms are generated in a single pass. While non-AR approaches tend to generate waveforms more efficiently, AR approaches have two key advantages. First, unlike non-AR approaches, they can generate waveforms of unbounded length. Second, they can tractably compute exact likelihoods, allowing them to be used for compression and posterior sampling .

In addition to these two advantages, new architectures for AR modeling of audio have the potential to bring about a cascade of improvements in audio generation more broadly. For example, while the WaveNet architecture was originally developed for AR modeling (in both conditional and unconditional settings), it has since become a fundamental piece of infrastructure in numerous audio generation systems. For instance, WaveNet is commonly used to vocode intermediaries such as spectrograms or discrete audio codes into waveforms, often in the context of text-to-speech (TTS) systems. Additionally, it serves as the backbone for several families of non-AR generative models of audio in both the conditional and unconditional settings:

Distillation: Parallel WaveNet and ClariNet distill parallelizable flow models from a teacher WaveNet model.

Likelihood-based flow models: WaveFlow , WaveGlow , and FloWaveNet all use WaveNet as a core component of reversible flow architectures.

Autoencoders: WaveNet Autoencoder and WaveVAE , which use WaveNets in their encoders.

Generative adversarial networks (GAN): Parallel WaveGAN and GAN-TTS , which use WaveNets in their discriminators.

Diffusion probabilistic models: WaveGrad and DiffWave learn a reversible noise diffusion process on top of dilated convolutional architectures.

In particular, we point out that DiffWave represents the state-of-the-art for unconditional waveform generation, and incorporates WaveNet as a black box.

Despite its prevalence, WaveNet is unable to model long-term structure beyond the length of its receptive field (up to $3$ s), and in practice, may even fail to leverage available information beyond a few tens of milliseconds . Hence, we develop an alternative to WaveNet which can leverage unbounded context. We focus primarily on evaluating our proposed architecture SaShiMi in the fundamental AR setting, and additionally demonstrate that, like WaveNet, SaShiMi can also transfer to non-AR settings.

Background

We provide relevant background on autoregressive waveform modeling in Section 3.1, state-space models in Section 3.2 and the recent S4 model in Section 3.3, before introducing SaShiMi in Section 4.

Given a distribution over waveforms $x=(x_{0},\dots,x_{T-1})$ , autoregressive generative models model the joint distribution as the factorized product of conditional probabilities

Autoregressive models have two basic modes:

Training: Given a sequence of samples $x_{0},\dots,x_{T-1}$ , maximize the likelihood

𝑖1p(x_{0},\dots,x_{T-1})=\sum_{i=0}^{T-1}p(x_{i}|x_{0},\dots,x_{i-1})=\sum_{i=0}^{T-1}\ell(y_{i},x_{i+1}) where $\ell$ is the cross-entropy loss function.

Inference (Generation): Given $x_{0},\dots,x_{t-1}$ as context, sample from the distribution represented by $y_{t-1}=p(x_{t}\mid x_{0},\dots,x_{t-1})$ to produce the next sample $x_{t}$ .

We remark that by the training mode, autoregressive models are equivalent to causal sequence-to-sequence maps $x_{0},\dots,x_{T-1}\mapsto y_{0},\dots,y_{T-1}$ , where $x_{k}$ are input samples to model and $y_{k}$ represents the model’s guess of $p_{(}x_{k+1}\mid x_{0},\dots,x_{k})$ . For example, when modeling a sequence of categorical inputs over $k$ classes, typically $x_{k}\in\mathbbm{R}^{d}$ are embeddings of the classes and $y_{k}\in\mathbbm{R}^{k}$ represents a categorical distribution over the classes.

The most popular models for autoregressive audio modeling are based on CNNs and RNNs, which have different tradeoffs during training and inference. A CNN layer computes a convolution with a parameterized kernel

where $w$ is the width of the kernel. The receptive field or context size of a CNN is the sum of the widths of its kernels over all its layers. In other words, modeling a context of size $T$ requires learning a number of parameters proportional to $T$ . This is problematic in domains such as audio which require very large contexts.

A variant of CNNs particularly popular for modeling audio is the dilated convolution (DCNN) popularized by WaveNet , where each kernel $K$ is non-zero only at its endpoints. By choosing kernel widths carefully, such as in increasing powers of $2$ , a DCNN can model larger contexts than vanilla CNNs.

RNNs such as SampleRNN maintain a hidden state $h_{t}$ that is sequentially computed from the previous state and current input, and models the output as a function of the hidden state

The function $f$ is also known as an RNN cell, such as the popular LSTM .

CNNs and RNNs have efficiency tradeoffs as autoregressive models. CNNs are parallelizable: given an input sequence $x_{0},\dots,x_{T-1}$ , they can compute all $y_{k}$ at once, making them efficient during training. However, they become awkward at inference time when only the output at a single timestep $y_{t}$ is needed. Autoregressive stepping requires specialized caching implementations that have higher complexity requirements than RNNs.

On the other hand, RNNs are stateful: The entire context $x_{0},\dots,x_{t}$ is summarized into the hidden state $h_{t}$ . This makes them efficient at inference, requiring only constant time and space to generate the next hidden state and output. However, this inherent sequentiality leads to slow training and optimization difficulties (the vanishing gradient problem ).

2 State Space Models

A recent class of deep neural networks was developed that have properties of both CNNs and RNNs. The state space model (SSM) is defined in continuous time by the equations

𝐴ℎ𝑡𝐵𝑥𝑡𝑦𝑡absent𝐶ℎ𝑡𝐷𝑥𝑡\begin{aligned} h^{\prime}(t)&={A}h(t)+{B}x(t)\\ y(t)&={C}h(t)+{D}x(t)\end{aligned}. (3) To operate on discrete-time sequences sampled with a step size of $\Delta$ , SSMs can be computed with the recurrence

¯𝐴subscriptℎ𝑘1¯𝐵subscript𝑥𝑘subscript𝑦𝑘¯𝐶subscriptℎ𝑘¯𝐷subscript𝑥𝑘\displaystyle=\overline{A}h_{k-1}+\overline{B}x_{k}\qquad y_{k}=\overline{C}h_{k}+\overline{D}x_{k} (4) $\displaystyle\overline{A}$ $\displaystyle=(I-\Delta/2\cdot A)^{-1}(I+\Delta/2\cdot A)$ (5) where $\overline{A}$ is the discretized state matrix and $\overline{B},\dots$ have similar formulas. Eq. (4) is equivalent to the convolution

SSMs can be viewed as particular instantiations of CNNs and RNNs that inherit their efficiency at both training and inference and overcome their limitations. As an RNN, (4) is a special case of (2) where $f$ and $g$ are linear, giving it much simpler structure that avoids the optimization issues found in RNNs. As a CNN, (6) is a special case of (1) with an unbounded convolution kernel, overcoming the context size limitations of vanilla CNNs.

3 S4

S4 is a particular instantiation of SSM that parameterizes $A$ as a diagonal plus low-rank (DPLR) matrix, $A=\Lambda+pq^{*}$ . This parameterization has two key properties. First, this is a structured representation that allows faster computation—S4 uses a special algorithm to compute the convolution kernel $\overline{K}$ (6) very quickly. Second, this parameterization includes certain special matrices called HiPPO matrices , which theoretically and empirically allow the SSM to capture long-range dependencies better. In particular, HiPPO specifies a special equation $h^{\prime}(t)=Ah(t)+Bx(t)$ with closed formulas for $A$ and $B$ . This particular $A$ matrix can be written in DPLR form, and S4 initializes its $A$ and $B$ matrices to these.

Model

SaShiMi consists of two main components. First, S4 layers are the core component of our neural network architecture, to capture long context while being fast at both training and inference. We provide a simple improvement to S4 that addresses instability at generation time (Section 4.1). Second, SaShiMi connects stacks of S4 layers together in a simple multi-scale architecture (Section 4.2).

We use S4’s representation and algorithm as a black box, with one technical improvement: we use the parameterization $\Lambda-pp^{*}$ instead of $\Lambda+pq^{*}$ . This amounts to essentially tying the parameters $p$ and $q$ (and reversing a sign).

To justify our parameterization, we first note that it still satisfies the main properties of S4’s representation (Section 3.3). First, this is a special case of a DPLR matrix, and can still use S4’s algorithm for fast computation. Moreover, we show that the HiPPO matrices still satisfy this more restricted structure; in other words, we can still use the same initialization which is important to S4’s performance.

All three HiPPO matrices from are unitarily equivalent to a matrix of the form ${A}=\Lambda-pp^{*}$ for diagonal $\Lambda$ and $p\in\mathbbm{R}^{N\times r}$ for $r=1$ or $r=2$ . Furthermore, all entries of $\Lambda$ have real part (for HiPPO-LegT and HiPPO-LagT) or $-\frac{1}{2}$ (for HiPPO-LegS).

Next, we discuss how this parameterization makes S4 stable. The high-level idea is that stability of SSMs involves the spectrum of the state matrix ${A}$ , which is more easily controlled because $-pp^{*}$ is a negative semidefinite matrix (i.e., we know the signs of its spectrum).

A Hurwitz matrix ${A}$ is one where every eigenvalue has negative real part.

Hurwitz matrices are also called stable matrices, because they imply that the SSM (3) is asymptotically stable. In the context of discrete time SSMs, we can easily see why ${A}$ needs to be a Hurwitz matrix from first principles with the following simple observations.

First, unrolling the RNN mode (equation (4)) involves powering up $\overline{A}$ repeatedly, which is stable if and only if all eigenvalues of $\overline{A}$ lie inside or on the unit disk. Second, the transformation (5) maps the complex left half plane (i.e. negative real part) to the complex unit disk. Therefore computing the RNN mode of an SSM (e.g. in order to generate autoregressively) requires ${A}$ to be a Hurwitz matrix.

However, controlling the spectrum of a general DPLR matrix is difficult; empirically, we found that S4 matrices generally became non-Hurwitz after training. We remark that this stability issue only arises when using S4 during autoregressive generation, because S4’s convolutional mode during training does not involve powering up $\overline{A}$ and thus does not require a Hurwitz matrix. Our reparameterization makes controlling the spectrum of $\overline{A}$ easier.

A matrix $A=\Lambda-pp^{*}$ is Hurwitz if all entries of $\Lambda$ have negative real part.

We first observe that if ${A}+{A}^{*}$ is negative semidefinite (NSD), then ${A}$ is Hurwitz. This follows because $0>v^{*}({A}+{A}^{*})v=(v^{*}{A}v)+(v^{*}{A}v)^{*}=2\mathfrak{Re}(v^{*}{A}v)=2\lambda$ for any (unit length) eigenpair $(\lambda,v)$ of ${A}$ .

Next, note that the condition implies that $\Lambda+\Lambda^{*}$ is NSD (it is a real diagonal matrix with non-positive entries). Since the matrix $-pp^{*}$ is also NSD, then so is ${A}+{A}^{*}$ . ∎

Proposition 4.3 implies that with our tied reparameterization of S4, controlling the spectrum of the learned ${A}$ matrix becomes simply controlling the the diagonal portion $\Lambda$ . This is a far easier problem than controlling a general DPLR matrix, and can be enforced by regularization or reparameteration (e.g. run its entries through an $\exp$ function). In practice, we found that not restricting $\Lambda$ and letting it learn freely led to stable trained solutions.

2 SaShiMi Architecture

Figure 1 illustrates the complete SaShiMi architecture.

S4 Block. SaShiMi is built around repeated deep neural network blocks containing our modified S4 layers, following the same original S4 model. Compared to Gu et al. , we add additional pointwise linear layers after the S4 layer in the style of the feed-forward network in Transformers or the inverted bottleneck layer in CNNs . Model details are in Appendix A.

Multi-scale Architecture. SaShiMi uses a simple architecture for autoregressive generation that consolidates information from the raw input signal at multiple resolutions. The SaShiMi architecture consists of multiple tiers, with each tier composed of a stack of residual S4 blocks. The top tier processes the raw audio waveform at its original sampling rate, while lower tiers process downsampled versions of the input signal. The output of lower tiers is upsampled and combined with the input to the tier above it in order to provide a stronger conditioning signal. This architecture is inspired by related neural network architectures for AR modeling that incorporate multi-scale characteristics such as SampleRNN and PixelCNN++ .

The pooling is accomplished by simple reshaping and linear operations. Concretely, an input sequence $x\in\mathbbm{R}^{T\times H}$ with context length $T$ and hidden dimension size $H$ is transformed through these shapes:

Here, $p$ is the pooling factor and $q$ is an expansion factor that increases the hidden dimension while pooling. In our experiments, we always fix $p=4,q=2$ and use a total of just two pooling layers (three tiers).

We additionally note that in AR settings, the up-pooling layers must be shifted by a time step to ensure causality.

Bidirectional S4. Like RNNs, SSMs are causal with an innate time dimension (equation (3)). For non-autoregressive tasks, we consider a simple variant of S4 that is bidirectional. We simply pass the input sequence through an S4 layer, and also reverse it and pass it through an independent second S4 layer. These outputs are concatenated and passed through a positionwise linear layer as in the standard S4 block.

We show that bidirectional S4 outperforms causal S4 when autoregression is not required (Section 5.3).

Experiments

We evaluate SaShiMi on several benchmark audio generation and unconditional speech generation tasks in both AR and non-AR settings, validating that SaShiMi generates more globally coherent waveforms than baselines while having higher computational and sample efficiency.

Baselines. We compare SaShiMi to the leading AR models for unconditional waveform generation, SampleRNN and WaveNet. In Section 5.3, we show that SaShiMi can also improve non-AR models.

Datasets. We evaluate SaShiMi on datasets spanning music and speech generation (Table 1).

Beethoven. A benchmark music dataset , consisting of Beethoven’s piano sonatas.

YouTubeMix. Another piano music dataset with higher-quality recordings than Beethoven.

SC09. A benchmark speech dataset , consisting of $1$ -second recordings of the digits “zero” through “nine” spoken by many different speakers.

All datasets are quantized using $8$ -bit quantization, either linear or $\mu$ -law, depending on prior work. Each dataset is divided into non-overlapping chunks; the SampleRNN baseline is trained using TBPTT, while WaveNet and SaShiMi are trained on entire chunks. All models are trained to predict the negative log-likelihood (NLL) of individual audio samples; results are reported in base $2$ , also known as bits per byte (BPB) because of the one-byte-per-sample quantization. All datasets were sampled at a rate of $16$ kHz. Table 1 summarizes characteristics of the datasets and processing.

Because music audio is not constrained in length, AR models are a natural approach for music generation, since they can generate samples longer than the context windows they were trained on. We validate that SaShiMi can leverage longer contexts to perform music waveform generation more effectively than baseline AR methods.

We follow the setting of Mehri et al. for the Beethoven dataset. Table 2 reports results found in prior work, as well as our reproductions. In fact, our WaveNet baseline is much stronger than the one implemented in prior work. SaShiMi substantially improves the test NLL by $0.09$ BPB compared to the best baseline. Table 3 ablates the context length used in training, showing that SaShiMi significantly benefits from seeing longer contexts, and is able to effectively leverage extremely long contexts (over $100$ k steps) when predicting next samples.

Next, we evaluate all baselines on YouTubeMix. Table 4 shows that SaShiMi substantially outperforms SampleRNN and WaveNet on NLL. Following Dieleman et al. (protocol in Section C.4), we measured mean opinion scores (MOS) for audio fidelity and musicality for $16$ s samples generated by each method (longer than the training context). All methods have similar fidelity, but SaShiMi substantially improves musicality by around $0.40$ points, validating that it can generate long samples more coherently than other methods.

Figure 2 shows that SaShiMi trains stably and more efficiently than baselines in wall clock time. Appendix B, Figure 5 also analyzes the peak throughput of different AR models as a function of batch size.

2 Model ablations: Slicing the SaShiMi

We validate our technical improvements and ablate SaShiMi’s architecture.

Stabilizing S4. We consider how different parameterizations of S4’s representation affect downstream performance (Section 4.1). Recall that S4 uses a special matrix $A=\Lambda+pq^{*}$ specified by HiPPO, which theoretically captures long-range dependencies (Section 3.3). We ablate various parameterizations of a small SaShiMi model ( $2$ layers, $500$ epochs on YouTubeMix). Learning $A$ yields consistent improvements, but becomes unstable at generation. Our reparameterization allows $A$ to be learned while preserving stability, agreeing with the analysis in Section 4.1. A visual illustration of the spectral radii of the learned $\overline{A}$ in the new parameterization is provided in Figure 3.

Λ𝑝superscript𝑞\Lambda+pq^{*} $1.445$ ✓ $\Lambda+pq^{*}$ $-$ $1.420$ ✗ $\Lambda-pp^{*}$ $-$ $1.419$ ✓ Figure 3: (S4 Stability) Comparison of spectral radii for all $\overline{A}$ matrices in a SaShiMi model trained with different S4 parameterizations. The instability in the standard S4 parameterization is solved by our Hurwitz parameterization. Multi-scale architecture. We investigate the effect of SaShiMi’s architecture (Section 4.2) against isotropic S4 layers on YouTubeMix. Controlling for parameter counts, adding pooling in SaShiMi leads to substantial improvements in computation and modeling (Table 5, Bottom).

Efficiency tradeoffs. We ablate different sizes of the SaShiMi model on YouTubeMix to show its performance tradeoffs along different axes.

Table 5 (Top) shows that a single SaShiMi model simultaneously outperforms all baselines on quality (NLL) and computation at both training and inference, with a model more than 3X smaller. Moreover, SaShiMi improves monotonically with depth, suggesting that quality can be further improved at the cost of additional computation.

3 Unconditional Speech Generation

The SC09 spoken digits dataset is a challenging unconditional speech generation benchmark, as it contains several axes of variation (words, speakers, microphones, alignments). Unlike the music setting (Section 5.1), SC09 contains audio of bounded length ( $1$ -second utterances). To date, AR waveform models trained on this benchmark have yet to generate spoken digits which are consistently intelligible to humans.111While AR waveform models can produce intelligible speech in the context of TTS systems, this capability requires conditioning on rich intermediaries like spectrograms or linguistic features. In contrast, non-AR approaches are capable of achieving global coherence on this dataset, as first demonstrated by WaveGAN .

Although our primary focus thus far has been the challenging testbed of AR waveform modeling, SaShiMi can also be used as a flexible neural network architecture for audio generation more broadly. We demonstrate this potential by integrating SaShiMi into DiffWave , a diffusion-based method for non-AR waveform generation which represents the current state-of-the-art for SC09. DiffWave uses the original WaveNet architecture as its backbone—here, we simply replace WaveNet with a SaShiMi model containing a similar number of parameters.

We compare SaShiMi to strong baselines on SC09 in both the AR and non-AR (via DiffWave) settings by measuring several standard quantitative and qualitative metrics such as Frechét Inception Distance (FID) and Inception Score (IS) (Section C.3). We also conduct a qualitative evaluation where we ask several annotators to label the generated digits and then compute their inter-annotator agreement. Additionally, as in Donahue et al. , we ask annotators for their subjective opinions on overall audio quality, intelligibility, and speaker diversity, and report MOS for each axis. Results for all models appear in Table 6.

Autoregressive. SaShiMi substantially outperforms other AR waveform models on all metrics, and achieves $2\times$ higher MOS for both quality and intelligibility. Moreover, annotators agree on labels for samples from SaShiMi far more often than they do for samples from other AR models, suggesting that SaShiMi generates waveforms that are more globally coherent on average than prior work. Finally, SaShiMi achieves higher MOS on all axes compared to WaveGAN while using more than $4\times$ fewer parameters.

Non-autoregressive. Integrating SaShiMi into DiffWave substantially improves performance on all metrics compared to its WaveNet-based counterpart, and achieves a new overall state-of-the-art performance on all quantitative and qualitative metrics on SC09. We note that this result involved zero tuning of the model or training parameters (e.g. diffusion steps or optimizer hyperparameters) (Section C.2). This suggests that SaShiMi could be useful not only for AR waveform modeling but also as a new drop-in architecture for many audio generation systems which currently depend on WaveNet (see Section 2).

We additionally conduct several ablation studies on our hybrid DiffWave and SaShiMi model, and compare performance earlier in training and with smaller models (Table 7). When paired with DiffWave, SaShiMi is much more sample efficient than WaveNet, matching the performance of the best WaveNet-based model with half as many training steps. Kong et al. also observed that DiffWave was extremely sensitive with a WaveNet backbone, performing poorly with smaller models and becoming unstable with larger ones. We show that, when using WaveNet, a small DiffWave model fails to model the dataset, however it works much more effectively when using SaShiMi. Finally, we ablate our non-causal relaxation, showing that this bidirectional version of SaShiMi performs much better than its unidirectional counterpart (as expected).

Discussion

Our results indicate that SaShiMi is a promising new architecture for modeling raw audio waveforms. When trained on music and speech datasets, SaShiMi generates waveforms that humans judge to be more musical and intelligible respectively compared to waveforms from previous architectures, indicating that audio generated by SaShiMi has a higher degree of global coherence. By leveraging the dual convolutional and recurrent forms of S4, SaShiMi is more computationally efficient than past architectures during both training and inference. Additionally, SaShiMi is consistently more sample efficient to train—it achieves better quantitative performance with fewer training steps. Finally, when used as a drop-in replacement for WaveNet, SaShiMi improved the performance of an existing state-of-the-art model for unconditional generation, indicating a potential for SaShiMi to create a ripple effect of improving audio generation more broadly.

Acknowledgments

We thank John Thickstun for helpful conversations. We gratefully acknowledge the support of NIH under No. U54EB020405 (Mobilize), NSF under Nos. CCF1763315 (Beyond Sparsity), CCF1563078 (Volume to Velocity), and 1937301 (RTML); ARL under No. W911NF-21-2-0251 (Interactive Human-AI Teaming); ONR under No. N000141712266 (Unifying Weak Supervision); ONR N00014-20-1-2480: Understanding and Applying Non-Euclidean Geometry in Machine Learning; N000142012275 (NEPTUNE); NXP, Xilinx, LETI-CEA, Intel, IBM, Microsoft, NEC, Toshiba, TSMC, ARM, Hitachi, BASF, Accenture, Ericsson, Qualcomm, Analog Devices, Google Cloud, Salesforce, Total, the HAI-AWS Cloud Credits for Research program, the Stanford Data Science Initiative (SDSI), and members of the Stanford DAWN project: Facebook, Google, and VMWare. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views, policies, or endorsements, either expressed or implied, of NIH, ONR, or the U.S. Government.

References

Appendix A Model Details

We prove Proposition 4.1. We build off the S4 representation of HiPPO matrices, using their decomposition as a normal plus low-rank matrix which implies that they are unitarily similar to a diagonal plus low-rank matrix. Then we show that the low-rank portion of this decomposition is in fact negative semidefinite, while the diagonal portion has non-positive real part.

We consider the diagonal plus low-rank decompositions shown in Gu et al. of the three original HiPPO matrices Gu et al. , and show that the low-rank portions are in fact negative semidefinite.

HiPPO-LagT. The family of generalized HiPPO-LagT matrices are defined by

for $0\leq\beta\leq\frac{1}{2}$ , with the main HiPPO-LagT matrix having $\beta=0$ .

12𝛽missing-subexpressionmissing-subexpressionmissing-subexpression…112𝛽missing-subexpressionmissing-subexpression1112𝛽missing-subexpression11112𝛽⋮missing-subexpressionmissing-subexpressionmissing-subexpression⋱𝛽𝐼matrixmissing-subexpression12121212missing-subexpression1212⋯1212missing-subexpression12121212missing-subexpression⋮missing-subexpressionmissing-subexpressionmissing-subexpression⋱12matrix1111⋯111111111111⋮missing-subexpressionmissing-subexpressionmissing-subexpression⋱\displaystyle=-\begin{bmatrix}\frac{1}{2}+\beta&&&&\dots\\ 1&\frac{1}{2}+\beta&&\\ 1&1&\frac{1}{2}+\beta&\\ 1&1&1&\frac{1}{2}+\beta\\ \vdots&&&&\ddots\\ \end{bmatrix}=-\beta I-\begin{bmatrix}&-\frac{1}{2}&-\frac{1}{2}&-\frac{1}{2}\\ \frac{1}{2}&&-\frac{1}{2}&-\frac{1}{2}&\cdots\\ \frac{1}{2}&\frac{1}{2}&&-\frac{1}{2}\\ \frac{1}{2}&\frac{1}{2}&\frac{1}{2}&\\ \vdots&&&&\ddots\\ \end{bmatrix}-\frac{1}{2}\begin{bmatrix}1&1&1&1&\cdots\\ 1&1&1&1\\ 1&1&1&1\\ 1&1&1&1\\ \vdots&&&&\ddots\\ \end{bmatrix}. The first term is skew-symmetric, which is unitarily similar to a (complex) diagonal matrix with pure imaginary eigenvalues (i.e., real part ). The second matrix can be factored as $pp^{*}$ for $p=2^{-1/2}\begin{bmatrix}1&\cdots&1\end{bmatrix}^{*}$ . Thus the whole matrix $A$ is unitarily similar to a matrix $\Lambda-pp^{*}$ where the eigenvalues of $\Lambda$ have real part between $-\frac{1}{2}$ and .

HiPPO-LegS. The HiPPO-LegS matrix is defined as

2𝑛112superscript2𝑘112if 𝑛𝑘𝑛1if 𝑛𝑘0if 𝑛𝑘\displaystyle\bm{A}_{nk}=-\begin{cases}(2n+1)^{1/2}(2k+1)^{1/2}&\mbox{if }n>k\\ n+1&\mbox{if }n=k\\ 0&\mbox{if }n $\frac{1}{2}(2n+1)^{1/2}(2k+1)^{1/2}$

2𝑛112superscript2𝑘112if 𝑛𝑘0if 𝑛𝑘12superscript2𝑛112superscript2𝑘112if 𝑛𝑘\displaystyle=\begin{cases}\frac{1}{2}(2n+1)^{1/2}(2k+1)^{1/2}&\mbox{if }n>k\\ 0&\mbox{if }n=k\\ -\frac{1}{2}(2n+1)^{1/2}(2k+1)^{1/2}&\mbox{if }n $\displaystyle p_{n}$

Up to the diagonal scaling, the LegT matrix is

The first term is skew-symmetric and the second term can be written as $pp^{*}$ for

A.2 Model Architecture

The first portion of the S4 block is the same as the one used in Gu et al. .

𝑊𝑦𝑏\displaystyle=Wy+b $\displaystyle y$ $\displaystyle=x+y$ Here $\phi$ is a non-linear activation function, chosen to be GELU in our implementation. Note that all operations aside from the S4 layer are position-wise (with respect to the time or sequence dimension).

These operations are followed by more position-wise operations, which are standard in other deep neural networks such as Transformers (where it is called the feed-forward network) and CNNs (where it is called the inverted bottleneck layer).

subscript𝑊1𝑦subscript𝑏1\displaystyle=W_{1}y+b_{1} $\displaystyle y$ $\displaystyle=\phi(y)$ $\displaystyle y$ $\displaystyle=W_{2}y+b_{2}$ $\displaystyle y$ $\displaystyle=x+y$ Here $W_{1}\in\mathbbm{R}^{d\times ed}$ and $W_{2}\in\mathbbm{R}^{ed\times d}$ , where $e$ is an expansion factor. We fix $e=2$ in all our experiments.

Appendix B Additional Results

We provide details of ablations, including architecture ablations and efficiency benchmarking.

We conduct architectural ablations and efficiency benchmarking for all baselines on the YouTubeMix dataset.

Architectures. SampleRNN- $2$ and SampleRNN- $3$ correspond to the $2$ - and $3$ -tier models described in Section C.2 respectively. WaveNet- $512$ and WaveNet- $1024$ refer to models with $512$ and $1024$ skip channels respectively with all other details fixed as described in Section C.2. SaShiMi- $\{2,4,6,8\}$ consist of the indicated number of S4 blocks in each tier of the architecture, with all other details being the same.

Isotropic S4. We also include an isotropic S4 model to ablate the effect of pooling in SaShiMi. Isotropic S4 can be viewed as SaShiMi without any pooling (i.e. no additional tiers aside from the top tier). We note that due to larger memory usage for these models, we use a sequence length of $4$ s for the $4$ layer isotropic model, and a sequence length of $2$ s for the $8$ layer isotropic model (both with batch size $1$ ), highlighting an additional disadvantage in memory efficiency.

Throughput Benchmarking. To measure peak throughput, we track the time taken by models to generate $1000$ samples at batch sizes that vary from $1$ to $8192$ in powers of $2$ . The throughput is the total number of samples generated by a model in $1$ second. Figure 4 shows the results of this study in more detail for each method.

Diffusion model ablations. Table 7 reports results for the ablations described in Section 5.3. Experimental details are provided in Section C.2.

Appendix C Experiment Details

We include experimental details, including dataset preparation, hyperparameters for all methods, details of ablations as well as descriptions of automated and human evaluation metrics below.

A summary of dataset information can be found in Table 1. Across all datasets, audio waveforms are preprocessed to $16$ kHz using torchaudio.

Beethoven. The dataset consists of recordings of Beethoven’s $32$ piano sonatas. We use the version of the dataset shared by Mehri et al. , which can be found here. Since we compare to numbers reported by Mehri et al. , we use linear quantization for all (and only) Beethoven experiments. We attempt to match the splits used by the original paper by reference to the code provided here.

YouTubeMix. A $4$ hour dataset of piano music taken from https://www.youtube.com/watch?v=EhO_MrRfftU. We split the audio track into .wav files of $1$ minute each, and use the first $88\%$ files for training, next $6\%$ files for validation and final $6\%$ files for testing.

SC09. The Speech Commands dataset contains many spoken words by thousands of speakers under various recording conditions including some very noisy environments. Following prior work we use the subset that contains spoken digits “zero” through “nine”. This SC09 dataset contains 31,158 training utterances (8.7 hours in total) by 2,032 speakers, where each audio has length 1 second sampled at 16kHz. the generative models need to model them without any conditional information.

The datasets we used can be found on Huggingface datasets: Beethoven, YouTubeMix, SC09.

C.2 Models and Training Details

For all datasets, SaShiMi, SampleRNN and WaveNet receive $8$ -bit quantized inputs. During training, we use no additional data augmentation of any kind. We summarize the hyperparameters used and any sweeps performed for each method below.

All methods in the AR setting were trained on single V100 GPU machines.

We adapt the S4 implementation provided by Gu et al. to incorporate parameter tying for $pq^{*}$ . For simplicity, we do not train the low-rank term $pp^{*}$ , timescale $dt$ and the $B$ matrix throughout our experiments, and let $\Lambda$ be trained freely. We find that this is actually stable, but leads to a small degradation in performance compared to the original S4 parameterization. Rerunning all experiments with our updated Hurwitz parameterization–which constrains the real part of the entries of $\Lambda$ using an $\exp$ function–would be expensive, but would improve performance. For all datasets, we use feature expansion of $2\times$ when pooling, and use a feedforward dimension of $2\times$ the model dimension in all inverted bottlenecks in the model. We use a model dimension of $64$ . For S4 parameters, we only train $\Lambda$ and $C$ with the recommended learning rate of $0.001$ , and freeze all other parameters for simplicity (including $pp^{*},B,dt$ ). We train with $4\times\rightarrow 4\times$ pooling for all datasets, with $8$ S4 blocks per tier.

On Beethoven, we learn separate $\Lambda$ matrices for each SSM in the S4 block, while we use parameter tying for $\Lambda$ within an S4 block on the other datasets. On SC09, we found that swapping in a gated linear unit (GLU) in the S4 block improved NLL as well as sample quality.

We train SaShiMi on Beethoven for $1$ M steps, YouTubeMix for $600$ K steps, SC09 for $1.1$ M steps.

We adapt an open-source PyTorch implementation of the SampleRNN backbone, and train it using truncated backpropagation through time (TBPTT) with a chunk size of $1024$ . We train $2$ variants of SampleRNN: a 3-tier model with frame sizes $8,2,2$ with $1$ RNN per layer to match the 3-tier RNN from Mehri et al. and a 2-tier model with frame sizes $16,4$ with $2$ RNNs per layer that we found had stronger performance in our replication (than the 2-tier model from Mehri et al. ). For the recurrent layer, we use a standard GRU model with orthogonal weight initialization following Mehri et al. , with hidden dimension $1024$ and feedforward dimension $256$ between tiers. We also use weight normalization as recommended by Mehri et al. .

We train SampleRNN on Beethoven for $150$ K steps, YouTubeMix for $200$ K steps, SC09 for $300$ K steps. We found that SampleRNN could be quite unstable, improving steadily and then suddenly diverging. It also appeared to be better suited to training with linear rather than mu-law quantization.

We adapt an open-source PyTorch implementation of the WaveNet backbone, trained using standard backpropagation. We set the number of residual channels to $64$ , dilation channels to $64$ , end channels to $512$ . We use $4$ blocks of dilation with $10$ layers each, with a kernel size of $2$ . Across all datasets, we sweep the number of skip channels among $\{512,1024\}$ . For optimization, we use the AdamW optimizer, with a learning rate of $0.001$ and a plateau learning rate scheduler that has a patience of $5$ on the validation NLL. During training, we use a batch size of $1$ and pad each batch on the left with zeros equal to the size of the receptive field of the WaveNet model ( $4093$ in our case).

We train WaveNet on Beethoven for $400$ K steps, YouTubeMix for $200$ K steps, SC09 for $500$ K steps.

C.2.2 Details for Diffusion Models

All diffusion models were trained on 8-GPU A100 machines.

We adapt an open-source PyTorch implementation of the DiffWave model. The DiffWave baseline in Table 6 is the unconditional SC09 model reported in Kong et al. , which uses a 36 layer WaveNet backbone with dilation cycle $ $and hidden dimension$ 256 $, a linear diffusion schedule$ \beta_{t}\in[1\times 10^{4},0.02] $with$ T=200 $steps, and the Adam optimizer with learning rate$ 2\times 10^{-4} $. The small DiffWave model reported in Table 7 has 30 layers with dilation cycle$ $and hidden dimension$ 128$.

Our large SaShiMi model has hidden dimension $128$ and $6$ S4 blocks per tier with the standard two pooling layers with pooling factor $4$ and expansion factor $2$ (Section 4.2). We additionally have S4 layers in the down-blocks in addition to the up-blocks of Figure 1. Our small SaShiMi model (Table 7) reduces the hidden dimension to $64$ . These architectures were chosen to roughly parameter match the DiffWave model. While DiffWave experimented with other architectures such as deep and thin WaveNets or different dilation cycles , we only ran a single SaShiMi model of each size. All optimization and diffusion hyperparameters were kept the same, with the exception that we manually decayed the learning rate of the large SaShiMi model at $500$ K steps as it had saturated and the model had already caught up to the best DiffWave model (Table 7).

C.3 Automated Evaluations

NLL. We report negative log-likelihood (NLL) scores for all AR models in bits, on the test set of the respective datasets. To evaluate NLL, we follow the same protocol as we would for training, splitting the data into non-overlapping chunks (with the same length as training), running each chunk through a model and then using the predictions made on each step of that chunk to calculate the average NLL for the chunk.

Evaluation of generated samples. Following Kong et al. , we use 4 standard evaluation metrics for generative models evaluated using an auxiliary ResNeXT classifier which achieved 98.3% accuracy on the test set. Note that Kong et al. reported an additional metric NDB (number of statistically-distinct bins), which we found to be slow to compute and generally uninformative, despite SaShiMi performing best.

Fréchet Inception Distance (FID) uses the classifier to compare moments of generated and real samples in feature space.

Inception Score (IS) measures both quality and diversity of generated samples, and favoring samples that the classifier is confident on.

Modified Inception Score (mIS) provides a measure of both intra-class in addition to inter-class diversity.

AM Score uses the marginal label distribution of training data compared to IS.

We also report the Cohen’s inter-annotator agreement $\kappa$ score, which is computed with the classifier as one rater and a crowdworker’s digit prediction as the other rater (treating the set of crowdworkers as a single rater).

Because autoregressive models have tractable likelihood scores that are easily evaluated, we use them to perform a form of rejection sampling when evaluating their automated metrics. Each model generated $5120$ samples and ranked them by likelihood scores. The lowest-scoring $0.40$ and highest-scoring $0.05$ fraction of samples were thrown out. The remaining samples were used to calculate the automated metrics.

The two thresholds for the low- and high- cutoffs were found by validation on a separate set of $5120$ generated samples.

C.3.2 Evaluation Procedure for Non-autoregressive Models

Automated metrics were calculated on 2048 random samples generated from each model.

C.4 Evaluation of Mean Opinion Scores

For evaluating mean opinion scores (MOS), we repurpose scripts for creating jobs for Amazon Mechanical Turk from Neekhara et al. .

We collect MOS scores on audio fidelity and musicality, following Dieleman et al. . The instructions and interface used are shown in Figure 6.

The protocol we follow to collect MOS scores for YouTubeMix is outlined below. For this study, we compare unconditional AR models, SaShiMi to SampleRNN and WaveNet.

For each method, we generated unconditional $1024$ samples, where each sample had length $16$ s ( $1.024$ M steps). For sampling, we directly sample from the distribution output by the model at each time step, without using any other modifications.

As noted by Mehri et al. , autoregressive models can sometimes generate samples that are “noise-like”. To fairly compare all methods, we sequentially inspect the samples and reject any that are noise-like. We also remove samples that mostly consist of silences (defined as more than half the clip being silence). We carry out this process until we have $30$ samples per method.

Next, we randomly sample $25$ clips from the dataset. Since this evaluation is quite subjective, we include some gold standard samples. We add $4$ clips that consist mostly of noise (and should have musicality and quality MOS $<=2$ ). We include $1$ clip that has variable quality but musicality MOS $<=2$ . Any workers who disagree with this assessment have their responses omitted from the final evaluation.

We construct $30$ batches, where each batch consists of $1$ sample per method (plus a single sample for the dataset), presented in random order to a crowdworker. We use Amazon Mechanical Turk for collecting responses, paying $\$ 0.50 $per batch and collecting$ 20 $responses per batch. We use Master qualifications for workers, and restrict to workers with a HIT approval rating above$ 98\% $. We note that it is likely enough to collect$ 10$ responses per batch.

C.4.2 Mean Opinion Scores for SC09

Next, we outline the protocol used for collecting MOS scores on SC09. We collect MOS scores on digit intelligibility, audio quality and speaker diversity, as well as asking crowdworkers to classify digits following Donahue et al. . The instructions and interface used are shown in Figure 7.

For each method, we generate $2048$ samples of $1$ s each. For autoregressive models (SaShiMi, SampleRNN, WaveNet), we directly sample from the distribution output by the model at each time step, without any modification. For WaveGAN, we obtained $50000$ randomly generated samples from the authors, and subsampled $2048$ samples randomly from this set. For the diffusion models, we run $200$ steps of denoising following Kong et al. .

We use the ResNeXT model (Section C.3) to classify the generated samples into digit categories. Within each digit category, we choose the top- $50$ samples, as ranked by classifier confidence. We note that this mimics the protocol followed by Donahue et al. , which we established through correspondence with the authors.

Next, we construct batches consisting of $10$ random samples (randomized over all digits) drawn from a single method (or the dataset). Each method (and the dataset) thus has $50$ total batches. We use Amazon Mechanical Turk for collecting responses, paying $\$ 0.20 $per batch and collecting$ 10 $responses per batch. We use Master qualification for workers, and restrict to workers with a HIT approval rating above$ 98\%$.

Note that we elicit digit classes and digit intelligibility scores for each audio file, while audio quality and speaker diversity are elicited once per batch.