Music Transformer
Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M. Dai, Matthew D. Hoffman, Monica Dinculescu, Douglas Eck
Introduction
A musical piece often consists of recurring elements at various levels, from motifs to phrases to sections such as verse-chorus. To generate a coherent piece, a model needs to reference elements that came before, sometimes in the distant past, repeating, varying, and further developing them to create contrast and surprise. Intuitively, self-attention (Parikh et al., 2016) appears to be a good match for this task. Self-attention over its own previous outputs allows an autoregressive model to access any part of the previously generated output at every step of generation. By contrast, recurrent neural networks have to learn to proactively store elements to be referenced in a fixed size state or memory, potentially making training much more difficult. We believe that repeating self-attention in multiple, successive layers of a Transformer decoder (Vaswani et al., 2017) helps capture the multiple levels at which self-referential phenomena exist in music.
In its original formulation, the Transformer relies on absolute position representations, using either positional sinusoids or learned position embeddings that are added to the per-position input representations. Recurrent and convolutional neural networks instead model position in relative terms: RNNs through their recurrence over the positions in their input, and CNNs by applying kernels that effectively choose which parameters to apply based on the relative position of the covered input representations.
Music has multiple dimensions along which relative differences arguably matter more than their absolute values; the two most prominent are timing and pitch. To capture such pairwise relations between representations, Shaw et al. (2018) introduce a relation-aware version of self-attention which they use successfully to modulate self-attention by the distance between two positions. We extend this approach to capture relative timing and optionally also pitch, which yields improvement in both sample quality and perplexity for JSB Chorales. As opposed to the original Transformer, samples from a Transformer with our relative attention mechanism maintain the regular timing grid present in this dataset. The model furthermore captures global timing, giving rise to regular phrases.
The original formulation of relative attention (Shaw et al., 2018) requires memory where is the sequence length and is the dimension of the model’s hidden state. This is prohibitive for long sequences such as those found in the Piano-e-Competition dataset of human-performed virtuosic, classical piano music. In Section 3.4, we show how to reduce the memory requirements to , making it practical to apply relative attention to long sequences.
The Piano-e-Competition dataset consists of MIDI recorded from performances of competition participants, bearing expressive dynamics and timing on the granularity of < 10 miliseconds. Discretizing time on a fixed grid that would yield unnecessarily long sequences as not all events change on the same timescale. We hence adopt a sparse, MIDI-like, event-based representation from (Oore et al., 2018), allowing a minute of music with 10 milisecond resolution to be represented at lengths around 2K, as opposed to 6K to 18K on a fixed-grid representation with multiple performance attributes. As position in sequence no longer corresponds to time, a priori it is not obvious that relative attention should work as well with such a representation. However, we will show in Section 4.2 that it does improve perplexity and sample quality over strong baselines.
We speculate that idiomatic piano gestures such as scales, arpeggios and other motifs all exhibit a certain grammar and recur periodically, hence knowing their relative positional distances makes it easier to model this regularity. This inductive bias towards learning relational information, as opposed to patterns based on absolute position, suggests that the Transformers with relative attention could generalize beyond the lengths it was trained on, which our experiments in Section 4.2.1 confirm.
Domain contributions We show the first successful use of Transformers in generating music that exhibits long-term structure. Before our work, LSTMs were used at timescales of 15s (~500 tokens) on the Piano-e-Competition dataset (Oore et al., 2018). Our work show that Transformers not only achieve state-of-the-art perplexity on modeling these complex expressive piano performances, but can also generate them at the scale of 60s (~2000 tokens) with remarkable internal consistency. Our relative attention mechanism is essential to the model’s quality. In listening tests (see Section 4.2.3), samples from models with relative self-attention were perceived as more coherent than the baseline Transformer model from Vaswani et al. (2017). Relative attention not only enables Transformers to generate continuations that elaborate on a given motif, but also to generalize and generate in consistent fashion beyond the length it was trained on (see Section 4.2.1). In a seq2seq setup, Transformers can generate accompaniments conditioned on melodies, enabling users to interact with the model.
Algorithmic contributions The space complexity of the relative self attention mechanism in its original formulation (Shaw et al., 2018) made it infeasible to train on sequences of sufficient length to capture long-range structure in longer musical compositions. Addressing this we present a crucial algorithmic improvement to the relative self attention mechanism, dramatically reducing its memory requirements from to . For example, we reduce the memory consumption per layer from GB to MB (per head from GB to MB) for a sequence of length and hidden-state size (per head , where number of heads is ) (see Table 1), allowing us to use GPUs to train the relative self-attention Transformer on long sequences.
Related work
Sequence models have been the canonical choice for modeling music, from Hidden Markov Models to RNNs and Long Short Term Memory networks (e.g., Eck & Schmidhuber, 2002; Liang, 2016; Oore et al., 2018), to bidirectional LSTMs (e.g., Hadjeres et al., 2017). Successful application of sequential models to polyphonic music often requires serializing the musical score or performance into a single sequence, for example by interleaving different instruments or voices. Alternatively, a 2D pianoroll-like representation (see A.1 for more details) can be decomposed into a sequence of multi-hot pitch vectors, and their joint probability distributions can be captured using Restricted Boltzmann Machines (Smolensky, 1986; Hinton et al., 2006) or Neural Autoregressive Distribution Estimators (NADE; Larochelle & Murray, 2011). Pianorolls are also image-like and can be modeled by CNNs trained either as generative adversarial networks (e.g., Dong et al., 2018) or as orderless NADEs (Uria et al., 2014; 2016) (e.g., Huang et al., 2017).
Lattner et al. (2018) use self-similarity in style-transfer fashion, where the self-similarity structure of a piece serves as a template objective for gradient descent to impose similar repetition structure on an input score. Self-attention can be seen as a generalization of self-similarity; the former maps the input through different projections to queries and keys, and the latter uses the same projection for both.
Dot-product self-attention is the mechanism at the core of the Transformer, and several recent works have focused on applying and improving it for image generation, speech, and summarization (Parmar et al., 2018; Povey et al., 2018; Liu et al., 2018). A key challenge encountered by each of these efforts is scaling attention computationally to long sequences. This is because the time and space complexity of self-attention grows quadratically in the sequence length. For relative self-attention (Shaw et al., 2018) this is particularly problematic as the space complexity also grows linearly in the dimension, or depth, of the per-position representations.
Model
We take a language-modeling approach to training generative models for symbolic music. Hence we represent music as a sequence of discrete tokens, with the vocabulary determined by the dataset. Datasets in different genres call for different ways of serializing polyphonic music into a single stream and also discretizing time.
The JSB Chorale dataset consists of four-part scored choral music, which can be represented as a matrix where rows correspond to voices and columns to time discretized to sixteenth notes. The matrix’s entries are integers that denote which pitch is being played. This matrix can than be serialized in raster-scan fashion by first going down the rows and then moving right through the columns (see A.1 for more details). Compared to JSB Chorale, the piano performance data in the Piano-e-Competition dataset includes expressive timing information at much finer granularity and more voices. For the Piano-e-Competition we therefore use the performance encoding proposed by Oore et al. (2018) which consists of a vocabulary of 128 NOTE_ON events, 128 NOTE_OFFs, 100 TIME_SHIFTs allowing for expressive timing at 10ms and 32 VELOCITY bins for expressive dynamics (see A.2 for more details).
2 Background: Self-attention in Transformer
The Transformer decoder is a autoregressive generative model that uses primarily self-attention mechanisms, and learned or sinusoidal position information. Each layer consists of a self-attention sub-layer followed by a feedforward sub-layer.
The attention layer first transforms a sequence of -dimensional vectors into queries , keys , and values , where , , and are each square matrices. Each query, key, and value matrix is then split into parts or attention heads, indexed by , and with dimension , which allow the model to focus on different parts of the history. The scaled dot-product attention computes a sequence of vector outputs for each head as
The attention outputs for each head are concatenated and linearly transformed to get , a by dimensional matrix. A upper triangular mask ensures that queries cannot attend to keys later in the sequence. For other details of the Transfomer model, such as residual connections and learning rates, the reader can refer Vaswani et al. (2017). The feedforward (FF) sub-layer then takes the output from the previous attention sub-layer, and performs two layers of point-wise dense layers on the depth dimension, as shown in Equation 2. are weights and biases of those two layers.
3 Relative positional self-attention
As the Transformer model relies solely on positional sinusoids to represent timing information, Shaw et al. (2018) introduced relative position representations to allow attention to be informed by how far two positions are apart in a sequence. This involves learning a separate relative position embedding of shape , which has an embedding for each possible pairwise distance between a query and key in position and respectively. The embeddings are ordered from distance to , and are learned separately for each head. In Shaw et al. (2018), the relative embeddings interact with queries and give rise to a , an dimensional logits matrix which modulates the attention probabilities for each head as:
We dropped head indices for clarity. Our work uses the same approach to infuse relative distance information in the attention computation, while significantly improving upon the memory footprint for computing . For each head, Shaw et al. (2018) instantiate an intermediate tensor of shape , containing the embeddings that correspond to the relative distances between all keys and queries. is then reshaped to an tensor, and .We assume that the batch size is here. With a batch size of , would be reshaped to and would be computed with a batch matrix–matrix product. This incurs a total space complexity of , restricting its application to long sequences.
4 Memory efficient implementation of relative position-based attention
We improve the implementation of relative attention by reducing its intermediate memory requirement from to , with example lengths shown in Table 1. We observe that all of the terms we need from are already available if we directly multiply with , the relative position embedding. After we compute , its entry contains the dot product of the query in position with the embedding of relative distance . However, each relative logit in the matrix from Equation 3 should be the dot product of the query in position and the embedding of the relative distance , to match up with the indexing in . We therefore need to “skew” so as to move the relative logits to their correct positions, as illustrated in Figure 1 and detailed in the next section. The time complexity for both methods are , while in practice our method is 6x faster at length 650.
Hence, we propose a “skewing” procedure to transform an absolute-by-relative indexed matrix into an absolute-by-absolute indexed matrix. The row indices stay the same while the columns indices are shifted according to the following equation: . For example in Figure 1 the upper right green dot in position of after skewing has a column index of , resulting in a position of in .
We outline the steps illustrated in Figure 1 below.
1. Pad a dummy column vector of length before the leftmost column.
2. Reshape the matrix to have shape . (This step assumes NumPy-style row-major ordering.)
3. Slice that matrix to retain only the last rows and all the columns, resulting in a matrix again, but now absolute-by-absolute indexed, which is the that we need.
5 Relative local attention
For very long sequences, the quadratic memory requirement of even baseline Transformer is impractical. Local attention has been used for example in Wikipedia and image generation (Liu et al., 2018; Parmar et al., 2018) by chunking the input sequence into non-overlapping blocks. Each block then attends to itself and the one before, as shown by the smaller thumbnail on the top right corner of Figure 2.
To extend relative attention to the local case, we first note that the right block has the same configuration as in the global case (see Figure 1) but much smaller: (where is the number of blocks, and be the resulting block length) as opposed to . The left block is unmasked with relative indices running from -1 (top right) to - (bottom left). Hence, the learned for the local case has shape .
Similar to the global case, we first compute and then use the following procedure to skew it to have the same indexing as , as illustrated in Figure 2.
1. Pad a dummy column vector of length after the rightmost column.
2. Flatten the matrix and then pad with a dummy row of length .
3. Reshape the matrix to have shape .
4. Slice that matrix to retain only the first rows and last columns, resulting in a matrix.
Experiments
J.S. Bach chorales is a canonical dataset used for evaluating generative models for music J.S. Bach chorales dataset: https://github.com/czhuang/JSB-Chorales-dataset (e.g., Allan & Williams, 2005; Boulanger-Lewandowski et al., 2012; Liang, 2016; Hadjeres et al., 2016; Huang et al., 2017). It consists of score-based four-part chorales. We first discretize the scores onto a 16th-note grid, and then serialize it by iterating through all the voices within a time step and then advancing time (see A.1 for more details). As there is a direct correspondence between position in sequence and position on the timing/instrument grid in a piece, adding relative position representations could make it easier to learn this grammar. We indeed see relative attention drastically improve negative log-likelihood (NLL) over baseline Transformer (Table 2). This improvement is also reflected in sample quality. The samples now maintain the necessary timing/instrument grid, always advancing four steps before advancing in time. As local timing is maintained, the model is able to capture timing on a more global level, giving rise to regular phrasing, as shown in Figure 3.
In addition to relative attention, we also explored enhancing absolute timing through concatenating instead of adding the sinusoids to the input embeddings. This allows the model to more directly learn its absolute positional mapping. This further improves performance for both the baseline and relative transformer (Table 2). We compare against Coconet as it is one of the best-performing models that has also been evaluated on the 16-note grid using the canonical dataset split. To directly compare, we re-evaluated Coconet to obtain note-wise losses on the validation set Some earlier papers report frame-wise losses to compare to models such as RNN-RBM which model “chords”. Coconet can be evaluated under note-wise or frame-wise losses.. For the Transformer models (abbreviated as TF), we implemented our attention mechanisms in the Tensor2Tensor framework (Vaswani et al., 2018). We use 8 heads, and keep the query, key (att) and value hidden size (hs) fixed within a config. We tuned number of layers (L in {4,5,6}), attention hidden size (att in {256, 512}) and pointwise feedforward hidden size (ff in {512, 1024}).
A musical event bears multiple attributes, such as timing, pitch, instrument etc. To capture more relational information, we extend relative attention to capture pairwise distances on additional attributes. We learn separate relative embeddings for timing and also pitch . has entries corresponding to how many sixteenth notes apart are two positions, while embeds the pairwise pitch interval. However this approach is not directly scalable beyond J.S. Bach Chorales because it involves explicitly gathering relative embeddings for and , resulting in a memory complexity of as in Shaw et al. (2018). This is due to relative information being computed based on content as opposed to content-invariant information such as position in sequence. It was sufficient to add the extra timing signals to the first layer, perhaps because it is closest to the raw input content. Here, the relative logits are computed from three terms, in contrast with other layers that only have one term, .
2 Piano-e-Competition
We use the first 6 years of of Piano-e-Competition because these years have corresponding MIDI data released Piano-e-Competition dataset (competition history): http://www.piano-e-competition.com/, resulting in about 1100 pieces, split 80/10/10. Each piece is MIDI data capturing a classical piano performance with expressive dynamics and timing, encoded with the MIDI-like representation described in Section A.2. We trained on random crops of 2000-token sequences and employed two kinds of data augmentation: pitch transpositions uniformly sampled from half-steps, and time stretches uniformly sampled from the set .
We compare to Magenta’s PerformanceRNN (LSTM, which first used this dataset) (Oore et al., 2018) and LookBack RNN (LSTM with attention) (Waite, 2016). LookBack RNN uses an input representation that requires monophonic music with barlines which is information that is not present in performed polyphonic music data, hence we simply adopt their architecture. Table 3 shows that Transformer-based architectures fits this dataset better than LSTM-based models.
We implemented our attention mechanisms in the Tensor2Tensor framework (Vaswani et al., 2018), and used the default hyperparameters for training, with 0.1 learning rate, 0.1 dropout, and early stopping. We compare four architectures, varying on two axes: global versus local, and regular versus relative attention. We found that reducing the query and key hidden size (att) to half the hidden size (hs) works well and use this relationship for all of the models, while tuning on number of layers (L) and filter size (fs). We use block size (bs) 512 for local attention. We set the maximum relative distance to consider to half the training sequence length for relative global attention, and to the full memory length (which is two blocks) for relative local attention. Table 3 show that relative attention (global or local) outperforms regular self-attention (global or local). All else being equal, local and global attention perform similarly. Each though local attention does not see all the history at once, it can build up a larger receptive field across layers. This can be an advantage in the future for training on much longer sequences, as local attention requires much less memory.
When primed with an initial motif (Chopin’s Étude Op. 10, No. 5) shown in the top left corner of Figure 4, we see the models perform qualitatively differently. Transformer with relative attention elaborates the motif and creates phrases with clear contour which are repeated and varied. Baseline Transformer uses the motif in a more uniform fashion, while LSTM uses the motif initially but soon drifts off to other material. Note that the generated samples are twice as long as the training sequences. Relative attention was able to generalize to lengths longer than trained but baseline Transformer deteriorates beyond its training length. See Appendix C for visualizations of how the our Relative Transformer attends to past motifs.
2.2 Harmonization: Conditioning on melody
2.3 Human evaluations
To compare the perceived sample quality of models trained on the Piano-e-Competition dataset, and their ability to generate a continuation for a priming sequence, we carried out a listening test study comparing the baseline Transformer, our Transformer with relative-attention, PerformanceRNN (LSTM), and the validation set. Participants were presented with two musical excerpts (from two different models that were given the same priming sequence) and asked to rate which one is more musical on a Likert scale. For each model, we generated 10 samples each with a different prime, and compared them to three other models, resulting in 60 pairwise comparisons. Each pair was rated by 3 different participants, yielding a total of 180 comparisons.
Figure 5 shows the number of comparisons in which an excerpt from each model was selected as more musical. The improvement in sample quality from using relative attention over the baseline Transformer model was statistically significant (see Appendix B for the analysis), both in aggregate and between the pair. Even though in aggregate LSTMs performed better in the study than the Transformer, despite having higher perplexity, but when compared against each other head to head, the results were not statistically significant (see Table 5 in Appendix B).
Conclusion
In this work we demonstrated that the Transformer equipped with relative attention is very well-suited for generative modeling of symbolic music. The compelling long-term structure in the samples from our model leaves us enthusiastic about this direction of research. Moreover, the ability to expand upon a primer, in particular, suggests potential applications as creative tool.
The significant improvement from relative attention highlights a shortcoming of the original Transformer that might also limit its performance in other domains. Improving the Transformer’s ability to capture periodicity at various time scales, for instance, or relations between scalar features akin to pitch could improve time-series models. Our memory-efficient implementation enables the application of relative attention to much longer sequences such as long texts or even audio waveforms, which significantly broadens the range of problems to which it could be applied.
Acknowledgement
We thank many colleagues from the Transformer (Vaswani et al., 2017) and Tensor2Tensor (Vaswani et al., 2018) papers for helping us along the way: Lukasz Kaiser, Ryan Sepassi, Niki Parmar and Llion Jones. Many thanks to Magenta and friends for their support throughout and for many insightful discussions: Jesse Engel, Adam Roberts, Fred Bertsch, Erich Elsen, Sander Dieleman, Sageev Oore, Carey Radebaugh, Natasha Jaques, Daphne Ippolito, Sherol Chan, Vida Vakilotojar, Dustin Tran, Ben Poole and Tim Cooijmans.
References
Appendix A Domain-specific representations
Adapting sequence models for music requires making decisions on how to serialize a polyphonic texture. The data type, whether score or performance, makes certain representations more natural for encoding all the information needed while still resulting in reasonable sequence lengths.
A.2 MIDI-like event-based (Piano-e-Competition)
The second dataset, Piano-e-Competition, consists of polyphonic piano performances with expressive timing and dynamics. The time resolution here is on the millisecond level, so a grid representation would result in sequences that are too long. Instead, the polyphonic performance is serialized into a sequence of one hot encoded events as proposed in (Oore et al., 2018).
First, the input MIDI files are preprocessed to extend note durations based on sustain pedal control events. The sustain pedal is considered to be down whenever a sustain control change is encountered with a value ; the sustain pedal is then considered up after a control change with a value . Within a period where the sustain pedal is down, the duration of each note is extended to either the beginning of the next note of the same pitch or the end of the sustain period, whichever happens first. If the original duration extends beyond the time when the sustain pedal is down, that original duration is used.
Next, the MIDI note events are converted into a sequence from the following set of vocabulary: 128 NOTE_ON events for starting a note of with one of the 128 MIDI pitches, 128 NOTE_OFF events for ending a note with one of the 128 MIDI pitches, 100 TIME_SHIFT events representing forward time shifts in 10ms increments from 10ms to 1s, and 32 SET_VELOCITY events representing the velocity for future NOTE_ON events in the form of the 128 possible MIDI velocities quantized into 32 bins. An example performance encoding is illustrated in Figure 7.
Appendix B Supplement of listening test
Participants were presented with two musical excerpts that shared a common priming sequence. For each excerpt, the priming sequence was played, followed by 2.5 seconds of silence, followed by the priming sequence again and a continuation of that sequence. The continuations were either sampled from one of the models or extracted from our validation set. We evaluated all possible pairs in the space of data and model samples, except from the same model. Each continuation had a length of 512 events using the encoding described in Section A.2. This corresponds to the length the models were trained on to remove the deteriorating effect that happens with baseline Transformer when asked to generate beyond the length it was trained on. Participants were asked which excerpt they thought was more musical on a Likert scale of 1 to 5. The pair is laid out left versus right, with 1 indicating the left is much more musical, 2 the left is slightly more musical, 3 being a tie, 4 being the right is slightly more musical, and 5 the right is much more musical. For each model, we generated 10 samples each with a different prime, and compared them to three other models, resulting in 60 pairwise comparisons. Each pair was rated by 3 different participants, yielding a total of 180 comparisons.
B.2 Analysis
A Kruskal-Wallis H test of the ratings showed that there was a statistically significant difference between the models: e-14. Table 5 show a post-hoc analysis on the comparisons within each pair, using the Wilcoxon signed-rank test for matched samples. Table 6 shows a post-hoc analysis of how well each model performed when compared to all pairs, and compares each model’s aggregate against each other, using the Mann–Whitney U test for independent samples. We use a Bonferroni correction on both to correct for multiple comparisons. The win and loss counts bucket scores 4, 5 and scores 1, 2 respectively, while the tieing score is 3.
Both within pairs and between aggregates, participants rated samples from our relative Transformer as more musical than the baseline Transformer with .
For within pairs, we did not observe a consistent statistically significant difference between the other model pairs, baseline transformer versus LSTM and LSTM versus relative Transformer.
When comparing between aggregates, LSTM was overall perceived as more musical than baseline Transformer. Relative Transformer came a bit close to outperforming LSTM with . When we listen to the samples from the two, they do sound qualitatively different. Relative Transformer often exhibits much more structure (as shown in Figure 4), but the effects were probably less pronounced in the listening test because we used samples around 10s to 15s, which is half the length of those shown in Figure 4 to prevent the baseline Transformer from deteriorating. This weakens the comparison on long-term structure.
When compared to real music from the validation set, we see that in aggregates, real music was better than LSTM and baseline Transformer. There was no statistical significant difference between real music and relative Transformer. This is probably again due to the samples being too short as real music is definitely still better.
Appendix C Visualizing softmax attention
One advantage of attention-based models is that we can visualize its attention distribution 3. This gives us a glimpse of how the model might be building up recurring structures and how far it is attending back. The pianorolls in the visualizations below is a sample generated from Transformer with relative attention. Each figure shows a query (the source of all the attention lines) and previous memories being attended to (the notes that are receiving more softmax probabiliy is highlighted in). The coloring of the attention lines correspond to different heads and the width to the weight of the softmax probability.