AudioPaLM: A Large Language Model That Can Speak and Listen

Paul K. Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen, Ankur Bapna, Zalán Borsos, Félix de Chaumont Quitry, Peter Chen, Dalia El Badawy, Wei Han, Eugene Kharitonov, Hannah Muckenhirn, Dirk Padfield, James Qin, Danny Rozenberg, Tara Sainath, Johan Schalkwyk, Matt Sharifi, Michelle Tadmor Ramanovich, Marco Tagliasacchi, Alexandru Tudor, Mihajlo Velimirović, Damien Vincent, Jiahui Yu, Yongqiang Wang, Vicky Zayats, Neil Zeghidour, Yu Zhang, Zhishuai Zhang, Lukas Zilka, Christian Frank

cs.CL cs.AI cs.SD eess.AS stat.ML

Introduction

Large language models (LLMs) [Brown et al., 2020, Rae et al., 2021, Chowdhery et al., 2022] excel at generating text for tasks that require the modeling of complex interactions as well as knowledge retrieval, such as open-domain question answering or few-shot machine translation [Anil et al., 2023]. The remarkable generative abilities of the underlying system — a Transformer [Vaswani et al., 2017] trained to predict sequences of discrete tokens — have been subsequently extended to continuous, natural signals with images [Yu et al., 2022b] or audio waveforms [Lakhotia et al., 2021, Kreuk et al., 2022, Wang et al., 2023] being converted into a stream of discrete units through a lossy compression algorithm and then modeled in a sequential fashion as would be text.

In the context of audio generation, the AudioLM framework [Borsos et al., 2022] has introduced a hierarchical approach which combines two types of audio tokens, with high-level coarse tokens extracted from self-supervised embeddings [Chung et al., 2021] being used to condition the generation of lower-level codes of a neural codec [Zeghidour et al., 2021]. This general framework, which makes little assumptions about the nature of the modeled audio signals, has been used to generate speech and music [Kharitonov et al., 2023, Agostinelli et al., 2023, Donahue et al., 2023]. In the particular case of text-to-music [Agostinelli et al., 2023] or text-to-speech [Kharitonov et al., 2023], a Transformer model takes text tokens as inputs and generates audio tokens, such that text and audio vocabularies do not interact with each other. Such models could naturally be converted into, respectively, music captioning and speech recognition systems by swapping their inputs and outputs. Following this observation, combining text and audio vocabularies into a multimodal, single vocabulary would allow for training a single model in both directions.

In this work, we introduce AudioPaLM, a multimodal generative model of speech and text. At the heart of AudioPaLM is a joint vocabulary that can represent speech and text with a limited number of discrete tokens which, combined with an elementary markup description of tasks, allows training a single decoder-only model on a mixture of tasks that involve arbitrarily interleaved speech and text. This includes speech recognition, text-to-speech synthesis, and speech-to-speech translation, unifying tasks that are traditionally solved by heterogeneous models into a single architecture and training run. Moreover, as the underlying architecture of AudioPaLM is a large Transformer model, we can initialize its weights with those of a large language model pretrained on text which allow it to benefit from the linguistic and common sense knowledge of models such as PaLM [Chowdhery et al., 2022] or PaLM 2 [Anil et al., 2023]. In particular, we show in Section 5.4.8 how the model’s translation capability is derived from the translation capability of the underlying text model. The contributions of this work are:

We present a unified speech-text LLM, capable of consuming and producing both speech and text, and leveraging the existing capabilities of PaLM Chowdhery et al. and PaLM-2 [Anil et al., 2023] coming from text-only pretraining.

This unified approach across modalities allows training AudioPaLM on a mixture of tasks such as Automatic Speech Recognition (ASR), Automatic Speech Translation (AST) and Speech-to-Speech Translation (S2ST), achieving state of the art results on AST and S2ST benchmarks, and competitive performance on ASR benchmarks.

Leveraging AudioLM’s audio prompting [Borsos et al., 2022], our model performs S2ST with voice transfer of unseen speakers, surpassing existing methods in terms of speech quality and voice preservation, as measured by both objective and subjective evaluations.

Our model exhibits zero-shot capabilities, performing AST with speech input/target language combinations that were not seen in training.

The remainder of this paper is organized as follows: in Section 2 we discuss the relation to existing work. In Section 3 we describe our method. In Section 4 we provide details about the data we use, and other technical details as a prelude to the experiments. In Section 5 we present our experimental results including a series of ablations to determine the influence of various design choices. We conclude in Section 6.

Related work

Encoder-based models are used to learn features which can be used for downstream tasks. By learning joint representations of both modalities together, the goal is that in addition to the learned features being richer than they would be with each modality treated separately, they are aligned with one another, improving their performance when used for inter-modality tasks. Such approaches have been applied in audio [Chen et al., 2022c, Bapna et al., 2022, Zhang et al., 2023a] and in vision [Chen et al., 2020, Gan et al., 2020, Fu et al., 2021] as well as combining both audio and video inputs [Shi et al., 2022].

Similar to BERT [Devlin et al., 2018], such encoders may be trained with a masked language model objective for both the multimodal setting as in previously mentioned works and for the unimodal setting [Baevski et al., 2020, Hsu et al., 2021, Chiu et al., 2022]. They may alternatively be trained in a contrastive manner [Radford et al., 2021, Yuan et al., 2021, Yu et al., 2022a] resulting in separate encoders for each modality with each informed by the other due to the contrastive objective.

A line of work on multimodal encoder-decoder models (also known as Vision Language Models in the vision literature) has sought to fuse text-decoders with advances in non-text encoder models. Examples include Flamingo [Alayrac et al., 2022] and PaLI [Chen et al., 2022b] in the vision domain, and Whisper [Radford et al., 2022] in the audio domain. The general idea of these approaches is to take an audio or vision encoder and a text decoder and to combine them, either with adapter layers as in Flamingo and Whisper, or by merging via a separate encoder as in PaLI.

Both PaLI and Flamingo use pretrained components. The advantage of this is that individual components can be frozen while finetuning the model on multimodal data (Whisper does not use a pretrained encoder or decoder and so does not freeze individual components). The disadvantage is that such models are constrained to only output text, since the decoder is text-only. In contrast, our proposed approach results in a decoder-only model which models sequences of arbitrary audio and text tokens. This is similar to the approach taken by Wang et al. except that we use a single decoder-only model and all audio seen by the model is tokenized, whereas Wang et al. use an encoder-decoder architecture and use continuous inputs and tokenized outputs for images.

2 Generating audio with language models

Recent work [Lakhotia et al., 2021, Wang et al., 2023] has explored generating speech by modeling discretized representations as target tokens of an autoregressive Transformer [Vaswani et al., 2017] network. Such discrete tokens can be extracted from self-supervised speech representations [Oord et al., 2018, Baevski et al., 2020, Hsu et al., 2021, Chung et al., 2021], modeling long-term patterns in audio sequences while providing limited reconstruction quality, or from a neural codec [Zeghidour et al., 2021, Défossez et al., 2022], providing high-fidelity reconstruction but with less temporal structure. AudioLM [Borsos et al., 2022] addresses this dichotomy by introducing a hierarchical approach, where a first stage produces “semantic” tokens from a self-supervised w2v-BERT system [Chung et al., 2021], which a second stage then uses as conditioning to generate the “acoustic” tokens of a SoundStream [Zeghidour et al., 2021] neural codec. This joint modeling of semantic and acoustic tokens allows the model to learn linguistic structure from the syntactic to the lexical and phonetic levels from speech-only corpora, without any textual guidance, while generating realistic speech from arbitrary speakers and in diverse acoustic conditions.

SPEAR-TTS [Kharitonov et al., 2023] combines the decoder-only generator of AudioLM with a text encoder, such that the model can perform text-to-speech synthesis. By leveraging pretraining and backtranslation [Sennrich et al., 2016], SPEAR-TTS can be trained with only 15 minutes of labeled speech. The ability of this model to learn a mapping between text and semantic tokens in such a low-data regime suggests that these representations are very close, yet the model’s encoder-decoder architecture specifically ingests text and outputs audio, such that both vocabulary of tokens (text and semantic) are disjoint and modeled separately. SpeechLM [Hassid et al., 2023] also exploits the similarity between text and semantic tokens by initializing a decoder-only audio generator with the weights of a pretrained text-based language model. While this allows some transfer of knowledge from text-to-speech modeling, the resulting architecture is not multimodal: semantic tokens replace the text vocabulary —rather than extending it— and the model is finetuned on speech-only data. AudioPaLM bridges these gaps and combines semantic tokens and text into an extended, multimodal set of tokens used interchangeably as inputs and outputs, such that text-only language model pretraining can be used to initialize a decoder-only model that can then be finetuned on a mixture of tasks that map freely between speech and text (e.g. speech-to-text, text-to-speech or speech-to-speech).

3 Speech-to-speech translation

The field of speech-to-speech translation (S2ST) focuses on converting spoken language from one language to another, facilitating communication between individuals who speak different languages. Conventional automatic speech-to-speech translation systems are typically composed of a cascade of three components: automatic speech recognition (ASR), text-to-text machine translation (MT), and text-to-speech (TTS) synthesis [Lavie et al., 1997, Wahlster, 2000, Nakamura et al., 2006]. However, these cascade-based approaches primarily focus on the text and may overlook important aspects such as para-linguistic features, computational efficiency, compound errors, and the accurate handling of proper names, nouns, and non-verbal communication that do not require translation.

Direct speech-to-speech translation systems [Jia et al., 2019b, Kano et al., 2021, Jia et al., 2022b] are trained end-to-end operating on the audio spectrogram domain without relying on text representation at inference time. In these systems, the synthesized audio has access to acoustic information in source speech and can potentially learn to preserve acoustic features and reduce compound errors and computational requirements.

There are other cascaded S2ST systems that utilize learned discrete speech representations as an intermediate representation [Tjandra et al., 2019, Zhang et al., 2021, Lee et al., 2022, Ma et al., 2021, Lee et al., 2021]. In these systems the translation operates in learned discrete representation space allowing to learn alignment in the discrete domain, and simplify leveraging of text pre-training. Lastly, there are other S2ST approaches that improve on performance, efficiency, and data requirements. Jia et al. [2022a] and Wei et al. [2022b] leveraged weakly supervised data and component pre-training to improve translation accuracy while requiring little parallel speech data.

Method

We use a decoder-only Transformer to model sequences consisting of text and audio tokens. As far as the model is concerned, text and audio are just sequences of arbitrary integers, as the inputs are tokenized before feeding to the model, and any outputs are detokenized before being returned to a user of the model. By representing speech with discrete tokens in a finite vocabulary, we can build a multimodal vocabulary which is the union of this audio vocabulary and a SentencePiece [Kudo and Richardson, 2018b] one used to represent text. Thus, in principle there is almost no difference between our setting and the usual decoder-only setup for pure text, except that in our setting some of the tokens represent audio and some text, and we initialize our multimodal model using a pretrained text-only checkpoint.

The overall model is described in Figure 1. In the rest of this section we describe the main steps of the model: first, how text and audio inputs are tokenized; second, how we modify existing pretrained text decoders to also model audio; and third, how we convert the model output into raw audio. Since the first and third steps are identical to the process used by Borsos et al. and [Borsos et al., 2023], we keep our explanation of these points high-level and refer the reader to those papers for further details.

Finally, we describe how we finetune AudioPaLM on a mixture of combined speech and text tasks including speech recognition and translation from or into either speech or text.

We follow the process of Lakhotia et al. , Borsos et al. to convert raw waveforms into tokens. This involves extracting embeddings from an existing speech representation model and subsequently discretizing those embeddings into a limited set of audio tokens. Borsos et al. extract embeddings from the w2v-BERT model [Chung et al., 2021] and quantize them via k-means. In this work, we experiment with the following approaches to obtain a set of discrete audio tokens.

w2v-BERT: We follow the procedure described in Borsos et al. with two modifications. First, we use a w2v-BERT model that has been trained on multilingual data, as opposed to the English-only setting of Borsos et al. . Second, we do not normalize the embeddings before performing the k-means clustering. While Borsos et al. found that the normalization removed speaker-identity information without degrading performance, we found in the multilingual setting that normalization did indeed cause degradation. This method produces tokens at a rate of 25Hz and the token vocabulary is of size 1024.

USM-v1: We perform the same procedure with the more performant Universal Speech Model (USM) encoder [Zhang et al., 2023a] instead of the w2v-BERT encoder. We use the largest 2B parameter variant of this multilingual speech encoder and extract embeddings from the middle layer. Similar to w2v-BERT, this method produces tokens at a rate of 25Hz and the token vocabulary is of size 1024.

USM-v2 : We additionally experiment with a quantizer that is trained with an auxiliary ASR loss. This version has been finetuned further to provide better multilingual performance. As with USM-v1, this method accepts raw audio as input and returns a sequence of integers with length proportional to the length of the audio as output.

2 Modifying text-only decoders to model both text and audio

In a Transformer decoder, the first layer of the model after input preprocessing is the token embeddings matrix $\mathbf{E}$ which maps integer-valued tokens to dense embeddings; given a vocabulary of $t$ tokens and embeddings of size $m$ , $\mathbf{E}$ is a $t\times m$ matrix whose $i$ th row gives the embedding for the $i$ th token. Another embeddings matrix $\mathbf{E}^{\prime}$ appears in the final softmax layer used to compute the logits over all tokens at each position; it is a $m\times t$ matrix which is multiplied with the $m$ -dimensional output of the model to obtain a $t$ dimensional vector of logits, one for each of the tokens. In the PaLM architecture, these matrices have shared variables, so that one is the transpose of the other, that is, $\mathbf{E}^{\prime}=\mathbf{E}^{\intercal}$ .

The rest of the decoder architecture is completely agnostic to the number of tokens modelled. Therefore we only need to make one small modification to turn a text-only model into one that models both text and audio: we expand the size of the embeddings matrix $\mathbf{E}$ to be of size $(t+a)\times m$ where $a$ is the number of audio tokens (the size of $\mathbf{E}^{\prime}=\mathbf{E}^{\intercal}$ changes accordingly).

In order to make use of pretrained text models, we change the existing model checkpoints by adding $a$ new rows to the embeddings matrix $\mathbf{E}$ . An implementation detail is that the first $t$ tokens (from zero to $t$ ) correspond to the SentencePiece text tokens while the next $a$ tokens (from $t$ to $t+a$ ) represent audio tokens. While we can re-use the text embeddings of the pre-trained model, the new audio embeddings are freshly initialized and must be trained. We found it necessary to train all model parameters rather than keeping the previous weights fixed. We train using mixed speech and text tasks, as detailed in Section 4. In Section 5.4.2 we show how adding audio tokens to a text-pretrained checkpoint in the above manner is highly beneficial for performance on the considered speech and text tasks (compared to re-training from scratch). For further details about the PaLM architecture we refer the reader to Section 2 of [Chowdhery et al., 2022].

3 Decoding audio tokens to raw audio

To synthesize an audio waveform from audio tokens, we experimented with two different methods: i) autoregressive decoding, following the setup of AudioLM [Borsos et al., 2022] and ii) non-autoregressive decoding, using the recently proposed SoundStorm model [Borsos et al., 2023]. In both cases the audio tokens are first used to generate SoundStream tokens [Zeghidour et al., 2021], which are then converted to an audio waveform with a convolutional decoder.

The acoustic generation in AudioLM proceeds in two stages: “Stage 2” is a decoder-only Transformer model that takes the audio tokens produced by AudioPaLM and a voice conditioning as input, and generates SoundStream tokens that can be used to materialize the speech in the desired voice, but at a very low bitrate. “Stage 3” reconstructs higher levels of SoundStream’s residual vector quantizer, which increases the bitrate and improves the audio quality. We use the same hyperparameters and the training process as in [Kharitonov et al., 2023].

SoundStorm proposes an alternative non-autoregressive decoding scheme, which applies an iterative method that proceeds in parallel on all tokens. SoundStorm produces audio of the same quality as AudioLM, but with higher consistency in voice and acoustic conditions, while being two orders of magnitude faster.

In both cases we train on Multilingual LibriSpeech [Pratap et al., 2020] and the voice conditioning is supplied as a 3-second long voice sample, represented as both audio tokens and SoundStream tokens. By providing part of the original input speech as the voice conditioning, the model is able to preserve the original speaker’s voice when translating their speech to a different language (see Section 5). Whenever the original audio is shorter than 3 seconds, it is repeated to reach the required duration.

4 Training tasks

We apply our method to the problems of speech recognition, speech synthesis and speech-to-speech translation. All datasets used in this report are speech-text datasets which contain a subset of the following fields.

Transcript: a transcript of the speech in Audio.

Translated audio: the spoken translation of the speech in Audio.

Translated transcript: the written translation of the speech in Audio.

The component tasks that we consider in this report are:

ASR (automatic speech recognition): transcribing the audio to obtain the transcript.

AST (automatic speech translation): translating the audio to obtain the translated transcript.

S2ST (speech-to-speech translation): translating the audio to obtain the translated audio.

TTS (text-to-speech): reading out the transcription to obtain the audio.

MT (text-to-text machine translation): translating the transcript to obtain the translated transcript.

A dataset including more than two of the fields may be used for multiple possible tasks. As explored in the experiment of Section 5.4.1, we found that including multiple tasks (for example, both ASR and AST) from the same dataset resulted in improved performance.

Following Raffel et al. , we signal to the model which task it should perform on a given input by prefixing the input with a tag specifying the task and the English name of the language of the input and, optionally, the language of the output if it is different.

For example, to query the model to perform ASR on an utterance in French, the tokenized audio input would be preceded by the tag [ASR French]. To perform TTS in English, the text would be preceded by [TTS English]. To perform S2ST from English to French, the tokenized English audio would be preceded by [S2ST English French]. The tag is tokenized using the normal text tokenizer of the model; we do not introduce special tokens to express the task or the languages involved. We found that changing task names to be more human-readable, such as using transcribe the following French audio instead of [ASR French], does not change the performance of the model. Naming the language in the task – compared to just using generic tags like transcribe audio or [ASR] – is not ultimately required but is beneficial for low-resource languages.

We consider both direct tasks, where the model is expected to directly map from input to output, and combined tasks, where we instruct the model to also output intermediate steps for a complex task. This is similar in spirit to chain of thought prompting [Wei et al., 2022a].

For example, for S2ST we could demand that the model directly maps from English audio tokens to French audio tokens. This would be expressed with the task tag [S2ST English French]. Alternatively we can train the model to first output English text, followed by French text, and finally French audio tokens. We express this with the task tag [ASR AST S2ST English French]. The model performs this task as a single autoregressive decoding, i.e. it is not performed with multiple separate calls to the model for each task. In particular this means that the model can attend to the input and all prior decoded content at each stage, as opposed to a separated pipeline approach of doing ASR, MT and then TTS.

We found combined tasks to improve performance, which we explore in the experiment of Section 5.4.4.

5 Training mixtures

In this section we describe the data mixtures used to train our best models based on the datasets listed in Table 1. Mixtures were implemented using the SeqIO library [Roberts et al., 2022]. More details on the datasets can be found in Section 4.

There are two mixtures: one used to train the Audio PaLM 8B AST and AudioPaLM-2 8B AST models which output text and are trained on ASR and AST tasks; the other used to train the Audio PaLM 8B S2ST model which outputs both text and speech and additionally includes TTS and S2ST tasks.

The ASR tasks from the CVSS, VoxPopuli ASR, CommonVoice 11, Conversational EsEn and Youtube ASR datasets. For the CVSS and Conversational EsEn datasets, we use ASR in both source and target languages.

The AST tasks from CVSS, Conversational EsEn and VoxPopuli S2ST. We use VoxPopuli S2ST for AST by mapping from the translated audio to the transcript, since the translated transcript is not available.

The combined AST + ASR task for the CVSS and Conversational EsEn datasets.

The S2ST mixture is composed of the above, plus additionally:

The TTS tasks from the CVSS and VoxPopuli ASR datasets. For CVSS we use only the source transcript and audio.

The S2ST tasks from the Vox Populi S2ST, CVSS, WMT/TED and PaLM MT TTS datasets. Note that except for VoxPopuli S2ST, the speech targets of these datasets are all synthetically generated. For VoxPopuli S2ST we perform translation from both source to target, and target to source.

The combined ASR + AST + S2ST tasks from the Conversational EsEn, CVSS and WMT/TED datasets.

In general the components of the mixture are weighted according to the number of elements in each component while we downweighted larger datasets; Table 1 lists the amounts of audio that models trained on the above mixtures have seen during training.

6 Training setup

In all experiments, we use the same finetuning setup as described in Section 6.1.2 of [Chowdhery et al., 2022]. In particular, we finetune with the Adafactor optimizer with a constant learning rate of $5\times 10^{-5}$ and dropout rate of 0.1, and we use loss masking on the inputs.

Data and Metrics

Table 1 lists the datasets used in AudioPaLM training.

CoVoST2 [Wang et al., 2020] is a speech-to-text dataset mapping speech in 21 languages to English text.

CVSS [Jia et al., 2022c] augmented CoVoST2 to synthesize speech for the target text in two flavors: CVSS-C uses a canonical speakers voice, while CVSS-T transfers voice properties from the source voice. Unless stated otherwise, we use the CVSS-C flavor in speech-to-speech translation experiments.

VoxPopuli [Wang et al., 2021] contains speeches from the European Parliament together with their transcripts – which can be used for speech recognition (ASR) tasks – as well as spoken translations from parliamentary interpreters – which can be used for speech translation (S2ST) tasks.

Common Voice [Ardila et al., 2020] consists of text paired with recordings where people were asked to read the text aloud.

The conversational dataset described in [Jia et al., 2019a] was obtained by crowd-sourcing humans to read a subset of the Spanish side of a proprietary conversational Spanish-English MT dataset.

YouTube ASR is an unlabeled multilingual dataset of YouTube-based audio which was transcribed automatically by using the USM-2B ASR model [Zhang et al., 2023a]. The dataset helps to improve models for YouTube to perform better on captioning and translation.

WMT/TED TTS is based on WMT [Barrault et al., 2020, 2019, Bojar et al., 2018, 2017, 2015, 2013] and TED [Qi et al., 2018] text-to-text translation datasets as described in [Bapna et al., 2022]. Following Jia et al. [2022a] the dataset is augmented by running all the source and target text through a TTS engine to generate synthetic paired audio.

PaLM MT TTS provides additional training data for S2ST: We use PaLM-2 to translate the transcripts of the YouTube, Common Voice, and Babel [Gales et al., 2017] datasets to English, and use a prior AudioPaLM 8B S2ST model (trained without this dataset) to synthesize the speech. The method is similar in spirit to [Jia et al., 2019a] which combines MT and TTS to generate additional paired data.

We train models on mixtures based on these datasets as described in Section 3.5 on ASR, AST, and S2ST tasks from the above datasets. In Section 5.4.6 we explore how adding more data improves the performance of our method.

Note that our method makes use of the text-pretrained PaLM checkpoints and audio tokenizers. So while the models are trained on the datasets listed in Table 1, they can also benefit from PaLM’s text training data [Anil et al., 2023] via the pre-trained PaLM checkpoint, and from the data used to train the audio tokenizers.

2 Evaluation Metrics

We evaluate our method on the following benchmarks:

CoVoST2 AST: We use BLEU scores, with the SacreBLEU corpusBLEU implementation Papineni et al. , Post . We do not perform any normalization to the text before computing BLEU.

FLEURS AST: The FLEURS [Conneau et al., 2023] dataset contains speech utterances and their corresponding transcripts in 102 languages and is used for evaluation, only. We use BLEU scores, as described for CoVoST2 AST.

VoxPopuli ASR: We use the JiWER implementation of word error rate (WER). We normalise the text by ignoring capitalisation and punctuation before computing the WER.

CoVoST2 ASR: Comparable to VoxPopuli ASR, but for Japanese and Chinese, the character error rate (CER) is reported instead of WER. We report this metric for experiments trained on CoVoST2, only.

CVSS S2ST: Following Translatron 2 [Jia et al., 2022b], we feed the audio output of our model into an ASR model and use BLEU to compare the ASR output with the ground truth target text. We use the same ASR model as Jia et al. [2022b] and so the metrics presented here are directly comparable.

All evaluations are performed on the test splits of the corresponding datasets.

Experiments

We start with our top-level results presenting significant improvements over prior results on automatic speech-to-text translation (AST) and direct speech-to-speech translation (S2ST), as well as competitive results on automatic speech recognition (ASR). Ablations of individual factors are provided in Section 5.4.

Table 2 displays results on ASR, AST and S2ST benchmarks for our method and existing baselines. Our models come in two variants; the first variant (referred to as AST) is trained on AST tasks without S2ST and TTS data; the second variant (referred to as S2ST) is trained with S2ST and TTS data and is therefore able to produce speech as well. To generate the audio for the S2ST results we used SoundStorm [Borsos et al., 2023]. For details on the training mixtures see Section 3.5.

As an initial checkpoint, we use a PaLM-2 8B checkpoint [Anil et al., 2023] to which we add the capability to process audio tokens as input and output as described in Section 3.2. The additional audio token embeddings are initialized to . As in the original PaLM and PaLM 2 models, the input and output embeddings are shared.

Our method exceeds the baselines on AST and S2ST and is competitive on ASR. Our method also comes close in AST performance to a cascaded approach in which we use our best AudioPaLM-2 ASR model followed by translation with another AudioPaLM-2 model finetuned only for text-to-text translation on CoVoST2.

2 Zero-shot behaviour

We evaluate the zero-shot capabilities of our AST models on the FLEURS multilingual dataset [Conneau et al., 2023]. The dataset contains speech utterances and their corresponding transcripts in 102 languages. Note that none of our models were trained on FLEURS, so we use the dataset for evaluation only. In this context, we focus on the language pairs X $\rightarrow$ English and extract two subsets of languages:

29 AST-observed languages: languages for which speech-to-text translation (AST) data (X $\rightarrow$ En) was seen during training (as these language pairs were present in the VoxPopuli S2ST, CoVoST2 or/and Conversational EsEn datasets). These languages are indicated with a § in Table 17.

26 ASR-observed languages: languages for which no speech-to-text translation data was seen when training our AST models, but for which at least 1 hour of transcription (ASR) data was present. We removed 3 languages (Cantonese, Kurdish and Ganda) for which we did not have a BLEU score for the baseline. These languages are indicated with a † in Table 17.

In Table 3, we present the results obtained with the two proposed AST models AudioPaLM and AudioPaLM-2, as well as the baseline model “Whisper Large-v2 1.5B”. We also present the number of AST and ASR speech training hours for these three models. For the proposed models, the reported number of hours do not take into account the amount of speech used to train the tokenizers.

We observe that the proposed AudioPaLM-2 model significantly outperforms the Whisper model on AST-observed languages. Although Whisper is used as a reference for the only ASR observed setting, its results are not zero-shot as Whisper has been trained on 40.6K hours of speech-to-text translation (AST) data for these languages. For the AudioPaLM models, this setting is zero-shot as it did not see any AST data for these languages. Despite this disadvantage, AudioPaLM-2 also outperforms the Whisper model on ASR-observed languages. For a detailed performance comparison for each language, see Appendix D.

There is a large improvement obtained by using the AudioPaLM-2 instead of the AudioPaLM model: 28% increase for AST-observed languages and 107% increase for ASR-observed languages. These numbers show that the superior text translation capabilities of AudioPaLM-2 immediately transfer to the audio domain, despite the fact that the model has not seen any speech-to-text data for these language pairs during training in the case of ASR-observed languages.

3 Quality of generated speech

In addition to measuring the translation quality of the speech content as reported in Table 2, we are also interested in evaluating whether the speech generated by AudioPaLM is (a) of high quality, and (b) truthfully preserves the voice of the speaker when translating to a different language. To this end, we use a combination of objective metrics and subjective evaluation studies that use the test split of the CVSS-T dataset [Jia et al., 2022c]. The subjective experiments were conducted on an earlier version of AudioPaLM using the acoustic generation method described in AudioLM [Borsos et al., 2022].

As a first baseline, we use the ground-truth translated utterances which are provided as a part of the CVSS-T dataset. These utterances were obtained by synthesizing the ground-truth translated text with a high-quality TTS system which was modified to enable voice transfer [Jia et al., 2021, 2022c]. As a result, the ground-truth utterances mimic the voice in the source utterance.

As a second baseline, we use Translatotron 2. Note that we could not use the “Translatotron 2 + pretraining + TTS aug” model (mentioned in Table 2) in the comparison because it was not trained to preserve voices and instead generates speech in a single canonical voice. Instead, we use the Translatotron 2 system presented by Jia et al. [2022c] which is capable of transferring the voice from the source utterance (albeit it achieves a lower BLEU score on CVSS). This version of the Translatotron 2 model was trained on the CVSS-T dataset and implements S2ST from 21 languages to English.

As the first objective metric, we use a no-reference MOS estimator akin to Reddy et al. which, given an audio sample, provides an estimate of the perceived audio quality on a scale from 1 to 5. To measure cross-lingual voice transfer quality, we rely on an off-the-shelf speaker verification model [Chen et al., 2022a] as used by Zhang et al. [2023b] and Kharitonov et al. , and compute the cosine similarity between the embeddings of the source (encoded/decoded with SoundStream) and the translated speech. Besides voice preservation, we also measure how well the acoustic properties (recording conditions, background noise) are transferred from the source audio to the target. We do so by computing the cosine similarity between embeddings extracted from a model trained to identify segments that belong to the same recording [Borsos et al., 2023].

We run two separate studies, one for evaluating the quality of the generated speech, and another for assessing the voice similarity. We use the same set of samples for both studies. Since utterances in CVSS-T are sourced from volunteer-generated data of variable quality, we noticed that some of the utterances contain loud overlapping speech (e.g., a TV show or a song playing in the background) or extremely strong noise (e.g., clothes rubbing against the microphone). Such aberrations complicate the work of raters, thus we decided to pre-filter by only selecting inputs with an estimated MOS of at least $3.0$ . Finally, we sampled 10 examples per language, giving us $21\times 10=210$ source utterances to translate. All utterances were peak normalised and resampled to 16kHz, if needed.

Before starting, the raters were provided with a small set of illustrative examples with ground-truth grades. They also completed a small pilot study as a training. The utterances (pairs of source-target utterances, in the case of the voice similarity evaluation) were presented one-by-one. The ratings are provided on a 5-grade scale from 1 (poor quality or completely different voices) to 5 (excellent quality, identical voices). In the voice similarity study, the raters are explicitly asked to ignore differences in the recording conditions and language, and solely focus on the voice. Each of the $630$ output examples ( $10$ inputs from each of $21$ languages were generated with each of the $3$ different systems) was rated $10$ times which results in 6300 ratings per study. Aggregating those ratings per system, we obtain mean opinion score (MOS) and similarity mean opinion score (SMOS).

We report the results of the objective and subjective evaluations in Table 4. From these results we observe that AudioPaLM significantly outperforms the baseline Translatotron 2 system both in audio quality and in voice similarity, in objective and subjective measurements. Moreover, AudioPaLM has higher quality and better voice similarity than the ground-truth synthesized recordings in CVSS-T, with a relatively large gap in most of the metrics. Following Jia et al. [2022c], we also compared the systems across high and low-resource groups (French, German, Spanish and Catalan vs. the rest) and found no significant variation of the metrics across these groups.

4 Impact of model and data choices

In this section we walk the reader through experiments that guided us towards our final training recipe from initial early experimentation. These show the impact of individual factors and build on top of one another until reaching the final setup described and analysed in the previous sections.

To achieve the results in Section 5.1, we trained on multiple tasks based on the same underlying data to improve performance. For example, the CoVoST2 data can be used for both ASR and AST tasks, and we observed that adding ASR tasks in training results in improved performance on AST benchmarks, compared to training with the AST tasks alone. In this section we investigate the effect of this choice on model performance.

We train two models on the CoVoST2 dataset. All conditions are identical except that in one experiment, we use only the AST data; in the other we train with both AST and ASR tasks. The base models are the PaLM 8B checkpoint and we use the USM-v1 tokenizer. We evaluate on the CoVoST2 AST benchmark.

See Table 5. We observe that adding ASR tasks into the dataset increases BLEU by 2.5 from 16.0 to 18.5 on the CoVoST2 AST benchmark.

Although ASR is not part of the evaluation task, adding ASR data helped improve performance. Our hypothesis is that ASR tasks help the model to better connect its understanding of the new audio input to its previous understanding of text. In subsequent experiments we include both ASR and AST tasks when using the CoVoST2 training data.

4.2 Training from scratch vs. finetuning

The results in Section 5.1 are based on finetuning a text-pretrained PaLM checkpoint. Here we investigate the effect of using such a model compared to starting training from scratch on the same architecture.

In the 1B from-scratch and 8B from-scratch experiments we start with randomly initialised weights. In the 8B finetune experiment we start from the PaLM 8B checkpoint, which has been modified by adding extra rows to the token embedding matrix for the audio tokens, which are randomly initialised.

All three models are trained on CoVoST2 ASR and AST tasks.

See Table 6. We observe that finetuning the PaLM 8B checkpoint achieves substantially higher performance than training from scratch on CoVoST2 tasks for both ASR and AST. The 1B-from-scratch experiment was added to determine whether a smaller model architecture would work better than the 8B model when trained from scratch on CoVoST2; it does not.

Finetuning a pretrained checkpoint substantially improves results. This is in some sense not surprising as the base model is very capable to begin with; nonetheless it is interesting that with finetuning the model is able to adapt to completely new input stimulus, since the audio tokens are totally new embeddings that the model must learn to understand. Furthermore the audio tokens are very different from text: despite the low sampling rate, there is presumably still some redundancy in the data and the rate of samples is still much higher than text tokens — we estimate from the data that at 25Hz, one text token corresponds to approximately 6-8 audio tokens.

4.3 Different tokenization schemes

To obtain the results in Section 5.1, we tokenized audio based on USM-v2. Here we investigate the impact of the choice of tokenization scheme on the final results.

We train three models with all conditions identical except for the tokenization scheme applied to the audio. All models are trained using the PaLM 8B checkpoint. In each case we use the CVSS datasets with ASR and AST tasks with the source audio preprocessed using different tokenizers. The three tokenizers used are the w2v-BERT, USM-v1 and USM-v2 tokenizers which were discussed Section 3.1.

See Table 7. We observe that the choice of tokenization scheme has a large impact on the performance of the model. The fact that the USM encoder is more powerful than w2v-BERT indeed translates to an improvement in performance in our setting. The USM-v2 tokens perform even better, yielding substantially improved results.

The choice of tokenization scheme has a substantial effect on performance. This is not surprising; the model only is exposed to the information captured by the tokenizer, and this may be in a form which is easy or difficult for the model to process. Future work should consider tokenization of audio more carefully because this is still relatively immature as a research area.

4.4 Training with combined tasks

To obtain the results in Section 5.1, we required the model to compute intermediate steps for complex tasks by combining multiple tasks into one, as described in Section 3.4. In the following we investigate the impact of this choice.

We train pairs of models on the CoVoST2 AST dataset. Within a pair, the only change is that for one model we train with ASR and AST tasks, while for the other we also include the combined task consisting of first doing ASR and then outputting the AST result. For the latter model, at evaluation time we report the result of doing the combined task from which we use only the final output. We repeat this setup twice: once with the USM-v1 tokens, and once with the USM-v2 tokens.

See Table 8. This shows that expressing the AST task as a combination of simpler tasks results in improved performance on the AST task. At the same time, we see a small reduction in performance on the ASR task.

Our results are consistent with prior works which have observed that allowing the model to break down a complex task into easier pieces results in improved performance, relative to making the model directly output the answer [Wei et al., 2022a].

At the same time, we observe a reduction in performance on the ASR task. We hypothesize that this may be a consequence of our checkpoint selection criterion, which was to select the checkpoint with the best AST metric on the validation split. It may also be a consequence of the large change in the data mixture resulting from this change.

We note that it may appear that combined tasks reduce the problem to a pipeline approach of separate ASR and translation systems. However this is not the case, as the model can refer to all previous tokens at each step and is a single unified model. For example, when decoding the translated text, it is possible to refer to the input audio and any information contained in them. This is particularly important for the S2ST setting (see Experiment 5.4.5) where prosodic information may be present in the input audio, which can be attended to while decoding output audio.

4.5 Training with additional speech-to-speech tasks

In the following, we investigate the impact of adding speech-to-speech translation (S2ST) tasks to the trained tasks.

We train two models using the CoVoST2 dataset. One model is only trained on the AST, ASR and combined AST tasks. The other model is additionally trained on S2ST as a direct and combined task. Thus the difference between these two models is that in the S2ST the model additionally sees tasks in which it must output audio tokens, whereas for the other tasks (and all previous experiments) the model only outputs text tokens.

See Table 9. We observe that adding the S2ST task results in the new capability of being able to perform S2ST, but that this comes at the cost of a modest decrease in performance to both the AST BLEU score and ASR WER score when evaluating on the CoVoST2 test split.

Since we use loss masking on the inputs for each training example, performing S2ST is fundamentally different from ASR or AST since the model must learn to emit audio tokens. For ASR and AST, the model takes audio tokens as input, but the loss masking means that it doesn’t need to learn to model these sequences of audio tokens. It is thus perhaps not surprising that this results in a decrease in performance on the text-output tasks, since model capacity must be devoted to audio modelling.

4.6 Scaling the training data

In this section we investigate the impact of increasing the amount of training data.

We run this analysis on two types of models, both trained from a PaLM 8B checkpoint and with USM-v2 tokens. The models “AudioPaLM 8B AST” are trained without the S2ST tasks, the models “AudioPaLM 8B S2ST” are trained with the S2ST tasks.

We train these two types of models with an increasing amount of data:

The CoVoST2 dataset only. For the S2ST model, we use the modified S2ST version of this dataset: CVSS.

All the public datasets described in Table 1, namely CoVoST2/CVSS, VoxPopuli AST, VoxPopuli S2ST, CommonVoice 11 and Conversational EsEn.

All the public datasets, as well as the YouTube ASR dataset.

All the public datasets, as well as the YouTube ASR dataset and the WMT/TED text-to-text translation dataset. For the S2ST models, we follow Jia et al. [2022a] and synthesise a paired S2ST dataset from this by using TTS on the examples in this dataset.

As above, but using the synthetic PaLM-based MT TTS dataset S2ST mixture instead of the YouTube ASR dataset. For this dataset we used PaLM-2 to translate the transcripts of the YouTube, Common Voice, and Babel datasets to English text, and then synthesized the English speech to create a speech-to-speech dataset.

See Table 10. We observe that training with increasing amounts of data yields a substantial improvement. In particular, consistent with Experiment 5.4.1 we see that adding additional ASR data helps on AST tasks. Consistent with Experiment 5.4.5 we observe that for each fixed dataset mixture for which we compare the AST and S2ST mixtures, including the S2ST tasks brings new S2ST capabilities at the cost of a modest reduction in performance on AST. All of the S2ST results in Table 10 use AudioLM stage 2 and 3 models [Borsos et al., 2022] to reconstruct the audio samples from audio tokens as discussed in Section 3.3.

It is unsurprising that scaling the amount of training data results in an improvement in performance. We observe that adding more data in some cases leads to a small reduction in performance on the ASR tasks, though always an improvement on the AST tasks. Similar to Experiment 5.4.4, this may be a consequence of our checkpoint selection criterion, which is based on AST performance on the CVSS validation set.

4.7 Decoding with AudioLM vs SoundStorm

In this section we investigate the impact on S2ST metrics of decoding using AudioLM stage 2 and 3 models vs SoundStorm.

We take the best AudioPaLM model from Section 5.4.6 trained with the mixture consisting of public, PaLM MT TTS and WMT/TED datasets. The previous experiment used AudioLM stage 2 and 3 models to decode the audio tokens output by AudioPaLM to wave audio. We rerun this using a SoundStorm model instead, and measure the impact on the CVSS S2ST task.

See Table 11. We observe a 1.3 BLEU point increase when using SoundStorm compared to AudioLM. This result corresponds to the S2ST model presented in Table 2 trained on the S2ST mixture described in 3.5.

These observations are consistent with those reported in Borsos et al. , which found that compared to AudioLM, SoundStorm produces more intelligible speech when used to decode semantic audio tokens. This was measured by how faithfully the resulting audio matches a ground truth transcript when transcribed with an ASR system, which is similar to our setup.

4.8 Impact of using PaLM-2

In the following we explore the effect of using the PaLM-2 checkpoint vs the original PaLM model. PaLM-2 was trained with improved data and techniques compared to the original PaLM model, and was explicitly trained with parallel translation data. We therefore aim to understand whether these improvements translate to gains in speech tasks.

We focus on speech-to-text tasks and do not consider S2ST. We train two pairs of models on the largest datasets considered in Section 5.4.6. For each dataset we train two models, one using the PaLM 8B checkpoint and the other using the PaLM-2 8B checkpoint. Compared to the PaLM finetuning experiments, the optimization hyperparameters differed: we used a dropout rate of $0.2$ and a learning rate schedule of linear ramp-up to $10^{-4}$ followed by exponential decay to $10^{-5}$ .

See Table 12. On the smaller mixture, we observe an improvement on the CoVoST2 AST task, and a minor degradation on VoxPopuli ASR and a more significant degradation on CoVoST2 ASR. On the larger data mixture, we see that PaLM-2 exceeds PaLM on the Vox Populi ASR and CVSS AST tasks, and is slightly worse on CoVoST2 ASR. Our interpretation of these results is that the improved ability of PaLM-2 to perform text translation leads to an improvement for AST. The impact on ASR capabilities is mixed, where when using the full training mixture, PaLM 2 exhibits slightly worse ASR capabilities on CoVoST2 and slightly better ones on VoxPopuli ASR.

While we do see a difference, we suspect that the different capabilities between PaLM and PaLM-2 are not as important in this setting as they might be for purely text-based tasks, since the addition of tokenized audio is novel for both models.

4.9 Impact of architecture scale

In the following we investigate the impact of the model size on the downstream task performance. We use PaLM-2 for this and focus on the ASR and AST settings.

We train three PaLM-2 models of different sizes (128M, 1B, and 8B) using USM-v2 tokens with the same two largest datasets from Section 5.4.6 and observe their performance on our benchmark ASR and AST tasks.

See Table 13. We find that our results improve substantially with model size, with 42% and 28% reduction in WER for CVSS and VoxPopuli ASR tasks and over 13 points increase in BLEU scores for translation tasks respectively moving from 128M to 1B model on the full Public + YT + WMT/TED dataset. Increasing the model size further from 1B to 8B leads to additional gains of a further 10% and 16% reduction in WER for CVSS and VoxPopuli ASR tasks and a further 6.2 point improvement in BLEU score. We find the scaling improvements also hold across different training datasets (e.g., Public + YT compared with Public + YT + WMT/TED).

As expected, performance on downstream ASR/AST tasks improves with larger model size. Our 1B sized model outperforms Whisper 1.5B Large model by over 5 BLEU points and 28% reduction in WER for VoxPopuli ASR.

Conclusion

We introduce AudioPaLM, a large language model that can process and generate speech and text interchangeably. AudioPaLM starts from a pre-trained text-based LLM and extends its vocabulary with discrete audio tokens. In doing so, the model can leverage its existing text capabilities while being finetuned to also consume and produce tokenized audio on a mixture of speech-text tasks. Moreover, by expressing the different tasks with textual tags, a single model can be trained on all tasks together. AudioPaLM demonstrates state-of-the-art results on speech translation benchmarks and competitive performance on speech recognition tasks, as well as zero-shot speech-to-text translation abilities on unseen language pairs. AudioPaLM also benefits from features of audio language models, such as voice prompting, and can perform S2ST with voice transfer of a superior quality compared to existing baselines, as measured by both automatic metrics and human raters.

The fact that our model can natively produce audio is a consequence of the fact that we make use of tokenized audio. This introduces a strong dependency on the quality of the audio tokenizer, as demonstrated in Section 7. We additionally empirically found it necessary to finetune the whole model, unlike a Flamingo-like [Alayrac et al., 2022] approach which freezes most of the weights and thus provides guarantees on preservation of the original capabilities of the model components.

There are numerous further avenues of research. One strand is around audio tokenization: what are desirable properties of audio tokens, how can we measure them, and how can we optimize for them? Another is around evaluations. In comparison to text, the richness of the set of established benchmarks for generative text/audio tasks is less developed. This work has focused on speech recognition and speech translation, for which the benchmarks are more mature. The establishment of more benchmarks and metrics for generative audio tasks will help to accelerate research further.

Acknowledgements

We would like to thank Nobuyuki Morioka and Yifan Ding for their help in re-creating the TTS-augmented WMT/TED dataset which was also used in Jia et al. [2022a] and Adam Roberts and Ron Weiss for their advice and reviews. We would like to thank Slav Petrov, Colin Cherry and the PaLM-2 team for their advice and support.

References

Appendix A Author Contributions

Paul initiated the project, created the AudioPaLM model architecture and proved its viability with a number of pre-training, speech recognition and translation tasks, onboarded the team to the project, and contributed significantly to the write-up of this report. Chulayuth performed many audio tokenization experiments and completed the audio integration with PaLM2. Duc carried out the experiments around combined tasks (performing both ASR and AST in the same task), synthesized the WMT-derived speech-to-speech datasets and further developed the audio vocabulary. All of the above contributed to the best-performing configuration for speech-to-speech translation. Paul and Christian coordinated the write-up of this report. Duc and Chulayuth ran a majority of the ablation experiments.

Ankur, Johan, Tara and Yu collaborated on the research and development of USM-v2 tokens which led to our best-performing configuration across tasks. Jiahui and Zhishuai developed the token learning approach for images and Jiahui advised on the development of USM-v2 tokens. James, Wei and Yongqiang developed the large-scale tokenization and transcription infrastructure for USM models. Yongqiang, Wei and Félix curated the semi-supervised ASR datasets. Alexandru put in place data processing pipelines, improved our best mixture by adding a variety of ASR datasets, on AudioLM speech generation models, and initially worked on the USM-v2 audio tokens together with Johan.

Danny worked on speech-to-speech translation and the ASR-BLEU metric for S2ST models together with Alexandru and Duc. Peter and Vicky worked on the PaLM2 integration and cascaded model baselines. Dalia significantly improved the best configuration of this report by adding TTS tasks and text-to-text and synthetic speech-to-speech datasets to the model’s task mixtures.

Eugene, Damien, Mihajlo, and Neil worked on AudioLM speech quality and in particular on making the translated voice consistent with the source voice and providing objective metrics. Mihajlo coordinated this effort, trained speech generation models, and created the website together with Hannah. Hannah further tuned the best models for the paper, analysed the zero-shot capabilities of the models, managed the rating tasks for subjective speech quality metrics, and performed a detailed training data analysis.

Neil contributed significantly to the writing of this report. Marco identified the opportunity to leverage AudioLM for speech-to-speech translation and Zalán trained the very first such model. Zalán, Neil, and Marco provided guidance around AudioLM details and project planning. Michelle provided guidance on speech-to-speech baselines, Translatotron, and other related work. Lukas, Dirk, Matt and Johan supported and advised the team throughout the project. Christian initiated the project, coordinated the effort, and contributed with core ideas and technical work.