SpecTr: Fast Speculative Decoding via Optimal Transport

Ziteng Sun, Ananda Theertha Suresh, Jae Hun Ro, Ahmad Beirami, Himanshu Jain, Felix Yu

Introduction

Autoregressive language models have shown to achieve state-of-the-art results in several natural language tasks . During inference, given a context $x^{t}{:=}x(1),x(2)\ldots,x(t)$ , an autoregressive model ${\cal M}_{b}$ generates successive tokens $x(t+1),x(t+2),\ldots$ via temperature sampling , where the next token $x(t+1)$ is drawn from the temperature-scaled distribution ${\cal M}_{b}(\cdot|x^{t})$ . If the temperature is zero, i.e., greedy decoding, the next token is determined by the maximum likelihood method i.e., $x(t+1)=\arg\max_{x\in\Omega}{\cal M}_{b}(x|x^{t})$ , where $\Omega$ is the domain of a single token also referred to as the vocabulary. The sampling approach can be further combined with other sampling primitives such as nucleus sampling and top- $k$ sampling .

All these approaches are autoregressive decodingIn this work, we use the words sampling and decoding interchangably to refer to the process of sequentially generating tokens from a language model. methods, where tokens are generated serially one after another, which can be slow or even prohibitive in several applications . Hence, several techniques have been proposed to improve the speed of decoding. Before we proceed further, we first present some notations and a simplified computational model.

Standard inference. Given a context $x^{t}$ , with $O(t^{2})$ computation and $O(1)$ time, an autoregressive model ${\cal M}_{b}$ can compute ${\cal M}_{b}(y|x^{t})$ , the (temperature-scaled) probability of all possible next tokens $y\in\Omega$ .

Parallelization along the time axis. Given a context $x^{t}$ , with $O(t^{2})$ computation and $O(1)$ time, an autoregressive model ${\cal M}_{b}$ can compute ${\cal M}_{b}(y|x^{i})$ , for all $y\in\Omega$ and $i\in\{1,2,\ldots,t\}$ .

Parallelization along time and batch axis. Let $K$ be the maximum batch size that can be used during the inference of the autoregressive model. Given several contexts, $x^{t}_{1},x^{t}_{2},\ldots x^{t}_{K}$ , with $O(Kt^{2})$ computation and $O(1)$ time, an autoregressive model ${\cal M}_{b}$ can compute ${\cal M}_{b}(y|x^{i}_{j})$ , for all $y\in\Omega$ , $i\in[t]$ , and $j\in[K]$ .When the assumption holds, one could naively batch multiple decoding contexts, which improves decoding throughput, but not the latency of each context.

The above computation model shows that parallelizing along time and batch axes does not increase the computation time. It is a simplified characterization of the typical hardware, such as TPUs and GPUs, used in neural network inference. Previous approaches also assume similar computational model to devise faster decoding algorithms . In practice, there will be some overhead depending on hardware, implementation and resource utilization. In Appendix E, we experimentally verify that the theoretical gains are largely preserved for a large transformer model in practice. We also note that there are efficient transformer architectures, which reduces the computation cost from $O(t^{2})$ to $O(t\log t)$ (see for a detailed survey). Such approaches are orthogonal to the focus of this paper, and they can be easily combined with our approach.

Broadly speaking, multiple previous approaches proposed to guess a few possible future tokens using an efficient model. They then compute several conditional probability distributions from the large model based on the guesses. Computing the distributions takes $O(1)$ time due to parallelization along the time axis. The guessed tokens are then accepted or rejected based on a statistical method such that the accepted tokens are effectively samples from the large model. This guarantees that there is provably no degradation in the quality of the decoded output compared to that of the large model. When the guesses are plausible under the large model, multiple tokens will be accepted, leading to a larger gain in latency improvement. We will further characterize the acceptance probability as a function of the closeness of the distributions of large model and the small model. While this approach incurs the same computation cost as vanilla decoding (under the simplified computational model assumed in this paper), it can significantly improve decoding latency due to parallelization.

The goal of this work is to provide a principled understanding of the above approaches and discuss optimality conditions and algorithmic improvements. We start by providing a more formal overview of speculative decoding and related works.

Previous works and speculative decoding

Previous approaches make use of parallelization along the time axis to provide speedups. They first predict multiple tokens and validate if these multiple tokens can be generated by the model with the corresponding sampling or decoding scheme. For greedy decoding, multiple tokens can be predicted by a separate model , aggressive decoding , or retrieval augmented text . For sampling, recently proposed an algorithm called speculative decoding, and we provide an overview of this algorithm in the rest of the section. Suppose we have access to a computationally-inexpensive draft model ${\cal M}_{s}$ , which predicts the next token given the context, and the predictions of ${\cal M}_{s}$ are close to that of ${\cal M}_{b}$ for most contexts. Suppose we have obtained prefix $x^{t}$ . The next iteration of the speculative algorithm can be broken down into three steps (see Fig. 1 for an illustration).

After this step, we use $x^{t+L^{\prime}+1}_{1}$ as the next context and sample the next few tokens using speculative decoding iteratively. For a complete statement of the algorithm, we refer the readers to . The crux of the above steps is draft selection, which given a draft sequence and the conditional probabilities from both models, selects a valid sequence such that the output has the same distribution as that of the large model. In speculative decoding, this is achieved via recursively applying a token-level maximal coupling algorithm, which is provided in Algorithm 1. Note that for the draft selection, Algorithm 1 is applied where $p$ is the conditional distribution of the draft model ${\cal M}_{s}(\cdot\mid x^{t})$ and $q$ is the conditional distribution of the large model ${\cal M}_{b}(\cdot\mid x^{t})$ (which may be further conditioned on the newly decoded tokens).

Algorithm 1 returns a random variable $Y$ which either is the accepted input $X$ or a sample from the residual distribution $p^{\text{res}}$ , which is defined in Step $1$ of Algorithm 1. The algorithm is recursively applied as long as the draft tokens are accepted to select the first $L^{\prime}\leq L$ tokens from the draft model. For the first rejected token, the sample $Y$ from the residual distribution is used as a correction. Previous works showed that if $X\sim p$ , then $Y\sim q$ . In the case of the draft selection, this means that the output of the algorithm is distributed according to ${\cal M}_{b}(\cdot\mid x^{t})$ , which is exactly the desired outcome. Furthermore

where $d_{\text{TV}}$ is the total variation distance between $p$ and $q$ . The closer $p$ and $q$ are in $d_{\text{TV}}$ , the higher the chance of $\Pr(Y=X)$ , and fewer the number of serial calls to the larger model. In the ideal case, if $p=q$ , then $\Pr(Y=X)=1$ , i.e., the draft token is always accepted, and when used for speculative decoding we have $L^{\prime}=L.$ Together with the extra sampled tokenWhen $L^{\prime}=L$ , $x(t+L+1)$ is sampled from ${\cal M}_{b}(\cdot\mid x^{t+L})$ . from ${\cal M}_{b}$ , $L+1$ tokens are obtained in one iteration. In such a case, based on our computational model (Section 1), assuming the decoding time of draft model is negligible, the speedup is $(L+1)$ times.

Our contributions

From a theoretical viewpoint, the speculative decoding algorithm raises multiple questions.

What is the relationship between speculative decoding and the broader literature of sampling in statistics?

Is speculative decoding optimal in an information-theoretic sense?

Speculative decoding uses parallelization along time to speed up decoding; would it be possible to use parallelization along batch (number of drafts) to further improve decoding speed?

We provide answers to all the above questions in this work. We first relate the problem of speculative decoding to the broader and well-studied discrete optimal transport theory through a token-level coupling problem (Section 4). With this connection, it becomes clear that the token-level draft selection is the optimal solution for optimal transport with indicator cost function and also related to the problem of maximal coupling . Based on the connection to optimal transport, we show that one can further speed up the decoding by parallelizing along the batch axis by using multiple drafts from the draft model (Section 5).

More precisely, we formulate the token-level draft selection problem as a discrete optimal transport problem with membership cost, which is referred to as OTM. Discrete optimal transport can be solved with a linear program, but the number of variables is exponential in batch size, which can be prohibitive. To address this, we propose a valid transport plan that can be efficiently computed. Moreover, it achieves a $(1-1/e)$ -approximation of the optimal acceptance probability (Section 6).

With the theoretically motivated algorithms and guarantees, we circle back to speeding up decoding and propose a new algorithm called SpecTr and theoretically show that it can be used to derive valid sequences from the large model with better speedups (Section 7). See Fig. 2 for an illustration of SpecTr. Compared to speculative decoding (Fig. 1), the main difference lies in the number of sampled drafts sampled from the small model and the selection algorithm that selects a valid sequence from multiple draft sequences. We remark here that the latter requires completely new statistical tools, and the connection between the token-level draft selection and OTM is critical for obtaining valid transport plans with good guarantees. We view this as one of the main contributions of the work. Similar to speculative decoding, there is provably no degradation in the quality of the decoded output compared that of the large model.

We then experimentally demonstrate the benefit of our approach on standard datasets (Section 8). More precisely, we show that for state-of-the-art large language models, SpecTr achieves a wall clock speedup of 2.13X, a further 1.37X speedup over speculative decoding on standard benchmarks.

Token-level draft selection and optimal transport

In this section, we focus on the draft selection step of SpecTr. We start by considering the case when $L=1$ , which is a token-level draft selection problem. In particular, given context $x^{t}$ , let $X_{1},\ldots X_{k}$ be a collection of draft tokens sampled from the small model, e.g., sampled i.i.d. from ${\cal M}_{s}(\cdot\mid x^{t})$ . Note that by our assumption of the computation model, we could compute the following conditional probabilities from the large model in parallel ( along time and batch axes):

The goal of the draft selection algorithm $f:\Omega^{k}\rightarrow\Omega$ is to output $Y=f(X^{k})$ , whose distribution follows ${\cal M}_{b}(\cdot\mid x^{t})$ , and hence is a valid sample from the large model. Moreover, when $Y\in\{X_{1},\ldots,X_{k}\}$ , we could sample an extra token from ${\cal M}_{b}(\cdot\mid x^{t},Y)$ without calling ${\cal M}_{b}$ since we have already computed the conditional probabilities ${\cal M}_{b}(\cdot\mid x^{t},Y)$ . Hence we would like to maximize the probability that we accept one token from the set of drafts.

When $L>1$ , the drafts are sequences sampled from ${\cal M}_{s}$ , a sequence of token-level draft selection algorithms could be used along the time axis to select a valid sequence from the ${\cal M}_{b}$ . See an example in Fig. 3. The full details about the sequence-level selection algorithm is provided in Section 7.

The reminder of the section will be focused on the token-level draft selection problem. From the above discussion, there are the two main goals of the draft selection problem.

Validity. The output token is always a valid token from the large model i.e., its distribution follows the conditional probability of the large model. This guarantees that there is no quality degradation compared to the large model.

Maximizing acceptance. The higher the probability that we accept a draft token, the more serial computation we can save through parallelization, and hence better speedup.

Before proposing our framework to achieve the above goals, we would like to first discuss the technical challenge of draft selection with multiple draft tokens. One attempt is to sequentially apply the acceptance phase of Algorithm 1 (line 3 - 5) to each draft token $X_{i}$ with $p={\cal M}_{s}(\cdot\mid x^{t})$ and $q={\cal M}_{b}(\cdot\mid x^{t})$ . However, this approach would not guarantee that the final accepted token is from the desired distribution. To see this, consider the example of $p=\text{Ber}(1)$ and $q=\text{Ber}(1/2)$ . $\text{Ber}(b)$ denotes a Bernoulli distribution with the probability of seeing a head $b$ . Then we have $\forall i=1,\ldots,k,$ $X_{i}=1$ and each of them will be accepted with probability $1/2$ . After applying Algorithm 1 to all $X_{i}$ ’s, the probability of getting a $1$ will be at least $1-1/2^{k}$ and hence the output distribution would not be $\text{Ber}(1/2)$ for $k>1$ . Therefore the algorithm does not produce valid samples, which is a requirement of the draft selection problem.

In this work, we conduct a principled investigation of the draft selection problem, and show that these two main goals could be captured by the framework of optimal transport with a properly defined cost function. Next we define optimal transport formally and then connect it to draft selection with one draft. The generalization to multiple drafts is provided in Section 5.

To simplify notations, we assume $\Omega$ is a discrete domain.

For two probability distributions $P$ over ${\cal X}$ and $Q$ over ${\cal Y}$ , we say a joint distribution $\pi$ supported over ${\cal X}\times{\cal Y}$ is a coupling between $P$ and $Q$ if $\forall x,y,\pi(x,y)\geq 0$ ,

We use $\Pi(P,Q)$ to denote the set of all possible couplings between $P$ and $Q$ .

When it is clear from context, we will overload notation and refer to the probabilistic mapping $f_{\pi}:{\cal X}\rightarrow{\cal Y}$ introduced by the conditional probability $\pi(y\mid x){:=}\pi(x,y)/P(x)$ as a coupling, which is also referred to as a transport plan from $P$ to $Q$ . In this paper, we will set $P$ to be the distribution of the draft tokens and $Q$ to be the target distribution of the output token. In this case, the $f_{\pi}$ is a valid draft selection algorithm. Formally, this is stated in the claim below.

For all $\pi\in\Pi(P,Q)$ , let $f_{\pi}$ be the probabilistic mapping defined above . If $X\sim P$ , then $f_{\pi}(X)\sim Q$ .

In this paper, we will design selection algorithms by finding valid couplings between the draft distribution and target distribution to guarantee validity of the output tokens.

The optimal transport plan is the coupling $\pi\in\Pi(P,Q)$ that minimizes the transportation cost.

Speculative decoding with one draft token.

With these definitions in place, we can see that with ${\cal X}={\cal Y}=\Omega$ , the domain of the tokens and $P=p,Q=q$ , we recover the speculative decoding objective with one draft token using the cost function of indicator cost, which captures the resampling cost, defined below:

which is achieved by the maximal coupling between $p$ and $q$ stated in Algorithm 1 . And hence speculative sampling achieves the optimal cost with one draft token.

Optimal transport with multiple draft tokens

where $S(x)=\{o\in\Omega\mid o\text{ appears in }x\}$ denotes the set of distinct elements in $x$ . When $k=1$ , it recovers the indicator cost mentioned before. The transportation cost of the coupling is

From now on we will use membership cost as the default cost function and refer to the optimal transport solution as optimal transport with membership cost (OTM). We use $\pi^{*}$ to denote the coupling that minimizes this cost $\pi^{*}=\arg\min_{\pi\in\Pi(P,Q)}C(\pi);$ The existence of optimal coupling in discrete domain is well-known, e.g., see . When the optimal coupling is not unique, we use $\pi^{*}$ to denote one of the optimal couplings. and the cost $C(\pi^{*})$ is referred to as the optimal transport cost between $P$ and $Q$ . We use $\alpha(P,Q)=1-C(\pi^{*})$ to denote the corresponding optimal acceptance probability.

In this paper, we will mainly focus on the case when the draft tokens are i.i.d. samples from a base distribution.The above generic formulation immediately allows generalization to more complex draft selection strategies, such as sampling $k$ tokens without replacement, or using a different drafting distribution for each draft. Let $p,q$ be supported over $\Omega$ and the goal is to obtain one valid token from $q$ given $k$ i.i.d. samples from $p$ . For SpecTr with context $x^{t}$ , we have $p={\cal M}_{s}(\cdot\mid x^{t})$ and $q={\cal M}_{b}(\cdot\mid x^{t})$ . We set $P=p^{\otimes k}$ , a product distribution whose marginals are all $p$ , and $Q=q$ . The OT problem we want to solve is the following:

We overload notation and denote the optimal acceptance probability as $\alpha_{k}(p,q){:=}\alpha(p^{\otimes k},q)=1-C(\pi^{*})$ . To better understand the quantity, we state a few properties about $\alpha_{k}$ .

(Appendix A.2) The optimal acceptance probability statisfies the following properties.

Monotonicity. For any $p,q$ and $k\geq 1$ , $\alpha_{k}(p,q)\leq\alpha_{k+1}(p,q)$ .

The above properties demonstrate that for a large $k$ , the value of $\alpha_{k}$ can become large. Hence increasing $k$ could increase the acceptance probability, leading to further speedups. We now focus on computing the optimal transport plan and the optimal acceptance probability.

OTM via Linear programming. Optimal transport in discrete domain has been studied extensively , and it is shown that the optimal transport problem is equivalent to the following linear programming problem:

The linear program in (4) has $|\Omega|^{k+1}$ variables and $|\Omega|^{k}+|\Omega|$ equality constraints (see Definition 1). Linear programming can be solved in time polynomial in the number of variables and constraints ,To our best knowledge, the best practical computation bound (through interior-point method) is $O(|\Omega|^{3k})$ and the best theoretical computation bound is $O(|\Omega|^{2.5k})$ . implying the following lemma.

Given $p,q$ over $\Omega$ , the solution to Eq. 3 can be computed in time $O(|\Omega|^{O(k)})$ .

We refer to the optimal coupling obtained above as OTM- $k$ and denote it as $\pi^{{\rm OTM-}k}$ . When $k=1$ , there is a closed form expression for the optimal acceptance cost (see Eq. 1), whereas for larger values of $k$ , we are unaware of a general closed form expression. In Section A.1, we provide an information-theoretic upper (and lower) bound, which is tight up to a multiplicative constant of $1-(1-1/k)^{k}\geq 1-1/e$ .

While solving OTM in Eq. 4 gives the plan with optimal acceptance probability, to the best of our knowledge, the best-known runtime will be exponential in $k$ , which can be prohibitive when either the vocabulary size $|\Omega|$ or the number of draft tokens $k$ is large.For discrete OT, Sinkhorn algorithm could be used to solve an entropy-regularized version of OT, which has a better computation complexity . However, the computation cost of the algorithm will still have a linear dependence on $|\Omega|^{k}$ , which can be prohibitive. In the next section, we will present a selection algorithm that can be efficiently computed and show that it achieves an acceptance probability of at least $(1-(1-1/k)^{k})\alpha_{k}\geq(1-1/e)\alpha_{k}$ .

Draft selection via k𝑘k-sequential selection

In this section, we present a sequential selection algorithm (k-Seq), an approximate solutionNote here that the solution still satisfies the constrains in Eq. 3, and hence is a valid transport plan. The term approximate here means that the solution is not the exact minimizer of the cost in Eq. 3. to the optimal transport problem in Eq. 3, which can be efficiently computed in time almost linear in $|\Omega|$ and logarithmic in $k$ . The algorithm is presented in Algorithm 2.

At a high-level, the algorithm goes over all $k$ draft samples generated from $p$ sequentially, and decides on whether to accept each $X_{i}$ based on the ratio $q(X_{i})/p(X_{i})$ . The algorithm output the first accepted sample or result from a residual distribution $p^{\text{res}}$ if none of the samples is accepted. To guarantee that the the final returned token is a valid sample from $q$ , we choose an appropriate $\rho\in[1,k]$ and accept $X_{i}$ with probability $\min(1,q(X_{i})/(\rho\cdot p(X_{i})))$ instead of $\min(1,q(X_{i})/(p(X_{i})))$ as in Algorithm 1. In Theorem 1, we show that with appropriately chosen $\rho$ ’s, Algorithm 2 is indeed valid transportation plans from $p^{\otimes k}$ to $q$ . Moreover, to find the best transportation plan within the family, we only need to search over a single parameter $\rho$ , which reduces the computation cost significantly. We also show that searching over this sub-family of couplings won’t decrease the optimal acceptance probability by a multiplicative constant. The performance of Algorithm 2 is stated in Theorem 1.

Let $\beta_{p,q}(\rho)=\sum_{x\in\Omega}\min\big{(}p(x),\frac{q(x)}{\rho}\bigr{)}$ and $\rho^{*}$ be the solution to the identity below.

When $\rho\geq\rho^{*}$ , the coupling $\pi^{\textsc{k-Seq}}_{\rho}$ in Algorithm 2 is a valid transport plan from $p^{\otimes k}$ to $q$ . When $\rho=\rho^{*}$ , we have

Moreover, $\rho^{*}$ can be computed up to accuracy $\delta$ in time $O(|\Omega|\log((k-1)/\delta))$ .

We provide the proof in Section C.1. In Appendix B, using a few canonical examples of distributions, we plot the acceptance probability of k-Seq and compare it with the optimal acceptance probability $\alpha_{k}$ . It can be shown that k-Seq could have a strictly worse acceptance probability compared to the OTM solution for certain cases while there also exist non-trivial cases where k-Seq achieves the optimal acceptance probability.

Concurrent and recent work of has proposed another efficient algorithm for the draft selection phase. To the best of our knowledge, there is no optimality guarantee proved for their proposed algorithm. In Section B.3, we present its acceptance probability empirically for the canonical case of Bernoulli distributions, and show that both our proposed algorithms (OTM and k-Seq) have a higher acceptance probability.

SpecTr: Application of OTM in autoregressive sampling

In this section, we describe how OTM can be used to speed up auto-regressive sampling, which we refer to as SpecTr sampling. Similar to speculative decoding, each iteration of SpecTr can be decomposed into three phases (Fig. 2):

Draft set construction. Given current context $x^{t}$ , use the draft model sample a set of $K$ draft sequences with length $L$ , denoted by $S=\{z^{L}\sim{\cal M}_{s}(\cdot\mid x^{t})\}$ . We keep the conditional probabilities ${\cal M}_{s}(y\mid x^{t},z^{i})$ for all $y\in\Omega,i\leq L$ and $z^{L}\in S$ .

Conditional probability computation. Compute the conditional probabilities on the next token for the large model ${\cal M}_{b}(y\mid x^{t},z^{i})$ for all $y\in\Omega,i\leq L$ and $z^{L}\in S$ in parallel.

Draft selection. Select first $L^{\prime}$ of the $L$ tokens and set $x(t+i)=z(i)$ for $i\leq L^{\prime}$ and some $z\in S$ given the set of draft sequences and the conditional probabilities from both models. Sample a token from a residual distribution as a correction to the rejected tokens.

The conditional probability computation step takes $O(1)$ when $|S|$ is not large based on our simplified computations model. We mainly focus on the draft set construction phase and draft selection phase.

Draft set with i.i.d. draft sequences. Given context $x^{t}$ , a natural way to come up with a set of $K$ drafts is to independently sample $K$ draft sequences from ${\cal M}_{s}(\cdot\mid x^{t})$ , i.e.,

The draft set construction method in (7) can be generalized to a prefix-tree based algorithm. However, this generalized version did not perform better in our experiments. We include this construction in Appendix D for completeness.

Draft selection with multiple candidates. We present the sequence-level selection algorithm given a set of draft sequences in Algorithm 3. We assume the conditional probabilities on the next token are available given any prefix in the candidate set since they are computed in parallel in the second phase, and won’t list them as inputs explicitly in Algorithm 3.

A sample run of the algorithm is presented in Fig. 3. The algorithm proceeds in a recursive fashion. Given prompt $x^{t}$ and a candidate set $S$ sampled from ${\cal M}_{s}(\cdot\mid x^{t})$ , the algorithm first computes a token-level draft selection algorithm $f_{\pi}:\Omega^{|S|}\rightarrow\Omega$ which is a transport plan from ${\cal M}_{s}(\cdot\mid x^{t})^{\otimes|S|}$ to ${\cal M}_{b}(\cdot\mid x^{t})$ . Then $f_{\pi}$ is applied to the set of first tokens of the draft sequences in $S$ to obtained a valid token $Y$ from ${\cal M}_{b}(\cdot\mid x^{t})$ . If $Y$ is not the last token ( $L\geq 2$ ), we filter out sequences in $S$ whose first token is not $Y$ and denote the remaining sequences as $S_{\rm next}$ and feed it to the algorithm with context $(x^{t},Y)$ and draft length $L-1$ . This goes on until we have $L=1$ or $S_{\rm next}=\emptyset$ .

In this case when $Y$ is the last token (i.e., $L=1$ ) and $Y\in S$ , we have the choice to sample an additional token ${\cal M}_{b}(\cdot\mid(x^{t},Y))$ since this conditional probability is already computed in the second phase. Due to the property of the token-level selection algorithms and the autoregressive structure of language models, it can be shown that $Y$ is always a valid sample from ${\cal M}_{b}(\cdot\mid x^{t})$ . Let $L^{\prime}$ be the number of decoded tokens in one iteration. Note that this is a random variable in the range $[1,L+1]$ .

The formal quality guarantee is stated in Theorem 2. We present the proof in Section C.2.

Assume all drafts in the set $S$ are generated from the small model with input $x^{t}$ , or more precisely, $\forall z\in S,$

Let $(x^{t},Y^{\tau})$ be the output of Algorithm 3 where $\tau$ is the length of the newly decoded tokens, then it satisfies that $Y^{1:\tau}$ is distributed according to ${\cal M}_{b}(\underbrace{\cdot,\cdot,\ldots\cdot}_{\tau\text{ dots}}\mid x^{t})$ . More precisely, For any $\tau_{0}\in[1,L+1]$ , and any $\tau_{0}$ -length, sequence $o^{\tau_{0}}=(o(1),\ldots,o(\tau_{0}))\in\Omega^{\tau_{0}}$ , we have

Experiments

We empirically evaluate SpecTr and compare it with two methods: (1) the baseline auto-regressive decoding; and (2) speculative decoding with $K=1$ . Note that all three methods effectively generate samples from the same baseline large model, and hence the quality of the two speculative decoding methods is provably neutral to that of the large model. Thus, we will only focus on measuring the speedup in our experiments. In the simplified computation model, we made the following assumptions: (1) Decoding time from small models is negligible compared to decoding from the small model; (2) Parallelization along the batch and time axis doesn’t increase the time for a serial call to the large model. With these, the theoretical speedup compared to baseline decoding will be the average number of decoded tokens per serial call, which is called block efficiency , defined below

However, in real deployment of the SpecTr algorithm, the actual end-to-end (wall clock) speedup is further impacted by the following aspects. (1) The decoding time for ${\cal M}_{s}$ might not be negligible; (2) Parallelization along the batch and time axis might increase the time for a single call to ${\cal M}_{b}$ ; (3) Overhead due to the implementation of additional functionalities in SpecTr such as the draft selection algorithm and switching between models. These factors will depend on how the algorithm is implemented and optimized. In our experiment, we consider both the block efficiency, and average wall clock speedup with our implementation of SpecTr.

We first present the performance of our algorithm and compare it to speculative decoding using state-of-the-art PALM-2 models with prompts from the one-billion language benchmark (LM1B) . In Appendix E, we use a pair of smaller transformer models to break down different affecting factors mentioned above.

In Table 1, we use PALM-2-Gecko and PALM-2-Bison as the small model and large model, respectively . The wall clock speedup is normalized by the wall clock latency of baseline autoregressive decoding. The time we log include all above mentioned aspects. In the considered parameter configurations, the wall clock speedup increases as $K$ and $L$ increases. As seen from the table, the actual wall clock speedup is smaller than the theoretical speedup of block efficiency, which is consistent with what we expected. Importantly, the benefit from SpecTr outweighs these overheads. In particular, when $L=8$ and $K=8$ , our proposed SpecTr algorithm has a speedup of 2.13x, a further 1.37x increase compared to speculative decoding ( $K=1$ ).

Acknowledgements

Authors thank Asaf Aharoni, Kwangjun Ahn, Badih Ghazi, Sanjiv Kumar, Teodor Marinov, Michael Riley, and NeurIPS reviewers for helpful comments and discussions.

References

Appendix A Properties of optimal transport cost

Below we provide an information-theoretic upper (and lower) bound in Lemma 3, which is tight up to a multiplicative constant of $1-(1-1/k)^{k}\geq 1-1/e$ . The proof is presented in Section A.3. For the case of $k=1$ , the upper bound matches the optimal acceptance probability.

For any two distributions $p,q$ and $\forall k\geq 1$ , we have

In Appendix B, we plot $\alpha_{k}$ as a function of $k$ for a few simple pairs of $(p,q)$ ’s as illustrative examples. We note that the upper bound in Lemma 3 is tight for examples considered in Appendix B.

A.2 Proof of Lemma 1

We first prove monotonicity. By definition,

Moreover, for any $\pi\in\Pi(p^{\otimes{k}},q)$ , we can construct $\pi^{\prime}\in\Pi(p^{\otimes{k+1}},q)$ by setting

i.e., adding and independent sample from $p$ to $X^{k}$ .

Next we prove consistency. We start with the case when $\forall x\in\Omega,q(x)/p(x)<\infty$ . To prove this, we will show that Algorithm 2 with $\rho_{\max}=\max_{x\in\Omega}q(x)/p(x)$ statisifies

Since $\alpha(\pi_{\rho_{\max}}^{\textsc{k-Seq}})\leq\alpha_{k}(p,q)$ , the above equation implies $\lim_{k\rightarrow\infty}\alpha_{k}(p,q)=1$ . Notice that by Lemma 4 and Theorem 1, $\pi_{\rho_{\max}}^{\textsc{k-Seq}}$ is a valid coupling, and

where $\beta_{p,q}(\rho)=\sum_{x\in\Omega}\min(p(x),\frac{q(x)}{\rho})\geq 1/\rho_{\max}>0$ . Taking $k\rightarrow\infty$ concludes the proof.

For the case when $q(x)/p(x)$ is unbounded, there exists $x\in\Omega$ such that $q(x)>0$ and $p(x)=0$ . Let

Let $x_{0}$ be such that $p(x_{0})>0$ . We define $q^{\prime}$ such that

Then we have $d_{\rm TV}(q,q^{\prime})=p_{\rm off}$ , and hence by subadditivity of transport cost,

Moreover, we have $\forall x\in\Omega,q^{\prime}(x)/p(x)<\infty$ . Hence

A.3 Proof of Lemma 3

For the upper bound, it would be enough to show that for any $\pi\in\Pi(p^{\otimes k},q)$ , and any $\Omega_{0}\subset\Omega$ , we have

For the lower bound, we show that k-Seq achieves an acceptance probability of at least $(1-(1-1/k)^{k})\bar{\alpha}_{k}(p,q)$ , see Eq. 11, implying the lower bound guarantee.

We illustrate the acceptance probabilities for our proposed token-level selection algorithms using a few simple examples and plot them in Figures 5 and 5. The analysis for these simple distributions is presented in Appendix B.1 and Section B.2.

Let $\text{Ber}(b)$ be a Bernoulli distribution with probability $b$ of getting a head. In Figure 5, we plot the acceptance probability comparison between OTM- $k$ and k-Seq for different Bernoulli distributions $q=\text{Ber}(b)$ as a function of $k$ when $p=\text{Ber}(0.25)$ . Note that when $p=q$ ( $b=0.25$ ), the acceptance probability is always one for both methods. When $p\neq q$ , the acceptance probabilities for both methods increase as $k$ increases before they reach one. When $b=0.1$ or $0.75$ , k-Seq has a worse acceptance probability compared to the OTM- $k$ algorithm. When $b=1$ , the two algorithms have the same performance.

Pairs of uniform distributions.

Let $U(d)$ denote a uniform distribution over $[d]$ . In Figure 5, we plot the optimal acceptance probability for different uniform functions $q$ as a function of $k$ . For these distributions, it can be shown that k-Seq achieves the optimal acceptance probability $\alpha_{k}$ . Hence only $\alpha_{k}$ is plotted. Observe that all acceptance probabilities are monotonically increasing and tend to one when $k\to\infty$ , as stated in Lemma 1.

In this section, we provide a sketch of optimal acceptance probability calculations for results in Figures 5 and 5.

Consider the transport plan $\pi$ given by $\pi(1^{k},1)=\min(p^{k},q)$ , $\pi(1^{k},0)=p^{k}-\min(p^{k},q)$ , $\pi(0^{k},0)=\min((1-p)^{k},1-q)$ , and $\pi(0^{k},1)=(1-p)^{k}-\min((1-p)^{k},1-q)$ . It can be checked that this is a valid transport plan. To see this matches the upper bound on the optimal cost from Lemma 3, notice that

If $p^{k}\leq q$ and $(1-p)^{k}\leq 1-q$ , then the above equation simplifies to $1$ and (10) also simplifies to $1$ . If $p^{k}>q$ and $(1-p)^{k}\leq 1-q$ , then the above equation simplifies to $1+q-p^{k}$ and (10) also simplifies to the same quantity. Similarly, the proof applies for $p^{k}\leq q$ and $(1-p)^{k}>1-q$ .

Figure 5: p=U(d)𝑝𝑈𝑑p=U(d) and q=U(d/r)𝑞𝑈𝑑𝑟q=U(d/r).

We first prove $\alpha^{k}(U(d),U(d/r))\geq 1-(1-1/r)^{k}$ by a construction. Let $S(X^{k})$ be the set of unique symbols in $X^{k}$ . Consider the following transport plan, where $Y$ is drawn uniformly from $S(X^{k})\cap[d/r]$ and draws a new uniform sample from $[d/r]$ if $S(X^{k})\cap[d/r]=\emptyset$ . Observe that since $U(d)$ is uniform over $[d]$ , this is a valid transport plan and furthermore,

The upper bound follows by setting $\Omega_{0}=[d]\setminus[d/r]$ in Lemma 3.

B.2 Acceptance probability of k-Seq for the example in Figure 5

In this section, we show that for the example in Figure 5, k-Seq achieves the optimal acceptance accuracy. In this case, $p=U(d)$ and $q=U(d/r)$ . Recall that the optimal acceptance probability is

And hence solving $1-(1-\beta(\rho))^{k}=\rho\beta(\rho)$ gives $\rho^{*}=r(1-(1-1/r)^{k})$ . And be Theorem 1, we have

And the equality holds since this is an upper bound for any coupling.

B.3 Comparison to multi-round rejection sampling in [21, 20]

In this section, we compare our proposed draft selection algorithms (OTM and k-Seq) to the multi-round rejection sampling algorithm (multi-round) in concurrent and recent work of (see Algorithm 1 in ) using the example of Bernoulli distributions. As Figure 6 demonstrates, both our proposed algorithms outperform their algorithm. The advantage of OTM is demonstrated by the fact it is the optimal algorithm under the validity guarantee of the final accepted token. Our proposed efficient algorithm k-Seq also outperforms multi-round for the considered examples. We leave a systematic comparison of the algorithms as future work.

Appendix C Analysis of SpecTr

We start by proving the following lemma on $\rho^{*}$ .

Then we have Let $\rho^{*}$ be the solution to Eq. 6. Then when $d_{\rm TV}(p,q)\in(0,1)$ ,

$f(\rho)$ is monotone in $\rho$ in $[1,\infty)$ ;

$\rho^{*}\in\big{[}1,\min\{k,\max_{x}\frac{q(x)}{p(x)}\}\big{]}$ .

It would enough to prove the followings: (1) $f(\rho)$ is monotone in $\rho$ in $[1,\infty)$ ; (2) $f(1)\geq 0$ ; (3) $f(k)\leq 0$ ; (4) $f\big{(}\max_{x}\frac{q(x)}{p(x)}\big{)}\leq 0$ .

To see (1), since $\beta_{p,q}(\rho)$ is decreasing in $\rho$ , so is $1-(1-\beta_{p,q}(\rho))^{k}$ . Moreover, $\rho\beta_{p,q}(\rho)=\sum_{x}\min\{\rho p(x),q(x)\}$ , which is non-decreasing in $\rho$ . Hence we have $1-(1-\beta_{p,q}(\rho))^{k}-\rho\beta_{p,q}(\rho)$ is decreasing.

To see (2), note that when $\rho=1$ , $\beta_{p,q}(\rho)=1-d_{\rm TV}(p,q)$ . Hence we have

When $\rho=k$ , (3) holds since for $x\in$ , we have $1-(1-x)^{k}\leq kx$ . Moreover, when $\rho=\max_{x}\frac{q(x)}{p(x)}>1$ , we have $\beta_{p,q}(\rho)=1/\rho$ and (4) holds since

Next we prove Theorem 1, we will break the proof into four parts: (1) computation efficiency; (2) $\pi_{\rho}^{\textsc{k-Seq}}$ is a valid transport plan; (3) acceptance probability; (4) optimality guarantee of $\pi^{\textsc{k-Seq}}_{\rho^{*}}$ .

Note that Lemma 4 immediately implies that $\rho^{*}$ can be computed up to arbitrary accuracy $\delta$ in time ${O(|\Omega|\log((k-1)/\delta)}$ using binary search over $[1,k]$ .

Valid transport plan.

We next prove that $\pi^{\textsc{k-Seq}}_{\rho}$ is a valid transport plan when $\rho\geq\rho^{*}$ . By Lemma 4, when $\rho\geq\rho^{*}$ , we have $1-(1-\beta_{p,q}(\rho))^{k}\geq\rho\beta_{p,q}(\rho)$ . Recall that $p_{\rm acc}=1-(1-\beta_{p,q}(\rho))^{k}$ , and

this implies $p^{\text{res}}(x)\geq 0$ for all $x\in\Omega$ . Moreover,

Hence $p^{\text{res}}$ is a valid distribution. It remains to show that the marginal of $Y$ is $q$ . We first compute the probability of the output $Y=x$ . Note that probability that $Y=X_{1}$ is

Acceptance probability.

It can be seen that $\beta(\rho)$ is decreasing in $\rho$ , and so is $1-(1-\beta_{p,q}(\rho))^{k}$ . Hence we have

The statement holds since $f(x)=\frac{1-(1-x)^{k}}{kx}$ in monotonically decreasing when $x\in(0,1/k]$ and $f(1/k)=1-(1-1/k)^{k},\lim_{x\rightarrow 0^{+}}f(x)=1$ .

Moreover, $\forall x\geq 0,kx\geq 1-(1-x)^{k}$ . Hence we have

where the last inequality is due to the upper bound in Lemma 3 with $\Omega_{0}=\Omega$ .

C.2 Proof of Theorem 2

We prove the theorem via induction. When $L=1$ , $\tau\in\{1,2\}$ . Let $k=|S|$ . Since for the first step, $f_{\pi}$ in Algorithm 3 is a valid transport plan from ${\cal M}_{s}(\cdot\mid x^{t})^{\otimes k}$ to ${\cal M}_{b}(\cdot\mid x^{t})$ . We have $Y_{1}\sim{\cal M}_{b}(\cdot\mid x^{t})$ , which completes the proof when $\tau=1$ . When $\tau=2$ , we have $Y_{2}\sim{\cal M}_{b}(\cdot\mid x^{t},Y_{1})$ as stated in Step 5 of Algorithm 3. Hence the statement holds.

Note that in this case $\tau=\tau^{\prime}+1$ , and for any $(\tau_{0}+1)$ -length sequence $o^{\tau_{0}+1}=(o(1),\ldots,o(\tau_{0}),o(\tau_{0}+1))\in\Omega^{\tau_{0}+1}$ , we have

Combining the two cases, we complete the proof.

Appendix D Candidate set construction via a prefix-tree

As discussed in Section 1, the size of the draft set $S$ is constrained by the number of parallel computations that can be supported in the hardware. Hence it is important to design the draft set carefully to allow for a longer sequence of accepted candidate sets. In addition to the i.i.d. draft set selection approach listed in Section 7, we present an algorithm that samples a draft set that forms the leaves of a prefix tree. Given a draft set size $K$ , the algorithm can be specified by a sequence of parameter $(k_{1},k_{2},\ldots,k_{L})$ satisfying $\prod_{i=1}^{L}k_{i}=K$ .

The algorithm starts with a root node with sequence $x^{1:t}$ and forms a prefix tree of depth $L$ . At depth $i\in[1:L-1]$ , each node is expanded by a factor of $k_{i+1}$ and each of its children will contain a sequence that satisfies: (1) Its prefix agrees with the sequence in the parent node; (2) The next token is sampled from the conditional probability given the prefix in small model. These child nodes will be at depth $i+1$ and the process goes until it hits depth $L$ . We give a detailed description of the algorithm in Algorithm 4.

Appendix E Additional experiments

In this section, we perform a detailed investigation of different factors that affect the speed of SpecTr with smaller transformer models. We train decoder-only transformer models on the LM1B dataset based on the example provided in the FLAX library . For the draft model, we use transformer models with $2M$ , $6M$ and $20M$ parameters, and for the large model we use a $97M$ parameter transformer model.

We first provide a verification of the computational model introduced in Section 1 by reporting the latencies of using the large model to compute the probabilistic distributions with parallelization over time and batch axes. As shown in Table 2, the latency stays roughly constant in these setting.

Similar to Table 2, we report relative latency when parallelizing across the time and batch axes using the small $6M$ draft model in Table 3. In Table 3, the reported relative latencies are relative to the large $97M$ model to get a sense of the relative cost of sampling multiple drafts with the small model compared to the large model.

To see how the size of size of the draft model will affect the block efficiency, we also include results for varying draft model sizes with the same $97M$ large model for LM1B in Table 4. These draft models were produced by either halving ( $2M$ ) or doubling ( $20M$ ) the original $6M$ draft model’s number of layers, embedding dimension, MLP dimension, and number of attention heads. As expected, the larger draft models improve all speculative methods’ block efficiency with SpecTr maintaining the best performance across all draft model sizes.