Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding
Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, Zhifang Sui
Introduction
Large Language Models (LLMs) have achieved remarkable proficiency in a range of downstream tasks OpenAI (2023); Touvron et al. (2023a, b); Chiang et al. (2023); Jiang et al. (2023). They are progressively evolving as the cornerstone of comprehensive API interfaces (e.g., ChatGPT222https://chat.openai.com), offering human life services and guidance. However, the inference latency of these sizable models has emerged as a substantial obstacle restricting their further applications. This latency primarily arises from the token-by-token generation necessitated by autoregressive decoding, resulting in an escalation of the inference latency with both the length of the generated sequence and the model’s scale.
To accelerate LLM inference, an innovative inference paradigm, Speculative Decoding, has been introduced Stern et al. (2018); Xia et al. (2022); Leviathan et al. (2023); Chen et al. (2023a); Miao et al. (2023). As demonstrated in Figure 1, Speculative Decoding first leverages a drafter model to efficiently decode multiple tokens as speculation of future decoding steps and then uses the target LLM to verify the drafted tokens in parallel. Only those tokens that meet the LLM’s verification criterion are accepted to guarantee high-quality outputs.
Speculative Decoding is founded upon two key observations about LLM inference: 1) many easy tokens can be predicted with less computation (e.g., using a smaller model), and 2) LLM inference is highly memory bandwidth bound Patterson (2004) with the main latency bottleneck arising from memory reads/writes of LLM parameters rather than arithmetic computations. Drawing on these observations, Speculative Decoding adapts the concept of speculative execution333Speculative execution Burton (1985); Hennessy and Patterson (2012) is an optimization technique used in computer architecture where tasks are performed in advance and subsequently verified for their necessity, thereby circumventing the delays inherent in sequential task execution. to focus the LLM’s computational efforts on the validation of multiple pre-drafted tokens, substantially diminishing the need for frequent memory reads/writes operations of LLM parameters, thereby improving inference efficiency.
While Speculative Decoding shows promise, it raises several critical questions that warrant further investigation. For instance, how to select or design the drafter model to strike a balance between speculation accuracy and drafting efficiency Xia et al. (2022); Chen et al. (2023a); Santilli et al. (2023). Additionally, it is essential to examine whether the verification criterion can maintain both generation diversity and output quality Miao et al. (2023); Spector and Re (2023). Furthermore, careful consideration should be given to closely align the prediction behavior between the drafter and the target LLM for higher speculation accuracy Zhou et al. (2023); Liu et al. (2023).
Amid the rapid expansion of research in Speculative Decoding, this work makes the first attempt to present a survey of this field, aiming to raise awareness within the academic community regarding the latest advancements. We provide a systematically organized categorization of current research and an in-depth analysis of relevant studies. Besides, we highlight the challenges and potential directions, hoping to serve as an essential guide for newcomers and to shed light on future research.
Overview
This paper offers a comprehensive survey of Speculative Decoding as a promising decoding paradigm for accelerating LLM inference. We commence by delivering an in-depth introduction to the early stages of Speculative Decoding research (§3), illustrated by a timeline of its evolution (as shown in Figure 2). This is followed by a formal definition and formulation of Speculative Decoding (§4). We present a taxonomy-based organizational framework to categorize relevant studies, as depicted in Figure 3. Then, this paper delves into a detailed discussion of leading techniques in Speculative Decoding, including the selection of drafter models (§5), verification strategies (§6), and alignment between the drafter and the target LLM (§7). Furthermore, we summarize several application scenarios where Speculative Decoding exhibits extraordinary effectiveness (§8). Finally, to facilitate beginners interested in this field, we outline the challenges faced and highlight potential directions for future research (§9).
Evolution of Speculative Decoding
The widespread adoption of LLMs has established autoregressive decoding as the de facto standard to LLM inference Chowdhery et al. (2023); OpenAI (2023); Jiang et al. (2024). However, autoregressive decoding is limited by its inference latency, which primarily stems from the memory-bound computation of LLMs Patterson (2004); Shazeer (2019). Specifically, the main latency bottleneck of each decoding step is not due to computational operations but arises from the necessity to transfer all LLM parameters from High-Bandwidth Memory (HBM) to the on-chip cache of modern accelerators like GPUs. This process, which generates only one token per step, leads to the underutilization of these accelerators and results in inefficiencies.
2 Pioneering Draft-then-Verify Efforts
To mitigate the above issue, an intuitive way is to trade off additional idle computational resources for more parallelism in LLM inference. To this end, Stern et al. (2018) introduced Blockwise Decoding, an approach that incorporates extra feedforward neural (FFN) heads atop the Transformer decoder, enabling the simultaneously drafting of multiple tokens per step. These tokens are then verified by the original LLM in parallel, ensuring that the outputs align with those of the original LLM. As a pioneering work proposing the Draft-then-Verify paradigm, Blockwise Decoding effectively reduces the total number of required LLM calls by increasing generation parallelism per step, thereby accelerating inference. However, the limited capacity of those extra FFN heads resulted in suboptimal drafting quality, leading to the underestimation of this paradigm.
𝑛1subscriptℳ𝑞conditional𝑥subscript𝑥absent𝑛1q_{n+1}\leftarrow\mathcal{M}_{q}\left(x\mid x_{
Following SpecDec, Leviathan et al. (2023) and Chen et al. (2023a) made concurrent contributions by proposing Speculative Sampling, expanding this paradigm to encompass the lossless acceleration of nucleus sampling. These methods employed smaller LMs from the same series (e.g., T5-small) to speed up the inference of their larger counterparts (e.g., T5-XXL). Compared to previous work, these off-the-shelf small LMs do not require additional training, enabling the rapid adoption of Speculative Decoding in LLM acceleration. This advancement has elevated Speculative Decoding to the forefront of LLM efficiency research, attracting widespread interest within the NLP community.
To sum up, these pioneering efforts in Speculative Decoding have gradually solidified the Draft-then-Verify paradigm, showcasing its promising potential in LLM acceleration. We provide a detailed categorization and discussion of these studies and subsequent research in the following sections.
Formulation and Definition
In this section, we first provide a succinct overview of standard autoregressive decoding (§4.1). Then, we offer an in-depth exposition of Speculative Decoding (§4.2), which encompasses a formal definition, a comprehensive description of the methodology, and a detailed elaboration of the algorithm.
forked edges, for tree= grow=east, reversed=true, anchor=base west, parent anchor=east, child anchor=west, base=center, font=, rectangle, draw=hidden-draw, rounded corners, align=left, text centered, minimum width=4em, edge+=darkgray, line width=1pt, s sep=3pt, inner xsep=2pt, inner ysep=3pt, line width=0.8pt, ver/.style=rotate=90, child anchor=north, parent anchor=south, anchor=center, , where level=1text width=7em,font=,, where level=2text width=8em,font=, where level=3text width=6em,font=,, where level=4text width=6em,font=,, [ Speculative Decoding, ver [ Drafting (§5), ver [ Independent Drafting (§5.1) [ Fine-tuned Drafter [ SpecDec Xia et al. (2022), Online Speculative Liu et al. (2023), SpecInfer Miao et al. (2023), BiLD Kim et al. (2023), DistillSpec Zhou et al. (2023) , leaf, text width=33em ] ] [ Tuning-Free Drafter [ Speculative Decoding Leviathan et al. (2023), SpS Chen et al. (2023a), SpecTr Sun et al. (2023), StagedSpec Spector and Re (2023), REST He et al. (2023), CS. Drafting Chen et al. (2023b) , leaf, text width=33em ] ] ] [ Self-Drafting (§5.2) [ FFN Heads [ Blockwise Stern et al. (2018), Medusa Cai et al. (2023) , leaf, text width=33em ] ] [ Early-Existing [ PPD Yang et al. (2023b), Self-Speculative Zhang et al. (2023a), SPEED Hooper et al. (2023) , leaf, text width=33em ] ] [ Mask-Predict [ Parallel Decoding Santilli et al. (2023), Lookahead Decoding Fu et al. (2023), PaSS Monea et al. (2023) , leaf, text width=33em ] ] ] ] [ Verification (§6), ver [ Greedy Decoding (§6.1) [ Lossless [ Blockwise Stern et al. (2018), SpecDec Xia et al. (2022), Parallel Decoding Santilli et al. (2023), PPD Yang et al. (2023b), SPEED Hooper et al. (2023), Self-Speculative Zhang et al. (2023a), Lookahead Decoding Fu et al. (2023) , leaf, text width=33em ] ] [ Approximate [ Blockwise Stern et al. (2018), SpecDec Xia et al. (2022), BiLD Kim et al. (2023) , leaf, text width=33em ] ] ] [ Nucleus Sampling (§6.2) [ Lossless [ Speculative Decoding Leviathan et al. (2023), SpS Chen et al. (2023a), Online Speculative Liu et al. (2023), DistillSpec Zhou et al. (2023), PaSS Monea et al. (2023), CS. Drafting Chen et al. (2023b) , leaf, text width=33em ] ] [ Approximate [ Speculative Decoding Leviathan et al. (2023), DistillSpec Zhou et al. (2023) , leaf, text width=33em ] ] ] [ Token Tree Verification (§6.3) [ SpecInfer Miao et al. (2023), StagedSpec Spector and Re (2023), SpecTr Sun et al. (2023), Medusa Cai et al. (2023), REST He et al. (2023) , leaf, text width=40.7em ] ] ] ]
Transformer-based LLMs typically make generations in an autoregressive manner. Given an input sequence , an autoregressive language model generates the next token according to:
𝑡1subscript𝑞𝑛1subscriptℳ𝑞conditional𝑥subscript𝑥absent𝑡1x_{t+1}\sim q_{n+1}=\mathcal{M}_{q}\left(x\mid x_{
As discussed in Section 3, although the standard autoregressive decoding offers desirable generation quality, it is strongly bound by memory bandwidth, resulting in low utilization of contemporary accelerator hardware. In this process, each memory-bound LLM call (i.e., an LLM forward step) produces merely a single token for the entire sequence, making the whole generation inefficient and time-consuming.
2 Speculative Decoding
Following Xia et al. (2022), Leviathan et al. (2023), and Chen et al. (2023a), we here provide a formal definition of Speculative Decoding:
Speculative Decoding is a Draft-then-Verify decoding paradigm in which, at each decoding step, it first efficiently drafts multiple future tokens and then verifies all these tokens in parallel using the target LLM to speed up inference.
We formulate a detailed Speculative Decoding process in Algorithm 2. Subsequently, we delve into the two fundamental substeps integral to this paradigm – drafting and verification:
At each decoding step, Speculative Decoding first efficiently drafts multiple future tokens, as a speculation of the target LLM’s output. Formally, given an input sequence and the target LLM , Speculative Decoding employs an efficient drafter model (e.g., a smaller LM) to decode the next drafted tokens:
where denotes various drafting strategies that we will discuss in Section 5, is the conditional probability distribution calculated by , and denotes the drafted token sampled from .
Verification
Subsequently, these drafted tokens are verified by the target LLM in parallel. Formally, given the input sequence and the draft , Speculative Decoding utilizes to compute probability distributions simultaneously:
𝐾1\displaystyle q_{i}=\mathcal{M}_{q}\left(x\mid x_{\leq t},\widetilde{x}_{ is verified by specific criterion – . Only those tokens that meet the criterion are selected as final outputs, ensuring quality consistent with the target LLM’s standards. Otherwise, the first drafted token that fails the verification will be corrected by the strategy . All drafted tokens after position will be discarded, to guarantee the high quality of the final outputs. If all tokens pass verification, an additional token will be sampled from as Eq. (1).
The drafting and verification substeps will be iterated until the termination condition is met, i.e., the [EOS] token is decoded or the sentence reaches the maximal length.
Consequently, the acceleration effect of Speculative Decoding primarily hinges on the number of drafted tokens accepted per step. This acceptance rate is contingent on several factors, including the capacity of the drafter model, the verification criterion, and the behavior alignment between the drafter and the target LLM. Additionally, the intrinsic efficiency of the drafter itself also contributes to the overall end-to-end speedup. The subsequent section will delve into these pivotal components of Speculative Decoding, as depicted in Figure 3, through a comparative analysis of current leading methods.
Drafting
As a vital component of Speculative Decoding, the drafting process has a crucial impact on the acceleration effect of the paradigm. The impact is determined by two key factors: the speculation accuracy of the drafter , measured by the average number of accepted tokens per step, and the drafting latency Stern et al. (2018); Xia et al. (2022). How to trade off high speculation accuracy and low drafting latency presents a major challenge in this process. In this section, we classify various drafting strategies into two categories: independent drafting (§5.1) and self-drafting (§5.2), and summarize their formulations in Table 1.
To strike a balance between speculation accuracy and efficiency, SpecDec Xia et al. (2022) first proposed to utilize an independent model to perform the drafting task. Specifically, it introduced a specialized Non-Autoregressive Transformer that drafts tokens simultaneously per step. This model has a deep-shallow encoder-decoder architecture to run efficiently. Besides, SpecDec incorporated sequence-level knowledge distillation Kim and Rush (2016) to align the drafter’s outputs with those of the target LLM, thereby improving speculation accuracy. However, this method requires training a specialized drafter model from scratch, which demands an increased computational budget.
Considering the available models in existing LLM series (e.g., OPT Zhang et al. (2022) and LLaMA Touvron et al. (2023a, b)), a more straightforward and efficient approach is directly employing a small LM from the same series as the drafter to accelerate the inference of its larger counterparts Leviathan et al. (2023); Chen et al. (2023a); Spector and Re (2023); Sun et al. (2023); Chen et al. (2023b). For instance, Leviathan et al. (2023) utilized T5-small as the drafter, to accelerate the inference of T5-XXL. These off-the-shelf small LMs do not require additional training or any modification on model architectures, facilitating the quick adoption of Speculative Decoding. Moreover, since models in the same series share tokenizers, pretraining corpora, and similar training processes, they inherently have an alignment in generation behaviors. Nevertheless, there is still a considerable behavior gap between the small LM and the target LLM, resulting in suboptimal speculation accuracy.
To improve behavior alignment, recent studies have investigated various knowledge distillation strategies to finetune existing small LMs as effective drafters Miao et al. (2023); Kim et al. (2023); Zhou et al. (2023); Liu et al. (2023). Notably, Miao et al. (2023) proposed a collective boost-tuning strategy to align various small LMs with the target LLM on distinct subsets of the training corpus. The aggregated output of these small LMs, which are generated in parallel, offers an enhanced speculative prediction of the target LLM’s outputs. Online Speculative Decoding Liu et al. (2023) proposed to continually align the drafter with the target LLM on the user query data stream. It introduced an online knowledge distillation strategy that dynamically adapts the drafter model to the evolving distribution of user queries on the fly, thereby improving the speculation accuracy of the drafter.
2 Self-Drafting
While leveraging an external drafter model in Speculative Decoding shows promising speedup, this approach requires additional effort to train or identify a suitable drafter model that aligns closely with the target LLM. It becomes more challenging when the target LLM lacks smaller counterparts, e.g. LLaMA-7B Touvron et al. (2023a, b). Moreover, the integration of two distinct models within one system introduces increased computational and operational complexities, especially in distributed settings Cai et al. (2023).
To address the above issues, some work has proposed to utilize the target LLM itself for efficient drafting Stern et al. (2018); Santilli et al. (2023); Hooper et al. (2023); Cai et al. (2023); Fu et al. (2023); Monea et al. (2023). Particularly, Blockwise Decoding Stern et al. (2018) and Medusa Cai et al. (2023) introduced additional FFN heads on top of the Transformer decoder, enabling the generation of multiple tokens simultaneously per step. Compared with external drafters, these lightweight FFN heads reduce extra computational overhead and are friendly to distributed inference. There is also another line of research utilized early existing or layer skipping on the target LLM itself to perform the drafting task Yang et al. (2023b); Zhang et al. (2023a); Hooper et al. (2023). For instance, Yang et al. (2023b) introduces additional subprocesses that exist early in the current decoding step to start drafting future tokens in advance. Similarly, Self-Speculative Zhang et al. (2023a) proposed to adaptively skip several intermediate layers during inference to draft efficiently.
In contrast to prior work that focused on extending model architectures or altering the inference process, Santilli et al. (2023) introduced a simple drafting strategy that directly adds multiple [PAD] tokens to the end of the input prompt. The effectiveness of this method stems from the robustness of LLMs in handling noisy inputs. Specifically, LLMs may still be capable of predicting the next token even with several [PAD] tokens inserted in the prefix. However, this approach deviates from the autoregressive pretraining pattern of LLMs, leading to suboptimal drafting quality. To tackle this problem, Fu et al. (2023) proposed to reform these low-quality drafted tokens into multiple n-grams, which effectively improves the speculation accuracy; Monea et al. (2023) introduced multiple learnable [LA] tokens and finetuned these token embeddings on a small training dataset to enhance the parallel decoding performance.
Verification
In each decoding step, the drafted tokens are then verified in parallel, to ensure the output quality is highly consistent with the target LLM. This process also determines the number of accepted tokens per step, a vital factor impacting the speedup. In this section, we summarize various verification criteria (as shown in Table 2), encompassing those supporting greedy decoding (§6.1) and nucleus sampling (§6.2) in LLM inference. Besides, we introduce token tree verification (§6.3), an effective strategy to increase token acceptance per step.
𝑡𝑐subscript𝑞𝑐{x}_{t+c}\leftarrow\arg\max q_{c} Blockwise Decoding Stern et al. (2018), SpecDec Xia et al. (2022) Nucleus Sampling Speculative Decoding Leviathan et al. (2023), SpS Chen et al. (2023a) Table 2: Summary of formulations for various verification strategies in Speculative Decoding. 6.1 Greedy Decoding Early attempts at Speculative Decoding focused on the verification criterion that supports greedy decoding, which guarantees that the outputs are exactly the same as the greedy decoding results of the target LLM Stern et al. (2019); Sun et al. (2021); Xia et al. (2022). Specifically, this criterion requires that only those drafted tokens matching the top-1 predictions of the target LLM could pass the verification. Formally, given the input sequence , the drafted tokens , and the computed probability distributions , as obtained from Eq. (2) and (3), respectively, the verification criterion on the drafted token is formulated as
where . The first position that the drafted token fails the verification denotes the bifurcation position. The output token at this position will be corrected by the correction strategy , which simply replaces the drafted token with the top-1 prediction here:
𝑡𝑐subscript𝑞𝑐{x}_{t+c}\leftarrow\arg\max q_{c}. (5) The verification criterion of greedy decoding is straightforward and effective. Thus, multiple subsequent studies have adopted this criterion to demonstrate the efficacy of their methodologies Santilli et al. (2023); Yang et al. (2023b); Hooper et al. (2023); Zhang et al. (2023a); Fu et al. (2023). Besides, this criterion has been prominently featured in numerous online demonstrations Joao Gante (2023); Cai et al. (2023); Fu et al. (2023), highlighting how the algorithm generates faster than greedy decoding while maintaining identical outputs. However, this approach is not without its limitations. The strict matching requirement of this criterion often results in the rejection of potentially suitable drafted tokens, simply because they differ from the top-1 predictions of the target LLM, thereby constraining the speedup of the paradigm.
To tackle this problem, multiple studies have proposed various approximate verification criteria Stern et al. (2018); Xia et al. (2022); Kim et al. (2023). Compared with the lossless greedy decoding criterion above, these methods slightly relax the matching requirement to trust the drafts more, leading to higher acceptance of drafted tokens. For instance, SpecDec Xia et al. (2022) only requires the drafted tokens to fall in top-k candidates of the target LLM with a tolerable log-likelihood gap away from the top-1 prediction; BiLD Kim et al. (2023) proposed a rollback verification criterion that rejects drafted tokens if the number of consecutive mismatch tokens exceeds a fixed threshold.
2 Nucleus Sampling
Following Stern et al. (2019) and Xia et al. (2022), subsequent work extended Speculative Decoding to support nucleus sampling Leviathan et al. (2023); Chen et al. (2023a), accelerating the target LLM’s inference without changing its output distribution. Formally, given the initial sequence , the drafted tokens and the computed distributions , , the verification criterion on the drafted token is
where denotes a random number drawn from a uniform distribution ; and are the probability of according to and , respectively; and . In other words, this criterion accepts the drafted token if , and in case it rejects the token with probability . The corresponding correction strategy resamples the output token at the bifurcation position from an adjusted distribution:
𝑡𝑐norm0subscript𝑞𝑐subscript𝑝𝑐x_{t+c}\sim\operatorname{norm}(\max\left(0,q_{c}-p_{c}\right)). (7) Leviathan et al. (2023) and Chen et al. (2023a) theoretically proved that Speculative Decoding with this sampling strategy maintains identical output distributions to the target LLM, which has been followed by multiple subsequent studies Liu et al. (2023); Zhou et al. (2023); Monea et al. (2023); Chen et al. (2023b). In addition to the strict requirement in Eq. (6), some work also introduces approximate verification strategies to improve the rate of drafted token acceptance Leviathan et al. (2023); Zhou et al. (2023). For instance, Leviathan et al. (2023) multiplies in Eq. (6) by a lenience parameter , relaxing the criterion of trusting the draft more.
3 Token Tree Verification
As illustrated in Table 2, prior verification strategies focus on verifying a single draft sequence and lack the consideration of different draft candidates, leading to suboptimal speculation accuracy. To address this issue, SpecInfer Miao et al. (2023) first proposed token tree verification, an effective strategy that enables the target LLM to verify multiple candidate draft sequences in parallel.
As illustrated in Figure 4, compared to prior strategies that only verify a single draft sequence, token tree verification first merges multiple candidate draft sequences into a token tree, then utilizes a specially designed tree attention mechanism to verify the whole token tree in parallel. Recent research has investigated various approaches to obtain the candidate draft sequences. For instance, Miao et al. (2023) generated diverse draft sequences from different boost-tuned LMs; Cai et al. (2023) considered the top-k predictions from each FFN head to obtain multiple candidates, while He et al. (2023) utilized different continuations (of the input prompt) from the retrieved documents as candidate draft sequences. Subsequently, those obtained candidate draft sequences are merged into a token tree by sharing prefixes and are fed into the target LLM with a tree attention mask for parallel verification, as shown in Figure 4.
Alignment
As illustrated in Section 5, improving speculation accuracy is the key to the speedup of Speculative Decoding: the closer the prediction behavior of the drafter is to the target LLM, the higher the acceptance rate of drafted tokens. To this end, existing work has explored various knowledge distillation (KD) strategies to align the drafter’s outputs with those of the target LLM Stern et al. (2018); Xia et al. (2022); Miao et al. (2023); Liu et al. (2023); Kim et al. (2023); Zhou et al. (2023). Blockwise Decoding first adopted sequence-level knowledge distillation (Seq-KD) Kim and Rush (2016) for alignment, which trained the drafter model on the sentences generated by the target LLM. Besides, Seq-KD is also an effective strategy to improve the generation quality of parallel decoding Gu et al. (2018); Qian et al. (2021), which enhances the drafting performance. Thus, this approach has been adopted by multiple subsequent studies Xia et al. (2022); Miao et al. (2023); Kim et al. (2023). For instance, Miao et al. (2023) proposed a collective boost-tuning (Col-BT) strategy, which adopted Seq-KD to finetune multiple small LMs on the training data and utilized their aggregated output as the draft, improving the speculation accuracy.
Though Seq-KD is effective, it ignores the probability distributions of the target LLM and only trains the drafter model with one-hot labels. Thus, this strategy becomes less effective when Speculative Decoding is adopted for nucleus sampling. To address this, recent studies have explored other KD strategies for Speculative Decoding Zhou et al. (2023); Liu et al. (2023). Notably, DistillSpec Zhou et al. (2023) conducted a comprehensive comparison of different KD strategies on Speculative Decoding across various downstream tasks, pointing out that the choice of the optimal KD algorithm largely depends on specific tasks and the verification strategy. Online Speculative Liu et al. (2023) proposed an online KD strategy that dynamically aligns the drafter with the target LLM on the fly using the query data.
We summarize the main features of existing Speculative Decoding methods in Table 3, including the drafter type or the drafting strategy, the alignment approach, supported verification strategies, and the reported speedup, etc.
Applications
In addition to serving as a general paradigm, recent work has revealed that some variants of Speculative Decoding demonstrate extraordinary effectiveness in specific tasks. Furthermore, other research has applied this paradigm to address latency issues unique to certain application scenarios, achieving inference acceleration. Below, we will provide a detailed introduction to these promising works.
Recent studies by Sun et al. (2021) and Yang et al. (2023a) have highlighted Speculative Decoding is particularly well suited for tasks where model inputs and outputs are highly similar, such as Grammatical Error Correction Wang et al. (2021); Bryant et al. (2023) and Retrieval-augmented Generation Lewis et al. (2020); Cai et al. (2022). These methods introduced a specialized form of Speculative Decoding, where the initial user input or the retrieved context is directly employed as drafts. For instance, SAD Sun et al. (2021), an early attempt at Speculative Decoding on Grammatical Error Correction, utilized the input sentence with grammatical errors as a draft and leveraged the LLM to verify the whole sentence in parallel, achieving a speedup. Similarly, LLMA Yang et al. (2023a) selected text spans from the reference as drafts, demonstrating a speedup across various practical application scenarios including Retrieval-augmented Generation, Cache-assisted Generation, and Multi-turn Conversations.
Beyond these works, RaLMSpec Zhang et al. (2023b) adopted Speculative Decoding to accelerate retrieval-augmented language models (RaLMs). It pointed out that the main latency bottleneck of iterative RaLMs is the frequent retrieval from a vast knowledge base. To accelerate inference, this method proposed to maintain a local cache for speculative retrieval, achieving around speedup with identical model outputs. LLMCad Xu et al. (2023) applied Speculative Decoding to on-device LLM inference. Concretely, it proposed to generate drafts with a smaller real-time LM that can be hosted in device memory, and only utilize the target LLM for parallel verification. This approach effectively reduces repetitive releasing and loading of model weights, achieving a speedup compared to existing inference engines.
Challenges and Future Directions
As discussed in Sections 5, scaling up the drafter can effectively enhance speculation accuracy, yet it largely reduces the drafting efficiency and even the overall speedup. Therefore, it is essential to strike a balance between speculation accuracy and drafting latency. Among existing strategies, behavior alignment is a promising approach to address this issue, as it improves speculation accuracy without increasing latency. However, despite recent advancements Miao et al. (2023); Zhou et al. (2023); Liu et al. (2023), there is still considerable room for improvement to align the drafter with the target LLM. For example, given that the drafted tokens after the bifurcation position are all discarded, one potential direction could involve encouraging the drafter to prioritize the generation quality of early-position tokens. Beyond alignment, other factors such as the quality of drafting Fu et al. (2023) and the determination of speculation length Su et al. (2023) also influence speculation accuracy and merit further exploration.
How to integrate Speculative Decoding with other leading techniques?
As a general decoding paradigm, Speculative Decoding has already demonstrated its potential in conjunction with other advanced techniques Yang et al. (2023a); Zhang et al. (2023b); Li et al. (2023). For instance, Yuan et al. (2023) combined Speculative Decoding with Contrastive Decoding Li et al. (2023), which not only speeds up the inference but also substantially improves the generation quality. In addition to the acceleration of text-only LLMs, the application of Speculative Decoding in multimodal inference, such as image synthesis, text-to-speech synthesis, and video generation, is also an intriguing and valuable direction for future research.
Conclusion
In recent years, the continual scaling up of LLMs has significantly increased the demand for efficient LLM inference. Speculative Decoding, a novel decoding paradigm that accelerates LLM inference while maintaining identical generation quality, has emerged as a promising solution. This paper presents a comprehensive survey of the existing literature on Speculative Decoding, including a formal definition and formulation of Speculative Decoding, an in-depth review of various leading techniques, as well as challenges and potential directions for future research. To the best of our knowledge, this is the first survey dedicated to Speculative Decoding. The primary objective of this paper is to clarify the current research landscape and provide insights into the future trajectory of this promising paradigm.