QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks

Albert Tseng, Jerry Chee, Qingyao Sun, Volodymyr Kuleshov, Christopher De Sa

Introduction

Large language models (LLMs) have driven rapid advances across diverse fields such as natural language processing (Touvron et al., 2023b), scientific modeling (Nguyen et al., 2023), and program synthesis (Rozière et al., 2024). However, the massive size of these models poses significant challenges to their deployment. For example, the largest model in the Llama2 family has 70B parameters, and requires 140GB of GPU memory in native 16-bit precision (Touvron et al., 2023b). This massive memory footprint motivates research into methods that can compress LLMs without sacrificing quality.

Post-training quantization (PTQ) reduces the memory requirements of large models by converting trained weights to a lower precision. For example, with 2-bit quantization, a 16-bit LLama2 model with 70B parameters fits on a single consumer-grade 24GB GPU and benefits from increased inference throughput (Cai et al., 2024). However, 2-bit quantization also often reduces the quality of the model and pushes the limits of PTQ algorithms (Chee et al., 2023).

In this work, we introduce QuIP#\#, a weight-only PTQ method that achieves a new state-of-the-art in model quantization. QuIP#\# improves over existing work via three techniques: incoherence processing, lattice codebooks, and fine-tuning. Incoherence processing is a principled form of outlier suppression that produces approximately sub-Gaussian distributed weight matrices (Chee et al., 2023). QuIP#\# performs incoherence processing with the computationally-efficient randomized Hadamard transform (Halko et al., 2011) (Section 3). To quantize incoherent matrices, QuIP#\# uses the BlockLDLQ block adaptive rounding algorithm with compressible codebooks based on the E8E_{8} lattice, which achieves the highest density 8 dimensional unit-ball packing (Viazovska, 2017) (Section 4). The E8E_{8} lattice is highly structured and symmetric, which means that our codebooks are hardware-friendly and admit fast inference. Finally, QuIP#\# includes an inter-layer fine-tuning algorithm that further improves quantization quality (Section 5).

These developments allow QuIP#\# to significantly outperform existing PTQ methods including OmniQuant (Shao et al., 2024), QuIP (Chee et al., 2023) (a previous, separate work), and AQLM (Egiazarian et al., 2024). To the best of our knowledge, QuIP#\# is also the first PTQ method where 3-bit models scale better than 4-bit models. This directly refutes Dettmers & Zettlemoyer (2023)’s claim that 4-bit models are “optimal” and indicates that as the field of PTQ develops, 2-bit models are likely to scale better than 3-bit models in the near future.

Finally, we note that QuIP#\# was designed from the ground up to be fast. Algorithm 2 describes fast inference with a QuIP#\#-quantized linear layer. Our “proof of concept” CUDA implementation of QuIP#\# achieves over 50% of peak memory bandwidth on a NVIDIA RTX 4090, validating our design choices.

In summary, we introduce QuIP#\#, a post-training quantization method that achieves state-of-the-art results by

Performing incoherence processing with the Randomized Hadamard Transform, which has better incoherence properties and faster runtime than the Kronecker factorization in QuIP.

Rounding incoherence-processed weight matrices with block adaptive rounding and codebooks based on the E8E_{8} lattice, which achieves the highest 8-dimension unit ball packing density (kissing number).

Introducing an inter-layer fine-tuning algorithm that further improves quantization quality.

Background / Related Work

A large body of work has focused on compressing LLMs, as doing so can directly benefit LLM inference at scale. Methods such as pruning, quantization aware training (QAT), and post-training quantization (PTQ) all focus on different areas of this problem and are not strictly orthogonal to each other. Pruning removes weights from models while preserving model quality and inference performance (Chee et al., 2022; Sun et al., 2023). QAT focuses on training models that are more “quantizable” but usually requires training models from scratch (Nagel et al., 2022). PTQ, which QuIP#\# falls under, instead quantizes pre-trained models. PTQ generally requires much less compute than QAT and achieve competitive performance (Chee et al., 2023; Frantar et al., 2023; Shao et al., 2024; Egiazarian et al., 2024). For the rest of this paper, we focus on the PTQ realm of LLM compression.

2 Quantization and Adaptive Rounding

In QuIP#\#, we follow existing state-of-the-art PTQ methods and round weights to minimize the per-layer proxy loss, as formalized by Nagel et al. (2020):

3 Incoherence Processing

Multiple works have observed that outliers in model activations and weights can hinder quantization quality, motivating methods that “suppress” outliers during quantization. For example, AWQ (Lin et al., 2023) scales model weights by information from activations and OmniQuant (Shao et al., 2024) uses simple learnable model-preserving transformations. However, these heuristic-based approaches tend to fail at lower bitrates.

Instead, in QuIP, Chee et al. (2023) proposed that incoherence is important for LLM quantization. Informally, incoherent matrices have concentrated entry magnitudes—ruling out outliers. In LLMs, incoherent weight and Hessian matrices mean that both the thing being rounded (weights) and important rounding directions (Hessians) are not too large in any coordinate. This enables quantization with provably bounded error.

These structured orthogonal multiplies by a Kronecker product lead to a runtime overhead of Θ(nn+mm)\Theta(n\sqrt{n}+m\sqrt{m}), which is small relative to the Θ(mn)\Theta(mn) cost of the multiply by WW.

Incoherence processing can be seen as a principled alternative to more complicated and heuristic methods for outlier suppression. Methods such as grouping require extra storage and can negatively impact performance. For example, using a 16 bit scale per group of 64 weights requires an extra 0.25 bits per weight. This increase is significant in extreme compression regimes, whereas incoherence processing allows more bits to be spent on actually quantizing model weights.

4 Vector Quantization

Incoherence Processing with the Randomized Hadamard Transform

In this section, we propose a way of improving the incoherence processing of QuIP by replacing the 2-factor Kronecker product by a Randomized Hadamard Transformation (RHT) (Halko et al., 2011). This change yields three advantages: (1) the theoretical bound on the incoherence parameter μ\mu is improved; (2) the asymptotic cost of multiplying by the structured random orthogonal matrix is improved from Θ(nn)\Theta(n\sqrt{n}) to Θ(nlogn)\Theta(n\log n); (3) the cost to multiply is further reduced by a constant factor, since a Hadamard matrix multiply can be performed without any floating-point multiplies as its entries are in {1,+1}\{-1,+1\}. Additionally, we show in Section 6.4 that this change by itself improves the perplexity of quantized LLMs.

In QuIP (Chee et al., 2023), the 2-factor Kronecker approach achieves μWKron=A2log(4Cmn/δ)2\mu_{W}^{Kron}=A^{2}\log\left(4Cmn/\delta\right)^{2}, where AA and CC are global constants independent of nn and the number of factors. QuIP#\#’s RHT achieves superior incoherence via a log dependence on the matrix size rather that the Kronecker method’s log-squared dependence. All of QuIP’s theory analyzing the proxy loss in Eq. (1) still holds with the RHT, with the improved incoherence rates propagating through.

While the Hadamard conjecture states that Hkk,4k\exists H_{k}\forall k,4\mid k, finding such Hadamard matrices is still an open problem (Hedayat & Wallis, 1978). In cases when there does not exist a factorization n=pqn=pq where Hp,Hq\exists H_{p},H_{q}, we present a Randomized Fast Fourier Transform (RFFT) incoherence processing algorithm with similar runtime and concentration properties as the RHT. At a high level, the RFFT performs incoherence processing with the Fast Fourier Transform (FFT) (Cochran et al., 1967) and a random complex phase. The RFFT only requires nn to be even, which is much weaker than the RHT’s restrictions on nn. The RFFT is also useful when there does exist a decomposition n=pqn=pq but p≫̸qp\not\gg q, resulting in reduced speedups over an Θ(nn)\Theta(n\sqrt{n}) algorithm. The FFT itself is also well supported on a wide variety of hardware, meaning that it may be easier to implement a fast RFFT when adapting QuIP#\# to new hardware. In practice, we find that the RFFT performs slightly worse than the RHT but still achieves strong results (Table 1). We describe the RFFT in detail in Section A.2 in the Appendix.

BlockLDLQ and Lattice Codebooks

The LDLQ algorithm sets U to be LTIL^{T}-I where H=LTDLH=L^{T}DL is the LDL decomposition of the proxy Hessian HH. From QuIP, we know that LDLQ is optimal within adaptive rounding methods with linear feedback when rounding to the integers. However, LDLQ does not work with vector quantization, which rounds multiple columns together.

We propose to extend LDLQ to support vector quantization. Given a block size gg that evenly divides nn, our block LDLQ is based on a novel gg-block LDL decomposition H=LTDLH=\mathbf{L}^{T}\mathbf{D}\mathbf{L}, where L\mathbf{L} is a unit block lower triangular matrix (among the n2/g2n^{2}/g^{2} g×gg\times g blocks of LRn×nL\in\mathbf{R}^{n\times n}, the n/gn/g diagonal blocks are all II and all blocks above the diagonal are ), and D\mathbf{D} is a block diagonal matrix.It is straightforward to produce the gg-block LDL decomposition from the Cholesky decomposition of HH. As before, we set U=LTI\mathbf{U}=\mathbf{L}^{T}-I, and round WW in a block-wise fashion via

Observe that under the same conditions, just quantizing all blocks independently would yield E[tr((W^W)H(W^W)T)]gmσ2tr(H)\mathbf{E}[\operatorname{tr}((\hat{W}-W)H(\hat{W}-W)^{T})]\leq gm\sigma^{2}\operatorname{tr}(H): this “improvement” from the trace of HH to the square of the trace of its square root divided by nn is the same factor achieved in the scalar case in QuIP.The original QuIP paper also included multiple other technical guarantees, including a bound that considers more rigorously the “real” case of finite-sized codebooks. While these results could also be generalized to the block-LDLQ case, we view this as not providing much insight relevant to QuIP#\# beyond Theorem 4.1, so (if desired) they are left as an exercise for the reader.

2 The E8P (“E8 Padded”) Codebook

D^8\hat{D}_{8} has nice symmetry properties: flipping any (nonzero) even number of signs of an element in D^8\hat{D}_{8}, yields another distinct element in D^8\hat{D}_{8}. This means that if D^8|\hat{D}_{8}| denotes the set of elementwise absolute values of entries in D^8\hat{D}_{8}, then each element of D^8\hat{D}_{8} can be expressed (uniquely) as the elementwise product of an entry sD^8s\in|\hat{D}_{8}| and a sign vector of appropriate parity. So, if we start from some “source codebook” of absolute entries SD8^S\subset|\hat{D_{8}}|, we can use the 128 possible odd- or even-parity sign flips to generate a subset of D8^\hat{D_{8}}. Each entry in SS is either an odd or even number of flips away from an entry in D8^\hat{D_{8}}, but not both. Thus, given sSs\in S and 7 out of the 8 sign flips, we can infer the last one from the parity of the 7 sign flips and ss. This lets us use the following pattern to store a 16-bit codeword in E8+14E_{8}+\frac{1}{4}: 8 bits for the entry in SS, 7 bits for sign flips, and 1 bit to ±14\pm\frac{1}{4}. This lets us decode a size 2162^{16} codebook by looking up into only a size 282^{8} codebook (SS) and performing some operations. All that remains is how to choose SS: we set SS to be the 227 elements of D8^|\hat{D_{8}}| with norm 10\leq\sqrt{10} plus 29 “padding” elements from D8^|\hat{D_{8}}| with norm 12\sqrt{12} (see Section C.1). We call this ball-shaped 2162^{16}-entry lattice codebook “E8P.”

Fine-Tuning During Quantization

Recent works have suggested that at extreme quantization levels (e.g. 2 bits), inter-layer interactions are a significant hurdle to lossless quantization (Shao et al., 2024; Egiazarian et al., 2024). Here, we employ a simple fine-tuning algorithm that attempts to recover the original unquantized model during quantization. Our fine tuning method runs on a small development set and works by relaxing the sign vectors in the RHT to arbitrary real vectors after quantization.

Our fine tuning method contains two steps. First, we fine-tune within each transformer block by quantizing the first linear layer, fine-tuning the remaining parameters to match the unquantized model’s output activations on the unquantized model’s input activations, quantizing the second linear layer, fine-tuning, and so on until all linear layers are quantized. This step attempts to minimize the activation error caused by an individual linear layer during quantization, and it is parallelizable across transformer blocks as the activation error does not consider the effect of quantizing preceding blocks. The idea of fine-tuning on the level of a transformer block was previously proposed in Egiazarian et al. (2024); our methodology differs in that we set a different set of parameters to be trainable. In the second step, after all linear layers in the model are quantized, the unquantized parameters (layernorms, sign vectors, language model head) are fine-tuned to minimize activation error over the entire model.

By optimizing the sign vectors as real vectors instead of binary vectors in both steps, we allow the incoherence processing step to shape the weight matrix to the codebook. While this means we must store the sign vectors in FP16 instead of as bitvectors, the size of LLM matrices means that the sign vectors still add less than 0.01 bits per weight. We describe these steps in more detail in Section D.

Experiments

Our main experiments show the performance of QuIP#\# on the Llama 1 (Touvron et al., 2023a) and 2 (Touvron et al., 2023b) family of models. These models range in size from 7 billion to 70 billion parameters and offer good performance, making them suitable for understanding how quantization methods perform and scale. Additional results for other models are available in the Appendix.

In Section 6.1, we compare QuIP#\# with recently published weight-only PTQ methods. AWQ scales weights by activation magnitudes before quantizing to reduce outliers (Lin et al., 2023). OmniQuant learns model-preserving layerwise transformations that reduce outliers per transformer block (Shao et al., 2024). AQLM uses additive quantization with learnable per-layer codebooks and performs a fine-tuning step on codebook entries and layernorms (Egiazarian et al., 2024). We also include QuIP (Chee et al., 2023) as a baseline for the improvements in QuIP#\#.

We report WxxA16 numbers for AWQ and OmniQuant from the OmniQuant paper and AQLM numbers from AQLM. We note that there are currently 2 methods for evaluating perplexity: using the Llama 1 context length of 2048 or using the model’s native context length (e.g. 4096 for Llama 2). OmniQuant and AWQ use 2048 for Llama 2 while AQLM uses 4096; we report both sets of numbers. We also note that AQLM paper reports QuIP#\# numbers from an outdated version of QuIP#\#. QuIP#\# is under active development and the numbers here represent the latest QuIP#\# numbers. Finally, we bold numbers in our tables when they are clearly better, such as a smaller model matching or outperforming a larger model or a similar sized model significantly outperforming another model.

Table 2 shows a comparison of QuIP#\# with OmniQuant, AWQ, and QuIP#\#(no FT and no lattice codebook E8E_{8}) with context length 2048. QuIP#\# offers a paradigm shift in quantization quality over OmniQuant and AWQ. Notably, while AWQ falls apart at even 2.15 bits (Shao et al., 2024) and OmniQuant produces unusable models at 2 bits, QuIP#\# produces high quality models that are close to OmniQuant 3 bit models. Table 2 also shows the importance of incoherence processing. QuIP#\# without fine-tuning or lattice codebooks significantly outperforms OmniQuant and AWQ, which both rely on heuristics to reduce model outliers during quantization.

Table 4 shows a comparison of QuIP#\# with AQLM with context length 4096. QuIP#\# offers strong improvements over AQLM at the 2 and 3 bit level, either significantly outperforming similarly sized models or offering similar performance with a smaller modelIn our experience, at extreme quantization levels, even 0.1 bits can make a significant difference in quantization quality.. At the 4 bit level, QuIP#\# and AQLM both perform similarly. This is not surprising as state-of-the-art 4 bit models are all very close to FP16 performance. Furthermore, the QuIP#\# 3 and 4 bit results presented in this paper use residual vector quantization; one could potentially achieve better numbers with more advanced multi-codebook quantization approaches.

Table 3 shows zeroshot results for QuIP#\#, AQLM, and OmniQuant. Both AQLM and QuIP#\# signficantly outperform OmniQuant, which correlates with the perpelxity results. AQLM and QuIP#\# both perform very close to FP16 at higher bitrates and for larger models, but QuIP#\# tends to outperform AQLM at lower bitrates and model sizes. We note that zeroshot tasks have an element of randomness and even FP16 numbers can disagree by up to 0.5%0.5\%.

2 QuIP##\# Bit Scaling

Figures 1 (first page) and 4 show how QuIP#\# scales on the Llama family of models and Wikitext2. On both Llama 1 and 2, QuIP#\# 3 bit outperforms QuIP#\# 4 bit and QuIP#\# 2 bit offers similar scaling to 3 and 4 bit models. Furthermore, on Llama 2, QuIP#\# 3 bit outperforms a theoretical lossless 4 bit model (FP16 at 4 bits). To the best of our knowledge, this is the first time a 3 bit PTQ method has outperformed a theoretical lossless 4 bit model and also the first time a 2 bit PTQ method has offered similar scaling to higher bitrates.

3 Efficient Inference with QuIP##\#

Table 5 shows 2 and 4 bit QuIP#\# Llama model generation speed measured on a NVIDIA RTX 4090 GPU with a “proof of concept” CUDA implementation of E8P and the FlashAttention library (Dao et al., 2022; Dao, 2023) implementation of Llama. 4 bit generation speed is faster than 50% of the 2 bit speed since both bitrates still need to perform two Hadamard multiplies per linear layer. We note that we are not CUDA experts and a better optimized implementation of E8P could likely achieve higher throughput. Nevertheless, we find that QuIP#\# inference is fast and scalable on modern GPUs.

4 Ablations

Table 4 also contains an ablation on the various components of QuIP#\#. The “no FT” row shows QuIP#\# without fine-tuning and the “no E8E_{8}” row shows QuIP#\# without fine-tuning and lattice codebooks. For the latter, we round to the 1-dimensional half-integer grid. We also include QuIP numbers as reported by AQLM. At all bitrates, each component of QuIP#\# brings additional performance gains. The difference between QuIP and QuIP#\# without fine-tuning and lattice codebooks also shows the difference between QuIP’s Kronecker factorization and QuIP#\#’s RHT. The RHT offers stronger incoherence properties than the Kronecker factorization (Section 3), which improves performance.

Conclusion

We present QuIP#\#, a weight-only post training compression method that achieves state-of-the-art results on LLMs at 2, 3, and 4 bits per weight. QuIP#\# uses the Randomized Hadamard Transform as am efficient and principled form of outlier suppression, and introduces the E8E_{8} lattice-based E8P codebook to better quantize RHT transformed weights. The E8P codebook is highly symmetric and admits fast inference, allowing a “proof of concept” QuIP#\# CUDA implementation to achieve over 50% peak memory bandwidth on modern GPUs. QuIP#\# also implements inter-layer fine tuning, further improving quantization. To the best of our knowledge, QuIP#\# is the first PTQ method to achieve superior scaling at 3 bits over 4 bits and similar scaling at 2 bits to higher bitrates. Our results indicate that, in the near future, 2 bit models are likely to scale better than 3 bit ones.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Acknowledgements

We thank, in no particular order, David Hou for helping with the QuIP#\# CUDA implementation, Tiancheng Yuan for lending his RTX 4090 and helping with acquiring QuIP#\# timing numbers, Tri Dao for a fast CUDA implementation of the Hadamard transform and general help with QuIP#\#, and Together AI for compute resources.

References

Appendix A Concentration Inequalities for the Randomized Hadamard Transform and Fast Fourier Transform

We start with the following “standard” integral. For non-negative integer mm and real n>0n>0,

Applying the Legendre duplication formula, for integer mm,

where the the last line follows from Lemma A.1. It follows from the standard application of Markov’s inequality that for any a>0a>0,

That is, with probability at least 1ϵ1-\epsilon, multiplying by either HSHS or FPFP makes the resulting orthogonal matrix μ\mu-incoherent, where

Setting b=eib=e_{i} and x=Uejx=Ue_{j} in Lemma A.2,

proves the lemma. The FFT case is identical. ∎

That is, with probability at least 1ϵ1-\epsilon, multiplying on both sides by a randomized Hadamard transform or a randomized FFT yields a weight matrix that is μW\mu_{W}-incoherent, where

By applying this once on each side to the rows and columns respectively, and union bounding over the mnmn entries, we get

The proof in the FFT case is identical. ∎

The incoherence of HH follows from the application of Lemma A.3. The incoherence of WW follows from the application of Lemma A.4. ∎

A.2 Incoherence Processing with the Randomized Fast Fourier Transform (RFFT)

Incoherence processing via the RFFT achieves similar theoretical guarantees as the RHT, see Lemmas A.3 and A.4. Ultimately the choice of the orthogonal transformation is up to the user. A Fourier transform works almost as well as a Hamard transform in practice (Table 1), so if a fast Hadamard implementation is not available, the FFT is a good option.

Appendix B Block LDLQ

where MD=I1d×dM_{D}=I\otimes\mathbf{1}_{d\times d} is the block diagonal mask. If, in addition, HH is μ\mu-incoherent in the sense that its matrix of eigenvectors UU has

Observe that the derivative of the loss is

If R=L1R=L^{-1}, then HR=LTDHR=L^{T}D. But this must be a block upper triangular matrix, because it’s the product of a unit upper triangular matrix (LTL^{T}) and a block diagonal matrix DD. It follows that f(L1)\nabla f(L^{-1}) is zero in all the directions in which we could move RR, since RR only varies in the strictly lower triangular directions. Therefore, R=L1R=L^{-1} is the solution to this optimization problem, and for any RR, f(R)f(L1)=tr(D)\nabla f(R)\geq\nabla f(L^{-1})=\operatorname{tr}\left(D\right).

Now, let MM denote the strictly block lower triangular mask, and observe that M+MT+MD=1nd×ndM+M^{T}+M_{D}=\mathbf{1}_{nd\times nd}. Set α=H1/2MD21\alpha=\left\|H^{1/2}\odot M_{D}\right\|_{2}^{-1}, and consider R=(I+αMH1/2)1R=\left(I+\alpha M\odot H^{1/2}\right)^{-1}. Observe that

It follows by inverting both sides that RRTα1H1/2RR^{T}\preceq\alpha^{-1}H^{-1/2}.

This proves the first part of the lemma. For the second part, observe that

First recall that from the description of block LDLQ,

We can also write this in matrix form in terms of the matrix Lk\mathbf{L}_{k} as

Here, Q\mathbf{Q} is interpreted as operating independently block-wise. Let η\eta denote the quantization error

But by assumption, E[ηηT]mσ2I\mathbf{E}[\eta\eta^{T}]\preceq m\sigma^{2}I (since each block is just an independent application of Q\mathbf{Q} and we sum over mm rows), so

Combining this with the result of Lemma B.1 proves the theorem. ∎

Appendix C E8P details

We use the following 29 elements of D^8\hat{D}_{8} with norm squared 12 to pad SS to 256 entries.

C.2 Example Decoding with E8P

Here, we give an example of decoding with E8P. In this example, the first 8 bits of the codeword encode the entry in SS, the next 7 bits encode the 7 right sign flips, and the last bit encodes whether or not we shift by 14\frac{1}{4}. Let the codeword be 0001010110010111. The first 8 bits 00010101 = 21 would indicate that we start with the 21st entry in SS. In this example, let that be the vector

which is not in D8^\hat{D_{8}}. Thus, ss requires an odd number of sign flips to get into D8^\hat{D_{8}}. Then, the next 7 bits 1001011 would indicate that we need to negate the 1st, 2nd, 4th, and 7th from right bits. Since we need an odd number of sign flips, the 8th from right bit is also a sign flip. The sign-decoded vector is then

which we can verify is in E8E_{8}. Finally, the last bit 1 indicates that we need to add 14\frac{1}{4}, so the final decoded vector is

which is in E8+14E_{8}+\frac{1}{4} as desired.

C.3 Why not K-Means?

A significant motivating factor behind E8P is that post-incoherence processing, entries of WW are approximately Gaussian distributed. However, E8P is uniformly distributed, which raises the question: why not use a K-means based codebook? K-means based codebooks offer strong theoretical performance but have a few issues. First, it is difficult to enforce symmetry in a “learned” K-means codebook. This is crucial to be able to have a compressible codebook. If we force sign symmetry by learning cluster centers on only the positive orthant of a nn-dimensional Gaussian, we can get around this but sacrifice accuracy at the axis region. Second, using K-means requires storing a codebook in fp16, whereas the entries of E8P can be stored as 4 bit integers. This means that during inference, the source codebook for a 8 dimension K-means codebook will be 4 times larger than the source codebook of E8P, running the risk of a cache eviction. Finally, we observe that empirically, E8P actually outperforms K-means, which is somewhat interesting and suggests that allocating more information to the edge of the distribution, even after incoherence processing, is useful.

Appendix D Fine-Tuning During Quantization

In Algorithm 5 we describe our fine tuning procedure for QuIP#\#.

Appendix E Additional Results

E.2 Zeroshot performance for ablation on lattice codebooks and fine-tuning

E.3 More Scaling Plots

Appendix F Implementation Details

This section contains implementation details for our Llama experiments. These details also mostly apply to the Mixtral and Falcon numbers except we use the Falcon dataset (Almazrouei et al., 2023) as it is publicly avaiable.

Hessian matrices HH were generated with 6144 sequences of a model’s native context length (2048 for Llama 1, 4096 for Llama 2) from the RedPajama 1T (Computer, 2023) dataset.

F.2 Hadamard Matrices

We use Hadamard matrices available at Neil Sloane’s website (Sloane, ).

F.3 Perplexity and Zeroshot Evaluation

We use the OPTQ (Frantar et al., 2023) “Wiktext2” and “C4” (not “C4 New”) sampling functions to calculate perplexity for our experiments. We use LM Eval (Gao et al., 2023) to calculate zeroshot numbers.

F.4 Fine Tuning

For the within-transformer block section of fine-tuning, we use the Adam optimizer (Kingma & Ba, 2017), a learning rate of 5×1055\times 10^{-5}, batch size of 8, and sequence length equal to the model’s native context length. We train on small development dataset of 256 sequences from RedPajama 1T and validate on 128 sequences. We train for 5 epochs (160 steps) and keep the best model parameters based on the validation set. For the end to end tuning, we use the Adam optimizer, a learning rate of 5×1055\times 10^{-5}, batch size of 1, sequence length equal to the model’s context length (except for 70B, where we had to use 3072 to avoid an OOM on our not very well optimized training script), and the same dataset and epoch setup as before. We observe that outside of using a low enough learning rate, the other hyperparameters did not affect fine-tuning much. For the 2 bit models, we used a learning rate of 5×1045\times 10^{-4} for SUS_{U} and SVS_{V} (5×1055\times 10^{-5} for everything else as above) for both the within-block and end to end fine tuning stages.

F.5 Hardware

All experiments were run on NVIDIA A100 GPUs except for the timing numbers, which were measured on a NVIDIA RTX 4090 to see what was possible with the current state-of-the-art NVIDIA consumer GPU. We find that we can quantize Llama 2 70B without fine tuning in under 10 GPU-hours and with fine tuning in around 100 GPU-hours. Both numbers do not include Hessian generation, which can be done once for a model and reused across many different quantization experiments.

F.6 Code and Prequantized Models

Our code is available at https://github.com/Cornell-RelaxML/quip-sharp and prequantized QuIP#\# models are available at https://huggingface.co/relaxml.

Appendix G Example Generation

Below are some example generations from Llama 2 70B chat quantized with QuIP#\# to 2 bits, truncated to 256 tokens.