OneBit: Towards Extremely Low-bit Large Language Models

Yuzhuang Xu, Xu Han, Zonghan Yang, Shuo Wang, Qingfu Zhu, Zhiyuan Liu, Weidong Liu, Wanxiang Che

Introduction

Transformer (Vaswani et al., 2017) has emerged as the pivotal architecture in large language models (LLMs), fundamentally reshaping the approach to natural language processing in deep learning era (Bubeck et al., 2023; Touvron et al., 2023; Bisk et al., 2020). Despite their popularity, deploying transformer-based LLMs presents significant challenges due to their computational intensity and considerable memory requirements as the parameters of LLMs become more and more. For instance, even moderately-sized LLMs like LLaMA-13B (Touvron et al., 2023) require around 26GB of memory to load its all parameters in FP16 format. Such overheads make deploying LLMs difficult beyond mid-to-high-end GPUs like the A100, let alone on mobile devices. The high demand for resources not only drives up usage costs, but also restricts their wider application.

Numerous efforts (Dettmers et al., 2022; Frantar et al., 2022; Frantar and Alistarh, 2023) have been devoted to reducing the computational and memory overheads of LLMs, while still preserving most of their original model capabilities. Among these efforts, quantization has gained widespread attention, particularly Post-Training Quantization (PTQ), benefitted from its lower transferring costs. Seminal studies such as GPTQ (Frantar et al., 2022), SpQR (Dettmers et al., 2023b), and AWQ (Lin et al., 2023) successfully compress the weight matrices of LLMs to 4-bit values while maintaining the main abilities of LLMs. Efficient quantization represents significant advances in LLM optimization, by achieving a balance between time and space efficiency as well as model performance.

Unfortunately, the efficacy of PTQ rapidly diminishes when the quantization bit-width is extremely low, as shown in Figure 1. Existing PTQ methods managed to compress weight matrices down to at least 3-bit (Dettmers and Zettlemoyer, 2023). Recent researches hope to leverage Quantization-Aware Training (QAT) to overcome the bottlenecks faced by PTQ. LLM-QAT (Liu et al., 2023) introduces a few learnable parameters into the quantization process, achieving notable results. OmniQuant (Shao et al., 2023), integrating learnable equivalent transformation, presents promising results in 2-bit quantization. However, existing methods decline when compressing model weights to 1 bit, struggling to maintain effectiveness. This mainly stems from the drastic precision loss at extremely low bit-width in weight matrix $\mathbf{W}$ , significantly increasing loss in linear projection $\mathbf{WX}$ , which is the core operator within LLMs.

In this paper, we propose a novel Linear layer and Sign-Value-Independent Decomposition (SVID) for weight matrices to represent LLMs using approximately 1-bit values. In SVID, each original high-bit weight matrix is decomposed into one sign matrix ( $\pm 1$ ) and two value vectors. The value vectors provide necessary floating-point precision in linear projection at little cost and help the model to be trained easily. The sign matrix maintains the high rank of the original weight matrix with a small space cost, thereby preserving high information capacity. SVID offers a better parameter initialization for 1-bit models and we employ quantization-aware knowledge distillation to transfer the capabilities of the original model to the proposed 1-bit counterpart. Experiments demonstrate that our method performs well at the W1A16 (1-bit weight and 16-bit activation) quantization level. Furthermore, our 1-bit model is more amenable to training and knowledge transfer than previous works. In summary, the contributions of this work are 3-fold:

We propose a novel and efficient 1-bit model architecture for LLMs, which can improve both the time and space efficiency during model inference. Moreover, our architecture is more stable during quantizing LLMs.

We propose SVID to decompose high-bit matrices into low-bit ones, which is essential for the initialization of our 1-bit architecture. Experiments demonstrate that the SVID-based initialization can improve the model performance and convergence speed.

Extensive experiments demonstrate that our method works well in model sizes from 1.3B to 13B in OPT, LLaMA, and LLaMA2, showcasing its generalizability.

Related Work

Quantization, pruning, and knowledge distillation (KD) are the mainstream methods for model compression. Quantization compresses model weights into low-bit values (Frantar et al., 2022; Lin et al., 2023; Dettmers et al., 2023a). For data type alignment in computation and reducing memory, it also involves quantizing activation (Dettmers et al., 2022; Xiao et al., 2023) and key-value cache (Shao et al., 2023). Pruning simplifies model complexity by removing unimportant weights or modules, thereby sparsifying the original larger models (Frantar and Alistarh, 2023; Sun et al., 2023; Ma et al., 2023). KD trains a smaller student model under the guidance of a larger teacher model (Hsieh et al., 2023; Agarwal et al., 2023; Hsieh et al., 2023), achieving the purpose of compressing the larger one. Beyond these methods, low-rank factorization approximates the original weight matrix $\mathbf{W}$ with the product of two lower-rank matrices (Xu et al., 2023) and also achieves promising results. Our work belongs to quantization, using KD for knowledge transfer from the original LLM and uniquely focusing on extremely low bit-width quantization. More details about model compression can refer to existing survies (Wan et al., 2023; Zhu et al., 2023).

2 Large Language Model Quantization

Since this paper aims to obtain extremely low-bit LLMs, here we thus introduce more details about LLM quantization. Quantization stands as a popular and crucial method for model compression, capable of achieving a significant compression ratio with a relatively small loss. It can be classified into Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT) according to when quantization is applied.

PTQ directly converts trained models into lower-bit counterparts using accurate solvers and limited calibration data without additional training. Typically, GPTQ (Frantar et al., 2022) row-wisely quantizes weight matrices and adjusts remaining weights to compensate for the precision loss caused by quantization, achieving nearly lossless 4-bit weight quantization. Moreover, numerous studies observed the effect of “outliers” in quantization (Dettmers et al., 2022; Kim et al., 2023b; Lin et al., 2023). LLM.int8() (Dettmers et al., 2022) suggests mixed-precision decomposition to ensure the accuracy of a few outliers in activations. SmoothQuant (Xiao et al., 2023) reduces the difficulty of quantization by smoothing the outliers of activation. SpQR (Dettmers et al., 2023b) identifies sensitive weights to ensure their precision, while quantizing other weights to lower bit-width.

QAT integrates quantization steps within the model, applying them during training or fine-tuning. It allows the model to better adapt to the reduced precision induced by quantization, leading to improved performance compared to PTQ. LLM-QAT (Liu et al., 2023) introduces a small number of learnable parameters into quantization and employs KD using data generated by the original model itself. OmniQuant (Shao et al., 2023; we classify it as QAT) further introduces learnable equivalent transformation, achieving acceptable results in 2-bit weight quantization. PEQA (Kim et al., 2023a) and QLoRA (Dettmers et al., 2023a) focus on fine-tuning a limited number of extra parameters to mitigate the precision loss caused by sub-4bit weight quantization. Our work is closely related to QAT, but due to the unique challenges posed by 1-bit quantization, our representation and initialization methods of quantized weights are distinct from any existing work.

Methodology

This section demonstrates our 1-bit architecture of the Linear layer to be quantized and discuss how to initialize the quantized model to achieve better performance in knowledge distillation. We start with a short review of classical weight quantization methods in Section 3.1 and then formulate our OneBit from Section 3.2 to Section 3.4 in detail.

The main idea of model quantization is to compress each weight matrix $\mathbf{W}$ within models in FP32 or FP16 format to a low-bit counterpart. Specifically, we often quantize the weight matrices of Linear layers in transformer to 8, 4, and even 2 bits.

The majority of quantization studies primarily employ the round-to-nearest (RTN) method, by which the weight $w$ is rounded to the nearest value in the quantization grid. It can be formulated as

Furthermore, when $N$ equals $1$ , quantization based on RTN method is essentially equivalent to setting a threshold, with weight $w$ on either side of it being converted to corresponding integer value $\hat{w}$ . In such a scenario, the parameters $s$ and $z$ in Eq. (1) effectively lose their practical significance. Consequently, when quantizing weights to 1 bit, the element-wise RTN operation drastically undermines the precision of the weight matrix $\mathbf{W}$ , leading to poor performance of the quantized model.

2 1-bit Linear Layer Architecture

Due to the severe precision loss of 1-bit weight quantization, converting weight matrices in Linear layers directly from FP32/16 to 1-bit format based on RTN is challenging. Wang et al. (2023) explore this possibility by studying the capabilities of purely 1-bit weight matrices, training the 1-bit model from scratch. In the W1A16 setting, their Linear layers are designed as

where $\mathbf{g}$ and $\mathbf{h}$ are the two FP16 value vectors. Note that we specify the calculation order using brackets in Eq. (3) for minimizing the time and space cost. The main difference between Wang et al. (2023) and OneBit is the extra parameter $\mathbf{g}$ and $\mathbf{h}$ . Even if additional parameters are brought in, the benefits far outweigh its small cost. For instance, when we quantize one weight matrix with the shape $4096\times 4096$ , the average bit-width of the quantized result is 1.0073. See A.6 for the details.

3 Sign-Value-Independent Decomposition

We can employ some widely used matrix decomposition methods to perform the rank-1 approximation, such as SVD (Beltrami, 1990) and NMF (Paatero and Tapper, 1994).

Given the weight matrix $\mathbf{W}$ and input $\mathbf{X}$ , the Linear layer can be reformulated as the following according to SVID:

Proposition 2

Note that, given the predominantly low precision of most parameters, it is quite challenging to approximate the weight matrix $\mathbf{W}$ accurately. SVID is not aimed to precisely replicate the original model’s parameters, but to provide an effective starting point for further training, leveraging the extensive training of the original model. Details on transferring knowledge from the original model to the quantized counterpart are in Section 3.4.

4 Knowledge Transfer

We employ quantization-aware knowledge distillation to transfer knowledge from the original model (i.e. teacher model) to the quantized one (i.e. student model). In the student model, the element in matrix $\mathbf{W}$ and vectors $\mathbf{g}$ / $\mathbf{h}$ in Eq. (3) will be trained. We use cross-entropy based logits and mean-square-error based hidden state of the full-precision teacher model to direct the quantized student model (Sun et al., 2019). Language modeling loss is not used. The cross-entropy is defined as

where $c$ denotes the number of classes and $n_{s}$ denotes the number of training samples in the current batch. $\mathcal{T}$ and $\mathcal{S}$ are the teacher model and student model, respectively. The error of hidden states is defined as

where $n_{l}$ denotes the number of layers and $\mathbf{q}$ denotes the hidden state. Hence the final objective function can be formulated as

where $\alpha$ is the hyper-parameter that balances the importance of the cross-entropy loss and the features in the intermediate layers.

Experiments

We experiment with 1-bit weight-only quantizaton and maintain 16-bit activation (W1A16) in this work. We evaluate our approach by performing experiments on OPT-1.3B/2.7B models, LLaMA-7B/13B models and LLaMA2-7B/13B models, and present results on various tasks.

For the training data of our quantization-aware knowledge distillation, we follow Liu et al. (2023) to synthesize corpus using next token generation from the original teacher model. It randomizes the first token from vocabulary and generates the next token iteratively until reaching either the token or the maximum length. Specially, the top-1 predictions are selected deterministically for the first 3 to 5 tokens, followed by stochastic sampling for the remaining tokens. We utilized LLaMA-7B to generate a total of 132k data entries, each with a maximum length of 2,048.

Training Details

Every KD experiment learns the training data over 50 epochs, from which 2048-token segments are selected. We employ NMF in scikit-learn https://scikit-learn.org/ to decompose the weight matrices in SVID. The quantized student models are optimized by Adam (Kingma and Ba, 2014) with $\beta_{1}=0.9$ , $\beta_{2}=0.98$ . The learning rate for all experiments is scheduled by cosine strategy. We use NVIDIA A100 GPUs and maintain FP16 precision while training quantized models. For additional details such as learning rate, please refer to Table 1.

Baselines

To our knowledge, there is no previous work exploring the 1-bit quantization of LLMs from a knowledge transfer perspective. To this end, we relax the quantization bit-width of baselines to 2 bits (W2A16) while maintaining the W1A16 setting in our method. We compare our method with GPTQ (Frantar et al., 2022), LLM-QAT (Liu et al., 2023) and OmniQuant (Shao et al., 2023). To ensure a fair comparison in terms of space usage, baselines do not employ grouped quantization. Additionally, we included the results of vanilla transformers with FP16 precision as a reference. While the recent work BitNet (Wang et al., 2023) also introduced one 1-bit model architecture, it only focused on training models from scratch. We also analyze its capability to transfer knowledge from the original models in Appendix A.5.

Evaluation Metrics

Basically, we evaluate quantized models by testing the perplexity on the validation set, specifically on WikiText2 (Merity et al., 2016) and C4 (Raffel et al., 2020). Lower perplexity indicates that the compressed model is better at preserving the output distribution of the original model. Furthermore, accuracies of zero-shot tasks including Winograde (Sakaguchi et al., 2021), HellaSwag (Zellers et al., 2019), PIQA (Bisk et al., 2020), BoolQ (Clark et al., 2019), and ARC (Clark et al., 2018) are also reported. They evaluate if the capabilities of the original model on downstream tasks are retained. We utilize the open-sourced toolkit “LM-Evaluation-Harness”https://github.com/EleutherAI/lm-evaluation-harness to perform the perplexity test and all zero-shot tasks.

2 Main Results

Table 2 compares our method with other typical strong baselines on different models. Due to space limitations, results of LLaMA2-7B/13B are listed in Appendix A.3. In various model sizes, our 1-bit weight quantization method obviously outperforms others under the W2A16 setting. Moreover, the effectiveness of QAT based methods consistently improves as the model size increases, whereas the result of the PTQ method, GPTQ, may degrade when model size increases (e.g., from 7B to 13B on LLaMA). This demonstrates that QAT-based method can achieve stable results in extremely low-bit quantization. Specifically, our method approaches the performance of FP16 more closely as the model size increases. For instance, when scaling from LLaMA-7B to LLaMA-13B, the perplexity of the FP16 model decreases by only 0.59, whereas our method sees a reduction of 1.20.

For perplexity, only our method achieves comparable results to the strongest FP16 baseline. For instance, our method achieves 9.18 in the Wiki2 dataset on LLaMA-13B model and the FP16 baseline is 5.09. The performance loss of other methods is significant, even though they use 2-bit quantization, which is more than our 1 bit. For GPTQ and LLM-QAT, the performance degradation after quantization is pretty severe. As for OmniQuant, even though it is the strongest baseline under the W2A16 setting, it still suffers greater performance loss compared to our W1A16 setting.

For zero-shot accuracy, although all methods inevitably have some degradation, our method achieves the closest performance to the FP16 baseline among most models. On the OPT-1.3B/2.7B model, our method shows smaller performance loss on most tasks such as PIQA and ARC-e. Additionally, the loss of other tasks is negligible compared with the second-best baseline, OmniQuant. On the LLaMA-7B model, our method notably outperforms OmniQuant in all tasks except ARC-e/ARC-c, averaging about a 4% improvement overall.

3 Problem Solving Ability

We have demonstrated the superior performance of our method under the W1A16 setting, compared to other representative baselines. Although all methods inevitably face performance degradation in 1-bit weight quantization, it remains of interest how our method fares in solving practical problems among the various approaches to reducing model size. For instance, directly training smaller models (Zhang et al., 2024) or employing low-rank decomposition to reduce the number of parameters.

To this end, we consider two crucial abilities of LLMs: commonsense reasoning and world knowledge. For commonsense reasoning, we use the 6 tasks (Hellaswag, etc.) and settings described in Section 4.2. For world knowledge, we examine it using the Massive Multi-task Language Understanding (MMLU; Hendrycks et al., 2021), a benchmark that covers a wide range of domains and knowledge. We compare the following 4 models:

Pythia-1.0B Biderman et al. (2023). A well-trained model released by EleutherAI whose memory footprint is 1.54x that of our 7B model.

TinyLLaMA-1.1B (Zhang et al., 2024). A model with the same structure as the LLaMA models, which undergoes continued training. To compare fairly, we use the checkpoint at 10k training steps, which is 2x that of our model.

LowRank LLaMA (Noach and Goldberg, 2020). Decompose every weight matrix in Linear layers to two low-rank matrices and learn from the original LLaMA-7B model by KD in the same setting of OneBit-7B.

OneBit-7B The model that we use in Section 4.2, which is built with OneBit.

Figure 3(a) and 3(b) demonstrate common sense reasoning ability and general world knowledge of different models. We can observe that, although other models have more parameters and are more thoroughly trained than ours, our model still has advantages in common sense reasoning. This reflects the benefits inherited from the larger 7B model. In terms of world knowledge, despite a significant loss in social sciences, our model outperforms the fully trained Pythia-1B in other domains. These results demonstrate the practical usability of OneBit.

Analysis and Discussion

It is evident that extremely low-bit quantization of weights can significantly reduce the memory footprint of models. As shown in Table 3, the actual compression ratio increases as the model size increases. This is particularly meaningful for larger models, making it possible to fit the model into one GPU. While there is a performance loss, Figure 4 illustrates that our method achieves a good trade-off between space occupancy and model performance. For example, we can achieve comparable performance to FP16 with only 0.2x the model space. Furthermore, quantizing to $\pm 1$ also aids in accelerating matrix multiplication on CPUs. It is because the floating-point multiplication of elements in two matrices can be converted into much faster bit operations on these chips. Thus the substantial reduction in memory overhead makes these low-bit LLMs meet the requirements for deployment on PCs and smartphones.

2 Robustness

Existing work (Wang et al., 2023) has already noted the instability within QAT. Extremely low-bit quantization makes the training process highly sensitive to the learning rate, making it difficult for the model to converge when the rate is too small or too large. This is primarily due to the large magnitude of gradients generated as the weight elements fluctuate between +1 and -1, leading to substantial fluctuations in the output of Linear layers. Experiments demonstrate that OneBit shows more stable training process and is not sensitive to learning rates. Please refer to Appendix A.5 for more details.

3 Effect of Different Components

The variable components in our method primarily include Post-LayerNorm, value vectors, and parameter initialization.

We discover that models might experience floating-point overflow during the QAT process. As depth increases, the activation can become progressively larger. We tackle it using Post-LayerNorm instead of Pre-LayerNorm. In contrast, Pre-LayerNorm may occasionally be ineffective.

Value Vectors

The main structural difference between OneBit and BitNet (Wang et al., 2023) is the two value vectors, which are demonstrated to be effective in Section 4.2. Please refer to Appendix A.5 for more details of comparison.

Parameter Initialization

In our proposed SVID, both NMF and SVD can be used to decompose $|\mathbf{W}|$ and we recommend using the former. This is because we find that NMF may make the training more faster to converge. Figure 5 shows that initializing by NMF facilitates better performance.

Conclusion

We propose a model structure for 1-bit weight quantization and a corresponding parameter initialization method. Extensive experiments on models of various sizes and series demonstrate that OneBit has clear advantages over representative strong baselines and achieves a good tradeoff between model size and performance. We further analyze the capabilities of such extremely low-bit quantized models and provide guidance for future research.

Limitation

Although our proposed method significantly reduces the memory footprint of LLMs, bringing hope for efficient deployment of them, there are still some limitations. Firstly, compared to the original model, our extremely low-bit quantization inevitably incurs a performance loss. Additionally, we are yet to understand the mathematical principles behind the optimal parameters of the 1-bit quantized model, thus capability transfer can only be achieved through the costly process of KD. Fortunately, this cost is a one-time expense. Moreover, due to the unique nature of 1-bit quantization, our method can not be naturally extended to higher bit-width. Lastly, we have not considered the activation quantization and leave it as future work.

Ethics Statement

In this study, we employ models that are publicly available and open source. We affirm that the use of these models aligns with their original intended purposes. These models have been utilized strictly within the scope of academic and research-based activities, adhering to ethical guidelines and ensuring compliance with open-source licenses.

References

Appendix A Appendix

In this section, we provide the necessary and detailed proofs for the propositions presented in this paper. All symbols have the same definition as in the main text.

Given the weight matrix $\mathbf{W}$ and input $\mathbf{X}$ , the Linear layer can be reformulated as the following according to SVID:

Proof

Lemma 1

Let $\sigma_{i}\left(\mathbf{W}\right)$ denote the i-th biggest singular value of matrix $\mathbf{W}$ . The following inequality holds:

Proof

According to the definition of induced norm, there are

Note that for $\forall\mathbf{x}$ , $\|\mathbf{x}\|_{2}=1$ and we have

Proposition 2

Proof

Here we consider SVD to prove it. For SVD, the norm of the error matrix $\mathbf{E}$ in the rank-1 approximation is the sum of the squares of all singular values except for the largest one. We have

Based on $\|\mathbf{W}\|_{F}^{2}=\||\mathbf{W}|\|_{F}^{2}$ , we have

From the equation in this proposition, we can formulate

A.2 Details on Baselines

In this subsection, we provide the essential details of the baselines in this work:

GPTQ (Frantar et al., 2022): We employ the open-source code released by the author. Both OPT models and LLaMA models take 128 2048-token samples from the C4 dataset to calibrate the quantized model. For LLaMA models, we apply the activation order heuristic according to the recommendation from the code.

LLM-QAT (Liu et al., 2023): We reimplement this method to adapt the W2A16 setting, as LLM-QAT is not designed for 2-bit weight quantization. We also do not quantize the KV Cache. When quantizing the weight matrix in Linear layer, we use symmetric MinMax quantization in which the zero-point is set to 0. The training hyper-parameters are the same as ours. Please refer to the training details in Section 4.1.

OmniQuant (Shao et al., 2023): We employ the open-source code released by the author. Both OPT models and LLaMA models take 128 2048-token samples from the WikiText2 dataset to calibrate the quantized model. The learning rate for learnable weight clipping and equivalent transformation is set to 5e-3 and 1e-2, respectively. We use a batch size of 1 and train 40 epochs for each model. For OPT models, both learnable weight clipping and equivalent transformation are leveraged. For LLaMA models, only learnable weight clipping is used.

A.3 Results of LLaMA2

Table 4 compares the results on LLaMA2-7B/13B. Obviously, our method has advantages in both perplexity and zero-shot accuracy. It also reflects that the advantages of our method are more pronounced in larger models. For instance, when scaling from LLaMA2-7B to LLaMA2-13B, the perplexity of the FP16 model decreases by around only 0.5, whereas our method reduces it by around 1.0 on both Wiki2 and C4 datasets.

A.4 Instrution Following Ability

Instruction following is an important ability of LLMs (Radford et al., 2019; Brown et al., 2020; Peng et al., 2023). Beyond the discussion on model abilities and efficiency before, we also focus on the instruction following ability of extremely low-bit models, which is closely related to their practical usability. In this subsection, we empirically study this capability of our quantized model. We fine-tune the model for 3 epochs using the alpaca_en_52k dataset and alpaca templates (Taori et al., 2023), then observe the generation in both zero-shot and few-shot settings before and after fine-tuning. During training, the learning rate is set to 1e-7 and the batch size to 32. Other parameters are consistent with Section 4.1.

Table 5 demonstrates the content generation and instruction following abilities of our 7B model. Under the zero-shot setting, the model without SFT produced verbose, repetitive, and low-quality text. However, once experienced to SFT, our model is able to smoothly output high-quality content, exhibiting excellent instruction following ability. For the few-shot setting, our model exhibits instruction following ability both before and after SFT.

A.5 Comparison with BitNet

Recently, BitNet (Wang et al., 2023) introduces a 1-bit model architecture and applies the architecture to train models from scratch, demonstrating the feasibility and application value of the 1-bit model structure. In this paper, we attempt to combine 1-bit quantization with knowledge distillation to quantize the LLaMA-7B model. Unfortunately, despite following the suggestion to use larger learning rates, the behavior remains unstable during training.

Figure 6 shows that the training process of BitNet may suffer from instability during knowledge distillation. We conjecture that it is because the gradient is pretty large when the weight elements fluctuate between +1 and -1, further aggravating the output of the Linear layer.

As a more effective measure, the value vectors we propose for quantization not only supplement the necessary floating-point numerical precision but also limit the fluctuation range of the matrix multiplication results after quantization. This can be understood from forward and backward computation, respectively.

Quantized matrix multiplication is more prone to overflow than FP16 counterparts in response to minor perturbations of input activations. This is because the magnitude of elements in quantized matrices, particularly the value $\pm 1$ , is far greater than the parameters of most FP16 matrices. By multiplying by value vectors of a magnitude similar to that of the FP16 model, the range of variation in model output activations can be restored to the level of FP16. Furthermore, we also avoid the increasingly large “drift phenomenon” of activations through Post-LayerNorm.

Backward stability.

A.6 Average Bit-width of Linear Layer

This subsection formulates the calculation of the average bit-width of Linear layers. Assume there is a weight matrix with a shape of $4096\times 4096$ in such a layer, the number of bits in every component is

where the first is for the 1-bit quantized weight matrix and the second is for the two FP16 value vectors. Hence the overall number of bits is $16,908,288$ . Moreover, the number of parameters is $4096\times 4096+2\times 4096\times 1=16,785,408$ . Therefore, the average bit-width of this Linear layer is $16,908,288\div 16,785,408\approx 1.0073$ .