PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization

Yidong Wang, Zhuohao Yu, Zhengran Zeng, Linyi Yang, Cunxiang Wang, Hao Chen, Chaoya Jiang, Rui Xie, Jindong Wang, Xing Xie, Wei Ye, Shikun Zhang, Yue Zhang

Introduction

Large language models (LLMs) have attracted increasing attention in the field of artificial intelligence , with various applications from question answering , machine translation to content creation . The Alpaca project has been a pioneering effort in instruction tuning of LLaMA , setting a precedent for instruction tuning LLMs, followed by Vicunna . Subsequent research have typically adopted Alpaca’s hyperparameters as a standard for training their LLMs. Given the necessity of instruction tuning for these pre-trained models to effectively understand and follow natural language instructions , optimizing their tuning hyperparameters is crucial for peak performance. Critical factors such as optimizer selection, learning rate, number of training epochs, and quality and size of training data significantly influence the model’s performance . However, a research gap remains in the area of hyperparameter optimization specifically designed for instruction tuning LLMs. To address this issue, we aim to construct an automated, reliable, and robust evaluation method, which can be integrated into any open-sourced LLMs and used as the judging basis for hyperparameter optimization.

The development of such an evaluation method presents its own challenges, including ensuring evaluation reliability and privacy protection. Current methods often involve either crowd-sourcing work or API usage, which could be costly, and time-consuming. Besides, these methods face challenges in terms of consistency and reproducibility. This is primarily due to the lack of transparency regarding language model change logs and the inherent subjectivity of human annotations. Note that utilizing API-based evaluations carries the risk of potentially high costs associated with addressing data leaks. Although open-sourced LLMs can be alternative evaluators, they are not specifically designed for assessment, thus making it difficult to deploy them directly as evaluators.

On the other hand, the labels of previous evaluation methods simply definite answers and fail to consider the language complexity in practice. The evaluation metrics of these procedures are typically accuracy and F1-score, without considering the subjective evaluation metrics that autoregressive generative language models should pay attention to, thus does not reflect the potential of such models to generate contextually relevant text. The appropriate subjective evaluation metrics can be relative conciseness, clarity, adherence to instructions, comprehensiveness, formality, and context relevance.

To tackle these challenges, we introduce a judge language model, aiming for Reproducible and Automated Language Model Assessment (PandaLM). Tuned from LLaMA-7B, PandaLM is used to distinguish the most superior model among various candidates, each fine-tuned with different hyperparameters, and is also capable of providing the rationale behind its choice based on the reference response for the context. PandaLM surpasses the limitations of traditional evaluation methods and focuses on more subjective aspects, such as relative conciseness, clarity, comprehensiveness, formality, and adherence to instructions. Furthermore, the robustness of PandaLM is strengthened by its ability to identify and rectify problems such as logical fallacies, unnecessary repetitions, grammatical inaccuracies, and context irrelevance. By considering these diverse aspects, we leverage PandaLM’s ability to distinguish the most superior model among candidates on the validation set and then provide insights for facilitating hyperparameter optimization of instruction tuning.

In practice, we generate paired responses from a diverse set of similarly sized foundation models including LLaMA-7B , Bloom-7B , Cerebras-GPT-6.7B , OPT-7B , and Pythia-6.9B . Each of these models is fine-tuned using the same data and hyperparameters as Alpaca . The paired responses from these tuned LLMs constitute the input of training data for PandaLM. The most straightforward approach to generate the corresponding target of training data is through human annotation, but this method can be costly and time-consuming . Considering that GPT-3.5 has the ability to provide reliable evaluation to some extent, to reduce costs, we follow self-instruct to distill data from GPT-3.5 and apply heuristic data filtering strategies to mitigate noise.

To ensure the reliability of PandaLM, we develop a test dataset that aligns with human preference and covers a wide range of tasks and contexts. The instructions and inputs of test data are sampled from the human evaluation dataset of self-instruct , with responses generated by different LLMs and each label independently provided by three different human evaluators. Samples with significant divergences are excluded to ensure the Inter Annotator Agreement (IAA) of each annotator remains larger than 0.85. PandaLM-7B demonstrates highly competitive performance, achieving 93.75% of GPT-3.5’s evaluation ability and 88.28% of GPT4’s in terms of F1-score on our diverse human-annotated test dataset.

Moreover, as illustrated in Figure 1, adopting PandaLM’s selected optimal hyperparameters covering optimizer selection, learning rate, number of training epochs, and learning rate scheduler brings noteworthy improvements. When assessed using GPT-4 with a set of 170 instructions, a group of five open language models, tuned with optimal hyperparameters selected by PandaLM, achieves an average of 47.0 superior responses and 26.2 inferior responses, outperforming those trained using Alpaca’s hyperparameters. Note that the training data remains the same for conducting fair comparisons. Moreover, when these LLMs are evaluated by human experts, using the same set of 170 instructions, they exhibit an average of 79.8 superior responses and 25.2 inferior responses, once again surpassing the performance of models trained with Alpaca’s hyperparameters. The experimental results underline the effectiveness of PandaLM in determining optimal hyperparameters for choosing the best LLMs. In addition, when the fine-tuned LLMs are assessed using tasks from the lm-eval, a unified framework to test LLM on a large number of different traditional evaluation tasks, the results further reinforce the superiority of LLMs optimized by PandaLM.

In conclusion, our work delivers three key contributions:

We introduce PandaLM, a privacy-protected judge language model for evaluating and optimizing hyperparameters for LLMs.

We create a reliable human-annotated dataset, essential for validating PandaLM’s performance and further research.

We utilize PandaLM to optimize the hyperparameters of a series of open-sourced LLMs. Tuning models with PandaLM-selected hyperparameters yields substantial performance enhancements.

By open-sourcing PandaLM with the associated resources at https://github.com/WeOpenML/PandaLM, we hope to facilitate further research and inspire new advancements in this area.

Related Work

This section reviews the relevant literature on the topic of hyperparameter optimization and the evaluation of language models.

Hyperparameter Optimization The importance of hyperparameter optimization in machine learning , particularly in the context of fine-tuning deep learning language models such as BERT and GPT , cannot be ignored. For these models, the choice of hyperparameters like the learning rate, batch size, or the number of training epochs can significantly influence their performance . This selection process becomes even more critical when fine-tuning these models on domain-specific tasks, where the optimal set of hyperparameters can vary significantly among different domains .

Evaluation of Language Models Accurate evaluation of language models is crucial in determining optimal hyperparameters, thus improving the models’ overall performance . Conventional objective metrics like perplexity and accuracy on downstream tasks provide valuable insights, but they may not effectively guide the choice of hyperparameters to enhance LLMs because evaluating LLMs requires other subjective metrics. Advanced language models, such as GPT-4 and Bard , incorporate human evaluations as part of their testing method for LLMs, aiming to better align with human judgements . Although human-based evaluation methods offer considerable insight into a model’s performance, they are costly and labor-intensive, making it less feasible for iterative hyperparameter optimization processes.

Subjective qualitative analysis of a model’s outputs, such as its ability to handle ambiguous instructions and provide contextually appropriate responses, is increasingly being recognized as a valuable metric for evaluating models . Optimizing hyperparameters with considerations towards these qualitative measures could lead to models that perform more robustly in diverse real-world scenarios. The previous qualitative analysis can be achieved either through human evaluators or through APIs of advanced language models, which is different from our motivation.

Methodology

As shown in Figure 2, the process of instruction tuning begins with a foundation model, which is then fine-tuned using instructions. The performance of each tuned model is evaluated to determine the best output. This involves exploring numerous models, each tuned with different hyperparameters, to identify the optimal one. To facilitate this pipeline, a reliable and automated language model assessment system is essential. To address this, we introduce PandaLM - a judge LLM specifically designed to assess the performance of LLMs fine-tuned with various parameters. Our goal is to identify the superior model from a pool of candidates accurately.

The training data collection aims to create a rich dataset that allows the model to evaluate different responses in a given context and generate an evaluation reason and a reference response using the same context. As demonstrated in Figure 3, each training data instance consists of an input tuple (instruction, input, response1, response2) and an output tuple (evaluation result, evaluation reason, reference response). The instructions and inputs in the input tuple are sampled from the Alpaca 52K dataset . The response pairs are produced by various instruction-tuned models: LLaMA-7B , Bloom-7B , Cerebras-GPT-6.7B , OPT-7B , and Pythia-6.9B . These models are selected due to their comparable sizes and the public availability of their model weights. Each is fine-tuned using the same instruction data and hyperparameters following Alpaca . The corresponding output tuple includes an evaluation result, a brief explanation for the evaluation, and a reference response. The evaluation result would be either ‘1’ or ‘2’, indicating that response 1 or response 2 is better, and ‘Tie’ indicates that two responses are similar in quality. As it is impractical to source millions of output tuples from human annotators, and given that GPT-3.5 is capable of evaluating LLMs to some degree, we follow self-instruct to generate output tuples using GPT-3.5. As illustrated in Figure 4, we design prompts carefully to guide the generation of training data for PandaLM. The goal is to ensure PandaLM not only prioritizes objective response correctness but also emphasizes critical subjective aspects such as relative conciseness, clarity, comprehensiveness, formality, and adherence to instructions. Besides, we encourage PandaLM to identify and rectify issues like logical fallacies, unnecessary repetitions, grammatical inaccuracies, and the absence of context relevance. A heuristic data filtering strategy is then applied to remove noisy data. Specifically, to address the observed inherent bias in GPT-3.5 regarding the order of input responses even with carefully designed prompts, samples from the training dataset are removed if their evaluation results conflict when the orders of input responses are swapped. We finally obtain a filtered dataset containing 300K samples. The training data and self-instruct prompts are open-sourced at https://github.com/WeOpenML/PandaLM.

2 PandaLM-7B Training

In this subsection, we provide details about the training procedure for PandaLM. The backbone of PandaLM is a 7B parameter variant of the LLaMA model, as it exhibits strong performance on multiple complicated NLP tasks.

We train PandaLM with the DeepSpeed library, and Zero Redundancy Optimizer (ZeRO) Stage 2, on 8 NVIDIA A100-SXM4-80GB GPUs. We use the bfloat16 (BF16) computation precision option to further optimize the model’s speed and efficiency. Regarding the training hyperparameters, we apply the AdamW optimizer with a learning rate of 2e-5 and a cosine learning rate scheduler. The model is trained for 2 epochs. The training process utilizes a warmup ratio of 0.03 to avoid large gradients at the beginning of training. We use a batch size of 2 per GPU with all inputs truncated to a maximum of 1024 tokens and employ a gradient accumulation strategy with 8 steps.

Reliability Evaluation of PandaLM-7B

To ensure the reliability of PandaLM-7B, we create a test dataset that is labeled by humans and designed to align with human preferences for responses. Each instance of this test dataset consists of one instruction and input, and two responses produced by different instruction-tuned LLMs. The paired responses are provided by LLaMA-7B, Bloom-7B, Cerebras-GPT-6.7B, OPT-7B, and Pythia-6.9B, all instruction tuned using the same instruction data and hyperparameters following Alpaca . The test data is sampled from the diverse human evaluation dataset of self-instruct , which includes data from Grammarly, Wikipedia, National Geographic and nearly one hundred apps or websites. The inputs and labels are solely human-generated and include a range of tasks and contents. Three different human evaluators independently annotate the labels indicating the preferred response. Samples with significant divergences are excluded to ensure the Inter Annotator Agreement (IAA) of each annotator remains larger than 0.85. This is because such samples demand additional knowledge or hard-to-obtain information, making them challenging for humans to evaluate. The filtered test dataset contains 1K samples, while the original unfiltered dataset has 2.5K samples.

To maintain high-quality crowdsourcing work, we involve three experts to annotate the same data point concurrently during the annotation process. These experts receive specialized training that goes beyond evaluating response correctness, enabling them to emphasize other crucial aspects like relative conciseness, clarity, comprehensiveness, formality, and adherence to instructions. Furthermore, we guide these annotators in identifying and addressing issues such as logical fallacies, unnecessary repetitions, grammatical inaccuracies, and a lack of contextual relevance. After the trial phase of data annotation, we eliminate some low-quality labeled data. The final IAA amongst the three annotators, as measured by Cohen’s Kappa , yields average scores of 0.85, 0.86, and 0.88 respectively, indicating a relatively high level of reliability for our test dataset. The distribution of the test data comprises 105 instances of ties, 422 instances where Response 1 wins, and 472 instances where Response 2 takes the lead. Note that the human-generated dataset has no personally identifiable information or offensive content, and all annotators receive redundant labor fees.

Judged By Base Model LLaMA-7B Bloom-7B Cerebras-6.7B OPT-7B Pythia-6.9B Human LLaMA-7B / (72,28,11) (80,24,6) (71,24,11) (58,27,9) Bloom-7B (28,72,11) / (59,30,11) (43,35,11) (47,49,11) Cerebras-6.7B (24,80,6) (30,59,11) / (33,49,9) (27,53,11) OPT-7B (24,71,11) (35,43,11) (49,33,9) / (32,53,15) Pythia-6.9B (27,58,9) (49,47,11) (53,27,11) (53,32,15) / GPT-3.5 LLaMA-7B / (59,19,33) (71,13,26) (58,17,31) (49,16,29) Bloom-7B (19,59,33) / (40,19,41) (36,30,23) (33,34,40) Cerebras-6.7B (13,71,26) (19,40,41) / (24,38,29) (22,43,26) OPT-7B (17,58,31) (30,36,23) (38,24,29) / (30,30,40) Pythia-6.9B (16,49,29) (34,33,40) (43,22,26) (30,30,40) / GPT-4 LLaMA-7B / (58,15,38) (69,9,32) (58,14,34) (52,17,25) Bloom-7B (15,58,38) / (47,16,37) (35,31,23) (32,33,42) Cerebras-6.7B (9,69,32) (16,47,37) / (23,40,28) (17,41,33) OPT-7B (14,58,34) (31,35,23) (40,23,28) / (25,37,38) Pythia-6.9B (17,52,25) (33,32,42) (41,17,33) (37,25,38) / PandaLM-7B LLaMA-7B / (46,29,36) (68,18,24) (52,26,28) (35,28,31) Bloom-7B (29,46,36) / (50,18,32) (36,30,23) (36,31,40) Cerebras-6.7B (18,68,24) (18,50,32) / (28,39,24) (24,46,21) OPT-7B (26,52,28) (30,36,23) (39,28,24) / (30,32,38) Pythia-6.9B (28,35,31) (31,36,40) (46,24,21) (32,30,38) /

Judged Model Accuracy Precision Recall F1 GPT-3.5 0.6296 0.6195 0.6359 0.5820 GPT-4 0.6647 0.6620 0.6815 0.6180 PandaLM-7B 0.5926 0.5728 0.5923 0.5456

After obtaining the human-labeled test dataset, we can assess and compare the evaluation performances of GPT-3.5, GPT-4, and PandaLM-7B. An interesting observation from Table 1 is the shared similar partial order graph between GPT-3.5, GPT-4, PandaLM-7B, and humans. Furthermore, Figure 5 illustrates directed orders of model superiority (if model A outperforms model B, a directed edge from A to B is drawn; if model A and model B perform similarly, a dashed line from A to B is drawn.), and provides a visual representation of comparative model effectiveness. The experimental results indicate similarities in the preferences of GPT-3.5, GPT-4, PandaLM-7B, and humans. Note that for PandaLM, GPT-3.5, and GPT-4, we swap the input response order and infer twice to procure the final evaluation output. The conflicting evaluation results are revised to ‘Tie’.

As shown in Table 2, we conduct a statistical analysis comparing the accuracy, precision, recall, and F1-score of GPT-3.5, GPT-4, and PandaLM-7B against human annotations. GPT-4 demonstrated superior performance, recording the highest scores across all assessed metrics. Despite PandaLM-7B having the lowest F1-score, it still demonstrates a notable performance, achieving 93.75% of GPT-3.5’s evaluation ability and 88.28% of GPT-4’s in terms of F1-score. Moreover, we are committed to continuously training larger-sized versions of PandaLM to enhance its evaluation performance further.

In addition, beyond performance metrics, PandaLM-7B introduces unique advantages that are not present in models like GPT-3.5 and GPT-4. It offers open-source availability, enabling reproducibility, and protecting data privacy. Furthermore, it provides unlimited access, removing any restrictions that might hinder comprehensive evaluation and application.

Using PandaLM-7B to Instruction Tune LLMs

Judge Model LLaMA-7B Bloom-7B Cerebras-6.7B OPT-7B Pythia-6.9B GPT-3.5 (45,26,99) (48,24,98) (58,21,91) (48,34,88) (59,20,91) GPT-4 (40,17,113) (44,34,92) (60,20,90) (39,30,101) (52,30,88) Human (82,21,67) (79,23,68) (88,25,57) (68,26,76) (82,31,57)

To highlight the effectiveness of using PandaLM-7B for instruction tuning LLMs, we compare the performance of models tuned with PandaLM’s selected optimal hyperparameters against those tuned with Alpaca’s parameters using GPT-3.5, GPT-4, and human experts. This comparison evaluates multiple tuned LLMs: LLaMA-7B, Bloom-7B, Cerebras-GPT-6.7B, OPT-7B, and Pythia-6.9B. The assessment is conducted on a validation set comprising 170 distinct instructions and inputs obtained from our 1K test set introduced in Section 4. Alpaca’s tuning protocol involves training for three epochs with the final iteration’s checkpoints being used. It uses the AdamW optimizer with a learning rate of 2e-5 and a cosine learning rate scheduler. We perform a wider range of hyperparamters to tune LLMs using PandaLM-7B. Specifically, we explore checkpoints from each epoch (ranging from epoch 1 to epoch 5), four different learning rates (2e-6, 1e-5, 2e-5, 2e-4), two types of optimizers (SGD and AdamW), and two learning rate schedulers (cosine and linear). In total, this creates a configuration space of 80 different possibilities per model.

We search for optimal hyperparameters among the 80 configurations. These are divided into four blocks, each containing 20 configurations. Sequential comparisons identify the best configuration in each block. The top configurations from each block are then compared to determine the overall best configuration. We repeat each comparison twice for robustness and carry out 800 comparisons in total. The conflicting evaluation results are modified to ‘Tie’. Key insights from our tuning process include: Bloom-7B performs best with SGD, a learning rate of 2e-5, and a cosine schedule over 5 epochs. Cerebras-GPT-6.7B also favors SGD with the same learning rate but with a linear schedule. LLaMA-7B prefers AdamW, a learning rate of 1e-5, and a linear schedule over 4 epochs. OPT-6.7B achieves top results with AdamW, a learning rate of 2e-5, and a linear scheduler over 5 epochs. Pythia-6.9B prefers SGD, a learning rate of 1e-5, a cosine schedule, and 5 epochs. This highlights the importance of customized hyperparameter tuning for different models to achieve peak performance. We also provide the analysis on data size and quality and LoRA when instruction tuning LLMs in Appendix B and Appedix C.

As illustrated in Table 3, for GPT-3.5, GPT-4, and human, all base models achieve superior performance when tuned with PandaLM’s selected hyperparameters compared to Alpaca’s hyperparameters. Note that the procedure of switching the order of input responses, as applied for PandaLM, is also implemented for GPT-3.5 and GPT-4 to acquire more robust evaluation results. This outcome not only supports the claim that PandaLM-7B can enhance the performance of models but also highlights its potential to further improve various large language models. In addition, as shown in Appendix A, based on PandaLM’s evaluation, the model demonstrating superior performance is LLaMA-PandaLM. It leads the ranking, followed by LLaMA-Alpaca, Bloom-PandaLM, Pythia-PandaLM, OPT-PandaLM, Cerebras-PandaLM, OPT-Alpaca, Bloom-Alpaca, Pythia-Alpaca, and Cerebras-Alpaca. This order emphasizes the efficacy of PandaLM’s approach in choosing hyperparameters, resulting in better model performance. Models tuned using PandaLM’s hyperparameters tend to consistently surpass those optimized with Alpaca’s hyperparameters in a hybrid ranking scenario, reinforcing the effectiveness of PandaLM. However, the base foundation model also plays a vital role, as demonstrated by LLaMA claiming both the first and second positions in performance.

Moreover, Table 4 compares fine-tuned LLMs on various traditional tasks with lm-eval. We select classic yet challenging datasets that require strong reasoning ability or real-world knowledge, as well as popular datasets from existing LLM leaderboards. The results show that models fine-tuned with PandaLM consistently outperform those optimized with Alpaca across most tasks. Specifically, the LLaMA-PandaLM model achieves the highest scores in most tasks, demonstrating the effectiveness of PandaLM’s approach in model fine-tuning. Even in other models like Bloom, Cerebras, OPT, and Pythia, we observe a noticeable improvement in performance when PandaLM is used for optimization.

LLMs ARC Challenge CB COQA HellaSwag SQuAD 2.0 WSC Accuracy Accuracy F1 Accuracy F1 Accuracy LLaMA-Alpaca 0.4206 0.5179 0.7335 0.7244 0.2239 0.3654 LLaMA-PandaLM 0.4249 0.5357 0.7420 0.7343 0.1807 0.4327 Bloom-Alpaca 0.3549 0.4464 0.0000 0.5985 0.0832 0.3654 Bloom-PandaLM 0.3515 0.4286 0.0002 0.5997 0.1137 0.3654 Cerebras-Alpaca 0.3063 0.1071 0.5565 0.5493 0.1163 0.3654 Cerebras-PandaLM 0.3174 0.3929 0.5665 0.5528 0.1319 0.3654 OPT-Alpaca 0.3413 0.0893 0.6535 0.6488 0.1096 0.4135 OPT-PandaLM 0.3422 0.0893 0.6442 0.6503 0.1304 0.4904 Pythia-Alpaca 0.3387 0.3929 0.5859 0.6025 0.1443 0.3654 Pythia-PandaLM 0.3481 0.4464 0.6045 0.6260 0.1545 0.3654

Limitations

While the outcomes of our study are encouraging, we discuss several limitations here. Firstly, the selected range of hyperparameters used in this work is based on common practice and prior literature, and thus may not encompass the absolute optimal hyperparameters. While extending the search bond will inevitably increase the computational cost. Another limitation pertains to the size of PandaLM. Currently, we only support a 7B version. However, we are committed to continuously updating PandaLM to support larger sizes, including 13B and 65B versions in the future.

Conclusion

In our exploration of hyperparameter optimization, we apply PandaLM-7B: an automatic and reliable judge model for the tuning of LLMs. Our findings demonstrate that the use of PandaLM-7B is feasible and consistently produces models of superior performance compared to those tuned with Alpaca’s default parameters. We are dedicated to continually enhancing PandaLM by expanding its capacity to support larger models and analyzing its intrinsic features, thereby developing increasingly robust versions of the judging model in the future.

References

A directed acyclic graph (DAG) is presented in Figure 6, illustrating the relative rankings of various models fine-tuned with different sets of hyperparameters. Notably, this ranking differs from those in Figure5, due to the variance in the test data: the test data for 6 is a sampled subset from that used in Figure5 which is deliberately chosen to ensure a high Inter-Annotator Agreement (IAA). A discernible pattern emerges from the rankings: models fine-tuned using PandaLM’s hyperparameters consistently outshine their counterparts fine-tuned with Alpaca’s. The top-rated model is PandaLM-LLaMA, followed by Alpaca-LLaMA, PandaLM-Bloom, PandaLM-Pythia, PandaLM-OPT, PandaLM-Cerebras-GPT, Alpaca-OPT, Alpaca-Bloom, Alpaca-Pythia, and Alpaca-Cerebras-GPT, in descending order of performance. This juxtaposition accentuates the effectiveness of PandaLM’s hyperparameter selection in improving model performance, as models optimized with PandaLM consistently rank higher than those using Alpaca’s hyperparameters in the hybrid ranking. These findings underscore the potential of PandaLM as a powerful tool in enhancing the performance of large language models, further supporting the assertion of its efficacy.

Appendix B Data Size and Quality Analysis in Instruction Tuning

We conduct an ablation study to investigate the impact of training data size (up to 1,344,000) on the performance of the model, given optimal hyperparameters. Importantly, a relationship exists between the size and quality of training data. Thus, we focus on an ablation study of data size here, but conducting a similar experiment on data quality is feasible. We derive the results from PandaLM-7B. The objective is to discern how much training data is required to reach each model’s peak performance. Table 5 reveals the optimal quantity of training data varies among models. More training data typically enhances model performance. However, an optimal point exists for each model, beyond which further data doesn’t improve performance. For example, the OPT model peaks at 992,000 data points, indicating additional data does not enhance the model’s performance.

Model Bloom Cerebras-GPT LLaMA OPT Pythia Optimal Training Data Size 1,216,000 1,344,000 11,520,000 992,000 1,344,000

Appendix C LoRA Analysis in Instruction Tuning

We further aim to evaluate the efficacy of Low-Rank Adaptation (LoRA) compared to full fine-tuning across various models, utilizing optimal hyperparameters. The results are also obtained from PandaLM-7B. Our analysis seeks to provide a comparative understanding of these tuning methodologies. As shown in Table 6, the results for the Bloom model reveal a distinct advantage for full fine-tuning, which triumphs over LoRA in 66 instances as opposed to LoRA’s 35. Notably, they tie in 69 instances. In the case of the Cerebras model, full fine-tuning again proves superior, leading in 59 cases compared to LoRA’s 40, despite drawing even 71 times. The trend of full fine-tuning superiority is consistent in the LLaMA model. Out of 170 instances, full fine-tuning results in better performance in 48 instances, whereas LoRA emerges victorious in only 28 instances. The majority of the results are tied, amounting to 94 instances. In the OPT model, full fine-tuning once more showcases its advantage with 64 instances of superior performance compared to LoRA’s 33, while recording a tie in 73 instances. Lastly, for the Pythia model, full fine-tuning leads the race with 71 instances of better performance against LoRA’s 21, and a tie occurring in 78 instances. These results underscore that full fine-tuning generally yields more favorable results compared to the use of LoRA, though the outcomes can vary depending on the model. Despite the considerable number of ties, full fine-tuning holds the upper hand in most models, thereby highlighting its effectiveness. This suggests that while LoRA may provide comparable results in some instances, a strategy of full fine-tuning often proves to be the more beneficial approach in enhancing model performance.

Appendix D Data License and Maintenance Plan

The test data we create is open sourced at https://github.com/WeOpenML/PandaLM. The test data is under Apache License 2.0. For model weights of PandaLM, we follow LLaMA license. We plan to collect more multilingual test data from more apps and websites and open source it for future research.