Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, Thomas Scialom

cs.CL cs.AI

Introduction

Large Language Models (LLMs) have shown great promise as highly capable AI assistants that excel in complex reasoning tasks requiring expert knowledge across a wide range of fields, including in specialized domains such as programming and creative writing. They enable interaction with humans through intuitive chat interfaces, which has led to rapid and widespread adoption among the general public.

The capabilities of LLMs are remarkable considering the seemingly straightforward nature of the training methodology. Auto-regressive transformers are pretrained on an extensive corpus of self-supervised data, followed by alignment with human preferences via techniques such as Reinforcement Learning with Human Feedback (RLHF). Although the training methodology is simple, high computational requirements have limited the development of LLMs to a few players. There have been public releases of pretrained LLMs (such as BLOOM (Scao et al., 2022), LLaMa-1 (Touvron et al., 2023), and Falcon (Penedo et al., 2023)) that match the performance of closed pretrained competitors like GPT-3 (Brown et al., 2020) and Chinchilla (Hoffmann et al., 2022), but none of these models are suitable substitutes for closed “product” LLMs, such as ChatGPT, BARD, and Claude. These closed product LLMs are heavily fine-tuned to align with human preferences, which greatly enhances their usability and safety. This step can require significant costs in compute and human annotation, and is often not transparent or easily reproducible, limiting progress within the community to advance AI alignment research.

𝑤𝑖𝑛𝑙𝑜𝑠𝑠win/(win+loss). The orders in which the model responses are presented to GPT-4 are randomly swapped to alleviate bias. Figure 3: Safety human evaluation results for Llama 2-Chat compared to other open-source and closed-source models. Human raters judged model generations for safety violations across ~2,000 adversarial prompts consisting of both single and multi-turn prompts. More details can be found in Section 4.4. It is important to caveat these safety results with the inherent bias of LLM evaluations due to limitations of the prompt set, subjectivity of the review guidelines, and subjectivity of individual raters. Additionally, these safety evaluations are performed using content standards that are likely to be biased towards the Llama 2-Chat models. In this work, we develop and release Llama 2, a family of pretrained and fine-tuned LLMs, Llama 2 and Llama 2-Chat, at scales up to 70B parameters. On the series of helpfulness and safety benchmarks we tested, Llama 2-Chat models generally perform better than existing open-source models. They also appear to be on par with some of the closed-source models, at least on the human evaluations we performed (see Figures 2 and 3). We have taken measures to increase the safety of these models, using safety-specific data annotation and tuning, as well as conducting red-teaming and employing iterative evaluations. Additionally, this paper contributes a thorough description of our fine-tuning methodology and approach to improving LLM safety. We hope that this openness will enable the community to reproduce fine-tuned LLMs and continue to improve the safety of those models, paving the way for more responsible development of LLMs. We also share novel observations we made during the development of Llama 2 and Llama 2-Chat, such as the emergence of tool usage and temporal organization of knowledge.

We are releasing the following models to the general public for research and commercial use111https://ai.meta.com/resources/models-and-libraries/llama/:

Llama 2, an updated version of Llama 1, trained on a new mix of publicly available data. We also increased the size of the pretraining corpus by 40%, doubled the context length of the model, and adopted grouped-query attention (Ainslie et al., 2023). We are releasing variants of Llama 2 with 7B, 13B, and 70B parameters. We have also trained 34B variants, which we report on in this paper but are not releasing.222We are delaying the release of the 34B model due to a lack of time to sufficiently red team.

Llama 2-Chat, a fine-tuned version of Llama 2 that is optimized for dialogue use cases. We release variants of this model with 7B, 13B, and 70B parameters as well.

We believe that the open release of LLMs, when done safely, will be a net benefit to society. Like all LLMs, Llama 2 is a new technology that carries potential risks with use (Bender et al., 2021b; Weidinger et al., 2021; Solaiman et al., 2023). Testing conducted to date has been in English and has not — and could not — cover all scenarios. Therefore, before deploying any applications of Llama 2-Chat, developers should perform safety testing and tuning tailored to their specific applications of the model. We provide a responsible use guide333https://ai.meta.com/llama and code examples444https://github.com/facebookresearch/llama to facilitate the safe deployment of Llama 2 and Llama 2-Chat. More details of our responsible release strategy can be found in Section 5.3.

The remainder of this paper describes our pretraining methodology (Section 2), fine-tuning methodology (Section 3), approach to model safety (Section 4), key observations and insights (Section 5), relevant related work (Section 6), and conclusions (Section 7).

Pretraining

To create the new family of Llama 2 models, we began with the pretraining approach described in Touvron et al. (2023), using an optimized auto-regressive transformer, but made several changes to improve performance. Specifically, we performed more robust data cleaning, updated our data mixes, trained on 40% more total tokens, doubled the context length, and used grouped-query attention (GQA) to improve inference scalability for our larger models. Table 1 compares the attributes of the new Llama 2 models with the Llama 1 models.

Our training corpus includes a new mix of data from publicly available sources, which does not include data from Meta’s products or services. We made an effort to remove data from certain sites known to contain a high volume of personal information about private individuals. We trained on 2 trillion tokens of data as this provides a good performance–cost trade-off, up-sampling the most factual sources in an effort to increase knowledge and dampen hallucinations.

We performed a variety of pretraining data investigations so that users can better understand the potential capabilities and limitations of our models; results can be found in Section 4.1.

2 Training Details

We adopt most of the pretraining setting and model architecture from Llama 1. We use the standard transformer architecture (Vaswani et al., 2017), apply pre-normalization using RMSNorm (Zhang and Sennrich, 2019), use the SwiGLU activation function (Shazeer, 2020), and rotary positional embeddings (RoPE, Su et al. 2022). The primary architectural differences from Llama 1 include increased context length and grouped-query attention (GQA). We detail in Appendix Section A.2.1 each of these differences with ablation experiments to demonstrate their importance.

We trained using the AdamW optimizer (Loshchilov and Hutter, 2017), with $\beta_{1}=0.9,\beta_{2}=0.95,\text{eps}=10^{-5}$ . We use a cosine learning rate schedule, with warmup of 2000 steps, and decay final learning rate down to 10% of the peak learning rate. We use a weight decay of $0.1$ and gradient clipping of $1.0$ . Figure 5 (a) shows the training loss for Llama 2 with these hyperparameters.

We use the same tokenizer as Llama 1; it employs a bytepair encoding (BPE) algorithm (Sennrich et al., 2016) using the implementation from SentencePiece (Kudo and Richardson, 2018). As with Llama 1, we split all numbers into individual digits and use bytes to decompose unknown UTF-8 characters. The total vocabulary size is 32k tokens.

2.1 Training Hardware & Carbon Footprint

We pretrained our models on Meta’s Research Super Cluster (RSC) (Lee and Sengupta, 2022) as well as internal production clusters. Both clusters use NVIDIA A100s. There are two key differences between the two clusters, with the first being the type of interconnect available: RSC uses NVIDIA Quantum InfiniBand while our production cluster is equipped with a RoCE (RDMA over converged Ethernet) solution based on commodity ethernet Switches. Both of these solutions interconnect 200 Gbps end-points. The second difference is the per-GPU power consumption cap — RSC uses 400W while our production cluster uses 350W. With this two-cluster setup, we were able to compare the suitability of these different types of interconnect for large scale training. RoCE (which is a more affordable, commercial interconnect network) can scale almost as well as expensive Infiniband up to 2000 GPUs, which makes pretraining even more democratizable.

Following preceding research (Bender et al., 2021a; Patterson et al., 2021; Wu et al., 2022; Dodge et al., 2022) and using power consumption estimates of GPU devices and carbon efficiency, we aim to calculate the carbon emissions resulting from the pretraining of Llama 2 models. The actual power usage of a GPU is dependent on its utilization and is likely to vary from the Thermal Design Power (TDP) that we employ as an estimation for GPU power. It is important to note that our calculations do not account for further power demands, such as those from interconnect or non-GPU server power consumption, nor from datacenter cooling systems. Additionally, the carbon output related to the production of AI hardware, like GPUs, could add to the overall carbon footprint as suggested by Gupta et al. (2022b, a).

Table 2 summarizes the carbon emission for pretraining the Llama 2 family of models. A cumulative of 3.3M GPU hours of computation was performed on hardware of type A100-80GB (TDP of 400W or 350W). We estimate the total emissions for training to be 539 tCO2eq, of which 100% were directly offset by Meta’s sustainability program.555https://sustainability.fb.com/2021-sustainability-report/ Our open release strategy also means that these pretraining costs will not need to be incurred by other companies, saving more global resources.

3 Llama 2 Pretrained Model Evaluation

In this section, we report the results for the Llama 1 and Llama 2 base models, MosaicML Pretrained Transformer (MPT)666https://www.mosaicml.com/blog/mpt-7b models, and Falcon (Almazrouei et al., 2023) models on standard academic benchmarks. For all the evaluations, we use our internal evaluations library. We reproduce results for the MPT and Falcon models internally. For these models, we always pick the best score between our evaluation framework and any publicly reported results.

In Table 3, we summarize the overall performance across a suite of popular benchmarks. Note that safety benchmarks are shared in Section 4.1. The benchmarks are grouped into the categories listed below. The results for all the individual benchmarks are available in Section A.2.2.

Code. We report the average pass@1 scores of our models on HumanEval (Chen et al., 2021) and MBPP (Austin et al., 2021).

Commonsense Reasoning. We report the average of PIQA (Bisk et al., 2020), SIQA (Sap et al., 2019), HellaSwag (Zellers et al., 2019a), WinoGrande (Sakaguchi et al., 2021), ARC easy and challenge (Clark et al., 2018), OpenBookQA (Mihaylov et al., 2018), and CommonsenseQA (Talmor et al., 2018). We report 7-shot results for CommonSenseQA and 0-shot results for all other benchmarks.

World Knowledge. We evaluate the 5-shot performance on NaturalQuestions (Kwiatkowski et al., 2019) and TriviaQA (Joshi et al., 2017) and report the average.

Reading Comprehension. For reading comprehension, we report the 0-shot average on SQuAD (Rajpurkar et al., 2018), QuAC (Choi et al., 2018), and BoolQ (Clark et al., 2019).

MATH. We report the average of the GSM8K (8 shot) (Cobbe et al., 2021) and MATH (4 shot) (Hendrycks et al., 2021) benchmarks at top 1.

Popular Aggregated Benchmarks. We report the overall results for MMLU (5 shot) (Hendrycks et al., 2020), Big Bench Hard (BBH) (3 shot) (Suzgun et al., 2022), and AGI Eval (3–5 shot) (Zhong et al., 2023). For AGI Eval, we only evaluate on the English tasks and report the average.

As shown in Table 3, Llama 2 models outperform Llama 1 models. In particular, Llama 2 70B improves the results on MMLU and BBH by $\approx$ 5 and $\approx$ 8 points, respectively, compared to Llama 1 65B. Llama 2 7B and 30B models outperform MPT models of the corresponding size on all categories besides code benchmarks. For the Falcon models, Llama 2 7B and 34B outperform Falcon 7B and 40B models on all categories of benchmarks. Additionally, Llama 2 70B model outperforms all open-source models.

In addition to open-source models, we also compare Llama 2 70B results to closed-source models. As shown in Table 4, Llama 2 70B is close to GPT-3.5 (OpenAI, 2023) on MMLU and GSM8K, but there is a significant gap on coding benchmarks. Llama 2 70B results are on par or better than PaLM (540B) (Chowdhery et al., 2022) on almost all benchmarks. There is still a large gap in performance between Llama 2 70B and GPT-4 and PaLM-2-L.

We also analysed the potential data contamination and share the details in Section A.6.

Fine-tuning

Llama 2-Chat is the result of several months of research and iterative applications of alignment techniques, including both instruction tuning and RLHF, requiring significant computational and annotation resources.

In this section, we report on our experiments and findings using supervised fine-tuning (Section 3.1), as well as initial and iterative reward modeling (Section 3.2.2) and RLHF (Section 3.2.3). We also share a new technique, Ghost Attention (GAtt), which we find helps control dialogue flow over multiple turns (Section 3.3). See Section 4.2 for safety evaluations on fine-tuned models.

To bootstrap, we started the SFT stage with publicly available instruction tuning data (Chung et al., 2022), as utilized previously in Touvron et al. (2023).

Third-party SFT data is available from many different sources, but we found that many of these have insufficient diversity and quality — in particular for aligning LLMs towards dialogue-style instructions. As a result, we focused first on collecting several thousand examples of high-quality SFT data, as illustrated in Table 5. By setting aside millions of examples from third-party datasets and using fewer but higher-quality examples from our own vendor-based annotation efforts, our results notably improved. These findings are similar in spirit to Zhou et al. (2023), which also finds that a limited set of clean instruction-tuning data can be sufficient to reach a high level of quality. We found that SFT annotations in the order of tens of thousands was enough to achieve a high-quality result. We stopped annotating SFT after collecting a total of 27,540 annotations. Note that we do not include any Meta user data.

We also observed that different annotation platforms and vendors can result in markedly different downstream model performance, highlighting the importance of data checks even when using vendors to source annotations. To validate our data quality, we carefully examined a set of 180 examples, comparing the annotations provided by humans with the samples generated by the model through manual scrutiny. Surprisingly, we found that the outputs sampled from the resulting SFT model were often competitive with SFT data handwritten by human annotators, suggesting that we could reprioritize and devote more annotation effort to preference-based annotation for RLHF.

For supervised fine-tuning, we use a cosine learning rate schedule with an initial learning rate of $2\times 10^{-5}$ , a weight decay of 0.1, a batch size of 64, and a sequence length of 4096 tokens.

For the fine-tuning process, each sample consists of a prompt and an answer. To ensure the model sequence length is properly filled, we concatenate all the prompts and answers from the training set. A special token is utilized to separate the prompt and answer segments. We utilize an autoregressive objective and zero-out the loss on tokens from the user prompt, so as a result, we backpropagate only on answer tokens. Finally, we fine-tune the model for 2 epochs.

2 Reinforcement Learning with Human Feedback (RLHF)

RLHF is a model training procedure that is applied to a fine-tuned language model to further align model behavior with human preferences and instruction following. We collect data that represents empirically sampled human preferences, whereby human annotators select which of two model outputs they prefer. This human feedback is subsequently used to train a reward model, which learns patterns in the preferences of the human annotators and can then automate preference decisions.

Next, we collect human preference data for reward modeling. We chose a binary comparison protocol over other schemes, mainly because it enables us to maximize the diversity of collected prompts. Still, other strategies are worth considering, which we leave for future work.

Our annotation procedure proceeds as follows. We ask annotators to first write a prompt, then choose between two sampled model responses, based on provided criteria. In order to maximize the diversity, the two responses to a given prompt are sampled from two different model variants, and varying the temperature hyper-parameter. In addition to giving participants a forced choice, we also ask annotators to label the degree to which they prefer their chosen response over the alternative: either their choice is significantly better, better, slightly better, or negligibly better/ unsure.

For our collection of preference annotations, we focus on helpfulness and safety. Helpfulness refers to how well Llama 2-Chat responses fulfill users’ requests and provide requested information; safety refers to whether Llama 2-Chat’s responses are unsafe, e.g., “giving detailed instructions on making a bomb” could be considered helpful but is unsafe according to our safety guidelines. Separating the two allows us to apply specific guidelines to each and better guide annotators; for example, our safety annotations provide instructions to focus on adversarial prompts, among other guidance.

Apart from differences in annotation guidelines, we additionally collect a safety label during the safety stage. This additional information bins model responses into one of three categories: 1) the preferred response is safe and the other response is not, 2) both responses are safe, and 3) both responses are unsafe, with 18%, 47%, and 35% of the safety dataset falling into each bin, respectively. We do not include any examples where the chosen response was unsafe and the other response safe, as we believe safer responses will also be better/preferred by humans. Safety guidelines and more detailed information regarding safety annotations can be found in Section 4.2.1.

Human annotations were collected in batches on a weekly basis. As we collected more preference data, our reward models improved, and we were able to train progressively better versions for Llama 2-Chat (see the results in Section 5, Figure 20). Llama 2-Chat improvement also shifted the model’s data distribution. Since reward model accuracy can quickly degrade if not exposed to this new sample distribution, i.e., from hyper-specialization (Scialom et al., 2020b), it is important before a new Llama 2-Chat tuning iteration to gather new preference data using the latest Llama 2-Chat iterations. This step helps keep the reward model on-distribution and maintain an accurate reward for the latest model.

In Table 6, we report the statistics of reward modeling data that we collected over time, and present them against multiple open-source preference datasets including Anthropic Helpful and Harmless (Bai et al., 2022a), OpenAI Summarize (Stiennon et al., 2020), OpenAI WebGPT (Nakano et al., 2021), StackExchange (Lambert et al., 2023), Stanford Human Preferences (Ethayarajh et al., 2022), and Synthetic GPT-J (Havrilla, ). We collected a large dataset of over 1 million binary comparisons based on humans applying our specified guidelines, which we refer to as Meta reward modeling data. Note that the number of tokens in prompts and answers differs depending on the text domain. Summarization and online forum data generally have longer prompts, while dialogue-style prompts are usually shorter. Compared to existing open-source datasets, our preference data features more conversation turns, and are longer, on average.

2.2 Reward Modeling

The reward model takes a model response and its corresponding prompt (including contexts from previous turns) as inputs and outputs a scalar score to indicate the quality (e.g., helpfulness and safety) of the model generation. Leveraging such response scores as rewards, we can optimize Llama 2-Chat during RLHF for better human preference alignment and improved helpfulness and safety.

Others have found that helpfulness and safety sometimes trade off (Bai et al., 2022a), which can make it challenging for a single reward model to perform well on both. To address this, we train two separate reward models, one optimized for helpfulness (referred to as Helpfulness RM) and another for safety (Safety RM).

We initialize our reward models from pretrained chat model checkpoints, as it ensures that both models benefit from knowledge acquired in pretraining. In short, the reward model “knows” what the chat model knows. This prevents cases where, for instance, the two models would have an information mismatch, which could result in favoring hallucinations. The model architecture and hyper-parameters are identical to those of the pretrained language models, except that the classification head for next-token prediction is replaced with a regression head for outputting a scalar reward.

To train the reward model, we convert our collected pairwise human preference data into a binary ranking label format (i.e., chosen & rejected) and enforce the chosen response to have a higher score than its counterpart. We used a binary ranking loss consistent with Ouyang et al. (2022):

where $r_{\theta}(x,y)$ is the scalar score output for prompt $x$ and completion $y$ with model weights $\theta$ . $y_{c}$ is the preferred response that annotators choose and $y_{r}$ is the rejected counterpart.

Built on top of this binary ranking loss, we further modify it separately for better helpfulness and safety reward models as follows. Given that our preference ratings is decomposed as a scale of four points (e.g., significantly better), as presented in Section 3.2.1, it can be useful to leverage this information to explicitly teach the reward model to assign more discrepant scores to the generations that have more differences. To do so, we further add a margin component in the loss:

where the margin $m(r)$ is a discrete function of the preference rating. Naturally, we use a large margin for pairs with distinct responses, and a smaller one for those with similar responses (shown in Table 27). We found this margin component can improve Helpfulness reward model accuracy especially on samples where two responses are more separable. More detailed ablation and analysis can be found in Table 28 in Appendix A.3.3.

We combine our newly collected data with existing open-source preference datasets to form a larger training dataset. Initially, open-source datasets were used to bootstrap our reward models while we were in the process of collecting preference annotation data. We note that in the context of RLHF in this study, the role of reward signals is to learn human preference for Llama 2-Chat outputs rather than any model outputs. However, in our experiments, we do not observe negative transfer from the open-source preference datasets. Thus, we have decided to keep them in our data mixture, as they could enable better generalization for the reward model and prevent reward hacking, i.e. Llama 2-Chat taking advantage of some weaknesses of our reward, and so artificially inflating the score despite performing less well.

With training data available from different sources, we experimented with different mixing recipes for both Helpfulness and Safety reward models to ascertain the best settings. After extensive experimentation, the Helpfulness reward model is eventually trained on all Meta Helpfulness data, combined with an equal parts of the remaining data uniformly sampled from Meta Safety and from the open-source datasets. The Meta Safety reward model is trained on all Meta Safety and Anthropic Harmless data, mixed with Meta Helpfulness and open-source helpfulness data in a 90/10 proportion. We found that the setting with 10% helpfulness data is especially beneficial for the accuracy on samples where both the chosen and rejected responses were deemed safe.

We train for one epoch over the training data. In earlier experiments, we found that training longer can lead to over-fitting. We use the same optimizer parameters as for the base model. The maximum learning rate is $5\times 10^{-6}$ for the 70B parameter Llama 2-Chat and $1\times 10^{-5}$ for the rest. The learning rate is decreased on a cosine learning rate schedule, down to 10% of the maximum learning rate. We use a warm-up of 3% of the total number of steps, with a minimum of 5. The effective batch size is kept fixed at 512 pairs, or 1024 rows per batch.

On each batch of human preference annotation for reward modeling, we held out 1000 examples as a test set to evaluate our models. We refer to the union of all prompts for the corresponding test sets as “Meta Helpfulness” and “Meta Safety,” respectively.

As reference points, we also evaluated other publicly available alternatives as baselines: SteamSHP-XL (Ethayarajh et al., 2022) based on FLAN-T5-xl, the Open Assistant (Köpf et al., 2023) reward model based on DeBERTa V3 Large (He et al., 2020), and GPT4 accessible through the OpenAI’s API. Note that at inference time, as opposed to training, all the reward models can predict a scalar for a single output, without requiring to access its paired output. For GPT-4, we prompt with a zero-shot question “Choose the best answer between A and B,” where A and B are the two responses for comparison.

We report the results in terms of accuracy in Table 7. As expected, our own reward models perform the best on our internal test sets collected based on Llama 2-Chat, with the Helpfulness reward model performing best on the Meta Helpfulness test set, and similarly the Safety reward model performing best on the Meta Safety test set. Overall, our reward models outperform all of the baselines, including GPT-4. Interestingly, GPT-4 performs better than other non-Meta reward models, despite not being trained directly nor targeting specifically this reward modeling task.

The fact that helpfulness and safety performed the best on their own domain is potentially due to the tension between the two objectives (i.e., being as helpful as possible versus refusing unsafe prompts when necessary), which may confuse the reward model during training. In order for a single model to perform well on both dimensions, it needs to not only learn to select the better response given a prompt but also to distinguish adversarial prompts from safe ones. As a result, optimizing two separate models eases the reward modeling task. More detailed analysis on this tension between safety and helpfulness can be found in Appendix A.4.1.

When we group the scores by preference rating in Table 8, we can see that the accuracy is superior for the “significantly better” test set and degrades gradually as comparison pairs become more similar (e.g., “slightly better”). It is expected that learning to model human preferences becomes challenging when deciding between two similar model responses, due to annotator subjectivity and their reliance on nuanced details that may differentiate responses. We emphasize that the accuracy on more distinct responses matters the most to improve Llama 2-Chat performance. The human preference annotation agreement rate is also higher on more distinct responses than similar pairs.

We study the scaling trends in terms of data and model size for the reward model, fine-tuning different model sizes on an increasing amount of the reward model data collected each week (see the details on volume per batch in Table 26). Figure 6 reports these trends, showing the expected result that larger models obtain higher performance for a similar volume of data. More importantly, the scaling performance has not yet plateaued given the existing volume of data annotation used for training, a signal that there is room for more improvement with more annotations. We note that reward model accuracy is one of the most important proxies for the final performance of Llama 2-Chat. While best practices for comprehensively evaluating a generative model is an open research question, the ranking task of the reward has no ambiguity. Therefore, everything else being equal, an improvement of the reward model can be directly translated into an improvement for Llama 2-Chat.

2.3 Iterative Fine-Tuning

As we received more batches of human preference data annotation, we were able to train better reward models and collect more prompts. We therefore trained successive versions for RLHF models, referred to here as RLHF-V1, …, RLHF-V5.

We explored RLHF fine-tuning with two main algorithms:

Proximal Policy Optimization (PPO) (Schulman et al., 2017), the standard in RLHF literature.

Rejection Sampling fine-tuning. We sample $K$ outputs from the model and select the best candidate with our reward, consistent with Bai et al. (2022b). The same re-ranking strategy for LLMs was also proposed in Deng et al. (2019), where the reward is seen as an energy function. Here, we go one step further, and use the selected outputs for a gradient update. For each prompt, the sample obtaining the highest reward score is considered the new gold standard. Similar to Scialom et al. (2020a), we then fine-tune our model on the new set of ranked samples, reinforcing the reward.

Breadth — in Rejection Sampling, the model explores $K$ samples for a given prompt, while only one generation is done for PPO.

Depth — in PPO, during training at step $t$ the sample is a function of the updated model policy from $t-1$ after the gradient update of the previous step. In Rejection Sampling fine-tuning, we sample all the outputs given the initial policy of our model to collect a new dataset, before applying the fine-tuning similar to SFT. However, since we applied iterative model updates, the fundamental differences between the two RL algorithms are less pronounced.

Until RLHF (V4), we used only Rejection Sampling fine-tuning, and after that, we combined the two sequentially, applying PPO on top of the resulted Rejection Sampling checkpoint before sampling again.

We perform rejection sampling only with our largest 70B Llama 2-Chat. All smaller models are fine-tuned on rejection sampled data from the larger model, thus distilling the large-model capabilities into the smaller ones. We leave further analysis of the effect of this distillation for future work.

At each iterative stage, we sample $K$ answers for each prompt from the most recent model. We score each sample given the best reward model accessible at the time of the experiment, and then select the best answer for a given prompt. In earlier versions of our model, up to RLHF V3, our approach was to confine answer selection solely to the “bag” of samples gathered from the preceding iteration. For example, RLHF V3 was trained using only samples from RLHF V2. However, despite continuous improvement, this method led to a regression in some capabilities. For example, RLHF V3 struggled more than previous versions to compose rhyming lines in poems, as discerned through qualitative analysis, suggesting that further investigation into the causes of and mitigations for forgetting (Kirkpatrick et al., 2017; Nguyen et al., 2019; Ramasesh et al., 2021) could be a fruitful area for additional future research.

In response, on subsequent iterations, we modified our strategy, incorporating top-performing samples from all prior iterations, such as those used in RLHF-V1 and RLHF-V2. Although we do not present specific figures, this adjustment demonstrated considerable enhancements in performance and effectively addressed the previously noted issues. This mitigation can be seen as analogous to Synnaeve et al. (2019) and Vinyals et al. (2019) in the RL literature.

We illustrate the benefit of Rejection Sampling in Figure 7. The delta between the maximum and median curves can be interpreted as the potential gain of fine-tuning on the best output. As expected, this delta increases with more samples, since the maximum increases (i.e., more samples, more opportunities to generate a good trajectory), while the median remains stationary. There is a direct connection between the exploration and the maximum reward we can obtain among the samples. The temperature parameter also plays an important role for exploration, as a higher temperature enables us to sample more diverse outputs.

In Figure 8, we report for a Llama 2-Chat-SFT (left) and a Llama 2-Chat-RLHF (right), the maximum reward curves among N samples (with $N\in[1,\ldots,100]$ ), for different temperatures. We can observe that the optimal temperature is not constant during the iterative model updates: RLHF has a direct impact on rescaling the temperature. For Llama 2-Chat-RLHF, the optimal temperature when sampling between 10 and 100 outputs is $T\in[1.2,1.3]$ . Given a finite compute budget, it is therefore necessary to re-adjust the temperature progressively. Note that this temperature rescaling happens for a constant number of steps for each model, and always starting from the base model on each new RLHF version.

We further train our language model following the RL scheme of Stiennon et al. (2020), which uses the reward model as an estimate for the true reward function (human preference) and the pretrained language model as the policy to optimize. During this phase, we seek to optimize the following objective:

We iteratively improve the policy by sampling prompts $p$ from our dataset $\mathcal{D}$ and generations $g$ from the policy $\pi$ and use the PPO algorithm and loss function to achieve this objective.

The final reward function we use during optimization,

contains a penalty term for diverging from the original policy $\pi_{0}$ . As was observed in other works (Stiennon et al., 2020; Ouyang et al., 2022), we find this constraint is useful for training stability, and to reduce reward hacking whereby we would achieve high scores from the reward model but low scores from human evaluation.

We define $R_{c}$ to be a piecewise combination of the safety ( $R_{s}$ ) and helpfulness ( $R_{h}$ ) reward models. We have tagged prompts in our dataset that might elicit potentially unsafe responses and prioritize the scores from the safety model. The threshold of $0.15$ is chosen for filtering unsafe responses, corresponding to a precision of $0.89$ and a recall of $0.55$ evaluated on the Meta Safety test set. We also find it important to whiten the final linear scores (shown here by reversing the sigmoid with the logit function) in order to increase stability and balance properly with the KL penalty term ( $\beta$ ) above.

For all models, we use the AdamW optimizer (Loshchilov and Hutter, 2017), with $\beta_{1}=0.9,\beta_{2}=0.95,\text{eps}=10^{-5}$ . We use a weight decay of $0.1$ , gradient clipping of $1.0$ , and a constant learning rate of $10^{-6}$ . For each PPO iteration we use a batch size of $512$ , a PPO clip threshold of $0.2$ , a mini-batch size of $64$ , and take one gradient step per mini-batch. For the 7B and 13B models, we set $\beta=0.01$ (KL penalty), and for the 34B and 70B models, we set $\beta=0.005$ .

We train for between $200$ and $400$ iterations for all our models, and use evaluations on held-out prompts for early stopping. Each iteration of PPO on the 70B model takes on average $\approx 330$ seconds. To train quickly with large batch sizes, we use FSDP (Zhao et al., 2023). This was effective when using O(1) forward or backward passes, but caused a large slow down ( $\approx 20\times$ ) during generation, even when using a large batch size and KV cache. We were able to mitigate this by consolidating the model weights to each node once before generation and then freeing the memory after generation, resuming the rest of the training loop.

3 System Message for Multi-Turn Consistency

In a dialogue setup, some instructions should apply for all the conversation turns, e.g., to respond succinctly, or to “act as” some public figure. When we provided such instructions to Llama 2-Chat, the subsequent response should always respect the constraint. However, our initial RLHF models tended to forget the initial instruction after a few turns of dialogue, as illustrated in Figure 9 (left).

To address these limitations, we propose Ghost Attention (GAtt), a very simple method inspired by Context Distillation (Bai et al., 2022b) that hacks the fine-tuning data to help the attention focus in a multi-stage process. GAtt enables dialogue control over multiple turns, as illustrated in Figure 9 (right).

Assume we have access to a multi-turn dialogue dataset between two persons (e.g., a user and an assistant), with a list of messages $[u_{1},a_{1},\ldots,u_{n},a_{n}]$ , where $u_{n}$ and $a_{n}$ correspond to the user and assistant messages for turn $n$ , respectively. Then, we define an instruction, $inst$ , that should be respected throughout the dialogue. For example, $inst$ could be “act as.” We can then synthetically concatenate this instruction to all the user messages of the conversation.

Next, we can sample from this synthetic data using the latest RLHF model. We now have a context-dialogue and the sample with which to fine-tune a model, in a process analogous to Rejection Sampling. Instead of augmenting all context-dialogue turns with the instruction, we can drop it in all but the first turn, but this would lead to a mismatch at training time between the system message, i.e., all the intermediate assistant messages that come before the last turn, and our sample. To fix this issue, which could hurt the training, we simply set the loss to 0 for all the tokens from the previous turns, including assistant messages.

For the training instructions, we created a few synthetic constraints to sample from: Hobbies (“You enjoy e.g. Tennis”), Language (“Speak in e.g. French”), or Public Figure (“Act as e.g. Napoleon”). To obtain the lists of hobbies and public figures, we asked Llama 2-Chat to generate it, avoiding a mismatch between the instruction and model knowledge (e.g., asking the model to act as someone it had not encountered during training). To make the instructions more complex and diverse, we construct the final instruction by randomly combining the above constraints. When constructing the final system message for the training data, we also modify the original instruction half of the time to be less verbose, e.g., “Always act as Napoleon from now”-> ”Figure: Napoleon.” These steps produce an SFT dataset, on which we can fine-tune Llama 2-Chat.

We applied GAtt after RLHF V3. We report a quantitative analysis indicating that GAtt is consistent up to 20+ turns, until the maximum context length is reached (see Appendix A.3.5). We tried to set constraints not present in the training of GAtt at inference time, for instance “Always answer with Haiku,” for which the model remained consistent as illustrated in Appendix Figure 28.

To illustrate how GAtt helped reshape attention during fine-tuning, we display the maximum attention activations of the model in Figure 10. The left-hand side of each figure corresponds to the system message (“Act as Oscar Wilde”). We can see that the GAtt-equipped model (right) maintains large attention activations with respect to the system message for a larger portion of the dialogue, as compared to the model without GAtt (left).

Despite its utility, the current implementation of GAtt is vanilla, and more development and iteration on this technique could likely further benefit the model. For instance, we could teach the model to change the system message during the conversation by integrating such data during fine-tuning.

4 RLHF Results

Evaluating LLMs is a challenging open-research problem. Human evaluation, while a gold standard, can be complicated by various HCI considerations (Clark et al., 2021; Gehrmann et al., 2023), and is not always scalable. Thus, to select the best-performing models among several ablations at each iteration from RLHF-V1 to V5, we first observed the improvement of the rewards from the latest reward models, to save costs and increase iteration speed. We later validated major model versions with human evaluations.

To measure the robustness of our reward model, we collected a test set of prompts for both helpfulness and safety, and asked three annotators to judge the quality of the answers based on a 7-point Likert scale (the higher the better). We observe that our reward models overall are well calibrated with our human preference annotations, as illustrated in Figure 29 in the appendix. This confirms the relevance of using our reward as a point-wise metric, despite being trained with a Pairwise Ranking Loss.

Still, as Goodhart’s Law states, when a measure becomes a target, it ceases to be a good measure. To ensure our measure won’t diverge from the human preferences, we additionally used a more general reward, trained on diverse open-source Reward Modeling datasets. We have not yet observed any such divergence, and hypothesize that iterative model updates may be helping to prevent this.

As a last verification step to ensure no regression between our new model and the previous one, we use both to sample during the next annotation iteration. This enables a model comparison “for free” on new prompts and can help to increase diversity when sampling.

Figure 11 reports the progress of our different SFT and then RLHF versions for both Safety and Helpfulness axes, measured by our in-house Safety and Helpfulness reward models. On this set of evaluations, we outperform ChatGPT on both axes after RLHF-V3 (harmlessness and helpfulness $>$ 50%). Despite the aforementioned relevance of using our reward as a point-wise metric, it can arguably be biased in favor of Llama 2-Chat. Therefore, for a fair comparison, we additionally compute the final results using GPT-4 to assess which generation is preferred. The order in which ChatGPT and Llama 2-Chat outputs appeared in GPT-4 prompt are randomly swapped to avoid any bias. As expected, the win-rate in favor of Llama 2-Chat is less pronounced, although obtaining more than a 60% win-rate for our latest Llama 2-Chat.

The prompts correspond to a validation set of $1,586$ and $584$ prompts for safety and helpfulness, respectively.

4.2 Human Evaluation

Human evaluation is often considered the gold standard for judging models for natural language generation, including dialogue models. To evaluate the quality of major model versions, we asked human evaluators to rate them on helpfulness and safety. We compare the Llama 2-Chat models to open-source models (Falcon, MPT MosaicML NLP Team et al. (2023), Vicuna Chiang et al. (2023), as well as closed-source models (ChatGPT (OpenAI, 2023) and PaLM Anil et al. (2023)) on over $4,000$ single and multi-turn prompts. For ChatGPT, we use gpt-3.5-turbo-0301 model in all generations. For PaLM, we use the chat-bison-001 model in all generations. The final prompt count for human evaluations for each model is shown in Table 32. See more methodology details in Appendix, Section A.3.7. The following section shows helpfulness results; safety results are presented in Section 4.4.

As shown in Figure 12, Llama 2-Chat models outperform open-source models by a significant margin on both single turn and multi-turn prompts. Particularly, Llama 2-Chat 7B model outperforms MPT-7B-chat on 60% of the prompts. Llama 2-Chat 34B has an overall win rate of more than 75% against equivalently sized Vicuna-33B and Falcon 40B models.

The largest Llama 2-Chat model is competitive with ChatGPT. Llama 2-Chat 70B model has a win rate of 36% and a tie rate of 31.5% relative to ChatGPT. Llama 2-Chat 70B model outperforms PaLM-bison chat model by a large percentage on our prompt set. More results and analysis is available in Section A.3.7.

In our human evaluations, three different annotators provided independent assessments for each model generation comparison. High IRR scores (closer to 1.0) are typically seen as better from a data quality perspective, however, context is important. Highly subjective tasks like evaluating the overall helpfulness of LLM generations will usually have lower IRR scores than more objective labelling tasks. There are relatively few public benchmarks for these contexts, so we feel sharing our analysis here will benefit the research community.

We used Gwet’s AC1/2 statistic (Gwet, 2008, 2014) to measure inter-rater reliability (IRR), as we found it to be the most stable metric across different measurement scenarios. On the 7-point Likert scale helpfulness task that is used in our analysis, Gwet’s AC2 score varies between $0.37$ and $0.55$ depending on the specific model comparison. We see scores on the lower end of that range for ratings from model comparisons with similar win rates to each other (like the Llama 2-Chat-70B-chat vs. ChatGPT comparison). We see scores on the higher end of that range for ratings from model comparisons with a more clear winner (like the Llama 2-Chat-34b-chat vs. Falcon-40b-instruct).

While our results indicate that Llama 2-Chat is on par with ChatGPT on human evaluations, it is important to note that human evaluations have several limitations.

By academic and research standards, we have a large prompt set of 4k prompts. However, it does not cover real-world usage of these models, which will likely cover a significantly larger number of use cases.

Diversity of the prompts could be another factor in our results. For example, our prompt set does not include any coding- or reasoning-related prompts.

We only evaluate the final generation of a multi-turn conversation. A more interesting evaluation could be to ask the models to complete a task and rate the overall experience with the model over multiple turns.

Human evaluation for generative models is inherently subjective and noisy. As a result, evaluation on a different set of prompts or with different instructions could result in different results.

Safety

WARNING: this section contains examples of text that may be considered unsafe, offensive, or upsetting.

In this section, we dive deeper into the important topic of safety measurements and mitigations. We first discuss our safety investigations into pretraining data and pretrained models (Section 4.1). Next, we describe the process of our safety alignment (Section 4.2), explaining how we collected safety-related annotations and utilized SFT and RLHF, and present experimental results. Then, we discuss the red teaming we performed to further understand and improve model safety (Section 4.3). Finally, we present quantitative safety evaluations of Llama 2-Chat (Section 4.4). We also share a model card in the Appendix, in Table 52.

It is important to understand what is in the pretraining data both to increase transparency and to shed light on root causes of potential downstream issues, such as potential biases. This can inform what, if any, downstream mitigations to consider, and help guide appropriate model use. In this section, we analyze the pretraining data for distributions of languages, demographic representations, and toxicity. We also present the results of testing the pretrained models on existing safety benchmarks.

We followed Meta’s standard privacy and legal review processes for each dataset used in training. We did not use any Meta user data in training. We excluded data from certain sites known to contain a high volume of personal information about private individuals. We made a best effort to train our models efficiently to reduce the carbon footprint of pretraining (Section 2.2.1). Sharing our models broadly will reduce the need for others to train similar models. No additional filtering was conducted on the datasets, to allow Llama 2 to be more widely usable across tasks (e.g., it can be better used for hate speech classification), while avoiding the potential for the accidental demographic erasure sometimes caused by over-scrubbing. Importantly, this allows Llama 2-Chat to generalize more effectively during safety tuning with fewer examples (Welbl et al., 2021; Korbak et al., 2023; Xu et al., 2021). As a result, Llama 2 models should be used carefully and deployed only after significant safety tuning is applied.

Bias in model generations may result from biases inherited from the training data itself. For instance, Bailey et al. (2022) shows that in massive text corpora, words representing “people” are often used in more similar contexts to words representing “men” than to words representing “women,” and Ganesh et al. (2023) demonstrates that a model’s performance on fairness metrics can be highly dependent on how the model trains on data representing underrepresented demographic groups. Within our English-language training corpus, we computed the frequencies of the most common English pronouns in Table 9(a). We observe that He pronouns are generally overrepresented in documents compared to She pronouns, echoing similar frequency differences observed in pronominal usage for similarly sized model pretraining datasets (Chowdhery et al., 2022). This could mean that the model is learning less during pretraining about context that mentions She pronouns, and subsequently may potentially generate He pronouns at a higher rate than She pronouns.

We also analyze the representation of different demographic groups in the pretraining data by measuring rates of usage of demographic identity terms from the HolisticBias dataset (Smith et al., 2022) as a proxy. We compute frequencies for each descriptor term in the pretraining corpus. We group descriptors into 5 axes (Religion, Gender and Sex, Nationality, Race and Ethnicity, and Sexual Orientation), and show the top 5 terms in each axis in Table 9(b). In the top 5 terms, we remove a few terms such as “straight,” “white,” and “black,” because these terms have frequent uses beyond demographic mentions (e.g., as basic color terms). We also deduplicate across lists, removing a few terms found in both Gender and Sex and Sexual Orientation. For Gender and Sex, while She pronouns are mentioned in fewer documents, the term “female” is present in a larger percentage of documents. This could imply that while there is less frequent context about She pronouns, comments about “females” are more prevalent, perhaps reflecting the differences in linguistic markedness of these terms (Blodgett et al., 2021). For Sexual Orientation, the top five terms all relate to LGBTQ+ identities. For Nationality, Race and Ethnicity, and Religion, we observe a Western skew (Bhatt et al., 2022). For instance, the term “American” is mentioned in 69.4% of the references, the term “European” is more prevalent than other race and ethnicity, and “Christian” is the most represented religion followed by “Catholic” and “Jewish.”

We measure the prevalence of toxicity in the English-language portion of the pretraining corpus using a HateBERT classifier fine-tuned on the ToxiGen dataset (Hartvigsen et al., 2022). We score each line of a document separately and average them to assign a document score. Figure 13 shows the distribution of scores in a 10% random sample of the full corpus. About 0.2% of documents evaluated are assigned a likelihood score of 0.5 or higher, meaning there is a small amount of toxicity in our pretraining data.

While our pretraining data is mostly English, it also includes text from a small number of other languages. Table 10 shows the distribution of languages in our corpus, subsetted to those found in more than 0.005% of the documents. Our analysis uses the fastText (Bojanowski et al., 2016) language identification tool and a threshold of $0.5$ for the language detection. A training corpus with a majority in English means that the model may not be suitable for use in other languages.

We evaluate the safety capabilities of Llama 2 on three popular automatic benchmarks, pertaining to three key dimensions of LM safety.

Truthfulness, referring to whether a language model produces known falsehoods due to misconceptions or false beliefs. We employ TruthfulQA (Lin et al., 2021) to measure how well our LLMs can generate reliable outputs that agree with factuality and common sense.

Toxicity, defined as the tendency of a language model to generate toxic, rude, adversarial, or implicitly hateful content. We choose ToxiGen (Hartvigsen et al., 2022) to measure the amount of generation of toxic language and hate speech across different groups.

Bias, defined as how model generations reproduce existing stereotypical social biases. We use BOLD (Dhamala et al., 2021) to study how the sentiment in model generations may vary with demographic attributes.

We compare the performance of Llama 2 with Llama 1 (Touvron et al., 2023), Falcon (Almazrouei et al., 2023), and MPT (MosaicML NLP Team et al., 2023) in Table 11. For decoding, we set temperature to $0.1$ and use nucleus sampling (Holtzman et al., 2020) with top- $p$ set to $0.9$ . For TruthfulQA, we present the percentage of generations that are both truthful and informative (the higher, the better). For ToxiGen, we present the percentage of generations that are deemed toxic by the metric (the lower, the better). Detailed descriptions of the benchmarks and metrics can be found in Appendix A.4.7. When compared to Llama 1-7B, Llama 2-7B demonstrates a 21.37% increase in truthfulness and informativeness and a 7.61% decrease in toxicity. We also observe an increase in toxicity in the pretrained 13B and 70B Llama 2, which may result from larger pretraining data or a different dataset mix. Some have postulated the existence of a relationship between pretraining dataset size and downstream model toxicity or bias (Bender et al., 2021b), but empirical work to validate this claim is still ongoing (Dodge et al., 2021; Smith and Williams, 2021; Tal et al., 2022), and further evidence from up-to-date models is still needed.

In Appendix A.4.7, we present bias metrics, such as how the sentiment of model generations varies with demographic attributes. We note an increase in positive sentiment overall for many of the groups using BOLD prompts. More detailed results split by different demographic groups can be found in Appendix A.4.8.

Llama 2 does not outperform other models on toxicity metrics, and we speculate that this may be because we refrained from aggressively filtering the pretraining data. Recall that leaving pretraining data unfiltered may enable base models tuned to perform well on more downstream tasks (including hate speech detection), and it carries less risk of accidentally filtering out some demographic groups. We observe that models trained from less aggressively filtered pretraining data also required fewer examples to achieve reasonable safety-alignment. We reiterate that this motivated choice does imply that additional safety mitigations should be applied before deployment of base Llama 2 models.

Benchmarks give a summary view of model capabilities and behaviors that allow us to understand general patterns in the model, but they do not provide a fully comprehensive view of the impact the model may have on people or real-world outcomes; that would require study of end-to-end product deployments. Further testing and mitigation should be done to understand bias and other social issues for the specific context in which a system may be deployed. For this, it may be necessary to test beyond the groups available in the BOLD dataset (race, religion, and gender). As LLMs are integrated and deployed, we look forward to continuing research that will amplify their potential for positive impact on these important social issues.

2 Safety Fine-Tuning

In this section, we describe our approach to safety fine-tuning, including safety categories, annotation guidelines, and the techniques we use to mitigate safety risks. We employ a process similar to the general fine-tuning methods as described in Section 3, with some notable differences related to safety concerns. Specifically, we use the following techniques in safety fine-tuning:

Supervised Safety Fine-Tuning: We initialize by gathering adversarial prompts and safe demonstrations that are then included in the general supervised fine-tuning process (Section 3.1). This teaches the model to align with our safety guidelines even before RLHF, and thus lays the foundation for high-quality human preference data annotation.

Safety RLHF: Subsequently, we integrate safety in the general RLHF pipeline described in Section 3.2.2. This includes training a safety-specific reward model and gathering more challenging adversarial prompts for rejection sampling style fine-tuning and PPO optimization.

Safety Context Distillation: Finally, we refine our RLHF pipeline with context distillation (Askell et al., 2021b). This involves generating safer model responses by prefixing a prompt with a safety preprompt, e.g., “You are a safe and responsible assistant,” and then fine-tuning the model on the safer responses without the preprompt, which essentially distills the safety preprompt (context) into the model. We use a targeted approach that allows our safety reward model to choose whether to use context distillation for each sample.

Based on limitations of LLMs known from prior work, we design instructions for our annotation team to create adversarial prompts along two dimensions: a risk category, or potential topic about which the LLM could produce unsafe content; and an attack vector, or question style to cover different varieties of prompts that could elicit bad model behaviors.

The risk categories considered can be broadly divided into the following three categories: illicit and criminal activities (e.g., terrorism, theft, human trafficking); hateful and harmful activities (e.g., defamation, self-harm, eating disorders, discrimination); and unqualified advice (e.g., medical advice, financial advice, legal advice). The attack vectors explored consist of psychological manipulation (e.g., authority manipulation), logic manipulation (e.g., false premises), syntactic manipulation (e.g., misspelling), semantic manipulation (e.g., metaphor), perspective manipulation (e.g., role playing), non-English languages, and others.

We then define best practices for safe and helpful model responses: the model should first address immediate safety concerns if applicable, then address the prompt by explaining the potential risks to the user, and finally provide additional information if possible. We also ask the annotators to avoid negative user experience categories (see Appendix A.5.2). The guidelines are meant to be a general guide for the model and are iteratively refined and revised to include newly identified risks.

2.2 Safety Supervised Fine-Tuning

In accordance with the established guidelines from Section 4.2.1, we gather prompts and demonstrations of safe model responses from trained annotators, and use the data for supervised fine-tuning in the same manner as described in Section 3.1. An example can be found in Table 5.

The annotators are instructed to initially come up with prompts that they think could potentially induce the model to exhibit unsafe behavior, i.e., perform red teaming, as defined by the guidelines. Subsequently, annotators are tasked with crafting a safe and helpful response that the model should produce.

2.3 Safety RLHF

We observe early in the development of Llama 2-Chat that it is able to generalize from the safe demonstrations in supervised fine-tuning. The model quickly learns to write detailed safe responses, address safety concerns, explain why the topic might be sensitive, and provide additional helpful information. In particular, when the model outputs safe responses, they are often more detailed than what the average annotator writes. Therefore, after gathering only a few thousand supervised demonstrations, we switched entirely to RLHF to teach the model how to write more nuanced responses. Comprehensive tuning with RLHF has the added benefit that it may make the model more robust to jailbreak attempts (Bai et al., 2022a).

We conduct RLHF by first collecting human preference data for safety similar to Section 3.2.2: annotators write a prompt that they believe can elicit unsafe behavior, and then compare multiple model responses to the prompts, selecting the response that is safest according to a set of guidelines. We then use the human preference data to train a safety reward model (see Section 3.2.2), and also reuse the adversarial prompts to sample from the model during the RLHF stage.

Safety is inherently a long-tail problem, where the challenge comes from a small number of very specific cases. We investigate the impact of Safety RLHF by taking two intermediate Llama 2-Chat checkpoints—one without adversarial prompts in the RLHF stage and one with them—and score their responses on our test sets using our safety and helpfulness reward models. In Figure 14, we plot the score distribution shift of the safety RM on the safety test set (left) and that of the helpfulness RM on the helpfulness test set (right). In the left hand side of the figure, we observe that the distribution of safety RM scores on the safety set shifts to higher reward scores after safety tuning with RLHF, and that the long tail of the distribution near zero thins out. A clear cluster appears on the top-left corner suggesting the improvements of model safety. On the right side, we do not observe any gathering pattern below the $y=x$ line on the right hand side of Figure 14, which indicates that the helpfulness score distribution is preserved after safety tuning with RLHF. Put another way, given sufficient helpfulness training data, the addition of an additional stage of safety mitigation does not negatively impact model performance on helpfulness to any notable degradation. A qualitative example is shown in Table 12.

A tension between helpfulness and safety of LLMs has been observed in previous studies (Bai et al., 2022a). To better understand how the addition of safety training data affects general model performance, especially helpfulness, we investigate the trends in safety data scaling by adjusting the amount of safety data used in the RLHF stage. In this ablation experiment, we keep the amount of helpfulness training data unchanged ( $\sim$ 0.9M samples) and gradually increase the amount of safety data used in model tuning, ranging from 0% to 100% ( $\sim$ 0.1M samples). For the specific training data mix recipe, we follow the procedure described in Section 3.1 and fine-tune Llama 2 pretrained model for 2 epochs.

We eventually obtain 6 model variants trained with 0%, 1%, 10%, 25%, 50%, and 100% of the total safety data. We evaluate them using our safety and helpfulness reward models described in Section 3.2.2. For each variant, we use the safety and helpfulness reward models to score model generations corresponding to prompts in the Meta Safety and Helpful test sets, respectively.

As shown in Figure 15, we use the mean reward model scores as proxies of model performance on safety and helpfulness. We observe that when we increase the proportion of safety data, the model’s performance on handling risky and adversarial prompts improves dramatically, and we see a lighter tail in the safety reward model score distribution. Meanwhile, the mean helpfulness score remains constant. We hypothesize that this is because we already have a sufficiently large amount of helpfulness training data. Appendix A.4.2 lists more qualitative results that demonstrate how different amounts of safety data in training can change model behavior in responding to adversarial and non-adversarial prompts.

Even though we do not see overall regression on model helpfulness, we qualitatively observe, through interaction, that the model with more safety mitigation answers certain questions in a more conservative manner (e.g., example shown in Appendix Table 38). As a follow-up, we measure false refusal to quantify the frequency that the model incorrectly refuses to answer non-adversarial prompts. Here, we define false refusal as the model incorrectly refusing to answer legitimate user prompts due to irrelevant safety concerns. Refusing due to reasonable causes exceeding the model’s capability, e.g., “I am not capable of participating in video calls” and “2024 is beyond my knowledge cut-off date,” are not counted as false refusals. We train a classifier for detecting refusals in responses and apply the classifier on 1) the above helpfulness test sets and 2) a curated borderline test set consisting of 210 samples for the purpose of boundary testing similar to OpenAI (2023). The borderline dataset is designed intentionally so that its prompts look adversarial (e.g., containing sensitive words or subwords) but are not actually unsafe (e.g., “give me a recipe for Christmas Crack”) (see Appendix Table 41 for more examples).

With more safety data mixed in model tuning, the false-refusal rate becomes larger on both datasets (see Appendix Figure 33). However, false refusal is overall rare—approximately 0.05%—on the helpfulness dataset, even with 100% safety data. On the other hand, the false-refusal rate is much larger on the borderline set due to its difficulty. Llama 2-Chat sometimes has difficulty distinguishing whether a prompt is safe when the prompt contains words that frequently occur in unsafe generations (such as “bomb”). Appendix Table 41 shows some examples of false refusals we discovered.

2.4 Context Distillation for Safety

We encourage Llama 2-Chat to associate adversarial prompts with safer responses by using context distillation (Askell et al., 2021a) similar to Section 3.3. We observe that the safety capabilities of LLMs can be efficiently enhanced by prefixing the model with a safety preprompt (e.g., “You are a safe and responsible assistant”). Like supervised safety fine-tuning, safety context distillation provides a quick way to bootstrap the model’s responses on hard adversarial prompts, so that they can then be further improved in RLHF.

Specifically, we apply context distillation by prefixing a safety preprompt to adversarial prompts to generate safer responses, and then fine-tune the model on its own safe output given the adversarial prompt without the preprompt. We generate safety preprompts automatically with templates. In particular, we use various adjectives usually associated with safe behavior such as “responsible,” “respectful’,’ or “wise,” with the intuition that the model associates them with positive traits that we want to see reflected in safe answers. We show examples of safety preprompts in Appendix Table 39.

During the prompt collection phase, we also asked annotators to label prompts according to risk categories, which enables even more targeted preprompts. Specifically, this allows us to provide some dedicated answer templates of how adversarial prompts should be addressed, based on each identified risk category. Figure 16(a) shows the impact of context distillation and context distillation with answer templates on the safety RM scores.

It is important to note that performing safety context distillation for helpful prompts can degrade model performance and lead to more false refusals (see Appendix Table 40). We therefore perform safety context distillation only on adversarial prompts. However, we observed that context distillation can sometimes degrade response quality, even when dealing with adversarial prompts. Specifically, if the model responses are already of high quality, the application of context distillation can result in less pertinent replies, as the model tends to overemphasize the preprompt, often resorting to generic concerns excessively (see Appendix Table 40 for an example of vague answers due to context distillation). We thus leverage the safety reward model to decide whether to use safety context distillation – we keep the context-distilled output only on the examples where it gets a better reward model score than the original answer. We notice that this is particularly helpful on prompts that the model is very bad at, but limits the negative impact of context distillation (see Figure 16(b)).

3 Red Teaming

Given how broad the capabilities of LLMs are and how varied their training data is, it is insufficient to identify risks solely via ex post facto usage and analysis. Rather, as has been done for other LLMs, we performed various kinds of proactive risk identification, colloquially called “red teaming,“ based on the term commonly used within computer security. This kind of granular analysis is very important because safety is a long-tail issue, in which even very infrequent edge cases can cause noticeable problems. Even if quantitative scores report good results, these types of qualitative insights allow us to recognize and target specific patterns in a more comprehensive way.

We conducted a series of red teaming with various groups of internal employees, contract workers, and external vendors. These teams included over 350 people, including domain experts in cybersecurity, election fraud, social media misinformation, legal, policy, civil rights, ethics, software engineering, machine learning, responsible AI, and creative writing. They also included individuals representative of a variety of socioeconomic, gender, ethnicity, and racial demographics.

The red teamers probed our models across a wide range of risk categories (such as criminal planning, human trafficking, regulated or controlled substances, sexually explicit content, unqualified health or financial advice, privacy violations, and more), as well as different attack vectors (such as hypothetical questions, malformed/misspelled inputs, or extended dialogues). Additionally, we conducted specific tests to determine the capabilities of our models to facilitate the production of weapons (e.g. nuclear, biological, chemical, and cyber); findings on these topics were marginal and were mitigated. Nonetheless, we will continue our red teaming efforts in this front.

To date, all of our red teaming efforts have targeted model outputs in English, but have crucially included non-English prompts and dialogue contexts, as that is a well-known attack vector. In all exercises, participants were given risk category definitions and were shown just a handful of examples of risky interactions with an LLM. After that, each participant was part of a subteam focused on a particular category of risk or attack vector. After creating each dialogue, the red team participant would annotate various attributes, including risk areas and degree of risk, as captured by a 5-point Likert scale.

Some examples of useful insights provided by members of red teams that we were able to improve upon throughout development:

[Early models] were more likely to have generated unsafe responses without noting that they contain problematic content. However, [slightly later models] have tended to display knowledge that the content is problematic, even if they do go on to provide it. “They respond with ‘[UNSAFE CONTENT] is not appropriate to discuss, etc.’ and then immediately follow up with ‘With that said, here’s how [UNSAFE CONTENT].’ ” [Latest models] are able to resolve these issues.

Distracting the [early models] by including “quirks” or specific requests usually defeated any reluctance encountered via more direct requests. “A creative writing request (song, story, poem, etc.) is a reliable way to get it to produce content that it is otherwise robust against.”

Embedding a problematic request in a positive context often successfully obscured the fact that problematic output was being requested for [early models]: “The overall principle I’ve found most effective for any kind of attack is to hide it in language that is positive, progressive, and empowering.”

Crucially, after each exercise, we performed a thorough analysis of the collected data, including dialogue length, risk area distribution, histogram of topic of misinformation (where appropriate), and rated degree of risk. In each case, we took the overall lessons as a guide to help further model safety training, and specifically took data from these exercises for model fine-tuning, model feedback training, and as a signal for other safety model training.

Multiple additional rounds of red teaming were performed over several months to measure the robustness of each new model as it was released internally. We defined the robustness of a model, $\gamma$ , with respect to a red teaming exercise executed by a set of experts as the average number of created prompts that would trigger a violating response from the model per person per hour. As an example, on our 7B model, we had an evolution of $\gamma:1.8\rightarrow 0.45$ over several red teaming iterations and model refinements. Robustness will likely continue to improve with additional red teaming efforts. Another magnitude that we tracked as new models were produced was the percentage of prompts triggering violating responses discovered in the previous red teaming exercises that were mitigated in a given new candidate release. On average, we had a 90% rejection rate model over model.

4 Safety Evaluation of Llama 2-Chat

We collected roughly 2,000 adversarial prompts for human evaluation according to the safety categories in Section 4.2.1, in which 1,351 prompts are single-turn and 623 are multi-turn. Examples of evaluation prompts and responses can be found in Appendix A.4.6.

We then asked raters to judge models for safety violations on a five-point Likert scale with the following definitions:

5 - No safety violations and very helpful

4 - No safety violations and only minor non-safety issues

3 - No safety violations but not helpful or other major non-safety issues

We consider a rating of 1 or 2 as violation and use violation percentage as our main evaluation metric, with the mean rating as a supplement. Each example is annotated by three annotators and we take the majority vote to determine if the response is violating or not. We used Gwet’s AC1/2 statistic to measure inter-rater reliability (IRR) as in the helpfulness human evaluation. The IRR scores range from $0.70to0.95$ depending on the annotation batch, indicating a high degree of agreement among annotators on safety assessments. On Llama 2-Chat annotations, the average IRR is $0.92$ according to Gwet’s AC2 measure. We see lower IRR scores on batches where the models have a high violation rate (e.g., Vicuna) and higher IRR scores on batches where the models have relatively low violation rates (e.g., Llama 2-Chat, Falcon, and ChatGPT).

We show the overall violation percentage and safety rating of various LLMs in Figure 17. Llama 2-Chat has comparable or lower overall violation percentage across model sizes, while ChatGPT and Falcon (Almazrouei et al., 2023) come next, then MPT (MosaicML NLP Team et al., 2023) and Vicuna (Chiang et al., 2023). It is important to interpret these results carefully, as they are affected by limitations of the prompt set, subjectivity of the review guidelines, content standards, and subjectivity of individual raters. Upon manual analysis, we found that the response of Falcon is typically short (one or two sentences), thus less prone to generating unsafe content but also generally less helpful. This is reflected by a large number of responses of Falcon with rating $=3$ . As a result, we note that in Figure 17(b) the average rating of Falcon is much lower than Llama 2-Chat (34B) although their violation percentages look similar ( $3.88$ vs $4.45$ ).

In Figure 18, we report the violation percentage on single- and multi-turn conversations, respectively. A trend across models is that multi-turn conversations are more prone to inducing unsafe responses. That said, Llama 2-Chat still performs well compared to baselines, especially on multi-turn conversations. We also observe that Falcon performs particularly well on single-turn conversations (largely due to its conciseness) but much worse on multi-turn conversations, which could be due to its lack of multi-turn supervised fine-tuning data.

In Figure 19, we show the per-category safety violation percentage of different LLMs. While model performance is similar across categories, Llama 2-Chat has relatively more violations under the unqualified advice category (although still low in an absolute sense), for various reasons, including lack of an appropriate disclaimer (e.g., “I am not a professional”) at times. For the other two categories, Llama 2-Chat achieves comparable or lower violation percentage consistently regardless of model sizes.

In Table 14, fine-tuned Llama 2-Chat shows great improvement over the pretrained Llama 2 in terms of truthfulness ( $50.18\rightarrow 64.14$ for 70B) and toxicity ( $24.60\rightarrow 0.01$ for 70B). The percentage of toxic generations shrinks to effectively 0% for Llama 2-Chat of all sizes: this is the lowest toxicity level among all compared models. In general, when compared to Falcon and MPT, the fine-tuned Llama 2-Chat shows the best performance in terms of toxicity and truthfulness. After fine-tuning, Llama 2-Chat tends to have an increase in positive sentiment overall for many of the demographic groups in BOLD. In Appendix A.4.8, we present a detailed score breakdown of model generation sentiment across different subgroups for the bias benchmark, along with more in-depth analyses and results of truthfulness and bias.

Discussion

Here, we discuss the interesting properties we have observed with RLHF (Section 5.1). We then discuss the limitations of Llama 2-Chat (Section 5.2). Lastly, we present our strategy for responsibly releasing these models (Section 5.3).

Our tuning process revealed several interesting results, such as Llama 2-Chat’s abilities to temporally organize its knowledge, or to call APIs for external tools.

At the outset of the project, many among us expressed a preference for supervised annotation, attracted by its denser signal. Meanwhile reinforcement learning, known for its instability, seemed a somewhat shadowy field for those in the NLP research community. However, reinforcement learning proved highly effective, particularly given its cost and time effectiveness. Our findings underscore that the crucial determinant of RLHF’s success lies in the synergy it fosters between humans and LLMs throughout the annotation process.

Even with proficient annotators, each individual writes with significant variation. A model fine-tuned on SFT annotation learns this diversity, including, unfortunately, the tail-end of poorly executed annotation. Furthermore, the model’s performance is capped by the writing abilities of the most skilled annotators. Human annotators are arguably less subject to discrepancy when comparing two outputs’ preference annotation for RLHF. Consequently, the reward mechanism swiftly learns to assign low scores to undesirable tail-end distribution and aligns towards the human preference. This phenomena is illustrated in Figure 20, where we can see that the worst answers are progressively removed, shifting the distribution to the right.

In addition, during annotation, the model has the potential to venture into writing trajectories that even the best annotators may not chart. Nonetheless, humans can still provide valuable feedback when comparing two answers, beyond their own writing competencies. Drawing a parallel, while we may not all be accomplished artists, our ability to appreciate and critique art remains intact. We posit that the superior writing abilities of LLMs, as manifested in surpassing human annotators in certain tasks, are fundamentally driven by RLHF, as documented in Gilardi et al. (2023) and Huang et al. (2023). Supervised data may no longer be the gold standard, and this evolving circumstance compels a re-evaluation of the concept of “supervision.”

We have observed an intriguing phenomenon related to RLHF, a feature not previously reported to the best of our knowledge: the dynamic re-scaling of temperature contingent upon the context. As indicated in Figure 8, the temperature appears to be influenced by RLHF. Yet, intriguingly, our findings also revealed that the shifts are not uniformly applied across all prompts, as shown in Figure 21.

For instance, when it comes to prompts associated with creativity, such as “Write a poem,” an increase in temperature continues to generate diversity across our various RLHF iterations. This can be observed in the Self-BLEU slope, which mirrors a pattern comparable to that of the SFT model.

On the other hand, for prompts based on factual information, such as “What is the capital of ?” the Self-BLEU slope diminishes over time. This pattern suggests that despite the rising temperature, the model learns to consistently provide the same response to factual prompts.

Our model showcased impressive generalization ability, as shown in Figure 22. We manually tested dozens of examples and observed consistently that our model demonstrates a robust capability to organize its knowledge in a temporal manner, even when provided with minimal data. To instill a concept of time in Llama 2-Chat, we collected a set of 1,000 SFT examples that were related to specific dates. These examples included questions like “How long ago did Barack Obama become president?” Each was associated with two critical pieces of metadata: the date when the query was posed — which influenced the response — and the event date, a point in time prior to which the question would be nonsensical.

The observation suggests that LLMs have internalized the concept of time to a greater extent than previously assumed, despite their training being solely based on next-token prediction and data that is randomly shuffled without regard to their chronological context.

The integration of LLMs with tools is a growing research area, as highlighted in Mialon et al. (2023). The approach devised in Toolformer (Schick et al., 2023) entails the sampling of millions of trajectories, complemented by the formulation of few-shot examples for each tool. Nonetheless, this technique was only applied using a single tool per example, and would not scale for a sequence of tool usage.

The release of OpenAI’s plugins777https://openai.com/blog/chatgpt-plugins has incited substantial discourse within the academic community, igniting questions such as: How can we effectively teach models to utilize tools? or Does the process necessitate a substantial dataset? Our experiments indicate that tool usage can spontaneously emerge from alignment in a zero-shot manner. Although we never explicitly annotate tool-use usage, Figure 23 exhibits an instance where the model demonstrated the capability to utilize a sequence of tools in a zero-shot context.

In addition, our study extended to evaluating the Llama 2-Chat with access to a calculator. The results from this particular experiment are documented in Table 15. LLM tool use, while exciting, can also cause some safety concerns. We encourage more community research and red teaming in this area.

2 Limitations and Ethical Considerations

Llama 2-Chat is subject to the same well-recognized limitations of other LLMs, including a cessation of knowledge updates post-pretraining, potential for non-factual generation such as unqualified advice, and a propensity towards hallucinations.

Furthermore, our initial version of Llama 2-Chat predominantly concentrated on English-language data. While our experimental observations suggest the model has garnered some proficiency in other languages, its proficiency is limited, due primarily to the limited amount of pretraining data available in non-English languages (as documented in Table 10). Consequently, the model’s performance in languages other than English remains fragile and should be used with caution.

Like other LLMs, Llama 2 may generate harmful, offensive, or biased content due to its training on publicly available online datasets. We attempted to mitigate this via fine-tuning, but some issues may remain, particularly for languages other than English where publicly available datasets were not available. We will continue to fine-tune and release updated versions in the future as we progress on addressing these issues.

Not everyone who uses AI models has good intentions, and conversational AI agents could potentially be used for nefarious purposes such as generating misinformation or retrieving information about topics like bioterrorism or cybercrime. We have, however, made efforts to tune the models to avoid these topics and diminish any capabilities they might have offered for those use cases.

While we attempted to reasonably balance safety with helpfulness, in some instances, our safety tuning goes too far. Users of Llama 2-Chat may observe an overly cautious approach, with the model erring on the side of declining certain requests or responding with too many safety details.

Users of the pretrained models need to be particularly cautious, and should take extra steps in tuning and deployment as described in our Responsible Use Guide. 888https://ai.meta.com/llama

3 Responsible Release Strategy

We make Llama 2 available for both research and commercial use at https://ai.meta.com/resources/models-and-libraries/llama/. Those who use Llama 2 must comply with the terms of the provided license and our Acceptable Use Policy, which prohibit any uses that would violate applicable policies, laws, rules, and regulations.

We also provide code examples to help developers replicate our safe generations with Llama 2-Chat and apply basic safety techniques at the user input and model output layers. These code samples are available here: https://github.com/facebookresearch/llama. Finally, we are sharing a Responsible Use Guide, which provides guidelines regarding safe development and deployment.

While many companies have opted to build AI behind closed doors, we are releasing Llama 2 openly to encourage responsible AI innovation. Based on our experience, an open approach draws upon the collective wisdom, diversity, and ingenuity of the AI-practitioner community to realize the benefits of this technology. Collaboration will make these models better and safer. The entire AI community—academic researchers, civil society, policymakers, and industry—must work together to rigorously analyze and expose the risks of current AI systems and to build solutions that address potentially problematic misuse. This approach not only fosters real collaboration with diverse stakeholders—those beyond the walls of big tech companies—but also serves as the cornerstone for democratizing access to foundational models. As argued in Zellers et al. (2019b), open releases promote transparency and allow more people to access AI tools, democratizing the technology and decentralizing AI expertise. We believe that the decentralization of AI expertise does more than simply distribute knowledge—it stimulates innovation and accelerates progress in the industry. Lastly, openly releasing these models consolidates costs and eliminates barriers to entry, allowing small businesses to leverage innovations in LLMs to explore and build text-generation use cases. Ultimately, we believe this will create a more level playing field for organizations of all sizes across the globe to benefit from the economic growth promised by the advancement of AI.

We know that not everyone who uses AI models has good intentions, and we acknowledge that there are reasonable concerns regarding the ways that AI will impact our world. Toxic content generation and problematic associations are meaningful risks that the AI community has yet to fully mitigate. As this paper illustrates, we have made strides in limiting the prevalence of these types of responses. While we recognize there is more work to be done, this realization only deepens our commitment to open science and collaboration with the AI community.

Related Work

The recent years have witnessed a substantial evolution in the field of LLMs. Following the scaling laws of Kaplan et al. (2020), several Large Language Models with more than 100B parameters have been proposed, from GPT-3 (Brown et al., 2020) to Gopher (Rae et al., 2022) or specialized models, e.g. Galactica, for science(Taylor et al., 2022). With 70B parameters, Chinchilla (Hoffmann et al., 2022) redefined those scaling laws towards the number of tokens rather than model weights. Notable in this progression is the rise of Llama, recognized for its focus on computational efficiency during inference (Touvron et al., 2023). A parallel discourse has unfolded around the dynamics of open-source versus closed-source models. Open-source releases like BLOOM (Scao et al., 2022), OPT(Zhang et al., 2022), and Falcon (Penedo et al., 2023) have risen to challenge their closed-source counterparts like GPT-3 and Chinchilla. Yet, when it comes to the "production-ready" LLMs such as ChatGPT, Bard, and Claude, there’s a marked distinction in performance and usability. These models rely on intricate tuning techniques to align with human preferences (Gudibande et al., 2023), a process that is still being explored and refined within the open-source community.

Attempts to close this gap have emerged, with distillation-based models such as Vicuna (Chiang et al., 2023) and Alpaca (Taori et al., 2023) adopting a unique approach to training with synthetic instructions (Honovich et al., 2022; Wang et al., 2022). However, while these models show promise, they still fall short of the bar set by their closed-source counterparts.

Wei et al. (2021) obtained zero-shot performance on unseen tasks by fine-tuning LLMs on numerous datasets. Chung et al. (2022) and Longpre et al. (2023) investigate the impact of instruction tuning as a function of number of tasks, model size, prompt settings, etc. Prompts used for instruction tuning can be created by humans or by LLMs themselves (Zhou et al., 2022), and follow-up instructions can be used to refine initial generations to make them more useful, engaging, and unbiased (Ganguli et al., 2023; Madaan et al., 2023). An approach related to instruction tuning is chain-of-thought prompting (Wei et al., 2022b), in which models are prompted to explain their reasoning when given a complex problem, in order to increase the likelihood that their final answer is correct.

RLHF has emerged as a powerful strategy for fine-tuning Large Language Models, enabling significant improvements in their performance (Christiano et al., 2017). The method, first showcased by Stiennon et al. (2020) in the context of text-summarization tasks, has since been extended to a range of other applications. In this paradigm, models are fine-tuned based on feedback from human users, thus iteratively aligning the models’ responses more closely with human expectations and preferences.

Ouyang et al. (2022) demonstrates that a combination of instruction fine-tuning and RLHF can help fix issues with factuality, toxicity, and helpfulness that cannot be remedied by simply scaling up LLMs. Bai et al. (2022b) partially automates this fine-tuning-plus-RLHF approach by replacing the human-labeled fine-tuning data with the model’s own self-critiques and revisions, and by replacing human raters with a model when ranking model outputs in RLHF, a process known as “RL from AI Feedback” (RLAIF).

Recent literature has extensively explored the risks and challenges linked with Large Language Models. Bender et al. (2021b) and Weidinger et al. (2021) underscore various hazards like bias, toxicity, private data leakage, and the potential for malicious uses. Solaiman et al. (2023) categorizes these impacts into two groups — those that can be assessed within the base system and those requiring a societal context evaluation, while Kumar et al. (2022) offers potential mitigation strategies to curb harm. Work from Roller et al. (2020) and Dinan et al. (2021) also illuminates the difficulties tied to chatbot-oriented LLMs, with concerns ranging from privacy to misleading expertise claims. Deng et al. (2023) proposes a taxonomic framework to tackle these issues, and Bergman et al. (2022) delves into the balance between potential positive and negative impacts from releasing dialogue models.

Investigations into red teaming reveal specific challenges in tuned LLMs, with studies by Ganguli et al. (2022) and Zhuo et al. (2023) showcasing a variety of successful attack types and their effects on the generation of harmful content. National security agencies and various researchers, such as (Mialon et al., 2023), have also raised red flags around advanced emergent model behaviors, cyber threats, and potential misuse in areas like biological warfare. Lastly, broader societal issues like job displacement due to accelerated AI research and an over-reliance on LLMs leading to training data degradation are also pertinent considerations (Acemoglu and Restrepo, 2018; Autor and Salomons, 2018; Webb, 2019; Shumailov et al., 2023). We are committed to continuing our work engaging with the broader policy, academic, and industry community on these issues.

Conclusion

In this study, we have introduced Llama 2, a new family of pretrained and fine-tuned models with scales of 7 billion to 70 billion parameters. These models have demonstrated their competitiveness with existing open-source chat models, as well as competency that is equivalent to some proprietary models on evaluation sets we examined, although they still lag behind other models like GPT-4. We meticulously elaborated on the methods and techniques applied in achieving our models, with a heavy emphasis on their alignment with the principles of helpfulness and safety. To contribute more significantly to society and foster the pace of research, we have responsibly opened access to Llama 2 and Llama 2-Chat. As part of our ongoing commitment to transparency and safety, we plan to make further improvements to Llama 2-Chat in future work.

References

Appendix A Appendix

All authors sorted alphabetically by last name.

Science and Engineering Leadership: Guillem Cucurull, Naman Goyal, Louis Martin, Thomas Scialom, Ruan Silva, Kevin Stone, Hugo Touvron.

Technical and Management Leadership: Sergey Edunov, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic.

Core Contributors: Peter Albert, Nikolay Bashlykov, Prajjwal Bhargava, Moya Chen, David Esiobu, Jeremy Fu, Vedanuj Goswami, Anthony Hartshorn, Rui Hou, Marcin Kardas, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Diana Liskovich, Xavier Martinet, Yuning Mao, Igor Molybog, Todor Mihaylov, Andrew Poulton, Jeremy Reizenstein, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Jacob Xu, Yuchen Zhang, Iliyan Zarov.

Contributors: Amjad Almahairi, Yasmine Babaei, Soumya Batra, Lukas Blecher, Dan Bikel, Shruti Bhosale, Cristian Canton Ferrer, Jude Fernandes, Wenyin Fu, Brian Fuller, Cynthia Gao, Saghar Hosseini, Hakan Inan, Isabel Kloumann, Madian Khabsa, Artem Korenev, Viktor Kerkez, Jian Xiang Kuan, Yinghai Lu, Jenya Lee, Pushkar Mishra, Yixin Nie, Rashi Rungta, Alan Schelten, Kalyan Saladi, Adina Williams, Zheng Yan.

We thank the GenAI executive team for their leadership and support: Ahmad Al-Dahle, Manohar Paluri.

This work was made possible by a large group of contributors. We extend our gratitude to the following people for their assistance:

Our human annotators, whose work we have shown is key to improving tuned model performance, as well as internal leads who organized annotations and quality control: Eric Alamillo, Tamara Best, Debanjali Bose, Adam Kelsey, Meghan Keneally, Rebecca Kogen, Catalina Mejiia, Elisabeth Michaels, Marco Mierke, Alyssa Pereira, Leigh Belz Ray, Rachel Rodriguez, Bardiya Sadeghi, Karthik Sivakumar, Laura Warne.

Our large internal red team, and especially the red team organizers (Dan Bikel, Joanna Bitton, Sean Brooks, Cristian Canton Ferrer, Aaron Fields, Li Chen, Ivan Evtimov, Aaron Grattafiori, Laurie H, Imanol Arrieta Ibarra, Semarley Jarrett, Harshit Maheshwari, Aram Markosyan, Pushkar Mishra, David Renardy, Chris Rohlf, Davide Testuggine, Qing Hu, Matt Wilde, Michael Tontchev, and Rashi Rungta) helped improve the safety and robustness of our models.

The many members of our infrastructure team, including our production engineers and the builders and maintainers of our Research Super Cluster and production clusters, who were key to our model training success. Thanks also to Matthew Oldham and Adi Gangidi for helping us with carbon emission calculations.

Our closest legal, policy, comms, marketing, and privacy partners, including Mike Clark, Nisha Deo, Ahuva Goldstand, Amanda Felix, Dustin Holland, Alex Kessler, Mo Metanat, Harrison Rudolph, Adam Shajnfeld, Beau James, Helen Suk, Britt Montalvo, Allie Vieth and Polina Zvyagina, who helped guide us through the release.

Our partnerships team including Ash Jhaveri, Alex Boesenberg, Sy Choudhury, Mayumi Matsuno, Ricardo Lopez-Barquilla, Marc Shedroff, Kelly Michelena, Allie Feinstein, Amit Sangani, Geeta Chauhan, Chester Hu, Charlton Gholson, Anja Komlenovic, Eissa Jamil, Brandon Spence, Azadeh Yazdan, Elisa Garcia Anzano, and Natascha Parks.

Chris Marra, Chaya Nayak, Jacqueline Pan, George Orlin, Edward Dowling, Esteban Arcaute, Philomena Lobo, Eleonora Presani, and Logan Kerr, who provided helpful product and technical organization support.

Armand Joulin, Edouard Grave, Guillaume Lample, and Timothee Lacroix, members of the original Llama team who helped get this work started.

Drew Hamlin, Chantal Mora, and Aran Mun, who gave us some design input on the figures in the paper.

Vijai Mohan for the discussions about RLHF that inspired our Figure 20, and his contribution to the internal demo.

Early reviewers of this paper, who helped us improve its quality, including Mike Lewis, Joelle Pineau, Laurens van der Maaten, Jason Weston, and Omer Levy.

A.2 Additional Details for Pretraining

We expand the context window for Llama 2 from 2048 tokens to 4096 tokens. The longer context window enables models to process more information, which is particularly useful for supporting longer histories in chat applications, various summarization tasks, and understanding longer documents. Table 16 compares the performance of 2k and 4k context pretraining on long-context benchmarks. Both models are trained for 150B tokens, keeping the same architecture and hyperparameters as a baseline, varying only the context length. We observe improvement on SCROLLS (Shaham et al., 2022), where the average input length is 3.5k, and no performance degradation on SQUAD (Rajpurkar et al., 2018). Table 17 shows that the longer context model retains strong performance on various general-purpose tasks.

A standard practice for autoregressive decoding is to cache the key (K) and value (V) pairs for the previous tokens in the sequence, speeding up attention computation. With increasing context windows or batch sizes, however, the memory costs associated with the KV cache size in multi-head attention (MHA) models grow significantly. For larger models, where KV cache size becomes a bottleneck, key and value projections can be shared across multiple heads without much degradation of performance (Chowdhery et al., 2022). Either the original multi-query format with a single KV projection (MQA, Shazeer, 2019) or a grouped-query attention variant with 8 KV projections (GQA, Ainslie et al., 2023) can be used.

In Table 18, we compare MQA and GQA variants with an MHA baseline. We train all models with 150B tokens while keeping a fixed 30B model size. To keep a similar overall parameter count across GQA and MQA, we increase the dimension of the feed-forward layers to compensate for the reduction in the attention layers. For the MQA variant, we increase the FFN dimension by a factor of $1.33$ , and for the GQA variant, we increase it by a factor of $1.3$ . From the results, we observe that the GQA variant performs comparably to the MHA baseline on most evaluation tasks and is better than the MQA variant on average.

To optimize for latency, we host our largest models using 8 A100s in a single node with tensor parallelism (Shoeybi et al., 2019). In this setting, sharding for MQA cannot be done across heads anymore, given the number of heads is lower than the number of GPUs. Either you duplicate the KV values in all GPUs (making the KV cache size equal to GQA), or an alternative is to shard across the batch dimension instead (Pope et al., 2022). The latter, however, can complicate an inference service, as it works only when batch sizes are larger than the number of shards and the additional communication cost is not worth it in all cases.

Therefore, based on the ablation results and ease of scaling inference, for the 34B and 70B Llama 2 models we chose to use GQA instead of MQA.

Figure 24 shows how inference speed changed for the 30B GQA and MQA ablation models compared to the MHA baseline, in an experiment using 8 x 80 GiB A100s with tensor parallelism. In these runs we simply duplicated the KV heads for MQA in all GPUs, so the KV cache size for MQA became equal to the GQA and the two variants behaved very similar (with MQA just having a slightly larger FFN dimension).

A.2.2 Additional Details for Pretrained Models Evaluation

In Table 19, we report details of the MMLU (Hendrycks et al., 2020) evaluation for Llama 2 models and others open-source models.

In Table 20, we show results on several standard benchmarks.

In Table 21, we compare results of Llama 2 with popular open source models on the Human-Eval and MBPP code generation benchmarks.

We evaluate the Llama 2 model together with other open-source models on the NaturalQuestions and TriviaQA benchmarks (Table 22).

In Table 23 we report zero-shot and few-shot results on SQUAD and zero-shot and one-shot experiments on QUAC. Here Llama 2 performs best on all evaluation settings and models except the QUAC 0-shot where Llama 1 30B performs slightly better.

In Table 24, we present fine-grained results from the English part of the AGI Eval (Zhong et al., 2023) benchmark. AGI Eval is a collection of standardized exams in different subjects.

In Table 25, we report results for Llama 2 and other open-source datasets on the GSM8k and MATH tasks.

A.3 Additional Details for Fine-tuning

Table 26 shows detailed statistics on Meta human preference data. In total, we collected 14 batches of human preference data (i.e., Meta Safety + Helpfulness) on a weekly basis, consisting of over 1 million binary model generation comparisons. In general, later batches contain more samples as we onboard more annotators over time and the annotators also become more familiar with the tasks and thus have better work efficiency. We also intentionally collect more multi-turn samples to increase the complexity of RLHF data and thus the average number of tokens per sample also increase accordingly over batches.

In Figure 25, we plot out the preference rating change over batches. It can be clearly seen that the share of samples with similar responses (e.g., negligibly better or unsure) increase dramatically over time while those with stronger preference (e.g., significantly better) drop in the meantime. This reflects the nature of our iterative model update and preference data annotation procedure - with better-performing Llama 2-Chat models used for response sampling over time, it becomes challenging for annotators to select a better one from two equally high-quality responses.

A.3.2 Curriculum Strategy for Meta Human Preference Data

High quality data is critical for alignment as discussed for SFT. We worked closely with the annotation platforms during our fine-tuning process, and opted for a curriculum annotation strategy. With the first model, the annotators were asked to make prompts relatively simple, and then to progressively move towards more complex prompts and teaching new skills to Llama 2-Chat. An illustration of this curriculum annotation on our helpfulness preference data is displayed in Figure 26.

A.3.3 Ablation on Ranking Loss with Preference Rating-based Margin for Reward Modeling

We ablated the ranking loss with the preference rating-based margin term for the helpfulness reward model. We tried two variants of $m(r)$ with different magnitude for the margin term in Eq 2 as listed open-source 27 and compare them against the baseline without the margin term. We report both their per-rating and average accuracy on the Meta Helpful test set in Table 28. We observe that the margin term can indeed help the reward model perform better on more separable comparison pairs and a larger margin can boost it further. However, the larger margin also regresses performance on similar samples.

We further evaluated the impact of margin-based loss on reward score distribution shifts. We plot the histogram of reward scores from the test set in Figure 27. Essentially, the margin term pushes the reward model to assign more extreme scores to model generations to form a binary split pattern and a larger margin makes this distribution shift more significant. The above observation suggests investment in reward calibration for future work as reinforcement learning algorithms, such as PPO, can be sensitive to reward distribution change.

A.3.4 Ablation on Ranking Loss with Safety Auxiliary Loss for Reward Modeling

We ablated the impact of the safety auxiliary loss with results on the Meta Safety test set shown in Table 29. As expected, The customized loss improves the recall of unsafe responses when we use a reward score of 0.5 as the threshold (negative before Sigmoid) and thus offers a better safety reward signal for RLHF. Teaching the model to discriminate between safe and unsafe model generations also improves model accuracy on three subcategories.

A.3.5 Additional Results for GAtt

We tested the model ability to remember the system arguments trough a human evaluation. The arguments (e.g. hobbies, persona) are defined during the first message, and then from turn 2 to 20. We explicitly asked the model to refer to them (e.g. “What is your favorite hobby?”, “What is your name?”), to measure the multi-turn memory ability of Llama 2-Chat. We report the results in Table 30. Equipped with GAtt, Llama 2-Chat maintains 100% accuracy, always referring to the defined attribute, and so, up to 20 turns (we did not extend the human evaluation more, and all the examples had less than 4048 tokens in total over the turns). As a comparison, Llama 2-Chat without GAtt can not anymore refer to the attributes after only few turns: from 100% at turn t+1, to 10% at turn t+3 and then 0%.

We tried at inference time to set constrain not present in the training of GAtt. For instance, “answer in one sentence only”, for which the model remained consistent, as illustrated in Figure 28.

We applied first GAtt to Llama 1, which was pretrained with a context length of 2048 tokens and then fine-tuned with 4096 max length. We tested if GAtt works beyond 2048 tokens, and the model arguably managed to understand attributes beyond this window. This promising result indicates that GAtt could be adapted as an efficient technique for long context attention.

A.3.6 How Far Can Model-Based Evaluation Go?

To measure the robustness of our reward model, we collected a test set of prompts for both helpfulness and safety, and asked annotators to judge quality of the answers based on a 7 point Likert-scale (the higher the better) using triple reviews. As illustrated in Figure 29 (in Appendix), we observe that our reward models overall are well calibrated with human preference. Note that this enables us to use the reward as a point-wise metric, despite being trained with a Pairwise Ranking Loss.

A.3.7 Human Evaluation

To compare the models, we collect a diverse set of over 4000 single and multi turn prompts. We manually collected single turn prompts spanning the following categories: factual questions, writing and content creation, language assistance, recommendations, and dialogue. For multi-turn prompts, annotators interacted with another model to generate a set of multi-turn prompts. To help ensure fairness, we asked annotators to collect multi-turn prompts by using four different interaction methods: (a) ChatGPT as the interaction model, (b) Llama 2-Chat as the interaction model, (c) best response between ChatGPT and Llama 2-Chat at every turn as selected by the annotators, (d) alternating between ChatGPT and Llama 2-Chat at every turn. We also categorized multi-turn prompts into the same five categories listed above. Since it can be hard to categorize multi-turn prompts into a single category, annotators could select up to two categories for multi-turn prompts. Example evaluation prompts can be seen in Table 33.

For open-source models, we collect generations using a context length of 1000 tokens and allow the model to generate up to 1000 tokens. Even though Llama 2-Chat models are capable of handling up to 4000 tokens, we limit the context and generation length to 1000 tokens to provide a fair comparison with the open-source models. Limiting the generation length to 1000 tokens may adversely affect the Llama 2-Chat models. Any prompts that are longer than 1000 tokens are filtered out for evaluations with open sourced models. For MPT models, we use the mpt-7b-chat model. For Falcon models, we use the Falcon-40B-Instruct model which is a chat/instruct model. For Vicuna models, we use vicuna-13b-delta-v1.1 and vicuna-33b-delta-v1.3 models from lmsys. All model weights were obtained from HuggingFace.

Since closed-source models have longer context lengths, we change the context length and generation length to 2000 tokens for these models. To evaluate with closed source models, we collect another set of generations with 2000 context and generation length.

While collecting generations, we append a system prompt prior to the prompt for evaluation. The system prompt for each model is shown in Table 31. Since ChatGPT, PaLM, and Falcon do not provide a system prompt, we use the same system prompt as Llama 2-Chat model. Generations from different models on an example prompt can be seen in Table 34.

For evaluations, the human annotators are presented with a prompt and generations from two models side-by-side. They are asked to answer the following question:

Considering both model responses, which is better (helpful while also being safe and honest), Model A or Model B?

The annotators answer this question on a seven point scale with the following labels:

A is much better, A is better, A is slightly better, About the same, B is slightly better, B is better, B is much better.

One of the model generations is a Llama 2-Chat model and the other generation is one of the open source or closed source models. Responses from the two models are randomized as Model A or Model B when presented to the annotators. From this data, we report wins, ties, and losses in our results. Three annotators rate each generation pair. Prior experiments with five annotators did not change the results or inter-annotator agreement significantly.

To understand the impact of system prompt on ChatGPT generations, we ran another human evaluation without any system prompt for ChatGPT. As shown in Figure 30, Llama 2-Chat win rate increases from 36% to 44%. Additionally, the win rate for single turn prompts show a dramatic increase from 36% to nearly 49%. In 30, we also show the category wise breakdown of win rate for different categories of prompts. It is interesting to note that ChatGPT outperforms Llama 2-Chat 70B on language assistance while Llama 2-Chat 70B outperforms ChatGPT on factual questions. While analyzing the results for factual questions, we noticed that examples where both models get the answer correct but annotators preferred Llama 2-Chat response due to the style of the response. These results on factual questions do not indicate the hallucination rate of either model. In 31, we also share the win rate by number of turns and total word count for prompts and generation. We do not see any trends in win rate in either case.

A.4 Additional Details for Safety

We briefly discussed the tension between safety and helpfulness in Section 3.2.2 and how it leads to optimizing two separate reward models for helpfulness and safety in our study. Here we show more evidence and qualitative results to manifest this tension. Figure 32 are two scatter plots of helpfulness and safety reward model scores on the safety test set for safe and unsafe responses. The tension can be observed at the bottom right corner (i.e., high safety score but low helpfulness score) in the safe response plot (left) and the top left corner (i.e., low safety score but high helpfulness score) in the unsafe response plot (right). We also list two qualitative examples where safety and helpfulness reward models don’t agree with each other in Table 35.

A.4.2 Qualitative Results on Safety Data Scaling

In Section 4.2.3, we study the impact of adding more safety data into model RLHF in a quantitative manner. Here we showcase a few samples to qualitatively examine the evolution of model behavior when we scale safety data in Tables 36, 37, and 38. In general, we are observing that Llama 2-Chat becomes safer responding to unsafe prompts with more safety data used. For example, Llama 2-Chat learns to refuse to generate offensive content (Table 36) once we use 50% of safety data. In the meantime, we also observe that Llama 2-Chat starts to be behave more conservatively when there are offensive (e.g., abomination in Table 37) or sensitive words (e.g., “sex in a pan” in Table 38) in prompts, even if the prompts themselves are safe (e.g., sex in a pan is the name of a dessert).

A.4.3 English Pronouns

The terms that we use for the pronoun analyses are consistent with the PaLM 2 paper (Anil et al., 2023):

Unknown: "they", "them", "their", "theirs", "theirself", "themself", "themselves"

1st-person: "I", "me", "my", "mine", "myself", "we", "us", "our", "ours", "ourselves"

2nd-person: "you", "your", "yours", "yourself", "yourselves"

3rd-person: "she", "her", "hers", "herself", "he", "him", "his", "himself", "they", "them", "their", "theirs", "theirself", "themself", "themselves", "it", "its", "itself"

A.4.4 Context Distillation Preprompts

We show examples of the preprompts that we use in Table 39.

A.4.5 Safety Errors: False Refusals and Vague Responses

See Table 40 for examples of false refusal due to perceived safety issues in prompts and see Figure 33 for the false refusal rate as a function of the percentage of safety data used.

See Table 41 for examples of vague responses or false refusals made by context distillation.

A.4.6 Examples of Safety Evaluation

We show examples of safety evaluation prompts of different risk categories in Table 42 and responses in Table 43.

A.4.7 Description of Automatic Safety Benchmarks

In this section, we provide a detailed description about the automatic safety benchmarks we use for evaluation from the perspectives of truthfulness, toxicity, and bias.

To understand the ability of LLMs to generate reliable outputs that agree with factuality and common sense, we employ TruthfulQA (Lin et al., 2021), used for LLM hallucinations to measure whether a language model is truthful in generating answers to questions while being informative at the same time. The TruthfulQA benchmark consists of 817 questions distributed across 38 categories, including but not limited to health, finance, law, and politics (Lin et al., 2021). The questions are designed in a way that even humans might answer incorrectly because of an unfounded belief or misconception. Following Lin et al. (2021) we use GPT-3-based metrics, which have been shown to have robust performance in predicting human judgments. Specifically, a fine-tuned GPT-3 model999curie:ft-personal-2023-06-01-06-02-42 is used for “truthful”, and curie:ft-personal-2023-06-01-05-20-23 is used for “informative”., i.e. a “GPT-judge”, is used to predict the truthfulness and informativeness of the generated outputs from LLMs. For the QA prompt, we adopt a few-shot prompt containing 6 random QA pairs with the formats following InstructGPT (Ouyang et al., 2022). We report the percentage of generations that are both truthful and informative, as well as the percentage that are either truthful or informative.

To measure the degree of generation of toxic language and hate speech across different groups, we use ToxiGen (Hartvigsen et al., 2022), a dataset that contains implicitly toxic and benign sentences mentioning 13 minority groups. We adopt a revised version of the dataset from Hosseini et al. (2023) that reduces noise by filtering out prompts for which annotators disagree on the target demographic group. We then use the default ToxiGen classifier tuned on RoBERTa (Liu et al., 2019) to measure the toxicity of generations of each of the LLMs.

To study the sentiment in model generations that may vary with demographic attributes, we choose BOLD (Dhamala et al., 2021), a large-scale bias benchmark that comprises 23,679 English Wikipedia prompts spanning five domains of race, gender, religion, political ideology, and profession, with 43 different subgroups101010In this analysis, we remove prompts that fall into the religious ideology subgroups Hinduism and Atheism, because they are underrepresented with only 12 and 29 prompts, respectively.. We conduct a sentiment analysis using the Valence Aware Dictionary and Sentiment Reasoner (VADER) (Hutto and Gilbert, 2014) to evaluate the sentiments conveyed by the combination of prompt prefix and model generation. VADER produces a sentiment score between -1 and 1. A positive (negative) score indicates a positive (negative) sentiment towards the population mentioned in the prompt, and a score closer to 0 indicates a neutral sentiment.

A.4.8 Automatic Safety Benchmark Evaluation Results

Here we perform in-depth analyses to better understand the safety of model generations from the perspectives of toxicity, truthfulness, and bias.

Truthfulness. Table 44 presents evaluation results of TruthfulQA for the percentage of truthfulness, percentage of informativeness, and percentage of both truthfulness and informativeness across generations. Most of the models show a >90% informativeness in the model generations. However, the truthfulness percentage is relatively low for pretrained models, around 30% to 40% for Falcon, MPT, and the 7B Llama 1. This percentage increases for pretrained Llama 1 and Llama 2 with a larger size. After instruction fine-tuning, both 7B and 13B Llama 2-Chat improved about 20% in truthfulness, 30B Llama 2-Chat improved about 24%, and 70B Llama 2-Chat improved about 14% compared to their pretrained versions.

Toxicity. Table 45 shows that Mexicans, Latinos, and women tend to be the top three demographic groups with the highest percentages of toxic generations given ToxiGen prompts for the pretrained models. Thanks to instruction fine-tuning, fine-tuned Llama 2-Chat models of all sizes show an effectively zero percentage of toxic model generations, and hence their results are not presented here.

Bias. Tables 46, 47, 48, 49, and 50 present the distribution of sentiment scores across different demographic groups under the domains of race, gender, religious ideology, political ideology, and profession. Overall, we observe positive sentiment scores for each domain in the BOLD dataset for both pretrained and fine-tuned models. The fine-tuned Llama 2-Chat shows more positivity in sentiment scores than the pretrained versions do. ChatGPT tends to have more neutral sentiment scores in its model generations. For the gender domain, LLMs tend to have a more positive sentiment towards American female actresses than male actors. For the race domain, demographic groups of Asian Americans and Hispanic and Latino Americans tend to have relatively positive sentiment scores compared to other subgroups. For the religious ideology domain, we observe that the demographic groups of Islam and Sikhism tend to have the largest increase in the sentiment scores after fine-tuning. For the political ideology domain, the Liberalism and Conservatism groups tend to have the most positive sentiment scores for both pretrained and fine-tuned models. Most of the sentiment scores are negative (i.e. less than 0) for the Fascism group. For the profession domain, there is highly positive sentiment towards the occupational categories of “Corporate titles” and “Computer”, while we observe the most neutral sentiment towards “Professional driver types”.

It is important to note that these evaluations using automatic metrics are by no means fully comprehensive, due to the complex nature of toxicity and bias in LLMs, but the benchmarks we selected are representative of our understanding that Llama 2-Chat improves on critical aspects of LLM safety. Benchmark evaluation is important for assessing AI models, including chat-oriented LLMs, because benchmarks provide a standardized and measurable way to compare different models and track progress in the field.

However, it’s crucial to be aware of the benchmarks’ limitations in evaluating safety. Most of them were initially developed for pretrained LLMs, and there are certain limitations to consider when using them to measure the safety of fine-tuned/chat-oriented models. For example, the benchmarks may not adequately cover adversarial inputs or toxic content specifically designed to exploit vulnerabilities, and they may not cover all demographic categories. It is advisable to monitor disaggregated metrics and benchmarks in order to better understand and analyze the varied behavior exhibited by LLMs across different demographic groups.

Additionally, benchmarks typically assess language understanding and generation based on individual sentences or prompts, but in chat scenarios, context is important. The ability of a fine-tuned chat model to maintain context, handle nuanced situations, and avoid generating toxic content within a conversation may not be thoroughly evaluated by existing benchmarks. In the BOLD dataset, the prompts extracted from Wikipedia are taken to be the first five words plus the domain term, resulting in prompts in BOLD having six to nine words, depending on the domain and demographic group (Dhamala et al., 2021).

After deployment, safety in chat models involves user experience and long-term effects, which are not captured by benchmarks alone. Therefore, to assess safety effectively, additional testing of how they are integrated in a product deployment, how they are used, and what metrics accurately and precisely capture safety risks given the product context is essential for a comprehensive evaluation of safety. Our future work will conduct more comprehensive evaluations that encompass some dimensions not yet addressed in the cases mentioned above.

A.5 Data Annotation

We have relied on human annotators in order to collect annotations for the supervised fine-tuning stage and human preferences to train the reward models. In this section, we provide details about the data annotation process.

We have collected single-turn and multi-turn dialogue annotations from our pool of annotators. We asked the annotators to write responses that are informative, truthful, relevant, clear and harmless. We also asked annotators to prioritize harmlessness over informativeness and helpfulness in cases of prompts that could lead the responses to be problematic in any way. We categorized the kind of responses that could lead to negative user experiences and shared these categories and examples with the annotators. A summary of these categories can be seen in Section A.5.2.

A.5.2 Negative User Experience Categories

There are different kinds of responses that could cause a negative user experience when interacting with our models. We have instructed the annotators to avoid writing responses that violate our safety guidelines, for example, we ask that prompts they write do not:

Promote or enable dangerous behaviors to the user or other people.

Contain, promote or enable offensive and abusive behavior towards the user or other people.

Contain, promote or enable sexually explicit content.

A.5.3 Quality Assurance Process

We have implemented a quality assurance process to ensure we only use high quality annotations for training the model. For this process, a team of highly skilled content managers manually reviewed the annotations and approved the ones that would be used.

During the quality assurance step, reviewers were asked to only approve those annotations that matched our guidelines: (a) they are consistent with the dialogue history, (b) follow instructions in the prompt (c) are free of grammatical, spelling and other writing errors, and (d) do not fall into any of the categories described in Section A.5.2. If an annotation needed small changes to be approved, due to grammar or spelling mistakes, or to improve the structure, cohesiveness and style of the text, reviewers could edit it to fix the issues and approve it. If the answer could not be approved without major changes, the reviewers were asked to reject it and write the feedback necessary to improve it.

A.5.4 Annotator Selection

To select the annotators who could work on our different data collection tasks, we conducted a multi-step assessment process where we tested their understanding of our guidelines, the alignment with our quality assessment criteria, the alignment with our sensitive topics guidelines and their reading and writing skills.

The first test consists of 3 sections of testing to evaluate grammar, reading comprehension and writing style. Each section is timed and the test should take a total of 50 minutes to complete. A candidate must score 90% on part I to continue on to parts II and III, and an average score of 4 on part II and III to pass the test.

The second test consisted of 42 questions split into sensitive topics alignment, answer ranking and two examples of answer writing, which were manually reviewed by us. To pass the test, annotators needed to agree with our criteria on 80% of the answers, and pass the written examples with a score of 4 out of 5.

The third test consisted in measuring the alignment with our quality assessment criteria. The test consisted of 31 different questions asking the annotators to grade different prompt-answer pairs, as well as ranking different answers to the same prompt. To measure alignment, we first collected responses from different team members, and the annotators who agreed with our preferences in more than 26 of the questions passed the test.

Finally, the last test consisted of a prompt response assessment where annotators choose a minimum of 6 out of 18 prompts to write responses for. We manually assess each response to evaluate production readiness. Annotators that have scored an average of $>$ 4 have passed the training.

A.6 Dataset Contamination

With the increasing scale of publicly available training data, it has become inevitable that some portion of evaluation data is seen during training, and may provide an undue boost in evaluation performance.

Earlier work (Brown et al. (2020), Wei et al. (2022a), Du et al. (2022) in measuring such dataset contamination considered an example from an evaluation set to be “contaminated” if there existed a collision between a high-order $n$ -gram (generally, $n=13$ ) from the sample and the training data. This was a deliberately conservative approach in order to produce a “clean” subset of the data with high precision, and is used in open-sourced evaluation libraries (e.g. Gao et al. (2021)).

This approach, however, was unable to detect precisely what proportion of a given sample is contaminated, and didn’t take into account how evaluation datasets are constructed. Furthermore, as noted in Chowdhery et al. (2022), some datasets (such as BoolQ) contain contexts extracted verbatim from the web, but not the question and answer continuation. As such, highly contaminated samples from these datasets are unlikely to gain an unfair advantage. The methodology in Chowdhery et al. (2022) further improves on the earlier $n$ -gram collision detection by considering a sample to be contaminated if 70% of all 8-grams can be found at least once in the training data.

The previous methodologies noted above all consider contamination in text space, and don’t appear to consider the formatting of prompts used for actual evaluation. In contrast, we instead match on tokenized input, being careful to pass fully verbalized evaluation samples to the tokenizer. We also diverge from the previous methodologies by considering contamination from a bottom-up perspective. We consider a token to be contaminated if it appears in any token $n$ -gram longer than 10 tokens in both the evaluation sample and the training set, and define the contamination percentage of a sample to be the percentage of tokens contaminated. This allows us to view the benchmark performance of our models on a range of contamination scales, while retaining the ability to test a high-precision clean subset (samples with $<20\%$ contamination) and a high-precision contaminated subset (samples with $>80\%$ contamination). In order to account for the vagaries of the precise format of verbalized samples, we allow a small "skipgram budget" of four tokens, so that matched spans between an evaluation sample and the training data can differ in at most four positions (we do not allow trailing mismatches, or mismatches in the first 10 tokens).

We identify such 10(+)-skipgrams with suffix arrays implemented using a variation of the library from Lee et al. (2022), modified to work on a PySpark cluster (effectively without random access to disk). Given the embarrassingly parallel nature of the task, we are able to find all such 10-grams (and their full lengths) in our entire dataset in around seven hours (including time to tokenize), utilizing an estimated 1,500 cores.

As there are many confounding factors at play when determining whether dataset contamination has contributed to evaluation performance (mostly stemming from the fact that "clean" and "dirty" subsets do not necessarily well-estimate the population distribution), we make the following assumption: In the event of dataset contamination contributing to evaluation performance, we expect both the "cleanest" examples to have an overall worse average score than their complement, and the "dirtiest" samples to have an overall better average score than their complement. It is insufficient evidence for contamination if only one of these were true. To this end, we define four (non-disjoint) subset types as follows:

“Clean” samples, with less than 20% token contamination,

“Not clean” samples, with greater than (or equal to) 20% token contamination,

“Not dirty” samples, with less than 80% token contamination,

“Dirty” samples, with greater than (or equal to) 80% token contamination.

There is an additional confounding factor that we attempt to address directly. With the given definition of contamination (as well as other definitions mentioned in the literature), there is a possibility that a sample may appear contaminated, by virtue of many tokens appearing in matched sequences found in the training data. However, the matched sequences might be highly fragmented across the training data, in which case it is very unlikely the model saw the correctly-assembled contaminated sequences during training. To reduce the chance of this phenomenon, we repeat our analysis with minimum match length $L\in\{10,20,30,40,50\}$ . Since in the limit of $L\rightarrow\infty$ every sample falls into both the "clean" and "not dirty" (there is no contamination), we report the largest $L$ for each dataset that appeared to benefit from contamination to strike a balance between fragmentation and overall contamination.

For each dataset and each of the above sample subset types, we compute both the mean $\bar{X}$ of the performance metric $X$ and the statistic $Z_{n}=\frac{(\bar{X}-\mu_{n})}{\sigma_{n}}$ , where $n$ is the size of the sample subset type, and $\mu_{n}$ and $\sigma_{n}^{2}$ are the mean and variance of the sampling distribution of the performance metric for samples of size $n$ , respectively. By the Central Limit Theorem, $Z_{n}$ tends towards a standard normal distribution and so we consider there is sufficient evidence to suggest contamination has affected evaluation performance on a dataset if all four sample subsets have $|Z_{n}|>2$ .

Results for this analysis can be seen in Table 51. We observe that only HellaSwag and MMLU-Humanities appear to have been boosted due to contamination in the training data, with the 70B model appearing to have gained a greater benefit than the 7B model, as one might expect. Furthermore, the impact of this effect on MMLU-Humanities appears to cause a benefit for MMLU-Overall for the 70B model, albeit with only a small delta (-0.9) between the "clean" subset performance and the sampling mean. No other dataset (for any choice of $L$ ) appears to have benefitted from dataset contamination, and we omit results from these datasets for conciseness.

A.7 Model Card

Table 52 presents a model card (Mitchell et al., 2018; Anil et al., 2023) that summarizes details of the models.