Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision

Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David Cox, Yiming Yang, Chuang Gan

Introduction

The problem of aligning large language models (LLMs) to human values and intentions in terms of being comprehensive, respectful, and compliantThis is the definition of AI alignment in this paper, distinct from following simple instructions . has gained significant attention in research as recent AI systems (like ChatGPT or GPT-4) have rapidly advanced in their capabilities . Presently, state-of-the-art AI systems predominantly depend on supervised fine-tuning (SFT) with human instructions and annotations, as well as reinforcement learning from human feedback (RLHF) on their preferences . The success of these techniques heavily relies on the availability of extensive human supervision, which is not only expensive to obtain but also has potential issues with the quality, reliability, diversity, creativity, self-consistence, undesirable biases, etc., in human-provided annotations .

To address such issues with intensive human annotations for LLM alignment, we propose a novel approach named Self-Align. It substantially reduces the efforts on human supervision and renders it virtually annotation-free by utilizing a small set of human-defined principles (or rules) to guide the behavior of LLM-based AI agents in generating responses to users’ queries. Our approach encompasses four essential stages:

(Topic-Guided Red-Teaming) Self-Instruct: We employ the self-instruct mechanism by Wang et al. with 175 seed prompts to generate synthetic instructions, plus 20 topic-specific prompts in addition to ensure a diversified topic coverage of the instructions. Such instructions ensure a comprehensive range of contexts/scenarios for the AI system to learn from.

Principle-Driven Self-Alignment: We offer a small set of 16 human-written principles in English about the desirable quality of the system-produced responses, or the rules behind the behavior of the AI model in producing answersThe detailed principles are given in the appendix. Analogous to Constitutional AI , the design of these principles in Self-Align remains exploratory and primarily serves research purposes.. These principles function as guidelines for generating helpful, ethical, and reliable responses. We conduct in-context learning (ICL) with a few (5) exemplars (demonstrations) that illustrate how the AI system complies with the rules when formulating responses in different cases. From the human-written principles, ICL exemplars, and the incoming self-instructed prompts, the LLM can trigger the matching rules and generate the explanations for a refused answer if the query is detected as a harmful or ill-formed one.

Principle Engraving: In the third stage, we fine-tune the original LLM (the base model) on the self-aligned responses, generated by the LLM itself through prompting, while pruning the principles and demonstrations for the fine-tuned model. The fine-tuning process enables our system to directly generate responses that are well-aligned with the helpful, ethical, and reliable principles across a wide range of queries, due to shared model parameters. Notice that the fine-tuned LLM can directly generate high-quality responses for new queries without explicitly using the principle set and the ICL exemplars.

Verbose Cloning: Lastly, we employ context distillation to enhance the system’s capability to produce more comprehensive and elaborate responses than the overly short or indirect responses.

Impressively, the entire Self-Align process necessitates fewer than 300 lines of annotations (including 195 seed prompts, 16 principles, and 5 exemplars), while previous aligned AI systems such as InstructGPT or Alpaca required at least 50K human/teacher annotations. This highlights the supervision efficiency of our approach in comparison with other state-of-the-art AI assistants, as shown in Table. 1. Our principle-driven approach, which is essentially rule-based, not only significantly reduces the required human effort for supervision but also showcases aligning neural language models with human understanding of principles or rules about quality language generation in both an effective and efficient manner.

We should also point out that the advancements of recent models like Alpaca and Vicuna have shown that the potent conversational capabilities can be obtained by distilling existing human-preference-aligned LLMs (i.e., Text-Davinci-003 and ChatGPT, respectively) into smaller, more manageable models . Those resulting smaller models, however, still rely on the successful alignment of existing LLMs, which are based on extensive human-provided supervision. In other words, those smaller models indirectly inherit the dependence on the availability of intensive supervision from humans. In contrast, our approach focuses on language model alignment from scratch, independent from the existence of well-aligned LLMs like ChatGPT or GPT-4. That is the main distinction of our approach from other existing approaches and is why we call it self-alignment from scratch.

We are providing the code for the Self-Align method as open source to promote collaboration and innovation within the research community. The base model of Dromedary is the LLaMA-65b language model , which is accessible for research-only, noncommercial purposes. By investigating different strategies from that in RLHF, our work seeks to broaden the scope of AI alignment techniques, and promote a deeper understanding of how to improve AI systems, not only in terms of being more powerful, but also more responsible and well-aligned with human values.

Related Works

The domain of AI alignment has garnered substantial attention in recent years, with LLMs exhibiting remarkable proficiencies across a wide array of tasks. GPT-4 epitomizes this development, implementing a post-training alignment process to bolster factuality and adherence to desired behavior, while concurrently mitigating potential risks. A prominent strategy for aligning language models with human values entails fine-tuning via human feedback. Notably, Ouyang et al. and Bai et al. utilized reinforcement learning from human feedback (RLHF) to refine models, enhancing helpfulness and truthfulness, and diminishing toxic output generation. This technique requires extensive human annotations.

Constitutional AI (CAI) or self-critique investigates self-improvement without human labels for harmful outputs, leveraging AI-generated self-critiques, revisions, and preference models. Based on a list of human-generated rules or principles, this approach fosters the evolution of safe, reliable, and effective AI systems with increased behavioral precision and reduced dependency on human labels. Both Self-Align and CAI are rule-based alignment techniques for powerful AI systems. However, there are substantial differences between them, as outlined below:

In the principle-driven self-alignment procedure of Self-Align, the language model itself determines which rules to adhere to given user queries, and it subsequently generates appropriate responses conditional on these queries and rules. Conversely, CAI employs a self-critique methodology; given a pair comprising a user query and the model’s response, it selects a rule to scrutinize the existing response, thereby yielding a refined output.

The self-critique nature of CAI necessitates RLHF warm-up. In stark contrast, Self-Align explores the alignment of language models from scratch, requiring minimal human supervision.

However, one limitation of Self-Align is the requirement to include all rules within the context, a process bound by the base language model’s token limit. In contrast, the CAI technique is not subject to this token limit constraintFor example, the latest Claude employs at least 58 rules . as a post-generation self-critique method.

State-of-the-art AI-assistant agents have significantly advanced in recent years, with InstructGPT leading the way as the first model trained with supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) on user queries. ChatGPT , a sibling model to InstructGPT, has garnered widespread success as a commercial AI assistant, showcasing its ability to follow instructions in prompts and provide detailed responses. Alpaca , as a subsequent open-source model, was developed using Self-Instruct to learn the knowledge from Text-Davinci-003 (similar to InstructGPT) , offering cost-effective and accessible alternatives. In parallel, models like Vicuna, Koala, and Baize have been trained on ChatGPT outputs, essentially distilling the ChatGPT model to create new open-source chatbots. Dolly-V2 , another open-source effort, utilizes 15k new instruction-following data points for training. OpenAssistant follows a similar approach to ChatGPT by collecting its own data. These advancements in AI assistants continue to push the boundaries of usability and accessibility, making significant strides in the open-source domains.

Our Self-Align approach distinguishes itself by concentrating on the creation of novel alignment techniques for LLMs, developed from the ground up and independent of established AI systems, while requiring minimal human supervision. This research direction aims to investigate the potential of aligning AI models under circumstances where dependence on or access to existing systems may be unfeasible or unfavorable. A comparison of annotation cost between Self-Align and previous methods is shown in Table. 1 and Figure. 2.

Our Method: Self-Align

The Self-Align method involves four distinct stages. The first stage is called Topic-Guided Red-Teaming Self-Instruct, which employs the language model itself to generate synthetic instructions and enhance diversity via a topic-guided red-teaming approach. The second stage, Principle-Driven Self-Alignment, defines a set of principles that the AI model must adhere to and provides in-context learning demonstrations for constructing helpful, ethical, and reliable responses. The third stage, Principle Engraving, fine-tunes the base language model by pruning principles and demonstrations, empowering the model to directly generate appropriate responses. Finally, the fourth stage, Verbose Cloning, serves as a complementary step to address challenges arising from overly-brief or indirect responses by refining the model to produce detailed and comprehensive answers to user queries. We will describe each of these stages in detail.

The Self-Instruct method is a semi-automated, iterative bootstrapping process that harnesses the capabilities of a pretrained LLM to generate a wide array of instructions (and corresponding outputs). The method commences with 175 manually-written instructionshttps://github.com/yizhongw/self-instruct/blob/main/data/seed_tasks.jsonl, and the LLM proceeds to develop new tasks and augment the task pool (after eliminating low-quality or repetitive instructions). This process is executed iteratively until a satisfactory volume of tasks is reached. A noteworthy application of this method can be observed in Alpaca , where Self-Instruct is utilized to generate new queries and distilled output from Text-Davinci-003 .

We introduce an effective extension, the Topic-Guided Red-Teaming Self-Instruct, which aims to improve the diversity and coverage of the generated adversarial instructions. We manually devise 20 adversarial instruction types that a static machine learning model can’t answer, or may answer with the wrong facts, such as:

, and prompt the base LLM to generate novel topics (e.g., Water) relevant to these typesSee the appendix for the seed prompts we used for Topic-Guided Red-Teaming Self-Instruct.. Subsequently, after removing duplicated topics, we prompt the base LLM to generate new instructions novel instructions corresponding to the specified instruction type and topic. Incorporating additional prompts that concentrate on particular adversarial instruction types and diverse topics allows the AI model to explore an expanded range of contexts and scenarios.

2 Principle-Driven Self-Alignment

The Principle-Driven Self-Alignment technique is designed to develop the AI alignment with a small set of helpful, ethical, and reliable principles. This stage capitalizes on the (Topic-Guided Red-Teaming) Self-Instruct as an instruction generator. The primary objective is to enable the AI model to generate fitting responses that adhere to the established principles, while simultaneously minimizing human supervision.

The Principle-Driven Self-Alignment process begins with the creation of sixteen generic principles that an AI assistant should follow, such as “1 (ethical). Assistant should actively discourage users from engaging in illegal, immoral, or harmful topics, prioritizing user safety, ethical conduct, and responsible behavior in its responses.” Subsequently, five in-context learning (ICL) demonstrations are provided to exemplify how the AI assistant should apply these principles through an explicit process termed “internal thoughts”The effectiveness of such a thinking procedure has been proven on a wide range of reasoning , action , or knowledge-intensive tasks.. For instance, in the ICL demonstration, the user query can be:

And we annotate the internal thoughts of the AI assistant as:

Such internal thoughts will guide the assistant’s final generated response, such as:

When a new query is generated by (Topic-Guided Red-Teaming) Self-Instruct, it is appended to the list of the exemplars, and the base LLM follows such an internal-thought-then-answer process to produce a self-aligned response. The whole process is illustrated in Figure. 3.

In this paper, the design of the principles remains exploratory and primarily serves research purposesAnalogous to Constitutional AI , we believe that, in the future, such principles should be redeveloped and refined by a more extensive set of stakeholders. Given the small number of bits of information involved in these principles, a thorough examination of these bits is warranted.. We (the authors) brainstormed sixteen principles, namely 1 (ethical), 2 (informative), 3 (helpful), 4 (question assessment), 5 (reasoning), 6 (multi-aspect), 7 (candor), 8 (knowledge recitation), 9 (static), 10 (clarification), 11 (numerical sensitivity), 12 (dated knowledge), 13 (step-by-step), 14 (balanced & informative perspectives), 15 (creative), 16 (operational)The detailed principles and the ICL exemplars are given in the appendix., drawing inspiration from existing principles in Constitutional AI and the new Bing Chatbot , as well as the principles proven to enhance AI performance in recent research papers, such as step-by-step reasoning and knowledge recitation .

3 Principle Engraving

Principle Engraving constitutes a vital element of the Self-Align methodology, focusing on honing the AI model’s behavior to produce responses that adhere to predefined principles. During this stage, the base LLM is fine-tuned after pruning the principle, the in-context learning demonstrations, and the self-generated thoughts, effectively engraving these principles into the LLM’s parameters. Figure 3 provides a visual representation of this process.

A noteworthy advantage of principle engraving is its ability to enhance the AI model’s alignment while reducing token usage, which enables longer context lengths during inference (as allocating more than 1.7k tokens to fixed principles and ICL demonstrations would be excessive). Remarkably, our empirical observations reveal that the base LLM, after fine-tuned with its self-aligned outputs, surpasses its prompted counterpart on alignment benchmarks. This improvement can likely be attributed to the generalization effect that occurs when the language model is directly optimized to generate output that is helpful, ethical, and reliable.

4 Verbose Cloning

In our preliminary testing of the principle-engraved model, we identified two primary challenges: 1) the model tended to generate unduly brief responses, while users typically expect more comprehensive and elaborate answers from an AI assistant, and 2) the model occasionally recited relevant Wikipedia passages without directly addressing the user’s query.

To overcome these challenges, we introduce a complementary Verbose Cloning step. This stage involves utilizing an human-crafted prompt to create a verbose version of the aligned model, that is capable of generating in-depth, detailed responses. We then employ context distillation to produce a new model that is not only aligned but also generates thorough and extensive responses to user queries. Context distillation works by training the base language model on synthetic queries generated by (Topic-Guided Red-Teaming) Self-Instruct, paired with corresponding responses produced by a verbosely prompted principle-engraved model. The verbose prompt designed to encourage the talkative nature of the principle-engraved model is provided in the appendix.

Evaluation

We quantitatively evaluate Dromedary on benchmark datasets and also assess its qualitative performance on several datasets for demonstration purposes. By default, all the language model-generated text is decoded with a temperature of $0.7$ .

Dromedary is the AI assistant developed by implementing the Self-Align process on the LLaMA-65b base language model. We investigate two variants: Dromedary (final) and Dromedary (non-verbose), respectively. The former represents the model obtained by applying all four steps of the Self-Align process, while the latter is the principle-engraved model, excluding the final step of verbose cloning. Due to the space limit, the experimental details of Dromedary such as training process and decoding hyper-parameters can be found in the appendix.

Our comparison involves several notable baselines. LLaMA provides a set of performant base language models for research usage. Text-Davinci-003, ChatGPT (or GPT-3.5), and GPT-4 , successors to their previous versions, have demonstrated significant enhancements in generating contextually relevant and high-quality content. Alpaca , a fine-tuned model derived from Text-Davinci-003, and Vicuna , a chatbot trained on user-shared conversations with ChatGPT, offer unique insights into model performance. Dolly-V2 , an instruction-following model, showcases commercial applications of language models. Finally, results from Anthropic-LM , though not publicly available, provide valuable benchmarks. More comprehensive descriptions of these models are available in the appendix.

2 Benchmark Results

The TruthfulQA benchmark evaluates a model’s ability to identify true claims, specifically in the context of literal truth about the real world. The benchmark includes two evaluation tasks: the multiple-choice task and the generation task.

In the Multiple-Choice (MC) task, models are tested on their ability to select true answers from sets of true and false (usually 2-7) reference answersThe evaluation prompt we used for TruthfulQA-MC can be found in the appendix.. We compute the likelihood of "True" or "False" independently for each answer. The MC1 accuracy results are shown in Figure 4 (left). We can see that with a modified ranking approach, Dromedary significantly outperforms the powerful GPT-4 model and other baselines, achieving a new state-of-the-art MC1 accuracy of 69.

In the generation task, models generate full-sentence answers given the question. The benchmark evaluates the model’s performance on both questions to measure truthful models and the intersection of truthful and informative. As shown in Table 4 (right), Dromedary achieves higher scores than GPT-3, LLaMA, Alpaca in both categories, while failing behind the ChatGPT-distilled Vicuna model.

2.2 BIG-bench HHH Eval

The BIG-bench HHH Eval was specifically designed to evaluate a model’s performance in terms of helpfulness, honesty, and harmlessness (HHH). It is a Multiple-Choice (MC) task, which tests the models’ ability to select superior answers from two reference answersThe evaluation prompt we used for HHH Eval can be found in the appendix.. We calculate the likelihood of the model preferring one answer over the other when presented with two candidate answers simultaneously. The MC accuracy results are displayed in Table 2. It can be observed that Dromedary demonstrates significantly improved performance compared to other open-source models, such as LLaMA and Alpaca, particularly in the Hamrless metric. Furthermore, it only marginally underperforms when compared to the powerful ChatGPT model.

2.3 Vicuna Benchmark Questions (Evaluated by GPT-4)

Chiang et al. introduced an evaluation framework leveraging GPT-4 to automate the assessment of chatbot performance. In this framework, GPT-4 generates challenging questions across diverse categories, and answers from five chatbots—LLaMA, Alpaca, ChatGPT, Bard, and Vicuna—are collected. We directly use these data to compare Dromedary with these chatbots.

We followed Chiang et al. and utilized GPT-4 to rate chatbot responses based on helpfulness, relevance, accuracy, and detail. Inspired by Vicunahttps://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py, we use two conversation examples as ICL to improve the response quality of DromedaryThe two-shot prompt we used for open-ended conversation can be found in the appendix.. A Win/Tie/Lose comparison between the final version of Dromedary and various baselines is illustrated in Figure 10. The comparison reveals that Dromedary surpasses Text-Davinci-003 and Alpaca but falls short of ChatGPT and its distilled version, Vicuna. Additionally, we present a comparison of relative performance with respect to ChatGPT in Figure 5(b).

2.4 Discussions

Interestingly, in contrast to the prevailing alignment paradigm of first-following-then-align, i.e., SFT (supervised fine-tuning) + RLHF (reinforcement learning from human feedback) , Self-Align prioritizes improving harmlessness and reliability through Principle-Driven Self-Alignment and Principle Engraving. Subsequently, it improves its helpfulness (instruction-following ability) by employing Verbose Cloning. Determining the superior paradigm (first-following-then-align or first-align-then-following) may need future research.

The final Verbose Cloning step in Self-Align aims to enhance the model’s ability to generate comprehensive and detailed responses. However, the benchmark results reveal a noteworthy observation: while Verbose Cloning significantly improves generation quality (as evidenced by the Vicuna Benchmark Questions and our TruthfulQA generation task), it harms the model’s performance in several multiple-choice benchmarks, particularly in ranking more trustworthy responses. Drawing on the “alignment taxes” concept introduced by Bai et al. , we refer to this phenomenon as verbose tax. Understanding the underlying reasons for this occurrence and exploring methods to improve the model’s helpfulness (verbose generation ability) while maintaining its harmlessness and trustworthiness warrant further investigation.

3 Qualitative Demonstrations

To offer a more profound insight into the strengths and weaknesses of Dromedary, we present qualitative demonstrations of its performance across diverse contexts. Our focus lies in highlighting the model’s capacity to address harmful or sensitive queries while generating comprehensive and nuanced responses. Due to the space limit, we present these results in the appendix. The results of Anthropic-LM (or ALM) HH RLHF and a few other baselines are taken from Bai et al. , while the results of other baselines on Vicuna benchmark questions are taken from Chiang et al. .

Conclusion & Future Work

Models like Alpaca and Vicuna have shown that powerful conversational capabilities can be distilled from existing human-preference-aligned large language models (LLMs), into smaller models. In this paper, we introduce Dromedary , a model for the research community based on principle-driven self-alignment, trained from scratch and requiring very little human annotation. By harnessing the intrinsic knowledge within an LLM, we can define principles that guide how we want an LLM-based AI model to behave, resulting in an AI assistant that not only produces quality interactions but also produces responses that respect the guardrails defined by the model creator. This method represents a distinct direction from RLHF, and it focuses on developing novel alignment techniques for language models from scratch, independent of pre-existing, well-established AI systems. In other words, our approach seeks to explore the potential of aligning AI models in situations where reliance on or access to existing systems may not be feasible or desired.

For future work, we propose the following research directions:

Conduct ablation studies on the Dromedary’s 16 self-alignment principles to evaluate the impact of adding or removing specific principles.

Apply Constitutional AI-based self-critique and reinforcement learning techniques to enhance the performance of Dromedary further.

Perform human evaluations to assess the real-world applicability and effectiveness of Self-Align.

Investigate better utilization of existing open-source annotation data, such as the 15k original instruction-following data in .

Engage with the broader research community to explore how the definition of principles interacts with different ethical, cultural, and application contexts. Principle-guided self-alignment provides a starting point for multi-stakeholder communities to engage with the alignment of AI models, but substantial ongoing work will be needed to ensure that these methods drive positive outcomes across a range of communities.

Acknowledgements

This work was supported in part by IBM research, the Microsoft Accelerate Foundation Models Research award, and the Google PhD Fellowship. We would also like to thank the computation support from AiMOS, a server cluster for the IBM Research AI Hardware Center. We thank Yizhong Wang, Frank Xu, and Zhengbao Jiang for their insightful discussions and help with the experiments. We thank the anonymous reviewers for their helpful suggestions.

References

Appendix A Limitations & Social Impacts

In this section, we discuss the limitations of the proposed Self-Align technique and the released Dromedary model, and address the potential social impacts that may arise from its release.

Incompleteness of intrinsic knowledge: While Dromedary harnesses the intrinsic knowledge within an LLM, it is subject to the limitations of the base model’s knowledge, which may be incomplete or outdated. Consequently, the model’s responses may sometimes be inaccurate or fail to reflect recent developments.

Challenges in defining principles: The process of defining principles for the self-alignment approach is non-trivial, as it may be difficult to anticipate all potential scenarios and challenges that a model might encounter during deployment. Furthermore, balancing between competing principles may result in unexpected behavior. We welcome the involvement of a broad community of stakeholders and ethics experts in helping shape these principles. Different communities and applications will demand different approaches, and we do not see any one set of principles as being the globally appropriate ones; our approach provides these different stakeholders with an avenue to engage in a structured way with this process. We do not expect any one set of principles to be the final word on alignment, but rather provide these methods as a jumping off point for broader community engagement.

Limited generalizability: While the model demonstrates strong performance in several domains, it may not generalize well to all possible applications or contexts. There may be situations where the model’s performance falls short of expectations, necessitating additional fine-tuning or adaptation.

Inconsistent principle adherence: In our preliminary testing, we observed that Dromedary occasionally hallucinates information that violates our pre-defined principles. Further investigation is required to improve strict principle adherence in the Self-Align process. One should not assume that the alignment procedures presented here provide definitive guardrails that prevent all undesired outputs and outcomes.

A.2 Social Impacts

By investigating the alternative AI alignment strategies, our work seeks to contribute to the broader landscape of AI alignment, expanding the range of possibilities and promoting a more diverse and robust understanding of how AI systems can be developed to be not only more powerful, but also more responsible and aligned with human values. Through this research, we aspire to pave the way for the safer and more harmonious integration of AI into various aspects of our lives, fostering a collaborative, ethical, and multi-stakeholder approach to AI development.

However, the potential negative impacts of our work include:

Potential misuse: As with any powerful AI system, there is the risk of misuses, such as generating malicious content or automated disinformation. It is crucial to establish mechanisms for detecting and mitigating such abuse, as well as promoting ethical guidelines for AI developers and users.

Bias and fairness: The Dromedary model may inadvertently perpetuate or exacerbate existing biases present in the pre-training data of its base language model, potentially leading to unfair or discriminatory outcomes. Future work should address bias mitigation strategies to ensure fairness and inclusivity in AI applications.

Appendix B More Details about Dromedary

The Dromedary model represents an AI assistant developed by implementing the Self-Align process on the LLaMA-65b base language model . This section delves into the details employed in the creation of the Dromedary model. The additional experimental details of Dromedary such as training and decoding hyper-parameters can be found in Appendix D.2.

We first followed the Alpaca’s recipe , employing Self-Instruct, which automatically produced 267,597 open-domain prompts along with their corresponding inputs. Additionally, we utilized Topic-Guided Red-Teaming Self-Instruct, which automatically generated 99,121 prompts specifically tailored to 20 red-teaming instruction types.

After applying the Principle-Driven Self-Alignment process and filtering out low-quality responses, we obtained 191,628 query-response pairs derived from Self-Instruct and 67,250 query-response pairs from Topic-Guided Red-Teaming Self-Instruct, resulting in a total of 258,878 query-response pairs. Figure 6 presents a detailed analysis of the principles applied and the instruction types encompassed in the Topic-Guided Red-Teaming (TGRT) approach. We observed that the instructions generated by the original Self-Instruct and TGRT Self-Instruct appear to evoke distinct principles. For instance, Self-Instruct datasets use the principles 5 (reasoning), 13 (step-by-step), and 15 (creative) extensively, whereas TGRT Self-Instruct relies more on 8 (knowledge recitation) and 14 (balanced and informative perspectives).

Next, we fine-tuned the LLaMA-65b base language model with the automatically generated 258,878 (after filtering) query-response pairs, as well as a modified version of 910 pairs of dummy dataThe dummy data are used to improve the self-identification of Dromedary: https://github.com/lm-sys/FastChat/blob/main/playground/data/dummy.json. from the Vicuna project . This results in a non-verbose principle-engraved AI assistant, namely the Dromedary (the non-verbose version).

Finally, we prompted the non-verbose principle-engraved model to generate more verbose outputs and utilized its output as the teacher model to produce 358,777 verbose responses to (Topic-Guided Red-Teaming) Self-Instruct queries. The Dromedary (the final version) model is trained on this dataset, resulting in an AI assistant designed to be helpful, ethical, and reliable, developed from scratch with a base language model (without any SFT or RLHF), and achieved with minimal human supervision (less than 300 lines of human annotations).

Appendix C Dromedary-2

After the recent release of LLaMA-2 with its extended 4k context length, we have experimented with responses that adhered more closely to the general-specific-general style within ICL Self-Align examples. The results are highly promising: The Dromedary-2 model, trained from LLaMA-2 and with improved ICL exemplars, has enhanced performance even without the verbose cloning phase nor inference-time few-shot examples. For more details and comparison with other models, please check Sun et al. .

Appendix D Additional Experimental Details

LLaMA consists of a series of base language models with a parameter count ranging from 7 billion to 65 billion. These base models are solely trained to optimize the likelihood of next-word prediction in the language modeling task. For fair comparison, we employ the same prompt for LLaMA as used for Dromedary, detailed as follows.

Dromedary is the AI assistant developed by implementing the Self-Align process on the LLaMA-65b base language model. We investigate two variants: Dromedary (final) and Dromedary (non-verbose), respectively. The former represents the model obtained by applying all four steps of the Self-Align process, while the latter is the principle-engraved model, excluding the final step of verbose cloning. The detail of the verbose prompt is presented in Appendix K.1.

The Text-Davinci-003 model is built on top of InstructGPT , with improved performance in several aspects over Text-Davinci-002, such as producing higher quality writing, handling more complex instructions, and generating a longer form of content.

GPT-3.5 (aka ChatGPT) is a sibling model of InstructGPT, specifically designed for conversational AI. It is trained to follow instructions, and to generate detailed, contextually relevant responses. GPT-4 represents a significant leap in language model capabilities, exhibiting human-level performance on a wide range of professional and academic benchmarks. Both ChatGPT and GPT-4 are fine-tuned from the corresponding base language models with SFT (Supervised Fine-Tuning) and RLHF (Reinforcement Learning with Human Feedback) .

Alpaca is a fine-tuned instruction-following language model derived from the LLaMA base model. It utilizes 52K instruction-following demonstrations generated through a cost-effective adaptation of the Self-Instruct method , in conjunction with Text-Davinci-003. Designed to address the research accessibility gap in academia, Alpaca exhibits qualitative similarities to Text-Davinci-003 in single-turn instruction setting. For fair comparison with Dromedary-65b, we employ a training methodology comparable to Dromedary, that is, fine-tuning the LoRA weights in the multi-head attention modules, to obtain our own reproduced Alpaca-65b model.

Vicuna is an open-source chatbot developed by fine-tuning a LLaMA base model on a dataset of approximately 70,000 user-shared conversations from ShareGPT.com, which effectively leverages the distilled knowledge from ChatGPT. The model’s training process involves refining the loss function to account for multi-round conversations. A preliminary evaluation , utilizing GPT-4 as a judge, indicates that Vicuna attains over 90% quality in comparison to ChatGPT, while surpassing models like LLaMA and Alpaca in more than 90% of cases.

Dolly-V2 is an open-source, instruction-following LLM fine-tuned for research and commercial use. Based on the Pythia-12b model , Dolly-V2 is fine-tuned on a new high-quality dataset, databricks-dolly-15k, which consists of 15k human-generated prompt/response pairs crowdsourced among Databricks employees.

Anthropic-LM (or ALM) is not a publicly released model, so we directly report results from Bai et al. . On BIG-bench HHH Eval, we report the results for both Context Distillation (CD) and Preference Model (PM) from Bai et al. .

D.2 Hyperparameters

For both Self-Instruct and Topic-Guided Red-Teaming Self-Instruct, we set the maximal number of new tokens in the generation to 384. The new tokens are generated by nuclear sampling with a top-p threshold $p=0.98$ and temperature $t=1.0$ .

The aggregated principles and in-context learning demonstrations in Appendix G and H take around 1800 tokens by LLaMA. So we set the maximal number of new tokens in the generation to 256. The new tokens are generated by nuclear sampling with a top-p threshold $p=0.9$ and temperature $t=0.5$ .

We fine-tune the base LLaMA-65b model on our aggregated Self-Instruct and Topic-Guided Red-Teaming Self-Instruct dataset for 1 epoch. We only finetune the LoRa weights in the multi-head attention modulesFollowing https://github.com/huggingface/peft, https://github.com/tloen/alpaca-lora. We use a batch size of 768, a maximal sequence length of 512, and a max learning rate of $4e-4$ . A 1-epoch (approximately 335 steps) training schedule is used, where the learning rate increases (i.e., warm-up) in the first $100$ steps with a log curve, and decays linearly to zero in the rest of the training steps.

The teacher model (i.e., the principle-engraved model) uses the verbose-encouraging prompt to relabel all the queries generated by (Topic-Guided Red-Teaming) Self-Instruct. We set the maximal number of new tokens in the generation to 512. The new tokens are generated by nuclear sampling with a top-p threshold $p=0.7$ and temperature $t=0.3$ , as well as a repetition penalty.

We fine-tune the base LLaMA-65b model on the dataset generated by the teacher model for 1 epoch. We only finetune the LoRa weights in the multi-head attention modules. We use a batch size of 768, a maximal sequence length of 768, and a max learning rate of $4e-4$ . A 1-epoch (approximately 465 steps) training schedule is used, where the learning rate increases (i.e., warm-up) in the first $100$ steps with a log curve, and decays linearly to zero in the rest of the training steps.

D.3 Benchmark Datasets

The TruthfulQA benchmark evaluates a model’s ability to identify true claims, specifically in the context of literal truth about the real world. The goal is to assess the risks of generating false claims or misinformation. The benchmark includes questions written in diverse styles, covering 38 categories, and designed to be adversarial. The benchmark includes two evaluation tasks: the multiple-choice task and the generation task.

The BIG-bench HHH Eval was specifically designed to evaluate a model’s performance in terms of helpfulness, honesty, and harmlessness (HHH). The dataset’s creators developed approximately 50 comparison evaluations for each category, including an ’other’ label, resulting in a total of around 200 comparisons. The dataset’s purpose is to assess both model alignment and capabilities without explicitly distinguishing between these two aspects.

Chiang et al. introduced an evaluation framework leveraging GPT-4 to automate the assessment of chatbot performance. This framework employs a diverse array of question categories, such as Fermi problems, roleplay scenarios, and coding/math tasks, to evaluate chatbot capabilities. GPT-4 generates challenging questions across these categories, and answers from five chatbots—LLaMA, Alpaca, ChatGPT, Bard, and Vicuna—are collected in Chiang et al. . We directly use this data to compare the performance of Dromedary with these chatbots.

Appendix E Additional analysis on Vicuna benchmark question

To provide deeper insights into the performance of Dromedary in comparison to other baseline models, an analysis of each question category is presented in Fig. 7. The results indicate that, when compared to other LLaMA-based baseline models such as Alpaca or Vicuna, Dromedary consistently outperforms them in question categories that demand more reasoning abilities, such as “fermi”, “counterfactual”, “coding”, and “math”. Similar relative strengths and weaknesses of Dromedary can also be observed in the zero-shot setting, as shown in Fig. 8.

However, when juxtaposed with ChatGPT and its distilled version Vicuna, Dromedary falls short and does not demonstrate competitive performance, particularly in question categories that necessitate a comprehensive organization of responses, such as “knowledge”, “roleplay”, “common-sense”, and “writing”.

Appendix F Many More Samples

Questions are drawn from the sensitive questions in Shaikh et al. . Please be warned that some of these questions contain harmful material (by design).

Questions are drawn from the sensitive questions in Solaiman and Dennison . Please be warned that some of these questions contain sensitive material (by design).

These are some prompts drawn from Thoppilan et al. .

We use two prompts in Ganguli et al. to demonstrate the moral self-correction in Dromedary without additional instructions. One prompt is from Bias Benchmark for QA (BBQ) , and the other is from Winogender .

We use the first prompt in each category from the Vicuna benchmark questions .

Due to the format error in LaTeX, we provide the Dromedary’s output for the coding prompt as below, and omit the responses from other models:

In the preliminary evaluation of the final Dromedary model, we identified two prominent failure modes that still impacted its performance. Addressing these shortcomings comprehensively requires further investigation and development in future work.

Failure Mode II: Inability to Strictly Adhere to Pre-Defined Principles (such as hallucinating misinformation, which violates the candor rule)

Appendix G Principles in Principle-Driven Self-Alignment

The former codename of our developed AI model was Watson. To ensure reproducibility, we have included the original prompt here. Furthermore, we found that “Watson” seems to be a more suitable name than “Dromedary” for the Principle-Driven Self-Alignment stage, as it uses fewer tokens. This enables us to utilize more tokens for the model’s output.

Appendix H In-Context Learning Demonstrations for Principle-Driven Self-Alignment

Appendix I Prompts for Principle Engraving

From the Principle Engraving step, we replace the deprecated codename "Watson" with "Dromedary" in all responses generated by Principle-Driven Self-Alignment. In the Principle Engraving step, the target (fine-tuned) model is prompted with the following introduction prompt:

Appendix J Prompts for Verbose Cloning

In the Verbose Cloning stage, the teacher model (i.e., the principle engraved model) is prompted with the following text to facilitate the generation of extensive, comprehensive, and detailed responses.

The final Self-Aligned model is fine-tuned on the pairs of "[User Query]" and "[Dromedary (extensive) Response]" as supervision with the following prompt (standard):

Appendix K Inference Prompts

The final Dromedary model is trained with a mixture of standard prompt and introduction prompt as shown above, but we discovered that we can influence Dromedary’s behavior by altering the prompts during the inference stage. In this section, we present two prompts that we employed for this purpose.

K.2 Prompts for multilingual outputs

We call it the multilingual prompt. The prompt below is slightly modified in order to display non-English characters. The original multilingual prompt can be found in our codebase.

Appendix L 20 Seed Prompts for Topic-Guided Red-Teaming Self-Instruct

Appendix M Instruction Prompts for Topic-Guided Red-Teaming Self-Instruct

Topic-Guided Red-Teaming Self-Instruct has two steps. In the first step, we use the base LLM to generate novel topics related to a given instruction (question) type. Some instructions are taken from the Alpaca projecthttps://github.com/tatsu-lab/stanford_alpaca/blob/main/prompt.txt .

In the second step, we prompt the base LLM with deduplicated topics and their instruction types to generate new questions.

Appendix N Evaluation Prompts for MC Benchmarks

We assess the likelihood of true and false as the score for each answer candidate.

We assess the likelihood of A and B as the scores for two answer candidates. Since the correct answer is consistently A in the original dataset, we aggregate the scores of the options by swapping their positions.

Appendix O Few-Shot Prompts for Vicuan Benchmark Questions

Inspired by Vicunahttps://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py, we use two conversation examples as ICL to improve the response quality of Dromedary. The prompt we used is a combination of our original verbose prompt, the “renewable energy” prompt taken from Vicunahttps://github.com/lm-sys/FastChat/blob/75d8ab26ee308f9cf0990976508232f06dd421e4/fastchat/conversation.py#LL204C3-L204C3, and an additional conversation example on a coding problem.

Appendix P Additional Related Work

The field of natural language processing has witnessed remarkable advancements with the advent of Large Language Models (LLMs), which have significantly improved various NLP tasks. The introduction of the Transformer architecture laid the groundwork for the development of these powerful language models (Devlin et al. 12, Radford et al. 33, Lewis et al. 21, Raffel et al. 34, Brown et al. 7, Chowdhery et al. 9, Zhang et al. 52, Scao et al. 36, Touvron et al. 44, inter alia). Among them, GPT-3 has been particularly influential, showcasing an exceptional capacity to adapt to diverse tasks through the in-context learning capabilities of LLMs. Recently, LLaMA has emerged as a pivotal open-source base language model, driving a series of open-source breakthroughs that strive to keep pace with the closed-source frontier in the field.

Appendix Q Cherry-pick Demonstration Examples of Principle-Driven Self-Alignment

In order to provide readers with a comprehensive understanding of how the Self-Align methodology aids in the development of AI models that are helpful, ethical, and reliable, we will explore the principle-driven self-alignment process through a selection of illustrative examples. Each example has been chosen to effectively demonstrate a specific case within the self-alignment framework.

1 (ethical). Dromedary should actively refrain users on illegal, immoral, or harmful topics, prioritizing user safety, ethical conduct, and responsible behavior in its responses.

2 (informative). Dromedary should provide users with accurate, relevant, and up-to-date information in its responses, ensuring that the content is both educational and engaging.

3 (helpful). Dromedary’s responses should be positive, interesting, helpful and engaging.

4 (question assessment). Dromedary should first assess whether the question is valid and ethical before attempting to provide a response.

5 (reasoning). Dromedary’s logics and reasoning should be rigorous, intelligent and defensible.

6 (multi-aspect). Dromedary can provide additional relevant details to respond thoroughly and comprehensively to cover multiple aspects in depth.

7 (candor). Dromedary should admit its lack of knowledge when the information is not in Dromedary’s internal knowledge.

8 (knowledge recitation). When a user’s question pertains to an entity that exists on Dromedary’s knowledge bases, such as Wikipedia, Dromedary should recite related paragraphs to ground its answer.

9 (static). Dromedary is a static model and cannot provide real-time information.

10 (clarification). If the provided information is insufficient or the question is ambiguous, Dromedary ought to request the user to provide further clarification on their query.

11 (numerical sensitivity). Dromedary should be sensitive to the numerical information provided by the user, accurately interpreting and incorporating it into the response.

12 (dated knowledge). Dromedary’s internal knowledge and information were only current until some point in the year of 2021, and could be inaccurate / lossy.

13 (step-by-step). When offering explanations or solutions, Dromedary should present step-by-step justifications prior to delivering the answer.

14 (balanced & informative perspectives). In discussing controversial topics, Dromedary should fairly and impartially present extensive arguments from both sides.

15 (creative). Dromedary can create novel poems, stories, code (programs), essays, songs, celebrity parodies, summaries, translations, and more.

16 (operational). Dromedary should attempt to provide an answer for tasks that are operational for a computer.