Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models

Guanting Dong, Keming Lu, Chengpeng Li, Tingyu Xia, Bowen Yu, Chang Zhou, Jingren Zhou

Introduction

The instruction-following ability of large language models (LLMs) refers to their capacity to understand, interpret, and execute commands given to them in natural language (Lou et al., 2023; OpenAI et al., 2024). This ability is fundamental to contemporary LLMs as it enables them to leverage their underlying knowledge, interact intuitively with users (Ouyang et al., 2022), adapt to various requirements (Zhang et al., 2023), and perform complex tasks (Sun et al., 2024). Misunderstandings in following instructions can lead to unintended outcomes, potentially resulting in severe consequences, particularly in critical scenarios (Zhou et al., 2023; Chang et al., 2024).

Although instruction following is crucial, scalable and reliable methods to enhance this capability of LLMs remain elusive. Current efforts in this field are divided into manual annotation (Wei et al., 2021; Zhou et al., 2023; Jiang et al., 2024b) and behavior imitation (Xu et al., 2023; Zhao et al., 2024). Manual annotation involves annotators designing instructions and writing corresponding responses. However, due to human cognition’s limitations, creating highly complex and diverse instructions is challenging, making the process difficult to scale. Furthermore, accurately executing complex instructions can sometimes be difficult for humans (Sun et al., 2024; Cao et al., 2024b), requiring multiple rounds of rigorous and costly validation (Wang et al., 2024a; Wei et al., 2024). On the other hand, behavior imitation aims to distill responses from more advanced LLMs (Taori et al., 2023; Peng et al., 2023) like GPT-4. This approach limits models to the capabilities of the advanced LLMs from which they are distilled. Moreover, even advanced LLMs can make mistakes, and the reliability of the distilled data cannot be guaranteed (Cui et al., 2023). Consequently, models trained with this data may have a propensity to not follow instructions accurately (Zhou et al., 2024).

In this paper, we introduce AutoIF, the first scalable and reliable method for automatically generating instruction following training Data for Supervised Finetuning (SFT) or Reinforcement Learning from Human Feedback (RLHF) (Ouyang et al., 2022). The core idea of AutoIF is to use code to verify the correctness of following instructions. Intuitively, if designed properly, a significant portion of instructions, such as “Keep your response under 20 characters in length” can be verified for correctness using code, as illustrated in Figure 1. Therefore, the key components of AutoIF include (1) automatically generating instructions that can be verified by code, (2) automatically generating corresponding verification codes for these instructions, and (3) ensuring the reliability of the first two steps. Specifically, we start by providing AutoIF with a small set of hand-written seed instructions. Then, LLMs, not necessarily advanced ones, generate an augmented instruction set through self-instruct (Wang et al., 2023). Next, LLMs write verification codes and unit test cases for each instruction. Only the code that compiles correctly, passes the test cases, and back-translates to the original instruction is retained. If an instruction does not have a corresponding code that can verify its correctness, it is discarded. Finally, we employ LLMs to generate responses that either pass or fail the verification code using execution feedback-based rejection sampling (Yuan et al., 2023). Responses that pass can be directly used for SFT, while pairs of passing and failing responses can be used to create chosen-rejected pairs for Direct Preference Optimization (DPO) (Rafailov et al., 2023) and other RLHF algorithms. Moreover, once the instructions and verification code are determined, this process can be conducted on-policy, continually enhancing the instruction-following capabilities.

Through extensive experiments, we have demonstrated that AutoIF achieves significant improvements under three training algorithms—SFT, Offline DPO, and Online DPO—when applied to the top open-source LLMs, Qwen2-72B and LLaMA3-70B, in both self-alignment and strong-to-weak distillation settings. In the IFEval benchmark, we achieved Loose Instruction (Acc.) rates of up to 88.0% with Qwen2-72B and 90.4% with LLaMA3-70B, marking the first instance of surpassing 90% accuracy. In the FollowBench benchmark, these models also showed significant improvements, with increases of over 5% in the SSR metric (avg). Additionally, they enabled Qwen2-7B and LLaMA3-8B to achieve average performance gains of over 4% in both benchmarks. Replacing Qwen2-72B and LLaMA3-70B with the more advanced GPT-4 resulted in further substantial improvements. We will open-source the SFT and DPO datasets constructed using AutoIF on Qwen2-72B, representing the first open-source complex instruction-following dataset at a scale of tens of thousands.

Related Works

Instruction-following capabilities are among the most essential features of LLMs (OpenAI et al., 2024; Lou et al., 2023), which are expected to precisely follow a broad and complex set of instructions. Consequently, recent research has concentrated on evaluating LLMs’ instruction-following abilities in various contexts, such as verifiable (Zhou et al., 2023), compositional (Qin et al., 2024), format-related (Xia et al., 2024), refuting (Yan et al., 2024), and fine-grained instructions (Jiang et al., 2024b). However, a significant gap remains between open-source and proprietary closed-source LLMs. Sun et al. (2024) propose Conifer, which enhances the instruction-following capabilities of open-source LLMs through knowledge distillation from proprietary LLMs. Wang et al. (2024b) use LLMs to encode instruction metadata and augment diverse instructions from this metadata, employing proprietary LLMs for quality control. Both approaches, however, rely on proprietary LLMs for response distillation or judgment, which not only limits their potential but also subjects them to OpenAI’s terms of use 111https://openai.com/policies/terms-of-use. In this work, we propose AutoIF, a more scalable and reliable method to enhance the instruction-following capabilities of LLMs. AutoIF uses execution feedback from self-generated verification functions to provide supervision for instructions. This allows for effective self-alignment and strong-to-weak distillation on open-source models, thereby narrowing the performance gap with proprietary LLMs.

Learning with Execution Feedback is a widely-used technique in automated alignment for tool use and coding (Cao et al., 2024a). These learning methods typically utilize execution feedback from tools such as code executors to provide supervision for specific tasks. For instance, Le et al. (2022) employ feedback from unit tests via code compilers to enhance code synthesis capabilities through reinforcement learning. Similarly, Chen et al. (2023) train LLMs to provide debugging suggestions as feedback to improve coding abilities. Additionally, Qiao et al. (2024) introduce Reinforcement Learning with execution feedback to enhance LLMs using execution results from tools. Building on this learning paradigm, we propose a novel scalable oversight method that enables LLMs to autonomously generate verification functions and unit tests for natural language instructions, thereby applying execution feedback to enhance their instruction-following capabilities.

AutoIF

We introduce AutoIF, an automated, scalable, and reliable method designed to enhance the instruction-following capabilities of LLMs. In this section, we outline the preliminaries (Section 3.1), detail the two core components of AutoIF (Section 3.2, Section 3.3), and discuss various training strategies that can be seamlessly integrated with AutoIF (Section 3.4).

Instruction-following Capabilities. Following instructions is one of the most crucial skills in modern LLMs. These models are expected to provide precise responses to queries containing complex instructions, which can be either atomic or compositional. To evaluate the instruction-following capability of LLMs, we define a general instruction-following requirement as a specific task. In this task, given an instruction I={ij}j=1NI={\{i_{j}\}}_{j=1}^{N} with NN specific constraints (e.g. “Please generate text in Shakespearean style, no more than 50 tokens” contains 2 constraints) and a specific query xx, an LLM πθ\pi_{\theta} should generate precise response yπθ(yx,I)y\sim\pi_{\theta}(y\mid x,I) adhering to the constraints.

Verifiable Instructions. The complexity and diversity of instructions necessitate manual construction and verification for reliable supervision. This practical challenge motivates us to focus initially on instructions that can be automatically verified through programs and code executors, also known as verifiable instructions (Zhou et al., 2023). Specifically, for a given instruction II and task-specific query qq, there exists a verification function fIf_{I} such that fI(y)f_{I}(y) returns true when the model’s response yy correctly follows the instruction. We demonstrate that supervision of such instructions can be self-generated through scalable oversight with LLMs and execution feedback. Extensive experiments in our work show that training on verifiable instructions significantly benefits the handling of other general instructions that are more complex but unverifiable with simple code snippets.

Method Overview. AutoIF synthesizes high-quality instruction-following data through self-evolution, rejection sampling, and execution feedback. As illustrated in Figure 2, AutoIF integrates automated data augmentation with quality verification processes, including automatically generated verification functions and back-translation instructions. This approach enables a two-stage automated data synthesis at both the instruction (Section 3.2) and query levels (Section 3.3). Additionally, we introduce three training strategies (Section 3.4) and explore two experimental settings (Section 4.1) to thoroughly evaluate the effectiveness and generalization of AutoIF.

2 Instruction Augmentation and Verification

We first develop verifiable instructions along with corresponding evaluation functions, using rejection sampling informed by execution feedback.

Seed Instruction Construction. We start by handwriting a set of seed instructions, denoted as DseedD_{seed}, ensuring that each instruction contains only a single atomic constraint (e.g., “Answer the words that begin with B”). Detailed information on seed instructions is listed in Appendix A.

Self-Instruct. Self-Instruct (Wang et al., 2023) is a straightforward and intuitive strategy for automated data augmentation that has garnered significant attention in the field of LLM reasoning (Xu et al., 2023; Zhao et al., 2023). For each instruction in DseedD_{seed}, we use an LLM to perform KK instruction rewrites, generating DaugD_{aug}. We then combine the seed and augmented data sets to obtain an enhanced set of instructions, Dins=DseedDaugD_{ins}=D_{seed}\cup D_{aug}, and remove any duplicates.

Automated Quality Cross Verification. Previous research has shown that relying solely on model-generated augmented instructions often leads to the inclusion of low-quality samples (Mumuni & Mumuni, 2022; Xie et al., 2020; Zheng et al., 2024). Inspired by a series of tool execution studies, we employ an LLM to generate verification functions and test cases for each instruction. We use feedback from executing Python programs to ensure quality control. Given the instruction set DinsD_{ins}, the LLM MM employs a rejection sampling (Touvron et al., 2023; Yuan et al., 2023) to generate KK verification functions fI={fi}j=1Kf_{I}=\{f_{i}\}_{j=1}^{K} and test cases cI={ci}j=1Kc_{I}=\{c_{i}\}_{j=1}^{K} for each instruction II, resulting in the set {I,fI,cI}Dins\{I,f_{I},c_{I}\}\in D_{ins}. We then cross-validate the quality of the instructions using the verification functions and test cases, ensuring they meet the following criteria:

The verification function ffIf\in f_{I} can be successfully compiled by the Python executor.

Each test case ccIc\in c_{I} achieves an accuracy rate greater than 0.5 across all verification functions.

Each verification function ffIf\in f_{I} achieves an accuracy rate greater than 0.5 across all test cases.

Each instruction includes at least one evaluation function and test case.

By adhering to these four conditions, we obtain the quality-filtered instruction set {I(2),fI(2)}Dins(2)\{I^{(2)},f_{I}^{(2)}\}\in D_{ins}^{(2)}.

Back-translation Verification. After the cross-validation stage, we obtained initially quality-verified verification functions and instructions. To further ensure the consistency between instructions and verification functions, we introduce back-translation. For a given pair {I(2),fI(2)}Dins(2)\{I^{(2)},f_{I}^{(2)}\}\in D_{ins}^{(2)}, we use the LLM MM to back-translate the verification function ffI(2)f\in f_{I}^{(2)} into instruction IfI_{f}. We then treat II as the premise and the back-translated instruction IfI_{f} as the hypothesis. Using the NLI model, we identify the semantic relationship between the two instructions. The prediction can fall into one of three categories: entailment, contradiction, or neutral:

where scoreθ:Rk×I×Rk×IfR3\operatorname{score}_{\theta}:\mathbb{R}^{k\times\ell_{I}}\times\mathbb{R}^{k\times\ell_{I_{f}}}\rightarrow\mathbb{R}^{3} is a model dependent scoring function with parameters θ\theta. We filter out any instruction II labeled as contradiction to ensure the intent consistency. Finally we obtain the set {I(3),fI(3)}Dins(3)\{I^{(3)},f_{I}^{(3)}\}\in D_{ins}^{(3)}

3 Query Augmentation and Verification

Once we have obtained verified instructions and verification functions, we utilize them to create training data comprising queries and responses.

Query Reforming and Augmentation. In the real-world application of modern chatbots, instructions are typically employed to generate constrained responses to user queries. Therefore, creating high-quality instructions is merely the initial step toward achieving effective instruction-following capabilities. To acquire authentic queries, as shown in the bottom part of Figure 2, we randomly selected KK user queries from ShareGPT (Chiang et al., 2023) for each instruction and concatenated them to construct the seed query dataset x,fI(3)Dq{x,f_{I}^{(3)}}\in D_{q}. To further enhance the diversity and complexity of the input xx, we utilized the LLM to generate KK responses yx={yi}i=1Ky_{x}=\{y_{i}\}_{i=1}^{K}, resulting in {x,fI3,yx}Dq\{x,f_{I}^{{3}},y_{x}\}\in D_{q}.

Instruction-following Verification. Following the previous quality cross-verification process, we further employ verification functions to assess whether the augmented responses adhere to the constraints in input xx. Similarly, we require each response in DqD_{q} to meet the following conditions:

Each response must achieve an accuracy rate greater than 0.5 across all verification functions.

Each input must include at least one verification function and one response.

Based on these rules, we obtain the set (x(2),fI(3),y(2))Dq(2)(x^{(2)},f_{I}^{(3)},y^{(2)})\in D_{q}^{(2)}.

Query Quality Verification. Additionally, we observe that concatenated instructions and queries often conflict. For instance, a high-quality response to the query “help me write a news article” is unlikely to comply with the instruction “please limit your answer to two words”. Such high-level semantic inconsistencies are challenging for a simple NLI model to discern. Therefore, we employ the LLM MM to assign matching scores between the instruction and query in input x(2)x^{(2)} and the corresponding responses y(2)y^{(2)}, on a scale from 1 to 10. We then filter out samples with a score lower than 8, constructing the final training set Dtrain={xi,yi,fIi}i=1ND_{\text{train}}=\{x_{i},y_{i},f_{Ii}\}_{i=1}^{N}.

4 Training Strategies

AutoIF offers multifaceted supervision for the instruction-following task, making it adaptable to various training strategies. To thoroughly evaluate the effectiveness of AutoIF, we propose the following training approaches:

Supervised Fine-tuning (SFT). Given (xi,yi)Dfinal(x_{i},y_{i})\in D_{\text{final}}, we apply the standard Supervised Fine-tuning (SFT) objective on the base model PP with parameters θ\theta: L(θ)=(xi,yi)DtrainlogPθ(yixi)\mathcal{L}(\bm{\theta})=\sum_{(x_{i},y_{i})\in\mathcal{D}_{\text{train}}}\log\mathbb{P_{\theta}}(y_{i}\mid x_{i}) , where xix_{i} denotes the ii-th input, consisting of a concatenated instruction and user query.

SFT + Offline DPO. In the process of AutoIF, multiple scales of quality filtering are utilized, naturally generating a substantial number of positive and negative sample pairs. This motivates us to obtain pairwise preference data (x,yw,yl)(x,y_{w},y_{l}). Our preference data mining is divided into two parts:

Instruction Level: During the automated quality cross-verification stage, we first extract positive samples cwc_{w} from cases with an accuracy rate higher than 0.5 on all verification functions and negative samples clc_{l} from cases with an accuracy rate of 0. We then construct pairwise preference data for each instruction: Dinspref(I,cw,cl)D_{\text{ins}}^{\text{pref}}\rightarrow(I,c_{w},c_{l}).

Query Level: In the query quality verification process, we similarly extract positive samples ywy_{w} from responses with an accuracy rate higher than 0.5 on all verification functions and negative samples yly_{l} from responses with an accuracy rate of 0. We then construct query preference data: Dquerypref(x,yw,yl)D_{\text{query}}^{\text{pref}}\rightarrow(x,y_{w},y_{l}).

Finally, we merge the two parts of the data: Dpref=DinsprefDqueryprefD_{\text{pref}}=D_{\text{ins}}^{\text{pref}}\cup D_{\text{query}}^{\text{pref}}. To further explore the potential of pairwise preference data (x,yw,yl)Dpref(x,y_{w},y_{l})\in D_{\text{pref}}, we first perform vanilla SFT on the base model πθ\pi_{\theta} to obtain an SFT model πθSFT\pi_{\theta}^{\text{SFT}} as equation 3.4. Then, we apply Direct Preference Optimization (DPO) (Rafailov et al., 2024) on our SFT model, which can be formulated as follows:

where the reference model πref\pi_{\text{ref}} is set to πθSFT\pi_{\theta}^{\text{SFT}} initially and remains fixed throughout training. β\beta is a hyperparameter and σ\sigma is the sigmoid function. LDPO\mathcal{L}_{\text{DPO}} aims to maximize the log probability of preferred ywy_{w} relative to the dispreferred yly_{l}.

SFT + Iterative Online DPO. Online training enables real-time, iterative optimization of model weaknesses. It relies on high-quality, lightweight reward models to provide continuous supervision feedback. In the case of AutoIF, verification functions serve as rigorous filtering standards, akin to reward models, delivering immediate feedback on model responses across training iterations. Following offline DPO, we conduct initial SFT on the base model πθ\pi_{\theta} to derive an SFT model πθSFT\pi_{\theta}^{\text{SFT}} with initial instruction-following capabilities. As depicted in Figure 3, we set the generation temperature to 0.8 and allow the SFT model to generate KK responses through self-sampling for each training sample, forming a response set {R1,,Rk}\{R_{1},\ldots,R_{k}\}. Then, we employ corresponding verification functions to assess KK responses, thereby constructing the online DPO dataset Donlinepref=(x,yw,yl)D_{\text{online}}^{\text{pref}}=(x,y_{w},y_{l}) based on average pass rates across all functions. Finally, leveraging DonlineD_{\textrm{online}}, we sequentially perform DPO training on πθSFT\pi_{\theta}^{\textrm{SFT}}. Importantly, our iterative online optimization process progressively unlocks enhanced instruction-following capabilities.

Experiment

Datasets & Baselines. We conduct experiments using two LLMs from the Qwen2 series (Qwen2-7B and Qwen2-72B-Instruct) and two from the LLaMA3 series (LLaMA3-8B and LLaMA3-70B-Instruct). Please note that the AutoIF method proposed in this work has been employed in the open-source Qwen2-Instruct model. Thus, the version of Qwen2-Instruct we utilized represents an early iteration during internal development, rather than the final open-source model. The training datasets are respectively generated from Qwen2-72B-Instruct and LLaMA3-70B-Instruct, with detailed statistics provided in Table 5. We demonstrate the effectiveness of AutoIF by evaluating the instruction-following capabilities of models fine-tuned with self-generated datasets using AutoIF. Additionally, we include strong open and closed-source LLM baselines such as Mixtral-8x22B and GPT-4. For more details, refer to Appendix B.

Settings. In our experiments, we mainly explore two experimental setups:

(1) Strong-to-Weak Distillation involves aligning a less powerful model with a stronger, well-aligned model by mimicking its generated responses. In AutoIF, we can utilize a strong model such as Qwen2-72B-Instruct for data synthesis. Subsequently, we train a less powerful model like Qwen2-7B-Instruct using this synthesized data to achieve strong-to-weak alignment.

(2) Self-Alignment: Following several self-alignment works (Chen et al., 2024; Yuan et al., 2024), we utilize the LLM to perform the AutoIF process for synthesizing data, and then train the same model using this synthesized data.

Evaluation. We evaluate our methods using two instruction-following benchmarks: IFEval (Zhou et al., 2023) and FollowBench (Jiang et al., 2024b). IFEval comprises 25 types of verifiable instructions across about 500 prompts. While IFEval also focuses on verifiable instructions, extensive n-gram probing confirms no overlap between the IFEval test set and our training sets, thus eliminating any contamination concerns. We report strict and loose accuracy metrics at both prompt and instruction levels for IFEval. FollowBench is a fine-grained constraint-following benchmark with five levels of difficulty. It contains diverse and open-ended instructions requiring evaluation by strong LLMs, such as GPT-4, which can fully examine the generalization of AutoIF to more general instructions not verifiable by simple code executions. At the same time, we also evaluate our models on C-Eval (Huang et al., 2023), MMLU (Hendrycks et al., 2021), GSM8k (Cobbe et al., 2021), and HumanEval (Chen et al., 2021a) to obtain a comprehensive assessment of capabilities.

Section 4.1 reports the main results. Overall, AutoIF substantially enhances instruction-following performance across all models, configurations (strong-to-weak distillation & self-Alignment), and training methodologies (SFT, Offline & Online DPO) on two benchmarks. These results decisively establish the superiority of our approach. Furthermore, we have identified the following insights:

On-policy Learning is More Effective. Comparing Online DPO and Offline DPO, the model-generated online data through self-supervision demonstrates superior performance compared to offline data (Qwen2-7B, IFEval: 1.7%\uparrow, Followbench: 2.6%\uparrow). This confirms that on-policy iterative execution feedback can effectively target and enhance the model’s weaknesses.

Larger models yield greater improvements. FollowBench provides a more comprehensive instruction-following assessment than IFEval. Significantly, base models with larger parameters typically improve Followbench more than smaller models (Qwen2 72B: 4.6%\uparrow, LLaMA3 70B: 5.6%\uparrow). This underscores that models with robust foundational capabilities coupled with AutoIF, can further unlock powerful instruction-following alignment potential.

General abilities are not declined. Improving instruction following abilities without compromising other capabilities is crucial. AutoIF notably preserves general abilities (MMLU, C-Eval), mathematical reasoning (GSM8k), and coding (Humaneval) performance across all training setups. Surprisingly, there are even slight performance gains in on-policy settings. We attribute this preservation largely to incorporating ShareGPT data during data synthesis, highlighting AutoIF’s capability to strike a balance across diverse abilities and excel in broad applicability.

Ablation on Supervision Model. Table 3 presents the results of replacing the supervision model Qwen72B with GPT-4. We observe that in AutoIF, a stronger supervision model (GPT-4) demonstrates more effective strong-to-weak distillation alignment, particularly evident with a performance gain of over 15% in the loose prompt in IFEval. This is reasonable because AutoIF requires the supervision model to perform several tasks, such as text augmentation (instruction, query, and response rewriting), code generation (verification function), and quality assessment (scoring). This implies that a supervision model with stronger fundamental abilities can synthesize higher-quality data when using AutoIF.

Quality Control on Instructions and Responses. In Figure 4, we examine how varying pass rate thresholds of verification functions (indicative of data quality) affect the amount of SFT data and instruction-following performance. As the pass rate threshold increases, the amount of SFT data decreases at the instruction level, while model performance consistently improves. This suggests that the quality of instructions is a crucial factor influencing IF performance. At the query level, the SFT data amount also decreases with higher pass rate thresholds. Notably, performance peaks at a pass rate of 0.8 and declines beyond 1. This observation aligns with our expectations, indicating a trade-off between data quality and quantity.

Ablation on Specific Components. To investigate the effectiveness of various modules in AutoIF, we conduct an ablation study, as presented in Table 3. we use w/o to denote the variant without a specific module. The results reveal the following: (1) The performance of AutoIF declines when any quality filtering process is removed, indicating that all components are highly effective. (2) The most significant performance drop occurs when the Cross Verification of instructions is removed, highlighting its importance over query quality verification. This underscores that a high-quality instruction set is fundamental to the AutoIF process. (3) Eliminating the overall quality filtering process results in a more substantial performance drop than removing any single component, suggesting that quality filtering at both the instruction and query levels provides a mutually reinforcing effect.

Scaling Analysis on SFT & DPO Data. Figure 4 presents the scaling analysis of SFT and DPO data using GPT-4 as the supervision model. The results demonstrate that even with just 1/64 of AutoIF-generated SFT/DPO data, Qwen2-7B achieves impressive performance, particularly with 1/64 DPO data reaching nearly 55% in loose prompt accuracy, , an increase of 11.4% pts. This strongly verifies the high quality of AutoIF-generated data. Further analysis reveals that IF capability steadily improves with an increase in data quantity, a scaling trend confirmed by numerous reasoning studies (Yuan et al., 2023; Muennighoff et al., 2024).

Contamination Analysis. We evaluate the contamination of the training dataset generated by AutoIF on IFEval and FollowBench. Specifically, we employ contamination detectors from LM-Sys (Yang et al., 2023), which utilize advanced chatbots to identify potentially rephrased contaminated test samples. Additionally, we report contamination findings detected by traditional n-gram contamination algorithms. As shown in Table 4, both contamination rates are lower than those of the ShareGPT dataset we used. This allows us to confidently assert that there is no contamination between the self-generated training samples and the test sets. More cases can be viewed in Appendix D,

Data Efficiency. Table 5 explores the relationship between model coding ability, data quality pass rate (samples with a query quality score above 8), and instruction-following capability. Surprisingly, we observe consistency in the supervision model across all three metrics. This indicates that the execution feedback resulting from the supervision model’s coding ability substantially influences data synthesis quality and the final capability.

In this paper, we propose AutoIF, a scalable and automated method to enhance the instruction-following abilities of LLMs. It uses self-instruct and rejection sampling to enhance the supervisory signals of seed instructions and relies on self-generated execution feedback for quality filtering. We introduce three training strategies and two alignment settings to comprehensively analyze AutoIF. Experiments demonstrate that our method significantly improves performance across all settings in both IFEval and Followbench, with the first LLM achieving over 90% loose instruction accuracy.

In this paper, we propose AutoIF, a system for automated instruction augmentation and quality filtering, capable of scaling to over 10,000 instructions. While our focus is not on the construction of cross-instructions, the excellent results achieved in two instruction-following benchmarks demonstrate the generalizability of our method in handling complex instruction-following tasks. Additionally, we believe a more direct strategy would involve combining multiple simple instructions into cross-instructions, and subsequently enhancing and quality-filtering them using AutoIF. This way has the potential to further amplify the effectiveness of our method. Therefore, we consider automating and scaling cross-instruction tasks as a key direction for future research.

In this paper, we have fully presented the seed instruction set used by AutoIF in the Appendix. All concatenated queries are sourced from the publicly available ShareGPT dataset and have undergone multiple steps of quality filtering. Therefore, our method strives to minimize potential safety and ethical risks as much as possible. However, during the rejection sampling process, malicious prompts can lead the model to produce harmful or inappropriate outputs, which is a shared problem. Ensuring the quality of generated content in a safe and controllable manner is crucial. The application of these techniques should be guided by ethical considerations, with safeguards in place to prevent misuse and reduce the likelihood of producing harmful outcomes.

Appendix

Appendix A Seed Instructions

Figure 5 illustrates our hand-written seed instructions.

Appendix B Implementation Details

To better motivate researchers to reproduce the results, we report the detailed experimental details:

In the SFT phase, we perform full fine-tuning on Qwen2-7B and LLaMA3-8B with a learning rate of 7e-6, using a linear scheduler with 20 warm-up steps. All models are trained with DeepSpeed ZeRO Stage 3 (Rasley et al., 2020) and Flash-Attention 2 (Dao, 2023). We use a global batch size of 128, a weight decay of 0.1, and train for 3 epochs. Mixed precision training with bf16 is used, and the maximum context length is set to 8192 tokens. For Qwen2-72B and LLaMA3-70B, the global batch size is 512.

In the DPO phase, the learning rate is set to 5e-7 with a cosine scheduler and a 0.1 warm-up ratio. We use DeepSpeed ZeRO Stage 3 and Flash-Attention 2 for efficiency, with a global batch size of 64. Training utilizes a sigmoid loss function with a beta value of 0.3 and spans 2 epochs, with checkpoints every 200 steps. Mixed precision training with bf16 is employed, and the maximum context length is 4096 tokens.

We run all our experiments on NVIDIA A100 and H800 GPUs. Specifically, we train Qwen2-7B and LLaMA3-8B on 8 A100 GPUs, while Qwen2-72B-Instruct and LLaMa3-70B-Instruct on 64 H800 GPUs. Notably, we use an in-house version of Qwen2-7B without any targeted optimizations on instruction-following capabilities. For evaluations, we report pass@1 results with greedy decoding for HumanEval and zero-shot accuracy for GSM8K. We report averaged performance from five randomly seeded experiments.

Appendix C Details of AutoIF

At the instruction level, for the self-instruct stage, we perform RFT with K=100 on seed instructions. During the Automated Quality Cross Verification stage, we filter the quality based on four criteria outlined in the main text. For NLI filtering, we use mDeberta as our filtering model222The NLI model is available at https://huggingface.co/MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7, and filter out only samples predicted as "Contradiction" (approximately 15%).

At the query level, we randomly select 16 ShareGPT samples for each instruction and perform Response Rejection Sampling with K=8. For instruction following verification, we adhere to the two standards mentioned in the text. Finally, for query quality verification, we filter for consistency using a threshold of 8.

Appendix D Case Study of Data Combination

We used n-gram 13 to evaluate the overlap between each test sample and the SFT training samples. It is unnecessary to evaluate DPO data since the inputs for DPO data are derived from SFT data. In Table 4, all our data combination metrics (both model-based and rule-based evaluation) are lower than those of ShareGPT, confirming that our method has no data combination with the test set. We also present the top 5 training-test sample overlaps in n-gram for both IF Eval and Followbench in Figure 6.

Appendix E Prompt Templates

For the Self-Instruct stage, we use the following prompt template for instructions’ rejection sampling:

- Instructions are about the format but not style of a response - Whether instructions are followed can be easily evaluate by a Python function Here are some examples of instructions we need: {Seed Instructions} Do not generate instructions about writing style, using metaphor, or translation. Here are some examples of instructions we do not need: - Incorporate a famous historical quote seamlessly into your answer - Translate your answer into Pig Latin - Use only words that are also a type of food - Respond with a metaphor in every sentence - Write the response as if you are a character from a Shakespearean play Please generate one instruction per line in your response and start each line with ’- ’.

Prompt Template of Self-Instruct Stage For generating the verification functions and test cases for each instruction, we use the following prompt template for rejection sampling:

Here is the instruction: {instruction} Please write a Python function named ‘evaluate‘ to evaluate whether an input string ‘response‘ follows this instruction. If it follows, simply return True, otherwise return False. Please respond with a single JSON that includes the evaluation function in the key ‘func‘, and a list of three test cases in the key ‘cases‘, which includes an input in the key ‘input‘ and an expected output in the key ‘output‘ (True or False). Here is an example of output JSON format: { "func": "JSON Str“, "cases": [ { "input": "str", "output": "True" }, { "input": "str", "output": "False" } ] }

Prompt Template for Generating Verification Functions and Cases For the back translation process of each verification function, we use the following prompt template:

Here’s an example: {Example func} {Example cases}

Please convert the following eval function into instructions stored in a list: {funcs}

Prompt Template for Back Translation For the rejection sampling of query responses, we use the following prompt template:

Instruction: {instruction} Query: {query}

Prompt Template for Response Generation Fot the query quality verification, we use the following prompt template:

Instruction: {instruction} Query: {query} Response: {response} Please notice that the response may not be helpful as it needs to strictly follow the requirements in the Instruction. You need to judge whether the response answers the query. Please first provide a detailed analysis and then give a score ranking from 0 to 10 at the last line. Scoring 0 means the response is totally unrelated to the query, while scoring 10 means the response is helpful and highly related to the query. Please only provide a score in the format ‘Score: score‘ without any other contents at the last line.

Appendix F Case Study of AutoIF

In Appendix F, we illustrates the data format of our AutoIF, including the query, response (verification funcs Acc>0.8) and verification function.

We give introductions to the LLM baselines for our instruction following.

(Meta, 2024), developed by MetaAI, is the latest iteration of the LLaMA series, featuring significant upgrades. Compared to LLaMA2, LLaMA3 expands its training dataset, context length, and vocabulary, resulting in improved performance across various tasks. Enhancements in contextual understanding and language generation further distinguish LLaMA3.

Qwen2

(Bai et al., 2023), developed by Alibaba, includes five sizes: Qwen2-0.5B, Qwen2-1.5B, Qwen2-7B, Qwen2-57B-A14B, and Qwen2-72B. Trained on high-quality data in Chinese, English, and 27 other languages, Qwen2 excels in multilingual capabilities and shows strong performance in coding and mathematics. Additionally, it supports extended context lengths of up to 128K tokens (Qwen2-72B-Instruct), making it ideal for long texts and complex tasks.

Mistral-7B

(Jiang et al., 2023), released by Mistral AI in September 2023, leverages grouped query attention (GQA) combined with sliding window attention (SWA) to efficiently process sequences of any length, enhance inference speed, and improve throughput. It outperforms many 13B models across various tasks.

Mixtral-8×\times7B

(Jiang et al., 2024a) developed by Mistral AI, is the first open-source MOE large model. It is a sparse mixture of experts network and, like Mistral 7B, employs the GQA mechanism. With a smaller parameter count compared to LLaMA2-70B and GPT-3.5, it outperforms them across numerous tasks.

GPT Series

GPT-3.5 (OpenAI, 2022) and GPT-4 (Achiam et al., 2023), developed by OpenAI, are advanced models in the GPT series that use a three-stage reinforcement learning with human feedback (RLHF) algorithm. This enhances their instruction-following capabilities and minimizes harmful content generation. GPT-3.5 excels in text completion, translation, and summarization. Building on these strengths, GPT-4 further refines the RLHF algorithm, enhancing performance on complex instructions and making it suitable for applications ranging from academic research to industrial use.

In addition to the two Instruction-Following benchmarks introduced in the main text, we also provide a detailed overview of datasets covered in the experiments

ShareGPT refers to the multi-turn chatting histories used by Vicuna Chiang et al. (2023). ShareGPT includes 86K human queries and responses from ChatGPT and other chatbots. We randomly select 2w samples to train LLaMA3-8B and Qwen2-7B to obtain our baseline models: LLaMA3-8B (ShareGPT) and Qwen2-7B (ShareGPT).333Follow the set up of Dong et al., we use the version from https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered cleaned raw dataset, and follow Vicuna preprocess..

GSM8K (Cobbe et al., 2021) is a mathematical dataset designed to evaluate the mathematical problem-solving abilities of language models. It consists of 8,000 diverse grade school-level math word problems, which require understanding and manipulating mathematical concepts to arrive at a correct solution. It comprises high-quality grade school math problems, with 7,473 training samples and 1,319 testing samples.

HumanEval (Chen et al., 2021b) includes 164 unique programming challenges, each paired with approximately 9.6 test cases on average. To provide a more comprehensive evaluation of the functional accuracy of code generated by large language models, HumanEval+ substantially increases the number of test cases to an average of 774.8 per problem. In this paper, we report the Pass@1 result when applying greedy decoding.

MMLU (Hendrycks et al., 2021) is a benchmark designed to assess pretraining knowledge in models using zero-shot and few-shot evaluations. It includes 57 subjects across STEM, humanities, social sciences, and more, with difficulty levels ranging from elementary to advanced professional. MMLU tests both world knowledge and problem-solving skills, covering traditional disciplines like mathematics and history, as well as specialized areas such as law and ethics.

C-Eval (Huang et al., 2023) consists of multiple-choice questions categorized into four difficulty levels: middle school, high school, college, and professional. The questions cover 52 varied disciplines, including humanities, science, and engineering. Additionally, there is C-Eval Hard, a subset of particularly challenging topics within C-Eval that demand advanced reasoning skills. We perform an in-depth evaluation of leading language models on C-Eval, testing both English and Chinese-focused models.