Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning

Ming Li, Lichang Chen, Jiuhai Chen, Shwai He, Jiuxiang Gu, Tianyi Zhou

Introduction

The quality of instruction tuning Wei et al. (2022); Chen et al. (2023a); Mishra et al. (2021); Chung et al. (2022); Zhang et al. (2023) data is paramount to the LLM being fine-tuned, i.e., the student model. There is a growing trend and demand for the community to automatically improve the quality of instruction tuning data. Previous works either curate datasets by human experts Conover et al. (2023); Longpre et al. (2023); Zhou et al. (2023) or distill the responses of well-trained LLMs Taori et al. (2023); Peng et al. (2023); Chiang et al. (2023); Vu et al. (2023); Xu et al. (2023a). The self-improvement Bai et al. (2022b); Huang et al. (2022); Pan et al. (2023) ability of LLMs has also been explored to improve the instruction or response of a training sample.

However, these existing methods of data enhancement Huang et al. (2022); Ye et al. (2023); Li et al. (2023b); Mitra et al. (2023) do not take a critical criterion into account: Is the teacher-refined data compatible to the needs of the student model? These approaches typically do not account for the inherent randomness and potential degradation associated with the generative models’ output, leading to an oversight in how the student model responds to these “improved” data samples. Thus a mechanism for the student model to selectively integrate these enhancements has been notably absent. To bridge this gap, our work introduces an teacher-student collaboration pipeline wherein a teacher generative model engages in a reflection process to enhance both the instruction and response of a data sample. The student model then evaluates whether to incorporate these improvements based on its unique statistical attributes. This pipeline is versatile and can be adapted to various contexts where data enhancement is needed.

Then, another pivotal question arises: How does the student model decide which enhanced data are needed and critical to its training? This question underpins the challenge of autonomously evaluating the quality of instructions and responses. Common practices involve utilizing sophisticated models like GPT-4 for assessment purposes Zheng et al. (2023); Li et al. (2023e); Liu et al. (2023b); Chiang and Lee (2023) or employing a secondary judge model equipped with evaluative capabilities Wang et al. (2023c); Li et al. (2023a). These methods, however, present limitations: they fail to address the discrepancies between the evaluating model and the actual student model undergoing training. Particularly in the latter approach, even though the judge model and the student model might share the same structural framework, their weight distributions diverge once endowed with the evaluative functions. Consequently, the preferences of the judge model may not align with the real student model’s requirements. To circumvent these issues, we adopt a statistical method, utilizing the Instruction-Following Difficulty (IFD) score proposed by Li et al. (2023c). This score is derived directly from the raw student model, thereby mitigating potential domain shifts and ensuring that the evaluation is better aligned with the student model’s learning context.

In our approach, the IFD score serves as a crucial metric that measures how much help the instruction can provide to the likelihood of the response if added as an extra condition, representing the Difficulty of the sample. However, though effective, the IFD score mainly assesses the instructions. Motivated by Humpback Li et al. (2023d) which requires LLMs to generate potential instruction based on responses, we further introduce a reversed version of IFD named reversed-IFD (r-IFD). This metric evaluates how much the response contributes to predicting the corresponding instruction. A lower r-IFD score suggests the student can easily deduce the corresponding instruction given the response, indicating this sample is feasible for the student to learn, representing the Feasibility of the sample. This dual approach, employing both IFD scores for Difficulty and r-IFD scores for Feasibility, enables a comprehensive and nuanced assessment of the instruction-tuning process, ensuring the refined data aligns well with the student model’s capabilities and objectives.

We name our overall method Selective Reflection-Tuning, which contains the selective instruction reflection phase and the selective response reflection phase. In the first phase, a teacher model is utilized to reflect on the instruction of the given sample based on some criteria and generate a new sample. Then the student model makes the decision of whether to accept the improvement based on difficulty (IFD). In the second phase, the teacher model reflects and generates a sample with a new response and the student model decides whether to accept based on feasibility (r-IFD). With our interactive pipeline, we obtain a dataset with supreme quality, with only instruction tuning on a relatively small amount of data, our model outperforms most existing open-source models with even larger model sizes. Our contributions include:

We propose a teacher-student collaboration pipeline where the teacher model and student model cooperate to build a more coherent and model-compatible instruction tuning dataset, which can be further adapted into other self-improvement scenarios.

We present a nuanced evaluation schema reversed-IFD, quantifying the relevance of instruction-response pairs, and representing the feasibility of the sample for the student.

With only instruction tuning on a few thousand of automatically generated data, our models achieve top-tier performances, indicating the supreme quality of our data.

Preliminaries

Let $f_{\theta}$ denote the pre-trained student model, e.g., LLaMA, with parameters $\theta$ and $g$ the teacher model, e.g., ChatGPT. Let lowercase letters $x,y,z,c,..$ denote the text segments, which could be phrases or sentences, and each token in $x$ is denoted as $x[i]$ . We use uppercase letters $D,..$ to denote the collection of language sequences or datasets, and $D_{0}$ represents the initial base dataset. Since both $f_{\theta}$ and $g$ are in auto-regressive manners, a sequence $x=(x,...,x[n])$ can be further denoted as:

In the instruction tuning setting, there will be a mapping function that turns the original raw instruction $x$ into the desirable format and requests models for a response $y$ . For simplicity, we directly notate this process as $y\sim f(y|x)$ . And the loss function for instruction-tuning can be denoted as:

Motivated by Cherry LLM Li et al. (2023c) which proposes the IFD score to measure the difficulty of instruction in the given instruction-response pairs. We utilize the perplexity of the IFD score Li et al. (2024), which is formulated as:

where $\text{ppl}(y|x)$ represents the perplexity of model $f_{\theta}$ to fit the response $y$ given the instruction $x$ as the context, and $\text{ppl}(y)$ represents the perplexity of model $f_{\theta}$ to directly fit the response $y$ without any context given. This value represents how the given instruction $x$ affects the generation of corresponding response $y$ for given model $f_{\theta}$ , which has been shown as an effective metric for evaluating the given instruction-following data pairs Li et al. (2024). A higher IFD score indicates that the instruction is more challenging for the student model to generate the response, suggesting the instruction’s difficulty for the student model.

Methodology

As shown in Figure 1, there are two main phases in our method, Selective Instruction Reflection and Selective Response Reflection phase. In each phase, the teacher model generates the updated version of instructions or responses based on some given specific criteria $\{c_{ins,1},...,c_{ins,k}\}$ Prompt for reflection can be found in Appendix B, then the student model judges if the updates are beneficial to it based on difficulty (IFD) or feasibility (reverse-IFD). Finally, these selectively improved samples can be used for the final instruction tuning.

Given the instruction-response pair $(x_{0},y_{0})$ from the original dataset $D_{0}$ with some specific criteria $\{c_{ins,1},...,c_{ins,k}\}$ , the teacher model $g$ is required to reflect on this sample and generate a better instruction-response pair $(x_{ins},y_{ins})$ according to its reflection. With the criteria given, the teacher model $g$ is able to generate critical responses:

where both original instruction and response are wrapped into the prompt rather than original instruction alone. These critical responses further serve as the guidance (chain of thought) Wei et al. (2023); Yao et al. (2023) for the generation of the new instruction and response pair:

where the above process is sampled as a continuous language sequence, and the critical responses would not be decomposed from the whole output.

Though the given sample pair is updated by the teacher model, it remains uncertain whether this updated version is truly better for the student model. While most existing work evaluates the quality of a data sample by directly prompting existing generative models, they inevitably suffer from the misalignment problem. Thus we utilize the IFD score Li et al. (2023c) calculated based on the specific base student model, which measures how the instruction benefits the generation of corresponding responses for the model, representing the difficulty of the sample.

After obtaining the updated instruction-response pair, the base model $f_{\theta}$ is required to compare the IFD score of the original pair $(x_{0},y_{0})$ and updated pair $(x_{ins},y_{ins})$ and the sample with higher IFD scores will be chosen:

where $(x,y)\in\{(x_{0},y_{0}),(x_{ins},y_{ins})\}$ . Then the chosen data pair $(x_{1},y_{1})$ with a higher IFD score will be sent to the next phase.

2 Selective Reflection on Response

After the first phase, although the instruction $x_{1}$ is guaranteed to be difficult for the student model, the corresponding response $y_{1}$ is still sub-optimal. Thus another reflection on the response process is further proposed. Similar to the above procedure, a new set of criteria for reflection on response is defined as $\{c_{res,1},...,c_{res,m}\}$ . The overall process can be noted as:

where $z_{res,i}$ represents the critical response of $i$ th response criteria $c_{res,i}$ . In the process, the instruction and response pair $(x_{1},y_{res}))$ is fully improved.

Our pipeline aims to improve both the instruction and response in an instruction-tuning sample. IFD score measures the difficulty of the sample. We take a step further by adding another dimension which we call reversed IFD (r-IFD) representing the feasibility for the student to generate the instruction given the response. A lower r-IFD score suggests the student can easily deduce the corresponding instruction given the response, indicating this sample is feasible for the student to learn, which measures the model-specific matching degree of the existing data pair. Two examples with low or high r-IFD scores can be found in Appendix H for better illustration.

The high-level idea of r-IFD is in line with the success of Humpback Li et al. (2023d), which utilizes LLM to predict the corresponding instruction from given texts (responses), and hypothesizes that “we can predict instructions for these candidate gold answers that can be used as high-quality example pairs”. In our paper, we further hypothesize that a response is more informative for training if it is feasible for the LLM to predict the corresponding instruction from the response. This hypothesis is naturally proved by the Humpback, which generates instructions that can be handled by LLMs, while those difficult ones are naturally discarded.

Under this circumstance, the reversed IFD score should be small since the smaller value represents that it is easier for the model to generate the corresponding instruction given the response. Specifically, the r-IFD score is calculated as:

where $y^{\prime}$ represents the text segment generated by mapping the original $y$ into a query to guess the corresponding potential instructions.

For the given original sample pair $(x_{1},y_{1})$ from the first phase and reflected sample pair $(x_{1},y_{res})$ , the selection process can be formulated as:

where $(x,y)\in\{(x_{1},y_{1}),(x_{1},y_{res})\}$ .

After the above phases, there will be a corresponding data pair $(x_{2},y_{2})$ for each original $(x_{0},y_{0})$ , which is represented as our selective reflected data. Then we discard all the samples which is not response-reflected for the consistency of response distribution. We name the whole above process as a selective recycling process, which greatly improves the quality of the previous dataset Some statistic analysis can be found in Appendix E. The student model $f_{\theta}$ will be trained on the newly generated data and the new models are notated as “sRecycled Models”, eg. sRecycled Alpaca.

Experimental Setup

The Alpaca dataset Taori et al. (2023), sourced from Stanford University, offers $52,002$ instruction samples. Developed via the self-instruct paradigm Wang et al. (2023d), it leveraged the capabilities of the text-davinci-003 model. The WizardLM dataset Xu et al. (2023a) is a refined collection encompassing a total of $250,000$ instruction samples. To enhance data fidelity, gpt-3.5-turbo-0613 has been meticulously integrated during the refinement process. From this extensive dataset, we predominantly focused on the WizardLM-7b subset, comprising $70,000$ samples. We test our method on both of these two datasets to verify the effectiveness of our method and name the corresponding models as “sRecycled Alpaca” and “sRecycled WizardLM”.

2 Evaluation Metric

To evaluate the effectiveness of our method, we utilize 4 commonly used automatic evaluation metrics, including (1) Pair-wise Comparison, (2) Alpaca Eval, (3) Open LLM Leaderboard, and (4) MT-Bench. Besides, additional (5) Human Study is also conveyed for the evaluation. Detailed description can be found in Appendix C.

Experimental Results

For Pair-wise Comparison, we compare our sRecycled WizardLM 7B with other classic open-source models by using GPT4 as the judge as shown in Figure 2. Notably, our model outperforms most models by a large margin, regardless of whether they are 7B or 13B, (“LLaMA2 Chat 13B”, “Vicuna 13B v1.3”), or whether extra RLHF/AIF is utilized (“LLaMA2 Chat 7B”, “Zephyr 7B Alpha”), or whether other data improvement methods are utilized (“Recycled Wiz 7B”, “WizardLM Orca 7B”https://huggingface.co/datasets/pankajmathur/WizardLM_Orca, “Orca 2 7B”Mitra et al. (2023)).

Table 1 delineates the outcomes on the AlpacaEval Leaderboard in which our models stand out for delivering promising results with a streamlined approach. This comparison provides a direct quantification of a model’s capacity for instruction adherence and the intrinsic quality of its output. Remarkably, with a win rate that competes closely with heavyweight counterparts, our models achieve this with only instruction tuning on a small amount of our high-quality data. Furthermore, our approach does not rely on additional processes such as RLHF Ouyang et al. (2022); Bai et al. (2022a) or RLAIF Bai et al. (2022b); Lee et al. (2023), which demand a significant overhead. This reduction in complexity represents a significant advancement in model efficiency, making it a cost-effective and agile solution for real-world applications. The ingenuity of our model lies in its simplicity and effectiveness, proving that with intelligent design less is more.

Table 2 showcases the performance comparison on the Huggingface Open LLM Leaderboard with some related models. Similarly, with only instruction tuning on a small amount of data, our models surpass plenty of the models on the average performances across representative benchmarks. These benchmarks do not directly measure the instruction-following ability or the quality of responses generated by LLMs, but a relatively higher performance on these benchmarks still shows the non-degradation quality of our method.

For the human evaluation, we compare the responses to given testing instructions between our sRecycled WizardLM 7B model with the original WizardLM 7B model by human evaluators, there are $57/108$ wins for our model, $23/108$ ties, and $28/108$ losses. These results further prove the efficacy of our method in improving the quality of the original data.

2 Fewer Data Scenario

To better illustrate the supreme quality of our sRecycled dataset, we further conduct experiments where only part of the data samples are utilized. Following Li et al. (2023c), we calculate the IFD score of each data sample and select the top $k$ -percent of the data for the instruction tuning. Their performances on the Open LLM Leaderboard and the Alpaca Eval Leaderboard are shown in Table 3 Detailed table and ablation can be found in Appendix F. Since selecting data by IFD score is an effective method to find a better instruction tuning subset from the overall data set, this consistent decrease in performance on Alpaca Eval indicates the difficulty in finding a subset with higher performances, which further verifies the overall high quality of our selective recycled data.

Figure 3 draws the scatters comparing the data used and corresponding performance. It illustrates a striking balance of efficiency and performance achieved by our models. Despite using markedly less data, our models—represented by the distinctive star markers—consistently occupy the upper echelons of the performance spectrum on both the Alpaca Eval benchmark and the open LLM leaderboard. Furthermore, the plots reveal that our models achieve these results without scaling up to the larger data requirements that other models seem to necessitate, as indicated by their position further to the right along the x-axis. The results not only signal superior data quality but also suggest a potential reduction in the computational resources and time required for training, which is crucial for sustainable and scalable AI development.

Furthermore, it is astonishing that with less than $1,000$ selective recycled data, our “sRecycled WizardLM 7B (2%) (926)” outperforms most existing 7B models, including LIMA, which is trained with manually curated data samples. This not only verifies LIMA’s Zhou et al. (2023) hypothesis but also pushes it further forward: In addition to human-carefully-crafted instruction tuning data, less than $1,000$ totally automatically generated data can also yield substantial benefits in model alignment and performance.

Ablation Study

Extensive experiments are conducted on several 7B models as shown in Table 4. We utilize the pair-wise comparison with GPT4 as the judge to measure the performance of different models.

Compared with the original WizardLM model, our performance is dramatically better, which directly showcases the supreme capability of our method to increase the data quality. “Reflect on Ins.” and “Reflect on Res.” represent models that are trained with data reflected merely on instruction or response and no selection process is utilized. Through these comparisons, it can be found that reflection on instruction only improves the data quality a little, while reflection on response improves the data quality more. This phenomenon is reasonable due to the similarity in response distribution between original WizardLM data and WizardLM data reflected on instruction. On the contrary, when the response is reflected, it directly affects the target that LLM needs to fit on, thus directly showing an improvement in the response quality. “Reflect on Ins. + Res.” represents the model trained by using reflection-tuning (“Recycled WizardLM 7B”) without the selection process, though already having the good capability to follow instructions, our model still outperforms it with less data.

2 Ablation on Selection

Moreover, to further verify the effectiveness of our selection mechanism, experiments with different selection methods are conducted shown in Table 4.

“Select by Randomness” represents the student model randomly choosing whether to accept improved data. Not only does this model underperform our final model largely, but it also underperforms both “Reflect on Res.” and “Reflect on Ins. + Res.”. This baseline result indicates that without a proper selection method, the blind mixture of data might harm the model’s performance.

“Select by Coherence” represents the data selected based on the coherence between instruction and response, which is calculated by cosine similarity of the Sentence-BERT Reimers and Gurevych (2019) embeddings. In this setting, the data pairs, whose instruction and response are more related, are more likely to be selected. The performance of this model is slightly better than the random selection model, and still worse than both “Reflect on Res.” and “Reflect on Ins. + Res.”, indicating the ineffectiveness of this selection method.

“Select by Perplexity” represents the student model choosing whether to accept the improved data by whether the perplexity is improved, which is the closest to ours. The performance of this model surpasses both “Reflect on Res.” and “Reflect on Ins. + Res.”, showing that a selection process can definitely further improve the model’s performance, verifying our motivation for adding the selection mechanism. However, this model still underperforms our model, indicating the efficacy of our selection strategy.

“Select by IFD only” and “Select by r-IFD only” represent situations where we only utilize IFD or r-IFD scores for student side selection. Utilizing only IFD results in a model that is close to our main model, indicating the usefulness of the IFD score. However, its performance is still lower, indicating the effect of the r-IFD.

Comparison with Related Work

Earlier works on instruction tuning focus on creating large, high-quality datasets curated by human experts Khashabi et al. (2020); Ye et al. (2021); Wei et al. (2022); Wang et al. (2022); Du et al. (2022), time-consuming and labor-intensive. Thus a number of works try to construct instruction-tuning datasets automatically. Self-Instruct Wang et al. (2023d) utilizes the in-context learning capability of GPT-3 to expand tasks to many diverse instruction-response pairs. WizardLM Xu et al. (2023a) applies an evolution methodology to refine and diversify the original instruction data. LaMini-LM Wu et al. (2024) introduces to generate Top-Fuided instructions based on Wiki data. Peng et al. (2023) utilize GPT4 to generate responses for existing datasets. UltraChat Ding et al. (2023), establishes various scopes and systematically generates a multitude of instructions within each designated area. Orca Mitra et al. (2023) directly apply GPT4 to generate reasoning steps for given instructions. SelFee Ye et al. (2023) utilizes ChatGPT to enhance the response quality. Reflection-Tuning Li et al. (2023b) improves both the instruction and response sequentially by reflecting on specific criteria. DEITA Liu et al. (2023a) utilizes ChatGPT to diversify and then select the data. LIFT Xu et al. (2023b) also tries to utilize ChatGPT/GPT4 to expand and compress the data.

All the above works are related to ours by involving a teacher model to improve the instruction data, however, all of them are teacher-dominating: Both the generation and selection are all decided by the teacher model and without involving the student. We are the first to introduce the teacher-student collaboration pipeline and it works fine.

Conclusion

Selective Reflection-Tuning, as proposed in this paper, marks a significant advancement in data improvement for instruction tuning of Large Language Models. By integrating an interactive pipeline between a teacher model and a student model, and utilizing the novel metrics of IFD and reversed-IFD, this approach has demonstrated a marked improvement in the quality and relevance of instruction-tuning datasets. The resulting enhancement in model performance across various benchmarks not only attests to the efficacy of our method but also suggests its potential applicability in broader machine learning contexts.

Limitations

The involvement of the student model makes it possible to build high-quality and student-compatible instruction-response data. However, the main limitation of this method is that the data samples selected by different student models are different, thus the statistics (IFD scores and r-IFD scores) need to be calculated again for different student models. We believe the use of model-specific data samples is more reasonable due to the distinct characteristics of different models, and utilizing the statistics-based method is much more efficient than other generation-based methods, the necessity of re-calculation for new models is still not efficient enough.

References

Appendix A Prompt for Evaluation

We provide the detailed prompt we used for the pair-wise comparison in Figure 4.

Appendix B Prompt for Reflection

The prompts for the reflection are shown in Figure 5 and Figure 6.

Appendix C Evaluation Metric

Evaluation of the responses generated by LLMs is an open problem that plenty of researchers are still working on, due to the lack of real ground truth for the open-domain questions, most of the previous methods can not be directly implemented for judging the instruction-following ability of LLMs. However, using LLM as a judge, e.g., GPT4, for evaluation is recently a widely accepted and common practice Touvron et al. (2023); Chiang et al. (2023); Dettmers et al. (2023); Liu et al. (2023b); Chiang and Lee (2023). Previous studies Zheng et al. (2023); Li et al. (2023e) have shown that GPT4’s evaluations are consistent with human evaluations. We utilized the testing instruction set from WizardLM Xu et al. (2023a) which contains $218$ diverse human-curated instructions, which are categorized into specific sub-categories.

Specifically, we directly follow the evaluation method from Chen et al. (2023a); Li et al. (2023c), which contains rating each model-generated response on a scale spanning from $1$ to $10$ , with scores encapsulating several aspects such as accuracy and relevance. To further mitigate the positional bias elaborated upon in Ko et al. (2020); Wang et al. (2023b), model-generated outputs are presented to the LLM judge in two distinct sequences and subsequently scored. Hence, a model’s dominance is ratified under the following conditions: Wins: Exhibits superiority in both sequences or prevails in one while maintaining parity in the alternate sequence. Tie: Demonstrates parity across both sequences or prevails in one while faltering in the alternate. Loses: Underperforms in both sequences or maintains parity in one while being eclipsed in the alternate.

C.2 Alapca Eval Leaderboard

AlpacaEval Leaderboard offers an LLM-centric automatic assessment utilizing the AlpacaFarm Dubois et al. (2023) evaluation dataset. It is an automated evaluation mechanism for LLMs that offers efficiency, cost-effectiveness, and reliability. Operating on the AlpacaFarm evaluation dataset, it gauges models’ proficiency in adhering to generic user instructions. The generated outputs are juxtaposed against benchmark responses from Davinci003. Empirical evidence suggests that AlpacaEval’s alignment with ground truth annotations sourced from human experts is notably high.

C.3 Open LLM Leaderboard

The Huggingface Open LLM Leaderboard employs the evaluation methodology from Gao et al. (2021), providing a cohesive framework for assessing generative language model capabilities across a spectrum of evaluation tasks. It focuses on $4$ pivotal benchmarks: ARC Clark et al. (2018), HellaSwag Zellers et al. (2019), MMLU Hendrycks et al. (2021), and TruthfulQA Lin et al. (2022).

C.4 MT-Bench

We also provide the performances of our sRecycled Models on MT-bench, as shown in Table 5. Since our training focused on 1-turn instructions and did not include any multi-turn data, the 1-turn score on the MT bench is promising and comparable to LLaMA2-13B-chat, while the 2-turn score is not that satisfactory. However, the Vicuna dataset Chiang et al. (2023) can introduce multi-turn dialog data to the model training. Hence, we tried training with our data based on the existing Vicuna 7B v1.5 model, whose result is reported in the last row as “sRecycled Wiz + Vicuna 7B”. Compared with the original Vicuna model, the 1-turn, 2-turn, and overall scores are improved dramatically and the overall score is similar to the performance of Vicuna-13B.

C.5 Human Study

To further validate the superiority of our method, we conducted a further human study to further evaluate the effectiveness of our method. In the test set, there are $27$ sub-categories that have $4$ or more testing instructions, thus we randomly sampled $4$ instructions from each sub-category to form a set containing $108$ instructions. Then $3$ human participants are given the task of comparing the responses generated by the comparing models with the criteria same as the previous pair-wise evaluation. For each comparison, 3 options are given (Win, Tie, and Loss) and the final results are determined by the majority voting of the participants.

Appendix D Implementation Details

For the Llama2 pre-trained model Touvron et al. (2023), we utilize the prompt and code base from Vicuna Chiang et al. (2023) and flash attention Dao et al. (2022) while the overall training arguments are aligned with protocols from Alpaca and WizardLM datasets. The Adam optimizer Kingma and Ba (2017), with a $2\times 10^{-5}$ learning rate for the 7b model and a $1\times 10^{-5}$ learning rate for the 13b model, and a batch size of $128$ , steer the training across three epochs with a max length of $2048$ . The warmup rate is set to $0.03$ .

Appendix E Statistic Analysis

In this section, we delve into a quantitative analysis of the instruction-response data, pre- and post-application of our methodology, as delineated in Table 6. We first compare both “Recycled Data” and “sRecycled Data” to the original data.

Observationally, there’s an increase in the average token length of instructions within the Alpaca dataset, whereas a decrement manifests for the WizardLM dataset, epitomizing the method’s adept adaptability. The succinctness and elementary nature of the Alpaca dataset’s instructions warrant an enhancement in intricacy through our method, thereby elongating their length. Conversely, the pre-existing complexity and intricacy in WizardLM’s instructions render our algorithm inclined towards succinctness. Pertaining to the response section, there’s a marked propensity of our approach to engender detail-rich textual content, leading to relatively long responses.

Moreover, leveraging Sentence-BERT Reimers and Gurevych (2019), we quantify the coherence metric between instructions and their affiliated responses. It’s discernible that our technique invariably fabricates samples with better coherence, signifying a superior alignment between modulated instructions and consequent responses. Additionally, to elucidate the metamorphosis in instructional difficulty, we employ the IFD score, executed on the pre-trained llama2-7b language model to check the the difficulties of instructions. The increase in IFD scores represents the increase in the overall difficulty of instructions. Moreover, r-IFD is also calculated, and the decrease in r-IFD scores represents the instruction response pair is more related.

E.2 Data Component Distribution

In our selective reflection-tuning, there are four different outcomes for each original data sample: both instruction and response are modified, only instruction is modified, only response is modified, and none of instruction and response are modified. Thus to provide a better view of the data conponents, we provide the pie chart for our sRecycled Alpaca 7B and sRecycled Wizardlm 7B data as shown in Figure 7.

Appendix F Detailed Few Data Scenario

The detailed performances in the few data scenarios are shown in TABLE 7 and comparisons with the randomly selected method are shown in TABLE 8.

Appendix G Ablation on Larger Evaluate Set

The evaluation set used on the main page in Table 4 is the WizardLM test set, which contains 218 human-written instructions, and is currently one of the most widely used test sets. Another widely used test set is the Vicuna test set, which is used in MT-Bench, but it contains only 80 instructions and the results are presented in Appendix C. Thus the test set we used for ablation is almost three times the Vicuna set. Moreover, in our evaluation, every comparison will be processed twice to eliminate the potential position bias. Thus we don’t think it would be regarded as a really small test set.

However, to further validate the effectiveness of our method, we further combine the Vicuna Chiang et al. (2023) test set (80), Koala Vu et al. (2023) test set (180), WizardLM Xu et al. (2023a) test set (218), Self-instruct Wang et al. (2023d) test set (252), and LIMA Zhou et al. (2023) test set (300) into a huge evaluation set of 1030 instructions for the ablation study as shown in Table 9. The results on this huge test set share similar trends compared with using the WizardLM test set alone, indicating the effectiveness of our method.

Appendix H Examples for r-IFD Illustrtaion

Example 2: (r-IFD=0.921, High, Not Prefered)

Instruction: Identify the type of sentence "I drove to the store yesterday".

In the first example, after reading through the given code, LLM can easily understand the task and guess what this code is for, indicating sufficient information in the response and its good match to the instruction. However, in the second example, the response is not able to provide enough information to derive the instructions and is vague in various aspects. It indicates that the response might not be feasible to be reasoned by the model and thus needs to be improved.