Yi: Open Foundation Models by 01.AI

01. AI, :, Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Guoyin Wang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Wen Xie, Wenhao Huang, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Pengcheng Nie, Yanpeng Li, Yuchi Xu, Yudong Liu, Yue Wang, Yuxuan Cai, Zhenyu Gu, Zhiyuan Liu, Zonghong Dai

cs.CL cs.AI

Introduction

Recent breakthroughs in large language models have revolutionized the whole field of artificial intelligence and potentially radiate across the entire human society. Our vision for large language models is to make them the next generation computational platform and empower the whole community with significantly amplified intelligence. As a step towards this mission, we present the Yi model series, 6B and 34B language models pretrained from scratch on 3.1T highly-engineered large amount of data, and finetuned on a small but meticulously polished alignment data. Due to the data quality resulting from our substantial engineering efforts, which we will detail in the upcoming sections, Yi achieves near GPT-3.5 benchmark scores and human preferences.

In designing the Yi model series, we are mostly concerned on the following dimensions regarding model scale, data scale, and data quality: (1). when choosing model scale, the desiderata is to have small enough model that is feasible for inference on consumer-grade hardware like the RTX 4090 where the bounding factor is its limited 24G memory, yet still large enough with complex reasoning and emergent abilities. This is why we found 34B gives a nice performance-cost balance; (2). since 34B is smaller than the conventional 70B used by Chinchilla and LLaMA , we increase the pretrain data scale to 3.1T tokens to compensate for the decreased compute flops. This makes the model-data scale combination fall into the post Chinchilla optimal regime , i.e., we overtrain the model on more tokens (3T) than the compute optimal (around 1T). The benefit is from the inference side, as we achieve stronger performance with reduced serving cost: after int4 quantization, one can serve the 34B chat model on 24G GPU memory with almost no performance drop; (3). our data engineering principle is to promote quality over quantity for both pretraining and finetuning. The pretraining data quality is guaranteed by a sophisticated data cleaning pipeline with cascaded filtering methods and intentionally increased deduplication strength; (4). for finetuning data we heavily emphasize quality by handcrafting less than 10K instructions over multiple iterations based on user feedback. This approach significantly deviates from the quantity-scaling styled instruction tuning works like FLAN and UltraChat , but aligns more with the handcrafting styled works like LIMA .

Our pretraining data cleaning system features a sophisticated filtering pipeline based on language, heuristic textual features, perplexity, semantics, topic, and safety, as well as a cascaded deduplication process based on paragraph, MinHash, and exact matching. This thorough pipeline leads to a much higher removal ratio than existing pipelines like CCNet , RefinedWeb and RedPajama , which we believe is key to the success of data engineering. The underlying principle is although pretraining requires data scaling, one would like to make sure the data used are of high quality, rather than training the model on large raw data, i.e., we prefer 3T tokens over sophasticated engineering over 10T tokens without extensive filtering. Regarding the model architecture, we use standard implementation of the Transformer architecture with Grouped-Query Attention (GQA) , SwiGLU activation, and RoPE with an adjusted base frequency (RoPE ABF) . This design choice is the standard approach rooted from the Transformer original paper , later modified by GPT-3 and Chinchilla , then followed by LLaMA , Baichuan , Qwen and many related works.

To approach GPT-3.5-matching human preferences, our finetuning dataset is curated from carefully selected multi-turn instruction-response pairs, annotated directly by our team of machine learning engineers then polished over multiple iterations of user feedback. As mentioned above, the size of our finetuning dataset is less than 10K, but improved over and over again across the model development timeline. Benefiting from the dataset’s manageable size, we employed an extensive grid search to identify the optimal data composition, promote diversity, and discover effective hyperparameters. After 8-bit and 4-bit quantization, the final chat model can be deployed on consumer-grade GPUs nearly without performance degradation compared to the bf16 format.

We further extend the Yi model capability from three dimensions: context scaling, vision-language adaptation, and depth-upscaling. To achive 200K context length, we continue pretrain the model on about 5B length-upsampled data, similar to the concurrent work in Fu et al. . To adapt the model to vision-language tasks, we integrate a vision encoder and develop a multi-stage training method, following and improving the practice of Liu et al. . We also study the effectiveness of depth-upscailng , i.e., making the model deeper by continual pretraining, and confirming its effectiveness to further improve model performance.

Our infrastructure provides strong support for the full-stack development of the Yi model series, from pretraining to finetuning to serving. To support pretraining, we develop cross-cloud elastic task scheduling, automatic failure recovery, and topology-aware resource allocation which collectively enable us to run tasks according to the real-time available GPU nodes cross clusters with limited switching overhead. To support finetuning, we build a hierarchical scheduling framework supporting different distributed backends for different models (e.g., Megatron for the policy model and DeepSpeed for the reward model). For efficient inference, we use 4-bit model and 8-bit KV cache quantization, combining with PagedAttention and Dynamic Batching.

Extensive experiments demonstrate that Yi-34B can match GPT-3.5 in both performance and efficiency. On most standard benchmarks like MMLU (for the base model) and LMSys ELO Rating (for the chat model), Yi-34B generally achieves scores on par with GPT-3.5. After model parameter and KV cache quantization, the inference cost is also controlled such that a wide range of the community can deploy the model on cost effective devices. We further report a detailed performance comparison between Yi and major LLMs on commonsense reasoning, college exams, math, coding, reading comprehension, and human preference win-rate on multiple evaluation benchmarks.

Since its release, the Yi model series has benefited the community from the following perspectives: (1). it provides GPT-3.5-matching quality yet cost-effective models to researchers, and enables developers to build AI-native applications like language model based agents; (2). it empowers end users with locally runnable chatbots, which consequently helps protecting user data privacy; (3). it sheds light on the direction on further data and model scaling to achieve even stronger frontier models. for both research and commercial use.

Pretraining

Our approach to pretraining is to train a standard dense transformer architecture on a heavily engineered large pretraining corpora, where our underlying assumption is that when trained on extensive data of high-enough quality, a standard architecture can exhibit advanced capability. This is to say, we may not need much architectural modification, although we have indeed conducted extensive preliminary architectural experiments. In the following subsections, we first detail our data engineering pipeline, then briefly discuss the model architecture.

The Yi data mixture is shown in Fig. 2. To produce a high-quality bilingual pretraining data, we meticulously designed a cascaded data-processing pipeline, as illustrated in Fig 1. This pipeline features a series of data-cleaning strategies targeting quality and diversity. We start with web documents from Common Crawl, use the CCNet pipeline for language identification and perplexity scoring. Then we use a combination of filtering and deduplication process, as detailed below.

This part of filter aims for removing text of low quality. We filter out text based on: (1). URL, domain, word blocklists and garbled text filters; (2). document length, the ratio of special symbols, and the ratio of short, consecutive, or incomplete lines; (3). repeated words, n-grams, or paragraphs ; The filtering thresholds are based on a statistical analysis of large document samples, as described in Nguyen et al. . Furthermore, we identify and anonymize Personal Identifiable Information (PII), such as email addresses and phone numbers.

We use learned filters to address nuanced cases that exceed the capabilities of standard heuristic rules. Notably, the Chinese content extracted from Common Crawl present unique challenges, particularly with a higher ratio of inappropriate content like pornography and gambling. Traditional heuristic-rule-based filters struggle to effectively identify and eliminate all harmful content. To enhance our filtering process, we have integrated a suite of learned scorers for filtering, namely the perplexity scorer, quality scorer, safety scorer, and document coherence scorer: (1). the Perplexity Scorer, utilizing the KenLM library as per CCNet , evaluates a vast array of web documents, discarding those with perplexity scores largely above average; (2). the Quality Scorer is a classifier trained to recognize and favor pages similar to Wikipedia in quality and assign scores accordingly. Documents that fail to meet the quality standard are subsequently removed; (3). the Document Coherence Scorer identifies low-quality web documents that consist of disparate sentences or paragraphs, thus being incoherence. Such documents are either segmented for further analysis or removed entirely. (4). the Safety Scorer identifies and removes web documents containing toxic content, such as violence, pornography, and political propaganda.

We further use unsupervised semantic clustering to group web documents. This clustering process enables efficient identification and analysis of documents sharing similar semantic features. The clustered data are subsequently annotated with quality labels, providing essential references for the optimization of Yi’s data mixture strategy. Documents identified as low-quality through automatic and manual verification are excluded from the dataset.

After filtering, we implement a comprehensive deduplication pipeline following the procedure in Penedo et al. (2023) . This pipeline integrates document-level MinHash deduplication and sub-document exact-match deduplication, effectively identifying and removing duplicate content within and across documents. We further categorize web documents into specific themes using a topic model predicting labels like as news, ads, and knowledge-based content. In the final pretraining dataset, we down-sample less helpful content, mostly advertisements, to ensure information density. The final composition of Yi’s pretraining data is shown in Fig. 2.

2 Tokenization

We use byte-pair encoding (BPE) implemented in the SentencePiece framework , to tokenize the pretraining data. The vocabulary size of Yi is set to 64,000 to balance computational efficiency and word comprehension. Specifically, we split numbers into individual digits to facilitate a better understanding of numeric data. We allow rare characters to fall back to the unicode-byte encoding to ensure fault tolerance. We employ the identity tokenizer to avoid transferring all punctuations to the half-width format. LLMs prioritizing English usually utilize dummy prefix (whitespace at the beginning of text) in their tokenizers to generalize the same words at different positions of sentences. We do not use this approach because the assumption does not always hold even in the English context, especially for sentences that begin with quotation marks, also it does not show positive effect in Chinese context.

3 Model Architecture

Yi uses a modified version of the classical decoder-only Transformer architecture where the code is based on LLaMA’s implementation. The main parameter setting is summarized in Table 1. The modifications from LLaMA to Yi are further summarized below:

LLaMA 2 uses Grouped-Query Attention(GQA) only on its largest 70B model, and its 7B and 13B uses full attention. We incorporate GQA in both Yi-6B and Yi-34B. GQA splits query-heads into G groups, sharing a single key and value head within each group of query . This approach offers substantial reductions of training and inference costs, compared to the original Multi-Head Attention (MHA) . We do not observe performance degradation after applying GQA to our 6B smaller model.

We use SwiGLU as Yi’s post-attention layer, reducing its activation size from $4h$ to $8/3h$ ( $h$ denotes hidden size) to be consistent with the normal post-attention layer. This adjustment also compensates for the reduction in parameter resulted from GQA, making the overall parameter count comparible of existing 7B and 34B models.

We use Rotary Position Embedding (RoPE) following the standard implementation. We adjust the base frequency (RoPE ABF), introduced in Xiong et al. , to support long context windows up to 200K where the base model itself is trained on 4K context length. To adapt the base model to longer context, we continue pretrain the model on 10B tokens from our pretraining data mixture with slightly upsampled long sequences, mostly from book. We observe that only 1-2B tokens is enough for the model to converge to low loss on 4K-200K length, and a lightweight finetuning further induces near-perfect long-context retrieval performance. Based on this observation, we tend to view that the capability of modeling longer dependency than the pretrained length (4K) is a intrinsic capability (rather than an being injected by post-train). This is to say, the base model already has the capability to model longer than 4K dependency even the model is trained shorter, and the post-train / finetuning procedure simply release this capability.

Finetuning

Our finetuning method significantly emphasizes data quality over quantity. Our approach does not follow existing data-intensive approaches like FLAN and UltraChat , which scales the SFT data to millions of entries but each of the entries may not been examined carefully because the scale is too large. In contrast, our method aligns with the LIMA and DEITA approach, which focus on data selection rather than scaling. With the scale being less than 10K, we are able to examine and optimize every single data point. Below we discuss our data construction and training details.

Our finetuning dataset consists of less than 10K multi-turn instruction-response dialog pairs, with each and every one of the entry constructed and polished over multiple iterations and from user feedback. We take this approach because in our preliminary experiments, we observe that compared to the open-source data of several hundred thousand entries, the results from a smaller, manually annotated dataset are superior. These observations align with those reported in Touvron et al. , Zhou et al. , Gemini Team et al. .

We use the following techniques to improve prompt distribution selection, response formatting, and chain-of-thought formatting: (1). for prompt distribution selection, drawing inspiration from WizardLM, we develope compound instructions and progressively evolved them to increase their complexity. This approach has significantly reduced the size of SFT data in our experiments; (2). for response formatting, we generally use a default style extended from LIMA. Overall, the responses are structured in an introduction-body-conclusion format where the body is usually a list of bullet point; (3). for CoT data formatting, we have use a “Step-Back” pattern, inspired by Zheng et al. , by performing abstraction to formulate higher-level solutions before delving into reasoning about the original, more concrete questions.

We spend extra efforts on reducing hallucination and repetition: (1). to reduce hallucinations, we examine and ensure that the knowledge in the responses is not contained within the model, and eliminate responses that might lead to memorization; (2). to reduce repetition, we rewrite the repetitive turns of the responses that usually exist but may be overlooked in the finetuning data.

To ensure the coverage of different capabilities, we have included a wide spectrum of open-source prompt, encompassing areas such as question answering, creative writing, dialogue, reasoning, mathematics, coding, safety, bilingual capabilities, and others.

To obtain a fine-grained control of different directions of capabilities, inspired by InsTag, we develop a instruction tagging system. By designing a diversity-focused sampling algorithm, we carefully balanced the distribution of instructions across various tags. This approach ensures a diverse finetuning dataset, aiming to achieve enhanced cross-task robustness.

To achieve the optimal data ratio for balancing different directions of the capability, we use an approximate grid search to determine our data mixture. Motivated by Dong et al. , this process involved experimenting with {1, 1/2, 1/4, 1/8, 1/16, 1/32, 1/64} proportions for each ability. The search process was guided by validation results and our in-house human evaluation sets.

Beyond the focus on data quality and diversity, our observations revealed that the format of the data substantially influences the model’s ultimate performance. To this end, we implemented the ChatML-style format . This structured approach empowers the model to differentiate among various information types, such as system configurations, user inputs, and assistant responses.

2 Training Method

We use next-word prediction loss for finetuning, and only compute loss on the responses, but not system and user instructions. We use AdamW optimizer with $\beta_{1}$ set to 0.9, $\beta_{2}$ set to 0.999, and $\epsilon$ set to $10^{-8}$ . We use a sequence length of 4096, alongside a batch size of 64. We set training step to 300 with a constant $1\times 10^{-5}$ learning rate, a weight decay of 0.1, gradient clipping with a maximum threshold of 1.0, and NEFTune with a noise scale of 45 for Yi-34B-Chat and 5 for Yi-6B-Chat.

Infrastructure

We build the infrastructure supporting the full-stack data processing, pretraining, finetuning, and serving. Our infrastructure feasures: (1). automated managing and monitoring the computing resource; (2). improved the training speed from optimized parallel strategies, kernel efficiency, and long-context support; (3). unified finetuning framework supporting heterogeneous distributed training backend, such as simultaneously using Megatron and DeepSpeed for multiple models in Direct Preference Optimization (DPO) ; (4). reducing the deployment cost by various LLM serving accelerations such as quantization, continuous batching, and paged attention. Below we explain these techniques one by one.

To efficient schedule large-scale language model development, particularly pretraining, which may take months on thousands of GPUs , we build a highly efficient multi-cloud task scheduling algorithm to manage pre-training, SFT, and RLHF tasks of different priorities. We also build a high-performance in-house training framework that allows us to automatically elastic scale the pre-train jobs to different node sizes based on the GPU availability. More importantly, all the training-related hyper-parameters will be scaled at the same time seamlessly.

During the large language model training stage, a wide range of failures regularly occur, ranging from GPU crashes to communication fabric errors to loss spikes. We use the following strategies to address these reliability challenges: (1) we apply automated inspection, prediction, and labeling of nodes for different kind of software/hardware error categories. Nodes marked as tainted will be temporarily removed from the resource pool until the errors got cleared. (2) we implement a task queuing system with pre-checks and the capability for fast, automatic recovery in the event of failures during training tasks. (3) we develop of a user-friendly multi-task submission and management console, enabling developers to seamlessly manage and track their training tasks and hyper-parameters.

Memory and communication restrictions are the two major technical challenges of large scale model training requiring integrated solutions beyond adding more GPUs. We use and improve upon the following techniques to tackle the memory and communication restrictions: (1) ZeRO-1 to remove the memory consumption by partitioning optimizer states cross data-parallel processes; (2) tensor parallel combined with pipeline parallel within each compute node to avoid inter-node communication bottleneck, and the 3D parallel strategy is well designed and optimized to avoid using activation checkpointing and minimize the pipeline bubbles; (3) kernel fusion techniques like flash attention and JIT kernels to reduce redundant global memory access and consumption; (4) topology-aware resource allocation (ranking strategy) to minimize the communication across different layers of switches, which is the limitation of a typical fat-tree-topology.

Different from pretraining, finetuning LLMs may require the orchestration of multiple models, as is the practice of DPO and PPO . In such training jobs, a typical process is to use reference/reward model to predict a batch of data (which also requires nontrivial time), then let the target model use this data to calculate loss and update parameters. To this end, we build a multi-model scheduling framework to support multiple backends for different LLMs in a single job. For example, when finetuning a language model with DPO, the intermediate results from the reference model can be cached and reused, improving the training speed and resource cost to be close to the supervised finetuning counterparts.

We primarily use quantization, dynamic batching, and Paged Attention for improving decoding speed and memory usage. We use quantization to decrease both the memory footprint and computation demand. By 4-bit model quantization and 8-bit KV cache quantization , we are able to achieve significant GPU memory saving with near-zero performance degradation (e.g., less than 1 $\%$ accuracy drop in MMLU/CMMLU benchmark). We use dynamic batching to minimize the response time and improve batching efficiency. We use PagedAttention to improve memory utilization and improve decoding.

We implement and improve computation-communication overlapping, sequence parallelism, and communication compression to support up to 200K context length continue pretraining and finetuning. Our method to scale the context length to 200K is solely based on engineering, that is to say, we do not modify the model architecture like sparse, local, or sliding window attention – the model remains using the full attention even the input is 200K.

Safety

To enhance the model’s trustworthiness and safety, we develop a full-stack Responsible AI Safety Engine (RAISE). RAISE ensures safe pretraining, alignment, and deployment. This section discusses our safety measures in the pretraining and alignment stages.

Aligning with standard pretraining data safety practices , we build a set of filters based on heuristic rules, keyword matching, and learned classifiers to remove text containing personal identifiers and private data, and reduce sexual, violent, and extremist content.

Informed by existing research in , we first build a comprehensive safety taxonomy. This taxonomy covers a broad spectrum of potential concerns, including environmental disharmony, superstitious, religious sensitivities, discriminatory practices, substance abuse, violent behavior, illegal activities, hate speech, ethical violations, privacy breaches, self-harm, sexually explicit content, mental health issues, and cybersecurity threats. We curated datasets reflecting these categories for a robust alignment, and mix them with our dialog SFT data. We also include a targeted set of prompts simulating attack scenarios in the alignment phase, which effectively improved the model’s resilience against malicious use.

Evaluations

Our evaluation demonstrates that the Yi model family achieves inspiring performance on a wide range of tasks and delivers close to GPT-3.5 user preference rate. We first report the base model performance on standard benchmarks, then we discuss the chat model performance and its user preference rate.

Here we present the results for our base models and several other well-known base models across standard academic benchmarks. While benchmarking open-source models, we observed a disparity between the results generated by our pipeline and those reported in public sources. Upon conducting a more in-depth investigation of this difference, mostly because different models use different prompts, post-processing strategies, and sampling techniques. These differences may potentially induce significant variations in the outcomes. Our prompt and post-processing strategy remains consistent with the default settings of the original benchmarks. We use greedy decoding without any post-processing for the generated content. For scores that were not reported publicly (or scores reported with different settings), we try to get results with our pipeline. For scores that can be found publicly, we directly report the existing numbers. We use the following benchmarks, largely following the practice of LLaMA 2 :

We included PIQA, SIQA, HellaSwag, WinoGrande , ARC, OpenBookQA(OBQA), and CommonsenseQA(CSQA) to assess common sense reasoning. CSQA was exclusively tested using a 7-shot setup, while all other tests were conducted with a 0-shot configuration.

For reading comprehension, we report the 0-shot average on SQuAD, QuAC, and BoolQ.

We report the average of the GSM8K (8 shot), and MATH (4 shot) benchmarks with pass@1 accuracy without any specific prompting strategy (e.g. Chain-of-Thought prompting) and other ensemble technique (e.g., majority voting).

We report the average pass@1 scores of our models on HumanEval (Chen et al., 2021) and MBPP (Austin et al., 2021).

We report the overall results for MMLU(5-shot), CMMLU (5-shot), Gaokao-Bench (5-shot), and BigBench Hard (BBH) (3-shot).

By training on a significantly larger number of tokens (3.1T) compared to prior work (usually $\leq$ 2T), we have observed a substantial performance gain across benchmarks, as shown in Table 2. However, it is important to note that there are still discernible disparities between our model and existing open-source and close-source models, particularly in tasks related to mathematics and coding. As performance in these domains can be significantly improved by continual pretraining and instruction fine-tuning, we have refrained from incorporating extensive mathematical and coding content in the pretraining corpus when making the initial design choices. We do plan to release models with enhanced math and coding capabilities in the future.

1.2 Discussions

We observe that Yi-34B has substantial performance improvement compared to Yi-6B, though they utilized the same pretrain corpora. Larger model size leads to higher performance gain on Code and Math benchmarks, referring to Tab. 3, compared to benchmarks focusing on Commonsense Reasoning, Reading Comprehension, or Knowledge.

Smaller models of higher quality pretrain data, like Yi-34B or Qwen-14B, usually demonstrate better performance than models of larger size but (presumably) lower quality data, such as Falcon-180B (though the focus of Falcon-180B might be more on the scaling side, which is definitely of important value on its own).

Based on Tab. 2, we note that open-source LLMs still lag behind the performance of GPT-4 and GPT-3.5 on various benchmarks. Yet representative bilingual LLMs, e.g. Qwen-14B and Yi-34B, can match or even surpass the performance of GPT-4 on Chinese knowledge related benchmarks, including C-Eval , CMMLU , and Gaokao . However, there is still a huge gap between GPT-4 and open-source models on reasoning-related benchmarks like BBH , code (HumanEval), and math (MATH).

1.3 In-Context Learning Study

We further investigate the in-context learning capability, i.e., the capability of inferring the underlying function given the few-show input-output demonstrations. We consider the task of inferring the linear coefficient of a weighted sum. Specifically, define $y=w_{1}x_{1}+w2x_{2}+...+w_{n}x_{n}$ , our few-shot demonstration is $x_{1},x_{2},...,x_{n},y$ , and we ask the model to (implicitly) infer $w_{1},w_{2},...,w_{n}$ by predicting the $y$ given a new set of input $x$ . We use (a). the absolute difference between model prediction $y$ and the ground truth $y^{*}$ , i.e., $|y-y^{*}|$ as a continuous measure, and use (b). the exact match $y==y^{*}$ as a discontinuous measure. We further note that most of the models perform reasonably well on addition and subtraction, so the ability to do arithmetic, as a confounding factor, can be ruled out.

The results are shown in Figure 3. When setting the linear coefficients of be , we see that Yi 34B and LLaMA-2 70B performs the best in-terms of answer exact match. If we increase the number of the linear coefficients to be , we observe the emergent behavior that only large models (LLaMA-2 70B and Mixtral) can achieve good scores on exact match, although the differences to target is more continuous. These observations give side evidence for Yi-34B’s performance on in-context learning and indicates that further scaling may allow the model to infer more complicated functions by in-context learning.

2 Chat Model Performance

In this section, we report the automatic and human preference evaluation of the Chat Model. We use greedy decoding to generate responses. For the automatic evaluation benchmarks, we extract answers from the model’s generated outputs and calculate accuracy. During the evaluation process, we observed that different prompts have varying influence on results. Therefore, for the same set of questions, we use identical prompts to evaluate all models, aiming to ensure as fair and unbiased results as possible.

For automatic evaluation, we use the same benchmarks as is for the base model, detailed in Sec. 6.1.1. We use both zero-shot and few-shot methods but generally, zero-shot is more suitable for chat models. Our evaluation involves generating responses while following instructions explicitly or implicitly (such as the format in the few-shot examples). We then isolate relevant answers from the generated text. Unlike the base model, for the zero-shot evaluations on the GSM8K and BBH datasets, we employ the Chain-of-Thought (CoT) approach to guide the model in deliberation before reaching an answer.

The results shown in Tab. 4 demonstrate the effectiveness of our chat models in understanding human instructions and generating appropriate instruction-following responses. We particularly highlight the 4-bit quantization results, as 4-bit quantization substantially reduces the memory requirement while the model performance nearly does not drop. This observation serve as the foundation of serving the model on consumer-grade devices.

In line with Goodhart’s principle, when a measurement metric becomes the target of our pursuit, it ceases to serve as a reliable standard of assessment. Consequently, the outcomes of our evaluations on benchmarks are exclusively employed for ensuring that our alignment training does not detrimentally impact the foundational knowledge and capabilities of the base model. We do not engage in targeted optimization of our chat model with the objective of enhancing benchmark performance.

To further evaluate the generalizability of our model’s capabilities, we conducted assessments of its mathematical computation proficiency by subjecting it to the 2023 Hungarian high school mathematics final exam questions, first proposed by the xAI Grok team then reproduced by Paster . This evaluation was undertaken with the aim of determining whether our model exhibited signs of overfitting to training datasets that are mathematically oriented. The results in Fig. 4 show that Yi-34B-Chat performs inspiringly on both the GSM8K and the Hungarian mathematics exam. However, note that Yi-6B-Chat does not exhibit strong mathematical capabilities (on both GSM8K and the Hungarian mathematics exam). We speculate that smaller models may require more data to activate their corresponding abilities during the SFT stage.

2.2 Human Evaluations

In this section we conducted an assessment of the model’s conversational abilities, considering aspects to ensure its effectiveness and safety. We have compiled a collection of open-source evaluation datasets from the community, such as alpaca-eval, Belle-eval , and MT-bench. Additionally, we have established our own helpful and harmless evaluation dataset by gathering and constructing data of varying difficulty levels, for the purpose of comprehensively assessing the conversational abilities of chat models.

However, whether it is a public evaluation set or a self-built evaluation set, the evaluation results are strongly influenced by the assessment criteria and the design of the prompt. Our internal evaluation results may be unfair to other models, making it difficult to accurately represent the true capability level of our model. Therefore, here we only present external evaluation results to demonstrate the current conversational abilities of our chat model. We consider: (1). AlapcaEval111https://tatsu-lab.github.io/alpaca_eval/ , which is designed to assess the English conversation capabilities of models by comparing the responses of a specified model to reference replies from Davinci003 in order to calculate a win-rate; (2). LMSys222https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard Chatbot Arena, which showcases the responses of different models through a dialogue platform, then asks users to make selections based on their preferences, then computes the Elo score; (3). SuperClue333https://www.superclueai.com/, on the other hand, is a leaderboard aimed at comprehensively evaluating the Chinese language capabilities of models.

Tab. 5 presents the performance results of Yi-34B-Chat in the three third-party evaluations we consider, with the cutoff date for the results being December 21, 2023. The data demonstrates that, although there is still a gap compared to GPT-4, our model exhibits proficient bilingual (Chinese and English) dialogue capabilities and aligns well with user preferences. Additional comparative results of various models are accessible for review on the official website.

We further demonstrate the data quality by comparing the speed of preference increase during data scaling. As is shown in Fig. 5, when compared with UltraChat and its cleaned version UltraChat 200K, we see a clear tendency of performance improvements when scaling up the Yi data.

Capability Extension

In this section, we discuss our post-training methods to extend the Yi base model to 200K long-context, equip it with visual understanding capability, and enhance the 6B model by depth upsacaling.

Our long-context solution consists of a continual pretraining and a finetuning phase, both are light-weight. We hold the basic hypothesis that the potential of utilizing information anywhere within the 200K input context is already exist in the base model (same as Fu et al. 22), the continue pretraining phase “unlocks” such capability, evidenced by a strong performance on Needle-in-a-Haystack test, then the finetuning phase further adapt the style of response to follow human instruction and preference.

Continue Pretraining We continue pretrain the full-attention model using sequence parallelism and distributed attention. This is to say, we do not use any sparse or linear attention, but use a brute force implementation of the full attention. We continue pretrain the Yi 6B/ 34B base model on the data mixture of (1). original pretraining data, as is introduced in section 2; (2). length-upsampled long-context data, where the long documents are mostly from books; (3). multi-document question-answering synthetic data, where we construct QA pairs where the answer contains a recitation of the related paragraph before the answer. Our data approach mostly follows the data engineering practice in Fu et al. and Yu et al. . We continue pretrain the model on 5B tokens with 4M batch size, which translate to 100 optimization steps. Aligning with the concurrent work from Fu et al. , we observe that such light-weight continue pretraining is already able to enable a strong performance on Needle-in-a-Haystack test, as we will show in Figure 6.

Supervised Finetuning We mix our short-context SFT data with long-context document question-answering data. We use model-assisted automated methods (i.e., synthetic data) to construct document QA. Specifically, we randomly concatenate multiple documents into a sequence, sample one or more paragraphs from the long sequence, and ask a chat model to construct question and answer pairs based on the sampled paragraph. One important detail is recitation and rephrasing: before giving the answer, we ask the model to recite or paraphrase the original paragraph. This data format encourages the model’s retrieval behavior and consequently discourages the hallucination behavior: given a question, the model is more likely to use the information within the input to construct the answer, rather than use its internal knowledge, which may be related but inaccurate. Our finetuned model is deployed at www.wanzhi01.com, and we encourage the readers to try it out.

The performance of the 200K models is shown in figure. 6 and table 6. Specifically, Figure 6 shows the famous Needle-in-a-Haystack test of Yi-34B-200K, though we tend to view that this level of retrieval is relatively easy for long-context LLMs. Table 6 shows that our context scaling does not significantly influence the short-context generic capability.

2 Vision-Language

In the burgeoning field of multimodal research, the integration of image understanding capabilities into large language models has become increasingly viable. Drawing inspiration from the open-sourced LLaVA , we present Yi Vision Language (Yi-VL) models, i.e., Yi-VL-6B and Yi-VL-34B, based on Yi-6B-Chat and Yi-34B-Chat language models. The architecture of Yi-VL models, as illustrated in Figure 7, comprises three primary modules. The Vision Transformer (ViT), used for image encoding, is initialized with CLIP ViT-H/14 model . A Projection Module, designed to align image features with text feature spcae, consists of a two-layer Multilayer Perceptron (MLP) with layer normalizations. Finally, the large language model, initialized with the Yi-Chat models, demonstrating exceptional proficiency in understanding and generating both English and Chinese. To enhance the performance of Yi-VL models in bilingual multimodal understanding and generation, we leverage a rich dataset of bilingual image-text pairs.

Yi-VL models undergo a three-stage training process:

we train the parameters of the ViT and the projection module using an image resolution of $224^{2}$ . The training leverages a substantial dataset comprising $100$ million image-text pairs from LAION-400M . The primary objective is to enhance the ViT’s knowledge acquisition within our specified architecture and to achieve better alignment between the ViT and the LLM.

we scale up the image resolution of ViT to $448^{2}$ , aiming to further boost the model’s capability for discerning intricate visual details. The dataset used in this stage includes $20$ million image-text pairs derived from LAION-400M. Additionally, we incorporate around $4.8$ million image-text pairsn from diverse sources, e.g., CLLaVA , LLaVAR , Flickr , VQAv2 , RefCOCO , Visual7w and so on.

the parameters of the entire model are trained. The primary goal is to enhance the model’s proficiency in multimodal chat interactions, thereby endowing it with the ability to seamlessly integrate and interpret visual and linguistic inputs. To this end, the training dataset encompasses a diverse range of sources, totalling approximately $1$ million image-text pairs, including GQA , VizWiz VQA , TextCaps , OCR-VQA , Visual Genome , ShareGPT4V and so on. To ensure data balancing, we impose a cap on the maximum data contribution from any single source, restricting it to no more than $50,000$ pairs.

In Stage 1 and 2, we set the global batch size, the learning rate, the gradient clip and the number of epoch to $4096$ , $1$ e $-4$ , $0.5$ and $1$ , respectively. In Stage 3, these parameters are adjusted to $256$ , $2$ e $-5$ , $1.0$ and $2$ . The training consumes $128$ NVIDIA A100 GPUs. The total training time amounted to approximately $3$ days for Yi-VL-6B and $10$ days for Yi-VL-34B.

Table 7 shows the MMMU test set leaderboard by Yi-VL’s release. We note that this area is currently actively under research, aligning with the community’s advances, we will continuously improve the update Yi-VL’s performance.

3 Depth Upscaling

Recent studies on scaling laws have underscored the predictable improvement in model performance with increases in computational budget, model size, and data size. Yet, identifying the most effective distribution of resources between model and data sizes upon expanding the computational budget remains a formidable challenge in the field of scaling laws. Additionally, research conducted by DeepSeek-AI et al. has highlighted that the allocation of an increased computational budget towards model scaling should be proportional to the quality of the data available. In light of these insights, we propose a novel approach aimed at dynamically adjusting the resource allocation between data and model sizes through a series of staged training processes. This strategy iteratively fine-tunes the balance between data characteristics and model size according to scaling laws, enhancing both model training efficiency and performance.

Method Following the methodology outlined by Kim et al. , our goal is to upscale our Yi-6B base model, which has 32 layers, to a 9B model named the Yi-9B base model, featuring 48 layers, by duplicating the original 16 middle layers 12-28. Depth up-scaling involves expanding the base model’s depth and subsequently continuing the pretraining phase for the enhanced model.

Our investigations reveal that the decision on which layers to replicate could be informed by evaluating the cosine similarity scores between the inputs and outputs of each layer. Such an approach allows for targeted model scaling without necessitating additional pretraining, leading only to minimal performance impacts. This minimal impact on performance is attributed to the high cosine similarity, approaching one, between the inputs and outputs of the duplicated layers, as evidenced in Figure 8. This observation suggests that the replication of these layers does not significantly alter the output logits produced by the original model. This method ensures the efficient scaling of the model by optimizing its architecture based on the internal processing dynamics of its layers.

Continual Training The dataset is composed of approximately 800 billion tokens across two stages, with around 70% having been recently collected and carefully selected. We have enhanced the code coverage in the final stage to improve code performance.

To optimize the training process, we maintain a constant learning rate of 3e-5, and adopt a strategic approach to gradually increase the batch size from 4M tokens whenever the model’s loss plateaued. This incremental adjustment of the batch size, alongside maintaining all other parameters in alignment with the established Yi-6B base model configuration, was instrumental in navigating the challenges of training at scale.

The effectiveness of these strategies is demonstrated in Table 8, which details the Yi-9B base model’s performance across a variety of benchmarks, including common sense, reasoning, knowledge, coding, and mathematics. It underscores the competitive advantages of Yi-9B base model in specific domains, illustrating the efficacy of our methodology in enhancing model performance by optimally adjusting the interplay between data characteristics and model size.

Final Discussions

In this report, we discuss the full-stack development of the Yi language model family. Yi-34B achieves GPT-3.5 matching performance and is deployable (thank to the 4/8-bit quantization) on consumer-grade devices, making it an ideal model for local deployment.

The key takeaways from the Yi pretraining procedure are about data quantity and quality: (1). training the model on a larger amount of data than the Chinchilla optimal delivers clear and consistent performance gain, which we highly recommend for all pretraining teams. Our model is trained on 3.1T tokens, yet we belive with larger amount of data, we can continue improve the model performance (i.e., the model have not saturated at 3.1T); (2). when it comes to the pretraining data quality, we believe the most critical two factors are the source of the data (e.g., whether the text is produced for professional usage or for casual social media posting) and the details of the data cleaning (e.g., the strength of filtering and deduplication). Since data cleaning is a very complicated pipeline and it is extremely difficult to conduct extensive grid-search styled optimizations, our current solution may still have room for improvements.

The key takeaways from the Yi finetuning procedure is to heavily iterate on a small amount of data ( $\leq$ 10K), case by case, over multiple iterations, directly by the machine learning engineer, and improved from real user feedback. This approach clearly deviates from the instruction-scaling approach, initially introduced by the FLAN series then followed by the UltraChat series .

As is demonstrated by our current results, the reasoning capability, which we view as the core capability for real-world deployment of language models, is strongly correlated with model scale when the amount of pretraining data is fixed. We believe that given our current results, continuing to scale up model parameters using thoroughly optimized data will lead to even stronger frontier models in our upcoming next versions.

Appendix A Author List and Contributions

Our team members contribute to the development of Yi from the following perspectives:

We list our team members in alphabetical order. All authors contributed equally to this work.