Stable LM 2 1.6B Technical Report

Marco Bellagente, Jonathan Tow, Dakota Mahan, Duy Phung, Maksym Zhuravinskyi, Reshinth Adithyan, James Baicoianu, Ben Brooks, Nathan Cooper, Ashish Datta, Meng Lee, Emad Mostaque, Michael Pieler, Nikhil Pinnaparju, Paulo Rocha, Harry Saini, Hannah Teufel, Niccolo Zanichelli, Carlos Riquelme

cs.CL stat.ML

Introduction

Following the development of the Transformer architecture , a remarkable number of proprietary and open-source large language models have been trained and deployed. While countless new ideas and artifacts are announced weekly or daily, some key aspects remain opaque – especially around the most powerful models. Often, the training data is not disclosed. This poses a fundamental challenge in times where society demands transparency as it faces a brand-new disruptive technology that is hard to inspect and audit. In this report, we explain in a reproducible manner how to train a modest-size but state-of-the-art language model. All the data we used is public (see Table 1) and its training required around 92k GPU hours of training – worth around $322k on popular cloud providers (assuming$ 3.5 per GPU hour). We hope this work contributes to the open AI community and helps set the standard for upcoming transparent models.

This report is organized as follows: Section 2 details the process of pre-training Stable LM 2 1.6B. We devote Section 3 to fine-tuning and human preference alignment. Section 4 presents model evaluations on standard downstream benchmarks. Compiling and running inference Stable LM 2 on several edge devices is outlined in Section 5. We consider several follow-up directions in Section 6. Carbon emissions and societal impacts related to the training and release of Stable LM 2 are covered in Section 7. Finally, Section 8 concludes and summarizes this work.

Model Pre-Training

The first stage in training large language models (LLMs) focuses on learning to predict the next token in a sequence using a vast and diverse array of data sources. We refer to this stage as pre-training. It enables models to build general-purpose internal representations suitable for basic language capabilities and even more advanced generation and comprehension tasks. In fact, it has been hypothesized that the majority of model knowledge and capabilities are learned during pre-training . In this section, we introduce the design principles and ablations that influenced the creation of our training set, as well as details about the model architecture and training procedure. While many similar reports exist for other cutting-edge models, they often omit critical details, such as the particular data sources, sampling weights, or the complete set of ablations they performed. As a result, the open-source community cannot accurately reproduce these models. On the other hand, we present a fully transparent log of our model training details. We are confident that researchers and practitioners will find valuable insights in this comprehensive account.

2 Data

Model performance is affected by pre-training data design decisions, including both the source selection and the sampling weights . Our approach is close to that of : the majority of our training data consists of sources utilized in the training of other LLMs, such as RefinedWeb , subsets of the Pile , RedPajama and the Stack . We supplement these with OpenWebText , OpenWebMath , and parts of CulturaX . While inspecting randomly sampled documents from the mC4 subset of CulturaX, we encountered frequent HTML boilerplate and decided to remove this portion entirely, finally keeping only the OSCAR subset. Additionally, we incorporate FanFicshttps://huggingface.co/datasets/atom-in-the-universe/fanfics-10k-50k, a subset of 50k documents from fanfiction.net selected by lowest perplexity scores according to a KenLMhttps://huggingface.co/edugp/kenlm. Finally, following , we restructure several raw datasets into rich fixed forms for downstream tasks such as summarization, question-answering, sentiment analysis, etc., and add instruction data from , the aggregate of which we collectively call Restruct-v1. The list of sources used in Restruct-v1 is made available in Tab. 11. Stable LM’s training set is comprised entirely of open-source datasets compatible with commercial usage, most of which are hosted on the Hugging Face Hub. The only exception to the latter aspect (HFH), Restruct-v1, can easily be reproduced by following the approaches and prompt templates provided by .

Carefully selecting the mixture proportions of the various data domains is critical, particularly with respect to the amount of non-English and code data. We trained several models on different mixes and evaluated them on downstream benchmarks to pick our final dataset. The full set of ablations is available in Appendix A, together with rationales for selecting these particular mixes. Based on the results of the ablations, we trained our model with the mix shown in Table 1, which accounts for around 2 trillion tokens. Note that it includes multilingual data in German (DE), Spanish (ES), French (FR), Italian (IT), Dutch (NL), and Portuguese (PT). The split of our dataset across different domains is visualized in Fig. 1.

3 Tokenizer

We use Arcade100k, a BPE tokenizer extended from OpenAI’s tiktoken.cl100k_base to include special tokens for code and digit-split handling . The vocabulary consists of 100,289 tokens and is padded to the nearest multiple of 64 (100,352) during training to meet the recommended Tensor Cores alignment on NVIDIA A100 devices. In preliminary experiments, we did not observe statistically significant deviations in downstream natural language performance tasks when compared against a hyperparameter matching model trained with the smaller GPT-NeoX tokenizer . Increased compression rates for code and non-English languages influenced our decision to use Arcade100k over existing tokenizers.

4 Architecture and Training Layout

The model is a causal decoder-only transformer similar in design to the LLaMA architecture . Table 3 shows some of the key architectural details. In particular, the main differences with respect to LLaMA are the following:

Position Embeddings. Rotary Position Embeddings applied to the first $25\%$ of head embedding dimensions for improved throughput following .

Normalization. LayerNorm with learned bias terms as opposed to RMSNorm .

Biases. We remove all bias terms from the feed-forward networks and multi-head self-attention layers, except for the biases of the key, query, and value projections .

Stable LM 2 1.6B is trained on 64 Amazon P4d instances comprising 512 NVIDIA A100 (40GB HBM2) GPUs. The size of our model, together with ZeRO stage 1 distributed optimization , eliminates the need for model sharding. Still, different triplets of micro batch size, gradient accumulation steps, and activation checkpointing granularity lead to different speed metrics. Following the recommendations in , we obtain our final configuration by finding the micro-batch size that allows us to completely remove activation checkpointing. We then determine the gradient accumulation steps based on our target batch size and data parallel degree. We employ a batch size of $8,388,608$ tokens, based on the observations in Appendix D. With the setup in Table 3, we achieve $\approx$ 170 TFLOPs/s per device, or $54.5\%$ model flops utilization (MFU). A higher hardware utilization of $\approx$ 200 TFLOPs/s ( $64\%$ MFU) can be trivially achieved by decreasing the degree of data parallelism and correspondingly increasing the number of gradient accumulation steps, at the cost of an increased iteration time.

5 Learning Rate Scheduler

We propose a new learning rate scheduler. It consists of multiple stages and is designed to favor flexibility in terms of continued pre-training. We begin by linearly increasing the learning rate to its max value of $1e-3$ over 9720 steps. This warm up stage is followed by the main training phase in which the learning rate is decreased according to Eq. 1:

where $m$ and $M$ are respectively the minimum and maximum learning rates, $i$ is the current step, and $N$ is the total number of steps. The free parameters $\alpha$ and $\beta$ have been arbitrarily chosen to enforce the continuity of the scheduler and its derivative at $i=N/4$ . We finalize training by linearly cooling the learning rate down to zero over 80k steps, which corresponds to around 670B tokens. The full scheduler is illustrated in Fig. 2; further details and ablations can be found in Appendix B.

Fine-tuning and Alignment

Following pre-training, we further develop the conversational skills of our model via a fine-tuning stage that consists of three main steps: supervised fine-tuning (SFT), direct preference optimization (DPO), and self-knowledge learning. Importantly, we do not use multilingual data at this stage. We now describe each of them in detail, and in Section 4 we report the results after all three steps.

The first step is supervised fine-tuning. We fine-tune the pre-trained model on a number of instruction datasets publicly available on the Hugging Face Hub. In particular, we use the following conversational datasets: UltraChat , WizardLM , SlimOrca , ShareGPT , Capybara , Deita , and MetaMathQA . We removed any samples that exceeded eight turns, leading to a total of 826,938 samples.

We train our SFT models for three epochs using a cosine learning rate scheduler. A warm-up phase of $10\%$ of the training duration is employed before reaching the peak learning rate of $5e-5$ . We set the global batch size to 512 sequences and pack inputs into sequences of up to 4096 tokens in length.

2 DPO

Direct Preference Optimization (DPO) has been a fundamental tool in recent strong models such as Zephyr-7B , Neural-Chat-7B, and Tulu-2-DPO-70B . Accordingly, after applying SFT, we aligned the resulting model via DPO. We use two datasets at this stage: UltraFeedback and Intel Orca Pairs. We filter the datasets by removing pairs with tied ranking, pairs with duplicated content, and pairs in which the score for the chosen response is less than eight out of ten. We train the model with DPO following the Zephyr recipe and borrowing most of its hyperparameters, except for $\beta$ , which we lower to $0.01$ , and the learning rate, which we lower to $4e-6$ , both of which helped improve the stability of training and final performance. This stage of training is performed using the Alignment Handbook .

3 Self-Knowledge

The model after the output of the DPO stage does not have knowledge about who created it, or even what limitations a language model has. To remedy this, we were inspired by the data generation method of Reinforcement Learning from Contrast Distillation (RLCD) and the training method of Conditioned Reinforcement Learning Fine-Tuning (C-RLFT) , which we apply to self-knowledge training.

To generate the initial prompts, we use the base model to generate 10k random first messages to a language model with no duplicates. To generate continuations, we use a few-shot prompt with self-knowledge in the previous chat turns as a positive completion. For the negative prompt, we simply sample from the prompt with no additional prompting or few-shot turns.

We train with unpacked examples for 6 epochs using a batch size of 256, a warmup stage for 100 steps to a maximum LR of 3e-6 followed by a cosine decay to zero. The positive prompts are trained in the same way as the SFT step, while the negative prompts are trained with a negative token instead of the Assistant token.

Experimental Results and Benchmarks

This section presents the main experimental results for Stable LM 2 1.6B. We compare with similarly-sized open-source models showing strong improvements on various tasks, including multilingual capabilities in Spanish, German, French, Italian, Portuguese, and Dutch. For context, we also provide comparisons with much larger models. We split our experiments into three sets: few- and zero-shot evaluations in English (as commonly done in the Hugging Face Open LLM leaderboard), multilingual evaluations, and conversational evaluations.

First, we assess the 0-shot and few-shot capabilities of Stable LM 2 by evaluating our model over popular benchmarks and comparing results against similarly sized open-source pre-trained models. Table 4 presents model evaluations in English. Our results cover the 6 benchmarks from the Open LLM Leaderboard (): ARC-Challenge 25-shot (ARC), HellaSwag 10-shot (HS), MMLU 5-shot (MMLU), TruthfulQA 0-shot (TQA), WinoGrande 5-shot (Wino) and GSM8K 5-shot (GSM). Also, as Stable LM 2 is a general-purpose foundational model, we further assess natural language understanding capabilities by evaluating English and machine-translated versions of LAMBADA. All evaluations are performed with the Language Model Evaluation Harness framework https://github.com/Stability-AI/lm-evaluation-harness/tree/stablelm-2/multilingual-bench. As shown in Table 4, Stable LM 2 1.6B (stablelm-2-1-6b) outperforms other base models by a significant margin. Similarly, the instruction-tuned version (stablelm-2-1-6b-dpo) improves on Microsoft’s Phi-1.5 by two average points while lagging behind the larger Phi-2.0 on few-shot accuracy. Performance versus Google’s Gemma 2B (2.5B parameters) is also remarkable.

2 Multilingual Evaluations

We assess knowledge and reasoning in the multilingual setting for non-English languages seen during pre-training by evaluating on ChatGPT-translated versions of ARC, HS, TQA, and MMLU (). In addition, we test next-word prediction capabilities using the machine-translated LAMBADA datasets from . After manual inspection by native speakers, we have deemed existing machine translations https://huggingface.co/datasets/EleutherAI/lambada_openai too noisy to draw accurate performance signals from. We instead evaluate multilingual next-word prediction with new translations which are made available for researchers https://huggingface.co/datasets/marcob/lambada_multilingual.

The zero-shot results are presented in Tab. 5 and highlight Stable LM 2’s superior performance compared to models even twice its size.

3 MT Benchmark Evaluations

Finally, we also test the conversational skills of our model on the popular multi-turn benchmark MT-Bench . The results are provided in Fig. 3 and Tab. 7. While lagging behind much more powerful models such as Mistral 7B Instruct v0.2 (more than 4x the size of Stable LM 2), our model delivers better chat performance and beats both Phi-2, Gemma 2B and TinyLLaMA 1.1B by a wide margin despite the larger size of the former.

Inference and Quantization

This model represents a substantial leap towards making advanced generation capabilities available directly on-device without the computational overhead of larger models. We believe this model strikes a great balance between remarkable efficiency and effectiveness in inference tasks when paired with inference frameworks and quantization methods. As part of our release, we provide quantized weights of stablelm-2-1-6b supported on popular inference libraries such as llama.cpp https://github.com/ggerganov/llama.cpp, Apple MLX and Intel OpenVINO https://github.com/openvinotoolkit/openvino.

We make available quantization files for various models and formats to support easier integration with different inference frameworks, including:

Two 4-bit quantized models: Q4_0, Q4_1 and a 5-bit quantized model: Q5_K_M GGUF

INT4 for OpenVINO quantized with Intel’s Neural Network Compression Framework (NNCF)

These quantization files can be found in the model’s Hugging Face repository for the convenience of developers and researchers working with our models. We aim to facilitate smoother deployment experiences across various deep-learning framework ecosystems by offering a range of quantization formats.

2 Throughput

In Tab. 8 we provide throughput numbers obtained from our model running on consumer-grade devices and the system environments utilized. Our initial runs showcase that when using a lower precision, we are able to achieve almost 2x performance in throughput. Note that these figures are provided as a reference, and they are not the result of rigorous benchmarking but are rather intended to give users a practical insight into what they can expect in terms of performance on commonly used devices. Likewise, as lower precision quantization is expected to reduce the model’s performance, we encourage researchers and developers to assess the potential degradation in real-world scenarios.

Future Work

There is a number of research avenues we would like to explore to further improve the model:

Data. In this work, we focused on publicly available data. In particular, most of the data comes from web-crawled content, as is common for most models. This data is known to contain many low-quality documents that can potentially harm training. We believe there is significant potential in smart filtering, re-writing, and synthetic data generation with strong models.

Hallucination Mitigation. Language models are prone to generating incorrect or misleading information, and small language models are even more prone to doing so. Finding reliable ways to detect hallucinations in these models will unlock new use cases in areas that are sensitive to hallucinations.

Long Contexts and Retrieval. The ability to retrieve information across long context windows is essential for applications such as chat models or dataset integration. Accordingly, in Appendix C, we explore the current capabilities and limitations of StableLM2 1.6B on the Needle-in-the-Haystack task. Going forward we plan to further build upon this work as well as to extend our models to context lengths beyond 4k.

Conditional Computation. Small models are often capacity-constrained – that is, with the current training approaches, they lack the capacity to process and exploit all of the training data. Recently, ideas such as Mixture of Experts have been successfully applied to take a dense model and extend it to contain more parameters that are selectively applied to certain inputs (for instance, via sparse upcycling ). Importantly, if each token selects only one expert, the overall inference FLOPs do not significantly change. Applying this to the Stable LM 2 1.6B model is a natural extension we will investigate.

Environmental and Societal Impact

The training of Stable LM 2 has consumed energy with associated carbon dioxide emissions. In line with we report our carbon footprint based on the formula

where the Power Usage Effectiveness is set to 1.1. We trained Stable LM 2 for $\approx 92,000$ GPU-hours, giving a total power consumption of 30MWh considering our average power usage. The tons of emitted carbon tCO2eq can be estimated using the US national average carbon intensity factor of 0.385 kg CO2eq/KWh, leading to a final figure of 11 tCO2eq.

2 Societal impact

Stability AI is committed to releasing open models to help improve access to foundational AI technology. Open access to model weights enables researchers to inspect models for suitability and vulnerabilities, test the effectiveness of different optimization strategies, and correct for biases observed in the model. To that end, this model is released under an open non-commercial license. However, open release can introduce challenges in assessing the societal impact of a model. For example, Stability AI does not have direct visibility into downstream applications of Stable LM 2 1.6B, the distribution of applications by sector, or the distribution of model usage by geography. Since the model is released with a noncommercial license, we expect a limited number of applications outside of fine-tuning or evaluation interfaces and a limited number of third parties affected by the model.

We will continue to monitor openly released fine-tuned models to understand the extent of fine-tuning research or development activity that uses Stable LM 2 1.6B as a base model, including the evaluation results of these derivative models.

Conclusion

In this report, we introduced Stable LM 2 1.6B, a compact decoder-only language model trained on multilingual datasets. It fluidly handles up to seven languages: English, Spanish, German, Italian, French, Portuguese, and Dutch. To ensure that the community can reproduce our run, we detail all datasets used during training –with the exact data mix– and our newly designed learning rate schedule. We also conduct extensive model evaluations and comparisons with other similarly-sized models, demonstrating Stable LM 2 1.6B’s exceptional performance. Finally, we profile the model on common edge computing architectures. We hope the current report contributes to the improvement and further research on small language models.

Acknowledgments

We thank our awesome MLOps team members, particularly Richard Vencu, for the support provided. We also thank Christian Laforte, Cedric Wagrez, and Jerry Chi for their feedback, useful ideas, and comments.

References

Appendix A Data Ablations

How to select the optimal training mix for pre-training from a set of sources is an open problem. Tuning weights based on downstream tasks can be extremely costly and bears the risk of overfitting on particular tasks as well as exploiting data leakage. While the computationally cheap, principled approach introduced in is promising, we found it delivers sub-optimal weights when the data sources are highly imbalanced and have different information content (e.g., large web sources vs curated datasets). Furthermore, multilingual evaluations introduce a more explicit dependence on the tokenizer, with increased noise due to the lack of high-quality, non-machine-translated benchmarks. We, therefore, aim to find general guiding principles that are expected to hold against changes in the tokenizer or in the absence of high-quality benchmarks for each data category while keeping the cost of these ablations low.

We trained a set of 1B models on a total of 100B tokens sampled according to Tab. 9.

Evaluations of each model on English and non-English benchmarks are shown in Tab. 10. We observe the following trends

Contrary to , we find less conclusive evidence that code can be used as a neutral filler for training data as increasing the amount of code leads to a degradation in language model performances. We leave a more thorough exploration of this to future work, including math and reasoning tasks that may benefit from a higher proportion of code data.

Performance on non-English benchmarks increases for each language by adding any amount of data in that same language. However, this increase saturates very fast and we observe only modest gains beyond $6\%$ . We hypothesise that this might be due to the lack of high-quality, structured data sources in non-English languages, which we only sample from the web.

Upsampling Academic and Books sources improves downstream performance over the control run, particularly in natural language understanding.

Appendix B Scheduler Ablations

shows that the widely adopted cosine learning rate decay achieves optimal performance only when performing a full cosine period, forcing practitioners to fix the number of steps beforehand. As multi-epoch training performs well for LLMs , and larger and cleaner data sources are made accessible by the OS community, it becomes more and more important to alleviate this limitation. To this end, we experimented with the "inverse square root" (rsqrt) learning rate scheduler Eq. 2

where $i$ is the current iteration and $k$ the number of warmup steps. As the scheduler is strictly convex and reaches zero asymptotically, it can be used to train for infinite iterations.

However, in standard scenarios, of which we show an example in Fig. 4, rsqrt consistently underperforms cosine. We make the comparison by decaying the learning rate to 0 in both cases, with a linear cool down for the last $10\%$ of the steps for the rsqrt scheduler.

A sharp difference between the two schedulers is how they start from the peak learning rate: with a flat slope, negative second derivative for cosine and a large slope, positive second derivative for rsqrt. This allows rsqrt to escape the high learning rate region quickly, which we identified as the main cause of the performance gap. We then get the best of both worlds by combining both schedulers into a single function that is again the cosine for the first section, and smoothly switches to rsqrt, as described in Eq. 1.

Finally, we experimented with two different versions of our scheduler, whose performance we show in the right panels of Fig. 4: hybrid (hyb) corresponds to our final scheduler defined in Eq. 1, while for hybrid2 (hyb2) we double the cosine period and move the turning point at half the training steps instead of at a fourth. We attribute the difference in performance between the two versions to the different average learning rates. For the set of experiments in Fig. 4, integrating the scheduler over the training volume, we obtain an average learning rate of $1.5e-4$ , $1.13e-4$ and $1.67e-4$ for cosine, hyb and hyb2 respectively. We leave to future work a proof that schedulers with the same average value lead to statistically equivalent models under mild conditions, such as monotonicity and fixed endpoints.

Appendix C Evaluation of Performance Across Different Sized Context Windows

The Needle-in-a-haystack test, as introduced in , is commonly used to assess the retrieval capabilities of LLMs across different context window sizes. Following the methodology of , a fact ("the needle") is embedded within a context of unrelated essays ("the haystack"). The model under evaluation is tasked to answer a question requiring the retrieval of the needle from the context. The evaluation is carried out systematically by placing the needle at 35 different depths within the context and comparing the results across context window sizes from 500 to 4000 tokens. Notably, these window sizes are specifically chosen to assess the performance within our trained context size of 4096 tokens. An AI judge, typically GPT-4 , scores the answers from 1 to 10 based on wether or not the fact was correctly retrieved.

While performing the evaluation, we noticed that the order of the context was not fixed or controlled by a seed, leading in some cases to significantly different scores. For this reason, we show in Fig. 5 the average results over 10 different runs, together with the corresponding standard deviation. For Stable LM 2, we employ the prompt of , whereas for our fine-tuned version we use the official repository as is. Averaging the scores over the evaluated grid, we observe a slight degradation from $\approx 7.7$ for our base model to $\approx 6.8$ for the fine-tuned version. The different prompt structures make it hard to directly compare the results, however, we will investigate in future work how different elements such as the distribution of the document lengths and the attention mask correlate with this behavior.

Appendix D Global Batch Size

The role of the batch size in stochastic optimization convergence is illustrated in as follows. The loss $L({\theta})$ of a model parameterized by $\theta$ over a training set $D$ is estimated by independently drawing random samples from $D$ to form a batch $B$ . Thus, the loss gradient used to update the model’s parameters is given by

and its variance scales with $1/|B|$ . In other words, a bigger batch has a lower variance and thus provides a more accurate estimate of the gradient. A more accurate gradient suggests that we should increase the learning rate accordingly to converge faster; however, in practice, this is far from trivial and requires special handling such as fine-tuning the learning rate scheduler or performing per-layer updates .

To choose the global batch size for Stable LM 2, we make the following assumptions:

We do not change the learning rate based on the batch size

The batch size can be approximately scaled at no cost.

With $1.$ , we are, in principle, giving up on potential further gains by tuning the step size to the gradient noise. In practice, we notice that using a large learning rate at the boundary between convergence and divergence of our ablations is more than enough to compensate for this. $2.$ follows from a combination of theoretical results, hardware optimizations, and increased training data availability. empirically demonstrated how multiple epochs of repeated data are as good as fresh data in LLMs pre-training, while the computational overhead of increasing data parallel workers is minimal.

The data volume available for pre-training is consistently increasing through larger datasets, and there are promising results of multi-epoch training. Hence, we explore training with larger batch sizes, which requires more overall training tokens to reach the final loss but significantly decreases training time. To determine our final batch size, we start by training a baseline with a batch $B_{0}$ of 4M tokens on $T_{0}=50B$ total training tokens. Subsequently, we train new models from scratch with batches $B_{i}=$ 8M, 12M, and 16M, increasing $T_{i}$ tokens until the same final loss is reached. We do this with an rsqrt scheduler as the number of required training steps is unknown beforehand.

In Fig. 6, we show the number of iterations $B_{i}/T_{i}$ which are required to match the baseline loss. Assuming for simplicity that the iteration time is independent of the batch size, this gives an upper bound on the speed up we can achieve by increasing the batch.

In the regime considered, we observe that an increase in batch size leads to a decrease in the number of iterations required to match the baseline loss all the way to the biggest batch considered of $16.7$ M tokens, which speeds up training by a factor 2x. However, to achieve an equal loss, we require $T_{i}=96B$ training tokens, which is a factor 1.96 increase compared to the baseline. Therefore to train Stable LM 2, we opt for a global batch of $8,388,608$ , which with the layout of Tab. 3 offers the best compromise between decrease in training time and additional required training tokens.

Appendix E Restructed Pre-training Sources

While many sources are restructured in , in this work, we have only considered those listed in Tab.11 with commercially viable licenses such as BIGPATENT, CLOTH, SciTLDR, TriviaQA, WordNet, WikiHow, etc. We also restructure additional sources with similar licensing compatibility and list them below.