TinyLlama: An Open-Source Small Language Model

Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, Wei Lu

Introduction

Recent progress in natural language processing (NLP) has been largely propelled by scaling up language model sizes (Brown et al.,, 2020; Chowdhery et al.,, 2022; Touvron et al., 2023a, ; Touvron et al., 2023b, ). Large Language Models (LLMs) pre-trained on extensive text corpora have demonstrated their effectiveness on a wide range of tasks (OpenAI,, 2023; Touvron et al., 2023b, ). Some empirical studies demonstrated emergent abilities in LLMs, abilities that may only manifest in models with a sufficiently large number of parameters, such as few-shot prompting (Brown et al.,, 2020) and chain-of-thought reasoning (Wei et al.,, 2022). Other studies focus on modeling the scaling behavior of LLMs (Kaplan et al.,, 2020; Hoffmann et al.,, 2022). Hoffmann et al., (2022) suggest that, to train a compute-optimal model, the size of the model and the amount of training data should be increased at the same rate. This provides a guideline on how to optimally select the model size and allocate the amount of training data when the compute budget is fixed.

Although these works show a clear preference on large models, the potential of training smaller models with larger dataset remains under-explored. Instead of training compute-optimal language models, Touvron et al., 2023a highlight the importance of the inference budget, instead of focusing solely on training compute-optimal language models. Inference-optimal language models aim for optimal performance within specific inference constraints This is achieved by training models with more tokens than what is recommended by the scaling law (Hoffmann et al.,, 2022). Touvron et al., 2023a demonstrates that smaller models, when trained with more data, can match or even outperform their larger counterparts. Also, Thaddée, (2023) suggest that existing scaling laws (Hoffmann et al.,, 2022) may not predict accurately in situations where smaller models are trained for longer periods.

Motivated by these new findings, this work focuses on exploring the behavior of smaller models when trained with a significantly larger number of tokens than what is suggested by the scaling law (Hoffmann et al.,, 2022). Specifically, we train a Transformer decoder-only model (Vaswani et al.,, 2017) with 1.1B parameters using approximately 3 trillion tokens. To our knowledge, this is the first attempt to train a model with 1B parameters using such a large amount of data. Following the same architecture and tokenizer as Llama 2 (Touvron et al., 2023b, ), we name our model TinyLlama. TinyLlama shows competitive performance compared to existing open-source language models of similar sizes. Specifically, TinyLlama surpasses both OPT-1.3B (Zhang et al.,, 2022) and Pythia-1.4B (Biderman et al.,, 2023) in various downstream tasks.

Our TinyLlama is open-source, aimed at improving accessibility for researchers in language model research. We believe its excellent performance and compact size make it an attractive platform for researchers and practitioners in language model research.

Pretraining

This section describes how we pre-trained TinyLlama. First, we introduce the details of the pre-training corpus and the data sampling method. Next, we elaborate on the model architecture and the hyperparameters used during pretraining.

Our main objective is to make the pre-training process effective and reproducible. We adopt a mixture of natural language data and code data to pre-train TinyLlama, sourcing natural language data from SlimPajama (Soboleva et al.,, 2023) and code data from Starcoderdata (Li et al.,, 2023). We adopt Llama’s tokenizer (Touvron et al., 2023a, ) to process the data.

This is a large open-source corpus created for training language models based on RedPajama (Together Computer,, 2023). The original RedPajama corpus is an open-source research effort aimed at reproducing Llama’s pretraining data (Touvron et al., 2023a, ) containing over 1.2 trillion tokens. The SlimPajama was derived by cleaning and deduplicating the original RedPajama.

Starcoderdata

This dataset was collected to train StarCoder (Li et al.,, 2023), a powerful open-source large code language model. It comprises approximately 250 billion tokens across 86 programming languages. In addition to code, it also includes GitHub issues and text-code pairs that involve natural languages. To avoid data duplication, we remove the GitHub subset of the SlimPajama and only sample code data from the Starcoderdata.

After combining these two corpora, we have approximately 950 billion tokens for pre-training in total. TinyLlama is trained on these tokens for approximately three epochs, as observed by Muennighoff et al., (2023), where training on data repeated for up to four epochs results in minimal performance degradation compared to using unique data. During training, we sample the natural language data to achieve a ratio of around 7:3 between natural language data and code data.

2 Architecture

We adopt a similar model architecture to Llama 2 (Touvron et al., 2023b, ). We use a Transformer architecture based on Vaswani et al., (2017) with the following details:

We use RoPE (Rotary Positional Embedding) (Su et al.,, 2021) to inject positional information into our model. RoPE is a widely adopted method recently used by many mainstream large language models, such as PaLM (Anil et al.,, 2023), Llama (Touvron et al., 2023a, ), and Qwen (Bai et al.,, 2023).

RMSNorm

In pre-normalization, to attain a more stable training, we normalize the input before each transformer sub-layer. In addition, we apply RMSNorm (Zhang and Sennrich,, 2019) as our normalization technique, which can improve training efficiency.

SwiGLU

Instead of using the traditional ReLU non-linearity, we follow Llama 2 and combine Swish and Gated Linear Unit together, which is referred to as SwiGLU (Shazeer,, 2020), as our activation function in TinyLlama.

Grouped-query Attention

To reduce memory bandwidth overhead and speed up inference, we use grouped-query attention (Ainslie et al.,, 2023) in our model. We have 32 heads for query attention and use 4 groups of key-value heads. With this technique, the model can share key and value representations across multiple heads without sacrificing much performance.

3 Speed Optimizations

During training, our codebase has integrated FSDPhttps://huggingface.co/docs/accelerate/usage_guides/fsdp to leverage multi-GPU and multi-node setups efficiently. This integration is crucial in scaling the training process across multiple computing nodes, which significantly improves the training speed and efficiency.

Flash Attention

Another critical improvement is the integration of Flash Attention 2 (Dao,, 2023), an optimized attention mechanism. The repository also provides fused layernorm, fused cross entropy loss, and fused rotary positional embedding, which together play a pivotal role in boosting computational throughput.

xFormers

We have replaced the fused SwiGLU module from the xFormers (Lefaudeux et al.,, 2022) repository with the original SwiGLU module, further enhancing the efficiency of our codebase. With these features, we can reduce the memory footprint, enabling the 1.1B model to fit within 40GB of GPU RAM.

Performance Analysis and Comparison with Other Models

The incorporation of these elements has propelled our training throughput to 24,000 tokens per second per A100-40G GPU. When compared with other models like Pythia-1.0B (Biderman et al.,, 2023) and MPT-1.3B https://huggingface.co/mosaicml/mpt-1b-redpajama-200b, our codebase demonstrates superior training speed. For instance, the TinyLlama-1.1B model requires only 3,456 A100 GPU hours for 300B tokens, in contrast to Pythia’s 4,830 and MPT’s 7,920 hours. This shows the effectiveness of our optimizations and the potential for substantial time and resource savings in large-scale model training.

4 Training

We build our framework based on lit-gpt.https://github.com/Lightning-AI/lit-gpt In adhering to Llama 2 (Touvron et al., 2023b, ), we employ an autoregressive language modeling objective during the pretraining phase. Consistent with Llama 2’s settings, we utilize the AdamW optimizer (Loshchilov and Hutter,, 2019), setting $\beta_{1}$ at 0.9 and $\beta_{2}$ at 0.95. Additionally, we use a cosine learning rate schedule with maximum learning rate as $4.0\times 10^{-4}$ and minimum learning rate as $4.0\times 10^{-5}$ . We use 2,000 warmup steps to facilitate optimized learning.Due to a bug in the config file, the learning rate did not decrease immediately after warmup and remained at the maximum value for several steps before we fixed this. We set the batch size as 2M tokens. We assign weight decay as 0.1, and use a gradient clipping threshold of 1.0 to regulate the gradient value. We pretrain TinyLlama with 16 A100-40G GPUs in our project.

Results

We evaluate TinyLlama on a wide range of commonsense reasoning and problem-solving tasks and compare it with several existing open-source language models with similar model parameters.

We primarily focus on language models with a decoder-only architecture, comprising approximately 1 billion parameters. Specifically, we compare TinyLlama with OPT-1.3B (Zhang et al.,, 2022), Pythia-1.0B, and Pythia-1.4B (Biderman et al.,, 2023).

Commonsense reasoning tasks

To understand the commonsense reasoning ability of TinyLlama, we consider the following tasks: Hellaswag (Zellers et al.,, 2019), OpenBookQA (Mihaylov et al.,, 2018), WinoGrande (Sakaguchi et al.,, 2021), ARC-Easy and ARC-Challenge (Clark et al.,, 2018), BoolQ (Clark et al.,, 2019), and PIQA (Bisk et al.,, 2020). We adopt the Language Model Evaluation Harness framework (Gao et al.,, 2023) to evaluate the models. Following previous practice (Biderman et al.,, 2023), the models are evaluated in a zero-shot setting on these tasks. The results are presented in Table 2. We notice that TinyLlama outperforms baselines on many of the tasks and obtains the highest averaged scores.

Evolution of performance during training

We tracked the accuracy of TinyLlama on commonsense reasoning benchmarks during its pre-training, as shown in Fig. 2. Generally, the performance of TinyLlama improves with increased computational resources, surpassing the accuracy of Pythia-1.4B in most benchmarks.In our initial dataset preprocessing, we inadvertently over-inserted end-of-sequence (EOS) tokens. This excess of EOS tokens may have negatively affected the model by introducing substantial less meaningful signals into the training data. However, after approximately 2.3T tokens, we removed these repetitive EOS tokens and continued pre-training TinyLlama with our refined data. This rectification likely contributed significantly to the observed sudden improvements in performance on benchmarks such as hellasag, piqa, arc_challenge, and arc_easy during that period.

Problem-solving evaluation

We also evaluate TinyLlama’s problem-solving capabilities using the InstructEval benchmark (Chia et al.,, 2023). This benchmark includes the following tasks:

Massive Multitask Language Understanding (MMLU) (Hendrycks et al.,, 2021): This task is used to measure a model’s world knowledge and problem-solving capabilities across various subjects. We evaluate the models in a 5-shot setting.

BIG-Bench Hard (BBH) (Suzgun et al.,, 2023): This is a subset of 23 challenging tasks from the BIG-Bench benchmark (Srivastava et al.,, 2022) designed to measure a language model’s abilities in complex instruction following. The models are evaluated in a 3-shot setting.

Discrete Reasoning Over Paragraphs (DROP) (Dua et al.,, 2019): This reading comprehension task measures a model’s math reasoning abilities. We evaluate the models in a 3-shot setting.

HumanEval (Zheng et al.,, 2023): This task is used to measure a model’s programming capabilities. The models are evaluated in a zero-shot setting.

The evaluation results are presented in Table 3. We observe that TinyLlama demonstrates better problem-solving skills compared to existing models.

Conclusion

In this paper, we introduce TinyLlama, an open-source, small-scale language model. To promote transparency in the open-source LLM pre-training community, we have released all relevant information, including our pre-training code, all intermediate model checkpoints, and the details of our data processing steps. With its compact architecture and promising performance, TinyLlama can enable end-user applications on mobile devices, and serve as a lightweight platform for testing a wide range of innovative ideas related to language models. We will leverage the rich experience accumulated during the open, live phase of this project and aim to develop improved versions of TinyLlama, equipping it with a diverse array of capabilities to enhance its performance and versatility across various tasks. We will document further findings and detailed results in upcoming reports.

Acknowledgements

We express our gratitude to the open-source community for their strong support during the open, live phase of our research. Special thanks go to Qian Liu, Longxu Dou, Hai Leong Chieu, and Larry Law for their help to our project. This research/project is supported by Ministry of Education, Singapore, under its Academic Research Fund (AcRF) Tier 2 Programme (MOE AcRF Tier 2 Award No.: MOE-T2EP20122-0011), Ministry of Education, Singapore, under its Tier 3 Programme (The Award No.: MOET320200004), the National Research Foundation Singapore and DSO National Laboratories under the AI Singapore Program (AISG Award No: AISG2-RP-2020-016), an AI Singapore PhD Scholarship (AISG Award No: AISG2-PhD-2021-08-007), an SUTD Kick-Starter Project (SKI 2021_03_11), and the grant RS-INSUR-00027-E0901-S00. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not reflect the views of the funding agencies.