Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V. Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, Hannaneh Hajishirzi

Introduction

“Just as the camel shares its burdens with others in the caravan, the wise share their insights to lighten the load of ignorance.” – Proverb generated by Tülu 3.

Post-training — the collection of techniques including instruction tuning, reinforcement learning from human feedback, and other types of finetuning — has become a crucial step in building frontier language models (OpenAI2024; Anthropic2024), yet developments to these techniques are frequently not accompanied by open resources and recipes. Fully open source counterparts (e.g., Tülu 2 (ivison2023camels) and Zephyr-β\beta (tunstall2023zephyr)) often rely on simpler-to-implement and cheaper pipelines and have become outdated on many metrics.

To close the gap between open and closed post training, we introduce Tülu 3, a family of open state-of-the-art post-trained models, alongside all of the data, training recipes, code, infrastructure, and evaluation framework. Integrating partial details from proprietary methods with novel techniques and established academic research, Tülu 3 pushes the boundaries of research in post-training. The advancements of Tülu 3 are attributed to Tülu 3 Data, new permissively licensed training datasets targeting core skills, Tülu 3 Eval, evaluation suite and tools to establish clear performance goals and guide improvement through training stages, and Tülu 3 Recipe, an advanced multi-stage training pipeline incorporating new algorithmic advancements in reinforcement learning, cutting-edge infrastructure, and rigorous experimentation to optimize data mixes, methods, and parameters across various training stages.

In order to build Tülu 3, we identify a set of core skills to improve after training (e.g., reasoning, math, coding, safety, precise instruction following, knowledge recall, etc.) and build an evaluation framework to establish clear performance goals and guide model improvement over a selection of development and unseen tasks. Tülu 3 benefits significantly from leveraging publicly available open data, generating diverse, skill-specific synthetic data at various training stages, and aggressively decontaminating them against our evaluation suite.

The Tülu 3 training recipe involves multiple stages, with each stage building upon the previous model and focusing on different types of data — namely, prompt-completion instances for supervised finetuning, preferences for preference tuning, or verifiable rewards for reinforcement learning. Our methodology facilitates identifying skill deficiencies and refining the data mix, methods and parameters, ensuring a balanced performance of core skills across the training process. Through rigorous, principled experimentation, we determine the best data mix for supervised finetuning, resulting in the Tülu 3 SFT checkpoint. Leveraging recent advances in preference tuning, we then train a model over carefully curated on-policy preference data from comparing Tülu 3 SFT completions against outputs from other language models. Furthermore, we introduce a new final finetuning stage – Reinforcement Learning with Verifiable Rewards (RLVR) - which employs a novel RL objective tailored to enhance specific skills with verifiable answers, such as mathematics and precise instruction following.

Our best performing recipe yields Tülu 3 models that outperform the state-of-the-art post-trained open-weight models of the same size such as Llama 3.1 Instruct (dubey2024llama), Qwen2.5 Instruct (qwen2.5), or Mistral-Instruct (mistral2024ministraux), and at the large 70B size Tülu matches the offerings of closed providers such as Claude 3.5 Haiku and GPT-4o mini.

In summary, Tülu 3 represents a family of state-of-the-art open language models, featuring a modern post-training framework with fully open-source data Tülu 3 Data, evaluation Tülu 3 Eval, training code Tülu 3 Code and development recipes Tülu 3 Recipe. Here are a few key contributions from the development of Tülu:

Extensive guidance and tooling for evaluation, decontamination, and recipe design,

Scaled, new synthetic instruction datasets,

Scaling preference data with on-policy generations,

Reinforcement learning with verifiable rewards, an RL-based method that only gets a reward if the model’s completions are verified to be correct,

Advanced infrastructure, details, and code to facilitate the successful implementation of large models