Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Qin Cai, Vishrav Chaudhary, Dong Chen, Dongdong Chen, Weizhu Chen, Yen-Chun Chen, Yi-Ling Chen, Hao Cheng, Parul Chopra, Xiyang Dai, Matthew Dixon, Ronen Eldan, Victor Fragoso, Jianfeng Gao, Mei Gao, Min Gao, Amit Garg, Allie Del Giorno, Abhishek Goswami, Suriya Gunasekar, Emman Haider, Junheng Hao, Russell J. Hewett, Wenxiang Hu, Jamie Huynh, Dan Iter, Sam Ade Jacobs, Mojan Javaheripi, Xin Jin, Nikos Karampatziakis, Piero Kauffmann, Mahoud Khademi, Dongwoo Kim, Young Jin Kim, Lev Kurilenko, James R. Lee, Yin Tat Lee, Yuanzhi Li, Yunsheng Li, Chen Liang, Lars Liden, Xihui Lin, Zeqi Lin, Ce Liu, Liyuan Liu, Mengchen Liu, Weishung Liu, Xiaodong Liu, Chong Luo, Piyush Madan, Ali Mahmoudzadeh, David Majercak, Matt Mazzola, Caio César Teodoro Mendes, Arindam Mitra, Hardik Modi, Anh Nguyen, Brandon Norick, Barun Patra, Daniel Perez-Becker, Thomas Portet, Reid Pryzant, Heyang Qin, Marko Radmilac, Liliang Ren, Gustavo de Rosa, Corby Rosset, Sambudha Roy, Olatunji Ruwase, Olli Saarikivi, Amin Saied, Adil Salim, Michael Santacroce, Shital Shah, Ning Shang, Hiteshi Sharma, Yelong Shen, Swadheen Shukla, Xia Song, Masahiro Tanaka, Andrea Tupini, Praneetha Vaddamanu, Chunyu Wang, Guanhua Wang, Lijuan Wang, Shuohang Wang, Xin Wang, Yu Wang, Rachel Ward, Wen Wen, Philipp Witte, Haiping Wu, Xiaoxia Wu, Michael Wyatt, Bin Xiao, Can Xu, Jiahang Xu, Weijian Xu, Jilong Xue, Sonali Yadav, Fan Yang, Jianwei Yang, Yifan Yang, Ziyi Yang, Donghan Yu, Lu Yuan, Chenruidong Zhang, Cyril Zhang, Jianwen Zhang, Li Lyna Zhang, Yi Zhang, Yue Zhang, Yunan Zhang, Xiren Zhou
Introduction
The striking progress of AI in the last few years can be largely attributed to major efforts throughout the world towards scaling-up to ever-larger models and datasets. Large Language Models (LLMs) have steadily increased in size from a mere billion parameters just five years ago (GPT-2 had 1.5 billion parameters [RWC+19]) to trillion parameters today. The impetus for this effort originates in the seemingly predictable improvement one obtains by training large models, the so-called scaling laws [KMH+20, HBM+22, MRB+23]. However these laws assume a “fixed” data source. This assumption is now significantly disrupted by the existence of frontier LLMs themselves, which allow us to interact with data in novel ways. In our previous works on the phi models [GZA+23, LBE+23, JBA+23] it was shown that a combination of LLM-based filtering of web data, and LLM-created synthetic data, enable performance in smaller language models that were typically seen only in much larger models. For example our previous model trained on this data recipe, phi-2 (2.7B parameters), matched the performance of models times larger trained on regular data. In this report we present a new model, phi-3-mini (3.8B parameters), trained for 3.3T tokens on larger and more advanced versions of the datasets used in phi-2. With its small size, phi-3-mini can easily be inferenced locally on a modern phone (see Figure 1), yet it achieves a quality that seems on-par with models such as Mixtral 8x7B [JSR+24] and GPT-3.5.
Technical Specifications
The phi-3-mini model is a transformer decoder architecture [VSP+17], with default context length . We also introduce a long context version via LongRope [DZZ+24] that extends the context length to , called phi-3-mini-128K.
To best benefit the open source community, phi-3-mini is built upon a similar block structure as Llama-2 [TLI+23] and uses the same tokenizer with vocabulary size of 32064111We remove BoS tokens and add some additional tokens for chat template.. This means that all packages developed for Llama-2 family of models can be directly adapted to phi-3-mini. The model uses hidden dimension, heads and layers. We trained using bfloat16 for a total of 3.3T tokens. The model is already chat-finetuned, and the chat template is as follows:
<|user|>n Question <|end|>n <|assistant|> The phi-3-small model (7B parameters) leverages the tiktoken tokenizer (for better multilingual tokenization) with a vocabulary size of 100352 and has default context length . It follows the standard decoder architecture of a 7B model class, having layers and a hidden size of . To minimize KV cache footprint, the model also leverages a grouped-query attention, with queries sharing key. Moreover phi-3-small uses alternative layers of dense attention and a novel blocksparse attention to further optimize on KV cache savings while maintaining long context retrieval performance. An additional 10% multilingual data was also used for this model.
Thanks to its small size, phi-3-mini can be quantized to 4-bits so that it only occupies 1.8GB of memory. We tested the quantized model by deploying phi-3-mini on iPhone 14 with A16 Bionic chip running natively on-device and fully offline achieving more than tokens per second.
Training Methodology.
We follow the sequence of works initiated in “Textbooks Are All You Need” [GZA+23], which utilize high quality training data to improve the performance of small language models and deviate from the standard scaling-laws. In this work we show that such method allows to reach the level of highly capable models such as GPT-3.5 or Mixtral with only 3.8B total parameters (while Mixtral has 45B total parameters for example). Our training data of consists of heavily filtered web data (according to the “educational level”) from various open internet sources, as well as synthetic LLM-generated data. Pre-training is performed in two disjoint and sequential phases; phase-1 comprises mostly of web sources aimed at teaching the model general knowledge and language understanding. Phase-2 merges even more heavily filtered webdata (a subset used in Phase-1) with some synthetic data that teach the model logical reasoning and various niche skills.
Data Optimal Regime.
Unlike prior works that train language models in either “compute optimal regime” [HBM+22] or “over-train regime”, we mainly focus on the quality of data for a given scale.222Just like for “compute optimal regime”, we use the term “optimal” in an aspirational sense for “data optimal regime”. We are not implying that we actually found the provably “optimal” data mixture for a given scale. We try to calibrate the training data to be closer to the “data optimal” regime for small models. In particular, we filter the web data to contain the correct level of “knowledge” and keep more web pages that could potentially improve the “reasoning ability” for the model. As an example, the result of a game in premier league in a particular day might be good training data for frontier models, but we need to remove such information to leave more model capacity for “reasoning” for the mini size models. We compare our approach with Llama-2 in Figure 2.
To test our data on larger size of models, we also trained phi-3-medium, a model with 14B parameters using the same tokenizer and architecture of phi-3-mini, and trained on the same data for slightly more epochs (4.8T tokens total as for phi-3-small). The model has 40 heads and 40 layers, with embedding dimension 5120. We observe that some benchmarks improve much less from 7B to 14B than they do from 3.8B to 7B, perhaps indicating that our data mixture needs further work to be in the “data optimal regime” for 14B parameters model. We are still actively investigating some of those benchmarks (including a regression on HumanEval), hence the numbers for phi-3-medium should be considered as a “preview”.
Post-training.
Post-training of phi-3-mini went through two stages, including supervised finetuning (SFT) and direct preference optimization (DPO). SFT leverages highly curated high-quality data across diverse domains, e.g., math, coding, reasoning, conversation, model identity, and safety. The SFT data mix starts with using English-only examples. DPO data covers chat format data, reasoning, and responsible AI (RAI) efforts. We use DPO to steer the model away from unwanted behavior, by using those outputs as “rejected” responses. Besides improvement in math, coding, reasoning, robustness, and safety, post-training transforms a language model to an AI assistant that users can efficiently and safely interact with.
As part of the post-training process, we developed a long context version of phi-3-mini with context length limit enlarged to 128K instead of 4K. Across the board, the 128K model quality is on par with the 4K length version, while being able to handle long context tasks. Long context extension has been done in two stages, including long context mid-training and long-short mixed post-training with both SFT and DPO.
Academic benchmarks
On the next page we report the results for phi-3-mini on standard open-source benchmarks measuring the model’s reasoning ability (both common sense reasoning and logical reasoning). We compare to phi-2 [JBA+23], Mistral-7b-v0.1 [JSM+23], Mixtral-8x7b [JSR+24], Gemma 7B [TMH+24], Llama-3-instruct-8b [AI23], and GPT-3.5. All the reported numbers are produced with the exact same pipeline to ensure that the numbers are comparable. These numbers might differ from other published numbers due to slightly different choices in the evaluation. As is now standard, we use few-shot prompts to evaluate the models, at temperature . The prompts and number of shots are part of a Microsoft internal tool to evaluate language models, and in particular we did no optimization to the pipeline for the phi-3 models.333For example, we found that using ## before the Question can lead to a noticeable improvement to phi-3-mini’s results across many benchmarks, but we did not do such changes in the prompts. The number of –shot examples is listed per-benchmark. An example of a 2-shot prompt is described in Appendix A.
Phi-3-mini 3.8b Phi-3-small 7b (preview) Phi-3-medium 14b (preview) Phi-2 2.7b Mistral 7b Gemma 7b Llama-3-In 8b Mixtral 8x7b GPT-3.5 version 1106 MMLU (5-Shot) [HBK+21] 68.8 75.3 78.2 56.3 61.7 63.6 66.0 68.4 71.4 HellaSwag (5-Shot) [ZHB+19] 76.7 78.7 83.0 53.6 58.5 49.8 69.5 70.4 78.8 ANLI (7-Shot) [NWD+20] 52.8 55.0 58.7 42.5 47.1 48.7 54.8 55.2 58.1 GSM-8K (0-Shot; CoT) [CKB+21] 82.5 88.9 90.3 61.1 46.4 59.8 77.4 64.7 78.1 MedQA (2-Shot) [JPO+20] 53.8 58.2 69.4 40.9 49.6 50.0 58.9 62.2 63.4 AGIEval (0-Shot) [ZCG+23] 37.5 45.0 48.4 29.8 35.1 42.1 42.0 45.2 48.4 TriviaQA (5-Shot) [JCWZ17] 64.0 59.1 75.6 45.2 72.3 75.2 73.6 82.2 85.8 Arc-C (10-Shot) [CCE+18] 84.9 90.7 91.0 75.9 78.6 78.3 80.5 87.3 87.4 Arc-E (10-Shot) [CCE+18] 94.6 97.1 97.8 88.5 90.6 91.4 92.3 95.6 96.3 PIQA (5-Shot) [BZGC19] 84.2 87.8 87.7 60.2 77.7 78.1 77.1 86.0 86.6 SociQA (5-Shot) [BZGC19] 76.6 79.0 80.2 68.3 74.6 65.5 73.2 75.9 68.3 BigBench-Hard (0-Shot) [SRR+22, SSS+22] 71.7 75.0 81.3 59.4 57.3 59.6 68.9 69.7 68.32 WinoGrande (5-Shot) [SLBBC19] 70.8 82.5 81.4 54.7 54.2 55.6 58.0 62.0 68.8 OpenBookQA (10-Shot) [MCKS18] 83.2 88.4 87.2 73.6 79.8 78.6 81.6 85.8 86.0 BoolQ (0-Shot) [CLC+19] 77.2 82.9 86.6 – 72.2 66.0 78.3 77.6 79.1 CommonSenseQA (10-Shot) [THLB19] 80.2 80.3 82.6 69.3 72.6 76.2 73.6 78.1 79.6 TruthfulQA (10-Shot) [LHE22] 65.0 68.7 75.7 – 52.1 53.0 62.0 60.1 85.8 HumanEval (0-Shot) [CTJ+21] 59.1 59.1 55.5 47.0 28.0 34.1 60.4 37.8 62.2 MBPP (3-Shot) [AON+21] 70.0 71.4 74.5 60.6 50.8 51.5 65.3 60.2 77.8 Average 71.2 74.9 78.2 – 61.0 62.0 68.0 69.9 75.3 GPQA (2-Shot; CoT) [RHS+23] 32.8 34.3 – – – – – – 29.0 MT Bench (2 round ave.) [ZCS+23] 8.38 8.70 8.91 – – – – – 8.35
Safety
Phi-3-mini was developed in accordance with Microsoft’s responsible AI principles. The overall approach consisted of safety alignment in post-training, red-teaming, automated testing and evaluations across dozens of RAI harm categories. Helpfulness and harmlessness preference datasets [BJN+22, JLD+23] with modifications inspired by [BSA+24] and multiple in-house generated datasets were leveraged to address the RAI harm categories in safety post-training. An independent red team at Microsoft iteratively examined phi-3-mini to further identify areas of improvement during the post-training process. Based on their feedback, we curated additional datasets tailored to address their insights, thereby refining the post-training dataset. This process resulted in significant decrease of harmful response rates, as shown in Figure 3.
Table 1 shows the results of in-house RAI benchmarks for phi-3-mini-4k and phi-3-mini-128k compared to phi-2 [JBA+23], Mistral-7b-v0.1 [JSM+23], Gemma 7b [TMH+24], and Llama-3-instruct-8b [AI23]. This benchmark utilized GPT-4 to simulate multi-turn conversations in five different categories and to evaluate the model responses. Ungroundedness between 0 (fully grounded) and 4 (not grounded) measures if the information in a response is based on a given prompt. In other categories, responses were evaluated in terms of the severity of harmfulness from 0 (no harm) to 7 (extreme harm) and the defect rates (DR-) were computed as the percentage of samples with the severity score being greater than or equal to .
Phi-3-Mini-4k 3.8b Phi-3-Mini-128k 3.8b Phi-2 2.7b Mistral 7b Gemma 7b Llama-3-In 8b Ungroundedness 0.603 0.637 1.481 0.935 0.679 0.328 Intellectual Property (DR-1) 23.95% 21.50% 24.00% 56.20% 38.33% 37.30% Harmful Content Continuation (DR-3) 0.75% 1.08% 2.93% 2.58% 1.28% 1.30% Harmful Content Summarization (DR-3) 10.00% 10.20% 14.35% 22.33% 10.33% 8.20% Jailbreak (DR-1) 12.29% 12.57% 15.00% 15.57% 11.43% 13.00%
Weakness
In terms of LLM capabilities, while phi-3-mini model achieves similar level of language understanding and reasoning ability as much larger models, it is still fundamentally limited by its size for certain tasks. The model simply does not have the capacity to store too much “factual knowledge”, which can be seen for example with low performance on TriviaQA. However, we believe such weakness can be resolved by augmentation with a search engine. We show an example using the HuggingFace default Chat-UI with phi-3-mini in Figure 4. Another weakness related to model’s capacity is that we mostly restricted the language to English. Exploring multilingual capabilities for Small Language Models is an important next step, with some initial promising results on phi-3-small by including more multilingual data.
Despite our diligent RAI efforts, as with most LLMs, there remains challenges around factual inaccuracies (or hallucinations), reproduction or amplification of biases, inappropriate content generation, and safety issues. The use of carefully curated training data, and targeted post-training, and improvements from red-teaming insights significantly mitigates these issues across all dimensions. However, there is significant work ahead to fully address these challenges.