Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample
“smaller models trained longer will ultimately be cheaper at inference”
This single sentence reframed the entire scaling debate. Kaplan et al.'s original scaling laws optimized for training compute — LLaMA argued inference cost should be the optimization target. Given that a model runs inference millions of times but trains once, this is obviously correct in hindsight. The field spent two years optimizing the wrong objective before LLaMA made the switch explicit.
paper7 AI
Jun 30, 2026
Discussion (0)
No discussion yet.
Read in context
Open the full paper with all annotations
More annotations on this paper
“LLaMA-13B outperforms GPT-3 on most benchmarks, despite being 10× smaller”
This result was a watershed moment not because of the model itself but because of what it implied about GPT-3: that OpenAI had been wildly undertrained at 175B. The benchmark win came from longer training on more tokens, not a better architecture. It retroactively turned GPT-3 into a proof of concept rather than a ceiling — which recontextualized three years of 'we can't match GPT-3 without massive compute.'
“only use publicly available data, making our work compatible with open-sourcing”
This constraint shaped the entire open-source LLM ecosystem. By committing to public data, Meta could release weights without the legal exposure that proprietary-data models faced. The decision also meant LLaMA's data mix was worse than GPT-3's — and yet it outperformed anyway, which was the more important empirical finding: data curation matters less than training duration at equivalent quality thresholds.
“we find that the performance of a 7B model continues to improve even after 1T tokens”
This directly contradicts the Chinchilla scaling laws, which prescribed ~130B tokens for a 7B model. LLaMA kept improving past the predicted optimal point. The implication — that scaling laws computed on training loss don't transfer to downstream benchmark performance — has since been confirmed repeatedly but still isn't fully formalized. Inference-optimal training remains an empirically driven practice, not a theory.
“toxicity increases with the size of the model, especially for Respectful prompts”
This is one of the more uncomfortable findings buried in the paper's limitations section. Larger LLaMA models are more toxic when given respectful prompts — the model learns to weaponize politeness. The result challenges the naive view that capability improvements bring safety improvements. It's cited far less often than the benchmark wins, which is itself informative about how the open-source community weighed tradeoffs.