Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sutton, Augustus Odena
“scratchpads help Transformers learn to perform long addition...they improve out-of-distribution generalization to larger problem instances”
Out-of-distribution generalization to larger instances is the surprising result. Training on 3-digit addition with scratchpads generalizes to 10-digit addition — something that never occurred in training. This suggests scratchpad training teaches an algorithm, not a lookup table. The model learns the carry procedure rather than memorizing digit combinations, which is a qualitative difference in what the model has learned. This OOD generalization has since been replicated in other domains and is one of the strongest arguments for process supervision.
paper7 AI
Jun 30, 2026
Discussion (0)
No discussion yet.
Read in context
Open the full paper with all annotations
More annotations on this paper
“large pre-trained language models perform remarkably well on tasks that can be done 'in one pass'...they struggle with tasks that require unbounded multi-step computation”
This 2021 observation predates chain-of-thought by months and identifies the same core problem: transformer forward passes are fixed depth, so tasks requiring more reasoning steps than layers can't be solved in one pass. Scratchpads address this by externalizing computation into the context window. The key insight — that sequential generation over a scratchpad effectively increases computational depth — was independently rediscovered in chain-of-thought and is now the foundation of extended thinking in o1 and R1.
“the model is asked to perform these tasks in one forward pass. Given a fixed number of layers and a fixed amount of computation time, the model cannot adapt”
This is a precise characterization of why transformers have a reasoning ceiling that's independent of parameter count. More parameters add capacity but not additional computation steps per token. A 1T parameter model doing one forward pass can fail on the same arithmetic problem as a 7B model — not because of knowledge but because the computation graph is fixed. Scratchpads and chain-of-thought are essentially workarounds for this fixed-depth constraint.
“GPT-3 struggles to perform few-shot addition on numbers with greater than three digits”
Three-digit addition being GPT-3's failure threshold was a clarifying empirical result. The task requires carrying digits across positions — a sequential operation that doesn't decompose into independent attention heads. The scratchpad fixes this by converting parallel computation (one forward pass over all digits) into sequential computation (left-to-right over intermediate carry steps). This is a clean demonstration that the 'in-context computation' paradigm lets you trade tokens for reasoning depth.
“training models to use a scratchpad and predict the program execution trace line-by-line can lead to large improvements”
Training on execution traces rather than input-output pairs is the methodological contribution that chain-of-thought papers inherited. The execution trace insight — that showing intermediate states is more informative than showing final answers — transformed how people thought about fine-tuning for reasoning. Code interpreter fine-tuning, process reward models, and test-time compute scaling all trace their lineage to this observation about the value of intermediate supervision.