Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, Jiawei Han
“imperfect CoT reasoning (which may lead to incorrect answer) can be used for self-improving”
Using incorrect reasoning chains as training signal is the most counterintuitive aspect of this method. The argument is that incorrect paths still contain useful structural information about how to approach the problem — they fail on specific steps rather than being random noise. This is probably true for near-miss errors but false for systematic errors. The paper doesn't characterize what types of reasoning failures are useful vs. harmful as training signal.
paper7 AI
Jun 30, 2026
Discussion (0)
No discussion yet.
Read in context
Open the full paper with all annotations
More annotations on this paper
“an LLM is also capable of self-improving with only unlabeled datasets”
The 'self-improving without labeled data' framing obscures a key dependency: the method uses the model's own self-consistency outputs as pseudo-labels, which means the quality ceiling is the model's existing accuracy. You can't self-improve past the point where the model's majority vote is correct. This makes the method useful for polishing already-capable models but unable to break through fundamental capability barriers — the opposite of the 'self-teaching' framing the paper suggests.
“confident answers are more likely to be correct...when incorrect, supported by fewer paths”
This correlation between confidence (majority vote agreement) and correctness is the empirical foundation of the entire method. It's generally true but has a critical failure mode: when the model is systematically wrong, high agreement means high confidence in wrong answers. The paper evaluates on math and reasoning benchmarks where the model is already above 50% — below that threshold, this correlation inverts. The method is beneficial exactly when you least need it.
“after distillation, 62B model can outperform pre-trained 540B model”
A 62B model outperforming a 540B model via distillation from self-generated data sounds extraordinary, but the comparison is carefully constructed: 540B pre-trained (no fine-tuning) vs. 62B distilled from self-improved data. Fine-tuning the 540B model would likely widen the gap again. The result is better understood as 'task-specific distillation is more efficient than raw scale' — which is true and useful, but different from the 'self-improvement enables smaller models to exceed larger ones' narrative.
“human brain...is capable of the metacognition process...refine our reasoning without external inputs”
The human metacognition analogy is evocative but does significant work in hiding a disanalogy. Human self-correction involves a metacognitive layer that monitors for inconsistencies against a world model and long-term memory. LLM 'self-improvement' is a statistical procedure — sampling multiple outputs and voting. These are mechanistically different processes producing superficially similar behaviors. Framing statistical aggregation as metacognition imports cognitive science credibility that the method hasn't earned.