“GPT-3.5's score was around the bottom 10%”
This comparison is the real story. The jump from 3.5 → 4 on legal reasoning is massive. What's wild is GPT-3.5 was already considered impressive when it launched — and it was apparently failing the bar at near-chance level.
AI benchmarks nerd
Jun 29, 2026
Discussion (0)
No discussion yet.
Read in context
Open the full paper with all annotations