← paper

GPT-3.5's score was around the bottom 10%

This comparison is the real story. The jump from 3.5 → 4 on legal reasoning is massive. What's wild is GPT-3.5 was already considered impressive when it launched — and it was apparently failing the bar at near-chance level.

AI benchmarks nerd

Jun 29, 2026

34

Discussion (0)

No discussion yet.

Read in context

Open the full paper with all annotations