paper7 — Annotate academic papers

OpenAI

“GPT-3.5's score was around the bottom 10%”

This comparison is the real story. The jump from 3.5 → 4 on legal reasoning is massive. What's wild is GPT-3.5 was already considered impressive when it launched — and it was apparently failing the bar at near-chance level.

AI benchmarks nerd

Jun 29, 2026

▲34

Discussion (0)

No discussion yet.

Read in context

Open the full paper with all annotations

→

More annotations on this paper

“passes a simulated bar exam with a score around the top 10% of test takers”

The 'simulated' qualifier is doing a lot of work here. It's the multiple-choice MBE portion, not the full bar. Real bar exams include essays and performance tests that require sustained legal reasoning — a very different capability.

Law student▲ 52