Why Claude 3.5 Sonnet Quietly Won the Coding Benchmarks
Anthropic's mid-tier model now beats GPT-4o on most agentic coding tasks. Here's why it matters for AI-assisted development.
SWE-bench scores are the canary. Claude 3.5 Sonnet hits 49% on the verified split — higher than any other production-available model and roughly double what GPT-4o achieves on the same harness. Benchmarks lie about a lot of things, but this one tracks real-world coding agent performance closely.
Why it matters for agentic tools
For Cursor, Continue, Cline and the new wave of agentic coding tools, this is decisive. Cost per resolved task drops dramatically when you don't burn iterations chasing hallucinated function names or imports that don't exist. We've measured 40–60% fewer agent loops on the same task when switching from GPT-4o to Sonnet on internal projects.
The three things Claude does better
- It admits what it doesn't know. Sonnet asks clarifying questions or says "I need to read this file first" instead of confabulating.
- Multi-file edits stay consistent. Variable renames propagate; imports update; type signatures stay coherent across files.
- Tool calling is sturdier. Function schemas come back correctly formatted on the first try roughly 98% of the time.
Where it still loses
GPT-4o is faster (lower latency to first token), cheaper for very large context windows on certain pricing tiers, and better at one-shot UI generation. For interactive autocomplete inside an editor, latency matters more than absolute correctness — that's still GPT-4o or Codex territory.
What to do this quarter
If you ship a coding agent product, route agentic loops (plan + execute + verify) through Claude 3.5 Sonnet. Keep faster models for autocomplete and quick chat. Measure cost per resolved task, not per token — Sonnet often costs more per call but less per outcome.
Bottom line
The model wars are over for coding. Until further notice, Claude wins agentic loops; everything else is tactical.