Claude 3.7 Sonnet: Anthropic's Extended Thinking Arrives

Two things about Claude 3.7 Sonnet matter: the model itself, and what it signals about where Anthropic is headed.

The model: a new state-of-the-art for coding benchmarks at launch. The signal: Anthropic's hybrid architecture — instant responses for simple tasks, extended chain-of-thought for hard ones — is now shipping to users, not just being previewed at research conferences.

What's New

Extended thinking — 3.7 is the first Claude with a visible, user-accessible chain of thought. Budget controls let you set how much compute it can burn thinking before it answers.
Best coding model at launch — Leads SWE-bench Verified, the most realistic measure of real-world coding ability, at the time of release. Matches or beats GPT-4o and Gemini 2.0 Pro on software engineering tasks.
200K context window — Same as 3.5, but the model makes better use of long contexts due to improved retrieval and reasoning over them.
Claude Code (preview) — Anthropic's coding agent launches alongside 3.7. It can autonomously work through multi-step coding tasks with file access, tests, and iteration.

How It Compares at Launch

Model	SWE-bench Verified	GPQA Diamond	Notes
Claude 3.7 Sonnet	~70%	~68%	Best coding model at release
Grok 3 Thinking	—	top tier	Math-first; coding is secondary
GPT-4o	~38%	~53%	Still strong for general tasks
DeepSeek R1	moderate	~65%	Open-weight reasoning model
Gemini 2.0 Pro	moderate	~60%	Pre-2.5, still solid

SWE-bench figures are approximate; independent evaluations vary by scaffold and harness.

Best For

Software engineers using Claude Code or API-based coding agents
Any task where "think before you answer" produces better results — legal analysis, code review, complex debugging, multi-step planning
Teams already on Anthropic's API who want an instant upgrade

Not Yet For

Real-time web access — Claude 3.7 doesn't have live search; Grok 3's DeepSearch is better here
Math competition benchmarks — Grok 3 Thinking leads on AIME at this moment
Budget-sensitive deployments — extended thinking uses more tokens; disable it for simple tasks

Verdict

Claude 3.7 Sonnet is the best coding model available at this moment, and the extended thinking mode is genuinely useful — not just a benchmarking trick. The dual-mode architecture (instant + thinking) is the right design for production: you don't want a model that stops to think before every simple autocomplete. The bigger story is Claude Code, which is quietly becoming the most capable AI coding agent shipped to developers so far.

Part of our Model Watch series. Next: GPT-4.5 →