Claude 3.7 Sonnet: Anthropic's Extended Thinking Arrives
Claude 3.7 Sonnet dropped February 24, 2025 with extended thinking — Anthropic's answer to o1 and DeepSeek R1. It immediately set a new bar for coding benchmarks and introduced a dual-mode architecture that would define the Claude 4 generation.
Two things about Claude 3.7 Sonnet matter: the model itself, and what it signals about where Anthropic is headed.
The model: a new state-of-the-art for coding benchmarks at launch. The signal: Anthropic's hybrid architecture — instant responses for simple tasks, extended chain-of-thought for hard ones — is now shipping to users, not just being previewed at research conferences.
What's New
- Extended thinking — 3.7 is the first Claude with a visible, user-accessible chain of thought. Budget controls let you set how much compute it can burn thinking before it answers.
- Best coding model at launch — Leads SWE-bench Verified, the most realistic measure of real-world coding ability, at the time of release. Matches or beats GPT-4o and Gemini 2.0 Pro on software engineering tasks.
- 200K context window — Same as 3.5, but the model makes better use of long contexts due to improved retrieval and reasoning over them.
- Claude Code (preview) — Anthropic's coding agent launches alongside 3.7. It can autonomously work through multi-step coding tasks with file access, tests, and iteration.
How It Compares at Launch
| Model | SWE-bench Verified | GPQA Diamond | Notes |
|---|---|---|---|
| Claude 3.7 Sonnet | ~70% | ~68% | Best coding model at release |
| Grok 3 Thinking | — | top tier | Math-first; coding is secondary |
| GPT-4o | ~38% | ~53% | Still strong for general tasks |
| DeepSeek R1 | moderate | ~65% | Open-weight reasoning model |
| Gemini 2.0 Pro | moderate | ~60% | Pre-2.5, still solid |
SWE-bench figures are approximate; independent evaluations vary by scaffold and harness.
Best For
- Software engineers using Claude Code or API-based coding agents
- Any task where "think before you answer" produces better results — legal analysis, code review, complex debugging, multi-step planning
- Teams already on Anthropic's API who want an instant upgrade
Not Yet For
- Real-time web access — Claude 3.7 doesn't have live search; Grok 3's DeepSearch is better here
- Math competition benchmarks — Grok 3 Thinking leads on AIME at this moment
- Budget-sensitive deployments — extended thinking uses more tokens; disable it for simple tasks
Verdict
Claude 3.7 Sonnet is the best coding model available at this moment, and the extended thinking mode is genuinely useful — not just a benchmarking trick. The dual-mode architecture (instant + thinking) is the right design for production: you don't want a model that stops to think before every simple autocomplete. The bigger story is Claude Code, which is quietly becoming the most capable AI coding agent shipped to developers so far.
Part of our Model Watch series. Next: GPT-4.5 →
