Grok 3: xAI Finally Has a Real Contender

For the first two years of the frontier model race, the battle was a three-way fight: OpenAI, Anthropic, Google. Everyone else was playing catch-up. Grok 3, released February 17, 2025, is the first serious sign that the field is broadening.

What's New

Thinking mode — Like DeepSeek R1 and Claude 3.7, Grok 3 introduces extended reasoning with a chain-of-thought that's visible to users. The model pauses, thinks, then answers.
AIME 2025: 93.3% — Tested on the 2025 AIME math competition released just seven days before launch. That score (using cons@64) beats every model available at launch, including o3 and DeepSeek R1.
DeepSearch — Real-time web access baked in. Not an afterthought — it reasons over live results, synthesizes them, and cites sources. Think Perplexity but with a frontier reasoning model underneath.
Colossus scale — Trained on 200K H100s in Memphis. xAI went from zero to the largest known GPU cluster in roughly 18 months.

How It Compares at Launch

Model	AIME 2025	GPQA Diamond	LiveCodeBench
Grok 3 (Thinking)	93.3%	top tier	top tier
DeepSeek R1	~79%	~65%	strong
Claude 3.5 Sonnet	moderate	~59%	strong
GPT-4o	moderate	~53%	solid
Gemini 2.0 Pro	moderate	~60%	solid

Scores at Feb 2025 release; xAI self-reported. Independent verification followed within weeks.

On pure math and science benchmarks, Grok 3 Thinking is the strongest model available at this moment. On coding and instruction-following for everyday tasks, Claude 3.5 Sonnet and GPT-4o are still competitive. The gap isn't as wide outside of competitive math.

Best For

Math-heavy tasks: quantitative finance, engineering, scientific research
Real-time research with DeepSearch (news synthesis, fast-moving topics)
Anyone who wants a reasoning model with live web access in one package
X Premium+ subscribers who want it for free

Not Yet For

Coding-heavy workflows — Claude 3.5 Sonnet still leads on SWE-bench at this point
API developers — Grok 3 launched on X first, API access was limited
Teams that need enterprise SLAs and audit trails — xAI infrastructure is early

Verdict

Grok 3 is a genuine benchmark leader on math and reasoning, and DeepSearch is a legitimately useful product feature. The xAI infrastructure story (Colossus, 200K H100s) is real and the pace of improvement has been faster than anyone expected. The open question is whether xAI can translate benchmark wins into a platform developers actually build on. That work is still ahead.

Part of our Model Watch series — a breakdown of every meaningful frontier model release. Next up: Claude 3.7 Sonnet →