Grok 3: xAI Finally Has a Real Contender
xAI's Grok 3 launched February 17, 2025 atop the math and reasoning leaderboards — the first time a non-Google/OpenAI/Anthropic model legitimately led the pack. Here's what changed.
For the first two years of the frontier model race, the battle was a three-way fight: OpenAI, Anthropic, Google. Everyone else was playing catch-up. Grok 3, released February 17, 2025, is the first serious sign that the field is broadening.
What's New
- Thinking mode — Like DeepSeek R1 and Claude 3.7, Grok 3 introduces extended reasoning with a chain-of-thought that's visible to users. The model pauses, thinks, then answers.
- AIME 2025: 93.3% — Tested on the 2025 AIME math competition released just seven days before launch. That score (using
cons@64) beats every model available at launch, including o3 and DeepSeek R1. - DeepSearch — Real-time web access baked in. Not an afterthought — it reasons over live results, synthesizes them, and cites sources. Think Perplexity but with a frontier reasoning model underneath.
- Colossus scale — Trained on 200K H100s in Memphis. xAI went from zero to the largest known GPU cluster in roughly 18 months.
How It Compares at Launch
| Model | AIME 2025 | GPQA Diamond | LiveCodeBench |
|---|---|---|---|
| Grok 3 (Thinking) | 93.3% | top tier | top tier |
| DeepSeek R1 | ~79% | ~65% | strong |
| Claude 3.5 Sonnet | moderate | ~59% | strong |
| GPT-4o | moderate | ~53% | solid |
| Gemini 2.0 Pro | moderate | ~60% | solid |
Scores at Feb 2025 release; xAI self-reported. Independent verification followed within weeks.
On pure math and science benchmarks, Grok 3 Thinking is the strongest model available at this moment. On coding and instruction-following for everyday tasks, Claude 3.5 Sonnet and GPT-4o are still competitive. The gap isn't as wide outside of competitive math.
Best For
- Math-heavy tasks: quantitative finance, engineering, scientific research
- Real-time research with DeepSearch (news synthesis, fast-moving topics)
- Anyone who wants a reasoning model with live web access in one package
- X Premium+ subscribers who want it for free
Not Yet For
- Coding-heavy workflows — Claude 3.5 Sonnet still leads on SWE-bench at this point
- API developers — Grok 3 launched on X first, API access was limited
- Teams that need enterprise SLAs and audit trails — xAI infrastructure is early
Verdict
Grok 3 is a genuine benchmark leader on math and reasoning, and DeepSearch is a legitimately useful product feature. The xAI infrastructure story (Colossus, 200K H100s) is real and the pace of improvement has been faster than anyone expected. The open question is whether xAI can translate benchmark wins into a platform developers actually build on. That work is still ahead.
Part of our Model Watch series — a breakdown of every meaningful frontier model release. Next up: Claude 3.7 Sonnet →
