Frontier Model Comparison
Interactive benchmark data for the top AI models. Updated weekly.
Last updated: March 13, 2026
Select models to compare
| Spec | GPT-5.4 | Claude Opus 4.6 | Gemini 3.1 Pro |
|---|---|---|---|
| Released | Mar 5, 2026 | Feb 5, 2026 | Feb 19, 2026 |
| Context | 1M | 200K (1M beta) | 1M |
| Max output | 128K | 128K | 64K |
| Input / 1M tokens | $2.50 | $5.00 | $2.00 |
| Output / 1M tokens | $15.00 | $25.00 | $12.00 |
| Benchmark | GPT-5.4 | Claude Opus 4.6 | Gemini 3.1 Pro |
|---|---|---|---|
GPQA Diamond PhD-level science reasoning | 92.8% | 91.3% | 94.3% |
GDPval Knowledge work across 44 occupations | 83.0% | 78.0% | — |
MMMU Pro Multimodal understanding | 81.2% | 85.1% | 80.5% |
ARC-AGI-2 Novel problem-solving | 73.3% | 75.2% | 77.1% |
HLE High-level exam performance | 52.1% | 40.0% | 44.4% |
SWE-bench Verified Real-world software engineering | 77.2% | 80.8% | 80.6% |
SWE-bench Pro Advanced software engineering | 57.7% | — | 54.2% |
Terminal-Bench 2.0 Terminal-based coding tasks | 75.1% | 65.4% | 68.5% |
OSWorld Verified Computer use tasks | 75.0% | 72.7% | — |
WebArena Verified Web-based agent tasks | 67.3% | — | — |
BrowseComp Web browsing comprehension | 82.7% | 84.0% | 85.9% |
Toolathlon Tool use proficiency | 54.6% | — | — |
MCP Atlas MCP tool integration | 67.2% | 59.5% | 69.2% |
LMArena Overall Crowdsourced human preference (Elo) | 1,485 | 1,503 | 1,492 |
LMArena Coding Coding preference (Elo) | 1,460 | 1,552 | 1,457 |
Non-Hallucination Rate AA-Omniscience: % of non-correct responses that are abstentions (not wrong answers) | — | — | 50.0% |