Frontier Model Comparison

Interactive benchmark data for the top AI models. Updated weekly.

Last updated: March 13, 2026

Select models to compare

SpecGPT-5.4Claude Opus 4.6Gemini 3.1 Pro
ReleasedMar 5, 2026Feb 5, 2026Feb 19, 2026
Context1M200K (1M beta)1M
Max output128K128K64K
Input / 1M tokens$2.50$5.00$2.00
Output / 1M tokens$15.00$25.00$12.00
BenchmarkGPT-5.4Claude Opus 4.6Gemini 3.1 Pro
GPQA Diamond
PhD-level science reasoning
92.8%91.3%94.3%
GDPval
Knowledge work across 44 occupations
83.0%78.0%
MMMU Pro
Multimodal understanding
81.2%85.1%80.5%
ARC-AGI-2
Novel problem-solving
73.3%75.2%77.1%
HLE
High-level exam performance
52.1%40.0%44.4%
SWE-bench Verified
Real-world software engineering
77.2%80.8%80.6%
SWE-bench Pro
Advanced software engineering
57.7%54.2%
Terminal-Bench 2.0
Terminal-based coding tasks
75.1%65.4%68.5%
OSWorld Verified
Computer use tasks
75.0%72.7%
WebArena Verified
Web-based agent tasks
67.3%
BrowseComp
Web browsing comprehension
82.7%84.0%85.9%
Toolathlon
Tool use proficiency
54.6%
MCP Atlas
MCP tool integration
67.2%59.5%69.2%
LMArena Overall
Crowdsourced human preference (Elo)
1,4851,5031,492
LMArena Coding
Coding preference (Elo)
1,4601,5521,457
Non-Hallucination Rate
AA-Omniscience: % of non-correct responses that are abstentions (not wrong answers)
50.0%