Frontier Model Comparison

Interactive benchmark data for the top AI models. Updated weekly.

Last updated: March 23, 2026

Select models to compare

Spec	GPT-5.4	Claude Opus 4.6	Gemini 3.1 Pro
Released	Mar 5, 2026	Feb 5, 2026	Feb 19, 2026
Context	1M	200K (1M beta)	1M
Max output	128K	128K	64K
Input / 1M tokens	$2.50	$5.00	$2.00
Output / 1M tokens	$15.00	$25.00	$12.00

Benchmark	GPT-5.4	Claude Opus 4.6	Gemini 3.1 Pro
GPQA Diamond PhD-level science reasoning	92.8%	91.3%	94.3%
GDPval Knowledge work across 44 occupations	83.0%	78.0%	—
MMMU Pro Multimodal understanding	81.2%	85.1%	80.5%
ARC-AGI-2 Novel problem-solving	73.3%	75.2%	77.1%
HLE High-level exam performance	52.1%	40.0%	44.4%
SWE-bench Verified Real-world software engineering	77.2%	80.8%	80.6%
SWE-bench Pro Advanced software engineering	57.7%	—	54.2%
Terminal-Bench 2.0 Terminal-based coding tasks	75.1%	65.4%	68.5%
MLE Bench Lite Autonomous ML research competitions	71.2%	75.7%	66.6%
VIBE-Pro Full project delivery (web, mobile, simulation)	—	—	—
OSWorld Verified Computer use tasks	75.0%	72.7%	—
WebArena Verified Web-based agent tasks	67.3%	—	—
BrowseComp Web browsing comprehension	82.7%	84.0%	85.9%
Toolathlon Tool use proficiency	54.6%	—	—
MCP Atlas MCP tool integration	67.2%	59.5%	69.2%
PinchBench OpenClaw agent brain evaluation (Kilo.ai)	—	—	—
ClawEval Real-world agent tasks across work and life	—	66.3%	—
Tau-bench Tool-use reliability under real conditions	—	—	—
LMArena Overall Crowdsourced human preference (Elo)	1,485	1,503	1,492
LMArena Coding Coding preference (Elo)	1,460	1,552	1,457
Non-Hallucination Rate AA-Omniscience: % of non-correct responses that are abstentions (not wrong answers)	—	—	50.0%

Sources