GPT-5.4: OpenAI's Most Capable Model Arrives With Computer Use
GPT-5.4 drops with verified computer use benchmarks, 1M token context, and a Tool Search feature that changes how you architect agents. Here's what actually changed and what it means if you're evaluating models right now.
GPT-5.4 landed on March 5th with three variants: a standard GPT-5.4, a reasoning-focused GPT-5.4 Thinking, and GPT-5.4 Pro for high-performance use cases. The headline numbers are genuinely good. Record scores on OSWorld-Verified and WebArena-Verified put it at the top of computer use benchmarks. It leads on Mercor's APEX-Agents benchmark for professional skills in law and finance. It hit 83% on GDPval, a knowledge work eval. The error rate dropped: 33% fewer errors in individual claims versus GPT-5.2, 18% fewer overall.
That's a lot of numbers. Here's what actually matters.
Computer Use Is No Longer a Curiosity
When Anthropic shipped computer use with Claude in late 2024, it was clever but rough. The model would click around a screen like someone who'd read a manual but never touched a mouse. The benchmarks were academic. Real-world deployments were fragile.
Leading on OSWorld-Verified and WebArena-Verified in early 2026 means something different. These are verified benchmarks, which require actual task completion, not just plausible-looking outputs. OpenAI scoring at the top suggests computer use is genuinely getting production-viable.
If you've been watching this space, this is the year things start to consolidate. Models that score well on verified evals tend to actually work in production. That's not guaranteed, but it's a stronger signal than it was 18 months ago.
1M Context and Token Efficiency
The context window is now 1 million tokens on the API. That's the largest OpenAI has shipped. Google's Gemini 1.5 Pro technically goes to 2M, but that model is showing its age against the current frontier and only makes sense for a few niche use cases. In practice, 1M covers most document-heavy workloads and entire codebases comfortably.
What's more interesting is the token efficiency claim. GPT-5.4 reportedly solves the same problems with fewer tokens than GPT-5.2. Combined with pricing that undercuts Claude Opus 4.6 by roughly half, that changes the math for production workloads. If the quality holds up, you're getting frontier capability at mid-tier cost. That's meaningful when you're building products on top of the API.
Tool Search: The Underrated Feature
The most interesting thing in this release for agent builders isn't the benchmarks. It's Tool Search.
Currently, if you're building an agent with access to a large tool library, you stuff tool definitions into the system prompt. This eats context, slows things down, and adds noise. The model has to parse a wall of definitions before it can do anything useful.
Tool Search flips this. Instead of loading all tools upfront, the model looks up tool definitions on demand. Think of it like lazy loading for capabilities. The model searches for what it needs, when it needs it.
For agents with dozens or hundreds of tools, this is a real architectural improvement. You can build agents with broader capability surfaces without paying the context penalty. It also scales better as your tool library grows, because you're not hitting a ceiling where adding the 50th tool starts breaking things.
This is the kind of API feature that doesn't make headlines but changes how you architect systems. Worth paying attention to.
Where It Sits on the Frontier
Honest answer: it depends on what you're measuring.
GPT-5.4 is clearly ahead of GPT-5.2. Whether it beats Gemini 3.1 Pro or Claude Opus 4.6 depends on the task. The computer use benchmarks favor GPT-5.4 right now. The Mercor APEX-Agents result (law, finance, long-horizon deliverables) is notable because those are hard, real-world tasks that enterprise buyers actually care about.
Brendan Foody at Mercor described it as excelling at "creating long-horizon deliverables such as slide decks, financial models, and legal analysis." That's a specific claim about professional productivity work, and Mercor has skin in the game since they run the benchmark. Take it with some context, but it's not a throwaway quote.
Where Claude tends to win, in my experience, is instruction-following precision and code quality. Where Gemini leads is multimodal reasoning and raw context depth. GPT-5.4 seems to be targeting professional work and computer use specifically. The Thinking variant also gets chain-of-thought safety evaluation, which OpenAI says reduces deceptive reasoning. That's a harder thing to verify independently, but it's the right problem to be working on.
Benchmarks vs. the Real Test
I've been running GPT-5.4 as my daily driver for a few days now, and so far it's been a solid replacement for Claude Opus 4.6. Hard to tell yet whether the quirks I'm hitting are from the driver or the car.
That's actually a useful analogy. These frontier models are like high-performance cars. The benchmarks give us the specs: torque, power, 0-60 times. But the real test is how professionals handle them on the track. Think Top Gear and the Stig. Every manufacturer ships impressive numbers. What actually matters is how the car feels, what real drivers can extract from it on a real circuit in real conditions.
We all drive at different levels. There's likely a different model to suit your needs, your prompting style, your domain. An engineer who writes precise system prompts will get different results than a product manager using natural language. The "best model" depends on the driver as much as the car.
Here's where the analogy breaks down, though, and it's the most important part. Car performance improvements have been logarithmic for decades. The gap between a 2020 and 2025 sports car is marginal. Shaving tenths of a second off lap times costs exponentially more engineering effort. But model improvements are still exponential. GPT-5.4 isn't a tenth of a second faster than last year's model. It's a fundamentally different class of capability. Computer use, million-token context, halved pricing. That pace shows no sign of slowing.
So yes, drive the car before you buy it. But also recognize that next year's model won't be a minor refresh. It'll be something we can't fully predict from where we're sitting today.
On the "Desperate Win" Narrative
Gizmodo and others framed this release as OpenAI needing a win after a turbulent stretch. That framing is lazy.
Yes, OpenAI has had a rough period. The GPT-5 rollout was messy. The competitive pressure from Gemini and Claude has been real. But a model that leads on computer use benchmarks and ships a meaningful API feature like Tool Search isn't spin. The benchmarks don't care about the company's internal drama.
The more interesting question is whether OpenAI can sustain this. The release cadence from all three major labs is so fast now that a "win" in March is under pressure by June. GPT-5.4 has to perform in production, not just on evals, and it needs to hold up long enough for teams to actually build on it.
What to Do If You're Evaluating Now
If computer use is on your roadmap, test GPT-5.4 now. The verified benchmark results are the best signal the field has that it's getting reliable.
If you're building agents with large tool libraries, look at Tool Search closely. This is the kind of architectural unlock that compounds as your agent capability surface grows.
If you're doing knowledge work or code generation, don't switch just because GPT-5.4 exists. Run your own evals against your specific use case. The benchmark gaps between frontier models are often smaller than they look in press releases once you narrow to your actual domain.
The model is real. The progress is real. But the best deployment decision is still the one you make from the driver's seat, not the spec sheet.
